Choosing a Cortex State Store: RocksDB, SQL, Cassandra, DuckDB, and More

The moment your pipeline does more than map and filter — the moment it counts, sums, joins, or windows — it has to remember something between events. A running total, the last value per key, the events buffered inside a window: that is state, and it has to live somewhere. Where it lives determines whether your application survives a restart, how fast it runs, how large it can grow, and how much you pay to operate it.

The Cortex Data Framework treats that "somewhere" as a swappable detail. Stateful operators in Cortex.States talk to a single key-value interface, and you choose the concrete backend at construction time. This post is a decision guide: what the options are, how they differ, and how to pick one without guessing.

Why stateful streaming needs a store

Stateless operators are easy — each event is transformed in isolation and forgotten. Stateful operators are the opposite: they accumulate. Consider the three most common shapes:

Aggregations keep a value per key (a count, a sum, a max) that is read and updated on every matching event.
Joins buffer events from one or both sides until a match arrives on a shared key.
Windows hold the events that fall inside a time range until the window closes and emits.

All three need somewhere to put that accumulating data. If it lives only in plain CLR objects on the heap, it vanishes the instant the process stops — fine for a dashboard you can rebuild from scratch, fatal for a financial ledger that must resume exactly where it left off. The job of a state store is to own that data with the durability, scale, and performance characteristics your use case actually requires.

Cortex makes the backend pluggable behind one interface. Every store in Cortex.States implements the same key-value contract — Put, Get, ContainsKey, Remove, GetAll, GetKeys — so your aggregation logic never changes when you switch where state is kept. You write the operator once and decide the backend separately.

Rendering diagram…

The operator on the left does not know — and does not care — which node on the right is wired in. That decoupling is the whole point: you can prototype against memory and ship against RocksDB without rewriting a line of business logic.

The landscape

The stores split along two axes that matter more than any feature list.

The first axis is persistent vs in-memory. In-memory state is fast and zero-config but disappears on restart. Persistent state is written to durable storage and survives crashes, restarts, and redeployments.

The second axis is embedded vs server/distributed. Embedded stores run inside your process — no separate service to deploy, state lives next to the app on local disk. RocksDB, SQLite, and DuckDB are embedded. Server-backed stores point at an external database — SQL Server, PostgreSQL, MongoDB, Cassandra, ClickHouse — which adds operational weight but unlocks shared access, central backups, and (for some) horizontal scale.

A third, quieter distinction is the storage model. Most stores are row- or key-value-oriented. Two are columnar: DuckDB (embedded, in-process analytics) and ClickHouse (server, large-scale analytics). Columnar storage is the right tool when you want to run analytical queries — scans, group-bys, aggregations — over the state itself, not just look up one key at a time.

Comparison table

The verified options in Cortex.States today, by the axes that drive the decision:

Store	Persistent?	Embedded / Server	Scale	Best for
`InMemoryStateStore`	No	Embedded	Single process, bounded by RAM	Tests, prototypes, transient metrics, ephemeral state
`RocksDbStateStore`	Yes	Embedded	Single node, very large on-disk state	Durable local state with high write throughput
`SqliteKeyValueStateStore`	Yes	Embedded	Single node, modest volume	Zero-config durability on one machine
`DuckDbKeyValueStateStore`	Yes (or in-memory)	Embedded	Single node, large columnar datasets	Embedded analytics, Parquet/CSV export
`SqlServerStateStore` / `SqlServerKeyValueStateStore`	Yes	Server	Vertical + enterprise HA	ACID durability, SQL-inspectable state
`PostgresStateStore` / `PostgresKeyValueStateStore`	Yes	Server	Vertical + replicas	Reliable relational state, rich querying
`MongoDbStateStore`	Yes	Server	Horizontal (sharding/replica sets)	Flexible-schema, document-shaped state
`CassandraStateStore`	Yes	Distributed	Horizontal, multi-node HA	High availability, fault tolerance at scale
`ClickHouseStateStore`	Yes	Server	Horizontal, analytical	Large-scale columnar analytics on state

How to choose, axis by axis

No store wins on every dimension. Decide which axes matter for your workload, then read across.

Latency and throughput

For the lowest possible latency with no durability, in-memory is unbeatable — it is just CLR dictionaries. When you need durability and high write throughput, RocksDB is the embedded sweet spot: it is built for fast storage and absorbs heavy update rates on local SSD without a network hop. Server-backed relational stores (SQL Server, PostgreSQL) pay a round-trip per operation, so they trade raw speed for transactional guarantees and shared access.

Durability

If losing state on restart is acceptable — a live dashboard, a short-lived job — stay in-memory and skip the operational cost entirely. If data loss is unacceptable, every other store on the list persists. RocksDB and SQLite give you embedded durability on the local disk; SQL Server and PostgreSQL give you full ACID transactions and write-ahead logging so committed state survives a crash.

Data size

In-memory is capped by available RAM. RocksDB and DuckDB handle very large datasets on a single node's disk. When state outgrows one machine, you need a distributed store: Cassandra for partitioned, highly available key-value state, or MongoDB for sharded document state.

Distribution and high availability

A single embedded store is tied to one process and one disk — if that node dies, its state is unreachable until it comes back. For multi-node high availability, Cassandra is purpose-built: it replicates across nodes and keeps serving through node failures. MongoDB replica sets and ClickHouse clusters offer their own replication models. Choose these when uptime through failure is a hard requirement.

Analytics

If you need to query your state analytically — not just look up keys but scan, group, and aggregate — pick a columnar store. DuckDB runs analytical queries in-process with vectorized execution and exports natively to Parquet and CSV. ClickHouse does the same at cluster scale. A row-oriented key-value store can do point lookups all day but will struggle with wide analytical scans.

Operational cost

Embedded stores have the lowest operational cost: nothing extra to deploy, no separate service to monitor. In-memory, RocksDB, SQLite, and DuckDB all run inside your app. Server-backed stores add a database to provision, secure, back up, and scale — justified when you already run that database, or when shared/central state is a requirement rather than a convenience.

The same aggregation, three backends

Here is what "pluggable" buys you in practice. The aggregation below counts events per key with AggregateSilently — the silent variant keeps records flowing downstream while it updates state. The business logic is identical in all three versions; only the stateStore: argument changes.

In-memory (fast, transient)

Perfect for tests and prototypes. State is gone when the process stops.

using Cortex.States;
using Cortex.Streams;

var store = new InMemoryStateStore<string, int>("counts");

var stream = StreamBuilder<PageView>.CreateNewStream("page-views")
    .Stream()
    .AggregateSilently<string, int>(
        e => e.Page,
        (count, e) => count + 1,
        stateStoreName: "counts",
        stateStore: store)
    .Sink(e => Console.WriteLine($"saw {e.Page}"))
    .Build();

stream.Start();
stream.Emit(new PageView("/home"));
stream.Emit(new PageView("/home"));
stream.Emit(new PageView("/pricing"));

// Read the accumulated state back out
foreach (var kv in stream
    .GetStateStoreByName<InMemoryStateStore<string, int>>("counts")
    .GetAll())
{
    Console.WriteLine($"{kv.Key} = {kv.Value}");
}

stream.Stop();

RocksDB (durable, embedded)

Swap one line and the same counts now survive restarts — written to local disk with high write throughput, no external service required. Install Cortex.States.RocksDb. The constructor takes the store name and a storage path:

using Cortex.States.RocksDb;
using Cortex.Streams;

// new RocksDbStateStore<TKey, TValue>(name, path)
var store = new RocksDbStateStore<string, int>("counts", "./data/counts");

var stream = StreamBuilder<PageView>.CreateNewStream("page-views")
    .Stream()
    .AggregateSilently<string, int>(
        e => e.Page,
        (count, e) => count + 1,
        stateStoreName: "counts",
        stateStore: store)         // only this changed
    .Sink(e => Console.WriteLine($"saw {e.Page}"))
    .Build();

stream.Start();
// ... emit events ...
stream.Stop();

store.Dispose();   // flush and release the RocksDB handle

On the next run, RocksDB reopens the same path and the operator resumes from the persisted counts.

DuckDB (durable, analytics-ready)

When you want the state itself to be queryable and exportable, point the same operator at DuckDB. Install Cortex.States.DuckDb. The key-value store constructor takes the store name, a database file path, and a table name:

using Cortex.States.DuckDb;
using Cortex.Streams;

// new DuckDbKeyValueStateStore<TKey, TValue>(name, databasePath, tableName)
var store = new DuckDbKeyValueStateStore<string, int>(
    name: "counts",
    databasePath: "./data/counts.duckdb",
    tableName: "PageCounts");

var stream = StreamBuilder<PageView>.CreateNewStream("page-views")
    .Stream()
    .AggregateSilently<string, int>(
        e => e.Page,
        (count, e) => count + 1,
        stateStoreName: "counts",
        stateStore: store)         // only this changed
    .Sink(e => Console.WriteLine($"saw {e.Page}"))
    .Build();

stream.Start();
// ... emit events ...
stream.Stop();

// DuckDB extras: durable checkpoint + native columnar export
store.Checkpoint();
store.ExportToParquet("./exports/page-counts.parquet");
store.Dispose();

Count(), Checkpoint(), and ExportToParquet(...) are DuckDB-specific conveniences on top of the shared key-value contract — exactly the kind of capability you reach for a particular backend to get.

Reaching for a server-backed store

The server-backed stores follow the same pattern — build the store, pass it as stateStore:. The only difference is the constructor, which carries connection details instead of a file path. A few verified signatures from the docs:

// SQL Server — Cortex.States.MSSqlServer
var sql = new SqlServerKeyValueStateStore<string, int>(
    name: "counts",
    connectionString: "Server=.;Database=CortexDb;Trusted_Connection=True;",
    tableName: "Counts");

// PostgreSQL — Cortex.States.PostgreSQL
var pg = new PostgresKeyValueStateStore<string, int>(
    name: "counts",
    connectionString: "Host=localhost;Database=cortex;Username=postgres;Password=secret",
    tableName: "counts",
    schemaName: "public");

// Cassandra — Cortex.States.Cassandra (uses an existing ISession)
var cassandra = new CassandraStateStore<string, int>(
    "counts", "keyspace", "tableName", session);

Each ships as its own NuGet package, so you only pull in the driver you actually use.

Recommendations and rules of thumb