What's the big deal about key-value databases like FoundationDB and RocksDB?

Let's assume you're familiar with basic SQL databases like PostgreSQL and MySQL, and document databases like MongoDB and Elasticsearch. You probably know Redis too.

But you're hearing more and more about embedded key-value stores like RocksDB, LevelDB, PebbleDB, and so on. And you're hearing about distributed key-value databases like FoundationDB and TiKV.

What's the big deal? Aren't these just the equivalent of Redis or Java's ConcurrentHashMap?

Let's take a look.

Extensible databases

Over the last 10 years or so (at least), databases have become more extensible. MySQL has around 10 different open-source storage engines. More surely exist in the wild.

Mongo supports multiple storage engines. Relatively late, PostgreSQL version 12 added support for pluggable storage engines.

OrioleDB and Citus 10's columnar compression are two particularly interesting databases making use of PostgreSQL's pluggable storage engine. But since neither use an embedded key-value store, I won't talk about them more in this post.

And so on.

But why?

Swapping out storage engines allows you to tune the performance of your database. It can allow you to swap out row-oriented storage for column-oriented storage (useful for analytics workloads).

It can allow you to swap B-Trees (traditional choice) for LSM Trees (new hotness) as the underlying storage method (useful for optimizing write-heavy workloads).

And since some storage engines themselves are built on distributed consensus (like FoundationDB and TiKV), it may even allow you to turn a non-distributed database into a distributed database.

Mapping SQL to key-value storage

But how the heck do you turn SQL, row-oriented data, into key-value data?

CockroachDB is a SQL database built on RocksDB originally and now their own LevelDB-inspired database called PebbleDB.

The reason I mention that here is because they maintain a great doc about their method of encoding rows to key-value form.

To simplify that doc though you can imagine mapping each row to a key-value form like this:

${TABLE_IDENTIFIER}_${PRIMARY_KEY}_${ROW_IDENTIFIER}: ${ENCODED_VALUE}

Embedded key-value stores almost always support efficient scanning of rows by a key-prefix. This means that you can efficiently grab all rows within a table by prefix-scanning on the table identifier. If you also include a primary key value along with the table identifier prefix, you get efficient primary key lookup.

Even though the key space is flat.

For the encoded value you can pick any encoding scheme; as inefficient as JSON or as efficient as some binary scheme like Protocol Buffers or Parquet.

Thanks to Justin Jaffray for pointing me at the CockroachDB doc and confirming some of my thoughts on encoding strategies.

Tutorials

I've written a couple of tutorials on building a database. They build on top of embedded key-value stores. If you're interested in seeing minimal code walkthroughs of how this process can work, check these posts out:

Major aspects of key-value stores

Now that you understand how a database can map to a key-value store, let's take a look at the particular properties that distinguish all these key-value stores from systems like Redis and Memcached.

Reliable storage

Maybe the single most important thing a storage system does is actual store data reliably. You can't just open() and write(). To quote Dan Luu, files are hard.

Deferring storage correctness to a dedicated system means database developers can worry about other aspects of database development.

Embeddable

Along with reliable storage is the fact that the storage needs to run in process. Redis, for example, is not embeddable. There are many other things on top of the storage that need to happen in a high-level database and RPC calls between processes for storage is an unnecessary overhead.

Efficient prefix scans

As mentioned above, support for scans is pretty important for how indexes and namespaces (tables in SQL) get mapped to key-value queries.

You shouldn't need to look through all table rows in the flat key space to find the rows for one table.

Additional aspects

The above isn't a complete list. Different stores provide different useful aspects like improved performance on certain workloads/in certain environments, builtin transactions, and so on.

And sometimes it's helpful just to have an embedded store in your language rather than going through a foreign-function interface.

Survey of databases built on embedded key-value stores

Lastly, let's take a look at a few databases that build on top of embedded key-value stores.

Note that some of them are not the primary version of the database (e.g. MyRocks vs MySQL, MongoRocks vs Mongo). Some of them are the primary version (e.g. CockroachDB, YugabyteDB).

Document databases built on key-value stores

MongoRocks (Mongo on RocksDB)

SQL databases built on key-value stores

MyRocks (MySQL on RocksDB)
CockroachDB (RocksDB originally, now their own PebbleDB)
YugabyteDB (on DocDB on RocksDB)
Apache Ignite (Calcite on RocksDB)

Redis-compatible databases built on key-value stores

ZippyDB (Redis-compatible database on RocksDB)
Redis Enterprise Flash (Redis on RocksDB)

Other databases built on key-value stores

Rocksandra (Cassandra on RocksDB)

Missing a database? Let me know!

Separately, distributed key-value stores

There is a different kind of key-value store that is a standalone app designed for distributed data. This list includes Consul, etcd, likely FoundationDB, and likely ZippyDB. (There's a nice comparison table about some of these databases on the etcd page).

These systems are designed to be used sort of like Redis except for that they are persistant and reliable stores. They are designed to always be up and always correct. For that reason they form the data storage backbone of core infrastructure like Kubernetes.

It is possibly how Snowflake uses FoundationDB but I'm not 100% sure.

TiKV is not an embedded key-value database but it's not being used the same way etcd/Consul are as far as I can tell. It forms the backbone of TiDB, an HTAP (hybrid OLAP/OLTP) SQL database.

Maybe FoundationDB and TiKV deserve their own new category.

But in general these databases have an RPC API that you communicate with over TCP. They are not generally embedded. You manage their process(es) separately.

Conclusion

So in this post we saw that databases are extensible. Storage engines are often swappable. Dedicated embedded key-value stores allow database developers to hand off data storage to a dedicated library. Different key-value stores have different performance characteristics that help developers and operators tune a database for their workload.

Embedded key-value stores are a great foundation for all kinds of databases; SQL databases like CockroachDB, document databases like Mongo, wide-store databases like Cassandra, and caching databases like ZippyDB or Redis Enterprise Flash.

This is a complex topic with many, many variations of systems. Hopefully this was a useful introduction.

Overall if you're not a database developer and you're not running databases at a massive scale, you can probably ignore the details of the storage layer.

Did I get something wrong? Or miss something important? Let me know. :)

Corrections

An earlier version of this post suggested that FoundationDB was embedded. It is not. Thanks adaszko on Lobsters for correcting.
An earlier version of this post suggested that TiKV was embedded. It is not. Thanks eis on Hacker News.

What's the big deal about embedded key-value databases like FoundationDB ands RocksDB?

I wrote a new blog post that might give a little context. :)https://t.co/kNFM1hVGx6 pic.twitter.com/H4SouStZHk
— Phil Eaton (@phil_eaton) August 23, 2022