Let's assume you're familiar with basic SQL databases like PostgreSQL and MySQL, and document databases like MongoDB and Elasticsearch. You probably know Redis too.
What's the big deal? Aren't these just the equivalent of Redis or Java's ConcurrentHashMap?
Let's take a look.
Over the last 10 years or so (at least), databases have become more extensible. MySQL has around 10 different open-source storage engines. More surely exist in the wild.
OrioleDB and Citus 10's columnar compression are two particularly interesting databases making use of PostgreSQL's pluggable storage engine. But since neither use an embedded key-value store, I won't talk about them more in this post.
And so on.
Swapping out storage engines allows you to tune the performance of your database. It can allow you to swap out row-oriented storage for column-oriented storage (useful for analytics workloads).
It can allow you to swap B-Trees (traditional choice) for LSM Trees (new hotness) as the underlying storage method (useful for optimizing write-heavy workloads).
Mapping SQL to key-value storage
But how the heck do you turn SQL, row-oriented data, into key-value data?
CockroachDB is a SQL database built on RocksDB originally and now their own LevelDB-inspired database called PebbleDB.
The reason I mention that here is because they maintain a great doc about their method of encoding rows to key-value form.
To simplify that doc though you can imagine mapping each row to a key-value form like this:
Embedded key-value stores almost always support efficient scanning of rows by a key-prefix. This means that you can efficiently grab all rows within a table by prefix-scanning on the table identifier. If you also include a primary key value along with the table identifier prefix, you get efficient primary key lookup.
Even though the key space is flat.
For the encoded value you can pick any encoding scheme; as inefficient as JSON or as efficient as some binary scheme like Protocol Buffers or Parquet.
Thanks to Justin Jaffray for pointing me at the CockroachDB doc and confirming some of my thoughts on encoding strategies.
I've written a couple of tutorials on building a database. They build on top of embedded key-value stores. If you're interested in seeing minimal code walkthroughs of how this process can work, check these posts out:
- Let's build a distributed Postgres proof of concept
- Writing a document database from scratch in Go: Lucene-like filters and indexes
Major aspects of key-value stores
Now that you understand how a database can map to a key-value store, let's take a look at the particular properties that distinguish all these key-value stores from systems like Redis and Memcached.
Maybe the single most important thing a storage system does is actual
store data reliably. You can't just
write(). To quote
Dan Luu, files are
Deferring storage correctness to a dedicated system means database developers can worry about other aspects of database development.
Along with reliable storage is the fact that the storage needs to run in process. Redis, for example, is not embeddable. There are many other things on top of the storage that need to happen in a high-level database and RPC calls between processes for storage is an unnecessary overhead.
Efficient prefix scans
As mentioned above, support for scans is pretty important for how indexes and namespaces (tables in SQL) get mapped to key-value queries.
You shouldn't need to look through all table rows in the flat key space to find the rows for one table.
The above isn't a complete list. Different stores provide different useful aspects like improved performance on certain workloads/in certain environments, builtin transactions, and so on.
And sometimes it's helpful just to have an embedded store in your language rather than going through a foreign-function interface.
Survey of databases built on embedded key-value stores
Lastly, let's take a look at a few databases that build on top of embedded key-value stores.
Note that some of them are not the primary version of the database (e.g. MyRocks vs MySQL, MongoRocks vs Mongo). Some of them are the primary version (e.g. CockroachDB, YugabyteDB).
Document databases built on key-value stores
- MongoRocks (Mongo on RocksDB)
SQL databases built on key-value stores
- MyRocks (MySQL on RocksDB)
- CockroachDB (RocksDB originally, now their own PebbleDB)
- YugabyteDB (on DocDB on RocksDB)
- Apache Ignite (Calcite on RocksDB)
Redis-compatible databases built on key-value stores
Other databases built on key-value stores
- Rocksandra (Cassandra on RocksDB)
Missing a database? Let me know!
Separately, distributed key-value stores
There is a different kind of key-value store that is a standalone app designed for distributed data. This list includes Consul, etcd, likely FoundationDB, and likely ZippyDB. (There's a nice comparison table about some of these databases on the etcd page).
These systems are designed to be used sort of like Redis except for that they are persistant and reliable stores. They are designed to always be up and always correct. For that reason they form the data storage backbone of core infrastructure like Kubernetes.
It is possibly how Snowflake uses FoundationDB but I'm not 100% sure.
TiKV is not an embedded key-value database but it's not being used the same way etcd/Consul are as far as I can tell. It forms the backbone of TiDB, an HTAP (hybrid OLAP/OLTP) SQL database.
Maybe FoundationDB and TiKV deserve their own new category.
But in general these databases have an RPC API that you communicate with over TCP. They are not generally embedded. You manage their process(es) separately.
So in this post we saw that databases are extensible. Storage engines are often swappable. Dedicated embedded key-value stores allow database developers to hand off data storage to a dedicated library. Different key-value stores have different performance characteristics that help developers and operators tune a database for their workload.
Embedded key-value stores are a great foundation for all kinds of databases; SQL databases like CockroachDB, document databases like Mongo, wide-store databases like Cassandra, and caching databases like ZippyDB or Redis Enterprise Flash.
This is a complex topic with many, many variations of systems. Hopefully this was a useful introduction.
Overall if you're not a database developer and you're not running databases at a massive scale, you can probably ignore the details of the storage layer.
Did I get something wrong? Or miss something important? Let me know. :)
- An earlier version of this post suggested that FoundationDB was embedded. It is not. Thanks adaszko on Lobsters for correcting.
- An earlier version of this post suggested that TiKV was embedded. It is not. Thanks eis on Hacker News.