An intuition for distributed consensus in OLTP systems

Distributed consensus in transactional databases (e.g. etcd or Cockroach) is a big deal these days. Most often under the hood are variations of log-based Paxos-like algorithms such as MultiPaxos, Viewstamped Replication, or Raft. While there are new variations that come out each year, optimizing for various workloads, these algorithms are fairly standard and well-understood.

In fact they are used in so many places, Kubernetes for example, that even if you don't decide to implement Raft (which is fun and I encourage it), it seems worth building an intuition for distributed consensus.

What happens as you tweak a configuration. What happens as the production environment changes. Or what to reach for as product requirements change.

I've been thinking about the basics of distributed consensus recently. There has been a lot to digest and characterize. And I'm only beginning to get an understanding.

This post is an attempt to share some of the intuition built up reading about and working in this space. Originally this post was also going to end with a walkthrough of my most recent Raft implementation in Rust. But I'm going to hold off on that for another time.

I was fortunate to have a few excellent reviewers looking at versions of this post: Paul Nowoczynski, Alex Miller, Jack Vanlightly, Daniel Chia, and Alex Petrov. Thank you!

Let's start with Raft.

Raft

Raft is a distributed consensus algorithm that allows you to build a replicated state machine on top of a replicated log.

A Raft library handles replicating and durably persisting a sequence (or log) of commands to at least a majority of nodes in a cluster. You provide the library a state machine that interprets the replicated commands. From the perspective of the Raft library, commands are just opaque byte strings.

For example, you could build a replicated key-value store out of SET and GET commands that are passed in by a client. You provide a Raft library state machine code that interprets the Raft log of SET and GET commands to modify or read from an in-memory hashtable. You can find concrete examples of exactly this replicated key-value store modeling in previous Raft posts I've written.

All nodes in the cluster run the same Raft code (including the state machine code you provide); communicating among themselves. Nodes elect a semi-permanent leader that accepts all reads and writes from clients. (Again, reads and writes are modeled as commands).

To commit a new command to the cluster, clients send the command to all nodes in the cluster. Only the leader accepts this command, if there is currently a leader. Clients retry until there is a leader that accepts the command.

The leader appends the command to its log and makes sure to replicate all commands in its log to followers in the same order. The leader sends periodic heartbeat messages to all followers to prolong its term as leader. If a follower hasn't heard from the leader within a period of time, it becomes a candidate and requests votes from the cluster.

When a follower is asked to accept a new command from a leader, it checks if its history is up-to-date with the leader. If it is not, the follower rejects the request and asks the leader to send previous commands to bring it up-to-date. It does this ultimately, in the worst case of a follower that has lost all history, by going all the way back to the very first command ever sent.

When a quorum (typically a majority) of nodes has accepted a command, the leader marks the command as committed and applies the command to its own state machine. When followers learn about newly committed commands, they also apply committed commands to their own state machine.

For the most part, these details are graphically summarized in Figure 2 of the Raft paper.

Availability and linearizability

Taking a step back, distributed consensus helps a group of nodes, a cluster, agree on a value. A client of the cluster can treat a value from the cluster as if the value was atomically written to and read from a single thread. This property is called linearizability.

However, with distributed consensus, the client of the cluster has better availability guarantees from the cluster than if the client atomically wrote to or read from a single thread. A single thread that crashes becomes unavailable. But some number f nodes can crash in a cluster implementing distributed consensus and still 1) be available and 2) provide linearizable reads and writes.

That is: distributed consensus solves the problem of high availability for a system while remaining linearizable.

Without distributed consensus you can still achieve high availability. For example, a database might have two read replicas. But a client reading from a read replica might get stale data. Thus, this system (a database with two read replicas) is not linearizable.

Without distributed consensus you can also try synchronous replication. It would be very simple to do. Have a fixed leader and require all nodes to acknowledge before committing, But the value here is extremely limited. If a single node in the cluster goes down the entire cluster is down.

You might think I'm proposing a strawman. We could simply designate a permanent leader that handles all reads and writes; and require a majority of nodes to commit a command before the leader responds to a client. But in that case, what's the process for getting a lagging follower up-to-date? And what happens if it is the leader who goes down?

Well, these are not trivial problems! And, beyond linearizability that we already mentioned, these problems are exactly what distributed consensus solves.

Why does linearizability matter?

It's very nice, and often even critical, to have a highly available system that will never give you stale data. And regardless, it's convenient to have a term for what we might naively think of as the "correct" way you'd always want to set and get a value.

So linearizability is a convenient way of thinking about complex systems, if you can use or build a system that supports it. But it's not the only consistency approach you'll see in the wild.

As you increase the guarantees of your consistency model, you tend to sacrifice performance. Going the opposite direction, some production systems sacrifice consistency to improve performance. For example, you might allow stale reads from any node, reading only from local state and avoiding consensus, so that you can reduce load on a leader and avoid the overhead of consensus.

There are formal definitions for lower consistency models, including sequential and read-your-writes. You can read the Jepsen page for more detail.

Best and worst case scenarios

A distributed system relies on communicating over the network. The worse the network, whether in terms of latency or reliability, the longer it will take for communication to happen.

Aside from the network, disks can misdirect writes or corrupt data. Or you could be mounted on a network filesystem such as EBS.

And processes themselves can crash due to low disk space or the OOM killer.

It will take longer to achieve consensus to commit messages these scenarios. If messages take longer to reach nodes, or if nodes are constantly crashing, followers will timeout more often, triggering leader election. And the leader election itself (which also requires consensus) will also take longer.

The best case scenario for distributed consensus is where the network is reliable and low-latency. Where disks are reliable and fast. And where processes don't often crash.

TigerBeetle has an incredible visual simulator that demonstrates what happens across ever-worsening environments. While TigerBeetle and this simulator use Viewstamped Replication, the demonstrated principles apply to Raft as well.

What happens when you add nodes?

Distributed consensus algorithms make sure that some minimum number of nodes in a cluster agree before continuing. The minimum number is proportional to the total number of nodes in the cluster.

A typical implementation of Raft for example will require 3 nodes in a 5-node cluster to agree before continuing. 4 nodes in a 7-node cluster. And so on.

Recall that the p99 latency for a service is at least as bad as the slowest external request the service must make. As you increase the number of nodes you must talk to in a consensus cluster, you increase the chance of a slow request.

Consider the extreme case of a 101-node cluster requiring 51 nodes to respond before returning to the client. That's 51 chances for a slower request. Compared to 4 chances in a 7-node cluster. The 101-node cluster is certainly more highly available though! It can tolerate 49 nodes going down. The 7-node cluster can only tolerate 3 nodes going down. The scenario where 49 nodes go down (assuming they're in different availability zones) seems pretty unlikely!

Horizontal scaling with distributed consensus? Not exactly

All of this is to say that the most popular algorithms for distributed consensus, on their own, have nothing to do with horizontal scaling.

The way that horizontally scaling databases like Cockroach or Yugabyte or Spanner work is by sharding the data, transparent to the client. Within each shard data is replicated with a dedicated distributed consensus cluster.

So, yes, distributed consensus can be a part of horizontal scaling. But again what distributed consensus primarily solves is high availability via replication while remaining linearizable.

This is not a trivial point to make. etcd, consul, and rqlite are examples of databases that do not do sharding, only replication, via a single Raft cluster that replicates all data for the entire system.

For these databases there is no horizontal scaling. If they support "horizontal scaling", they support this by doing non-linearizable (stale) reads. Writes remain a challenge.

This doesn't mean these databases are bad. They are not. One obvious advantage they have over Cockroach or Spanner is that they are conceptually simpler. Conceptually simpler often equates to easier to operate. That's a big deal.

Optimizations

We've covered the basics of operation, but real-world implementations get more complex.

Snapshots

Rather than letting the log grow indefinitely, most libraries implement snapshotting. The user of the library provides a state machine and also provides a method for serializing the state machine to disk. The Raft library periodically serializes the state machine to disk and truncates the log.

When a follower is so far behind that the leader no longer has a log entry (because it has been truncated), the leader transfers an entire snapshot to the follower. Then once the follower is caught up on snapshots, the leader can transfer normal log entries again.

This technique is described in the Raft paper. While it isn't necessary for Raft to work, it's so important that it is hardly an optimization and more a required part of a production Raft system.

Batching

Rather than limiting clients of the cluster to submitting only one command at a time, it's common for the cluster to accept many commands at a time. Similarly, many commands at a time are submitted to followers. When any node needs to write commands to disk, it can batch commands to disk as well.

But you can go a step beyond this in a way that is completely opaque to the Raft library. Each opaque command the client submits can also contain a batch of messages. In this scenario, only the user-provided state machine needs to be aware that each command it receives is actually a batch of messages that it should pull apart and interpret separately.

This latter techinque is a fairly trivial way to increase throughput by an order of magnitude or two.

Disk and network

In terms of how data is stored on disk and how data is sent over the network there is obvious room for optimization.

A naive implementation might store JSON on disk and send JSON over the network. A slightly more optimized implementation might store binary data on disk and send binary data over the network.

Similarly you can swap out your RPC for gRPC or introduce zlib for compression to network or disk.

You can swap out synchronous IO for libaio or io_uring or SPDK/DPDK.

A little tweak I made in my latest Raft implementation was to index log entries so searching the log was not a linear operation. Another little tweak was to introduce a page cache to eliminate unnecessary disk reads. This increased throughput for by an order of magnitude.

Flexible quorums

This brilliant optimization by Heidi Howard and co. shows you can relax the quorum required for committing new commands so long as you increase the quorum required for electing a leader.

In an environment where leader election doesn't happen often, flexible quorums can increase throughput and decrease latency. And it's a pretty easy change to make!

These are just a couple common optimizations. You can also read about parallel state machine apply, parallel append to disk, witnesses, compartmentalization, and leader leases. TiKV, Scylla, RedPanda, and Cockroach tend to have public material talking about this stuff.

There are also a few people I follow who are often reviewing relevant papers, if they are not producing their own. I encourage you to follow them too if this is interesting to you:

Safety and testing

The other aspect to consider is safety. For example, checksums for everything written to disk and passed over the network; or being able to recover from corruption in the log.

Testing is also a big deal. There are prominent tools like Jepsen that check for consistency in the face of fault injection (process failure, network failure, etc.). But even Jepsen has its limits. For example, it doesn't test disk failure.

FoundationDB made popular a number of testing techniques. And the people behind this testing went on to build a product, Antithesis, around deterministic testing of non-deterministic code while injecting faults.

And on that topic there's Facebook Experimental's Hermit deterministic Linux hypervisor that may help to test complex distributed systems. However, my experience with it has not been great and the maintainers do not seem very engaged with other people who have reported bugs. I'm hopeful for it but we'll see.

Antithesis and Hermit seem like a boon when half the trouble of working on distributed consensus implementations is avoiding flakey tests.

Another promising avenue is emitting logs during the Raft lifecycle and validating the logs against a TLA+ spec. Microsoft has such a project that has begun to see adoption among open-source Raft implementations.

Conclusion

Everything aside, consensus is expensive. There is overhead to the entire consensus process. So if you do not need this level of availability and can settle for some process via backups, it's going to have lower latency and higher throughput than if it had to go through distributed consensus.

If you do need high availability, distributed consensus can be a great choice. But consider the environment and what you want from your consensus algorithm.

Also, while MultiPaxos, Raft, and Viewstamped Replication are some of the most popular algorithms for distributed consensus, there is a world beyond. Two-phase commit, ZooKeeper Atomic Broadcast, PigPaxos, EPaxos, Accord by Cassandra. The world of distributed consensus also gets especially weird and interesting outside of OLTP systems.

But that's enough for one post.

February 8, 2024