Awesome FOSS Logo
Discover awesome open source software
Launched 🚀🧑‍🚀

Paxosmon: Gotta Consensus Them All


Table of Contents

Paxos logo

tl;dr - I go through a (near complete?) selection of research papers and approaches therein that solve distributed consensus with strong consistency as provided by the Paxos family of algorithms.

Multi-part blog post alert

This is a multi-part blog-post!

  1. Part 1 - Paxosmon: Gotta consensus them all (this post)
  2. Part 2 - Paxosmon 2: The Journey Continues

Foreword: I made an icon!

Before we start, a quick note – I made an icon for Paxos since it doesn’t seem to have one – what do you think? It was <30 minutes in Inkscape (and honestly, it shows), but I felt it might be a waste to just let it live on my hard drive without showing it to people. I started with the Paxos island flag’s trident motif and added some vaguely distributed-system-y ideas:

Improvised paxos logo

The wonderful font is Aller created by Dalton Maag Ltd. Anyways, on to the real important stuff.

The future is now and it is distributed

The future (and present) of computing is almost certainly distributed – while people are quibbling over whether it should be federated or centralized, the importance of being able to think in terms of distributed systems and reason about them is paramount. Whether you consider distribution a problem (as an implementer) or a paradigm (as an architect), differs, but regardless of which side you’re on, the Paxos is one of the most promising classical solutions to the problem of getting state shared across a set of machines in a consistent manner.

I’m not going to do a general introduction to distributed systems here, or debate the CAP theorem, so please make sure to do other reading as necessary when you find a term you don’t recognize. What I will do is give an insanely reductive and unprincipled look at the state of the field (from where I’m standing, as a layman), and the broad strokes of the approaches and incremental improvements that have been happening in the research world over the last decade or two in the paxos-related distributed system space.

What came before: 2 Phase Commit

Before diving into Paxos and all it’s flavors, let’s do a tiny bit of background on an approach that came before it – 2 Phase Commit (2PC). 2PC is an approach that was used . }

2PC basically takes Consistency (the C in CAP theorem) above everytihng else – you give up the availability of your system in the event that any node is unable to contribute to the consensus process. It can very quickly become impossible to make progress (which in this case I define to be accepting writes), though you can still serve reads.

There’s even a StackOverflow question about comparisons between the two that is worth reading through. There’s also a much more detailed comparison available in the form of a paper called “Consensus On Transaction Commit. What you can do is actually just skip to the conclusion. Badly paraphrased, the issue with 2PC is that it forces the single coordinator but paxos improves on this because it allows for weakening to a majority of nodes. This is a huge paradigm shift because it now means that we’re ok with some number n of nodes failing. We take advantage of the properties of a majority in a set (a quorum) to guarantee that our writes are safe from the system-level view (even if they’re not for a single specific node).

An interesting use of this algorithm is surprisingly not a distributed system at all, but Postgres, a well-loved database in the open source world. In my opinion Postgres is the best F/OSS database you could pick for projects today but I’ll write more on that another time. before Postgres adopted Multi Version Concurrency Control (MVCC), it actually used 2PC to achieve consistency across busy processes that were a part of the database system. Postgres is actually a process-per-connection model.

Enter Paxos (1998)

The previous section hinted at it a bit, but the crucial improvement of Paxos over something like 2 Phase Commit (2PC) was the introduction and use of quroums, instead of requiring every single node to agress on some shared value. Paxos has basically become the bread and butter of modern day consensus and distributed system algorithms as a result. As people found out (very quickly) that serving lots of traffic (or performing lots of operations) often required scaling a system to more than one machine, the problem of coordinating the systems became very prominent very quickly. The seminal paper that introduced Paxos is The Part Time Parliament. The paper didn’t become very well known very quickly however, due to the heavy use of analogies and examples that were hard to understand.

Here’s a reductive but hopefully useful description of the mechanics:

  • Establish an ordering for events in the system (let’s use a vector clock – so start at time=0), and make sure it counts up
  • Nodes propose values with some clock value attached (in JSON, let’s say {"time":1, "value": 5})
  • Other nodes receive (or don’t receive) those values w/ times and make a decision on whether to accept the value based on what they’ve seen
  • When a majority agrees on some value at a given time, it tells everyone so nodes can save that value to permanent storage
  • Once we have agreement @ time t, that round is over, let’s do it all again.

I’ve completely ignored the basic concepts and roles of paxos but hopefully just by reading this you can get get the general gist – it mirrors what happens with humans sitting in a meeting room and voting. More concretely, there are different roles for nodes to play:

  • Proposers: nodes which can suggest values
  • Acceptors: nodes which can accept suggestions (and tell everyone else)
  • Learners: nodes which write values to disk when consensus succeeds (a majority of acceptors accept some value)
  • Leader: a place to centralize the state for a all the timeslots (more than one might be happening at a time)

It’s important to note that any node can play any number of roles (except of course the leader role).

The safety properties of the algorithm are easier to grasp when you think of the natural mechanics of majorities. For example, with humans, once you get a majority of people to agree on something, it’s impossible to pick another majority that doesn’t have someone from the first majority. This means that at least 1 person will definitely know about the previous agreement you reached, and they’ll be able to tell the team. If you have a leader, you can speed up the process by just having everyone talk to the leader instead of trying to talk and hear each other individually in a crowded room.

For machines, this majority voting system (quorum) is a way to introduce fault tolerance – even if some number of nodes are in a bad state and either broken or never respond (it’s impossible to tell the difference), as long as the majority of the nodes are functioning, the system can withstand the fact that some participants are faulty. Of course, if you have a majority of nodes in an error state, you probably have larger problems – Paxos can’t help you there.

There is a distinction that is often mentioned right about now which is the difference between regular fault tolerance and “byzantine fault tolerance” (BFT). Basically BFT here indicates the class of consensus problems that arise when nodes in your cluster are malicious, and trying to break the algorithm on purpose. As you might expect, this is a much harder problem to solve, but surprisingly it is solvable – granted you have enough nodes and the network itself isn’t byzantine.

Paxos: Advancements

If we consider 2 Phase Commit (2PC) as what came before, then Paxos improves upon it fundamentally by:

  • Introducing quorums (with clock values to help manage concurrent instances)

Paxos: Issues

Some issues with classical Paxos:

  • No one could figure it out, or realized they should try and figure it out for years
  • Hard to implement

Paxos in a single reductive sentence

Get a majority of nodes to agree on some value(ex. v=5) @ a certain time (t=1 using vector clocks), and you have distributed consensus.

Paxos tries again: Paxos Made Simple (2001)

The next entry in the Paxos lineage is actually pretty much identical to the first, but just a tiny bit better looking. The original Part Time Parliament was so hard to read and stifled innovation for so long that Leslie Lamport wrote Paxos Made Simple in an attempt to simplify the idea so others could approach it. After this paper, academia was able to start really exploring the ideas posed by Paxos and has given birth to a wide variety of optimizations and improvements on Paxos (and resultingly the distributed systems landscape).

Unfortunately, Paxos is still hard to implement, but it’s at least easier than it was.

Paxos Made Simple: Advancements

Paxos made simple doesn’t really improve on Paxos per-say but simply makes it easier to read and understand.

Paxos Made Simple: Issues

Paxos Made Simple carries with it some of the same issues as original Paxos, most importantly being that it’s still hard to implement. It’s so hard to implment that some folks at Google actually wrote a paper called Paxos Made Live to make things clearer for us mere mortals.

Paxos Made Simple in a single reductive sentence

Same as original Paxos but better explained

Paxos parallelizes: Multipaxos (2001)

The naming is somewhat confusing (well explained by an SO post I came across), but Multipaxos is an improvement over basic Paxos by running multiple paxos rounds at the same time. This optimization is generally called “pipelining” – running multiple instances. Multipaxos was introduced as an optimization in the Paxos Made Simple paper, but not fully explored. As far as I can tell there’s no standalone paper that introduced “Multipaxos” by itself, but it’s what it came to be called.

Multipaxos: Advancements

  • Higher throughput through runing multiple instances

Multipaxos: Issues

  • Increased complexity
  • Larger messages over the network

Multipaxos in a single reductive sentence

Do more paxos while you paxos by running multiple consensus rounds/instances at the same time.

Paxos gets faster: FastPaxos (2004~2005)

Given some time to digest the ideas put forth by the Paxos papers, the next thing to show up on the scene was FastPaxos – an optimization of Paxos that reduces messaging delays in the original model.

While we didn’t go into too much detail on how the original model of Paxos works to begin with, the reliance on a leader node for coordination meant that it took 1 round trips worth of time (RTT = Round Trip Time) in order to get a value accepted:

  • A client with a value to save needed to talk to the leader
  • the leader needs to talk to acceptors
  • the rest of the fucking owl paxos algorithm

What Leslie Lamport puts forward in the FastPaxos paper is to save a single RTT worth of computation by allowing clients to send messages directly to nodes acting as acceptors, bypassing the leader entirely. Another key innovation introduced in tihs paper was for messages to be sent to enough replicas to make a quorum, not necessarily to all replicas that were present. The leader isn’t completely useless though – it’s still required to resolve conflicts and start rounds , so it’s still necessary.

One more improvement that FastPaxos introduced was batching of Paxos instances – so instead of having a round for every value that needed to be written, multiple values could co-exist as operations in a log (think SET x=5; SET y=4; ...) in one round of Paxos. Without this batching, you might have had 2+ rounds of paxos to decide on the values for x and y respectively.

We’re jumping ahead a bit but the Generalized Consensus and Paxos paper provides an excellent explanation which was the most useful for me personally in understanding what benefit FastPaxos provided. The paper itself is really about FastPaxos, along with Temporal Logic of Actions (TLA / TLA+), which is a framework invented by Leslie to formally specify the semantics of distributed systems (which can then lead to testing them in part if not in total).

FastPaxos: Advancements

  • -1 RTT
  • Less rounds of Paxos by batching

FastPaxos: Issues

The FastPaxos lunch isn’t free, however – FastPaxos relies hinges on all acceptors assigning the same command number to a client’s command – race conditions can abound for clients who submit at exactly the same time. Conflicts like this can be resolved by a leader (who has a single, ordered view of when requests came in), but this is the worst case, i.e. the original Paxos case.

The issue with conflicts due to race conditions is also compounded by the fact that if the commands don’t commute, then you have to incur the extra RTT to the leader to make sure you have the ordering right. If commands don’t commute, then order matters, and you need to know the order.

FastPaxos in a single reductive sentence

Let clients skip the leader and send messages straight to all acceptors, batch up the values to reduce rounds of Paxos that have to be run, and resolve conflicts at the leader only when needed.

Relevant Papers

Paxos gets general: Generalized Paxos (2004~2005)

This paper is actually somewhat of a rehashing of the innovations of FastPaxos, but was written somewhat out of order. Leslie notes in the description:

I’ve been sitting on this paper for so long because it doesn’t seem right to publish a paper on a generalization of Fast Paxos before publishing something about Fast Paxos itself. Since generalized Paxos is a generalization, this paper also explains Fast Paxos. But people’s minds don’t work that way. They need to understand Fast Paxos before they can really understand its generalization.

Ironically this paper was key to me understanding FastPaxos properly. In particular it describes some key Paxos properties that sum up to roughly:

  • Commands can be committed faster if they commute (i.e. they don’t interfere/interact with each other)
  • In the fast (non interfering) case replicas learn after two message delays which is optimal
  • Commands that do interfere with each other require a stable leader for conflict resolutions, and O(N) messages

Generalized Paxos: Advancements

This paper lays out the foundations of a generalized paxos (and in doing so explains FastPaxos), it’s not really bringing new information to the table but it does introduce a formalization and executable code in TLA+ however, so there’s that.

Generalized Paxos: Issues

All the usual issues with Paxos apply with it’s generalized form.

Generalized paxos in a single reductive sentence

Paxos is generalizable, and if you want impress machines and get them to check your work take a look at TLA+

Paxos sets up a rotation: Mencius (2008)

With the well-known optimization of having a single leader well understood, one of the next optimizations to be made in the Paxos family algorithms was Mencius. While having a single leader take most writes reduced the need for coordination, it also make the leader likely to get overloaded as the cluster got bigger or received more traffic. the mencius paper’s solution was to use a rotating coordinator scheme, rotating leaders to avoid downtime from overloading and increase throughput.

Mencius also seems to be the first time that the Wide Area Network (WAN) environments were taken into consideration in a Paxos paper. While most papers mention the network and of course aren’t too far disconnected, Mencius makes sepcial mention of performance in the WAN setting.

Mencius: Advancements

  • Increased throughput
  • Increased reliability through better loadbalancing
  • WAN consideration

Mencius: Issues

Unfortunatlely, if you’re unlucky and your leader fails, the next leader election required you wait for coordination with all nodes to safely proceed. This means that while mencius crashes less, each individual crash (however unlikely) costs more.

Mencius in a single reductive sentence

Designate a leader, but also rotate them for better throughput and loadbalancing at the expense of a more troublesome restart after leader failure.

Paxos gets it’s act together repeatedly : Multicoordinated Paxos (2006)

Following in the footsteps of the approach outlined in Generalized Paxos, Multicoordinated Paxos introduces more coordinators to increase availability in the case where two commands don’t conflict with each other. In the conflict case (where two commands must be ordered properly) clusters still need a stable leader, but in the (hopefully more common) easy no-conflict case, any fast acting quorum of coordinators can take writes.

Multicoordinated Paxos: Advancements

  • Increased availability with multiple coordinators

Multicoordinated Paxos: Issues

  • Noisier network (more messages required to achieve coordinator quorum)

Multicoordinated Paxos in a single reductive sentence

Roughly the same as Generalized Paxos (basically FastPaxos), but pick a leader quorum, not a single leader, for more availability.

Paxos goes vertical: Vertical Paxos (2009)

From what I can tell, the Vertical Paxos paper is focused on presenting a solution to highly geo-distributed use of Paxos. The main thrust is to to partition (AKA “shard”) data, assigned statically to regions. This paper heavily takes into account real world conditions over a WAN and is pretty practical in it’s motivations and resulting discussion. It also includes a section towards the end comparing and contrasting vertical paxos and other “primary-backup protocols” like the Niobe replication protocol and the Google File System.

Vertical Paxos: Advancements

  • Dynamically reconfigurable Paxos clusters
  • Multiple region support/consideration
  • TLA+ spec

Vertical Paxos: Issues

  • Added complexity of external cross-region coordinator
  • Static assignment of regions (I don’t think this was a hard constraint)

Vertical Paxos in a single reductive sentence

Do the usual Paxos dance but shard your data, assign them static regions and add a coordinating service for relocating objects so you can avoid cross-region Paxos rounds where possible.

Paxos gets even more scalable: SPaxos (2012)

The SPaxos (Scalable Paxos) paper is one of the big lurches forward after some time in paxos land – it introduces the idea of having all replicas respond to all kinds of requests, but simply batch them up and send them to the leader. In this scenario, the leader handles ordering, and while the algorithm is still weak to slow/faulty leaders the ability for replicas to help in the process and gather batches of commands improves throughput.

The throughput graphs show a pretty staggering difference:

spaxos throughput graph

As the paper goes on to state, there’s a roughly 2/3x performance improvement.

SPaxos: Advancements

  • 2/3x performance gains

SPaxos: Issues

  • Weak to slow/faulty leaders (still a single leader system after all)

SPaxos in a single reductive sentence

Batch writes at all replicas, and let the coordinator focus on ordering bigger chunks to watch your throughput soar.

Paxos plays ring around the rosie: Ring Paxos (2010)

The Ring Paxos paper introduces the use of Atomic broadcast in assisting Paxos. Network-level multicast and structuring nodes in a ring play a large part in the gains provided by Ring Paxos.

This is somewhat similar to Mencius, but the main difference is that network multicast is brought into play, and the actual ring structure requirement on node communication.

Ring Paxos: Advancements

  • Reduces load to the coordinator/leader node (due to ring structure)
  • Increased throughput compared to classical paxos
  • Better performance (in particular “Maximum Throughput Efficiency” in the paper) than most other atomic broadcast protocols

Ring Paxos: Issues

  • Relies on IP-multicast to be available
  • Requires additional configuration of things like socket buffer sizes, overflow buffers, etc.
  • High load on the coordinator/leader node

Ring Paxos in a single reductive sentence

Use IP multicast to send messages, and network the nodes in a ring while the coordinator/leader node works hard.

Paxos plays ring around the rosie multiple times: Multi-Ring Paxos (2010)

The Multi-Ring Paxos paper introduces improvements over Ring Paxos, the key one being running multiple instances of Ring Paxos at the same time. This is a common theme in making Paxos more efficient, called “Pipelining” (another common improvement is batching). Receivers (non-leader nodes) in Multi-Ring Paxos subscribe to different sets of groups, and particiapte in consensus instances/rounds for the groups they’re in. For nodes that engage in multiple groups, they have to merge deterministically, which adds a bunch of complexity.

Here’s a graph of the performance increase from the paper:

multi ring paxos perf graph

A little disorienting at first, I found it most convenient to compare Disk M-RP and Ring Paxos labels. The more partitions (X-axis), the better Multi-Ring Paxos performs regular Ring Paxos. Multi-Ring Paxos also uses drastically less CPU, though it’s pretty high – if you look at the RAM Multi-Paxos (RAM M-RP label), the CPU essentially becomes the bottleneck, presumably because RAM operations are so fast.

Multi-Ring Paxos: Advancements

  • Higher throughput than simple Ring Paxos
  • Better load balancing than Ring Paxos

Multi-Ring Paxos: Issues

  • More complex
  • Still relies on multicast being available (same as Ring Paxos)

Multi-Ring Paxos in a single reductive sentence

Multiple parallel Ring Paxos instances for more throughput and balanced work.

Paxos goes out to sea: Raft (2014)

Raft was introduced as an even more approachable Paxos implementation – the title of the paper was “In Search of an Understandable Consensus Algorithm”, which speaks to just how difficult the Paxos family of algorithms was to understand and implement. The core premise that Raft introduces is the hard-requirement on a desginated leader through which all writes flow. Raft introduces Paxos to modern developer ergonomics – even including interactive diagrams and explanations on the raft website (and having a logo). The basic “trick” to Raft is to have the leader node work on coordinating and committing the writes, and “fall back” to basic Paxos only in the case of leader failure. So when nodes get writes they forward them to the “leader” (raft lingo) or “distinguished proposer/acceptor” (paxos lingo).

Other Paxos optimizations like batching and pipelining can also be applied to Raft to speed it up, as well as SPaxos’s main contribution which was allowing non-leader nodes to batch up writes, though it depends on what the implementation you’re using does. Generally, if an application can be structured as a distributed log (which I’m pretty sure is universally true for any distributed state machine, since you just store the transitions), Raft is for you.

Raft can be viewed as a specialized case of MultiPaxos tailored for log replication.

Raft: Advancements

  • Understandability (Raft was so understandable that it actually got used in production widely by key value stores like etcd)
  • Simplicity (no need to worry about or resolve conflicts on commands that don’t commute)

Raft: Issues

  • High load on leader node
  • Leader failover is more time-consuming than other failures

Raft in a single reductive sentence

Paxos with a guaranteed leader node through which all writes flow, unless the leader fails then one regular paxos instance happens to elect the next leader.

Relevant Papers

Paxos cares about equality: Egalitarian Paxos AKA EPaxos (2013)

My biggest initial impression from the EPaxos paper was that the solution space was starting to run out of air – more and more concessions and environment/problem-space tweaks seemed to be getting made in order to make Paxos more efficient on some axis or another. Here’s a quick list of the ones that EPaxos requires/introduces:

  • Ensure initial pair of leader-like have the same data
  • Nodes must have distinct unique IDs and epoch tracking for the ballot system
  • Commands must be unique and idempotent
  • Distinguish a command commit and a command execution (so when a client “commits” a write, reads take a different path that forces the client to re-sync)

The second point (of finer grained command distinction) actually parallels some thinking I’ve been doing lately about treating applications as proper state machines to achieve application-level distribution/consensus. This also meshes with the well-known practice of Command Query Response Segregation (CQRS) – sometimes clients don’t care about the result of their command (or at least sometimes they can pretend it actually happened locally), they sometimes just want to fire-and-forget commands.

Returning to EPaxos, however, the basic thrust is that by distinguishing the commit and execution times of a particular comamnd, you can ensure that you get into the conflict-free case much more often. EPaxos uses techniques from other Paxos derivatives and combines them to produce a pretty good system. In particular, given that you’ve met all the restrictions above, it:

  • Fast/slow paths for command executions
  • Allows commits at every replica at every node (better load balancing, batching)
  • Multiple paxos instances at the same time (nodes become “command leaders”, this is basically pipelining AFAICT)

Along with the paper there’s the go-distributed/epaxos repository which contains an implementation of EPaxos written in Golang. It also contains implementations for Mencius and Generalized Paxos.

EPaxos: Advancements

  • Clearer specification of the operations-as-command-logs problem space
  • WAN (i.e. Realistic conditions) consideration – they use EC2
  • Load balancing across replicas
  • Graceful performance degradation (no drop in throughput even when the leader crashes)
  • Increased reliability due to multiple leaders
  • Executable F/OSS code in a repository with the paper

EPaxos: Issues

  • On average 3/4s of acceptors (more than a simple majority) need to listen in certain cases
  • Complexity
    • All the additional requirements must be met
    • Both reads and writes become more complicated
    • Correct metadata on how commands interact must be kept, commands must be idempotent, etc

EPaxos in a single reductive sentence

Multiple leaders, writes at any replica (batching), multiple instances (pipelining), along with a sprinkle in a little CQRS and accurate metadata on how commands interact for performant Paxos even over WAN.

Paxos limbers up: Flexible Paxos AKA FPaxos (2016)

The FPaxos (Flexible Paxos) paper is a bit of a step forward and a step back at the same time. Although it requires a single elected leader/distinguished node, it introduces the idea of flexible quorums – the novel idea that simple majority quorums are actually not needed across steps in a certain instance/round of Paxos. This is a subtle but huge step forward for the protocol in that it makes the fast (happy) path even faster if things go right.

The idea is that there must be an intersection in the nodes required for quorum across leader switches. So when a new brand new instance of Paxos (cluster-wide) starts, a leader is elected, and it can skip out on a majority for the writes it performs/accepts but the requirement is that the next leader must pick a (possibly less than simple) majority that includes at least a single node from the previous instance. If you sprinkle on a bit of math, you get the concept of grid quorums – a way to ensure that quorums with the right acceptors are picked.

This paper also introduces the fpaxos/fpaxos-lib repository that contains runnable code from the paper.

FPaxos: Advancements

  • Higher throughput in the happiest of paths, no need to wait for a full simple majority
  • Executable F/OSS code in a repository with the paper

FPaxos: Issues

  • Single leader (reliability, load balancing, leader failure downtime issues)

FPaxos in a single reductive sentence

F isn’t for fast but flexible – make sure one node stays in the quorum across leader changes.

Paxos ditches user-space: Kernel Paxos AKA KPaxos (2017)

The KPaxos (Kernel Paxos)’s contribution to move Paxos computations into the kernel itself. I didn’t really read much of this paper, so I’m going to have to leave anything that passes for deep analysis to someone else. While there is a performance gain I’m not sure I really want code like this moved into the kernel until it’s bulletproof. I’m no kernel developer but this seems like a pretty intense way to solve this problem. I’d personally rather that Paxos algorithms be implemented as a layer in a unikernel system than trying to patch them into a monolithic kernel.

Despite my reservations it seemed like this approach might be best suited for same-machine Paxos clusters which might might prove useful for local databases? I’m not quite sure. The performance gains are undeniable though:

kernel paxos perf graphs

The esposem/Kernel_Paxos repository is included along with the paper if you want to check out the code.

KPaxos: Advancements

  • Bonkers performance
  • Executable F/OSS code in a repository with the paper
  • Code inside the kernel

KPaxos: Issues

  • Code inside the kernel

KPaxos in a single reductive sentence

Let the kernel do your Paxos for you

Relevant Papers

Paxos spreads out: WAN Paxos AKA WPaxos (2017)

As the theme of solution space asphyxia continues, the WPaxos paper is a pretty complicated solution that introduces a bunch of new requirements in order to gain an edge. The basic gist of WPaxos is to add and implement an automatically partitioned, rebalanced, and replicated Paxos configuration. While that’s quite a mouthful it’s basically a combination of all the best features we’ve seen so far (batching, pipelining, multiple leaders, flexible quorums), but with the addition of features like work stealing and enforcing network closeness of certain nodes and certain leaders.

Here are a few techniques and requirements that WPaxos introduces to gain an edge:

  • Multiple concurrent leaders (i.e. flexible quorums ala FPaxos)
  • Phase 2 acceptors (so non-fast path, conflicting writes that we need to think about more) to be close (in terms of network round trip) to the appropriate leaders
  • Shared commit log per-object that’s being stored
  • Object stealing between leaders so that write happen close to leaders
  • Separate ballot # for every object, resolves conflict for single object ballot control w/ zone and node IDs and random backoff
  • Every command must only touch one object at a time (!!)

There is a lot happening here, a lot of additional complexity and some pretty startling/hard to achieve requirements. The payoff however, is some pretty amazing real world performance and consistency that actually beats KPaxos and some of the other advanced appraoches (like EPaxos) on some metrics:


I must admit that reading the ideas in this paper reminded me very much of what I envisioned when laying out my ideas behind a distributed SQLite DB (which was not quite the same as rqlite or dqlite). My main contribution to the state of the art (rqlite and dqlite) was going to be write-anywhere funcitonality backed by automatically partitioned, rebalanced, and replicated data. Nice to know I was on a track, maybe even the right one – though of course WPaxos goes way further than my scribblings and ruminations – just reading this paper was enough to convince me that things were a lot more complicated than I thought.

Back to WPaxos though – it’s pretty clear that it’s an improvement over EPaxos with the exploitation of data locality (the closeness requirement) when it’s possible. Enabling dynamic repatitioning and work stealing between leaders also increases the likelihoood of that happy path case. The per-object commit long and one-object-per-command restrictions would normlaly be dealbreakers, but there are some cases where this is actually the case – a distributed key/value store for example. While I don’t know of many key/value stores that work this hard to be consistent (most just make do with eventual consistency), I guess it would be beneficial.

Runnable Golang code is also available at the ailidani/paxi repository. As a side note, this paper also mentions and dives into FPaxos, so it’s a good place to pick up some details.

WPaxos: Advancements

  • Excellent performance (due to using effectively every trick in the book)
  • Heavy production-readiness (WAN) consideration in the paper
  • Resilient and (dynamically) load balanced

WPaxos: Issues

  • Only one object can be touched by a given command
  • A bunch of the node ID, network setup, and etra requirements of EPaxos
  • Complexity
    • We’re basically also trying to solve scheduling/work placement (another old research problem) at the same time as the normal consensus problem

WPaxos in a single reductive sentence

All the advanced features (Multiple leaders, flexible quorums, etc), along with dynamic partitioning and work stealing to keep data localized, most production-ready except for the restrictions like commands only touching one object at a time.

Paxos gets more careful with setting values: Compare And Set AKA CASPaxos (2018)

In somewhat stark contrast to the WPaxos paper, the CASPaxos paper makes a much simpler contribution. The gist of this the utilization of Compare And Swap semantics to build a Paxos-replicated system that instead of doesn’t send operations but actually sends state transitions. CASPaxos is actually more of a return to the Synod (AKA Single Decree Paxos) algorithm (which focuses on a management of a single piece of data with Paxos), which is also explained in a nice table on pages 4 and 5 of the paper. The basic gist is that since the problem has been rephrased such that client contribute state transitions (i.e. a function state0 -> state1), no two clients with conflicting could ever both succeed, as once one of them succeeds all other commands with the older state would be ignored (the compare and swap semantic at work).

Personally, this paper was really enlightening for me and lead me to the EPaxos and WPaxos as places to look at what this paper was building on. I’d heard of EPaxos but not WPaxos in the past and CASPaxos builds on tricks from both of these so it was a great find. CASPaxos allows for non-sharded multi-master mode, which neither EPaxos or WPaxos could achieve which is pretty impressive. CAS has to be set for every value you store (you can use a monotonically increasing number), but the properties of

At this point we’ve seen a bunch of papers that came with code, but the gryadka/js repository that comes with CASPaxos is even better than normal – it actually gets similar performance as etcd itself, while being actually master-master replicated database with full consistency and it comes with a TLA+ specification. gryadka does lose to etcd only on the recovery time metric, based on the measurements in rystov/perseus, but it wins on every other metric.

Another really cool thing about CASPaxos’s implementation gryadka is that it’s composable in that it was built to work on top of some underlying storage (a local redis instance in the reference implementation). This is cool to me because it’s another idea I’ve been playing around and thinking about lately – application-level consensus. Of course, databases are just applications themselves, but they seem to be where we first try and solve distribution/scaling issues, when it might actually be better (and make for purer microservices) to solve consensus at the application level, above the database itself. The fact that graydka uses this approach and beats a minimal purpose-built tool like etcd (which has a full company and lots more resources behind it, and is the industry standard) is an amazing result.

CASPaxos: Advancements

  • TLA+ Spec
  • Executable F/OSS code in a repository with the paper (core has <500 lines of code!)
  • Simple
  • Composable with other software (gryadka manipulates redis)

CASPaxos: Issues

  • Restricted/slightly more specifict problem space, clients must contribute commands that transform the whole state at the same time
  • Large individual object will slow it down

CASPaxos in a single reductive sentence

Use the best features of EPaxos but make all clients submit state transformations (like CAS on a hardware register) and periodically garbage collect when people delete.

Paxos cuts back on distributing itself: SDPaxos (2018)

The SDPaxos (Semi Decentralized) paper is the most recent paxos-family paper I’ve seen and it’s proof that there’s still interesting things happening in the field. There’s an excellent introductory blog post by Murat Demibras over at SUNY Buffalo that’s worth a read but I’ll also take a crack at explaining it here.

The general gist behind SDPaxos is to avoid drawbacks of leader and leaderless approaches by striking a balance in the middle – designating per-command leaders, and an ordering leader. Per-command leaders first appeared in EPaxos as far as I can tell, and leaders that only worried about ordering first appeared in SPaxos (writes got batched at all replicas and sent to the leader to be ordered). This mix of both approaches means that by splitting up comamnd replication/consensus and ordering SDPaxos allows replication as fast as possible, but leaves ordering to one node to gain easy serializability. The split goes even deeper because the command-related and ordering related instances/consensus can actually happen completely independently, in parallel on the same machine.

As called out in the paper, CASPaxos improves on the weaknesses of a bunch of other approaches:

  • Raft (overloaded leader, painful leader crashes)
  • Mencius/Ring Paxos (weak to stragglers/slow nodes, painful leader crashes)
  • EPaxos (conflicting commands can’t fast path)

The performance is also pretty impressive when compared against these three approaches:

sdpaxos perf graphs

SDPaxos solves the traditional general state-machine-replication-through-logs problem (not the slightly different Synod/Single Decree Paxos which CASPaxos solves), and all the advances of previous approaches should apply to SDPaxos, though they haven’t necesarily applied in the paper as it sits. In particular this approach might look even more interesting with the advances of WPaxos like automatic sharding/partitioning, though the increased complexity would make it very hard to implement. Theoretically there’s nothing stopping SDPaxos from utilizing the improvements introduced by WPaxos or any of the other paxos-based algorithms (except for maybe CASPaxos).

Runnable Golang code is also available at zhypku/SDPaxos on Github.

SDPaxos: Advancements

  • Efficient command ordering through separation of responsibility
  • Better throughput even in the case of conflicts (compared to EPaxos)
  • Fastest multi-master setup with strong consistency
  • Executable F/OSS code in a repository with the paper (core has <500 lines of code!)
  • Optimized read path through the single sequencer (no need to get quorum to read)

SDPaxos: Issues

  • Complexity
    • More consensus groups to manage
    • Straggler detection
    • Intricate rules during recovery for slot management
  • Parallel (threaded) operation of command and ordering instances

SDPaxos in a single reductive sentence

Single leader serializability, multiple command instances and a single serializer for super fast reads and writes, at the cost of lots more complexity.

Relevant Papers

Honorable mentions

Speculative Paxos (2015)) focuses on data-center specific optimizations that didn’t interest me too much (since I don’t run a data center).

NOPaxos (2016) focuses on a new network primitive and some other data-center specific optimziations that I didn’t find interesting (also because I don’t run a data center).

Adjacent approaches/technology

There is a lot of other research/tech in the distributed systems space, so I’ll take a few seconds to write about some things that have sparked my interest.


Gossip, popularized most recently by the SWIM paper is another simpler but surprisingly effective approach to distrubted (eventual) consistency. It’s up to users/implementors to determine when they need eventual consistency versus the ordering +/- serializability that stricter algorthims can provide, but gossip is much simpler to understand.

SWIM improves on the basic gossip idea by actually picking nodes to which you gossip randomly, and piggybacking messages or information about cluster state with the health checks that are necessary for the system. Through the magic of statistics and probability this both reduces the number of messages you need to send (naive gossip’s biggest weakness), yet still ensuring that nodes see updates reasonably fast. What’s more, SWIM’s approach doesn’t add that much complexity to an already very simple algorithm so you’ve got a good chance of implementing it correctly.

Production-ready implementations exist, for example hashicorp’s memberlist golang library, which is used inside their widely-used Consul product.


Once you’ve started loosening your ordering/serializability requirements, you’ll no doubt land on Convergent Replicated Data Types eventually – as conflicting values fly around your distributed system, the easiest way to deal with them is often Last Write Wins, but the key innovation of CRDTs is introducing and definining the formal concepts/theory surrounding ensuring a data type is always cleanly mergable. The idea is that if you have two values you should be able to combine them and know that they will both converge to a well known/predictable state. Put simply CRDTs answer the question “what if we had data types that guarantee automatic conflict resolution?”.

The classical example of a CRDT is a G-Counter, basically an incrementing counter. It’s fairly trivial to understand on the surface – if you’ve got values that are doing nothing but incrementing, it’s pretty easy to always resolve conflicts – just take the max.

A slightly more interesting example is sets – lists without duplicates. The G-Set (Grow Only set) is easy to merge, becuase in the end all you need to do is union the two sets and ensure there aren’t any duplicates and you don’t need to worry about conflicts. An even more interesting example is the “OpSet” – you can easily imagine being able to store things like Write Ahead Logs (WALs) by treating them as sets of operations. Of course things aren’t quite that easy, but it’s the kind of idea that once worked through provides you with strong eventually consistent databases at very little cost.

There are lots of talks out there on CRDTs, here are a few I’ve personally enjoyed:

Classically there are two types of CRDTs, Operation-based CRDTs (CmRDTs) and State-based CRDTs and the difference between them is exactly what it sounds like – with operation based CRDTs you ferry around messages like ADD 1 and with state based CRDTs you ferry around the entire state, like {a: 1, b: 3, c: -1}. Turns out they’re actually both reducable to just operation-based CRDTs, which was an idea introduced by the Making Operation-Based CRDTs Operation-Based paper which introduces “pure operation-based CRDTs”. It should be pretty easy to figure out though, since the definition of “operation” is so wide, you could trivially represent any state-change as an operation (for example STATE_CHANGE {a: 1, b: 3, c: -1}), and pass those messages around. In the general scheme of things state-based CRDT use should probably be somewhat limited though, as the network isn’t free.



Some rough notes that I took on RAMP, though I didn’t get a chance to fully digest the ideas:

  • Basically, what happens if you through serializability out of the window yet still go for ACID (I think it actually fits all the
  • The idea is that you make sure transactions either apply for everything or not at all, and ensure readers can’t see across transactions. Also, run all transactions at the same time and if some stall it’s OK (order not guaranteed)
    • Technically this satisfies ACID right? Atomicity is there, Consistency is there becuase you can only change data in allowed ways (rules can still be checked in transactions), and changes stay after being committed, I because transactions don’t see partials of each other, and D if you just write stuff to on-disk WAL.
    • Basically other than this, readers race writers but writers never overlap
  • Seems really good for graph issues (mentioned in the paper) @ massive scale, as it is capable of checking constraints, pretty interesting because there are atomicity issues that were accepted as a tradeoff with scalability (in BigTable, Dynamo and others) that paper authors think is unnecessary
  • I do wonder what would happen if this was combined with compare & set w/ versions being checked… Wonder if you could take a step towards serializability/some sort of partial order (?)

On the use of Golang

One more thing I noticed was the amount of times that Golang was being used for experimentation. I don’t think Golang is a particularly good language in the classical computer science-y sense, but it sure does “do what it says on the tin”. It set out to be simple, performant, and bring a decent feature-set without requiring too much runtime tweaking/mastery, and it is delivering. It’s very popular in the “devops” space where scripting languages were king, and big important tools are being written at an alarming pace and being put into production and the world hasn’t burned down (which is a good sign for a language that is so young). Golang is already outperforming Java in some places and that’s crazy given the amount of man-hours that has been put into Java. If researchers are also finding Golang to be a nice language to use with a feature set that they can work with, it is likely going to start upending a lot of other languages that were traditionally used. I’ve said it multiple times on HackerNews, but I think Golang is going to replace Java within 10 years.

I’m not a researcher but I assume the logic is something like this:

  • Researchers are not professional software developers, and normally write less maintainable code, but I bet the forced simplicity of Golang is making their lives easier – it’s very hard to be clever with Go.
  • Researchers might try to write these algos in something like Python only to be hobbled by the GIL (and need to now struggle with using multiprocessing or asyncio)
    • They might try Perl which has native system threads but then no one would be able to read the code afterwards to try and reproduce the experiment
  • Researchers might have wanted to try Rust, but then they’d have to spend time learning and struggling with the borrow checker and a stricter compiler than most languages. Haskell might have been a good choice, but they might get caught in the Monad burrito/category theory bear trap.
  • Researchers might reach for something like base Erlang or Elixir but then what if they didn’t want to fully commit to the actor model? Actor-paradigm based systems are likely really good for simulations but sometimes you just want to call a function on an object.

Of course I have no basis for the fake situations/musings above, but they seem vaguely plausible. Either way, it looks like Golang (and of course a little bit of JS with the dryad code I thought it was interesting


This article was really fun to write, and a good way to get my current understanding of Paxos-based algorithms out on virtual paper somewhere. Paxos has been through a lot of iterations, and a lot of shifts in thinking as the state of the art moved around. A few trends emerged that I can clearly amke out now:

  • Batching & Pipelining makes things better (this is basically always the case)
    • You have to be somewhat careful that you don’t mess up ordering (usually involves requiring a leader somewhere)
  • Single leader is bad for availability/resiliency/throughput, though it is likely good enough for most realworld cases.
    • Load balancing becomes an issue as writes increase, so once this if this is a problem, it’s one that can’t be ignored at all (doesn’t matter how many nodes you have if becoming leader is a death sentence)
  • If you want to receive writes at multiple nodes, you have to start caring about dependencies/relations between commands/edits/writes, and deal with the fast/slow path dichotomy & complexity
    • Partitioning helps the multi-leader case but that brings it’s own complexity/problems
    • Flexible quorums, though more complex, are a step forward. More consistent, faster instances by just sending to less other nodes, as long as you ensure intersection in the next phase.

Along with these basic takeaways, the combination of these ideas with the other techniques emerging in distributed computing is really exciting, even just adding CRDTs and SWIM seems like it could produce a whole new breed of systems – merge without worries, and function in modes across the range from eventual to full consistency.