Awesome FOSS Logo
Discover awesome open source software
Launched 🚀🧑‍🚀

Paxosmon 2: The Paxos Journey Continues

Paxos logo

tl;dr - A few more Paxos papers have been produced since the original post, I do the usual writeup. Main benefit/focuses this time look to be centered around increasing throughput by taking load off of the leader node(s) (and assigning the extra work to existing or new nodes).

Multi-part blog post alert

This is (now) a multi-part blog-post!

  1. Part 1 - Paxosmon: Gotta consensus them all
  2. Part 2 - Paxosmon 2: The Journey Continues (this post)

What’s new?

I’m not sure if there’s a lot of new Paxos research out there, since I’m not a post graduate student, but thanks to judofyr from HN I found out about a few more that are new (to me):

I’ll give them the same kind of rundown as I did before but do make sure to do your own reading! If you find some mischaracterized descriptions, reach out and let me know. I also just realized that “consensus” is spelled wrong in the original post 🤦.

Not quite a new Paxos variant: Linearizable Quorum reads in Paxos (2019)

This paper is a nice and short one – it doesn’t introduce a new variant of Paxos but it introduces a new interesting concept – reading primarily from a quorum of followers to acknowledge a round, combined with a forced immediately subsequent phase to account for pending updates. They call this the “Paxos Quorum Read” protocol and go into the implementation.

There’s a nice picture that sums up how it works:

paxos quorum read overview

The image is well made and explains it pretty clearly – the ability to do a simple quorum for reads that doesn’t involve the master, and only check one follower in the subsequent round means it’s pretty easy to do more reads. This effect is compounded when Flexible quorums are brought into play, so that load can be spread across followers.

Paxos Quorum Read: Advancements

PQR does serve for a nice improvement over stock Paxos:

Paxos quorum read performance highlight

Overall PQR offers/introduces:

  • Higher throughput

Paxos Quorum Read: Issues

  • Higher complexity assuming flexible quorums are used
  • The rinse phase is important and linearizability violations can occur if not implemented properly

Paxos Quorum Read in a single reductive sentence

Quorum read from a possibly shifting quorum of non-leader nodes, then check one nodes and you can return your answer.

Paxos gets into politics: Bipartisan Paxos (2018-2019?)

Bipartisan Paxos (“BPaxos”) is really interesting! It introduces a modularized conception of paxos – Leaders, dependency service nodes, proposers, acceptors and replicas (only one of those node types is new). BPaxos’s novel contribution is to “scale” the leaders and proposers, as they are the easily-identifiable bottleneck. With this simple change, the researchers were able to obtain 2x the throughput of EPaxos which is pretty impressive. Bipartisan takes a very pragmatic approach (if that wasn’t already obvious – “scale” is generally something people in non-academic realms think of/optimize for) and uses an off-the-shelf Paxos implementation which is a smart choice, and in doing so actually find bugs in the implementation of EPaxos that went unnoticed for 6 years (!). It can happen to anyone.

This paper serves as a great refresher on the Paxos landscape (as good papers should), starting off with a brief and succint review of Paxos & MultiPaxos and their well-documented failings of such a setup (single leader). Mencius, EPaxos and Caesar (which I don’t think I’ve covered?) are also introduced as solutions, as they enable “multileader” protocols.

So how does BPaxos achieve a speed up? Hello again, our old friend commutability. BPaxos is a “generalized” state replication protocol, which means that it only requires conflicting commands to be executed in the same order (which means that you need to know which commands are conflicting, which can get complex real quick on anything other than a simple key-value store). Paxos introduces the partial ordering of it’s commands by creating a directed graph (notably not acyclic), which are analogous to MultiPaxos logs.

Here’s an image from the paper that lays this out:

Bipartisan paxos command graph illustration

The command graph is analogous to the the command log so we don’t have to think too hard about it, but it introduces an interesting way of modeling executions, and the introduction/possibility of cycles is interesting. The plot thickens as we look at an overview of BPaxos’s execution:

Bipartisan paxos execution overview

The dependency service plays an important role in deciding which operations conflict and which don’t, which affects required ordering operations. What’s interesting is that there’s an extra quorum step in talking to the dependency service. Leaders must aggregate dependencies from all the dependency service nodes before they can move ahead with forwarding to a proposer. After that, the proposers talk to the “consensus service” (isn’t that what we’re solving here?) which perform a more “traditional” round of consensus (the BPaxos paper uses stock conensus with a ballot zero optimization that allows for choices in only 1 round trip, similar to MultiPaxos).

All this said though, BPaxos does very well compared to EPaxos and MultiPaxos – disaggregating roles and scaling the leaders and proposers is very effective on increasing throughput, and shifting the bottle neck to other parts of the system (which can’t be scaled quite as easily if at all). It’s never made quite clear how the leaders’ work roles are divyed up but I assume this is something that dependency nodes track (so if a write to x goes to two leaders, the coordination point with the dependency service is when the first/second node will realize any needed dependencies on shared data). Most important to remember is that there are two keys to BPaxos:

  • Disaggregation
  • Scaling

Bipartisan Paxos: Advancements

Here’s a nice picture of the performance comparison:

Bipartisan paxos performance comparison to Multipaxos and EPaxos

BPaxos offers:

  • More throughput (over EPaxos and MultiPaxos)
  • Lower latencies with more leaders in the very high client count case (better than MultiPaxos)
  • Disaggregated roles that offers slightly less complexity when dealing with individual pieces

Bipartisan Paxos: Issues

  • Increased complexity overall at the system level (logs -> graphs, more communication, dependency service algorithm and management, there’s a consensus in your consensus, dependency graph compaction)
  • Delays are possible due to graph dependency failures
  • More storage usage, replicas must record the IDs of all commands they have executed for a client (compaction for this may be possible as well, but wasn’t discussed)
  • slightly worse latency for a small number of clients (only one client, which is a pretty pathological case)

Bipartisan Paxos in a single reductive sentence

Throw in a graph, model your operations as a graph, add a dependency service, make sure everything runs separately and scale scale scale.

Paxos splits and reorganizes work: Compartmentalized Paxos (2020)

This Paxos variant is interesting as far as I can see – it looks like someone has applied “web scale” to Paxos. The “compartmentalization” approach is focused on two “simple” ideas – decoupling and scaling. What’s nice is this paper also contains a nice concise explanation of the field – Paxos, MultiPaxos and the others are all covered briefly and succinctly. The key ingredient to getting the amazing throughput results that this paper generate lying in identifying unnecessary work being done by key parts of the system. In the MultiPaxos case this means letting the work of notifying Replicas replicas be given to other nodes (leaders in MultiPaxos traditionally have to order writes and communicate paxos round results to Replicas).

Here’s what it looks like when MultiPaxos leaders do the hard work themselves:

Compartmentalized Paxos MultiPaxos overview

And here’s what it looks like when MultiPaxos leaders leave the hard work to others:

Compartmentalized Paxos MultiPaxos overview

Note the introduction of the “proxy leader” class of nodes – it’s quite a nice solution/shift of the problem – simply freeing up the leader nodes to do only log ordering means that there is more (CPU) time to order log messages! The researchers also do well to get to the question before I can ask it in my head:

Have we really eliminated the leader as a bottleneck, or have we just moved the bottleneck into the proxy leaders? To answer this question, we scale.

They go on to note that you can’t just scale any which way though, the key insight is that once the commands have been ordered, then you can choose whichever downstream acceptors you want (and scale can be applied almost indiscriminately). In addition to scaling the Proxy leaders, they then scale the acceptors, and then the replicas. then, we sprinkle some batching on top! Here’s what things look like by the end for MultiPaxos:

Compartmentalized Paxos MultiPaxos final form

Whew, that’s quite the number of arrows – considering most non Hyperscale “distributed systems” out in the wild have either 3 or 5 nodes (at least those are the numbers most frequently used in the guides), this represents a really interesting shift in the thinking – maybe it’s not that hard to manage this many replicas of each part of the system?

Compartmentalized Paxos: Advancements

The performance looks absolutely stunning compared to MultiPaxos:

Compartmentalized Paxos performance comparison to Multipaxos and an unreplicated state machine

Compartmentalized Paxos offers:

  • Introduction of a general technique for increasing throughput (applied to MultiPaxos, Mencius, and S-Paxos)
  • Much higher throughput (unbatched MultiPaxos got ~8x faster, batched MultiPaxos got 4.5x faster)

Compartmentalized Paxos: Issues

  • Significantly increased infrastructure complexity (more actors/servers, more problems)
  • Higher resource requirements

Compartmentalized Paxos in a single reductive sentence

Separate your functionality, introduce nodes that act as demi-leaders, buy more of everything with load balancing and randomization and you’ve got throughput gains with your web-scale.

You can also find information related to the paper here:

Paxos takes aim with a bow and arrow: Matchmaker Paxos (2020)

Matchmaker Paxos focuses on the problems of reconfiguration – the process of replacing dead machines with live ones during failure scenarios in Paxos algorithms like MultiPaxos and Raft. The researchers capitalized on the fact that it was an underexplored area and sought to make reconfiguration require less (machine) effort. A bit of a twist though – where old reconfigurations were applied retroactively (once a machine fails, replace it), the techniques in this paper optimize for a world where machines are reconfigured proactively. The key to making this all work is allowing different rounds to have different configurations of acceptors.

There’s a nice explanatory diagram:

matchmaker paxos overview

See those that new set of blue machines? Those are the matchmaker machines that let can tell proposers which acceptors to talk to, essentially functioning as a “acceptor service discovery”.

Matchmaker Paxos: Advancements

Here’s what “stock” MultiPaxos looks like around horizontal reconfigurations:

Multipaxos without Matchmakers during a horizontal reconfiguration

And here’s Matchmaker Paxos, providing a nearly-degradation-free horizontal reconfiguration:

Multipaxos with matchmakers during a horizontal reconfiguration

Matchmaker Paxos offers:

  • Allow for dynamic reconfiguration of acceptors, with space for proactive matchmaking
  • Builds on (is interoperable with) MultiPaxos and “classic” Paxos

Matchmaker Paxos: Issues

  • More machines to manage (the group of machines that does the matchmaking)
  • Increased complexity

Matchmaker Paxos in a single reductive sentence

Reconfigure so easily you can do it ahead of time by paying the cost of doing acceptor service discovery.

Paxos steps up to the trough: PigPaxos (2020)

Pig Paxos is a bit interesting because it looks like compartmentalized Paxos at first, but the key to the benefits there are related to increasing efficiency of communication by doing some (aptly named) piggy backing and essentially aggregating/batching of messages. The researchers who worked on Pig Paxos seem to have realized the same thing that compartmentalized paxos did as well – leader nodes are simply doing too much. Pig’s contribution here is that it replaces the leader-follower communication with a a “relay-/aggregate-based message flow”. In somewhat plain-er terms this means that nodes are selected to perform the communication would have done (forwarding the leader’s message and gathering responses). Pig also focuses doing this particularly on strong consistency replication – the only kind of consistency that most application developers can actually wrap their heads around properly.

Here’s a nice explanatory image from the paper:

pig paxos overview

Pig Paxos relies on a randomly selected (and thus load-sharing) group of relay nodes who will serve as a sort of “surrogate” leaders. Similar, but slightly different to the ideas of compartmentalized Paxos (there are definitely less mentions of “scale”, at least). The communication flow diagram is a bit more instructive:

pig paxos overview

This all looks well and good but PigPaxos may be the weakest entry so far – the optimizations here seem somewhat optimistic compared to other approaches, and there is a considerable amount of new complexity. To be fair to Pig Paxos though, other papers covered in this edition of Paxosmon could be characterized as being somewhat hand-wavy by ignoring the difficulty of managing gobs of new machines (I’m looking at you, compartmentalized paxos).

PigPaxos: Advancements

PigPaxos gets quite an impressive throughput/latency curve:

Pig Paxos throughput/latency curve, showing around 3x improvement in throughput at lower latencies

  • Greatly increased throughput
  • Load balancing

PigPaxos: Issues

  • More complexity required to manage relaying topologies/schemes
  • More failure modes (communication between a relay group)
  • More overhead due to increased messaging
    • You need to do some sort of compression and dedup of messages as an optimization
  • Probably more nodes to manage
    • This isn’t necessarily true (if you use flexible quorums for example) but you probably want more followers to build relay groups out of

PigPaxos in a single reductive sentence

Shunt work from the leader node from your big quorum to communication-only leaders in small quorums, and optimistically you’ll get more throughput (compared to EPaxos and stock Paxos).


Well hopefully it was fun to read about these new developments in the Paxos ecosystem. After reading some of these papers I started wondering if it was Michael Whittaker himself who recommended all these papers to me, but it wasn’t – THanks to Magnus Holm for making me aware of all these developments. Looks like 2020 was a productive year for the Paxos sphere.

It seems like the main advancement of Paxos in this series of papers was shifting work off of the master nodes. I wonder if this indicates a local maximum in the optimization of other parts of the protocol (operation ordering/guarantees, batching, etc). What someone really needs to do is put all these approaches together into a benchmark/shootout of some sort so we can see just how good they all are stacked up next to each other…