tl;dr - I go through a (near complete?) selection of research papers and approaches therein that solve distributed consensus with strong consistency as provided by the Paxos family of algorithms.
This is a multi-part blog-post!
Before we start, a quick note – I made an icon for Paxos since it doesn’t seem to have one – what do you think? It was <30 minutes in Inkscape (and honestly, it shows), but I felt it might be a waste to just let it live on my hard drive without showing it to people. I started with the Paxos island flag’s trident motif and added some vaguely distributed-system-y ideas:
The wonderful font is Aller created by Dalton Maag Ltd. Anyways, on to the real important stuff.
The future (and present) of computing is almost certainly distributed – while people are quibbling over whether it should be federated or centralized, the importance of being able to think in terms of distributed systems and reason about them is paramount. Whether you consider distribution a problem (as an implementer) or a paradigm (as an architect), differs, but regardless of which side you’re on, the Paxos is one of the most promising classical solutions to the problem of getting state shared across a set of machines in a consistent manner.
I’m not going to do a general introduction to distributed systems here, or debate the CAP theorem, so please make sure to do other reading as necessary when you find a term you don’t recognize. What I will do is give an insanely reductive and unprincipled look at the state of the field (from where I’m standing, as a layman), and the broad strokes of the approaches and incremental improvements that have been happening in the research world over the last decade or two in the paxos-related distributed system space.
Before diving into Paxos and all it’s flavors, let’s do a tiny bit of background on an approach that came before it – 2 Phase Commit (2PC). 2PC is an approach that was used . }
2PC basically takes Consistency (the C in CAP theorem) above everytihng else – you give up the availability of your system in the event that any node is unable to contribute to the consensus process. It can very quickly become impossible to make progress (which in this case I define to be accepting writes), though you can still serve reads.
There’s even a StackOverflow question about comparisons between the two that is worth reading through. There’s also a much more detailed comparison available in the form of a paper called “Consensus On Transaction Commit. What you can do is actually just skip to the conclusion. Badly paraphrased, the issue with 2PC is that it forces the single coordinator but paxos improves on this because it allows for weakening to a majority of nodes. This is a huge paradigm shift because it now means that we’re ok with some number n
of nodes failing. We take advantage of the properties of a majority in a set (a quorum) to guarantee that our writes are safe from the system-level view (even if they’re not for a single specific node).
An interesting use of this algorithm is surprisingly not a distributed system at all, but Postgres, a well-loved database in the open source world. In my opinion Postgres is the best F/OSS database you could pick for projects today but I’ll write more on that another time. before Postgres adopted Multi Version Concurrency Control (MVCC), it actually used 2PC to achieve consistency across busy processes that were a part of the database system. Postgres is actually a process-per-connection model.
The previous section hinted at it a bit, but the crucial improvement of Paxos over something like 2 Phase Commit (2PC) was the introduction and use of quroums, instead of requiring every single node to agress on some shared value. Paxos has basically become the bread and butter of modern day consensus and distributed system algorithms as a result. As people found out (very quickly) that serving lots of traffic (or performing lots of operations) often required scaling a system to more than one machine, the problem of coordinating the systems became very prominent very quickly. The seminal paper that introduced Paxos is The Part Time Parliament. The paper didn’t become very well known very quickly however, due to the heavy use of analogies and examples that were hard to understand.
Here’s a reductive but hopefully useful description of the mechanics:
time=0
), and make sure it counts up{"time":1, "value": 5}
)t
, that round is over, let’s do it all again.I’ve completely ignored the basic concepts and roles of paxos but hopefully just by reading this you can get get the general gist – it mirrors what happens with humans sitting in a meeting room and voting. More concretely, there are different roles for nodes to play:
It’s important to note that any node can play any number of roles (except of course the leader role).
The safety properties of the algorithm are easier to grasp when you think of the natural mechanics of majorities. For example, with humans, once you get a majority of people to agree on something, it’s impossible to pick another majority that doesn’t have someone from the first majority. This means that at least 1 person will definitely know about the previous agreement you reached, and they’ll be able to tell the team. If you have a leader, you can speed up the process by just having everyone talk to the leader instead of trying to talk and hear each other individually in a crowded room.
For machines, this majority voting system (quorum) is a way to introduce fault tolerance – even if some number of nodes are in a bad state and either broken or never respond (it’s impossible to tell the difference), as long as the majority of the nodes are functioning, the system can withstand the fact that some participants are faulty. Of course, if you have a majority of nodes in an error state, you probably have larger problems – Paxos can’t help you there.
There is a distinction that is often mentioned right about now which is the difference between regular fault tolerance and “byzantine fault tolerance” (BFT). Basically BFT here indicates the class of consensus problems that arise when nodes in your cluster are malicious, and trying to break the algorithm on purpose. As you might expect, this is a much harder problem to solve, but surprisingly it is solvable – granted you have enough nodes and the network itself isn’t byzantine.
If we consider 2 Phase Commit (2PC) as what came before, then Paxos improves upon it fundamentally by:
Some issues with classical Paxos:
Get a majority of nodes to agree on some value(ex.
v=5
) @ a certain time (t=1
using vector clocks), and you have distributed consensus.
The next entry in the Paxos lineage is actually pretty much identical to the first, but just a tiny bit better looking. The original Part Time Parliament was so hard to read and stifled innovation for so long that Leslie Lamport wrote Paxos Made Simple in an attempt to simplify the idea so others could approach it. After this paper, academia was able to start really exploring the ideas posed by Paxos and has given birth to a wide variety of optimizations and improvements on Paxos (and resultingly the distributed systems landscape).
Unfortunately, Paxos is still hard to implement, but it’s at least easier than it was.
Paxos made simple doesn’t really improve on Paxos per-say but simply makes it easier to read and understand.
Paxos Made Simple carries with it some of the same issues as original Paxos, most importantly being that it’s still hard to implement. It’s so hard to implment that some folks at Google actually wrote a paper called Paxos Made Live to make things clearer for us mere mortals.
Same as original Paxos but better explained
The naming is somewhat confusing (well explained by an SO post I came across), but Multipaxos is an improvement over basic Paxos by running multiple paxos rounds at the same time. This optimization is generally called “pipelining” – running multiple instances. Multipaxos was introduced as an optimization in the Paxos Made Simple paper, but not fully explored. As far as I can tell there’s no standalone paper that introduced “Multipaxos” by itself, but it’s what it came to be called.
Do more paxos while you paxos by running multiple consensus rounds/instances at the same time.
Given some time to digest the ideas put forth by the Paxos papers, the next thing to show up on the scene was FastPaxos – an optimization of Paxos that reduces messaging delays in the original model.
While we didn’t go into too much detail on how the original model of Paxos works to begin with, the reliance on a leader node for coordination meant that it took 1 round trips worth of time (RTT = Round Trip Time) in order to get a value accepted:
What Leslie Lamport puts forward in the FastPaxos paper is to save a single RTT worth of computation by allowing clients to send messages directly to nodes acting as acceptors, bypassing the leader entirely. Another key innovation introduced in tihs paper was for messages to be sent to enough replicas to make a quorum, not necessarily to all replicas that were present. The leader isn’t completely useless though – it’s still required to resolve conflicts and start rounds , so it’s still necessary.
One more improvement that FastPaxos introduced was batching of Paxos instances – so instead of having a round for every value that needed to be written, multiple values could co-exist as operations in a log (think SET x=5; SET y=4; ...
) in one round of Paxos. Without this batching, you might have had 2+ rounds of paxos to decide on the values for x
and y
respectively.
We’re jumping ahead a bit but the Generalized Consensus and Paxos paper provides an excellent explanation which was the most useful for me personally in understanding what benefit FastPaxos provided. The paper itself is really about FastPaxos, along with Temporal Logic of Actions (TLA / TLA+), which is a framework invented by Leslie to formally specify the semantics of distributed systems (which can then lead to testing them in part if not in total).
The FastPaxos lunch isn’t free, however – FastPaxos relies hinges on all acceptors assigning the same command number to a client’s command – race conditions can abound for clients who submit at exactly the same time. Conflicts like this can be resolved by a leader (who has a single, ordered view of when requests came in), but this is the worst case, i.e. the original Paxos case.
The issue with conflicts due to race conditions is also compounded by the fact that if the commands don’t commute, then you have to incur the extra RTT to the leader to make sure you have the ordering right. If commands don’t commute, then order matters, and you need to know the order.
Let clients skip the leader and send messages straight to all acceptors, batch up the values to reduce rounds of Paxos that have to be run, and resolve conflicts at the leader only when needed.
This paper is actually somewhat of a rehashing of the innovations of FastPaxos, but was written somewhat out of order. Leslie notes in the description:
I’ve been sitting on this paper for so long because it doesn’t seem right to publish a paper on a generalization of Fast Paxos before publishing something about Fast Paxos itself. Since generalized Paxos is a generalization, this paper also explains Fast Paxos. But people’s minds don’t work that way. They need to understand Fast Paxos before they can really understand its generalization.
Ironically this paper was key to me understanding FastPaxos properly. In particular it describes some key Paxos properties that sum up to roughly:
This paper lays out the foundations of a generalized paxos (and in doing so explains FastPaxos), it’s not really bringing new information to the table but it does introduce a formalization and executable code in TLA+ however, so there’s that.
All the usual issues with Paxos apply with it’s generalized form.
Paxos is generalizable, and if you want impress machines and get them to check your work take a look at TLA+
With the well-known optimization of having a single leader well understood, one of the next optimizations to be made in the Paxos family algorithms was Mencius. While having a single leader take most writes reduced the need for coordination, it also make the leader likely to get overloaded as the cluster got bigger or received more traffic. the mencius paper’s solution was to use a rotating coordinator scheme, rotating leaders to avoid downtime from overloading and increase throughput.
Mencius also seems to be the first time that the Wide Area Network (WAN) environments were taken into consideration in a Paxos paper. While most papers mention the network and of course aren’t too far disconnected, Mencius makes sepcial mention of performance in the WAN setting.
Unfortunatlely, if you’re unlucky and your leader fails, the next leader election required you wait for coordination with all nodes to safely proceed. This means that while mencius crashes less, each individual crash (however unlikely) costs more.
Designate a leader, but also rotate them for better throughput and loadbalancing at the expense of a more troublesome restart after leader failure.
Following in the footsteps of the approach outlined in Generalized Paxos, Multicoordinated Paxos introduces more coordinators to increase availability in the case where two commands don’t conflict with each other. In the conflict case (where two commands must be ordered properly) clusters still need a stable leader, but in the (hopefully more common) easy no-conflict case, any fast acting quorum of coordinators can take writes.
Roughly the same as Generalized Paxos (basically FastPaxos), but pick a leader quorum, not a single leader, for more availability.
From what I can tell, the Vertical Paxos paper is focused on presenting a solution to highly geo-distributed use of Paxos. The main thrust is to to partition (AKA “shard”) data, assigned statically to regions. This paper heavily takes into account real world conditions over a WAN and is pretty practical in it’s motivations and resulting discussion. It also includes a section towards the end comparing and contrasting vertical paxos and other “primary-backup protocols” like the Niobe replication protocol and the Google File System.
Do the usual Paxos dance but shard your data, assign them static regions and add a coordinating service for relocating objects so you can avoid cross-region Paxos rounds where possible.
The SPaxos (Scalable Paxos) paper is one of the big lurches forward after some time in paxos land – it introduces the idea of having all replicas respond to all kinds of requests, but simply batch them up and send them to the leader. In this scenario, the leader handles ordering, and while the algorithm is still weak to slow/faulty leaders the ability for replicas to help in the process and gather batches of commands improves throughput.
The throughput graphs show a pretty staggering difference:
As the paper goes on to state, there’s a roughly 2/3x performance improvement.
Batch writes at all replicas, and let the coordinator focus on ordering bigger chunks to watch your throughput soar.
The Ring Paxos paper introduces the use of Atomic broadcast in assisting Paxos. Network-level multicast and structuring nodes in a ring play a large part in the gains provided by Ring Paxos.
This is somewhat similar to Mencius, but the main difference is that network multicast is brought into play, and the actual ring structure requirement on node communication.
Use IP multicast to send messages, and network the nodes in a ring while the coordinator/leader node works hard.
The Multi-Ring Paxos paper introduces improvements over Ring Paxos, the key one being running multiple instances of Ring Paxos at the same time. This is a common theme in making Paxos more efficient, called “Pipelining” (another common improvement is batching). Receivers (non-leader nodes) in Multi-Ring Paxos subscribe to different sets of groups, and particiapte in consensus instances/rounds for the groups they’re in. For nodes that engage in multiple groups, they have to merge deterministically, which adds a bunch of complexity.
Here’s a graph of the performance increase from the paper:
A little disorienting at first, I found it most convenient to compare Disk M-RP
and Ring Paxos
labels. The more partitions (X-axis), the better Multi-Ring Paxos performs regular Ring Paxos. Multi-Ring Paxos also uses drastically less CPU, though it’s pretty high – if you look at the RAM Multi-Paxos (RAM M-RP
label), the CPU essentially becomes the bottleneck, presumably because RAM operations are so fast.
Multiple parallel Ring Paxos instances for more throughput and balanced work.
Raft was introduced as an even more approachable Paxos implementation – the title of the paper was “In Search of an Understandable Consensus Algorithm”, which speaks to just how difficult the Paxos family of algorithms was to understand and implement. The core premise that Raft introduces is the hard-requirement on a desginated leader through which all writes flow. Raft introduces Paxos to modern developer ergonomics – even including interactive diagrams and explanations on the raft website (and having a logo). The basic “trick” to Raft is to have the leader node work on coordinating and committing the writes, and “fall back” to basic Paxos only in the case of leader failure. So when nodes get writes they forward them to the “leader” (raft lingo) or “distinguished proposer/acceptor” (paxos lingo).
Other Paxos optimizations like batching and pipelining can also be applied to Raft to speed it up, as well as SPaxos’s main contribution which was allowing non-leader nodes to batch up writes, though it depends on what the implementation you’re using does. Generally, if an application can be structured as a distributed log (which I’m pretty sure is universally true for any distributed state machine, since you just store the transitions), Raft is for you.
Raft can be viewed as a specialized case of MultiPaxos tailored for log replication.
etcd
)Paxos with a guaranteed leader node through which all writes flow, unless the leader fails then one regular paxos instance happens to elect the next leader.
My biggest initial impression from the EPaxos paper was that the solution space was starting to run out of air – more and more concessions and environment/problem-space tweaks seemed to be getting made in order to make Paxos more efficient on some axis or another. Here’s a quick list of the ones that EPaxos requires/introduces:
The second point (of finer grained command distinction) actually parallels some thinking I’ve been doing lately about treating applications as proper state machines to achieve application-level distribution/consensus. This also meshes with the well-known practice of Command Query Response Segregation (CQRS) – sometimes clients don’t care about the result of their command (or at least sometimes they can pretend it actually happened locally), they sometimes just want to fire-and-forget commands.
Returning to EPaxos, however, the basic thrust is that by distinguishing the commit and execution times of a particular comamnd, you can ensure that you get into the conflict-free case much more often. EPaxos uses techniques from other Paxos derivatives and combines them to produce a pretty good system. In particular, given that you’ve met all the restrictions above, it:
Along with the paper there’s the go-distributed/epaxos
repository which contains an implementation of EPaxos written in Golang. It also contains implementations for Mencius and Generalized Paxos.
Multiple leaders, writes at any replica (batching), multiple instances (pipelining), along with a sprinkle in a little CQRS and accurate metadata on how commands interact for performant Paxos even over WAN.
The FPaxos (Flexible Paxos) paper is a bit of a step forward and a step back at the same time. Although it requires a single elected leader/distinguished node, it introduces the idea of flexible quorums – the novel idea that simple majority quorums are actually not needed across steps in a certain instance/round of Paxos. This is a subtle but huge step forward for the protocol in that it makes the fast (happy) path even faster if things go right.
The idea is that there must be an intersection in the nodes required for quorum across leader switches. So when a new brand new instance of Paxos (cluster-wide) starts, a leader is elected, and it can skip out on a majority for the writes it performs/accepts but the requirement is that the next leader must pick a (possibly less than simple) majority that includes at least a single node from the previous instance. If you sprinkle on a bit of math, you get the concept of grid quorums – a way to ensure that quorums with the right acceptors are picked.
This paper also introduces the fpaxos/fpaxos-lib
repository that contains runnable code from the paper.
F isn’t for fast but flexible – make sure one node stays in the quorum across leader changes.
The KPaxos (Kernel Paxos)’s contribution to move Paxos computations into the kernel itself. I didn’t really read much of this paper, so I’m going to have to leave anything that passes for deep analysis to someone else. While there is a performance gain I’m not sure I really want code like this moved into the kernel until it’s bulletproof. I’m no kernel developer but this seems like a pretty intense way to solve this problem. I’d personally rather that Paxos algorithms be implemented as a layer in a unikernel system than trying to patch them into a monolithic kernel.
Despite my reservations it seemed like this approach might be best suited for same-machine Paxos clusters which might might prove useful for local databases? I’m not quite sure. The performance gains are undeniable though:
The esposem/Kernel_Paxos
repository is included along with the paper if you want to check out the code.
Let the kernel do your Paxos for you
As the theme of solution space asphyxia continues, the WPaxos paper is a pretty complicated solution that introduces a bunch of new requirements in order to gain an edge. The basic gist of WPaxos is to add and implement an automatically partitioned, rebalanced, and replicated Paxos configuration. While that’s quite a mouthful it’s basically a combination of all the best features we’ve seen so far (batching, pipelining, multiple leaders, flexible quorums), but with the addition of features like work stealing and enforcing network closeness of certain nodes and certain leaders.
Here are a few techniques and requirements that WPaxos introduces to gain an edge:
There is a lot happening here, a lot of additional complexity and some pretty startling/hard to achieve requirements. The payoff however, is some pretty amazing real world performance and consistency that actually beats KPaxos and some of the other advanced appraoches (like EPaxos) on some metrics:
I must admit that reading the ideas in this paper reminded me very much of what I envisioned when laying out my ideas behind a distributed SQLite DB (which was not quite the same as rqlite or dqlite). My main contribution to the state of the art (rqlite
and dqlite
) was going to be write-anywhere funcitonality backed by automatically partitioned, rebalanced, and replicated data. Nice to know I was on a track, maybe even the right one – though of course WPaxos goes way further than my scribblings and ruminations – just reading this paper was enough to convince me that things were a lot more complicated than I thought.
Back to WPaxos though – it’s pretty clear that it’s an improvement over EPaxos with the exploitation of data locality (the closeness requirement) when it’s possible. Enabling dynamic repatitioning and work stealing between leaders also increases the likelihoood of that happy path case. The per-object commit long and one-object-per-command restrictions would normlaly be dealbreakers, but there are some cases where this is actually the case – a distributed key/value store for example. While I don’t know of many key/value stores that work this hard to be consistent (most just make do with eventual consistency), I guess it would be beneficial.
Runnable Golang code is also available at the ailidani/paxi
repository. As a side note, this paper also mentions and dives into FPaxos, so it’s a good place to pick up some details.
All the advanced features (Multiple leaders, flexible quorums, etc), along with dynamic partitioning and work stealing to keep data localized, most production-ready except for the restrictions like commands only touching one object at a time.
In somewhat stark contrast to the WPaxos paper, the CASPaxos paper makes a much simpler contribution. The gist of this the utilization of Compare And Swap semantics to build a Paxos-replicated system that instead of doesn’t send operations but actually sends state transitions. CASPaxos is actually more of a return to the Synod (AKA Single Decree Paxos) algorithm (which focuses on a management of a single piece of data with Paxos), which is also explained in a nice table on pages 4 and 5 of the paper. The basic gist is that since the problem has been rephrased such that client contribute state transitions (i.e. a function state0 -> state1
), no two clients with conflicting could ever both succeed, as once one of them succeeds all other commands with the older state would be ignored (the compare and swap semantic at work).
Personally, this paper was really enlightening for me and lead me to the EPaxos and WPaxos as places to look at what this paper was building on. I’d heard of EPaxos but not WPaxos in the past and CASPaxos builds on tricks from both of these so it was a great find. CASPaxos allows for non-sharded multi-master mode, which neither EPaxos or WPaxos could achieve which is pretty impressive. CAS has to be set for every value you store (you can use a monotonically increasing number), but the properties of
At this point we’ve seen a bunch of papers that came with code, but the gryadka/js
repository that comes with CASPaxos is even better than normal – it actually gets similar performance as etcd itself, while being actually master-master replicated database with full consistency and it comes with a TLA+ specification. gryadka
does lose to etcd only on the recovery time metric, based on the measurements in rystov/perseus
, but it wins on every other metric.
Another really cool thing about CASPaxos’s implementation gryadka
is that it’s composable in that it was built to work on top of some underlying storage (a local redis
instance in the reference implementation). This is cool to me because it’s another idea I’ve been playing around and thinking about lately – application-level consensus. Of course, databases are just applications themselves, but they seem to be where we first try and solve distribution/scaling issues, when it might actually be better (and make for purer microservices) to solve consensus at the application level, above the database itself. The fact that graydka
uses this approach and beats a minimal purpose-built tool like etcd
(which has a full company and lots more resources behind it, and is the industry standard) is an amazing result.
gryadka
manipulates redis
)Use the best features of EPaxos but make all clients submit state transformations (like CAS on a hardware register) and periodically garbage collect when people delete.
The SDPaxos (Semi Decentralized) paper is the most recent paxos-family paper I’ve seen and it’s proof that there’s still interesting things happening in the field. There’s an excellent introductory blog post by Murat Demibras over at SUNY Buffalo that’s worth a read but I’ll also take a crack at explaining it here.
The general gist behind SDPaxos is to avoid drawbacks of leader and leaderless approaches by striking a balance in the middle – designating per-command leaders, and an ordering leader. Per-command leaders first appeared in EPaxos as far as I can tell, and leaders that only worried about ordering first appeared in SPaxos (writes got batched at all replicas and sent to the leader to be ordered). This mix of both approaches means that by splitting up comamnd replication/consensus and ordering SDPaxos allows replication as fast as possible, but leaves ordering to one node to gain easy serializability. The split goes even deeper because the command-related and ordering related instances/consensus can actually happen completely independently, in parallel on the same machine.
As called out in the paper, CASPaxos improves on the weaknesses of a bunch of other approaches:
The performance is also pretty impressive when compared against these three approaches:
SDPaxos solves the traditional general state-machine-replication-through-logs problem (not the slightly different Synod/Single Decree Paxos which CASPaxos solves), and all the advances of previous approaches should apply to SDPaxos, though they haven’t necesarily applied in the paper as it sits. In particular this approach might look even more interesting with the advances of WPaxos like automatic sharding/partitioning, though the increased complexity would make it very hard to implement. Theoretically there’s nothing stopping SDPaxos from utilizing the improvements introduced by WPaxos or any of the other paxos-based algorithms (except for maybe CASPaxos).
Runnable Golang code is also available at zhypku/SDPaxos
on Github.
Single leader serializability, multiple command instances and a single serializer for super fast reads and writes, at the cost of lots more complexity.
Speculative Paxos (2015)) focuses on data-center specific optimizations that didn’t interest me too much (since I don’t run a data center).
NOPaxos (2016) focuses on a new network primitive and some other data-center specific optimziations that I didn’t find interesting (also because I don’t run a data center).
There is a lot of other research/tech in the distributed systems space, so I’ll take a few seconds to write about some things that have sparked my interest.
Gossip, popularized most recently by the SWIM paper is another simpler but surprisingly effective approach to distrubted (eventual) consistency. It’s up to users/implementors to determine when they need eventual consistency versus the ordering +/- serializability that stricter algorthims can provide, but gossip is much simpler to understand.
SWIM improves on the basic gossip idea by actually picking nodes to which you gossip randomly, and piggybacking messages or information about cluster state with the health checks that are necessary for the system. Through the magic of statistics and probability this both reduces the number of messages you need to send (naive gossip’s biggest weakness), yet still ensuring that nodes see updates reasonably fast. What’s more, SWIM’s approach doesn’t add that much complexity to an already very simple algorithm so you’ve got a good chance of implementing it correctly.
Production-ready implementations exist, for example hashicorp’s memberlist
golang library, which is used inside their widely-used Consul product.
Once you’ve started loosening your ordering/serializability requirements, you’ll no doubt land on Convergent Replicated Data Types eventually – as conflicting values fly around your distributed system, the easiest way to deal with them is often Last Write Wins, but the key innovation of CRDTs is introducing and definining the formal concepts/theory surrounding ensuring a data type is always cleanly mergable. The idea is that if you have two values you should be able to combine them and know that they will both converge to a well known/predictable state. Put simply CRDTs answer the question “what if we had data types that guarantee automatic conflict resolution?”.
The classical example of a CRDT is a G-Counter, basically an incrementing counter. It’s fairly trivial to understand on the surface – if you’ve got values that are doing nothing but incrementing, it’s pretty easy to always resolve conflicts – just take the max
.
A slightly more interesting example is sets – lists without duplicates. The G-Set (Grow Only set) is easy to merge, becuase in the end all you need to do is union the two sets and ensure there aren’t any duplicates and you don’t need to worry about conflicts. An even more interesting example is the “OpSet” – you can easily imagine being able to store things like Write Ahead Logs (WALs) by treating them as sets of operations. Of course things aren’t quite that easy, but it’s the kind of idea that once worked through provides you with strong eventually consistent databases at very little cost.
There are lots of talks out there on CRDTs, here are a few I’ve personally enjoyed:
Classically there are two types of CRDTs, Operation-based CRDTs (CmRDTs) and State-based CRDTs and the difference between them is exactly what it sounds like – with operation based CRDTs you ferry around messages like ADD 1
and with state based CRDTs you ferry around the entire state, like {a: 1, b: 3, c: -1}
. Turns out they’re actually both reducable to just operation-based CRDTs, which was an idea introduced by the Making Operation-Based CRDTs Operation-Based paper which introduces “pure operation-based CRDTs”. It should be pretty easy to figure out though, since the definition of “operation” is so wide, you could trivially represent any state-change as an operation (for example STATE_CHANGE {a: 1, b: 3, c: -1}
), and pass those messages around. In the general scheme of things state-based CRDT use should probably be somewhat limited though, as the network isn’t free.
Paper: http://www.bailis.org/papers/ramp-sigmod2014.pdf
Some rough notes that I took on RAMP, though I didn’t get a chance to fully digest the ideas:
One more thing I noticed was the amount of times that Golang was being used for experimentation. I don’t think Golang is a particularly good language in the classical computer science-y sense, but it sure does “do what it says on the tin”. It set out to be simple, performant, and bring a decent feature-set without requiring too much runtime tweaking/mastery, and it is delivering. It’s very popular in the “devops” space where scripting languages were king, and big important tools are being written at an alarming pace and being put into production and the world hasn’t burned down (which is a good sign for a language that is so young). Golang is already outperforming Java in some places and that’s crazy given the amount of man-hours that has been put into Java. If researchers are also finding Golang to be a nice language to use with a feature set that they can work with, it is likely going to start upending a lot of other languages that were traditionally used. I’ve said it multiple times on HackerNews, but I think Golang is going to replace Java within 10 years.
I’m not a researcher but I assume the logic is something like this:
multiprocessing
or asyncio
)
Of course I have no basis for the fake situations/musings above, but they seem vaguely plausible. Either way, it looks like Golang (and of course a little bit of JS with the dryad code I thought it was interesting
This article was really fun to write, and a good way to get my current understanding of Paxos-based algorithms out on virtual paper somewhere. Paxos has been through a lot of iterations, and a lot of shifts in thinking as the state of the art moved around. A few trends emerged that I can clearly amke out now:
Along with these basic takeaways, the combination of these ideas with the other techniques emerging in distributed computing is really exciting, even just adding CRDTs and SWIM seems like it could produce a whole new breed of systems – merge without worries, and function in modes across the range from eventual to full consistency.