tl;dr:What if there was a write-ahead-logging layer above/outside the database? Then you could get CQRS+ES capabilities and assurnaces from your database without rewriting, and even abstract out the database as a whole, going as far as to create your own DSL for application-level changes, ADD_USER instead of INSERT ... INTO USERS ...*

I spend a bunch of time thinking about how I can waste less time writing applications. Over time I’ve worked on various protoypes, with varying levels of completeness, but as I move across languages, frameworks, databases, and stacks, I always find myself solving the same problems, and having the same worries. I gradually discover interesting ways to solve the problems in different languages as I go (or fall back on some previous approach I’ve used), but a large part of me pines for a cross-language/platform/whatever way to solve all the things, in development, production, test, wherever.

An example of an piece of tooling that bridges this gap between languages/etc consistently is entr – it’s a simple, small project that does something popularized by tools like grunt and gulp in JS land, but that everyone grew to like – recompiling or running some arbitrary commands when a file changes. Another somewhat more canonical case of a tool that bridges these gaps is good ‘ol GNU make. I use make across various projects, and am able to abstract away from the language/platform/whatever specific commands that actually build the project (from npm install to stack build to pip install, to bundle install, etc), and give my projects a consistent feel – make, make test, make build, make deploy SSH_ADDR=<path-to-server>, make publish.

Recently, I spent some time thinking about all the things that come into play when building the backend of an application. There are so many concerns to worry about, outside of just setting up the data schema and pulling data. I wrote down some of what I think it takes to get to a bullet-proof backend (less in terms of reliability and more in terms of consistency/knowing what it’s doing and being confident about it). I wanted to think of what it would really take to remove 99% of the nail-biting/nervousness out of deploys.

I’m going to basically go into a lot of context/internal thinking and then come back to the original idea towards the end, so buckle in (or skip to the end)!

Step 0: Regular backups

Importance of these can’t be understated once you’re at any kind of scale, or active users.

Backups (and a tested way to reset to them) means that neither malicious/incompetent devs (including future you) or DROP TABLEs don’t scare you. That’s a nice starting level of confidence in your operation.

Step 1: DB versioning & migrations

If you use some form of database (whether SQL-compliant, noSQL, or whatever else) that lets your model your data, the one thing you can likely count on is that the model is going to change. If you make enough tables (basically 1), there’s a only-increasing chance that at some point in the future, someone is going to want to add a column, or remove a column, or update an index, or do something else to the database, in relation to that table. Note that for noSQL databases, these worries just move out of the database and into your application code – whether this is the wrong or right tradeoff depends on the project.

This stuff is prety standard, so I won’t talk about it too much – database versioning and migrations are pretty pervasive in modern software development.

Step 2: Schema-level versioning & migrations

While you could arguably make a case for this being so intricately related to DB versioning as to be the same thing, Schema-level verisoning is row/document level versioning for items in your database. Columns/properties like created_at and updated_at are exremely common because of how useful they are – they let you know when something was changed or when it was first inserted. Why not also have versioning that reflected the schema version at the row-level? When a DB version is updated, there are often changes to the underlying schema that may or may not break past versions. One of the most common ways to deal with this is to just write a migration that does all the necessary conversions to whatever format is newest.

While convert-to-newest is a pretty easy-to-implement solution, it kind of leaves you in a quandry while dealing with live web applications – if you’re doing something like a rolling deploy to avoid downtime, what happens to the old instance with the old code that is expecting the database that was just updated to hold an older version of the data? There are some wrinkles here that could certainly use some ironing.

At this level, 8what could really help would be a pattern of defining transformations from one version to another*, in part enabling arbitrary versions of the application to gracefully handle picking up older data or newer data from the backend, as requested. Here’s an example:

car-model.json

{
  "make": String,
  "model": String,
  "productionYear": Number,
}

lets say after a little while where this model is in production, we realize that productionYear should actually be a possibly non-contiguous range of years. What to do? productionYear as a whole is kind of now wrong (because it meant firstProductionYear at the very least, and isn’t how we want to think of the data), and we have to add at least the new field to represent the better way to model this data. Today I think most software is written in this way:

Write a migration that creates the new properties/columns
Write a script that will do conversions on the existing database (possibly part of the migration)
Either write code that can handle both types with specific code or Kill the old servers and replace them with the new ones, allowing (hopefully short/temporary) failures on the old nodes.

I think if this car-model.json was expanded to always include pattern, you the process could look like this:

Allow clients to specify what version of the data they want (and of course be explicit about it in the application itself)
Write a migration that creates the new properties/columns without removing the old ones, with a new version of the old class
Write transformations between old-class-version-x and old-class-version-x+1 in both directions
Write and run a migration – the big difference here is that both old and new services should be unbothered by the different data versions.

Assuming there is a reasonable (or near reasonable) transformation in both directions, code written this way could avoid downtime as well as errors for old clients. Put simply, this is just enforcing a consistent/repeatable pattern out of writing very specific code to gracefully handle post-migration data queries, but I think this increases confidence in the backend. Knowing that this is the process for this usecase (“what happens when a migration changes the underlying database for services/clients that rely on it?”), gives me a lot of confidence that rolling deployments would be truly problem-free, on the database side at least.

There’s another bit of versioning that could be useful but hasn’t been enshrined – why not have atomic versioning for all the objects as they change over time? Whether a monotonically increasing number of a SHA sum, having a property like version is usually useful for reducing data sent to frontends, as they can check the version of some resource before performing the relatively expensive call to retrieve it (assuming they have version 23 already in local cache, there’s no reason to retrieve the whole record again).

I’ve never seen either of these approaches used/enshrined in a large-scale codebase in the wild before, and am starting to want to.

Step 3: Testing contract boundaries between services

Micro-service driven architecture is all the rage these days, and one of the things that doesn’t get tested as much as the applications themselves, but possibly should is the contracts between the services themselves. As services multiply, and data grows, the number of schemas, representations, and clients that have to be made capable grows rapidly.

Someone needs to come up with some sort of end to end integration testing layer that caters well to how microservices are often architectured. Some ideas in this area:

Does a meta service that orchestrates contractual obligation checks make sense? If one service depends on another, should some service (let’s call it “overseer”) be notified? Maybe that service could even notify relevant engineers of the change.
Should an “overseer” type service gate deploys of lesser services? Maybe it should be impossible for you to deploy Service B that removes/changes APIs that Service A depends on? If both services lay out their contracts in some cross-compatible manner, it would be trivial to verify (of course creating the stanrdard is not so trivial).
Build-time checks for contracts? A quick check of other repositories or even this “overseer” service would let developers know closer to build time what services they had a chance of breaking with their changes. Right now this is handled by basically lots of talking between engineers in a given department, and that’s IMO pretty wasteful and unnecessary – my goal isn’t always to build knowledge about the years of decisions in a codebase, it’s to ship a feature, WITHOUT breaking another service I know nothing about.

Step 4: Immutable databases

Why not immutable databases? They basically force/ensure a lot fo these patterns (except contractual boundary testing), but it looks like they haven’t reached mainstream adoption, and some don’t even seem to be having a good time using them.

Understanding CQRS + ES

CQRS can mean a lot of things to different people, but at it’s heart is the separation of code that produces side effects (mutates something in the environment) and code that tells you about the environment. Basically, pseudocode like x = x +1 and get_value(x) should always be in separate functions. This approach is shares some common thought with languages like Haskell in the encouragement of referential transparency and making it very clear when things can have side-effects. It’s important to note that code that does not do any side-effects AT ALL is very likely not useful.

ES (Event Sourcing) can be prety loosely described as implementing your database as list of things that happened to it. Think of a shopping or TODO list. For most use cases, if you manage those kind of listings with a pencil, you can completely erase previous states – you don’t know what was on the list in the spot you put “eggs” into, because you erased it before you put “eggs” in. The ES approach essentially suggests that you use a pen instead. This way, all changes exist forever (assuming you cross them out reasonably of course, with just a single stroke or whatever) – you can always see what was removed, what was added, what was changed. Of course, this means that the current state of the list is somewhat harder to reason about – you have to purposefully train your eyes to ignore the striked-through items, and maybe even parts of items that were struck out.

The theoretical case for CQRS + ES

There’s a lot to consider when using these approaches together, but for my purposes, it seems like almost the perfect setup. With appropriate backups, the combination of CQRS and ES gives you the best tools to reason about your database at rest, during updates, and during a wide range of circumstances. You can speak with a certain amount of confidence that just isn’t possible with many other approaches. It’s extremely difficult to answer a question like “when did this object last change” with much certainty in traditional mutable-state approaches to databases because it’s hard to completely trust a field like updated_at. Did it trigger properly? Did someone go in and change it after the fact? Did some database quirk make it update incorrectly?

Unfortunately, a lot of people have tried CQRS+ES (weirdly it seems like a lot in the Microsoft-powered/.NET space). A quick look around HN reveals some common frustrations. But it still seems perfect to me… Maybe I could think about it in the context of the flexible-schema problem I discussed near Step 2 and see what solutions it could offer?

Problem

New application with different schema/database version, needing to be deployed, but the current app runs on the old version

Solution without CQRS+ES

See Step 2

Solution with CQRS+ES

With CQRS+ES, you have a few more options:

Don’t allow the old application to ever see into the future where things have changed
Use CQRS+ES versioning schemes in marking schema changes, and just do the strategy laid out in step 2

In addition to the immutability that CQRS+ES provides, it seems like it offers more options, and more peace of mind, but it seems like no one enjoys actually running a CQRS+ES bakend these days. Maybe the answer is to extract the desirable properties of CQRS+ES and tack them on to databases that people do want to use?

Back to the idea

What if a system was built that basically externalized the concept of a write ahead log, and in return for what are potentially large performance penalties, gave you CQRS+ES level certainty of how your database was mutated, but still allowed you to access your databases how you normally do. This isn’t much different from how CQRS+ES is done today (there’s often/always a virtual/materialized view that is projected from the underlying data), but I think the subtle difference of adapting CQRS+ES to traditional approaches rather than the other way around is significant. In the same vein, this feature (write ahead logging, basically) is also implemented in just about every database, as a way to supply ACID gurantees.

So why is this still worth thinking about? It’s basically a part of both solutions (CQRS+ES backends and regular backends) already. I think the addition here is the mechanism that could be put around the externalized write ahead log – one could choose and finely configure the mutation allowed onto one more database nodes by shuffling the writes through a database-adjacent service.

Let’s call this imaginary service “WAL-y”. Look at it this way – if I now make my database service Postgres + WAL-y, and start to feed my writes to postgres through WAL-y, I get the benefits of CQRS without the cost of a change in paraidm. WAL-y would silently record all the changes, and do whatever was necessary to ensure consistency (likely this would need to be configured), and possibly to multiple nodes at once. Once you start getting into distributing the system, you run into classical coordination (quorum seeking) and performance degradation problems depending on how consistent you want WAL-y to be. I think those problems have been reasonably solved at this point – the research is already in the mainstream, and for a lot of applications the peace of mind is well worth it.

Another feature feature is that WAL-y can actually let you start to abstract your database out of the equation alltogether. One could start to think of the operations as application-specific backend translations. INSERT ... INTO .... becomes CREATE_ACCOUNT ... ... , and WAL-y does the work to translate that into whatever the underlying layer needs to do, whether it’s mongo, or postgres, or anything else. WAL-y could also become the place that stores the DSL for these operations, as well as the versions and schemas for the data that lives in your application. Data modeling in the native database (postgres, let’s say) gets you part of the way there, but it just doesn’t deal with migrations and immutability very well, WAL-y could be that layer.

What do others do?

I think one thing missing from all this thought is a real understand on what others do to solve this issue. I’ve only worked at a handful of companies – surely there’s some other companies that doe it right, and have a principled approach to removing all uncertainty from their backend administration and deployment?

VADOSWARE

Living in a yak shaver's paradise.

Research Idea: Maintaining A Write Ahead Log For Application Runtime State

Categories