tl;dr - I’ve been doing a lot of work with SQLite lately, using it in one of my projects to try and test it’s limits before I moved to something like Postgres. I wanted to scale the API up and after ensuring that SQLite was thread safe and parallel access safe, I tried but was limited by shared file system mode limitations of my platform (kubernetes). This post details some prior art (rqlite, dqlite) and what I want to make for distributing SQLite, which is expressly not what SQLite is for but seems like it would be fun. This post doesn’t have any actual code in it, or tangible software engineering, so you might want to skip it if you’re looking for deep insights.

I’ve been doing a lot of work with SQLite lately, and I love it’s simplicity and relative performance. Most people would wince when hearing that I was planning to run SQLite in “production” (my app is pretty small), but I am convinced that most apps (especially when starting out), SQLite is more than good enough. Most applications are more read heavy than anything, and read speed is often mitigated not by a faster database, but by aggressively caching data at various levels. SQLite has been a quiet workhorse in various programming environments for years, and it doesn’t get as much credit as often as it should.

All that said, I’m looking to absute/misuse SQLite. The very case that SQLite is not good for is the very case I want to look for. Check out this awesome quote:

Client/server database engines are designed to live inside a lovingly-attended datacenter at the core of the network. SQLite works there too, but SQLite also thrives at the edge of the network, fending for itself while providing fast and reliable data services to applications that would otherwise have dodgy connectivity.

Lately, deploying the application in question on Kubernetes I ran into the issue that since Kubernetes didn’t quiet yet have any RWX (ReadWriteMany) mountable volumes that I could use (that would be as robust as I wanted), I realized that I would need to start looking into how to distribute SQLite across multiple nodes – it’s either that or switch to Postgres before I even get the chance to test the limits of SQLite.

Is SQLite thread-safe? (single process concurrency)

The question here is whether multiple threads in a single process can access and use SQLite, presumably with different connections.

The answer is a resounding yes. SQLite is indeed threadsafe, though they’re pretty against using threads in the first place. The paper detailing why threads should be avoided they link to is also an interesting read, though I will say now that it doesn’t actually pick a paradigm that can beat threads for managing asynchrony, instead detailing the common approaches. I personally really like how Haskell and Golang handle concurrency.

So if you’re keeping score at home, You can write down SQLite is threadsafe, it’s data structures are generally protected by mutexes.

Is SQLite parallel-access safe (multi process parallelism)?

The simple question here is whether SQLite will become corrupted quickly if accessed by multiple processes at the same time. While I was still looking to share the same drive

The answer is a resounding yes. SQLite can be accessed from multiple processes. The better question is how SQLite supports parallel access.

For those keeping score, SQLite is indeed parallel-access-safe from multiple processes, using a model generally like 2 phase commit. In contrast, PostgreSQL uses MVCC which is another interesting approach. There’s an excellent talk by Tejas Manohar outlining MVCC that might be worth a watch if you’re interested, also postgres internals in pictures is also interesting.

So I’m all good to scale my API right?

So at this point, my checklist is all checked:

I changed my k8s Persistent Volume Claim to run in ReadWriteMany (“RWX”) mode
Ensure that Ceph supported fcntl(2), since that’s how SQLite locks
Up the replicas count in my kubernetes deployment
Configure the deployment to use a rolling deploy strategy

Soon I saw the second pod start up, but a funny thing happened – it never got past the Init stage, because it turns out that CephFS (as deployed by Rook) doesn’t support PVCs in RWX mode. Welp that’s unfortunate, looks like I’m stuck with only a single API server for now, which is definitely not what I wanted. If I was using a DB like postgres, this problem would not exist, since postgres is made with the client/server model in mind.

What if I actually want to make a version of sqlite that works equally well when distributed and when not?

You really shouldn’t try to scale SQLite, but…

There are a few choices already out there, namely RQLite and DQLite. Either approach has it’s ups and downs, but basically it looks like the main axes in approach are the following:

Replication strategy => shipping statements vs WAL frames
Multi-master support => how many nodes can write data?

RQLite makes the choice to ship statements, but this means you can’t write statements that use functions like time() or random() – different servers would generate different results, and you’d get inconsistent state instantaneously. DQLite, on the other hand, makes the choice to ship WAL frames, but runs into another issue – you need to patch your SQLite. It remains to be seen whether the patch will be accepted/embraced, I can only find scant conversations about it.

RQLite makes the the choice to allow multiple nodes to write, once the statement is quorum-approved, the statements can be executed at every node. DQLite opts for a master node to do the writes – it uses Raft consensus algorithm which is based on the Paxos family of consensus algorithms. The note from the Github is particularly helpful:

Q: How does dqlite behave during conflict situations? Does Raft select a winning WAL write and any others in flight are aborted? A: There can’t be a conflict situation. Raft’s model is that only the leader can append new log entries, which translated to dqlite means that only the leader can write new WAL frames. So this means that any attempt to perform a write transaction on a non-leader node will fail with a sqlite3x.ErrNotLeader error (and in this case clients are supposed to retry against whoever is the new leader).

While these pieces of software seem great and well-written, I’m not sure they’re the be-all-end-all with regards to the tradeoffs that are involved.

How what I want to make is different

Introduce a new axis: Sharding

I didn’t find any mentions of sharding in either project, and I think this could be a happy medium between a single-writer-only cluster and being able to distribute and/or optimize write load for the cluster.

Writing an extensible coordination layer, allowing for Raft optimizations (batching & pipelining) & trying a Gossip protocol if/when it makes sense

Raft is currently the state of the art as far as I know for consensus algorithms, but it’s beat in some cases by a Gossip-based protocol like SWIM. Maybe it makes sense to coordinate sharding and replication of different kinds of objects (tables, schemas, however the database is sharded), and be able to choose between replication strategies on the fly. Choosing on the fly is a pretty big feature, but at the very least this system needs to abstract coordination in a way that I can at the very least swap out the raft-based approach for etcd or some other consistent store easily.

Memory safety with Rust

Rust has the expressive types, memory safety, speed, and other features that would make it a killer language for this project. Whether I implement this as a SQLite extension or as a standalone process (maybe I could write it such that both are possible?), Rust has just enough Haskell-like type machinery and checking, as well as enough C/C++ to be as fast as possible, yet safe.

Included/automated read/write routing

It should be possible to send a write to a read-only node and have the write request be transparently forwarded to a node that can write for the specific cluster. Write nodes could also proactively forward read requests to other nodes.

Dynamic discovery of other nodes

It should be able to support just specifying ONE other node in the cluster (and maybe a total cluster size) to build a full cluster. I think it makes sense to at the very least wait until the cluster is healthy (and elects the first set of leaders) for the first start before sending any traffic at all.

Discovering other nodes could also be handled abstractly, enabling using DNS, or Kubernetes configs, or Consul, or whatever else people cared to write in support for.

Bundled Admin UI

Since this thing is clearly growing to be a bit more than just humble SQLite, I want to include a nice Admin UI, much like RethinkDB’s to make managing the cluster and viewing it’s state easy, along with a swagger-compliant JSON API where people could send queries or even REST-ful GET/POST/PUTs.

Abstracted data storage/replication mechanisms

It would be cool to abstract the storage/replication mechanism so much so that I could just as easily use SQLite’s Online Backup API or the knowledge that you’re on a shared network disk (ex. CephFS, NFS) to change your storage/replication strategy.

Wrap up

Is this a shit idea? is this a great idea? I want to find out. I’m thinking I’ll call it “feathers” or “flock” or something rather than “xxqlite”. In this post I haven’t done much more than just put out a bunch of ideas I’ve been stewing over for a long time, I can’t wait to actually test out some of these ideas.

I also have no idea when I’ll start on it, but just wanted to at least lay out my rationale somewhere! Hope you enjoyed the read.

VADOSWARE

Living in a yak shaver's paradise.

SQLite Is threadsafe, parallel access safe, but I still want to extend it.

Categories