UPDATE (2021/12/21)

After lots of feedback on Reddit (thanks /u/BucketOfSpinningRust!) and doing some more experimenting and digging, I've updated this post with more information -- new/updated sections are marked "Update".

tl;dr - Tips for running Postgres on (Open)ZFS-on-Linux (ZoL), which I’m nicknaming “PoZoL”, check out the TOC (or skip to the resource list at the bottom) and browse to whatever looks interesting.

This post is a shameless ripoff/reformulation of everything I’ve seen so far across conference presentations, the internet, reddit, forum posts and other places about how to run Postgres on ZFS reliably and with good performance. I’ve been doing a lot of tinkering with my storage setup (deciding on mdraid +/- LVM vs ZFS for direct disk access, Ceph vs Longhorn for distributed storage, etc) since I’m gearing up to offer Postgres as a Service on NimbusWS, but I thought I should share some of the information I’ve managed to gather.

As you might imagine, this post is not exhaustive – there are great companies out there that can take your Postgres installation to the next level:

Delphix (leading maintainers of OpenZFS)
2nd Quadrant (They’ve got a post on ZFS)
EDB
Percona
CrunchyData
CYBERTEC

They’ve almost certainly got lots of hard-won secret sauce tunables over years of experience so obviously you’re going to want to pay them first if you need problems diagnose or sage advice ASAP.

If you’re still here, what we’re going to be talking about is running Postgres on (Open)ZFS-on-Linux I’m going to call the effort “PoZoL”.

ZFS tunables

Setting `recordsize` to `8k`

While the default recordsize of ZFS is 128k, a recordsize of 8k exactly matches the sizing that Postgres’s page size – this is great for postgres since it cuts down on unnecessary writes (especially to solid state disks), and it gets even better with compression made available by ZFS.

One thing I misunderstood about this (and thankfully was corrected by /u/mercenary_sysadmin on Reddit), is that while the default recordsize of 128k can be considered an “upper bound”, random IO in large files is done in increments of recordsize. He’s laid it out pretty nicely so I’ll copy his comment here:

recordsize is only “an upper bound” in the sense that small files get small records. For any workload which does random I/O inside larger files, recordsize is exactly the increment which the random I/O happens in.

If you set recordsize=1M as you originally did, but then test with eg blocksize=4K, you get an amplification factor of 256x… and, even worse, you get a read-modify-write cycle when re-writing blocks.

Let’s say you ask fio to do a bs=4K run when recordsize=1M. When fio wants to write 4K of new data to the middle of a testfile, it first has to read in 1M of data, then modify 4K out of that data, then write all 1M back out again.

If you re-run the test on a dataset with recordsize=4K there is no RMW and no amplification—you ask fio to write 4K of data to the middle of the testfile, it writes 4K of data, done.

Obviously a blocksize=1M is not a reasonable (the tests I ran that he’s referring to are fio tests) but it was a pretty big insight to me to realize that recordsize isn’t automatically pared down. I’ve seen some arguments against modifying recordsize but at this point it looks pretty clear that aligning the storage to the IO you’re going to do (and avoiding write amplification wher epossible) is a good idea.

(optional) Setting `recordsize` to `16k`

A talk from Scale15x in 2017 recommends blocksize=16k. The primary reason here was being able to essentially pre-fault the next page, which is very useful for sequential scans.

Enable `compression`

Regardless of your recordsize, Postgres and your storage density in general will benefit from enabling compression. While compression may increase the CPU resources required by ZFS, it’s worth it, as with compresssion=lz4 (or the newer zstd, you can expect more data to fit in the same space on disk – compression rates can vary hugely but members of the community (giving talks) have reported compression ratios anywhere from 1 to 100 (realistically you’ll probably see 1-4x space savings).

Benchmark (`pgbench`): `recordsize=8k` vs `recordsize=16k`

As this is relatively easy to test, I set up some pgbench marks to figure out if there was a large difference between 8k and 16k. In general the settings were as follows:

pgbench jobs have 4 CPU 8GB
postgres instances have 4 CPU 16GB, with storage of 32GB
full_page_writes=off (because ZFS)
wal_init_zero=off (because ZFS)
wal_recycle=off (because ZFS)

The results for recordsize=8k:

📜️ PGBench output (click to expand)

pgbench (14.1)
starting vacuum...end.
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1000
query mode: simple
number of clients: 40
number of threads: 4
duration: 300 s
number of transactions actually processed: 409828
latency average = 29.273 ms
initial connection time = 102.610 ms
tps = 1366.435557 (without initial connection time)
statement latencies in milliseconds:
         0.001  \set aid random(1, 100000 * :scale)
         0.001  \set bid random(1, 1 * :scale)
         0.001  \set tid random(1, 10 * :scale)
         0.000  \set delta random(-5000, 5000)
         0.606  BEGIN;
         2.105  UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;
         0.788  SELECT abalance FROM pgbench_accounts WHERE aid = :aid;
         0.918  UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;
         1.225  UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;
         0.822  INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);
        22.795  END;

The results for recordsize=16k:

📜️ PGBench output (recordsize=16k) (click to expand)

pgbench (14.1)
starting vacuum...end.
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1000
query mode: simple
number of clients: 40
number of threads: 4
duration: 300 s
number of transactions actually processed: 506702
latency average = 23.679 ms
initial connection time = 92.940 ms
tps = 1689.255984 (without initial connection time)
statement latencies in milliseconds:
         0.001  \set aid random(1, 100000 * :scale)
         0.001  \set bid random(1, 1 * :scale)
         0.001  \set tid random(1, 10 * :scale)
         0.000  \set delta random(-5000, 5000)
         0.602  BEGIN;
         1.094  UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;
         0.757  SELECT abalance FROM pgbench_accounts WHERE aid = :aid;
         0.786  UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;
         1.100  UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;
         0.799  INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);
        18.530  END;

Wow, a ~20% improvement there – looks like recordsize=16k is the better setting in this very limited set of testing.

Reducing read-ahead

ZFS read ahead should be minimized as much as possible – as a lot of random IO will be happening inside of given files which may not be read completely. This tip was contributed by /u/mercenary_sysadmin and seems to be good advice as well. I’ll let the quote speak for itself:

There are some other tweaks you want with postgres as a workload; readhead should generally be off for one thing (as it should be in any dataset where you’re primarily doing random I/O inside files which are not expected to need to be read in their entirety).

Tuning `primarycache`

primarycache is a ZFS tunable that can be set to all or metadata, and generally the rule seems to be:

If it fits in RAM = primarycache=metadata
If your working set exceeds RAM, primary_cache=all

This goes with a few more modifications to prevent double-caching:

max ARC size = 15%->25%
shared_buffers => physical RAM + 50%

Setting `full_page_writes=off`

This is a pretty common recommendation, for good reason. It comes up in the Scale15x talk from 2017 and lots of other places. ZFS does not need full page writes to prevent torn pages, because ZFS never tears pages (yay for Copy on Write.

Drawback: “interferes” with cascading replication

This does suffer from a drawback that it can interfere with cascading replication – I haven’t run into this issue in particular myself wiht this setup, but as cascading replication itself is pretty niche (and there are a bunch of options/things you can try to do), I didn’t think to look into it too much, or try to reproduce.

Disable postgres checksumming

ZFS is doing all the checksumming, scrubbing and safe writing you’ll need. By default data checksums are off:

By default, data pages are not protected by checksums, but this can optionally be enabled for a cluster.

Disable Postgres compression

Since ZFS will be compressing all the data before it hits disk (and has access to some better compression tech like zstd first), you can pretty safely let ZFS handle your compression as well. Postgres compression may still offer an advantage in certain situations with certain data storage patterns but in general it should be much less advantageous considering the lower layer compression at work.

Tune `wal_init_zero` & `wal_recycle`

In a paper created by Tomas Munro over at Microsoft, there’s lots of insight on how to run Postgres on FreeBSD (which of course utilizes ZFS), but one that stuck out to me was tuning the wal_init_zero and wal_recycle options for Postgres – they help you avoid zero-filling of new files used for Postgres’s Write Ahead Log. The reasoning here is that since ZFS overwrites these anyway, you might as well get yourself a new inode.

Tune `wal_sync_method`, trying `fdatasync`

This tip also comes from Thomas’s paper, but the insight here is that we can skip writing metadata (the difference between fdatasync and fsync), and save ourselves some time writing to disk.

WARNING: Don’t do this if you can’t rely on your drive’s writeback caching

I’ve found out the painful way (by getting instant data corruption on large pgbenchruns) that it may be possible that you can’t rely on your drives writeback cache – fsync must be used in these cases. I found this in particular with consumer-grade Samsung MZVLB512HBJQ disks, but probably best to be wary of it in general, and run some punishing pgbench runs before you put anything in production (and of course, have backups handy.

Drawback: modified time may be lost after crash

As Thomas goes into (read the slides!), there are some issues where modified time may be lost after a crash.

Alternative: `wal_sync_method=open_datasync`

An alternative to fdatasync is to use open_datasync, which is equivalent to trying to write WAL with open() and the option O_DSYNC set. As of the time of this post, ZoL does not support Direct IO (it’s in the work), so Im’ not sure how much benefit there is to this.

Future Alternative: `wal_sync_method=aio_f[sync|datasync]`?

Plugging into the Linux async IO pipeline for WAL syncing seems like a pretty attractive option that will be available in the future.

Increasing Postgres `blocksize`

Vladimir Mihailenco who maintains Bun (A DB client for Postres, MySQL and SQLite) has a great guide and and explaration on PoZoL. His insight flips the recordsize insight in the ZFS section and flips it on it’s head – we focused on decreasing recordsize to match Postgres’s default blocksize, but there’s also the option of actually increasing blocksize on the postres side. This requires rebuilding Postgres but can considerably improve data-hungry queries. A quote:

You can improve this situation by increasing PostgreSQL block size to 32k and WAL block size to 64k. This requires re-compiling PostgreSQL and re-initializing [the] database.

As he notes the principal tension here is that smaller blocksize will usually mean higher Transactions Per Second (TPS). This one I’m not too sure about doing but it’s definitely a fascinating experiment to try someday.

*** Drawback: requires recompile

(Controversial) `zfs_txg_timeout=1` & `synchronous_commit=off` & `logbias=throughput`

This intense bit of advice comes from the Scale15x talk – the idea is that setting the ZFS transaction flushing timeout (zfs_txg_timeout) which is 5 seconds by default to only 1 second gives you lots more performance at the risk of 1 second of data loss (with writes not being written to disk). Not waiting for writes to be served at all at the Postgres level, but allowing for approximately 1 second of possible data loss seems like a pretty reasonable tradeoff, but of course turning synchronous_commit to off is definitely something anyone should think about well before doing.

.ZFS plays into this beause it will never be inconsistent, so you don’t have to worry so much about data integrity. Considering the problems I had with fdatasync I’m a little worried about this but it’s definitely worth trying. Another thing that makes this hard to try is that this is a pool-wide setting – it can’t be set on just one dataset

Drawback: 1s of data loss

Drawback: “horrific” fragmentation

Yet another quote from /u/mercenary_sysadmin (https://www.reddit.com/r/zfs/comments/azt8sz/logbiasthroughput_without_a_slog/):

logbias=throughput with no SLOG will likely improve performance if your workload is lots of big block writes, which is a workload that usually isn’t suffering from performance issues much in the first place.

Logbias=throughput with no SLOG and small block writes will result in the most horrific fragmentation imaginable, which will penalize you both in the initial writes AND when you reread that data from metal later.

Unfortunately I don’t have any direct tests I can show for this (they’d be easy to run, only one ZFS setting needs to be changed), but hopefully I’ll get a chance to test this out. The logic is pretty sound so it looks like logbias=latency results in much better metadta/data spacing.

Setting `logbias=latency` (instead of `logbias=throughput`)

The Bun guide has a great setction on logbias – logbias=throughput looks to cause lots of fragmentation and is generally not a good idea.

Update (2021/12/21): Results from `zfs_txg_timeout=1`,`sync=disabled`,`logbias=latency`

In local testing I’ve found that 16k has the slight edge over 8k, and zstd has the slight edge over lz4, so I wanted to test zfs_txg_timeout as well, and I did – it turns out it also has an edge over the 5s wait (on NVMe, at least) – with no visible downsides (so far). The timeout used to be much longer for ZFS and went from 30 seconds to 5 – moving the rest of the distance from 5 to 1 (and why you should/shouldn’t) seems to be more rarely discussed.

If I had to guess (as I’m not a ZFS expert) the issues would be drive wear, inefficient transaction groupings (maybe when doing particularly large writes), and in general this is probably something that only make sense on NVMe.

Anyway, here are the results for recordsize=16k,compression=zstd,logbias=latency, with zfs_txg_timeout=5 (the default):

pgbench (14.1)
starting vacuum...end.
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1000
query mode: simple
number of clients: 40
number of threads: 4
duration: 300 s
number of transactions actually processed: 1088807
latency average = 11.018 ms
initial connection time = 102.951 ms
tps = 3630.302820 (without initial connection time)
statement latencies in milliseconds:
         0.001  \set aid random(1, 100000 * :scale)
         0.001  \set bid random(1, 1 * :scale)
         0.001  \set tid random(1, 10 * :scale)
         0.000  \set delta random(-5000, 5000)
         0.585  BEGIN;
         1.141  UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;
         0.685  SELECT abalance FROM pgbench_accounts WHERE aid = :aid;
         0.676  UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;
         0.775  UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;
         0.733  INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);
         6.410  END;

And with zfs_txg_timeout=1, sync=disabled (postgres still thinks it’s doing synchronous writes):

kubectl logs job/pgbench
pgbench (14.1)
starting vacuum...end.
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1000
query mode: simple
number of clients: 40
number of threads: 4
duration: 300 s
number of transactions actually processed: 1219432
latency average = 9.838 ms
initial connection time = 98.798 ms
tps = 4066.023297 (without initial connection time)
statement latencies in milliseconds:
         0.001  \set aid random(1, 100000 * :scale)
         0.001  \set bid random(1, 1 * :scale)
         0.001  \set tid random(1, 10 * :scale)
         0.000  \set delta random(-5000, 5000)
         0.640  BEGIN;
         1.134  UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;
         0.881  SELECT abalance FROM pgbench_accounts WHERE aid = :aid;
         0.809  UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;
         0.891  UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;
         0.921  INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);
         4.553  END;

About a ~12% improvement! I’m not sure this improvement is quite worth such a drastic change – zfs_txg_timeout is set at the node level so it will affect other workloads. Again to go over the logic:

(Controversial) Re-convincing myself `zfs_txg_timeout=1` and `sync=disabled` is OK in the case of Postgres

The Idea is that ZFS will be lying to postgres about synchronous writes speed (we’ll be essentially writing to RAM via the ARC) but since txg_timeout means writes are flushed every second we actually only have one potential second of data loss.

Normally, even one second of dataloss (that the upstream application thinks was written) would be an issue, but since Postgres has a WAL, if we suffer a crash we will lose 1 second of data, but not completely corrupt our database.Postgres is OK with this data loss because upon restart it will replay the WAL, and ZFS cannot ever be in an inconsistent state (torn pages, etc), so there is virtually no risk outside of 1 second of data loss.

When postgres replays it’s WAL it will have either:

The exact right (there were no new writes in the second before interruption)
An updated WAL that doesn’t match the data files (writes came in but didn’t make it to disk), these writes will be replayed
(impossible) Updated data files that don’t match the WAL – WAL writes go before data writes @ checkpoints

BTW if you want to join me on this dangerous journey, you can set zfs_txg_timeout via the kernel module, or temporarily by echo-ing values to /sys/module/zfs/parameters/zfs_txg_timeout (see the docs for your distribution or this reddit post).

Separate WAL writes into a separate ZFS dataset

The Bun guide covers this prfetty succinctly – by splitting the WAL off into it’s own ZFS dataset, it’s easier to take snapshots of without the rest of the data. In the end PGBackrest, Barman or other backup tooling are probably better to use than simply zfs send/zfs recv but if you find yourself using zfs send/zfs recv then this is worth paying attention to.

General Postgres tuning (in no particular order)

There are lots more tips for just general postgres, but those are mostly not covered here. Here are a few that I came across that are worth a mention. They’re not really related to PoZoL per-say but are great to take note of.

Tuning `shared_buffers`

Pretty obvious of course, but shared_buffers is a big part of tuning Postgres performance – Percona has a fantastic set of slides on it. shared_buffers is probably one of the quickest ways to more perf out there.

Tuning `work_mem`

Another large part of tuning postgres performance which is mentioned in Percona’s set of slides on Performance tuning. work_mem is used for operations like in-memory sorting, which I think most people can spare more than the default (2MB) for!

Tuning `maintenance_work_mem`

Similar to work_mem, maintenance_work_mem is used for maintenance tasks (ex. VACUUM, CREATE INDEX, ALTER_TABLE), and the default is only 64MB. I think most modern databases can afford to spare a bit more memory for such important operations. These days there are lots of things that can be done CONCURRENTLY but this is a relatively free win in my opinion.

Tuning `wal_buffers`

Increasing the amount of shared memory used for WAL data that has not been written to disk (which defaults to 16MB, w/ 8k blocks) can improve the performance of your database if you have a lot of concurrent connections. Bigger WAL can increase your concurrent performance but also makes backup/recovery take longer (bigger files to lug around) so this has to be considered.

Use `pg_repack`

Also in the scale15x 2017 talk, pg_repack allows you to remove bloat (similar to CLUSTER/VACUUM FULL), but with the database online, without holding an exclusive lock on the processed table.

Encode using UTF8, sort using C

Another gem from the scale15x 2017 talk, the insight here is to only enable local changes when you need them.

Tuning `effective_io_concurrency`

This controls the maximum posix_fadvise/pread sequences generated by a query – tuning this should help you avoid stalls and encourage concurrent IO.

Tuning `max_connections`

Obviously, you’re going to want to manage max_connections – the default is 100 (with 3 reserved for super users). There’s some nuance here in how you should tune it, and CYBERTEC has a great post on that

Tuning `effective_cache_size`

Giving Postgres hints on the size of the kernel’s disk buffer cache (and ZFS ARC) can also improve the heuristics and statistics that Postgres will use internally.

Bonus: What discovering silent `fdatasync` data corruption looks like

Alternatively, how to discover that volatile cache writebacks can bite you on consumer-grade hardware.

When I tried out a lot of these tips, I quickly found out that fdatasync wasn’t going to work for me, by running into database corruption with just pgbench. Leaving shared_buffers set to the default value made the corruption pretty easy to trigger. Here’s what the logs look like:

📜️ Postgres block read error output (click to expand)

2021-11-28 16:31:45.700 UTC [299] ERROR:  could not read block 0 in file "base/13754/16431_vm": Bad address
2021-11-28 16:31:45.700 UTC [299] CONTEXT:  automatic vacuum of table "postgres.public.pgbench_history"
2021-11-28 16:32:45.728 UTC [308] ERROR:  could not read block 0 in file "base/13754/16431_vm": Bad address
2021-11-28 16:32:45.728 UTC [308] CONTEXT:  automatic vacuum of table "postgres.public.pgbench_history"
2021-11-28 16:33:45.787 UTC [317] ERROR:  could not read block 0 in file "base/13754/16431_vm": Bad address
2021-11-28 16:33:45.787 UTC [317] CONTEXT:  automatic vacuum of table "postgres.public.pgbench_history"
2021-11-28 16:34:45.835 UTC [326] ERROR:  could not read block 0 in file "base/13754/16431_vm": Bad address

When I saw these messages at first I was really confused because of course the only thing that these messages point to is database corruption, and all I was doing was running dinky old pgbench! A few resources pointed me in a few directions:

https://www.postgresql.org/message-id/CAA-aLv6Dp_ZsV-44QA-2zgkqWKQq%3DGedBX2dRSrWpxqovXK%3DPg%40mail.gmail.com
https://github.com/LiskHQ/lisk-sdk/issues/268 (leads to an issue with their logic/transactions?)
https://dba.stackexchange.com/questions/44508/error-could-not-read-block-x-of-relation-base-y-z

And all roads lead to ~Rome~ database corruption. At first I suspected mdraid since that’s where I was running the tests at the time (to test against ZFS) so I ran it some more:

📜️ more Postgres block read error output (click to expand)

pgbench: error: client 3 script 0 aborted in command 5 query 0: ERROR:  could not read block 263392 in file "base/13754/16404.2": Bad address
pgbench: error: client 14 script 0 aborted in command 5 query 0: ERROR:  could not read block 193763 in file "base/13754/16404.1": Bad address
pgbench: error: client 7 script 0 aborted in command 5 query 0: ERROR:  could not read block 209098 in file "base/13754/16404.1": Bad address
pgbench: error: client 27 script 0 aborted in command 5 query 0: ERROR:  could not read block 62738 in file "base/13754/16404": Bad address
pgbench: error: client 9 script 0 aborted in command 5 query 0: ERROR:  could not read block 24230 in file "base/13754/16404": Bad address
pgbench: error: client 35 script 0 aborted in command 5 query 0: ERROR:  could not read block 21950 in file "base/13754/16404": Bad address
pgbench:pgbench: error: client 26 script 0 aborted in command 5 query 0: ERROR:  could not read block 53585 in file "base/13754/16404": Bad address
 error: client 17 script 0 aborted in command 5 query 0: ERROR:  could not read block 112867 in file "base/13754/16404": Bad address
pgbench: error: client 31 script 0 aborted in command 5 query 0: ERROR:  could not read block 8971 in file "base/13754/16404": Bad address
pgbench: error: client 32 script 0 aborted in command 5 query 0: ERROR:  could not read block 127138 in file "base/13754/16404": Bad address
pgbench: error: client 34 script 0 aborted in command 5 query 0: ERROR:  could not read block 566008 in file "base/13754/16396.4": Bad address
pgbench: error: client 38 script 0 aborted in command 5 query 0: ERROR:  could not read block 116455 in file "base/13754/16404": Bad address
pgbench: error: pgbench: error: client 4 script 0 aborted in command 5 query 0: ERROR:  could not read block 31640 in file "base/13754/16404": Bad address
client 29 script 0 aborted in command 5 query 0: ERROR:  could not read block 137124 in file "base/13754/16404.1": Bad address
pgbench: error: client 19 script 0 aborted in command 5 query 0: ERROR:  could not read block 78349 in file "base/13754/16404": Bad address
pgbench: error: pgbench: error: client 20 script 0 aborted in command 5 query 0: ERROR:  could not read block 56435 in file "base/13754/16404": Bad address
client 0 script 0 aborted in command 5 query 0: ERROR:  could not read block 141619 in file "base/13754/16404.1": Bad address
pgbench: error: client 12 script 0 aborted in command 5 query 0: ERROR:  could not read block 46705 in file "base/13754/16404": Bad address
pgbench: error: client 1 script 0 aborted in command 5 query 0: ERROR:  could not read block 335297 in file "base/13754/16396.2": Bad address
pgbench: error: client 15 script 0 aborted in command 5 query 0: ERROR:  could not read block 131663 in file "base/13754/16404.1": Bad address
pgbench: error: client 6 script 0 aborted in command 5 query 0: ERROR:  could not read block 2895 in file "base/13754/16404": Bad address
pgbench: error: client 22 script 0 aborted in command 5 query 0: ERROR:  could not read block 251229 in file "base/13754/16404.1": Bad address
pgbench: error: client 8 script 0 aborted in command 5 query 0: ERROR:  could not read block 129323 in file "base/13754/16404": Bad address
pgbench: error: client 36 script 0 aborted in command 5 query 0: ERROR:  could not read block 25085 in file "base/13754/16404": Bad address
pgbench: error: client 2 script 0 aborted in command 5 query 0: ERROR:  could not read block 259029 in file "base/13754/16404.1": Bad address
pgbench: error: client 23 script 0 aborted in command 5 query 0: ERROR:  could not read block 168628 in file "base/13754/16404.1": Bad address
pgbench: error: client 33 script 0 aborted in command 5 query 0: ERROR:  could not read block 233469 in file "base/13754/16404.1": Bad address
pgbench: error: client 24 script 0 aborted in command 5 query 0: ERROR:  could not read block 92632 in file "base/13754/16404": Bad address
pgbench: error: client 18 script 0 aborted in command 5 query 0: ERROR:  could not read block 40478 in file "base/13754/16404": Bad address
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1000
query mode: simple
number of clients: 40
number of threads: 4
duration: 300 s
number of transactions actually processed: 1251
pgbench: fatal: Run was aborted; the above results are incomplete.
latency average = 10.681 ms
initial connection time = 100.183 ms
tps = 3744.914729 (without initial connection time)
statement latencies in milliseconds:
         0.001  \set aid random(1, 100000 * :scale)
         0.001  \set bid random(1, 1 * :scale)
         0.001  \set tid random(1, 10 * :scale)
         0.001  \set delta random(-5000, 5000)
         0.447  BEGIN;
         2.017  UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;
         0.519  SELECT abalance FROM pgbench_accounts WHERE aid = :aid;
         0.545  UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;
         0.585  UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;
         0.595  INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);
         5.635  END;

Well that’s not good – the errors actually still happens on ZFS, which is what tipped me off to this being a underlying hardware problem. ZFS was protecting me a bit more than I thought (due to the ARC absorbing some load) – the problem definitely happens a lot more, and it looks increasingly clear that it is likely filesystem corruption.

Well luckily for me the Postgres Wiki has a page on Corruption, so I took a look through there and thought if there were any things there that could have been issues… Defective CPU or RAM didn’t seem to be the case (I didn’t have ECC RAM in the server but I ran some checks on the RAM and they seemed fine, the tests also consistently failed across machines).

At the very least the fact that it was happening on both mdraid and ZFS was a good sign that it wasn’t the ZFS subsystem and it wasn’t mdraid in particular. The problem seemed to be “masked” by shared_buffers so I made a reddit post and got some great help from the community.

📜️ pgbench error output (click to expand)

2021-11-29 04:36:49.649 UTC [115] STATEMENT:  UPDATE pgbench_accounts SET abalance = abalance + -2070 WHERE aid = 30688590;
2021-11-29 04:36:49.649 UTC [95] ERROR:  could not read block 66695 in file "base/13754/16404": Bad address
2021-11-29 04:36:49.649 UTC [95] STATEMENT:  UPDATE pgbench_accounts SET abalance = abalance + 3387 WHERE aid = 24410232;
2021-11-29 04:36:49.649 UTC [114] ERROR:  could not read block 27935 in file "base/13754/16404": Bad address
2021-11-29 04:36:49.649 UTC [114] STATEMENT:  UPDATE pgbench_accounts SET abalance = abalance + 4906 WHERE aid = 10195188;
2021-11-29 04:36:49.649 UTC [109] ERROR:  could not read block 33635 in file "base/13754/16404": Bad address
2021-11-29 04:36:49.649 UTC [109] STATEMENT:  UPDATE pgbench_accounts SET abalance = abalance + 649 WHERE aid = 12338677;
2021-11-29 04:36:49.649 UTC [126] ERROR:  could not read block 70731 in file "base/13754/16404": Bad address
2021-11-29 04:36:49.649 UTC [126] STATEMENT:  UPDATE pgbench_accounts SET abalance = abalance + -283 WHERE aid = 25795973;
2021-11-29 04:36:49.649 UTC [116] ERROR:  could not read block 64302 in file "base/13754/16404": Bad address
2021-11-29 04:36:49.649 UTC [116] STATEMENT:  UPDATE pgbench_accounts SET abalance = abalance + -3404 WHERE aid = 23451372;
2021-11-29 04:36:49.650 UTC [93] ERROR:  could not read block 213483 in file "base/13754/16404.1": Bad address
2021-11-29 04:36:49.650 UTC [93] STATEMENT:  UPDATE pgbench_accounts SET abalance = abalance + -1552 WHERE aid = 77858559;
2021-11-29 04:36:49.651 UTC [111] ERROR:  could not read block 67287 in file "base/13754/16404": Bad address
2021-11-29 04:36:49.651 UTC [111] STATEMENT:  UPDATE pgbench_accounts SET abalance = abalance + 4587 WHERE aid = 24539716;
2021-11-29 04:37:37.732 UTC [135] ERROR:  could not read block 0 in file "base/13754/16431_vm": Bad address
2021-11-29 04:37:37.732 UTC [135] CONTEXT:  while scanning block 0 of relation "public.pgbench_history"

With some of the error messages I could directly see the statements that were failing and they seemed like very average statements so I started to get a bit worried. The next thing I tried was downgrading to PG 13, thinking I might have found a Postgres bug as version 14.1 was very new (of course not). The logs were very similar:

📜️ Postgres container output (click to expand)

...
PostgreSQL init process complete; ready for start up.

2021-12-02 11:16:23.317 UTC [1] LOG:  starting PostgreSQL 13.5 on x86_64-pc-linux-musl, compiled by gcc (Alpine 10.3.1_git20211027) 10.3.1 20211027, 64-bit
2021-12-02 11:16:23.317 UTC [1] LOG:  listening on IPv4 address "0.0.0.0", port 5432
2021-12-02 11:16:23.317 UTC [1] LOG:  listening on IPv6 address "::", port 5432
2021-12-02 11:16:23.323 UTC [1] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2021-12-02 11:16:23.329 UTC [51] LOG:  database system was shut down at 2021-12-02 11:16:23 UTC
2021-12-02 11:16:23.334 UTC [1] LOG:  database system is ready to accept connections
2021-12-02 11:18:17.933 UTC [52] LOG:  checkpoints are occurring too frequently (29 seconds apart)
2021-12-02 11:18:17.933 UTC [52] HINT:  Consider increasing the configuration parameter "max_wal_size".
2021-12-02 11:19:47.736 UTC [116] ERROR:  could not read block 648535 in file "base/13444/16396.4": Bad address
2021-12-02 11:19:47.736 UTC [116] STATEMENT:  UPDATE pgbench_accounts SET abalance = abalance + 2027 WHERE aid = 39560672;
2021-12-02 11:19:47.736 UTC [123] ERROR:  could not read block 585913 in file "base/13444/16396.4": Bad address
2021-12-02 11:19:47.736 UTC [123] STATEMENT:  UPDATE pgbench_accounts SET abalance = abalance + 2180 WHERE aid = 35740744;
2021-12-02 11:19:47.736 UTC [115] ERROR:  could not read block 838163 in file "base/13444/16396.6": read only 4352 of 8192 bytes
2021-12-02 11:19:47.736 UTC [115] STATEMENT:  UPDATE pgbench_accounts SET abalance = abalance + -4292 WHERE aid = 51127988;
2021-12-02 11:19:47.736 UTC [127] ERROR:  could not read block 52370 in file "base/13444/16396": Bad address
2021-12-02 11:19:47.736 UTC [127] STATEMENT:  UPDATE pgbench_accounts SET abalance = abalance + -89 WHERE aid = 3194591;
2021-12-02 11:19:47.736 UTC [122] ERROR:  could not read block 166523 in file "base/13444/16396.1": Bad address
2021-12-02 11:19:47.736 UTC [122] STATEMENT:  UPDATE pgbench_accounts SET abalance = abalance + 1505 WHERE aid = 10157924;
2021-12-02 11:19:47.736 UTC [119] ERROR:  could not read block 54214 in file "base/13444/16404": Bad address
2021-12-02 11:19:47.736 UTC [119] STATEMENT:  UPDATE pgbench_accounts SET abalance = abalance + -3274 WHERE aid = 19771731;
2021-12-02 11:19:47.736 UTC [94] ERROR:  could not read block 25913 in file "base/13444/16404": Bad address
2021-12-02 11:19:47.736 UTC [94] STATEMENT:  UPDATE pgbench_accounts SET abalance = abalance + 3160 WHERE aid = 9450375;
2021-12-02 11:19:47.736 UTC [129] ERROR:  could not read block 29298 in file "base/13444/16404": Bad address

Very curiously ,the folder seemed to be stuck at around 16GB before the errors would show up.

Fake resolution: Bigger `shared_buffers`

Generally shared_buffers should be 25% of RAM it seems… the nodes did have 16GB of RAM so 4GB would be what to set it to. The runs would complete with shared_buffers set to 4GB (but thankfully I didn’t stop there) – obviously, a lack of memory space (shared_buffers) should not cause a database like Postgres to los integrity. I may have found a workaround but something is fundamentally wrong.

Dead end: `max_wal_size` ?

Increasing the max_wal_size to 4096 (4GB) also delayed the problem but didn’t fix it – pgbench runs were still failing. Looking at max_wal_size did give me an idea though – maybe I was running out of some kind of resource?

Dead end: ulimit?

I started thinking that maybe I was hitting some file system limit, so I checked ulimit -a:

root@node-2 ~ # ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 255813
max locked memory       (kbytes, -l) 65536
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 255813
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

And the “real” usage during pgbench seemed to be:

root@node-2 ~ # cat /proc/sys/fs/file-nr
50112   0       9223372036854775807

Based on this I thought maybe the open files limit was the issue, but then I checked the max:

root@node-2 ~ # sysctl fs.file-max
fs.file-max = 9223372036854775807

Well… It looks like the limit is not the issue. In fact, you can actually check by a specific process and find out what the limits are and get the right values (which are set by container orchestration tooling):

root@node-2 ~ # ulimit -n 65536
root@node-2 ~ # ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 255813
max locked memory       (kbytes, -l) 65536
max memory size         (kbytes, -m) unlimited
open files                      (-n) 65536
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 255813
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

Clearly no problem there.

Solution: Force proper disk syncing

The insight came from /u/nikowek on reddit – the lost writes (that couldn’t be read back) coupled with absolutely no feedback from the Kernel or Postgres itself on writes failing means that the drives below must have been reporting writes successful when they weren’t.

It looks like I’d run head first into a feature of the drives below that wasn’t quite working as I’d expected – the “Volatile cache” write cache is to blame (there’s a great post on this in some microsoft forums).

As this is the problem, there are only two easy ways around it:

wal_sync_method=fsync at the postgres level (instead of fdatasync)
Disable write back cache at the drive level with (apt install nvme-cli)

Since I don’t ever want to run into this problem again (at least until I upgrade to enterprise HDDs across the board, I guess), I’ve gone with the second one – which is disabling the volatile write cache at the drive level:

$ nvme set-feature -f 6 -v 0 /dev/nvme0n1

Same same, but different: Longhorn has the same issue

While I was here, I ran the pgbench tests against a postgres instance with storage backed by my Longhorn installation. I was instantly suspicious because the Longhorn runs actually ran as fast as mdraid and ZFS runs which just can’t be right – Longhorn performs synchronous remote writes – there’s no way the performace should be at par with local-only disk writes, even with increased shared_buffers size.

Once I reduced the size of shared_buffers the issue presented itself (I guess longhorn uses fdatasync or one of the other options underneath?), so the right way to fix the problem was indeed disabling the volatile write cache at the drive.

All the links

Here’s all the sources so you can look for yourself. Thanks to all these people for sharing their wealth of knowledge. Special thanks to /u/mercenary_sysadmin on Reddit and all the comitters and contributors on the Postgres and ZFS mailing lists.

Update (2021/12/21): Way more links

So there are lots more links and information to read through. I got going down a rabbit hole, so I want to leave all these resources here for anyone else who wants to jump down:

Percona blog: About ZFS Performance
- Benchmark seems to be flawed because ZFS w/ L2ARC was not compared to the equivalent for XFS (bcache) or EXT4 (dm-cache +/- LVM caching?)
- Commenter called out logbias=throughput as terrible for causing fragmentation (just read all the quotes by “Janet Campbell”)
Paper: Evaluation of ZFS as an efficient WLCG storage backend
- ZFS is compared with XFS and EXT4 directly
- Talk schedule page which comes with more PDFs (at the bottom)
Slides: Best Practices for OpenZFS L2ARC in the Era of NVMe by Ryan McKenzie of iXsystems
Long Paper: napp-it ZFS storage performance tests
- This seems like kinda like a paid advertisement for Oracle Solaris / Intel Optane, but it has a very helfpul TL;DR.
Article on ZFS tuning by Martin Heiland
- I think he reaches the wrong conclusion about compression here, but some interesting insights for VM workloads.

And here’s the rabbit hole I went down….

Another completely separate rabbit hole which looks like it holds even richer treasures is Chris Seibermann of UToronto’s blog posts on ZFS (thanks to /u/BucketOfSpinningRust on Reddit).

/u/BucketOfSpinningRust on reddit also was kind enough to share some explanations of behavior on the ZFS side in two separate comments:

https://www.reddit.com/r/zfs/comments/rjavyc/everything_ive_seen_on_optimizing_postgres_on_zfs/hp6vrra/
- This was a correction of the notion that multiple pages ever go into the same “ZFS Record” – this is a mischaracterization that I parroted, what happens is that more pages will be able to likely fit in the same area of disk – compression is done after the concept of a ZFS record is gone (i.e. when you know the physical size of one or more writes in a transaction group).
https://www.reddit.com/r/zfs/comments/rjavyc/everything_ive_seen_on_optimizing_postgres_on_zfs/hp8u2rk/
- Great explanation on how record size is treated, reproduced below, just in case in some weird future this site is up and Reddit isn’t (or it’s paywalled):

Files smaller than the record size are compressed (if applicable), and then stored as unique records that physically occupy unique sectors on disk (IE round up to the nearest ashift value). With a record size of 8k (or larger), you can store a 5k file that compresses down to 3k in 4k of disk space with ashift=12. Files larger than a record size are broken into blocks of the record size, then compressed and stored, again rounding up. A 5M file with 1M records could be stored as 2 uncompressed 1M records, a 512k record, a 100k record, and another 1M record. Side note: There are some weird edge cases with very large record sizes. A 1.001M file with 1M record sizes will be stored as 2 1M records, then compressed.

In response to your edit: You can set your record size to some multiple of the page size in whatever database you’re using. There are some perfectly legitimate reasons to do this. Jumping to 16 and 32k records often gives you moderate bumps in compression ratios. (Going past 32k rarely gives you worthwhile benefits in my experience.) This means you can store more in the ARC and/or use less memory. It also means that you warm the cache faster which is relevant for spinning up disposable VMs for things like load balancing. For databases where you frequently read multiple adjacent pages, it acts as a form of prefetching. For exceptionally large databases that aren’t under heavy I/O load the compression can save on disk space (that’s a rare usage case admittedly).

The catch is write amplification, or even worse, read modify writes. Write amplification itself isn’t always a major issue on rust because the limiting factor is seek time. The difference between doing 4, 8, 16 and 32k writes is fairly small in terms of IOPS. Yeah, you can do fewer operations with larger writes because the head needs to spend longer writing (duh), but the time spent writing on full random workloads is somewhat trivial compared to the time the head spends moving around. What usually fucks your performance when oversizing is read modify writes. If everything is cached in RAM, yeah you’re using a bit of extra memory throughput and burning some CPU time doing compression, checksums, etc, but that often doesn’t matter. What matters is disk throughput. If oversizing records gives you better compression to fit more stuff, or your workload does enough adjacent reads to make the pseudo prefetching give you better ARC hit rates, you can actually reduce disk usage.

It’s a balancing act. Time spent reading off of rust is time that cannot be spent writing to disk, and vice versa. Oversizing ZFS records can reduce the number of reads going to disk, which can result in increased performance, but if you go too far it’s very easy to completely tank your performance.

Wrapup

There is lots of information and links out there on the internet about Postgres and ZFS – hopefully you’ve discovered some new sources to read up on! As always, if you have some nitpicks or good corrections I’d love to hear them so I can update and make this post even better (and help all the lost souls who wander in). I’d love to hear about areas that I’ve overlooked or new optimization opportunities I’ve never heard of.

Thanks for reading!

Categories

Table of Contents

UPDATE (2021/12/21)

ZFS tunables

Setting recordsize to 8k

(optional) Setting recordsize to 16k

Enable compression

Benchmark (pgbench): recordsize=8k vs recordsize=16k

Reducing read-ahead

Tuning primarycache

ZFS-related tunables on the Postgres side

Setting full_page_writes=off

Drawback: “interferes” with cascading replication

Disable postgres checksumming

Disable Postgres compression

Tune wal_init_zero & wal_recycle

Tune wal_sync_method, trying fdatasync

WARNING: Don’t do this if you can’t rely on your drive’s writeback caching

Drawback: modified time may be lost after crash

Alternative: wal_sync_method=open_datasync

Future Alternative: wal_sync_method=aio_f[sync|datasync]?

Increasing Postgres blocksize

(Controversial) zfs_txg_timeout=1 & synchronous_commit=off & logbias=throughput