tl;dr - A drive attached to a Hetzner dedicated server failed, my drive has both mdraid and ZFS set up so I restored thanks to Hetzner documentation and Joyent’s nice high level docs on ZFS drive replacement combined with the OpenBSD Handbook’s section on ZFS and OpenZFS (ZFS On Linux) Docs.

Recently I’ve undergone the pretty traumatic expansion of my Kubernetes control plane from single controller (which was also acting as a node in the cluster and running workloads) to a 3 Hetzner Cloud machine Highly Available (“HA”) setup. During this transition one of the drives in a node actually failed – the node was not the node being reconfigured but nonetheless Murphy’s Law saw fit to strike me while I was in the middle of a pretty complicated process (a post is coming about that too!).

In the past when I’ve had drives die, I was in a setup where it was easy to actually completely rebuild the machine, so once the replacement was done I could just re-run my automation to provision everything. Unfortunately, this time there were stateful workloads running on the machine with the dead disk. Obviously, this requires recovery and restoration of the workloads that were actually running and getting them access to the data they had.

Symptoms

Well what does it look like when you have a drive failure? Here are a few things:

At the application layer

The application may/may not be not running
- This does happen in ZFS – ZFS will not provide a mirror that has become degraded unless you force it to (also for the purposes of restoring)
- This doesn’t happen on “raw” mdraid – your reads will silently go to the last remaining drive and it’s up to you to check/know that you’re running in a degraded state and replace the drive ASAP.
- I’m not sure exactly what LVM will do here, but it should continue serving writes from the disk that is remaining, but it really depends on your setup:
  - LVM / mdraid (“LVM over mdraid”) => LVM will serve writes from the one disk, because underlying mdraid (ex. /dev/md0) is still serving reads/writes
  - LVM with multi disk Volume Group (“VG"s) pool / raw disks => You’re hosed, data on the dead disk is gone
  - LVM with RAID1 logical volumes (“LV"s) (man pages) over multi-disk VGs => data from every LV is spread across both disks, so even with one disk gone/missing from the VG you can serve reads/writes (not sure if LVM requires intervention to enable this in the short term while you recover…)

At the orchestration layer

Whether it’s the linux kernel itself, k8s or nomad, your orchestrator is aware of the missing storage (usually by not being able to set up volumes/usable disk) – your workloads are not starting, and errors are being thrown
- This does happen for Kubernetes (obviously, which is why this post is here) – OpenEBS ZFS LocalPV has quite the hard time mounting datasets from ZFS pools that aren’t imported! Of course, the ZFS pools are not imported automatically becuase they’re degraded.
- I have no idea how this would show up in Nomad, though I assume it’s similar to Kubernetes, IIRC Nomad can also use Container Storage Interface (“CSI”) drivers, so a drive like OpenEBS ZFS LocalPV would probably fail to provide storage and stop the workload from starting
- You’re either completley fine or mega hosed linux – depending on how your underlying storage is set up, you’re either dead because you root disk (and other programs) had important data on the dead disk, or you’re fine (at least for now) if you used mdraid since it will allow reads on the remaining disk. This is why Hetzner uses RAID1 by default on their disks – it’s a much safer option.

ZFS

THe zpool list command returned no poools at all, and obviously zfs list will return no datasets. In the past I have had accidently ZFS version downgrades (the Ubuntu packaged version of ZFS is 0.8.3 but I run 2.1.1 and making sure the kernel uses the custom-installed 2.1.1 (and not the ubuntu provided 0.8.3) can be tricky. Previous ZFS versions can’t read pools administered by newer versions, so that’s one reason the pool might not show up.

The first step for me was to double-check the version (zfs version) to see if that was the reason the pools weren’t showing up:

root@node-3 ~ # zfs version
zfs-2.1.1-1
zfs-kmod-2.1.1-1

OK so looks like I at least don’t have the accidental-ZFS-version-degrade problem!

Operating System (smoking gun)

Other nodes were fine so I knew this wasn’t a shared configuration problem, so I went ahead and checked the disk layout (I run a somewhat non-standard layout):

$ lsblk
root@node-3 ~ # lsblk
NAME        MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
nvme0n1     259:0    0   477G  0 disk
├─nvme0n1p1 259:1    0    32G  0 part
│ └─md0       9:0    0    32G  0 raid1 [SWAP]
├─nvme0n1p2 259:2    0     1G  0 part
│ └─md1       9:1    0  1022M  0 raid1 /boot
├─nvme0n1p3 259:3    0   128G  0 part
│ └─md2       9:2    0 127.9G  0 raid1 /
├─nvme0n1p4 259:4    0     1K  0 part
└─nvme0n1p5 259:5    0   316G  0 part

Uhhh where the heck is nvme1n1? This machine is an AX41-NVME Hetzner drive which comes with two NVMe SSDs, normally called nvme0n1 and nvme1n1 at the OS level (available @ /dev/nvme0n1, etc). Here’s what it normally looks like just for reference:

root@node-5 ~ # lsblk
NAME        MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
zd0         230:0    0   150G  0 disk  /var/lib/longhorn
nvme0n1     259:0    0   477G  0 disk
├─nvme0n1p1 259:2    0    32G  0 part
│ └─md0       9:0    0    32G  0 raid1 [SWAP]
├─nvme0n1p2 259:3    0     1G  0 part
│ └─md1       9:1    0  1022M  0 raid1 /boot
├─nvme0n1p3 259:4    0   128G  0 part
│ └─md2       9:2    0 127.9G  0 raid1 /
├─nvme0n1p4 259:5    0     1K  0 part
└─nvme0n1p5 259:6    0   316G  0 part
nvme1n1     259:1    0   477G  0 disk
├─nvme1n1p1 259:7    0    32G  0 part
│ └─md0       9:0    0    32G  0 raid1 [SWAP]
├─nvme1n1p2 259:8    0     1G  0 part
│ └─md1       9:1    0  1022M  0 raid1 /boot
├─nvme1n1p3 259:9    0   128G  0 part
│ └─md2       9:2    0 127.9G  0 raid1 /
├─nvme1n1p4 259:10   0     1K  0 part
└─nvme1n1p5 259:11   0   316G  0 part

Oh shit. A drive is missing completely (the kernel hides drives that are sufficiently broken). After a small freak out, I can rest relieved because this is the very reason I’ve agonized over the [disadvantages/advantages of disassembling software RAID on Hetzner machines. I’m running RAID1 like a sensible sysadmin for this very eventuality, so I can survive single hard drive failures.

Hard drive failures are pretty low on my “storage hierarchy of needs”:

Storage hierarchy of needs

NOTE If you’re wondering when you may want to actually disassemble software RAID1 on a provider like Hetzner, I think it makes sense when you’re running a distributed storage system like Ceph – Ceph makes sure multiple copies of every bit of data makes it to various nodes (and resultingly various drives) so even if you were to lose an entire drive from one node, workloads should still be able to run as other copies will exist. This also requires running your OS off of a drive that is not those drives (ex. running Alpine Linux from memory), but that’s a post for another day.

Recovery

Finding my missing zpools

I started with ZFS because it is arguably the most “risky” part of this level of my stack – not that ZFS is risky in absolute terms but compared to in-kernel software like mdraid or LVM that are well understood and common, it is. A lot of my attention was geared towards understanding and seeing how ZFS reacts to this kind of failure and building good intuition on how to use it properly.

Searching produced a great FreeBSD thread which wasn’t a bad place to start, and once I ran zpool import I was able to see the zpools that were there but not automatically imported:

root@node-3 ~ # zpool import
   pool: tank
     id: 7705949272249162819
  state: DEGRADED
status: One or more devices contains corrupted data.
 action: The pool can be imported despite missing or damaged devices.  The
        fault tolerance of the pool may be compromised if imported.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-4J
 config:

        tank           DEGRADED
          mirror-0     DEGRADED
            nvme0n1p5  ONLINE
            nvme1n1p5  UNAVAIL

Well this confirms the problem – clearly the missing disk is causing the pool to not be imported (as you might imagine – this is equivalent to if the disk was present but damaged).

Requesting a new drive from Hetzner

Hetzner does use reasonably worn drives on their dedicated disks – as far as I know they aim to maximize utility and minimize waste of their hardware which is great for me (I get great deals on good-enough hardware) and the environment (they use drives as long as possible). This however means that drive failure on dedicated machines happens more frequently

Luckily Hetzner is well prepared for this and they have a support form for reporting dead drives:

Hetzner broken drive report form

With the form you can specify either the intact drive or the dead drive. I specified the intact drive, since I didn’t know how to retrieve the UUID of the drive that wasn’t being surfaced by the OS.

Hetzner replaced the drive in <30 minutes from when I filed the tickets (about 20 minutes!), so outstanding service there. This would have resulted in about that much downtime for the data on the node, so that’s not great, but as noted in the “storage hierarchy of needs” illustration, if I want robustness to node failures (which this basically becomes) I have a few options:

Allow ZFS to mport drives even when they’re degraded (good short term solution, but every running workload is in even more danger if we have a second drive failure
Use a higher layer abstraction like Ceph or Longhorn to distribute writes synchronously (if we want 0 data loss) across nodes

There’s no free lunch here as well, but at least Hetzner replacing the drive is “free”!

Restoring data (`mdraid` & ZFS)

Once a new drive is present we can at least confirm it’s present

root@node-3 ~ # lsblk
NAME        MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
nvme1n1     259:0    0   477G  0 disk
nvme0n1     259:1    0   477G  0 disk
├─nvme0n1p1 259:2    0    32G  0 part
│ └─md0       9:0    0    32G  0 raid1 [SWAP]
├─nvme0n1p2 259:3    0     1G  0 part
│ └─md1       9:1    0  1022M  0 raid1 /boot
├─nvme0n1p3 259:4    0   128G  0 part
│ └─md2       9:2    0 127.9G  0 raid1 /
├─nvme0n1p4 259:5    0     1K  0 part
└─nvme0n1p5 259:6    0   316G  0 part

Hurray, welcome back nvme1n1! As noted earlier, my drive setup is bit complicated – I’m using a combination of both mdraid (boot, root disk, swap, workload storage) and ZFS (workload storage). Both drives have partitions that are used by both mdraid and ZFS, with different partitions used for different disks. The new drive knows nothing of the setup for either storage tool, and is just an empty “512GB” NVMe drive.

Restoring partition structure

- Now that we know we've got the new drive in, we need to replicate the partition structure before we can start restoring the individual RAID1 implementations we have onboard (mdraid and ZFS).
- Hetzner's docs (LINK) go into how to do this, but I'll partially reproduce it here.
- NOTE: that if you want to do a hard wipe of the new disk, this would be the time to do so!
- If you have MBR (which is likely on Hetzner at least), for a single disk you can backup and restore (to another disk):
  #+BEGIN_SRC bash

To backup the existing partition structure:

$ sfdisk --dump /dev/nvme0n1 > nvme0n1.mbr.bak # feel free to open this file, it's text!

To “restore” the partition structure you backed up (to another disk):

sfdisk /dev/nvme1n1 < nvme0n1.mbr.bak # /dev/nvme1n1 is the new empty disk being restored

If you’re running GPT you’ll want to use sgdisk instead to backup the xisting partition structure:

$ sgdisk --backup=nvme0n1.gpt.bak /dev/nvme0n1 # this file is *not* textual.

To restore (possibly to another disk) with GPT:

$ sgdisk --load-backup=nvme0n1.gpt.bak /dev/nvme1n1 # /dev/nvme1n1 is the new empty disk being restored

Once you’ve used either sfdisk or sgdisk to restore the partition structure, you should see the following after running lsblk:

root@node-3 ~ # lsblk
NAME        MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
nvme1n1     259:0    0   477G  0 disk
├─nvme1n1p1 259:7    0    32G  0 part
├─nvme1n1p2 259:8    0     1G  0 part
├─nvme1n1p3 259:9    0   128G  0 part
├─nvme1n1p4 259:10   0     1K  0 part
└─nvme1n1p5 259:11   0   316G  0 part
nvme0n1     259:1    0   477G  0 disk
├─nvme0n1p1 259:2    0    32G  0 part
│ └─md0       9:0    0    32G  0 raid1 [SWAP]
├─nvme0n1p2 259:3    0     1G  0 part
│ └─md1       9:1    0  1022M  0 raid1 /boot
├─nvme0n1p3 259:4    0   128G  0 part
│ └─md2       9:2    0 127.9G  0 raid1 /
├─nvme0n1p4 259:5    0     1K  0 part
└─nvme0n1p5 259:6    0   316G  0 part

Hurray, nvme1n1 looks a little like the other drive. There are some subtle differences but it’s overall a similar layout. Now let’s get to actually restoring the data that was on the partitions to begin with.

Restoring (“resilvering”) `mdraid` data

Hetzner’s got fantastic documentation on exactly how to restore after a failure – their guide is excellent (earlier instructions are based off of it), and expanding on it here would only be incomplete reflection of the work they’ve already done, so I’ll only reproduce a tiny bit.

With the right partition structure in place, we can restore each of the 3 mdraid (md0,md1,md2) drives that exist on the other drive. For me that looks like this, make sure to match up the partitions correctly!:

mdadm /dev/md0 -a /dev/nvme1n1p1
mdadm /dev/md1 -a /dev/nvme1n1p2
mdadm /dev/md2 -a /dev/nvme1n1p3

THe output looks like the following:

root@node-3 ~ # mdadm /dev/md0 -a /dev/nvme1n1p1
mdadm: added /dev/nvme1n1p1
root@node-3 ~ # mdadm /dev/md1 -a /dev/nvme1n1p2
mdadm: added /dev/nvme1n1p2
root@node-3 ~ # mdadm /dev/md2 -a /dev/nvme1n1p3
mdadm: added /dev/nvme1n1p3

mdadm automatically springs to life and does the rest! You can follow progress by looking at /proc/mdstat:

root@node-3 ~ # cat /proc/mdstat
Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10]
md0 : active raid1 nvme1n1p1[2] nvme0n1p1[0]
      33520640 blocks super 1.2 [2/1] [U_]
      [=========>...........]  recovery = 45.3% (15210880/33520640) finish=1.5min speed=200219K/sec

md2 : active raid1 nvme1n1p3[2] nvme0n1p3[0]
      134085632 blocks super 1.2 [2/1] [U_]
        resync=DELAYED
      bitmap: 1/1 pages [4KB], 65536KB chunk

md1 : active raid1 nvme1n1p2[2] nvme0n1p2[0]
      1046528 blocks super 1.2 [2/1] [U_]
        resync=DELAYED

unused devices: <none>

If you want to watch this live, you can run watch -n1 cat /proc/mdstat, but otherwise it’s easy to see how far mdadm is in the process of “resilvering”. When mdadmis completely finished you should see output like the following:

root@node-3 ~ # cat /proc/mdstat
Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10]
md0 : active raid1 nvme1n1p1[2] nvme0n1p1[0]
      33520640 blocks super 1.2 [2/2] [UU]

md2 : active raid1 nvme1n1p3[2] nvme0n1p3[0]
      134085632 blocks super 1.2 [2/2] [UU]
      bitmap: 1/1 pages [4KB], 65536KB chunk

md1 : active raid1 nvme1n1p2[2] nvme0n1p2[0]
      1046528 blocks super 1.2 [2/2] [UU]

unused devices: <none>

That [UU] (not to be confused with “UwU”) means we’re golden (as opposed to [U_]).

Surprise: GRUB

I didn’t think of it at all but thankfully it was included in the Hetzner guide – we need to adjust the the boot loader! I would have been VERY angry to reboot the machine and have it not boot. I’ve had problems with Grub on Hetzner dedicated machines before before (it’s ultimately why I’m not running Arch Linux on the server right now). This time though, it looks like I’ll be able to avoid those belly aches – I needed to:

Create a new Grub device map (I’m using Grub2):

$ grub-mkdevicemap -n

Install Grub on the new (partially restored) device:

grub-install /dev/nvme1n1

Great, now all the mdraid data is transferred and Grub is installed.

ZFS

ZFS has lots of documentation out there but I found the Joyent docs which offer a very high level view particularly helpful (note that this article is for OpenBSD so some tools are not present on linux like fmadm). This was a decent place to start and if you’re not too phased by of course the OpenBSD handbook section on ZFS is also a fantastic resource. We’re using ZFS on Linux so of course OpenZFS documentation there are more relevant to our needs.

Anyway, back to doing the restoration! Now that I have the drive in the right place, with mdadm partition data restored, all that’s left is that last partition (around 300GB) which should house ZFS data. If I had some offsite backups I could restore with a zfs recv or something of that sort, but we’re going to need to go for a full mirror resilver from the existing drive (this can be risky because it does put load on the now-single liven drive). After crossing some of my fingers I got started.

First here’s the current state of the system:

root@node-3 ~ # zpool list
no pools available

We know this is kind of a lie – the zpool is there but it’s not imported because it’s degraded as a disk was missing. That disk is now back, but obviously doesn’t contain any of the data that ZFS expects it to. zfs list will of course also show no datasets available.

And if we zfs import we’ll see the “missing” pool:

root@node-3 ~ # zpool import
   pool: tank
     id: 7705949272249162819
  state: DEGRADED
status: One or more devices contains corrupted data.
 action: The pool can be imported despite missing or damaged devices.  The
        fault tolerance of the pool may be compromised if imported.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-4J
 config:

        tank           DEGRADED
          mirror-0     DEGRADED
            nvme0n1p5  ONLINE
            nvme1n1p5  UNAVAIL  invalid label

The keen reader will notice that the drive has a new message next to the UNAVAIL status – invalid label! ZFS sees a new drive there but doesn’t see the data that it expects.

At this point I was a bit confused – I thought I might need to run zpool replace on the drive that was missing, but I needed to read the (wrong) docs a bit more. After some more confusion I figured out that to actually do a zpool replace I needed the pool to actually be present! The first step is forcing the import, even in it’s degraded state:

root@node-3 ~ # zpool import tank
root@node-3 ~ # zpool list
NAME   SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
tank   314G  92.1G   222G        -         -    12%    29%  1.00x  DEGRADED  -
root@node-3 ~ # zpool status
  pool: tank
 state: DEGRADED
status: One or more devices could not be used because the label is missing or
        invalid.  Sufficient replicas exist for the pool to continue
        functioning in a degraded state.
action: Replace the device using 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-4J
  scan: scrub repaired 0B in 00:00:00 with 0 errors on Sun Oct 10 00:24:01 2021
config:

        NAME                     STATE     READ WRITE CKSUM
        tank                     DEGRADED     0     0     0
          mirror-0               DEGRADED     0     0     0
            nvme0n1p5            ONLINE       0     0     0
            1845643194530042201  UNAVAIL      0     0     0  was /dev/nvme1n1p5

errors: No known data errors

OK, now we’ve got even more movement underneath the surface – the previous mirror vdev (the 5th partition of the second drive – /dev/nvme1n1p5) now has a randomized name, and there’s no vdev that represents the new nvme1n1p5.

Now the arguments of zpool replace make sense! ZFS has made a randomized name there that we can refer to, and we can give it the “new drive”:

root@node-3 ~ # zpool replace tank 1845643194530042201 nvme1n1p5

Now we can check zpool status:

root@node-3 ~ # zpool status
  pool: tank
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Wed Dec  8 03:22:07 2021
        92.1G scanned at 1.21G/s, 29.2G issued at 393M/s, 92.1G total
        29.6G resilvered, 31.69% done, 00:02:43 to go
config:

        NAME                       STATE     READ WRITE CKSUM
        tank                       DEGRADED     0     0     0
          mirror-0                 DEGRADED     0     0     0
            nvme0n1p5              ONLINE       0     0     0
            replacing-1            DEGRADED     0     0     0
              1845643194530042201  UNAVAIL      0     0     0  was /dev/nvme1n1p5/old
              nvme1n1p5            ONLINE       0     0     0  (resilvering)

errors: No known data errors

Similar to mdadm, ZFS gets started with resilvering automatically – Resilvering was done quite quickly:

root@node-3 ~ # zpool status
  pool: tank
 state: ONLINE
status: Some supported and requested features are not enabled on the pool.
        The pool can still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
        the pool may no longer be accessible by software that does not support
        the features. See zpool-features(7) for details.
  scan: resilvered 93.9G in 00:03:03 with 0 errors on Wed Dec  8 03:25:10 2021
config:

        NAME           STATE     READ WRITE CKSUM
        tank           ONLINE       0     0     0
          mirror-0     ONLINE       0     0     0
            nvme0n1p5  ONLINE       0     0     0
            nvme1n1p5  ONLINE       0     0     0

errors: No known data errors

Awesome, we’re back to a fuly functional ZFS setup!

Bonus: ZFS upgrade

Well I’m not sure what features are not enabled but I guess I better run zpool upgrade. Maybe this was an issue from back when I degraded to 0.8.3 by accident… The pool was fully formed, and 0.8.3 couldn’t access it, but AFAIK upgrades are fine to do in-place on ZFS so let’s get right in:

root@node-3 ~ # zpool upgrade tank
This system supports ZFS pool feature flags.

Enabled the following features on 'tank':
  redaction_bookmarks
  redacted_datasets
  bookmark_written
  log_spacemap
  livelist
  device_rebuild
  zstd_compress
  draid

Well that just looks like some nice new features coming in (ex. ZSTD compression, DRAID). Running zfs upgrade didn’t cause any issues and I could continue on my merry way.

Wrapup

While the drive going dead did cost me quite a bit of uptime (my Statping instance was red for quite a bit), the downtime was actually could have been mitigated by mounting the pool despite the single disk failure. While that’s a questionable prospect (given the possibility of a second drive failure), Hetzner did replace the drive in <30 minutes so even a completley responsibly performed restore would have yielded “only” 30 minutes of downtime. 30 minutes is a lot of downtime, but as long as it happens only once a month, you’re still at 99.9% uptime!

As mentioned earlier, there are at least a few solutions to becoming completely robust to this kind of failure:

Allow ZFS to mport drives even when they’re degraded
- This may be a workable short term solution, but it’s risky as every running workload is in even more danger if we have a second drive failure
Use a higher layer abstraction like Ceph or Longhorn to distribute writes synchronously (if we want 0 data loss) across nodes
- This is a much better solution, but will cut into performance
Add a third drive to the mirrors (both mdraid and ZFS)
- This has great performance and is robust to drive failures (assuming the node is alive), but costs a little more.

Hopefully this post has helped you out if you’ve got a setup similar to mine, thanks for reading!

VADOSWARE

Living in a yak shaver's paradise.

Handling your first dead Hetzner hard drive

Categories

Table of Contents

Symptoms

At the application layer

At the orchestration layer

ZFS

Operating System (smoking gun)

Recovery

Finding my missing zpools

Requesting a new drive from Hetzner

Restoring data (`mdraid` & ZFS)

Restoring partition structure

Restoring (“resilvering”) `mdraid` data

Surprise: GRUB

ZFS

Bonus: ZFS upgrade

Wrapup

Categories

Table of Contents

Symptoms

At the application layer

At the orchestration layer

ZFS

Operating System (smoking gun)

Recovery

Finding my missing zpools

Requesting a new drive from Hetzner

Restoring data (mdraid & ZFS)

Restoring partition structure

Restoring (“resilvering”) mdraid data

Surprise: GRUB

ZFS

Bonus: ZFS upgrade

Wrapup

Restoring data (`mdraid` & ZFS)

Restoring (“resilvering”) `mdraid` data