Disassembling Raid on Hetzner Without Rescue Mode

Hetzner logo + Kubernetes logo + Rook logo

tl;dr - Disassembling the default-installed RAID1 on Hetzner dedicated servers so you can give one drive to Rook (Ceph underneath) to manage is doable without going into Hetzner rescue mode if you just shrink the cluster to one drive (credit to user forstschutz on StackOverflow), then remove the second.

I’m a huge fan of Hetzner dedicated servers and in particular their Robot Marketplace. Long story short, discovering the robot marketplace thanks to someone on HN opened my eyes to the world of affordable dedicated servers (I’ve also written about hetzner in some previous posts). There are lots of companies that offer dedicated servers but I haven’t found any as well-priced and consistent/reliable as Hetzner (to be fair, I haven’t tried a whole lot – Scaleway seems to be a good bet as well). Most Hetzner dedicated machines have two attached local disks that are replicated via software RAID1 – which is pretty awesome.

Unfortunately, the automatic RAID1 setup is not awesome when you actually want to give that second drive to Ceph to manage (via Rook, an add-on) as part of a Kubernetes cluster, which is where I find myself. Normally disassembling RAID is a step in the wrong direction, but in this case, given that my data-plane (worker node) servers are very much cattle, I don’t mind some instability on the main harddrive as long as I have replication of the important data (managed by Ceph) at the cluster level by Ceph. Removing RAID1 on the Hetzner disks requires booting into rescue mode, according to their guide.

At first I thought Rook (Ceph underneath) supported working with just a plain folder on an already existing disk but this seems to not be the case (if you know otherwise, please, please let me know). This means that if I want to use Rook across my cluster, I’m going to need to figure out a way to undo the RAID1 setup that comes in by default. Luckily for me, I found an SO post that suggested a clever way to do just that.

This post is a writeup on what I went through trying to get it to work.

Working through trying it

Figuring out the current state of the system

Let’s start by figuring out where we’re at with the server so far.

root@Ubuntu-1804-bionic-64-minimal ~ # lsblk
NAME    MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
loop0     7:0    0    87M  1 loop  /snap/core/5145
loop1     7:1    0  87.9M  1 loop  /snap/core/5328
loop2     7:2    0   240K  1 loop  /snap/jq/6
loop3     7:3    0 104.6M  1 loop  /snap/conjure-up/1029
loop4     7:4    0 104.6M  1 loop  /snap/conjure-up/1027
loop6     7:6    0 104.5M  1 loop  /snap/conjure-up/1025
loop7     7:7    0  87.9M  1 loop  /snap/core/5548
sda       8:0    0 223.6G  0 disk
├─sda1    8:1    0    12G  0 part
│ └─md0   9:0    0    12G  0 raid1
├─sda2    8:2    0   512M  0 part
│ └─md1   9:1    0 511.4M  0 raid1 /boot
└─sda3    8:3    0 211.1G  0 part
  └─md2   9:2    0   211G  0 raid1 /
sdb       8:16   0 223.6G  0 disk
├─sdb1    8:17   0    12G  0 part
│ └─md0   9:0    0    12G  0 raid1
├─sdb2    8:18   0   512M  0 part
│ └─md1   9:1    0 511.4M  0 raid1 /boot
└─sdb3    8:19   0 211.1G  0 part
  └─md2   9:2    0   211G  0 raid1 /

A few things we can learn from this:

  • There are two drives; sda and sdb (correspoding to /dev/sda and /dev/sdb)
  • There are three RAID1’d partitions on each drive; sda1/sda2/sda3 and sdb1/sdb2/sdb3, corresponding to md0, md1 and md2.

Just as a reminder the basic strategy we’re going for is to shrink one drive out of the RAID1 setup, then remove it from the cluster alltogether once we know the data is safely on the first drive only.

Here’s the status mdadm (you might enjoy the Arch Wiki entry on mdadm as well) can provide me, for each raided partition:

root@Ubuntu-1804-bionic-64-minimal ~ # mdadm -D /dev/md0
/dev/md0:
           Version : 1.2
     Creation Time : Mon Jul 30 10:55:29 2018
        Raid Level : raid1
        Array Size : 12574720 (11.99 GiB 12.88 GB)
     Used Dev Size : 12574720 (11.99 GiB 12.88 GB)
      Raid Devices : 2
     Total Devices : 2
       Persistence : Superblock is persistent

       Update Time : Fri Oct  5 01:05:04 2018
             State : clean
    Active Devices : 2
   Working Devices : 2
    Failed Devices : 0
     Spare Devices : 0

Consistency Policy : resync

              Name : rescue:0
              UUID : da12f611:5bb391ce:ec88c344:fc568fbd
            Events : 30

    Number   Major   Minor   RaidDevice State
       0       8        1        0      active sync   /dev/sda1
       1       8       17        1      active sync   /dev/sdb1
root@Ubuntu-1804-bionic-64-minimal ~ # mdadm -D /dev/md1
/dev/md1:
           Version : 1.2
     Creation Time : Mon Jul 30 10:55:29 2018
        Raid Level : raid1
        Array Size : 523712 (511.44 MiB 536.28 MB)
     Used Dev Size : 523712 (511.44 MiB 536.28 MB)
      Raid Devices : 2
     Total Devices : 2
       Persistence : Superblock is persistent

       Update Time : Fri Oct  5 22:17:02 2018
             State : clean
    Active Devices : 2
   Working Devices : 2
    Failed Devices : 0
     Spare Devices : 0

Consistency Policy : resync

              Name : rescue:1
              UUID : d7431387:53d27d65:03d51052:2f9e84d6
            Events : 83

    Number   Major   Minor   RaidDevice State
       0       8        2        0      active sync   /dev/sda2
       1       8       18        1      active sync   /dev/sdb2
root@Ubuntu-1804-bionic-64-minimal ~ # mdadm -D /dev/md2
/dev/md2:
           Version : 1.2
     Creation Time : Mon Jul 30 10:55:29 2018
        Raid Level : raid1
        Array Size : 221190720 (210.94 GiB 226.50 GB)
     Used Dev Size : 221190720 (210.94 GiB 226.50 GB)
      Raid Devices : 2
     Total Devices : 2
       Persistence : Superblock is persistent

     Intent Bitmap : Internal

       Update Time : Sat Oct  6 08:54:51 2018
             State : clean
    Active Devices : 2
   Working Devices : 2
    Failed Devices : 0
     Spare Devices : 0

Consistency Policy : bitmap

              Name : rescue:2
              UUID : 7dd7e498:71bc1c71:9a225a36:10f7a242
            Events : 1158

    Number   Major   Minor   RaidDevice State
       0       8        3        0      active sync   /dev/sda3
       1       8       19        1      active sync   /dev/sdb3

Let’s try running the commands on one cluster

The commands that accomplish what we want to do for one disk look like this (feel free to refer back to the so post):

mdadm /dev/mdx --fail /dev/disky1
mdadm /dev/mdx --remove /dev/disky1
mdadm --grow /dev/mdx --raid-devices=1 --force

Note that mdx will need to be md0, md1 and md2 respectively, so these commands will need to be run more than once. As far as what the commands are trying to achieve, it should look something like this:

  • Mark the disk we want to remove as faulty
  • Remove the disk from the setup
  • “Grow” the cluster to 1 single device (in effect shrinking it)

Since I have to do with with three different partitions I’m going to start with the boot partition – a flawed change here shouldn’t affect the currently running system, since the partition is used @ boot (conversely this is a risky choice because if the machine reboots while things are in a bad state, life will be hard). Note the small mismatch, md1 is supported by sdb2 (md0 -> sdb1) – a manifestation of one of the hardest problems in computer science – this is consistent with the status provided for md1, the list of supporting disks at the bottom. With that said, let’s let ‘er rip:

root@Ubuntu-1804-bionic-64-minimal ~ # mdadm /dev/md1 --fail /dev/sdb2
mdadm: set /dev/sdb2 faulty in /dev/md1

OK, so far so good – the disk is marked as faulty. Let’s move on to removing it!

root@Ubuntu-1804-bionic-64-minimal ~ # mdadm /dev/md1 --remove /dev/sdb2
mdadm: hot removed /dev/sdb2 from /dev/md1

OK, so far so good – the disk is removed, let’s confirm this with mdadm -D:

root@Ubuntu-1804-bionic-64-minimal ~ # mdadm -D /dev/md1
/dev/md1:
           Version : 1.2
     Creation Time : Mon Jul 30 10:55:29 2018
        Raid Level : raid1
        Array Size : 523712 (511.44 MiB 536.28 MB)
     Used Dev Size : 523712 (511.44 MiB 536.28 MB)
      Raid Devices : 2
     Total Devices : 1
       Persistence : Superblock is persistent

       Update Time : Sat Oct  6 09:04:41 2018
             State : clean, degraded
    Active Devices : 1
   Working Devices : 1
    Failed Devices : 0
     Spare Devices : 0

Consistency Policy : resync

              Name : rescue:1
              UUID : d7431387:53d27d65:03d51052:2f9e84d6
            Events : 86

    Number   Major   Minor   RaidDevice State
       0       8        2        0      active sync   /dev/sda2
       -       0        0        1      removed

OK, great everything seems to be progressing smoothly, let’s do the final step:

root@Ubuntu-1804-bionic-64-minimal ~ # mdadm --grow /dev/md1 --raid-devices=1 --force
raid_disks for /dev/md1 set to 1

OK, now let’s check and ensure that mdadm -D reflects these changes:

root@Ubuntu-1804-bionic-64-minimal ~ # mdadm -D /dev/md1
/dev/md1:
           Version : 1.2
     Creation Time : Mon Jul 30 10:55:29 2018
        Raid Level : raid1
        Array Size : 523712 (511.44 MiB 536.28 MB)
     Used Dev Size : 523712 (511.44 MiB 536.28 MB)
      Raid Devices : 1
     Total Devices : 1
       Persistence : Superblock is persistent

       Update Time : Sat Oct  6 09:06:22 2018
             State : clean
    Active Devices : 1
   Working Devices : 1
    Failed Devices : 0
     Spare Devices : 0

Consistency Policy : resync

              Name : rescue:1
              UUID : d7431387:53d27d65:03d51052:2f9e84d6
            Events : 89

    Number   Major   Minor   RaidDevice State
       0       8        2        0      active sync   /dev/sda2

Great! it looks like the only backing drive for the md1 partition of the RAID1 cluster is /dev/sda2, which is what we want (we’re freeing up sdb to give to Rook)! Let’s see if lsblk can confirm:

root@Ubuntu-1804-bionic-64-minimal ~ # lsblk
NAME    MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
loop0     7:0    0    87M  1 loop  /snap/core/5145
loop1     7:1    0  87.9M  1 loop  /snap/core/5328
loop2     7:2    0   240K  1 loop  /snap/jq/6
loop3     7:3    0 104.6M  1 loop  /snap/conjure-up/1029
loop4     7:4    0 104.6M  1 loop  /snap/conjure-up/1027
loop6     7:6    0 104.5M  1 loop  /snap/conjure-up/1025
loop7     7:7    0  87.9M  1 loop  /snap/core/5548
sda       8:0    0 223.6G  0 disk
├─sda1    8:1    0    12G  0 part
│ └─md0   9:0    0    12G  0 raid1
├─sda2    8:2    0   512M  0 part
│ └─md1   9:1    0 511.4M  0 raid1 /boot
└─sda3    8:3    0 211.1G  0 part
  └─md2   9:2    0   211G  0 raid1 /
sdb       8:16   0 223.6G  0 disk
├─sdb1    8:17   0    12G  0 part
│ └─md0   9:0    0    12G  0 raid1
├─sdb2    8:18   0   512M  0 part
└─sdb3    8:19   0 211.1G  0 part
  └─md2   9:2    0   211G  0 raid1 /

Awesome – lsblk is confirming what we wanted – sdb/sdb2 is confirmed to not be raid1 compatible, and sda/sda2 still is (RAID1 with one disk).

All we have to do is repeat this process for the other two partitions and we’ll be all good!

Making it repeatable with Ansible

Here’s what the ansible tasks to support this look like:

---
- block:
    - {name: "query md0 setup", become: true, shell: "mdadm --query /dev/md0", register: "md0_query"}
    - {name: "query md1 setup", become: true, shell: "mdadm --query /dev/md1", register: "md1_query"}
    - {name: "query md2 setup", become: true, shell: "mdadm --query /dev/md2", register: "md2_query"}

############
# /dev/md0 #
############

- block:
    - name: mark /dev/sdb1 as faulty on /dev/md0
      become: yes
      shell: "mdadm /dev/md0 --fail /dev/sdb1"

    - name: remove /dev/sdb1 from /dev/md0
      become: yes
      shell: "mdadm /dev/md0 --remove /dev/sdb1"

    - name: shrink /dev/md0 to one disk
      become: yes
      shell: "mdadm --grow /dev/md0 --raid-devices=1 --force"

    - name: wait for resync/rebuild/recovery on md0
      ignore_errors: true
      become: yes
      shell: "mdadm --wait /dev/md0  --wait"
  when: '"raid1 2 devices" not in md0_query.stdout_lines[0]'

############
# /dev/md1 #
############
- block:
    - name: mark /dev/sdb2 as faulty on /dev/md1
      become: yes
      shell: "mdadm /dev/md1 --fail /dev/sdb2"

    - name: remove /dev/sdb2 from /dev/md1
      become: yes
      shell: "mdadm /dev/md1 --remove /dev/sdb2"

    - name: shrink /dev/md1 to one disk
      become: yes
      shell: "mdadm --grow /dev/md1 --raid-devices=1 --force"

    - name: wait for resync/rebuild/recovery on md1
      ignore_errors: true
      become: yes
      shell: "mdadm --wait /dev/md1  --wait"
  when: '"raid1 2 devices" not in md1_query.stdout_lines[0]'

############
# /dev/md2 #
############

- block:
    - name: mark /dev/sdb3 as faulty on /dev/md2
      become: yes
      shell: "mdadm /dev/md2 --fail /dev/sdb3"

    - name: remove /dev/sdb3 from /dev/md2
      become: yes
      shell: "mdadm /dev/md2 --remove /dev/sdb3"

    - name: shrink /dev/md2 to one disk
      become: yes
      shell: "mdadm --grow /dev/md2 --raid-devices=1 --force"

    - name: wait for resync/rebuild/recovery on md2
      ignore_errors: true
      become: yes
      shell: "mdadm --wait /dev/md2  --wait"
  when: '"raid1 2 devices" not in md2_query.stdout_lines[0]'

Note that this bundle of ansible tasks is all or nothing – it will end your entire play if even one thing is not what was expected. I want to minimize the chance of running any commands if the environment isn’t exactly like I expect it to be. If you want to run this for indivdiaul drives, comment accordingly – also make sure you tag this role @ the upper (playbook + whatever else) level.

Wrapup

While ansible is an awesome tool, it would be much better to not have to do this at all, but unfortunately dedicated servers (and the Hetzner features around them currently) don’t permit a more immutable way of doing this. In the future I might start using Hetzner Cloud (since the prices are pretty reasonable as well).

After running through this process I was able to set up Rook very easily on my small cluster and get to use it and all it’s Ceph-y goodness! Pro-tip on setting up rook, in generally you only have to point it to a harddrive, but make sure to run wipefs -a /dev/drive first!