tl;dr - Disassembling the default-installed RAID1 on Hetzner dedicated servers so you can give one drive to Rook (Ceph underneath) to manage is doable without going into Hetzner rescue mode if you just shrink the cluster to one drive (credit to user forstschutz on StackOverflow), then remove the second.
I’m a huge fan of Hetzner dedicated servers and in particular their Robot Marketplace. Long story short, discovering the robot marketplace thanks to someone on HN opened my eyes to the world of affordable dedicated servers (I’ve also written about hetzner in some previous posts). There are lots of companies that offer dedicated servers but I haven’t found any as well-priced and consistent/reliable as Hetzner (to be fair, I haven’t tried a whole lot – Scaleway seems to be a good bet as well). Most Hetzner dedicated machines have two attached local disks that are replicated via software RAID1 – which is pretty awesome.
Unfortunately, the automatic RAID1 setup is not awesome when you actually want to give that second drive to Ceph to manage (via Rook, an add-on) as part of a Kubernetes cluster, which is where I find myself. Normally disassembling RAID is a step in the wrong direction, but in this case, given that my data-plane (worker node) servers are very much cattle, I don’t mind some instability on the main harddrive as long as I have replication of the important data (managed by Ceph) at the cluster level by Ceph. Removing RAID1 on the Hetzner disks requires booting into rescue mode, according to their guide.
At first I thought Rook (Ceph underneath) supported working with just a plain folder on an already existing disk but this seems to not be the case (if you know otherwise, please, please let me know). This means that if I want to use Rook across my cluster, I’m going to need to figure out a way to undo the RAID1 setup that comes in by default. Luckily for me, I found an SO post that suggested a clever way to do just that.
This post is a writeup on what I went through trying to get it to work.
Let’s start by figuring out where we’re at with the server so far.
root@Ubuntu-1804-bionic-64-minimal ~ # lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
loop0 7:0 0 87M 1 loop /snap/core/5145
loop1 7:1 0 87.9M 1 loop /snap/core/5328
loop2 7:2 0 240K 1 loop /snap/jq/6
loop3 7:3 0 104.6M 1 loop /snap/conjure-up/1029
loop4 7:4 0 104.6M 1 loop /snap/conjure-up/1027
loop6 7:6 0 104.5M 1 loop /snap/conjure-up/1025
loop7 7:7 0 87.9M 1 loop /snap/core/5548
sda 8:0 0 223.6G 0 disk
├─sda1 8:1 0 12G 0 part
│ └─md0 9:0 0 12G 0 raid1
├─sda2 8:2 0 512M 0 part
│ └─md1 9:1 0 511.4M 0 raid1 /boot
└─sda3 8:3 0 211.1G 0 part
└─md2 9:2 0 211G 0 raid1 /
sdb 8:16 0 223.6G 0 disk
├─sdb1 8:17 0 12G 0 part
│ └─md0 9:0 0 12G 0 raid1
├─sdb2 8:18 0 512M 0 part
│ └─md1 9:1 0 511.4M 0 raid1 /boot
└─sdb3 8:19 0 211.1G 0 part
└─md2 9:2 0 211G 0 raid1 /
A few things we can learn from this:
sda
and sdb
(correspoding to /dev/sda
and /dev/sdb
)sda1
/sda2
/sda3
and sdb1
/sdb2
/sdb3
, corresponding to md0
, md1
and md2
.Just as a reminder the basic strategy we’re going for is to shrink one drive out of the RAID1 setup, then remove it from the cluster alltogether once we know the data is safely on the first drive only.
Here’s the status mdadm
(you might enjoy the Arch Wiki entry on mdadm
as well) can provide me, for each raided partition:
root@Ubuntu-1804-bionic-64-minimal ~ # mdadm -D /dev/md0
/dev/md0:
Version : 1.2
Creation Time : Mon Jul 30 10:55:29 2018
Raid Level : raid1
Array Size : 12574720 (11.99 GiB 12.88 GB)
Used Dev Size : 12574720 (11.99 GiB 12.88 GB)
Raid Devices : 2
Total Devices : 2
Persistence : Superblock is persistent
Update Time : Fri Oct 5 01:05:04 2018
State : clean
Active Devices : 2
Working Devices : 2
Failed Devices : 0
Spare Devices : 0
Consistency Policy : resync
Name : rescue:0
UUID : da12f611:5bb391ce:ec88c344:fc568fbd
Events : 30
Number Major Minor RaidDevice State
0 8 1 0 active sync /dev/sda1
1 8 17 1 active sync /dev/sdb1
root@Ubuntu-1804-bionic-64-minimal ~ # mdadm -D /dev/md1
/dev/md1:
Version : 1.2
Creation Time : Mon Jul 30 10:55:29 2018
Raid Level : raid1
Array Size : 523712 (511.44 MiB 536.28 MB)
Used Dev Size : 523712 (511.44 MiB 536.28 MB)
Raid Devices : 2
Total Devices : 2
Persistence : Superblock is persistent
Update Time : Fri Oct 5 22:17:02 2018
State : clean
Active Devices : 2
Working Devices : 2
Failed Devices : 0
Spare Devices : 0
Consistency Policy : resync
Name : rescue:1
UUID : d7431387:53d27d65:03d51052:2f9e84d6
Events : 83
Number Major Minor RaidDevice State
0 8 2 0 active sync /dev/sda2
1 8 18 1 active sync /dev/sdb2
root@Ubuntu-1804-bionic-64-minimal ~ # mdadm -D /dev/md2
/dev/md2:
Version : 1.2
Creation Time : Mon Jul 30 10:55:29 2018
Raid Level : raid1
Array Size : 221190720 (210.94 GiB 226.50 GB)
Used Dev Size : 221190720 (210.94 GiB 226.50 GB)
Raid Devices : 2
Total Devices : 2
Persistence : Superblock is persistent
Intent Bitmap : Internal
Update Time : Sat Oct 6 08:54:51 2018
State : clean
Active Devices : 2
Working Devices : 2
Failed Devices : 0
Spare Devices : 0
Consistency Policy : bitmap
Name : rescue:2
UUID : 7dd7e498:71bc1c71:9a225a36:10f7a242
Events : 1158
Number Major Minor RaidDevice State
0 8 3 0 active sync /dev/sda3
1 8 19 1 active sync /dev/sdb3
The commands that accomplish what we want to do for one disk look like this (feel free to refer back to the so post):
mdadm /dev/mdx --fail /dev/disky1
mdadm /dev/mdx --remove /dev/disky1
mdadm --grow /dev/mdx --raid-devices=1 --force
Note that mdx
will need to be md0
, md1
and md2
respectively, so these commands will need to be run more than once. As far as what the commands are trying to achieve, it should look something like this:
Since I have to do with with three different partitions I’m going to start with the boot
partition – a flawed change here shouldn’t affect the currently running system, since the partition is used @ boot (conversely this is a risky choice because if the machine reboots while things are in a bad state, life will be hard). Note the small mismatch, md1
is supported by sdb2
(md0
-> sdb1
) – a manifestation of one of the hardest problems in computer science – this is consistent with the status provided for md1
, the list of supporting disks at the bottom. With that said, let’s let ’er rip:
root@Ubuntu-1804-bionic-64-minimal ~ # mdadm /dev/md1 --fail /dev/sdb2
mdadm: set /dev/sdb2 faulty in /dev/md1
OK, so far so good – the disk is marked as faulty. Let’s move on to removing it!
root@Ubuntu-1804-bionic-64-minimal ~ # mdadm /dev/md1 --remove /dev/sdb2
mdadm: hot removed /dev/sdb2 from /dev/md1
OK, so far so good – the disk is removed, let’s confirm this with mdadm -D
:
root@Ubuntu-1804-bionic-64-minimal ~ # mdadm -D /dev/md1
/dev/md1:
Version : 1.2
Creation Time : Mon Jul 30 10:55:29 2018
Raid Level : raid1
Array Size : 523712 (511.44 MiB 536.28 MB)
Used Dev Size : 523712 (511.44 MiB 536.28 MB)
Raid Devices : 2
Total Devices : 1
Persistence : Superblock is persistent
Update Time : Sat Oct 6 09:04:41 2018
State : clean, degraded
Active Devices : 1
Working Devices : 1
Failed Devices : 0
Spare Devices : 0
Consistency Policy : resync
Name : rescue:1
UUID : d7431387:53d27d65:03d51052:2f9e84d6
Events : 86
Number Major Minor RaidDevice State
0 8 2 0 active sync /dev/sda2
- 0 0 1 removed
OK, great everything seems to be progressing smoothly, let’s do the final step:
root@Ubuntu-1804-bionic-64-minimal ~ # mdadm --grow /dev/md1 --raid-devices=1 --force
raid_disks for /dev/md1 set to 1
OK, now let’s check and ensure that mdadm -D
reflects these changes:
root@Ubuntu-1804-bionic-64-minimal ~ # mdadm -D /dev/md1
/dev/md1:
Version : 1.2
Creation Time : Mon Jul 30 10:55:29 2018
Raid Level : raid1
Array Size : 523712 (511.44 MiB 536.28 MB)
Used Dev Size : 523712 (511.44 MiB 536.28 MB)
Raid Devices : 1
Total Devices : 1
Persistence : Superblock is persistent
Update Time : Sat Oct 6 09:06:22 2018
State : clean
Active Devices : 1
Working Devices : 1
Failed Devices : 0
Spare Devices : 0
Consistency Policy : resync
Name : rescue:1
UUID : d7431387:53d27d65:03d51052:2f9e84d6
Events : 89
Number Major Minor RaidDevice State
0 8 2 0 active sync /dev/sda2
Great! it looks like the only backing drive for the md1
partition of the RAID1 cluster is /dev/sda2
, which is what we want (we’re freeing up sdb
to give to Rook)! Let’s see if lsblk
can confirm:
root@Ubuntu-1804-bionic-64-minimal ~ # lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
loop0 7:0 0 87M 1 loop /snap/core/5145
loop1 7:1 0 87.9M 1 loop /snap/core/5328
loop2 7:2 0 240K 1 loop /snap/jq/6
loop3 7:3 0 104.6M 1 loop /snap/conjure-up/1029
loop4 7:4 0 104.6M 1 loop /snap/conjure-up/1027
loop6 7:6 0 104.5M 1 loop /snap/conjure-up/1025
loop7 7:7 0 87.9M 1 loop /snap/core/5548
sda 8:0 0 223.6G 0 disk
├─sda1 8:1 0 12G 0 part
│ └─md0 9:0 0 12G 0 raid1
├─sda2 8:2 0 512M 0 part
│ └─md1 9:1 0 511.4M 0 raid1 /boot
└─sda3 8:3 0 211.1G 0 part
└─md2 9:2 0 211G 0 raid1 /
sdb 8:16 0 223.6G 0 disk
├─sdb1 8:17 0 12G 0 part
│ └─md0 9:0 0 12G 0 raid1
├─sdb2 8:18 0 512M 0 part
└─sdb3 8:19 0 211.1G 0 part
└─md2 9:2 0 211G 0 raid1 /
Awesome – lsblk
is confirming what we wanted – sdb/sdb2
is confirmed to not be raid1
compatible, and sda/sda2
still is (RAID1 with one disk).
All we have to do is repeat this process for the other two partitions and we’ll be all good!
Here’s what the ansible tasks to support this look like:
---
- block:
- {name: "query md0 setup", become: true, shell: "mdadm --query /dev/md0", register: "md0_query"}
- {name: "query md1 setup", become: true, shell: "mdadm --query /dev/md1", register: "md1_query"}
- {name: "query md2 setup", become: true, shell: "mdadm --query /dev/md2", register: "md2_query"}
############
# /dev/md0 #
############
- block:
- name: mark /dev/sdb1 as faulty on /dev/md0
become: yes
shell: "mdadm /dev/md0 --fail /dev/sdb1"
- name: remove /dev/sdb1 from /dev/md0
become: yes
shell: "mdadm /dev/md0 --remove /dev/sdb1"
- name: shrink /dev/md0 to one disk
become: yes
shell: "mdadm --grow /dev/md0 --raid-devices=1 --force"
- name: wait for resync/rebuild/recovery on md0
ignore_errors: true
become: yes
shell: "mdadm --wait /dev/md0 --wait"
when: '"raid1 2 devices" not in md0_query.stdout_lines[0]'
############
# /dev/md1 #
############
- block:
- name: mark /dev/sdb2 as faulty on /dev/md1
become: yes
shell: "mdadm /dev/md1 --fail /dev/sdb2"
- name: remove /dev/sdb2 from /dev/md1
become: yes
shell: "mdadm /dev/md1 --remove /dev/sdb2"
- name: shrink /dev/md1 to one disk
become: yes
shell: "mdadm --grow /dev/md1 --raid-devices=1 --force"
- name: wait for resync/rebuild/recovery on md1
ignore_errors: true
become: yes
shell: "mdadm --wait /dev/md1 --wait"
when: '"raid1 2 devices" not in md1_query.stdout_lines[0]'
############
# /dev/md2 #
############
- block:
- name: mark /dev/sdb3 as faulty on /dev/md2
become: yes
shell: "mdadm /dev/md2 --fail /dev/sdb3"
- name: remove /dev/sdb3 from /dev/md2
become: yes
shell: "mdadm /dev/md2 --remove /dev/sdb3"
- name: shrink /dev/md2 to one disk
become: yes
shell: "mdadm --grow /dev/md2 --raid-devices=1 --force"
- name: wait for resync/rebuild/recovery on md2
ignore_errors: true
become: yes
shell: "mdadm --wait /dev/md2 --wait"
when: '"raid1 2 devices" not in md2_query.stdout_lines[0]'
Note that this bundle of ansible tasks is all or nothing – it will end your entire play if even one thing is not what was expected. I want to minimize the chance of running any commands if the environment isn’t exactly like I expect it to be. If you want to run this for indivdiaul drives, comment accordingly – also make sure you tag this role @ the upper (playbook + whatever else) level.
While ansible is an awesome tool, it would be much better to not have to do this at all, but unfortunately dedicated servers (and the Hetzner features around them currently) don’t permit a more immutable way of doing this. In the future I might start using Hetzner Cloud (since the prices are pretty reasonable as well).
After running through this process I was able to set up Rook very easily on my small cluster and get to use it and all it’s Ceph-y goodness! Pro-tip on setting up rook, in generally you only have to point it to a harddrive, but make sure to run wipefs -a /dev/drive
first!