In my previous blog, I’ve introduced the storage solution Rook Ceph for Kubernetes. We saw that in our architecture we had 3 workers dedicated for storage using Ceph file system. Each worker had 2 disks of 100 GB and we saw that from the storage point of view one OSD equals one disk. With these 6 disks we had a Ceph cluster with a total capacity of 600 GB. This disk size was fine for our initial tests but we wanted to extend them to 1.5 TB each to be closer to a production-like design.
The pools of our Ceph cluster use the default configuration of 3 replicas for our placement groups (pgs) and permit the loss of 1 replica. The Ceph cluster will still operate with only 2 replicas. By knowing this, we can then control this disk extension by working with one worker at a time. We will both keep our Ceph cluster available to users and avoid the risk of data corruption. Data corruption (or even loss) is what could happen if we were doing this disk extension abruptly without properly preparing our cluster to this operation.
Preparing the Ceph cluster before the disk extension
First let’s have a look at the status of our Ceph cluster:
[benoit@master ~]$ kubectl -n rookceph exec -it deploy/rook-ceph-tools -- ceph status
cluster:
id: eb11b571-65e7-480a-b15a-e3a200946d3a
health: HEALTH_OK
services:
mon: 3 daemons, quorum a,b,c (age 5d)
mgr: a(active, since 13d), standbys: b
mds: 1/1 daemons up, 1 hot standby
osd: 6 osds: 6 up (since 5d), 6 in (since 5d)
rgw: 1 daemon active (1 hosts, 1 zones)
data:
volumes: 1/1 healthy
pools: 12 pools, 337 pgs
objects: 87.92k objects, 4.9 GiB
usage: 19 GiB used, 481 GiB / 500 GiB avail
pgs: 337 active+clean
io:
client: 1.2 KiB/s rd, 2 op/s rd, 0 op/s wr
Then to the configuration of our OSDs in this cluster:
[benoit@master ~]$ kubectl -n rookceph exec -it deploy/rook-ceph-tools -- ceph osd df tree
ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS TYPE NAME
-1 0.58612 - 600 GiB 122 GiB 117 GiB 153 MiB 5.1 GiB 478 GiB 20.35 1.00 - root default
-5 0.19537 - 200 GiB 40 GiB 39 GiB 57 MiB 1.4 GiB 160 GiB 20.22 0.99 - host worker1
0 hdd 0.09769 1.00000 100 GiB 24 GiB 23 GiB 7.0 MiB 738 MiB 76 GiB 23.72 1.17 169 up osd.0
4 hdd 0.09769 1.00000 100 GiB 17 GiB 16 GiB 50 MiB 737 MiB 83 GiB 16.73 0.82 168 up osd.3
-3 0.19537 - 200 GiB 41 GiB 39 GiB 35 MiB 1.8 GiB 159 GiB 20.37 1.00 - host worker2
1 hdd 0.09769 1.00000 100 GiB 18 GiB 17 GiB 23 MiB 836 MiB 82 GiB 17.74 0.87 189 up osd.1
3 hdd 0.09769 1.00000 100 GiB 23 GiB 22 GiB 12 MiB 967 MiB 77 GiB 23.00 1.13 148 up osd.4
-7 0.19537 - 200 GiB 41 GiB 39 GiB 61 MiB 1.9 GiB 159 GiB 20.44 1.00 - host worker3
6 hdd 0.09769 1.00000 100 GiB 22 GiB 21 GiB 29 MiB 848 MiB 78 GiB 21.78 1.07 163 up osd.2
7 hdd 0.09769 1.00000 100 GiB 19 GiB 18 GiB 33 MiB 1.0 GiB 81 GiB 19.09 0.94 174 up osd.5
TOTAL 600 GiB 122 GiB 117 GiB 153 MiB 5.1 GiB 478 GiB 20.35
MIN/MAX VAR: 0.82/1.17 STDDEV: 2.64
We will detail the disk extension procedure on worker1. The output above shows worker1 disks with the labels osd.0 and osd.3.
The first thing to do is to stop the Ceph operator and delete the 2 OSD deployments related to worker1:
[benoit@master ~]$ kubectl -n rookceph scale deployment rook-ceph-operator --replicas=0
[benoit@master ~]$ kubectl delete deployment.apps/rook-ceph-osd-0 deployment.apps/rook-ceph-osd-3 -n rookceph
deployment.apps "rook-ceph-osd-0" deleted
deployment.apps "rook-ceph-osd-3" deleted
As the operator is stopped, these deployments will not be automatically re-created. We can then continue by moving out osd.0 and osd.3 from this Ceph cluster:
[benoit@master ~]$ kubectl -n rookceph exec -it deploy/rook-ceph-tools -- ceph osd out 0
[benoit@master ~]$ kubectl -n rookceph exec -it deploy/rook-ceph-tools -- ceph osd out 3
[benoit@master ~]$ kubectl -n rookceph exec -it deploy/rook-ceph-tools -- ceph osd crush remove osd.0
[benoit@master ~]$ kubectl -n rookceph exec -it deploy/rook-ceph-tools -- ceph osd crush remove osd.3
[benoit@master ~]$ kubectl -n rookceph exec -it deploy/rook-ceph-tools -- ceph auth del osd.0
[benoit@master ~]$ kubectl -n rookceph exec -it deploy/rook-ceph-tools -- ceph auth del osd.3
[benoit@master ~]$ kubectl -n rookceph exec -it deploy/rook-ceph-tools -- ceph osd down osd.0
[benoit@master ~]$ kubectl -n rookceph exec -it deploy/rook-ceph-tools -- ceph osd down osd.3
[benoit@master ~]$ kubectl -n rookceph exec -it deploy/rook-ceph-tools -- ceph osd rm osd.0
[benoit@master ~]$ kubectl -n rookceph exec -it deploy/rook-ceph-tools -- ceph osd rm osd.3
All references to these 2 OSDs have now been removed from this Ceph cluster. We then wait for the cluster to recover from this removal and monitor its status:
[benoit@master ~]$ kubectl -n rookceph exec -it deploy/rook-ceph-tools -- ceph -s
cluster:
id: eb11b571-65e7-480a-b15a-e3a200946d3a
health: HEALTH_WARN
Degraded data redundancy: 55832/263757 objects degraded (21.168%), 88 pgs degraded
services:
mon: 3 daemons, quorum a,b,c (age 5d)
mgr: a(active, since 13d), standbys: b
mds: 1/1 daemons up, 1 hot standby
osd: 4 osds: 4 up (since 32s), 4 in (since 15m); 104 remapped pgs
rgw: 1 daemon active (1 hosts, 1 zones)
data:
volumes: 1/1 healthy
pools: 12 pools, 337 pgs
objects: 87.92k objects, 4.9 GiB
usage: 15 GiB used, 385 GiB / 400 GiB avail
pgs: 55832/263757 objects degraded (21.168%)
32087/263757 objects misplaced (12.165%)
145 active+undersized
88 active+undersized+degraded
82 active+clean+remapped
22 active+clean
io:
client: 1023 B/s rd, 1 op/s rd, 0 op/s wr
...
...
...
[benoit@master ~]$ kubectl -n rookceph exec -it deploy/rook-ceph-tools -- ceph -s
cluster:
id: eb11b571-65e7-480a-b15a-e3a200946d3a
health: HEALTH_WARN
services:
mon: 3 daemons, quorum a,b,c (age 5d)
mgr: a(active, since 13d), standbys: b
mds: 1/1 daemons up, 1 hot standby
osd: 6 osds: 6 up (since 5d), 4 in (since 102s)
rgw: 1 daemon active (1 hosts, 1 zones)
data:
volumes: 1/1 healthy
pools: 12 pools, 337 pgs
objects: 87.92k objects, 4.9 GiB
usage: 19 GiB used, 481 GiB / 500 GiB avail
pgs: 337 active+clean
io:
client: 1.2 KiB/s rd, 2 op/s rd, 0 op/s wr
You can use the command ceph with -s (same as status) in order to follow this redistribution (use it in combination with the watch command from linux to follow these changes in live). You can notice now that only 4 OSDs are “in” because we removed 2 of them. The cluster is ready when all pgs are in the state active+clean.
Let’s check the OSDs status:
[benoit@master ~]$ kubectl -n rookceph exec -it deploy/rook-ceph-tools -- ceph osd df tree
ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS TYPE NAME
-1 0.39075 - 400 GiB 15 GiB 12 GiB 154 MiB 2.3 GiB 385 GiB 3.65 1.00 - root default
-3 0 - 0 B 0 B 0 B 0 B 0 B 0 B 0 0 - host worker1
-7 0.19537 - 200 GiB 8.0 GiB 6.2 GiB 83 MiB 1.6 GiB 192 GiB 3.98 1.09 - host worker2
1 hdd 0.09769 1.00000 100 GiB 3.7 GiB 3.1 GiB 57 MiB 563 MiB 96 GiB 3.68 1.01 200 up osd.1
5 hdd 0.09769 1.00000 100 GiB 4.3 GiB 3.2 GiB 26 MiB 1.1 GiB 96 GiB 4.27 1.17 191 up osd.5
-5 0.19537 - 200 GiB 6.6 GiB 5.9 GiB 71 MiB 667 MiB 193 GiB 3.32 0.91 - host worker3
2 hdd 0.09769 1.00000 100 GiB 3.3 GiB 2.9 GiB 51 MiB 367 MiB 97 GiB 3.35 0.92 208 up osd.2
4 hdd 0.09769 1.00000 100 GiB 3.3 GiB 3.0 GiB 20 MiB 299 MiB 97 GiB 3.30 0.90 179 up osd.4
TOTAL 400 GiB 15 GiB 12 GiB 154 MiB 2.3 GiB 385 GiB 3.65
MIN/MAX VAR: 0.90/1.17 STDDEV: 0.39
Both disk references have been properly removed from this Ceph cluster. We can now safely proceed with the extension of disks.
Extension of disks
In our environment, worker nodes used for storage are VMware virtual machines. Extending their disks is just a parameter to change. Once done, we can then format and erase each disk as follows:
[root@worker1 ~]# lsblk | egrep "sdb|sde"
[root@worker1 ~]# sgdisk --zap-all /dev/sdb
Warning: Partition table header claims that the size of partition table
entries is 0 bytes, but this program supports only 128-byte entries.
Adjusting accordingly, but partition table may be garbage.
Warning: Partition table header claims that the size of partition table
entries is 0 bytes, but this program supports only 128-byte entries.
Adjusting accordingly, but partition table may be garbage.
Creating new GPT entries.
GPT data structures destroyed! You may now partition the disk using fdisk or
other utilities.
[root@worker1 ~]# dd if=/dev/zero of="/dev/sdb" bs=1M count=100 oflag=direct,dsync
100+0 records in
100+0 records out
[root@worker1 ~]# partprobe /dev/sdb
[root@worker1 ~]# sgdisk --zap-all /dev/sde
[root@worker1 ~]# dd if=/dev/zero of="/dev/sde" bs=1M count=100 oflag=direct,dsync
[root@worker1 ~]# partprobe /dev/sde
We can now bring back those disks into our Ceph cluster.
Bringing back those extended disks
This step shows all the power of using Rook for operating Ceph:
[benoit@master ~]$ kubectl -n rookceph scale deployment rook-ceph-operator --replicas=1
[benoit@master ~]$ kubectl -n rookceph exec -it deploy/rook-ceph-tools -- ceph -w
...
...
...
2023-02-16T14:16:06.419312+0000 mon.a [WRN] Health check update: Degraded data redundancy: 3960/263763 objects degraded (1.501%), 9 pgs degraded, 9 pgs undersized (PG_DEGRADED)
2023-02-16T14:16:12.426883+0000 mon.a [WRN] Health check update: Degraded data redundancy: 3078/263763 objects degraded (1.167%), 9 pgs degraded, 9 pgs undersized (PG_DEGRADED)
2023-02-16T14:16:18.260414+0000 mon.a [WRN] Health check update: Degraded data redundancy: 2223/263763 objects degraded (0.843%), 8 pgs degraded, 8 pgs undersized (PG_DEGRADED)
2023-02-16T14:16:26.427226+0000 mon.a [WRN] Health check update: Degraded data redundancy: 1459/263763 objects degraded (0.553%), 6 pgs degraded, 6 pgs undersized (PG_DEGRADED)
2023-02-16T14:16:31.429443+0000 mon.a [WRN] Health check update: Degraded data redundancy: 848/263763 objects degraded (0.322%), 4 pgs degraded, 4 pgs undersized (PG_DEGRADED)
2023-02-16T14:16:36.432270+0000 mon.a [WRN] Health check update: Degraded data redundancy: 260/263763 objects degraded (0.099%), 2 pgs degraded, 2 pgs undersized (PG_DEGRADED)
2023-02-16T14:16:38.557667+0000 mon.a [INF] Health check cleared: PG_DEGRADED (was: Degraded data redundancy: 260/263763 objects degraded (0.099%), 2 pgs degraded, 2 pgs undersized)
2023-02-16T14:16:38.557705+0000 mon.a [INF] Cluster is now healthy
You can use the command ceph with -w to monitor live the health of this Ceph cluster. And just like that, by scaling up the operator, the Ceph cluster becomes up and running after a few minutes:
[rook@rook-ceph-tools-9967d64b6-n6rnk /]$ ceph status
cluster:
id: eb11b571-65e7-480a-b15a-e3a200946d3a
health: HEALTH_OK
services:
mon: 3 daemons, quorum a,b,c (age 3h)
mgr: b(active, since 3h), standbys: a
mds: 1/1 daemons up, 1 hot standby
osd: 6 osds: 6 up (since 3m), 6 in (since 6m)
rgw: 1 daemon active (1 hosts, 1 zones)
data:
volumes: 1/1 healthy
pools: 12 pools, 337 pgs
objects: 87.92k objects, 4.9 GiB
usage: 19 GiB used, 481 GiB / 500 GiB avail
pgs: 337 active+clean
io:
client: 852 B/s rd, 1 op/s rd, 0 op/s wr
[rook@rook-ceph-tools-9967d64b6-n6rnk /]$ ceph osd df tree
ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS TYPE NAME
-1 3.32034 - 3.3 TiB 18 GiB 16 GiB 180 MiB 2.7 GiB 3.3 TiB 0.54 1.00 - root default
-3 2.92960 - 2.9 TiB 6.0 GiB 5.2 GiB 0 B 805 MiB 2.9 TiB 0.20 0.37 - host worker1
0 hdd 1.46480 1.00000 1.5 TiB 2.1 GiB 1.9 GiB 0 B 296 MiB 1.5 TiB 0.14 0.26 160 up osd.0
3 hdd 1.46480 1.00000 1.5 TiB 3.8 GiB 3.3 GiB 0 B 509 MiB 1.5 TiB 0.25 0.47 177 up osd.3
-7 0.19537 - 200 GiB 6.5 GiB 5.4 GiB 109 MiB 1.0 GiB 194 GiB 3.24 5.99 - host worker2
1 hdd 0.09769 1.00000 100 GiB 3.6 GiB 2.9 GiB 57 MiB 639 MiB 96 GiB 3.60 6.65 179 up osd.1
5 hdd 0.09769 1.00000 100 GiB 2.9 GiB 2.4 GiB 52 MiB 414 MiB 97 GiB 2.89 5.33 161 up osd.5
-5 0.19537 - 200 GiB 6.0 GiB 5.0 GiB 71 MiB 924 MiB 194 GiB 2.98 5.51 - host worker3
2 hdd 0.09769 1.00000 100 GiB 2.9 GiB 2.4 GiB 51 MiB 478 MiB 97 GiB 2.90 5.36 171 up osd.2
4 hdd 0.09769 1.00000 100 GiB 3.1 GiB 2.6 GiB 20 MiB 446 MiB 97 GiB 3.06 5.65 163 up osd.4
TOTAL 3.3 TiB 18 GiB 16 GiB 180 MiB 2.7 GiB 3.3 TiB 0.54
MIN/MAX VAR: 0.26/6.65 STDDEV: 2.12
We can see the new size of both disks! All the steps we did to remove those disks have been automatically reverted back. Brilliant!
We just had to repeat the same procedure with the 2 other workers used for storage. We have now a Ceph cluster with 9 TB!