In my previous blog, I’ve introduced the storage solution Rook Ceph for Kubernetes. We saw that in our architecture we had 3 workers dedicated for storage using Ceph file system. Each worker had 2 disks of 100 GB and we saw that from the storage point of view one OSD equals one disk. With these 6 disks we had a Ceph cluster with a total capacity of 600 GB. This disk size was fine for our initial tests but we wanted to extend them to 1.5 TB each to be closer to a production-like design.
The pools of our Ceph cluster use the default configuration of 3 replicas for our placement groups (pgs) and permit the loss of 1 replica. The Ceph cluster will still operate with only 2 replicas. By knowing this, we can then control this disk extension by working with one worker at a time. We will both keep our Ceph cluster available to users and avoid the risk of data corruption. Data corruption (or even loss) is what could happen if we were doing this disk extension abruptly without properly preparing our cluster to this operation.
Preparing the Ceph cluster before the disk extension
First let’s have a look at the status of our Ceph cluster:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 | [benoit@master ~]$ kubectl -n rookceph exec -it deploy /rook-ceph-tools -- ceph status cluster: id : eb11b571-65e7-480a-b15a-e3a200946d3a health: HEALTH_OK services: mon: 3 daemons, quorum a,b,c (age 5d) mgr: a(active, since 13d), standbys: b mds: 1 /1 daemons up, 1 hot standby osd: 6 osds: 6 up (since 5d), 6 in (since 5d) rgw: 1 daemon active (1 hosts, 1 zones) data: volumes: 1 /1 healthy pools: 12 pools, 337 pgs objects: 87.92k objects, 4.9 GiB usage: 19 GiB used, 481 GiB / 500 GiB avail pgs: 337 active+clean io: client: 1.2 KiB /s rd, 2 op /s rd, 0 op /s wr |
Then to the configuration of our OSDs in this cluster:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | [benoit@master ~]$ kubectl -n rookceph exec -it deploy /rook-ceph-tools -- ceph osd df tree ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS TYPE NAME -1 0.58612 - 600 GiB 122 GiB 117 GiB 153 MiB 5.1 GiB 478 GiB 20.35 1.00 - root default -5 0.19537 - 200 GiB 40 GiB 39 GiB 57 MiB 1.4 GiB 160 GiB 20.22 0.99 - host worker1 0 hdd 0.09769 1.00000 100 GiB 24 GiB 23 GiB 7.0 MiB 738 MiB 76 GiB 23.72 1.17 169 up osd.0 4 hdd 0.09769 1.00000 100 GiB 17 GiB 16 GiB 50 MiB 737 MiB 83 GiB 16.73 0.82 168 up osd.3 -3 0.19537 - 200 GiB 41 GiB 39 GiB 35 MiB 1.8 GiB 159 GiB 20.37 1.00 - host worker2 1 hdd 0.09769 1.00000 100 GiB 18 GiB 17 GiB 23 MiB 836 MiB 82 GiB 17.74 0.87 189 up osd.1 3 hdd 0.09769 1.00000 100 GiB 23 GiB 22 GiB 12 MiB 967 MiB 77 GiB 23.00 1.13 148 up osd.4 -7 0.19537 - 200 GiB 41 GiB 39 GiB 61 MiB 1.9 GiB 159 GiB 20.44 1.00 - host worker3 6 hdd 0.09769 1.00000 100 GiB 22 GiB 21 GiB 29 MiB 848 MiB 78 GiB 21.78 1.07 163 up osd.2 7 hdd 0.09769 1.00000 100 GiB 19 GiB 18 GiB 33 MiB 1.0 GiB 81 GiB 19.09 0.94 174 up osd.5 TOTAL 600 GiB 122 GiB 117 GiB 153 MiB 5.1 GiB 478 GiB 20.35 MIN /MAX VAR: 0.82 /1 .17 STDDEV: 2.64 |
We will detail the disk extension procedure on worker1. The output above shows worker1 disks with the labels osd.0 and osd.3.
The first thing to do is to stop the Ceph operator and delete the 2 OSD deployments related to worker1:
1 2 3 4 5 | [benoit@master ~]$ kubectl -n rookceph scale deployment rook-ceph-operator --replicas=0 [benoit@master ~]$ kubectl delete deployment.apps /rook-ceph-osd-0 deployment.apps /rook-ceph-osd-3 -n rookceph deployment.apps "rook-ceph-osd-0" deleted deployment.apps "rook-ceph-osd-3" deleted |
As the operator is stopped, these deployments will not be automatically re-created. We can then continue by moving out osd.0 and osd.3 from this Ceph cluster:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | [benoit@master ~]$ kubectl -n rookceph exec -it deploy /rook-ceph-tools -- ceph osd out 0 [benoit@master ~]$ kubectl -n rookceph exec -it deploy /rook-ceph-tools -- ceph osd out 3 [benoit@master ~]$ kubectl -n rookceph exec -it deploy /rook-ceph-tools -- ceph osd crush remove osd.0 [benoit@master ~]$ kubectl -n rookceph exec -it deploy /rook-ceph-tools -- ceph osd crush remove osd.3 [benoit@master ~]$ kubectl -n rookceph exec -it deploy /rook-ceph-tools -- ceph auth del osd.0 [benoit@master ~]$ kubectl -n rookceph exec -it deploy /rook-ceph-tools -- ceph auth del osd.3 [benoit@master ~]$ kubectl -n rookceph exec -it deploy /rook-ceph-tools -- ceph osd down osd.0 [benoit@master ~]$ kubectl -n rookceph exec -it deploy /rook-ceph-tools -- ceph osd down osd.3 [benoit@master ~]$ kubectl -n rookceph exec -it deploy /rook-ceph-tools -- ceph osd rm osd.0 [benoit@master ~]$ kubectl -n rookceph exec -it deploy /rook-ceph-tools -- ceph osd rm osd.3 |
All references to these 2 OSDs have now been removed from this Ceph cluster. We then wait for the cluster to recover from this removal and monitor its status:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 | [benoit@master ~]$ kubectl -n rookceph exec -it deploy /rook-ceph-tools -- ceph -s cluster: id : eb11b571-65e7-480a-b15a-e3a200946d3a health: HEALTH_WARN Degraded data redundancy: 55832 /263757 objects degraded (21.168%), 88 pgs degraded services: mon: 3 daemons, quorum a,b,c (age 5d) mgr: a(active, since 13d), standbys: b mds: 1 /1 daemons up, 1 hot standby osd: 4 osds: 4 up (since 32s), 4 in (since 15m); 104 remapped pgs rgw: 1 daemon active (1 hosts, 1 zones) data: volumes: 1 /1 healthy pools: 12 pools, 337 pgs objects: 87.92k objects, 4.9 GiB usage: 15 GiB used, 385 GiB / 400 GiB avail pgs: 55832 /263757 objects degraded (21.168%) 32087 /263757 objects misplaced (12.165%) 145 active+undersized 88 active+undersized+degraded 82 active+clean+remapped 22 active+clean io: client: 1023 B /s rd, 1 op /s rd, 0 op /s wr ... ... ... [benoit@master ~]$ kubectl -n rookceph exec -it deploy /rook-ceph-tools -- ceph -s cluster: id : eb11b571-65e7-480a-b15a-e3a200946d3a health: HEALTH_WARN services: mon: 3 daemons, quorum a,b,c (age 5d) mgr: a(active, since 13d), standbys: b mds: 1 /1 daemons up, 1 hot standby osd: 6 osds: 6 up (since 5d), 4 in (since 102s) rgw: 1 daemon active (1 hosts, 1 zones) data: volumes: 1 /1 healthy pools: 12 pools, 337 pgs objects: 87.92k objects, 4.9 GiB usage: 19 GiB used, 481 GiB / 500 GiB avail pgs: 337 active+clean io: client: 1.2 KiB /s rd, 2 op /s rd, 0 op /s wr |
You can use the command ceph with -s (same as status) in order to follow this redistribution (use it in combination with the watch command from linux to follow these changes in live). You can notice now that only 4 OSDs are “in” because we removed 2 of them. The cluster is ready when all pgs are in the state active+clean.
Let’s check the OSDs status:
1 2 3 4 5 6 7 8 9 10 11 12 | [benoit@master ~]$ kubectl -n rookceph exec -it deploy /rook-ceph-tools -- ceph osd df tree ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS TYPE NAME -1 0.39075 - 400 GiB 15 GiB 12 GiB 154 MiB 2.3 GiB 385 GiB 3.65 1.00 - root default -3 0 - 0 B 0 B 0 B 0 B 0 B 0 B 0 0 - host worker1 -7 0.19537 - 200 GiB 8.0 GiB 6.2 GiB 83 MiB 1.6 GiB 192 GiB 3.98 1.09 - host worker2 1 hdd 0.09769 1.00000 100 GiB 3.7 GiB 3.1 GiB 57 MiB 563 MiB 96 GiB 3.68 1.01 200 up osd.1 5 hdd 0.09769 1.00000 100 GiB 4.3 GiB 3.2 GiB 26 MiB 1.1 GiB 96 GiB 4.27 1.17 191 up osd.5 -5 0.19537 - 200 GiB 6.6 GiB 5.9 GiB 71 MiB 667 MiB 193 GiB 3.32 0.91 - host worker3 2 hdd 0.09769 1.00000 100 GiB 3.3 GiB 2.9 GiB 51 MiB 367 MiB 97 GiB 3.35 0.92 208 up osd.2 4 hdd 0.09769 1.00000 100 GiB 3.3 GiB 3.0 GiB 20 MiB 299 MiB 97 GiB 3.30 0.90 179 up osd.4 TOTAL 400 GiB 15 GiB 12 GiB 154 MiB 2.3 GiB 385 GiB 3.65 MIN /MAX VAR: 0.90 /1 .17 STDDEV: 0.39 |
Both disk references have been properly removed from this Ceph cluster. We can now safely proceed with the extension of disks.
Extension of disks
In our environment, worker nodes used for storage are VMware virtual machines. Extending their disks is just a parameter to change. Once done, we can then format and erase each disk as follows:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 | [root@worker1 ~] # lsblk | egrep "sdb|sde" [root@worker1 ~] # sgdisk --zap-all /dev/sdb Warning: Partition table header claims that the size of partition table entries is 0 bytes, but this program supports only 128-byte entries. Adjusting accordingly, but partition table may be garbage. Warning: Partition table header claims that the size of partition table entries is 0 bytes, but this program supports only 128-byte entries. Adjusting accordingly, but partition table may be garbage. Creating new GPT entries. GPT data structures destroyed! You may now partition the disk using fdisk or other utilities. [root@worker1 ~] # dd if=/dev/zero of="/dev/sdb" bs=1M count=100 oflag=direct,dsync 100+0 records in 100+0 records out [root@worker1 ~] # partprobe /dev/sdb [root@worker1 ~] # sgdisk --zap-all /dev/sde [root@worker1 ~] # dd if=/dev/zero of="/dev/sde" bs=1M count=100 oflag=direct,dsync [root@worker1 ~] # partprobe /dev/sde |
We can now bring back those disks into our Ceph cluster.
Bringing back those extended disks
This step shows all the power of using Rook for operating Ceph:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | [benoit@master ~]$ kubectl -n rookceph scale deployment rook-ceph-operator --replicas=1 [benoit@master ~]$ kubectl -n rookceph exec -it deploy /rook-ceph-tools -- ceph -w ... ... ... 2023-02-16T14:16:06.419312+0000 mon.a [WRN] Health check update: Degraded data redundancy: 3960 /263763 objects degraded (1.501%), 9 pgs degraded, 9 pgs undersized (PG_DEGRADED) 2023-02-16T14:16:12.426883+0000 mon.a [WRN] Health check update: Degraded data redundancy: 3078 /263763 objects degraded (1.167%), 9 pgs degraded, 9 pgs undersized (PG_DEGRADED) 2023-02-16T14:16:18.260414+0000 mon.a [WRN] Health check update: Degraded data redundancy: 2223 /263763 objects degraded (0.843%), 8 pgs degraded, 8 pgs undersized (PG_DEGRADED) 2023-02-16T14:16:26.427226+0000 mon.a [WRN] Health check update: Degraded data redundancy: 1459 /263763 objects degraded (0.553%), 6 pgs degraded, 6 pgs undersized (PG_DEGRADED) 2023-02-16T14:16:31.429443+0000 mon.a [WRN] Health check update: Degraded data redundancy: 848 /263763 objects degraded (0.322%), 4 pgs degraded, 4 pgs undersized (PG_DEGRADED) 2023-02-16T14:16:36.432270+0000 mon.a [WRN] Health check update: Degraded data redundancy: 260 /263763 objects degraded (0.099%), 2 pgs degraded, 2 pgs undersized (PG_DEGRADED) 2023-02-16T14:16:38.557667+0000 mon.a [INF] Health check cleared: PG_DEGRADED (was: Degraded data redundancy: 260 /263763 objects degraded (0.099%), 2 pgs degraded, 2 pgs undersized) 2023-02-16T14:16:38.557705+0000 mon.a [INF] Cluster is now healthy |
You can use the command ceph with -w to monitor live the health of this Ceph cluster. And just like that, by scaling up the operator, the Ceph cluster becomes up and running after a few minutes:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 | [rook@rook-ceph-tools-9967d64b6-n6rnk /]$ ceph status cluster: id : eb11b571-65e7-480a-b15a-e3a200946d3a health: HEALTH_OK services: mon: 3 daemons, quorum a,b,c (age 3h) mgr: b(active, since 3h), standbys: a mds: 1 /1 daemons up, 1 hot standby osd: 6 osds: 6 up (since 3m), 6 in (since 6m) rgw: 1 daemon active (1 hosts, 1 zones) data: volumes: 1 /1 healthy pools: 12 pools, 337 pgs objects: 87.92k objects, 4.9 GiB usage: 19 GiB used, 481 GiB / 500 GiB avail pgs: 337 active+clean io: client: 852 B /s rd, 1 op /s rd, 0 op /s wr [rook@rook-ceph-tools-9967d64b6-n6rnk /]$ ceph osd df tree ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS TYPE NAME -1 3.32034 - 3.3 TiB 18 GiB 16 GiB 180 MiB 2.7 GiB 3.3 TiB 0.54 1.00 - root default -3 2.92960 - 2.9 TiB 6.0 GiB 5.2 GiB 0 B 805 MiB 2.9 TiB 0.20 0.37 - host worker1 0 hdd 1.46480 1.00000 1.5 TiB 2.1 GiB 1.9 GiB 0 B 296 MiB 1.5 TiB 0.14 0.26 160 up osd.0 3 hdd 1.46480 1.00000 1.5 TiB 3.8 GiB 3.3 GiB 0 B 509 MiB 1.5 TiB 0.25 0.47 177 up osd.3 -7 0.19537 - 200 GiB 6.5 GiB 5.4 GiB 109 MiB 1.0 GiB 194 GiB 3.24 5.99 - host worker2 1 hdd 0.09769 1.00000 100 GiB 3.6 GiB 2.9 GiB 57 MiB 639 MiB 96 GiB 3.60 6.65 179 up osd.1 5 hdd 0.09769 1.00000 100 GiB 2.9 GiB 2.4 GiB 52 MiB 414 MiB 97 GiB 2.89 5.33 161 up osd.5 -5 0.19537 - 200 GiB 6.0 GiB 5.0 GiB 71 MiB 924 MiB 194 GiB 2.98 5.51 - host worker3 2 hdd 0.09769 1.00000 100 GiB 2.9 GiB 2.4 GiB 51 MiB 478 MiB 97 GiB 2.90 5.36 171 up osd.2 4 hdd 0.09769 1.00000 100 GiB 3.1 GiB 2.6 GiB 20 MiB 446 MiB 97 GiB 3.06 5.65 163 up osd.4 TOTAL 3.3 TiB 18 GiB 16 GiB 180 MiB 2.7 GiB 3.3 TiB 0.54 MIN /MAX VAR: 0.26 /6 .65 STDDEV: 2.12 |
We can see the new size of both disks! All the steps we did to remove those disks have been automatically reverted back. Brilliant!
We just had to repeat the same procedure with the 2 other workers used for storage. We have now a Ceph cluster with 9 TB!