In my previous blog, I’ve introduced the storage solution Rook Ceph for Kubernetes. We saw that in our architecture we had 3 workers dedicated for storage using Ceph file system. Each worker had 2 disks of 100 GB and we saw that from the storage point of view one OSD equals one disk. With these 6 disks we had a Ceph cluster with a total capacity of 600 GB. This disk size was fine for our initial tests but we wanted to extend them to 1.5 TB each to be closer to a production-like design.

The pools of our Ceph cluster use the default configuration of 3 replicas for our placement groups (pgs) and permit the loss of 1 replica. The Ceph cluster will still operate with only 2 replicas. By knowing this, we can then control this disk extension by working with one worker at a time. We will both keep our Ceph cluster available to users and avoid the risk of data corruption. Data corruption (or even loss) is what could happen if we were doing this disk extension abruptly without properly preparing our cluster to this operation.

Preparing the Ceph cluster before the disk extension

First let’s have a look at the status of our Ceph cluster:

[benoit@master ~]$ kubectl -n rookceph exec -it deploy/rook-ceph-tools -- ceph status
  cluster:
    id:     eb11b571-65e7-480a-b15a-e3a200946d3a
    health: HEALTH_OK

  services:
    mon: 3 daemons, quorum a,b,c (age 5d)
    mgr: a(active, since 13d), standbys: b
    mds: 1/1 daemons up, 1 hot standby
    osd: 6 osds: 6 up (since 5d), 6 in (since 5d)
    rgw: 1 daemon active (1 hosts, 1 zones)

  data:
    volumes: 1/1 healthy
    pools:   12 pools, 337 pgs
    objects: 87.92k objects, 4.9 GiB
    usage:   19 GiB used, 481 GiB / 500 GiB avail
    pgs:     337 active+clean

  io:
    client:   1.2 KiB/s rd, 2 op/s rd, 0 op/s wr

Then to the configuration of our OSDs in this cluster:

[benoit@master ~]$ kubectl -n rookceph exec -it deploy/rook-ceph-tools -- ceph osd df tree
ID  CLASS  WEIGHT   REWEIGHT  SIZE     RAW USE  DATA     OMAP     META     AVAIL    %USE   VAR   PGS  STATUS  TYPE NAME
-1         0.58612         -  600 GiB  122 GiB  117 GiB  153 MiB  5.1 GiB  478 GiB  20.35  1.00    -          root default
-5         0.19537         -  200 GiB   40 GiB   39 GiB   57 MiB  1.4 GiB  160 GiB  20.22  0.99    -              host worker1
 0    hdd  0.09769   1.00000  100 GiB   24 GiB   23 GiB  7.0 MiB  738 MiB   76 GiB  23.72  1.17  169      up          osd.0
 4    hdd  0.09769   1.00000  100 GiB   17 GiB   16 GiB   50 MiB  737 MiB   83 GiB  16.73  0.82  168      up          osd.3
-3         0.19537         -  200 GiB   41 GiB   39 GiB   35 MiB  1.8 GiB  159 GiB  20.37  1.00    -              host worker2
 1    hdd  0.09769   1.00000  100 GiB   18 GiB   17 GiB   23 MiB  836 MiB   82 GiB  17.74  0.87  189      up          osd.1
 3    hdd  0.09769   1.00000  100 GiB   23 GiB   22 GiB   12 MiB  967 MiB   77 GiB  23.00  1.13  148      up          osd.4
-7         0.19537         -  200 GiB   41 GiB   39 GiB   61 MiB  1.9 GiB  159 GiB  20.44  1.00    -              host worker3
 6    hdd  0.09769   1.00000  100 GiB   22 GiB   21 GiB   29 MiB  848 MiB   78 GiB  21.78  1.07  163      up          osd.2
 7    hdd  0.09769   1.00000  100 GiB   19 GiB   18 GiB   33 MiB  1.0 GiB   81 GiB  19.09  0.94  174      up          osd.5
                       TOTAL  600 GiB  122 GiB  117 GiB  153 MiB  5.1 GiB  478 GiB  20.35
MIN/MAX VAR: 0.82/1.17  STDDEV: 2.64

We will detail the disk extension procedure on worker1. The output above shows worker1 disks with the labels osd.0 and osd.3.

The first thing to do is to stop the Ceph operator and delete the 2 OSD deployments related to worker1:

[benoit@master ~]$ kubectl -n rookceph scale deployment rook-ceph-operator --replicas=0

[benoit@master ~]$ kubectl delete deployment.apps/rook-ceph-osd-0 deployment.apps/rook-ceph-osd-3 -n rookceph
deployment.apps "rook-ceph-osd-0" deleted
deployment.apps "rook-ceph-osd-3" deleted

As the operator is stopped, these deployments will not be automatically re-created. We can then continue by moving out osd.0 and osd.3 from this Ceph cluster:

[benoit@master ~]$ kubectl -n rookceph exec -it deploy/rook-ceph-tools -- ceph osd out 0
[benoit@master ~]$ kubectl -n rookceph exec -it deploy/rook-ceph-tools -- ceph osd out 3

[benoit@master ~]$ kubectl -n rookceph exec -it deploy/rook-ceph-tools -- ceph osd crush remove osd.0
[benoit@master ~]$ kubectl -n rookceph exec -it deploy/rook-ceph-tools -- ceph osd crush remove osd.3

[benoit@master ~]$ kubectl -n rookceph exec -it deploy/rook-ceph-tools -- ceph auth del osd.0
[benoit@master ~]$ kubectl -n rookceph exec -it deploy/rook-ceph-tools -- ceph auth del osd.3

[benoit@master ~]$ kubectl -n rookceph exec -it deploy/rook-ceph-tools -- ceph osd down osd.0
[benoit@master ~]$ kubectl -n rookceph exec -it deploy/rook-ceph-tools -- ceph osd down osd.3

[benoit@master ~]$ kubectl -n rookceph exec -it deploy/rook-ceph-tools -- ceph osd rm osd.0
[benoit@master ~]$ kubectl -n rookceph exec -it deploy/rook-ceph-tools -- ceph osd rm osd.3

All references to these 2 OSDs have now been removed from this Ceph cluster. We then wait for the cluster to recover from this removal and monitor its status:

[benoit@master ~]$ kubectl -n rookceph exec -it deploy/rook-ceph-tools -- ceph -s
  cluster:
    id:     eb11b571-65e7-480a-b15a-e3a200946d3a
    health: HEALTH_WARN
            Degraded data redundancy: 55832/263757 objects degraded (21.168%), 88 pgs degraded

  services:
    mon: 3 daemons, quorum a,b,c (age 5d)
    mgr: a(active, since 13d), standbys: b
    mds: 1/1 daemons up, 1 hot standby
    osd: 4 osds: 4 up (since 32s), 4 in (since 15m); 104 remapped pgs
    rgw: 1 daemon active (1 hosts, 1 zones)

  data:
    volumes: 1/1 healthy
    pools:   12 pools, 337 pgs
    objects: 87.92k objects, 4.9 GiB
    usage:   15 GiB used, 385 GiB / 400 GiB avail
    pgs:     55832/263757 objects degraded (21.168%)
             32087/263757 objects misplaced (12.165%)
             145 active+undersized
             88  active+undersized+degraded
             82  active+clean+remapped
             22  active+clean

  io:
    client:   1023 B/s rd, 1 op/s rd, 0 op/s wr
...
...
...
[benoit@master ~]$ kubectl -n rookceph exec -it deploy/rook-ceph-tools -- ceph -s
  cluster:
    id:     eb11b571-65e7-480a-b15a-e3a200946d3a
    health: HEALTH_WARN

  services:
    mon: 3 daemons, quorum a,b,c (age 5d)
    mgr: a(active, since 13d), standbys: b
    mds: 1/1 daemons up, 1 hot standby
    osd: 6 osds: 6 up (since 5d), 4 in (since 102s)
    rgw: 1 daemon active (1 hosts, 1 zones)

  data:
    volumes: 1/1 healthy
    pools:   12 pools, 337 pgs
    objects: 87.92k objects, 4.9 GiB
    usage:   19 GiB used, 481 GiB / 500 GiB avail
    pgs:     337 active+clean

  io:
    client:   1.2 KiB/s rd, 2 op/s rd, 0 op/s wr

You can use the command ceph with -s (same as status) in order to follow this redistribution (use it in combination with the watch command from linux to follow these changes in live). You can notice now that only 4 OSDs are “in” because we removed 2 of them. The cluster is ready when all pgs are in the state active+clean.

Let’s check the OSDs status:

[benoit@master ~]$ kubectl -n rookceph exec -it deploy/rook-ceph-tools -- ceph osd df tree
ID  CLASS  WEIGHT   REWEIGHT  SIZE     RAW USE  DATA     OMAP     META     AVAIL    %USE  VAR   PGS  STATUS  TYPE NAME
-1         0.39075         -  400 GiB   15 GiB   12 GiB  154 MiB  2.3 GiB  385 GiB  3.65  1.00    -          root default
-3               0         -      0 B      0 B      0 B      0 B      0 B      0 B     0     0    -              host worker1
-7         0.19537         -  200 GiB  8.0 GiB  6.2 GiB   83 MiB  1.6 GiB  192 GiB  3.98  1.09    -              host worker2
 1    hdd  0.09769   1.00000  100 GiB  3.7 GiB  3.1 GiB   57 MiB  563 MiB   96 GiB  3.68  1.01  200      up          osd.1
 5    hdd  0.09769   1.00000  100 GiB  4.3 GiB  3.2 GiB   26 MiB  1.1 GiB   96 GiB  4.27  1.17  191      up          osd.5
-5         0.19537         -  200 GiB  6.6 GiB  5.9 GiB   71 MiB  667 MiB  193 GiB  3.32  0.91    -              host worker3
 2    hdd  0.09769   1.00000  100 GiB  3.3 GiB  2.9 GiB   51 MiB  367 MiB   97 GiB  3.35  0.92  208      up          osd.2
 4    hdd  0.09769   1.00000  100 GiB  3.3 GiB  3.0 GiB   20 MiB  299 MiB   97 GiB  3.30  0.90  179      up          osd.4
                       TOTAL  400 GiB   15 GiB   12 GiB  154 MiB  2.3 GiB  385 GiB  3.65
MIN/MAX VAR: 0.90/1.17  STDDEV: 0.39

Both disk references have been properly removed from this Ceph cluster. We can now safely proceed with the extension of disks.

Extension of disks

In our environment, worker nodes used for storage are VMware virtual machines. Extending their disks is just a parameter to change. Once done, we can then format and erase each disk as follows:

[root@worker1 ~]# lsblk | egrep "sdb|sde"

[root@worker1 ~]# sgdisk --zap-all /dev/sdb
Warning: Partition table header claims that the size of partition table
entries is 0 bytes, but this program  supports only 128-byte entries.
Adjusting accordingly, but partition table may be garbage.
Warning: Partition table header claims that the size of partition table
entries is 0 bytes, but this program  supports only 128-byte entries.
Adjusting accordingly, but partition table may be garbage.
Creating new GPT entries.
GPT data structures destroyed! You may now partition the disk using fdisk or
other utilities.

[root@worker1 ~]# dd if=/dev/zero of="/dev/sdb" bs=1M count=100 oflag=direct,dsync
100+0 records in
100+0 records out

[root@worker1 ~]# partprobe /dev/sdb

[root@worker1 ~]# sgdisk --zap-all /dev/sde

[root@worker1 ~]# dd if=/dev/zero of="/dev/sde" bs=1M count=100 oflag=direct,dsync

[root@worker1 ~]# partprobe /dev/sde

We can now bring back those disks into our Ceph cluster.

Bringing back those extended disks

This step shows all the power of using Rook for operating Ceph:

[benoit@master ~]$ kubectl -n rookceph scale deployment rook-ceph-operator --replicas=1

[benoit@master ~]$ kubectl -n rookceph exec -it deploy/rook-ceph-tools -- ceph -w
...
...
...
2023-02-16T14:16:06.419312+0000 mon.a [WRN] Health check update: Degraded data redundancy: 3960/263763 objects degraded (1.501%), 9 pgs degraded, 9 pgs undersized (PG_DEGRADED)
2023-02-16T14:16:12.426883+0000 mon.a [WRN] Health check update: Degraded data redundancy: 3078/263763 objects degraded (1.167%), 9 pgs degraded, 9 pgs undersized (PG_DEGRADED)
2023-02-16T14:16:18.260414+0000 mon.a [WRN] Health check update: Degraded data redundancy: 2223/263763 objects degraded (0.843%), 8 pgs degraded, 8 pgs undersized (PG_DEGRADED)
2023-02-16T14:16:26.427226+0000 mon.a [WRN] Health check update: Degraded data redundancy: 1459/263763 objects degraded (0.553%), 6 pgs degraded, 6 pgs undersized (PG_DEGRADED)
2023-02-16T14:16:31.429443+0000 mon.a [WRN] Health check update: Degraded data redundancy: 848/263763 objects degraded (0.322%), 4 pgs degraded, 4 pgs undersized (PG_DEGRADED)
2023-02-16T14:16:36.432270+0000 mon.a [WRN] Health check update: Degraded data redundancy: 260/263763 objects degraded (0.099%), 2 pgs degraded, 2 pgs undersized (PG_DEGRADED)
2023-02-16T14:16:38.557667+0000 mon.a [INF] Health check cleared: PG_DEGRADED (was: Degraded data redundancy: 260/263763 objects degraded (0.099%), 2 pgs degraded, 2 pgs undersized)
2023-02-16T14:16:38.557705+0000 mon.a [INF] Cluster is now healthy

You can use the command ceph with -w to monitor live the health of this Ceph cluster. And just like that, by scaling up the operator, the Ceph cluster becomes up and running after a few minutes:

[rook@rook-ceph-tools-9967d64b6-n6rnk /]$ ceph status
  cluster:
    id:     eb11b571-65e7-480a-b15a-e3a200946d3a
    health: HEALTH_OK

  services:
    mon: 3 daemons, quorum a,b,c (age 3h)
    mgr: b(active, since 3h), standbys: a
    mds: 1/1 daemons up, 1 hot standby
    osd: 6 osds: 6 up (since 3m), 6 in (since 6m)
    rgw: 1 daemon active (1 hosts, 1 zones)

  data:
    volumes: 1/1 healthy
    pools:   12 pools, 337 pgs
    objects: 87.92k objects, 4.9 GiB
    usage:   19 GiB used, 481 GiB / 500 GiB avail
    pgs:     337 active+clean

  io:
    client:   852 B/s rd, 1 op/s rd, 0 op/s wr

[rook@rook-ceph-tools-9967d64b6-n6rnk /]$ ceph osd df tree
ID  CLASS  WEIGHT   REWEIGHT  SIZE     RAW USE  DATA     OMAP     META     AVAIL    %USE  VAR   PGS  STATUS  TYPE NAME
-1         3.32034         -  3.3 TiB   18 GiB   16 GiB  180 MiB  2.7 GiB  3.3 TiB  0.54  1.00    -          root default
-3         2.92960         -  2.9 TiB  6.0 GiB  5.2 GiB      0 B  805 MiB  2.9 TiB  0.20  0.37    -              host worker1
 0    hdd  1.46480   1.00000  1.5 TiB  2.1 GiB  1.9 GiB      0 B  296 MiB  1.5 TiB  0.14  0.26  160      up          osd.0
 3    hdd  1.46480   1.00000  1.5 TiB  3.8 GiB  3.3 GiB      0 B  509 MiB  1.5 TiB  0.25  0.47  177      up          osd.3
-7         0.19537         -  200 GiB  6.5 GiB  5.4 GiB  109 MiB  1.0 GiB  194 GiB  3.24  5.99    -              host worker2
 1    hdd  0.09769   1.00000  100 GiB  3.6 GiB  2.9 GiB   57 MiB  639 MiB   96 GiB  3.60  6.65  179      up          osd.1
 5    hdd  0.09769   1.00000  100 GiB  2.9 GiB  2.4 GiB   52 MiB  414 MiB   97 GiB  2.89  5.33  161      up          osd.5
-5         0.19537         -  200 GiB  6.0 GiB  5.0 GiB   71 MiB  924 MiB  194 GiB  2.98  5.51    -              host worker3
 2    hdd  0.09769   1.00000  100 GiB  2.9 GiB  2.4 GiB   51 MiB  478 MiB   97 GiB  2.90  5.36  171      up          osd.2
 4    hdd  0.09769   1.00000  100 GiB  3.1 GiB  2.6 GiB   20 MiB  446 MiB   97 GiB  3.06  5.65  163      up          osd.4
                       TOTAL  3.3 TiB   18 GiB   16 GiB  180 MiB  2.7 GiB  3.3 TiB  0.54
MIN/MAX VAR: 0.26/6.65  STDDEV: 2.12

We can see the new size of both disks! All the steps we did to remove those disks have been automatically reverted back. Brilliant!

We just had to repeat the same procedure with the 2 other workers used for storage. We have now a Ceph cluster with 9 TB!