Write the description below
Test and document what happens when an osd goes down, it's down for a while, and comes up on the codfw cluster.
Test and document what happens when an osd goes down, it's down for a while, and comes up on the codfw cluster.
Mentioned in SAL (#wikimedia-cloud) [2021-08-05T09:37:08Z] <dcaro> Taking one osd daemon down ot codfw cluster (T288203)
Tl;Dr;
It's ok to take an osd down, if the downtime is quick (less than 10min, configurable), it will not need to do any rebalancing, otherwise some rebalancing will have to happen, though the cluster is available during the whole time.
Note that rebalancing after bringing the OSD back up is way faster than when taking it down (as it reuses data).
Note 2, it was using the private interface to transfer data, though we don't collect metrics, I was manually monitoring on the host.
The test was:
The full log of 'ceph status' for the procedure:
root@cloudcephosd2002-dev:~# systemctl stop ceph-osd@2.service root@cloudcephosd2002-dev:~# date Thu 05 Aug 2021 09:41:37 AM UTC
root@cloudcephosd2002-dev:~# date Thu 05 Aug 2021 10:32:32 AM UTC root@cloudcephosd2002-dev:~# systemctl start ceph-osd@2.service
Ceph osd df after the procedure shows little variance, and well balanced data:
root@cloudcephmon2002-dev:~# ceph osd df ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS 0 ssd 0.87299 1.00000 894 GiB 172 GiB 171 GiB 2.5 MiB 1022 MiB 722 GiB 19.29 1.04 210 up 1 ssd 0.87299 1.00000 894 GiB 159 GiB 158 GiB 2.8 MiB 1021 MiB 735 GiB 17.76 0.96 191 up 2 ssd 0.87299 1.00000 894 GiB 156 GiB 155 GiB 2.7 MiB 1021 MiB 738 GiB 17.45 0.94 190 up 3 ssd 0.87299 1.00000 894 GiB 175 GiB 174 GiB 1.4 MiB 1.2 GiB 719 GiB 19.62 1.06 211 up 4 ssd 0.87299 1.00000 894 GiB 169 GiB 168 GiB 3.2 MiB 1021 MiB 725 GiB 18.89 1.02 202 up 5 ssd 0.87299 1.00000 894 GiB 162 GiB 161 GiB 5.6 MiB 1018 MiB 732 GiB 18.16 0.98 199 up TOTAL 5.2 TiB 994 GiB 988 GiB 18 MiB 6.2 GiB 4.3 TiB 18.53 MIN/MAX VAR: 0.94/1.06 STDDEV: 0.80
Just to doublecheck -- the takeaway from this is that when we need to do quick service on an osd node (e.g. drive replacement for T287838) we can just shutdown the host, service, switch back on, and Ceph will do something reasonable?
Yes, and even if it's not quick, ceph will manage, might take a bit more time and throughput if the downtime is longer, but will eventually get there.