This task is to test and document a way to achieve rack level HA.
Note that in codfw we don't really have the hosts in different racks, so we would not be able to test a real case, but we can experiment.
Note that C8 and D5 are together in codfw due to having only 3 hosts xd (in eqiad,
= Deployment
== Setting it manually, all at once
Reference: https://docs.ceph.com/en/latest/rados/operations/crush-map-edits/
* [] dump current crushmap:
```
ceph osd getcrushmap -o crushmap.bin
```
* [] Decompile into text
```
crushtool -d crushmap.bin -o crushmap.txt
```
* [] Make a copy
```
cp crushmap.txt crushmap.$(date +%Y%m%d%H%M%S).before_rack_ha.txt
```
* [] Get the new prepared crushmap from:
{P44926}
** [] Compile it:
```
crushtool -c new_crushmap.txt -o crushmap.bin
```
** [] Test that the rules still work well (check that there's no misplaced pgs, that is shows 1024/1024, and that there's more or less the same placements on each device, current output example P44923):
```
crushtool --test -i crushmap.bin --show-utilization --num-rep=3
```
** [] Load the new crush map and wait for the cluster to shift data around (will take a looong time)
```
ceph osd setcrushmap -i crushmap.bin
```
== Tests
Followed the above procedure (adapted for codfw) to setup a "fake" rack ha there, with three virtual racks and one node on each rack. The cluster started rebalancing (shows as repair, and misplaced pgs) without loss of service or redundancy.
It took in the order of 3h hours to rebalance 6 osds