[ceph] Use upmap PG balancer
Currently there's some small balancing issues, but if we don't actively move the PGs around this will become a problem soon, enabling upmap will help dealing with that issue.

dcaro triaged this task as Medium priority.Feb 11 2021, 5:31 PM
dcaro created this task.

It seems that we could achieve quite a nice balancing, some preliminar test (just offline checking how the balancer
would act):

dcaro@cloudcephmon1001:~$ sudo ceph osd getmap -o osd_map
got osdmap epoch 366517

dcaro@cloudcephmon1001:~$ sudo osdmaptool osd_map --upmap out.txt --upmap-pool eqiad1-compute --upmap-active | phaste
osdmaptool: osdmap file 'osd_map'

dcaro renamed this task from [ceph] Us upmap PG balancer to [ceph] Use upmap PG balancer.Mar 9 2021, 5:15 PM

Mentioned in SAL (#wikimedia-cloud) [2021-04-13T10:43:59Z] <dcaro> enabled ceph upmap balancer on codfw (T274573,T274573)

Mentioned in SAL (#wikimedia-cloud) [2021-04-13T14:43:07Z] <dcaro> enabling ceph upmap pg balancer on equiad (T274573)

Mentioned in SAL (#wikimedia-cloud) [2021-04-13T14:49:37Z] <dcaro> Running the first_pass balancing plan on ceph eqiad, current eval 0.030622 (T274573)

Mentioned in SAL (#wikimedia-cloud) [2021-04-13T15:02:16Z] <dcaro> First pass finished, improved eval to 0.030075 (T274573)

Mentioned in SAL (#wikimedia-cloud) [2021-04-13T15:03:48Z] <dcaro> Executing a second pass, there's still movements to improve the eval of 0.030075 (T274573)

Mentioned in SAL (#wikimedia-cloud) [2021-04-13T15:08:43Z] <dcaro> Activating continuous upmap balancer, keeping a close eye (T274573)

Mentioned in SAL (#wikimedia-cloud) [2021-04-13T16:42:54Z] <dcaro> Ceph balancer got the cluster to eval 0.014916, that is 88-77% usage for compute pool, and 28-19% usage for the cinder one \o/ (T274573)

Closing this as a success :)

ceph_balancer_after.png (1×3 px, 396 KB)

pools_usages.png (653×1 px, 69 KB)