Page MenuHomePhabricator

Ceph performance tuning
Closed, ResolvedPublic

Description

Over the last couple of weeks we've seen some degradation in IO performance on VMs. In particular, IO is now too slow for reliable etcd.

This is the tracking task for investigating the change in performance and optimizing things.

Event Timeline

Mentioned in SAL (#wikimedia-cloud) [2020-12-17T22:14:45Z] <andrewbogott> setting pg number to 8192 for eqiad1-compute (a 4x increase) and 2048 for eqiad1-glance-images (also a 4x increase) T270305

Mentioned in SAL (#wikimedia-cloud) [2020-12-17T22:16:07Z] <andrewbogott> setting pgp number to 8192 for eqiad1-compute (a 4x increase) and 2048 for eqiad1-glance-images (also a 4x increase) T270305 (same as pg)

Mentioned in SAL (#wikimedia-cloud) [2020-12-18T20:46:34Z] <andrewbogott> setting pg and pgp number to 4096 for eqiad1-compute as joachim thinks 8192 might be too much T270305

Things seem moderately better with 4096 pgs. Latency numbers seemed even better with 8196. After we have weeks of data at 4096 let's switch back to 8192 for another few weeks and get some good data.

They seem more consistent as far as etcd fsync now. Thanks!

the current behavior is pretty good (maybe better than it was with 8192) so we aren't going to mess with success.

A couple ideas that were left over from this, for the next round of improvements: