Ceph performance tuning
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Andrew
	Dec 16 2020, 3:44 PM

Description

Over the last couple of weeks we've seen some degradation in IO performance on VMs. In particular, IO is now too slow for reliable etcd.

This is the tracking task for investigating the change in performance and optimizing things.

Related Objects
Search...

Status	Assigned	Task
Resolved	• taavi	T211393 openstack-browser and horizon: Security group and floating IP quota information being pulled from Nova instead of Neutron for eqiad1-r
Resolved	Andrew	T211777 Can't get quota information from Neutron API
Resolved	Andrew	T261137 upgrade cloud-vps openstack to Openstack version 'Victoria'
Resolved	dcaro	T261136 upgrade cloud-vps openstack to Openstack version 'Ussuri'
Resolved	Andrew	T261138 Upgrade Horizon to latest OpenStack release
Resolved	Andrew	T261135 upgrade cloud-vps openstack to Openstack version 'Train'
Resolved	Andrew	T261134 upgrade cloud-vps openstack to Openstack version 'Stein'
Resolved	Andrew	T259399 Upgrade cloudvirts to Debian Buster
Resolved	dcaro	T216195 Move cloudvirt hosts to 10Gb ethernet
Resolved	Andrew	T194334 [Epic] Modern Cloud VPS storage layer
Resolved	Andrew	T261132 Move all cloud-vps VMs to Ceph
Resolved	Andrew	T270305 Ceph performance tuning

Event Timeline

Andrew created this task.Dec 16 2020, 3:44 PM

Screen Shot 2020-12-16 at 2.04.22 PM.png (1×1 px, 239 KB)

Mentioned in SAL (#wikimedia-cloud) [2020-12-17T22:14:45Z] <andrewbogott> setting pg number to 8192 for eqiad1-compute (a 4x increase) and 2048 for eqiad1-glance-images (also a 4x increase) T270305

Mentioned in SAL (#wikimedia-cloud) [2020-12-17T22:16:07Z] <andrewbogott> setting pgp number to 8192 for eqiad1-compute (a 4x increase) and 2048 for eqiad1-glance-images (also a 4x increase) T270305 (same as pg)

• Bstorm mentioned this in T267966: Try to squeeze better performance out of k8s-etcd nodes.Dec 17 2020, 10:58 PM

Mentioned in SAL (#wikimedia-cloud) [2020-12-18T20:46:34Z] <andrewbogott> setting pg and pgp number to 4096 for eqiad1-compute as joachim thinks 8192 might be too much T270305

Things seem moderately better with 4096 pgs. Latency numbers seemed even better with 8196. After we have weeks of data at 4096 let's switch back to 8192 for another few weeks and get some good data.

They seem more consistent as far as etcd fsync now. Thanks!

the current behavior is pretty good (maybe better than it was with 8192) so we aren't going to mess with success.

A couple ideas that were left over from this, for the next round of improvements:

Disable the write cache on the disks we have, they have the capacitors needed according to the specs: https://yourcmc.ru/wiki/Ceph_performance#Drive_cache_is_slowing_you_down
Track down why do we create such a high amount of snapshots (might get solved at T270478), just in the holidays period, we created >1500 snapshots of glance images that are not being cleaned up.

	F33947431: Screen Shot 2020-12-16 at 2.04.22 PM.png
	Dec 16 2020, 8:06 PM

Ceph performance tuningClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Ceph performance tuning
Closed, ResolvedPublic
Actions

Related Objects
Search...