Improve ceph performance
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	dcaro
	Feb 2 2021, 5:13 PM

Description

Do a general look around and see how can we improved the current performance

Some additional context

To add some specific watermarks that would be good to keep in mind around the place we are seeing the most pain around ceph performance (etcd), besides the etcd issues themselves (which can also be influenced by the VMs and the network connection between them), we also should generally treat it as a fairly heavily burdened cluster due to the number of namespaces, number of etcd nodes (more nodes = more demanding) and the amount of traffic on the API. From https://etcd.io/docs/v3.4.0/op-guide/hardware/ we see that a healthy cluster *should* need just 50 sequential IOPS if it doesn't do much. We'd be a lot happier with closer to "For heavily loaded clusters, 500 sequential IOPS (e.g., a typical local SSD or a high performance virtualized block device) is recommended.".

Note: our issues often happen on toolsbeta where the cluster is only conducting minimal automated tasks, so IOPS will help, but latency, iowait and throughput may be bigger issues.

General objectives to aim toward:

90th percentile of fsync/fdatasync is less than 10ms more or less consistently
higher sequential IOPS than we are seeing (how close can we get to 500 or at least for our throttles to matter?). In initial testing ceph was faster than local disk without throttles and now it is nowhere near it. We may have changed VM processor settings and things since then to help enable live migration (worth checking there if raw ceph performance just doesn't line up with VM disk performance).
iowait is not constantly at whole numbers on etcd nodes

Test used to check the performance:

fio --name=fio-seqrw --bs=64k       --direct=1 --filename=/tmp/fio.seqrw --fsync=256 --gtod_reduce=1 --iodepth=64       --ioengine=libaio --rw=rw  --size=5G --group_reporting

for fsync:

fio --rw=write --ioengine=sync --fdatasync=1 --directory=/tmp/test-data --size=22m --bs=2300 --name=mytest

Storage performance test results

etcd-4 vs etcd-8:
- run 1 - https://phabricator.wikimedia.org/F34199231
- run 2 - https://phabricator.wikimedia.org/F34199235

Details

	Subject	Repo	Branch	Lines +/-
	ceph-common: add fio for testing and debugging	operations/puppet	production	+22 -0

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	dcaro	T273649 Improve ceph performance
Resolved	dcaro	T273791 [ceph] Change the io scheduler none/noop
Open	None	T273792 [cloudvirt] Enable and test jumbo frames to ceph osds
Resolved	dcaro	T273794 [ceph] Upgrade to Octopus 15.2.8
Resolved	dcaro	T273851 Increase OSD ram limit to whatever fits the hosts
Resolved	dcaro	T273952 [ceph] Investigate osds going down for very short periods of time
Open	None	T273862 [ceph] Find a way to use config management for ceph config commands
Resolved	dcaro	T274565 [ceph] Test and upgrade to kernel ~15
Resolved	dcaro	T271527 [ceph] Disable write cache on all osds
Resolved	dcaro	T271417 [ceph] test how disabling the disk cache affects the io throughtput

Event Timeline

dcaro created this task.Feb 2 2021, 5:13 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 2 2021, 5:13 PM

dcaro claimed this task.Feb 2 2021, 5:14 PM

Mentioned in SAL (#wikimedia-cloud) [2021-02-02T17:14:15Z] <dcaro> Changed osd memory limit from 4G to 8G (T273649)

Had a chat with Joachim, we agreed to increase the memory limit for the osds to 8G (was 4G), as they have more than enough ram and cpu, used the command:
ceph config set global osd_memory_target 8589934592

Some other useful commands:
ceph tell osd.<id> bench -> for donig a quick non-destructive benchmark
ceph features -> to see what versions and features are the clients using, for backwards compatibility
cpufreq-info -> to check (and other cpufreq-* commands to set) the cpu throttling
ceph crash * -> to manage, see details of crashes
iostat -x 1
rbd perf image iotop -> AWESOME to check the io usage for each rbd image

Some possible followups:

Upgrade to octopus (v15)
Tweak disk queue scheduler (/sys/block/sdc/queue/scheduler, currently mq-deadline)
Investigate possible network bottlenecks
Use upmap to balance the pgs

All these are also recommended even if not having performance issues.

bd808 added a project: Cloud-VPS.Feb 2 2021, 9:11 PM

dcaro triaged this task as High priority.Feb 3 2021, 3:39 PM

Today's meeting minutes:

Adding more ram seemed to improve the amount of spikes as tested from the vm
tools-k8s-etcd-4, though there's no clear improvement on the etcd health graph
and the ceph tell osd.112 bench tests, as those seem quite flaky.

Still seeing some spikes on iostat -x 1, though less than before.

Next things to try are:

Change the kernel io scheduler to none/noop, as current scheduler mq-deadline gives preference to read requests (though in iops, we have almost double of writes than reads).
Upgrade kernel to 5 on the osds
Upgrade ceph to 15.2.8
Use a specialized pool for low latency (and tweak those specific osds)
Disable logging by deffault, enable only on-demand (only in very extreme performance need cases).

Some other side things that we talked about

rbd mirroring for multisite is not recommended, as it degrades performance considerably (at least 50%) and adds extra complexity.
Recommended strategy -> do it at the app level... and if not possible, use rbd export with differential exports instead.
Other maybe possible strategies:
- Create a mirrored pool only for HA vms (still having performance issue)
- Create snapshots of the images on a mirrored pool from the non-mirrored one (interesting)

@Bstorm @aborrero ^ what was in the meeting notes, we did not go in depth but seems that he has had lots of issues with production setups and mirroring.

dcaro added a subtask: T273791: [ceph] Change the io scheduler none/noop.Feb 3 2021, 5:30 PM

dcaro moved this task from Inbox to Doing on the cloud-services-team (Kanban) board.

dcaro added a subtask: T273792: [cloudvirt] Enable and test jumbo frames to ceph osds.Feb 3 2021, 5:34 PM

dcaro added a subtask: T273794: [ceph] Upgrade to Octopus 15.2.8.

Interesting reads:
https://www.samsung.com/semiconductor/global.semi/file/resource/2020/05/redhat-ceph-whitepaper-0521.pdf
https://www.redhat.com/cms/managed-files/st-micron-ceph-performance-reference-architecture-f17294-201904-en.pdf

dcaro added a subtask: T273851: Increase OSD ram limit to whatever fits the hosts.Feb 4 2021, 9:01 AM

dcaro added a subtask: T273862: [ceph] Find a way to use config management for ceph config commands.Feb 4 2021, 10:18 AM

• Bstorm updated the task description. (Show Details)Feb 9 2021, 7:14 PM

• Bstorm updated the task description. (Show Details)

• nskaggs subscribed.Feb 9 2021, 9:22 PM

dcaro closed subtask T273791: [ceph] Change the io scheduler none/noop as Resolved.Feb 11 2021, 4:44 PM

dcaro closed subtask T273851: Increase OSD ram limit to whatever fits the hosts as Resolved.

dcaro added a subtask: T274565: [ceph] Test and upgrade to kernel ~15.Feb 18 2021, 11:51 AM

• Bstorm mentioned this in T275430: Large images cloned to /var/lib/nova/instances/_base filling up disk on hypervisors.Feb 22 2021, 11:23 PM

After the change I made to the driver in T275430, it's possible that a full set of new etcd VMs might perform ok. My tests showed very consistently better performance so far (both in seq IOPS and fsync)

dcaro updated the task description. (Show Details)Feb 24 2021, 5:10 PM

dcaro added a subtask: T271527: [ceph] Disable write cache on all osds.

dcaro added a subtask: T271417: [ceph] test how disabling the disk cache affects the io throughtput.Feb 24 2021, 5:16 PM

Change 672670 had a related patch set uploaded (by David Caro; owner: David Caro):
[operations/puppet@production] ceph-common: add fio for testing and debugging

https://gerrit.wikimedia.org/r/672670

gerritbot added a project: Patch-For-Review.Mar 16 2021, 9:45 AM

Change 672670 merged by David Caro:
[operations/puppet@production] ceph-common: add fio for testing and debugging

https://gerrit.wikimedia.org/r/672670

So far it's not very clear that there's an improvement between the new machines using the new cinder setup and the old ones, but maybe? (see attached files)

dcaro closed subtask T274565: [ceph] Test and upgrade to kernel ~15 as Resolved.Apr 1 2021, 1:19 PM

dcaro closed subtask T273794: [ceph] Upgrade to Octopus 15.2.8 as Resolved.Jun 30 2021, 1:06 PM

Closing as this seems to have lost relevance, will reopen a new one with new goals when needed.

dcaro closed this task as Resolved.Jul 16 2021, 10:30 AM

Improve ceph performanceClosed, ResolvedPublicActions