Page MenuHomePhabricator

Improve ceph performance
Closed, ResolvedPublic

Description

Do a general look around and see how can we improved the current performance


Some additional context

To add some specific watermarks that would be good to keep in mind around the place we are seeing the most pain around ceph performance (etcd), besides the etcd issues themselves (which can also be influenced by the VMs and the network connection between them), we also should generally treat it as a fairly heavily burdened cluster due to the number of namespaces, number of etcd nodes (more nodes = more demanding) and the amount of traffic on the API. From https://etcd.io/docs/v3.4.0/op-guide/hardware/ we see that a healthy cluster *should* need just 50 sequential IOPS if it doesn't do much. We'd be a lot happier with closer to "For heavily loaded clusters, 500 sequential IOPS (e.g., a typical local SSD or a high performance virtualized block device) is recommended.".

Note: our issues often happen on toolsbeta where the cluster is only conducting minimal automated tasks, so IOPS will help, but latency, iowait and throughput may be bigger issues.

General objectives to aim toward:

  • 90th percentile of fsync/fdatasync is less than 10ms more or less consistently
  • higher sequential IOPS than we are seeing (how close can we get to 500 or at least for our throttles to matter?). In initial testing ceph was faster than local disk without throttles and now it is nowhere near it. We may have changed VM processor settings and things since then to help enable live migration (worth checking there if raw ceph performance just doesn't line up with VM disk performance).
  • iowait is not constantly at whole numbers on etcd nodes

Test used to check the performance:

fio --name=fio-seqrw --bs=64k       --direct=1 --filename=/tmp/fio.seqrw --fsync=256 --gtod_reduce=1 --iodepth=64       --ioengine=libaio --rw=rw  --size=5G --group_reporting

for fsync:

fio --rw=write --ioengine=sync --fdatasync=1 --directory=/tmp/test-data --size=22m --bs=2300 --name=mytest

Storage performance test results

Event Timeline

Mentioned in SAL (#wikimedia-cloud) [2021-02-02T17:14:15Z] <dcaro> Changed osd memory limit from 4G to 8G (T273649)

Had a chat with Joachim, we agreed to increase the memory limit for the osds to 8G (was 4G), as they have more than enough ram and cpu, used the command:
ceph config set global osd_memory_target 8589934592

Some other useful commands:
ceph tell osd.<id> bench -> for donig a quick non-destructive benchmark
ceph features -> to see what versions and features are the clients using, for backwards compatibility
cpufreq-info -> to check (and other cpufreq-* commands to set) the cpu throttling
ceph crash * -> to manage, see details of crashes
iostat -x 1
rbd perf image iotop -> AWESOME to check the io usage for each rbd image

Some possible followups:

  • Upgrade to octopus (v15)
  • Tweak disk queue scheduler (/sys/block/sdc/queue/scheduler, currently mq-deadline)
  • Investigate possible network bottlenecks
  • Use upmap to balance the pgs

All these are also recommended even if not having performance issues.

dcaro triaged this task as High priority.Feb 3 2021, 3:39 PM

Today's meeting minutes:

Adding more ram seemed to improve the amount of spikes as tested from the vm
tools-k8s-etcd-4, though there's no clear improvement on the etcd health graph
and the ceph tell osd.112 bench tests, as those seem quite flaky.

Still seeing some spikes on iostat -x 1, though less than before.

Next things to try are:

  • Change the kernel io scheduler to none/noop, as current scheduler mq-deadline gives preference to read requests (though in iops, we have almost double of writes than reads).
  • Upgrade kernel to 5 on the osds
  • Upgrade ceph to 15.2.8
  • Use a specialized pool for low latency (and tweak those specific osds)
  • Disable logging by deffault, enable only on-demand (only in very extreme performance need cases).

Some other side things that we talked about

  • rbd mirroring for multisite is not recommended, as it degrades performance considerably (at least 50%) and adds extra complexity.
  • Recommended strategy -> do it at the app level... and if not possible, use rbd export with differential exports instead.
  • Other maybe possible strategies:
    • Create a mirrored pool only for HA vms (still having performance issue)
    • Create snapshots of the images on a mirrored pool from the non-mirrored one (interesting)

@Bstorm @aborrero ^ what was in the meeting notes, we did not go in depth but seems that he has had lots of issues with production setups and mirroring.

After the change I made to the driver in T275430, it's possible that a full set of new etcd VMs might perform ok. My tests showed very consistently better performance so far (both in seq IOPS and fsync)

Change 672670 had a related patch set uploaded (by David Caro; owner: David Caro):
[operations/puppet@production] ceph-common: add fio for testing and debugging

https://gerrit.wikimedia.org/r/672670

Change 672670 merged by David Caro:
[operations/puppet@production] ceph-common: add fio for testing and debugging

https://gerrit.wikimedia.org/r/672670

So far it's not very clear that there's an improvement between the new machines using the new cinder setup and the old ones, but maybe? (see attached files)

Closing as this seems to have lost relevance, will reopen a new one with new goals when needed.