Page MenuHomePhabricator

Find a sustainable local storage solution for cloud-vps
Closed, ResolvedPublic

Description

So far the only file system fast enough to properly support etcd is hypervisor-local storage. Each of our etcd clusters is distributed among these three cloudvirts:

-cloudvirt1019
-cloudvirt1020
-cloudvirt1028

Cloudvirt1019 are (I hope!) right on the verge of decommission. Cloudvirt1028 is scheduled for decom in 2023. Indeed, by the end of 2023 we don't expect to have any of the old 'fatvirt' hosts in service, just ceph-backed 'thinvirts'.

Unless someone scares up a miracle with ceph, we're going to need a new plan for running etcd on cloud-vps. The good news is that etcd nodes require very little space -- the toolforge nodes are running on 20Gb volumes with only 20% in use. We may or may not want to support other use-cases that require local storage performance.

Here are some ideas:

  1. Dedicate three existing thinvirt hypervisors to the cause

This is the simplest option, and also the least efficient (given that we currently run a grand total of 6 etcd VMs)

  1. Buy or repurpose three special-purpose mini-sized servers for the cause

Slightly more efficient than #1 in terms of HW cost, quite a bit more trouble for everyone involved

  1. Buy three special-purpose fatvirt hypervisors

This is the same as #1 except it supports expanded future local storage uses.

  1. Figure out how to support both local and ceph VM images on a given hypervisor

4a) ... and configure three cloudvirts for mixed use
4b) ... and configure all cloudvirts for mixed use

I don't immediately know how to do this but it may be possible! 4a is a technically worse solution (because it requires host-specific config) but it would have the advantage that we'd be able to remember which cloudvirts are special and not to be rebooted at the same time.

  1. Set up a cinder pool that uses local storage, mount and use local cinder volumes on etcd nodes.

This feels like the most elegant solution, and would also support future expansion. It would mean that any HV could be drained and rebooted at any time without risk of causing an etcd outage. We could also safely permit users to opt in to local storage for select cinder volumes if they accept the trade-off of performance vs. uptime.

On the other hand, we'd still cause an outage when rebooting the storage server itself, and even if we have three different backend storage hosts it might be hard to keep track of what's scheduled where in order to avoid killing the whole cluster at once.

I'm currently leaning towards #4b although both 4 and 5 could do with a bit more research. #1 would be very easy to implement but I'm not sure I can stand to underuse hardware like that.

Event Timeline

I've done some research!

It's not clear to me that 4b is actually an option. All examples I see of mixed-type storage on hypervisors involves different hardware doing the different types of storage.

Option 5, on the other hand (local storage on cinder) seems to work fine, as long as I split out cinder-volume.conf from cinder.conf so that the scheduler can know about volume drivers that aren't present on the same host as the scheduler. I've done a small proof-of-concept of this on codfw1dev.

That leaves the question of hardware. If we think that local cinder storage is something we want to support for all users, and at scale, we'll need to acquire some (presumably at least three) dedicated servers with big drive arrays.

For now, though, I propose a different, smaller solution:

  • local storage will only be used ONLY for etcd, and only available for use by admins or by special request.
  • creation of local storage volumes will be limited to projects that need etcd (which, conveniently, are also only admin-managed projects)
  • that local storage will be assigned to the free (mirrored ssd) space on existing cloudvirts
  • I'll create three different local-storage pools, named etcd1, etcd2, and etcd3, with a different cloudvirt (or group of cloudvirts) assigned to each pool.

Users (or cookbook-assisted users) will need to explicitly assign each etcd host in a cluster to a different pool, but after that we'll be able to reboot any one cloudvirt without knocking out any etcd clusters.

This isn't as clean as having dedicated hardware, but it's something we can implement immediately. If we later decide we have the money for a dedicated trio of storage servers, all the better, but the puppet work to do the above will be mostly re-usable on those hosts.

I spoke a bit too soon, I can't get VMs to attach properly to the new lvm devices. Still working on this in codfw1dev.

OK, now I have it working -- going to work on puppetizing.

@fnegri suggests that I use generic volume-type names 'local1, local2, local3' and that I give these cinder-hosting hosts a different name like cloudvirtblock or something.

Change 886934 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] Split cinder-volume.conf out from cinder.conf

https://gerrit.wikimedia.org/r/886934

@fnegri suggests that I use generic volume-type names 'local1, local2, local3' and that I give these cinder-hosting hosts a different name like cloudvirtblock or something.

yep, I'd even be a more verbose if possible, like 'localstorage1' or similar.

About the host names, those are for/if we have dedicated hosts right? If so ok from me (though maybe something like cloudvirtlocal or similar would be better)

Otherwise I'd keep the host names generic, and instead try to make it easier to see what aggregates they are in, maybe in a shell prompt or similar (is there a way to have that in puppet? maybe a cumin query can pick it up?)

About the host names, those are for/if we have dedicated hosts right? If so ok from me (though maybe something like cloudvirtlocal or similar would be better)

My thinking was that even if those hosts contained other workloads and not just etcd, it would be nice to immediately spot that they have something "special". Puppet or shell prompt are also valid options, but the hostname makes it even more obvious (even when having a quick look in Netbox) that those hosts need to be treated more carefully than the other cloudvirts.

If they should be dedicated or not... I'm not sure.

Change 887320 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] Move cinder-volume into its own class and profile

https://gerrit.wikimedia.org/r/887320

About the host names, those are for/if we have dedicated hosts right? If so ok from me (though maybe something like cloudvirtlocal or similar would be better)

My thinking was that even if those hosts contained other workloads and not just etcd, it would be nice to immediately spot that they have something "special". Puppet or shell prompt are also valid options, but the hostname makes it even more obvious (even when having a quick look in Netbox) that those hosts need to be treated more carefully than the other cloudvirts.

If they should be dedicated or not... I'm not sure.

ack, my concern there is mostly that it's "hard" to change the hostname, but as I understand we might want to move around/expand/swap/replace the hosts that provide the local storage too (maybe even add all the hosts local storage?), so having to rename them every time we move the service to or from them seems unpractical (with our current renaming process at least xd)
, that brings me to the question.

@Andrew how many hosts are you proposing to use? (it seems at least 3, but maybe more?)

Change 886934 merged by Andrew Bogott:

[operations/puppet@production] Split cinder-volume.conf out from cinder.conf

https://gerrit.wikimedia.org/r/886934

Change 887368 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] cinder-volume.conf: include common oslo-messaging-rabbit section

https://gerrit.wikimedia.org/r/887368

Change 887368 merged by Andrew Bogott:

[operations/puppet@production] cinder-volume.conf: include common oslo-messaging-rabbit section

https://gerrit.wikimedia.org/r/887368

I'm thinking this extra service will be added to either three or six cloudvirts -- I'm not sure we get any more stability from six than from three, at the moment. We could add it to all of them just for consistency. It depends on whether we want the rule to be:

  • Don't reboot more than one of (this special set of hosts) at a time, or
  • Don't reboot more than one cloudvirt at a time

or, I guess

  • Don't reboot more than one of any three continuously-number hosts at the same time

Regardless it's going to be cumbersome.

Change 887320 merged by Andrew Bogott:

[operations/puppet@production] Move cinder-volume into its own class and profile

https://gerrit.wikimedia.org/r/887320

Change 887390 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] Add a new role for cloudvirt nodes with a cinder/lvm client.

https://gerrit.wikimedia.org/r/887390

Change 887390 merged by Andrew Bogott:

[operations/puppet@production] Add a new role for cloudvirt nodes with a cinder/lvm client.

https://gerrit.wikimedia.org/r/887390

Change 887395 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] Add cinder-volume nodes to cinder grant and fw rules.

https://gerrit.wikimedia.org/r/887395

Change 887395 merged by Andrew Bogott:

[operations/puppet@production] Add cinder-volume nodes to cinder grant and fw rules.

https://gerrit.wikimedia.org/r/887395

Change 887403 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] cinder-volume: make lvm volume group configurable via hiera

https://gerrit.wikimedia.org/r/887403

Change 887411 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] Cinder-volume lvm: clarify backend type/name confusion

https://gerrit.wikimedia.org/r/887411

Change 887403 merged by Andrew Bogott:

[operations/puppet@production] cinder-volume: make lvm volume group configurable via hiera

https://gerrit.wikimedia.org/r/887403

Change 887411 merged by Andrew Bogott:

[operations/puppet@production] Cinder-volume lvm: clarify backend type/name confusion

https://gerrit.wikimedia.org/r/887411

I'm thinking this extra service will be added to either three or six cloudvirts -- I'm not sure we get any more stability from six than from three, at the moment. We could add it to all of them just for consistency. It depends on whether we want the rule to be:

  • Don't reboot more than one of (this special set of hosts) at a time, or
  • Don't reboot more than one cloudvirt at a time

or, I guess

  • Don't reboot more than one of any three continuously-number hosts at the same time

Regardless it's going to be cumbersome.

In speaking with Andrew, it sounds like having dedicated hardware is the option that avoids this complexity. A request for this has been filed.

nskaggs mentioned this in Unknown Object (Task).Feb 7 2023, 10:35 PM

Change 887439 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] cinder-volume.conf.erb: remove an erb typo

https://gerrit.wikimedia.org/r/887439

Change 887439 merged by Andrew Bogott:

[operations/puppet@production] cinder-volume.conf.erb: remove an erb typo

https://gerrit.wikimedia.org/r/887439

Change 887441 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] cinder::volume: include 'tgt' package on hosts

https://gerrit.wikimedia.org/r/887441

Change 887441 merged by Andrew Bogott:

[operations/puppet@production] cinder::volume: include 'tgt' package on hosts

https://gerrit.wikimedia.org/r/887441

This is now set up (using cloudvirts) in codfw1dev. It won't take much to get it working in eqiad1, but we may do it on dedicated storage hosts rather than on cloudvirts. See T329126

Some quick performance tests:

local-etcd-1 (LVM drive):

root@local-etcd-1:~# fio --name=test_write --directory=/srv/fio_test --ioengine=libaio --rw=randwrite --size=500M
test_write: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
fio-3.25
Starting 1 process
Jobs: 1 (f=1): [w(1)][100.0%][w=15.5MiB/s][w=3972 IOPS][eta 00m:00s]
test_write: (groupid=0, jobs=1): err= 0: pid=36010: Wed Feb  8 17:36:37 2023
  write: IOPS=5165, BW=20.2MiB/s (21.2MB/s)(500MiB/24778msec); 0 zone resets
    slat (usec): min=4, max=42011, avg=184.68, stdev=1626.54
    clat (nsec): min=1314, max=986911, avg=4883.03, stdev=9689.71
     lat (usec): min=5, max=42024, avg=190.41, stdev=1627.47
    clat percentiles (nsec):
     |  1.00th=[  1544],  5.00th=[  1560], 10.00th=[  1592], 20.00th=[  1944],
     | 30.00th=[  4768], 40.00th=[  4832], 50.00th=[  4960], 60.00th=[  5024],
     | 70.00th=[  5088], 80.00th=[  5856], 90.00th=[  6496], 95.00th=[  6880],
     | 99.00th=[ 10560], 99.50th=[ 21632], 99.90th=[ 66048], 99.95th=[ 92672],
     | 99.99th=[585728]
   bw (  KiB/s): min= 8704, max=244312, per=100.00%, avg=20746.16, stdev=32762.14, samples=49
   iops        : min= 2176, max=61078, avg=5186.51, stdev=8190.54, samples=49
  lat (usec)   : 2=22.14%, 4=2.30%, 10=74.30%, 20=0.72%, 50=0.40%
  lat (usec)   : 100=0.10%, 250=0.02%, 500=0.01%, 750=0.01%, 1000=0.01%
  cpu          : usr=3.85%, sys=12.34%, ctx=1908, majf=0, minf=13
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,128000,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=20.2MiB/s (21.2MB/s), 20.2MiB/s-20.2MiB/s (21.2MB/s-21.2MB/s), io=500MiB (524MB), run=24778-24778msec

Disk stats (read/write):
  sdb: ios=0/35576, merge=0/21, ticks=0/620677, in_queue=622187, util=93.15%

ceph-etcd-1 (throttled Ceph):

root@ceph-etcd-1:~# fio --name=test_write --directory=/srv/fio_test --ioengine=libaio --rw=randwrite --size=500M
test_write: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
fio-3.25
Starting 1 process
Jobs: 1 (f=1): [w(1)][100.0%][w=5376KiB/s][w=1344 IOPS][eta 00m:00s]
test_write: (groupid=0, jobs=1): err= 0: pid=5042: Wed Feb  8 17:37:37 2023
  write: IOPS=1702, BW=6810KiB/s (6973kB/s)(500MiB/75184msec); 0 zone resets
    slat (usec): min=3, max=146353, avg=579.66, stdev=3988.52
    clat (nsec): min=1180, max=10546k, avg=4323.99, stdev=29718.27
     lat (usec): min=5, max=146366, avg=584.68, stdev=3989.70
    clat percentiles (nsec):
     |  1.00th=[  1224],  5.00th=[  1272], 10.00th=[  1320], 20.00th=[  1560],
     | 30.00th=[  4576], 40.00th=[  4640], 50.00th=[  4704], 60.00th=[  4768],
     | 70.00th=[  4832], 80.00th=[  4896], 90.00th=[  5024], 95.00th=[  5600],
     | 99.00th=[ 10432], 99.50th=[ 24192], 99.90th=[ 38144], 99.95th=[ 48896],
     | 99.99th=[100864]
   bw (  KiB/s): min= 3584, max=239592, per=100.00%, avg=6820.25, stdev=19192.51, samples=150
   iops        : min=  896, max=59898, avg=1705.06, stdev=4798.13, samples=150
  lat (usec)   : 2=25.47%, 4=0.95%, 10=72.34%, 20=0.70%, 50=0.50%
  lat (usec)   : 100=0.04%, 250=0.01%, 500=0.01%, 750=0.01%
  lat (msec)   : 20=0.01%
  cpu          : usr=1.05%, sys=3.22%, ctx=3177, majf=0, minf=13
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,128000,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=6810KiB/s (6973kB/s), 6810KiB/s-6810KiB/s (6973kB/s-6973kB/s), io=500MiB (524MB), run=75184-75184msec

Disk stats (read/write):
  sdb: ios=0/37499, merge=0/73, ticks=0/1825687, in_queue=1825751, util=99.68%

ceph-etcd-2 (unthrottled Ceph):

root@ceph-etcd-2:~# fio --name=test_write --directory=/srv/fio_test --ioengine=libaio --rw=randwrite --size=500M
test_write: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
fio-3.25
Starting 1 process
test_write: Laying out IO file (1 file / 500MiB)
Jobs: 1 (f=1): [w(1)][100.0%][w=80.1MiB/s][w=20.5k IOPS][eta 00m:00s]
test_write: (groupid=0, jobs=1): err= 0: pid=4130: Wed Feb  8 17:46:30 2023
  write: IOPS=30.6k, BW=120MiB/s (125MB/s)(500MiB/4179msec); 0 zone resets
    slat (usec): min=3, max=18325, avg=26.84, stdev=396.11
    clat (nsec): min=1335, max=3798.0k, avg=3429.30, stdev=22629.20
     lat (usec): min=5, max=18340, avg=30.89, stdev=397.37
    clat percentiles (nsec):
     |  1.00th=[   1464],  5.00th=[   1592], 10.00th=[   1608],
     | 20.00th=[   1624], 30.00th=[   1656], 40.00th=[   1928],
     | 50.00th=[   2160], 60.00th=[   2384], 70.00th=[   2832],
     | 80.00th=[   3536], 90.00th=[   4896], 95.00th=[   6048],
     | 99.00th=[  12096], 99.50th=[  25472], 99.90th=[ 142336],
     | 99.95th=[ 391168], 99.99th=[1073152]
   bw (  KiB/s): min=79528, max=331536, per=100.00%, avg=123532.00, stdev=84897.99, samples=8
   iops        : min=19882, max=82884, avg=30883.00, stdev=21224.50, samples=8
  lat (usec)   : 2=40.91%, 4=43.11%, 10=14.93%, 20=0.41%, 50=0.36%
  lat (usec)   : 100=0.13%, 250=0.08%, 500=0.03%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2=0.01%, 4=0.01%
  cpu          : usr=15.15%, sys=31.28%, ctx=7581, majf=0, minf=13
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,128000,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=120MiB/s (125MB/s), 120MiB/s-120MiB/s (125MB/s-125MB/s), io=500MiB (524MB), run=4179-4179msec

Disk stats (read/write):
  sdb: ios=0/36635, merge=0/17, ticks=0/128918, in_queue=128918, util=93.59%

I have rebuilt a test host with no throttles on either cinder or the VM. It is much faster than fnegri's tests but still slightly slower than ceph. This may be because we're testing the wrong thing, or because codfw1dev is an odd use case (very low traffic) OR it might be that the lvm backend is worse than ceph.

If lvm is worse, then we're back to needing dedicated hypervisors with locally-stored VMs. Either way I think we want the proposed three servers for eqiad1 -- we can decide to use them as cinder nodes or as HVs.

The same run on a current etcd node backed by local disk (not using iscsi):

root@tools-k8s-etcd-15:~# fio --name=test_write --directory=/tmp/fio_test --ioengine=libaio --rw=randwrite --size=500M
test_write: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
fio-3.12
Starting 1 process
test_write: Laying out IO file (1 file / 500MiB)
Jobs: 1 (f=1): [w(1)][100.0%][w=71.1MiB/s][w=18.2k IOPS][eta 00m:00s]
test_write: (groupid=0, jobs=1): err= 0: pid=11782: Thu Feb  9 13:47:25 2023
  write: IOPS=24.8k, BW=96.9MiB/s (102MB/s)(500MiB/5161msec); 0 zone resets
    slat (usec): min=3, max=11171, avg=29.67, stdev=371.83
    clat (nsec): min=1162, max=4406.2k, avg=3552.86, stdev=34752.36
     lat (usec): min=5, max=11176, avg=34.98, stdev=374.74
    clat percentiles (nsec):
     |  1.00th=[   1336],  5.00th=[   1384], 10.00th=[   1416],
     | 20.00th=[   1512], 30.00th=[   1656], 40.00th=[   1704],
     | 50.00th=[   1768], 60.00th=[   1944], 70.00th=[   3088],
     | 80.00th=[   3280], 90.00th=[   3472], 95.00th=[   3856],
     | 99.00th=[  19840], 99.50th=[  52480], 99.90th=[ 195584],
     | 99.95th=[ 395264], 99.99th=[1417216]
   bw (  KiB/s): min=42240, max=216656, per=99.90%, avg=99102.40, stdev=69928.87, samples=10
   iops        : min=10560, max=54164, avg=24775.60, stdev=17482.22, samples=10
  lat (usec)   : 2=60.61%, 4=34.84%, 10=2.51%, 20=1.04%, 50=0.48%
  lat (usec)   : 100=0.31%, 250=0.13%, 500=0.05%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2=0.01%, 4=0.01%, 10=0.01%
  cpu          : usr=17.05%, sys=28.06%, ctx=7947, majf=0, minf=10
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,128000,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=96.9MiB/s (102MB/s), 96.9MiB/s-96.9MiB/s (102MB/s-102MB/s), io=500MiB (524MB), run=5161-5161msec

Disk stats (read/write):
  sda: ios=0/29018, merge=0/184, ticks=0/15547, in_queue=16368, util=98.91%

Here are the current testing hosts in codfw1dev. The should have all throttles removed (both on the VM itself and on the mounted cinder volume)

local-ceph-1a.testlabs.codfw1dev.wikimedia.cloud <- uses an unthrottled ceph volume under /srv

local-etcd-1a.testlabs.codfw1dev.wikimedia.cloud <- uses an unthrottled lvm volume under /srv which is hosted on cloudvirt2001-dev, an older server with possibly worse performance

local-etcd-2a.testlabs.codfw1dev.wikimedia.cloud <- uses an unthrottled lvm volume under /srv

local-etcd-3a.testlabs.codfw1dev.wikimedia.cloud <- uses an unthrottled lvm volume under /srv which is hosted on the same HV as the vm

Using direct=1 on the current local etcd nodes (not using iscsi):

root@tools-k8s-etcd-15:~# fio --name=test_write --directory=/tmp/fio_test --ioengine=libaio --rw=randwrite --size=500M --direct=1
test_write: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
fio-3.12
Starting 1 process
Jobs: 1 (f=1): [w(1)][100.0%][w=11.6MiB/s][w=2973 IOPS][eta 00m:00s]
test_write: (groupid=0, jobs=1): err= 0: pid=12032: Thu Feb  9 13:57:26 2023
  write: IOPS=3204, BW=12.5MiB/s (13.1MB/s)(500MiB/39946msec); 0 zone resets
    slat (usec): min=9, max=11957, avg=32.19, stdev=70.38
    clat (usec): min=3, max=12327, avg=269.17, stdev=197.90
     lat (usec): min=144, max=12359, avg=303.68, stdev=208.56
    clat percentiles (usec):
     |  1.00th=[  151],  5.00th=[  174], 10.00th=[  186], 20.00th=[  206],
     | 30.00th=[  221], 40.00th=[  233], 50.00th=[  247], 60.00th=[  260],
     | 70.00th=[  269], 80.00th=[  281], 90.00th=[  318], 95.00th=[  437],
     | 99.00th=[  857], 99.50th=[ 1106], 99.90th=[ 2573], 99.95th=[ 3687],
     | 99.99th=[ 7570]
   bw (  KiB/s): min=10744, max=15792, per=100.00%, avg=12825.89, stdev=1061.16, samples=79
   iops        : min= 2686, max= 3948, avg=3206.47, stdev=265.29, samples=79
  lat (usec)   : 4=0.02%, 10=0.41%, 20=0.02%, 50=0.04%, 100=0.06%
  lat (usec)   : 250=52.37%, 500=43.12%, 750=2.42%, 1000=0.89%
  lat (msec)   : 2=0.52%, 4=0.10%, 10=0.04%, 20=0.01%
  cpu          : usr=4.41%, sys=12.11%, ctx=137555, majf=0, minf=10
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,128000,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=12.5MiB/s (13.1MB/s), 12.5MiB/s-12.5MiB/s (13.1MB/s-13.1MB/s), io=500MiB (524MB), run=39946-39946msec

Disk stats (read/write):
  sda: ios=91/131435, merge=0/681, ticks=41/34646, in_queue=41680, util=99.90%

Using direct=1 on local-ceph-1a latency is also about 10x higher than without direct=1:

root@local-ceph-1a:~# fio --name=test_write --directory=/srv/fio_test --ioengine=libaio --rw=randwrite --size=500M
test_write: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
fio-3.25
Starting 1 process
test_write: Laying out IO file (1 file / 500MiB)
Jobs: 1 (f=1)
test_write: (groupid=0, jobs=1): err= 0: pid=49237: Thu Feb  9 14:04:00 2023
  write: IOPS=62.4k, BW=244MiB/s (256MB/s)(500MiB/2050msec); 0 zone resets
    slat (usec): min=3, max=12512, avg=12.21, stdev=198.89
    clat (nsec): min=1194, max=2056.0k, avg=2280.37, stdev=12428.58
     lat (usec): min=5, max=12523, avg=14.92, stdev=199.98
    clat percentiles (nsec):
     |  1.00th=[  1208],  5.00th=[  1224], 10.00th=[  1256], 20.00th=[  1288],
     | 30.00th=[  1352], 40.00th=[  1528], 50.00th=[  1560], 60.00th=[  1592],
     | 70.00th=[  1752], 80.00th=[  2128], 90.00th=[  3440], 95.00th=[  4576],
     | 99.00th=[  6688], 99.50th=[ 17280], 99.90th=[ 74240], 99.95th=[132096],
     | 99.99th=[514048]
   bw (  KiB/s): min=100279, max=445376, per=100.00%, avg=253567.75, stdev=149846.47, samples=4
   iops        : min=25069, max=111344, avg=63391.75, stdev=37461.87, samples=4
  lat (usec)   : 2=76.86%, 4=14.72%, 10=7.56%, 20=0.43%, 50=0.28%
  lat (usec)   : 100=0.08%, 250=0.04%, 500=0.01%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2=0.01%, 4=0.01%
  cpu          : usr=20.35%, sys=47.49%, ctx=4307, majf=0, minf=13
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,128000,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=244MiB/s (256MB/s), 244MiB/s-244MiB/s (256MB/s-256MB/s), io=500MiB (524MB), run=2050-2050msec

Disk stats (read/write):
  sdb: ios=0/9560, merge=0/12, ticks=0/30519, in_queue=30520, util=76.88%
root@local-ceph-1a:~# fio --name=test_write --directory=/srv/fio_test --ioengine=libaio --rw=randwrite --size=500M --direct=1
test_write: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
fio-3.25
Starting 1 process
test_write: Laying out IO file (1 file / 500MiB)
Jobs: 1 (f=1): [w(1)][100.0%][w=25.9MiB/s][w=6631 IOPS][eta 00m:00s]
test_write: (groupid=0, jobs=1): err= 0: pid=49253: Thu Feb  9 14:04:34 2023
  write: IOPS=6457, BW=25.2MiB/s (26.5MB/s)(500MiB/19821msec); 0 zone resets
    slat (usec): min=11, max=2647, avg=17.28, stdev=10.69
    clat (usec): min=2, max=10077, avg=135.94, stdev=55.75
     lat (usec): min=77, max=10094, avg=153.59, stdev=56.94
    clat percentiles (usec):
     |  1.00th=[   90],  5.00th=[  103], 10.00th=[  113], 20.00th=[  122],
     | 30.00th=[  125], 40.00th=[  129], 50.00th=[  133], 60.00th=[  137],
     | 70.00th=[  141], 80.00th=[  149], 90.00th=[  161], 95.00th=[  176],
     | 99.00th=[  217], 99.50th=[  243], 99.90th=[  306], 99.95th=[  396],
     | 99.99th=[ 2704]
   bw (  KiB/s): min=23368, max=28136, per=100.00%, avg=25858.26, stdev=1146.35, samples=39
   iops        : min= 5842, max= 7034, avg=6464.56, stdev=286.59, samples=39
  lat (usec)   : 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01%, 100=4.08%
  lat (usec)   : 250=95.53%, 500=0.36%, 750=0.02%, 1000=0.01%
  lat (msec)   : 2=0.01%, 4=0.01%, 10=0.01%, 20=0.01%
  cpu          : usr=3.46%, sys=10.99%, ctx=128373, majf=0, minf=12
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,128000,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=25.2MiB/s (26.5MB/s), 25.2MiB/s-25.2MiB/s (26.5MB/s-26.5MB/s), io=500MiB (524MB), run=19821-19821msec

Disk stats (read/write):
  sdb: ios=0/126882, merge=0/929, ticks=0/17219, in_queue=17371, util=99.56%

To check if values in eqiad are different, I ran the same command on tools-db-2, which has a Ceph-based root volume:

root@tools-db-2:~# fio --name=test_write --directory=/root/fio_test --ioengine=libaio --rw=randwrite --size=500M --direct=1
test_write: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
fio-3.25
Starting 1 process
test_write: Laying out IO file (1 file / 500MiB)
Jobs: 1 (f=1): [w(1)][100.0%][w=19.6MiB/s][w=5005 IOPS][eta 00m:00s]
test_write: (groupid=0, jobs=1): err= 0: pid=3537038: Thu Feb  9 17:15:26 2023
  write: IOPS=5013, BW=19.6MiB/s (20.5MB/s)(500MiB/25532msec); 0 zone resets
    slat (usec): min=12, max=4725, avg=23.42, stdev=22.32
    clat (usec): min=2, max=9525, avg=173.62, stdev=227.28
     lat (usec): min=71, max=9549, avg=197.42, stdev=228.22
    clat percentiles (usec):
     |  1.00th=[   81],  5.00th=[   93], 10.00th=[  100], 20.00th=[  111],
     | 30.00th=[  116], 40.00th=[  120], 50.00th=[  124], 60.00th=[  129],
     | 70.00th=[  135], 80.00th=[  145], 90.00th=[  163], 95.00th=[  223],
     | 99.00th=[ 1237], 99.50th=[ 1254], 99.90th=[ 1319], 99.95th=[ 1336],
     | 99.99th=[ 1663]
   bw (  KiB/s): min=19480, max=23944, per=100.00%, avg=20084.94, stdev=566.35, samples=50
   iops        : min= 4870, max= 5986, avg=5021.22, stdev=141.59, samples=50
  lat (usec)   : 4=0.01%, 50=0.01%, 100=9.77%, 250=85.52%, 500=0.26%
  lat (usec)   : 750=0.05%, 1000=0.04%
  lat (msec)   : 2=4.36%, 4=0.01%, 10=0.01%
  cpu          : usr=2.59%, sys=12.51%, ctx=128040, majf=0, minf=11
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,128000,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=19.6MiB/s (20.5MB/s), 19.6MiB/s-19.6MiB/s (20.5MB/s-20.5MB/s), io=500MiB (524MB), run=25532-25532msec

Disk stats (read/write):
  sda: ios=0/127456, merge=0/1502, ticks=0/22175, in_queue=22620, util=99.74%

@dcaro found this blog post with an example fio command that should tell us if the performance is good enough for etcd, looking at the 99th sync percentile.

Running it on the same host with Ceph storage, I see slightly different results when using the root volume (/dev/sda, "compute pool" in Ceph), and an attached Cinder volume (/dev/sdb, "Cinder pool" in Ceph).

Root volume (99th percentile is 11731):

root@tools-db-2:~# fio --rw=write --ioengine=sync --fdatasync=1 --directory=/root/fio_test --size=22m --bs=2300 --name=mytest
mytest: (g=0): rw=write, bs=(R) 2300B-2300B, (W) 2300B-2300B, (T) 2300B-2300B, ioengine=sync, iodepth=1
fio-3.25
Starting 1 process
mytest: Laying out IO file (1 file / 22MiB)
Jobs: 1 (f=1): [W(1)][100.0%][w=539KiB/s][w=240 IOPS][eta 00m:00s]
mytest: (groupid=0, jobs=1): err= 0: pid=3540437: Thu Feb  9 18:05:42 2023
  write: IOPS=212, BW=478KiB/s (490kB/s)(21.0MiB/47093msec); 0 zone resets
    clat (usec): min=8, max=1550, avg=35.47, stdev=25.82
     lat (usec): min=9, max=1551, avg=36.40, stdev=25.87
    clat percentiles (usec):
     |  1.00th=[   16],  5.00th=[   23], 10.00th=[   27], 20.00th=[   29],
     | 30.00th=[   30], 40.00th=[   31], 50.00th=[   32], 60.00th=[   35],
     | 70.00th=[   38], 80.00th=[   40], 90.00th=[   45], 95.00th=[   56],
     | 99.00th=[   89], 99.50th=[  102], 99.90th=[  233], 99.95th=[  338],
     | 99.99th=[ 1254]
   bw (  KiB/s): min=    4, max=  759, per=100.00%, avg=535.62, stdev=141.92, samples=84
   iops        : min=    2, max=  338, avg=238.67, stdev=63.20, samples=84
  lat (usec)   : 10=0.10%, 20=3.57%, 50=89.78%, 100=6.00%, 250=0.45%
  lat (usec)   : 500=0.06%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2=0.02%
  fsync/fdatasync/sync_file_range:
    sync (usec): min=1061, max=4337.9k, avg=4650.64, stdev=46238.53
    sync percentiles (usec):
     |  1.00th=[   1565],  5.00th=[   1762], 10.00th=[   1909],
     | 20.00th=[   2114], 30.00th=[   2343], 40.00th=[   2802],
     | 50.00th=[   4015], 60.00th=[   4424], 70.00th=[   4752],
     | 80.00th=[   5080], 90.00th=[   5538], 95.00th=[   6390],
     | 99.00th=[  11731], 99.50th=[  14484], 99.90th=[  66847],
     | 99.95th=[ 149947], 99.99th=[1182794]
  cpu          : usr=0.49%, sys=3.23%, ctx=29292, majf=0, minf=14
  IO depths    : 1=200.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,10029,0,0 short=10029,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=478KiB/s (490kB/s), 478KiB/s-478KiB/s (490kB/s-490kB/s), io=21.0MiB (23.1MB), run=47093-47093msec

Disk stats (read/write):
  sda: ios=0/25759, merge=0/11272, ticks=0/44192, in_queue=81635, util=91.18%

Attached volume, 99th percentile is 5145 (but it was 9765 on a previous run):

root@tools-db-2:~# fio --rw=write --ioengine=sync --fdatasync=1 --directory=/mnt/perftest --size=22m --bs=2300 --name=mytest
mytest: (g=0): rw=write, bs=(R) 2300B-2300B, (W) 2300B-2300B, (T) 2300B-2300B, ioengine=sync, iodepth=1
fio-3.25
Starting 1 process
Jobs: 1 (f=1): [W(1)][100.0%][w=890KiB/s][w=396 IOPS][eta 00m:00s]
mytest: (groupid=0, jobs=1): err= 0: pid=3540488: Thu Feb  9 18:08:06 2023
  write: IOPS=354, BW=797KiB/s (816kB/s)(21.0MiB/28266msec); 0 zone resets
    clat (usec): min=5, max=10609, avg=641.26, stdev=582.29
     lat (usec): min=5, max=10610, avg=642.15, stdev=582.30
    clat percentiles (usec):
     |  1.00th=[   13],  5.00th=[   17], 10.00th=[   23], 20.00th=[   25],
     | 30.00th=[   27], 40.00th=[   30], 50.00th=[  889], 60.00th=[ 1012],
     | 70.00th=[ 1090], 80.00th=[ 1172], 90.00th=[ 1270], 95.00th=[ 1352],
     | 99.00th=[ 1516], 99.50th=[ 1631], 99.90th=[ 3130], 99.95th=[ 4228],
     | 99.99th=[ 7898]
   bw (  KiB/s): min=  705, max= 1037, per=99.88%, avg=796.91, stdev=75.83, samples=56
   iops        : min=  314, max=  462, avg=354.96, stdev=33.77, samples=56
  lat (usec)   : 10=0.42%, 20=8.05%, 50=35.04%, 100=0.29%, 250=0.03%
  lat (usec)   : 500=0.01%, 750=0.61%, 1000=14.42%
  lat (msec)   : 2=40.88%, 4=0.20%, 10=0.05%, 20=0.01%
  fsync/fdatasync/sync_file_range:
    sync (usec): min=916, max=16762, avg=2165.57, stdev=782.96
    sync percentiles (usec):
     |  1.00th=[ 1303],  5.00th=[ 1500], 10.00th=[ 1614], 20.00th=[ 1762],
     | 30.00th=[ 1860], 40.00th=[ 1942], 50.00th=[ 2040], 60.00th=[ 2147],
     | 70.00th=[ 2245], 80.00th=[ 2409], 90.00th=[ 2638], 95.00th=[ 2966],
     | 99.00th=[ 5145], 99.50th=[ 6718], 99.90th=[10421], 99.95th=[13173],
     | 99.99th=[16581]
  cpu          : usr=1.56%, sys=5.47%, ctx=35768, majf=0, minf=15
  IO depths    : 1=200.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,10029,0,0 short=10029,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=797KiB/s (816kB/s), 797KiB/s-797KiB/s (816kB/s-816kB/s), io=21.0MiB (23.1MB), run=28266-28266msec

Disk stats (read/write):
  sda: ios=5606/20057, merge=0/18, ticks=5858/20953, in_queue=45035, util=99.77%

hv-local lvm drive:

root@local-etcd-3a:/srv# fio --rw=write --ioengine=sync --fdatasync=1 --directory=/srv/fio_test --size=22m --bs=2300 --name=mytest
mytest: (g=0): rw=write, bs=(R) 2300B-2300B, (W) 2300B-2300B, (T) 2300B-2300B, ioengine=sync, iodepth=1
fio-3.25
Starting 1 process
Jobs: 1 (f=1): [W(1)][100.0%][w=1875KiB/s][w=834 IOPS][eta 00m:00s]
mytest: (groupid=0, jobs=1): err= 0: pid=58074: Thu Feb 9 17:53:16 2023

write: IOPS=694, BW=1559KiB/s (1596kB/s)(21.0MiB/14450msec); 0 zone resets
  clat (usec): min=3, max=9169, avg=263.82, stdev=267.78
   lat (usec): min=4, max=9169, avg=264.48, stdev=267.86
  clat percentiles (usec):
   |  1.00th=[    6],  5.00th=[    7], 10.00th=[    9], 20.00th=[   15],
   | 30.00th=[   21], 40.00th=[   34], 50.00th=[  253], 60.00th=[  326],
   | 70.00th=[  429], 80.00th=[  510], 90.00th=[  586], 95.00th=[  709],
   | 99.00th=[  889], 99.50th=[  914], 99.90th=[  938], 99.95th=[ 1004],
   | 99.99th=[ 1090]
 bw (  KiB/s): min= 1073, max= 2767, per=98.53%, avg=1536.32, stdev=464.72, samples=28
 iops        : min=  478, max= 1232, avg=684.21, stdev=206.93, samples=28
lat (usec)   : 4=0.01%, 10=11.55%, 20=18.38%, 50=13.67%, 100=0.21%
lat (usec)   : 250=5.65%, 500=29.31%, 750=16.87%, 1000=4.29%
lat (msec)   : 2=0.05%, 10=0.01%
fsync/fdatasync/sync_file_range:
  sync (usec): min=417, max=11526, avg=1168.61, stdev=426.85
  sync percentiles (usec):
   |  1.00th=[  502],  5.00th=[  578], 10.00th=[  644], 20.00th=[  775],
   | 30.00th=[  889], 40.00th=[ 1029], 50.00th=[ 1156], 60.00th=[ 1254],
   | 70.00th=[ 1385], 80.00th=[ 1549], 90.00th=[ 1713], 95.00th=[ 1844],
   | 99.00th=[ 2114], 99.50th=[ 2180], 99.90th=[ 2638], 99.95th=[ 3458],
   | 99.99th=[ 7963]
cpu          : usr=1.34%, sys=7.71%, ctx=40257, majf=0, minf=17
IO depths    : 1=200.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
   submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
   complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
   issued rwts: total=0,10029,0,0 short=10029,0,0,0 dropped=0,0,0,0
   latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):

WRITE: bw=1559KiB/s (1596kB/s), 1559KiB/s-1559KiB/s (1596kB/s-1596kB/s), io=21.0MiB (23.1MB), run=14450-14450msec

Disk stats (read/write):

sdb: ios=5603/19957, merge=0/2, ticks=2261/11201, in_queue=20364, util=99.32%
Andrew added a subtask: Unknown Object (Task).Feb 10 2023, 3:21 PM

Presently I'm thinking we will use the three new servers as local-storage hypervisors. Cinder/vlm performance varies dramatically depending on co-location (or not) of the volume with the VM which means special-purpose local-storage lvm is probably a non-starter and I don't see ready-made openstack features for keeping VMs and their volumes colocated.

Change 894050 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] Openstack codfw1dev: remove the local cinder code from cloudvirts

https://gerrit.wikimedia.org/r/894050

Change 894050 merged by Andrew Bogott:

[operations/puppet@production] Openstack codfw1dev: remove the local cinder code from cloudvirts

https://gerrit.wikimedia.org/r/894050

Jclark-ctr closed subtask Unknown Object (Task) as Resolved.Mar 17 2023, 2:12 PM
fnegri triaged this task as High priority.Apr 12 2023, 2:45 PM

Now we have etcd running on cloudvirtlocal100[1-3] and things seem to be working fine.