Page MenuHomePhabricator

Set up a mechanism for etcd nodes to be local storage VMs
Closed, ResolvedPublic

Description

Etcd nodes simply do not seem to operate well on ceph as we have it laid out. We've made ceph very good and highly performant, but etcd has very peculiar sensitivities that make it so that nodes that should be blazing fast behave terribly.

We have been trying to move everything to a one-size-fits-all model where instance storage is in ceph, period, with cinder for attachable volumes. It seems that the only thing that won't quite work right for this is etcd. Interestingly, it operates pretty badly in PAWS even, which is a very quiet cluster compared to tools. Toolsbeta isn't any better with persistent single to double digit whole-number iowait, timeouts, failures, etc. We cannot be all-in on kubernetes and have the backing datastore constantly sucking.

I suggest we move clouddb1003/4 to cinder/ceph systems with appropriately downsized storage, use cloudvirt1019 and cloudvirt1020 and maybe one more cloudvirt to make reboots easier? I figure they need non-ceph flavors and probably something else? If there are three cloudvirts and etcd servers are in sets of three, with hard anti-affinity, you can always reboot one cloudvirt without evacuating (once toolsdb is not in the picture anyway).

Event Timeline

I've created an aggregate called 'localdisk' which schedules VMs using local storage -- that aggregate currently contains cloudvirt1018, 1019 and 1020. I've also created a private flavor (available only to testlabs, tools, and toolsbeta) named g3.cores1.ram2.disk20.localdisk which makes use of this aggregate.

Cloudvirt1019 and 1020 have plenty of room for a few etcd nodes so there's no immediate need to move the db instances out of the way.

Before closing this task we need to adjust our hypervisor-draining scripts to notice these localdisk VMs and respond appropriately (probably by displaying a message along the lines of "go ahead and reboot this but only reboot one of these at a time")

That sounds good to me. Thanks for pushing this forward so far so fast :)

I was overly optimistic about our ability to host local- and ceph-hosted VMs on the same hypervisor. We mostly can't.

Some options are:

  1. pick a cloudvirt, drain it, and declare it 'local storage only'
  2. Add local storage mode support in cinder, use that for etcd storage
  3. This weird hack where you have a hypervisor manifest as two different nodes with two different configs to the scheduler https://ceph.io/geen-categorie/openstack-nova-configure-multiple-ceph-backends-on-one-hypervisor/
  4. Just cram two etcd nodes together onto either 1019 or 1020 (which are already using local storage)

We're gong to try #4 for starters but both 2 and 3 are intriguing.

Mentioned in SAL (#wikimedia-cloud) [2021-05-26T18:07:06Z] <andrewbogott> draining cloudvirt1018, converting it to a local-storage host like cloudvirt1019 and 1020 -- T283296

Change 695441 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] Convert cloudvirt1018 to a local-storage hypervisor

https://gerrit.wikimedia.org/r/695441

Change 695441 merged by Andrew Bogott:

[operations/puppet@production] Convert cloudvirt1018 to a local-storage hypervisor

https://gerrit.wikimedia.org/r/695441

Script wmf-auto-reimage was launched by andrew on cumin1001.eqiad.wmnet for hosts:

['cloudvirt1018.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202105261905_andrew_1008.log.

Script wmf-auto-reimage was launched by andrew on cumin1001.eqiad.wmnet for hosts:

['cloudvirt1018.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202105261940_andrew_5449.log.

Completed auto-reimage of hosts:

['cloudvirt1018.eqiad.wmnet']

Of which those FAILED:

['cloudvirt1018.eqiad.wmnet']

we're now using cloudvirt1018, 1019 and 1020 as local-storage nodes. 1019 and 1020 share things with toolsdb but that seems to work fine; 1018 is a dedicated (and almost totally empty) host for just this use.

Toolforge etcd is now running entirely on local storage VMs.

Andrew claimed this task.