Page MenuHomePhabricator

Support Cinder for CI docker workers
Closed, ResolvedPublic

Description

Instances in the 'Integration' project implement some of the interesting use cases for LVM (e.g. having multiple LVM volumes on a single VM).

This task is to keep track of attempts to move from LVM to Cinder use for these storage needs.

Event Timeline

Change 670524 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] profile::ci::slave::labs::common: move to cinder-based storage

https://gerrit.wikimedia.org/r/670524

Hashar requested on the task that I make a feature flag. That's a reasonable request but it would be nice to get the existing puppet code organized and understandable before I add a module param all over the place. A few requests/suggestions:

  • All of the integration-agent-docker-xxxx nodes seem to use the same puppet config but their puppet config is set individually and there are slight variations. Could they be standardized using a prefix?
  • integration-agent-qemu-1001.integration.eqiad1.wikimedia.cloud is an outlier that uses a unique hiera set; if that VM is useful that's fine but if it can be deleted then the docker_lvm_volume hiera key can be removed in favor of default behavior.
  • same question re: integration-castor03

Each of the instance have a Puppet role applied to them explicitly and indeed there are no prefix applied. Most probably cause that pre date the introduction of the puppet prefix in Horizon and most importantly that is one less layer of inception to lookup when figuring out which Hiera settings get applied.

The per instance settings also make it easy to progressively roll a new Hiera setting one instance after the other. Though nowadays we don't rely much on Puppet beside setting up the partitions, a few files and bringing Docker.

The instances having a Docker daemon have role::ci::slave::labs::docker::docker_lvm_volume: 'true' which ensure the extended disk space is split in two partitions, one dedicated to Docker, the other to the Jenkins build workspace. The reasoning iirc is that if Docker is full, we still have the images and can still execute jobs and / is still happy. However if the job area is full, no more jobs can run there and we have an outage.

That got introduced in T203841: Provide dedicated storage space to Docker for images/containers by @dduvall to isolate the Docker images from the root partition and apparently Docker at the time wanted a dedicated volume group (so we could not point it to /srv). Hence the split.

The two other servers do use Docker and thus do not have role::ci::slave::labs::docker::docker_lvm_volume: 'true':

  • integration-agent-qemu-1001.integration.eqiad1.wikimedia.cloud has a different naming scheme since it has a different role. The instance doesn't have a Docker daemon, the jobs running on there use Qemu to boot a VM that has a running and writable Docker daemon. We can't have those jobs mess with the Docker daemon used to run the other CI jobs. That is used to build and test Docker images afaik
  • integration-castor03 just uses the whole extended disk as /srv and does not have Docker. That machine is a rsync server we use to store package managers caches. Maybe that whole system can be dropped and we could instead use the WMCS per tenant shared storage. But that is an entirely different story ;-]

I thought we could drop that docker_lvm_volume trick and use a single slightly larger /srv and point Docker to it (eg /srv/docker), then the past task seems to indicate that is not possible.

Above was for the whole context. To elaborate on the review comment I gave on https://gerrit.wikimedia.org/r/670524 , it would be ideal to have a Hiera setting we can flip to use the new Cinder based storage. Then to do the migration we would do:

  • unpoll an instance from Jenkins
  • stop puppet
  • nuke /srv and /var/lib/docker and unmount them
  • remove the docker_lvm_volume Hiera setting
  • turn on the Cinder feature flag
  • enable puppet
  • ensure the new partitions are properly setup
  • restart Docker to use the new system
  • run an random image as a sanity check
  • poll the instance back in Jenkins

And if that works all fine, proceed with the next instance.

And if Docker is happy about it maybe we could use a single volume that host both the jobs workspaces and the Docker daemon data. The original use case was to use the extended disk and to separate those from the root partition holding the system. Assuming Docker works, I will be fine with a single /srv/ Cinder volume.

Another note: from T278689 , when creating a new instance it is apparently no more possible to use the legacy LVM partitioning. I haven't looked at it though, it was the evening and I had a veryyyy long day.

Side track: we also have a task to rebuild all those instances to Buster (aka start with fresh ones) which is T252071 but is not scheduled. We might well just jump to the next Debian version and thus do that migration at the end of the year.

Andrew triaged this task as Medium priority.Apr 13 2021, 4:22 PM

https://gerrit.wikimedia.org/r/c/operations/puppet/+/670524 got applied which caused puppet to fail on all instances:

Error while evaluating a Resource Statement, Duplicate declaration: Mount[/srv] is already declared
  at (file: /etc/puppet/modules/cinderutils/manifests/ensure.pp, line: 57);
cannot redeclare (file: /etc/puppet/modules/cinderutils/manifests/ensure.pp, line: 57)
(file: /etc/puppet/modules/cinderutils/manifests/ensure.pp, line: 57, column: 9)
(file: /etc/puppet/modules/profile/manifests/ci/slave/labs/common.pp, line: 23)

The Puppet change adds to profile::ci::slave::labs::common:

# Mount a cinder volume on /srv. Max_gb is specified
#  to keep this from acccidentally claiming a volume
#  designated for Docker
cinderutils::ensure { 'srv':
    mount_point => '/srv',
    max_gb      => 40,
}

Which already also has require ::profile::labs::lvm::srv. They both declare a Mount[/srv] and that fail as a result. I guess we want a feature switch of some sort or follow the migration procedure I have mentioned in my previous comment.

I have removed the change from the Puppet master.

Change 721090 had a related patch set uploaded (by Bstorm; author: Bstorm):

[operations/puppet@production] cloud lvm: add back an optional labs_lvm init

https://gerrit.wikimedia.org/r/721090

Change 721090 merged by Bstorm:

[operations/puppet@production] cloud lvm: add back an optional labs_lvm init

https://gerrit.wikimedia.org/r/721090

Change 721105 had a related patch set uploaded (by Bstorm; author: Bstorm):

[operations/puppet@production] cloud lvm: finish up volume group creation for ephemeral disk

https://gerrit.wikimedia.org/r/721105

Change 721105 merged by Bstorm:

[operations/puppet@production] cloud lvm: finish up volume group creation for ephemeral disk

https://gerrit.wikimedia.org/r/721105

@Krinkle and @Jdforrester-WMF I think that should unblock your efforts at making new nodes. If you use the ephemeral disk flavors with that profile (either applied directly to the instance or included in the role) you should be able to use the ephemeral disk without changes to anything else. It will not work with an actual cinder volume because it specifically is checking for the ephemeral disk label.

I would recommend a different approach for instances where the data is something you want to live longer than the instance (that's what cinder is for), but for just unblocking replacing disposable instances, this should work. To use it, you don't want ANY cinderutils puppet stuff in there, though. That isn't compatible directly. If you need a larger ephemeral disk (and the use case of the instance doesn't match what cinder is for), then you'll need to request a flavor with a larger ephemeral disk.

Where it might *not* work is if the init for LVM in general gets applied too early in the whole setup, but I'm hoping this will work if it is the first thing applied 😁

If it doesn't, I have another trick that should do it, so let me know.

I've done the following:

  • Deleted agent-qemu-1002 which we both did various manual changes on.
  • Un-cherry-picked https://gerrit.wikimedia.org/r/717732 which did a partial migration from lvm to cinderutils. The integration puppet agent is now up-to-date with operations/puppet with one unrelated cherry-pick on top:.
  • Create agent-qemu-1003 with flavour g3.cores8.ram24.disk20.ephemeral40.4xiops (Debian 11 Bullseye)
  • Initial puppet run.
  • Role role::ci::slave::labs::docker in Horizon.
  • Main puppet run:
Info: Caching catalog for integration-agent-qemu-1003.integration.eqiad1.wikimedia.cloud
Info: Applying configuration version '(e6b892e97a) root - ci: Add 'bullseye' to docker lsbdistcodename hack'
Notice: The LDAP client stack for this host is: sssd/sudo
Notice: /Stage[main]/Profile::Ldap::Client::Labs/Notify[LDAP client stack]/message: defined 'message' as 'The LDAP client stack for this host is: sssd/sudo'
Notice: /Stage[main]/Profile::Ci::Docker/Package[acl]/ensure: created
Notice: /Stage[main]/Docker::Configuration/File[/etc/docker]/ensure: created
Notice: /Stage[main]/Docker::Configuration/File[/etc/docker/daemon.json]/ensure: defined content as '{md5}49b16946fdce3875daa02eeaa158f67f'
Notice: /Stage[main]/Docker/Package[docker.io]/ensure: created
Notice: /Stage[main]/Profile::Ci::Docker/File[/usr/local/bin/docker-credential-environment]/ensure: defined content as '{md5}a48067b5809a2703033c7bf7b89c98a8'
Notice: /Stage[main]/Profile::Ci::Docker/Exec[jenkins user docker membership]/returns: executed successfully
Notice: /Stage[main]/Java/Java::Package[openjdk-jre-headless-11]/Package[openjdk-11-jre-headless]/ensure: created
Notice: /Stage[main]/Labs_lvm/Package[lvm2]/ensure: created
Notice: /Stage[main]/Labs_lvm/Package[parted]/ensure: created
Notice: /Stage[main]/Labs_lvm/File[/usr/local/sbin/make-instance-vg]/ensure: defined content as '{md5}4a19ecb20e1ea5b8fb152e6e772e7e4a'
Notice: /Stage[main]/Labs_lvm/File[/usr/local/sbin/make-instance-vg-ephem]/ensure: defined content as '{md5}352a418e3db6e44cd58014d130b9ddac'
Notice: /Stage[main]/Labs_lvm/File[/usr/local/sbin/pv-free]/ensure: defined content as '{md5}c1def11fe917fa99078e0b71662f1165'
Notice: /Stage[main]/Labs_lvm/File[/usr/local/sbin/make-instance-vol]/ensure: defined content as '{md5}31bba45e3cfdfa50c8421d67ceb6366a'
Notice: /Stage[main]/Labs_lvm/File[/usr/local/sbin/extend-instance-vol]/ensure: defined content as '{md5}d6486da5b09024f34c4755866c63238f'
Notice: /Stage[main]/Labs_lvm/Exec[create-volume-group]/returns: /usr/local/sbin/make-instance-vg: lvm is not active on this host; unable to create a volume.
Error: '/usr/local/sbin/make-instance-vg '/dev/sda'' returned 1 instead of one of [0]
Error: /Stage[main]/Labs_lvm/Exec[create-volume-group]/returns: change from 'notrun' to ['0'] failed: '/usr/local/sbin/make-instance-vg '/dev/sda'' returned 1 instead of one of [0]
Info: Class[Labs_lvm]: Unscheduling all events on Class[Labs_lvm]
Notice: /Stage[main]/Profile::Labs::Lvm::Srv/Labs_lvm::Volume[second-local-disk]/Exec[available-space-second-local-disk]/returns: Traceback (most recent call last):
Notice: /Stage[main]/Profile::Labs::Lvm::Srv/Labs_lvm::Volume[second-local-disk]/Exec[available-space-second-local-disk]/returns:   File "/usr/local/sbin/pv-free", line 17, in <module>
Notice: /Stage[main]/Profile::Labs::Lvm::Srv/Labs_lvm::Volume[second-local-disk]/Exec[available-space-second-local-disk]/returns:     assert pvfree.endswith("G")
Notice: /Stage[main]/Profile::Labs::Lvm::Srv/Labs_lvm::Volume[second-local-disk]/Exec[available-space-second-local-disk]/returns: AssertionError
Error: '/usr/local/sbin/pv-free' returned 1 instead of one of [0]
Error: /Stage[main]/Profile::Labs::Lvm::Srv/Labs_lvm::Volume[second-local-disk]/Exec[available-space-second-local-disk]/returns: change from 'notrun' to ['0'] failed: '/usr/local/sbin/pv-free' returned 1 instead of one of [0]
Notice: /Stage[main]/Profile::Labs::Lvm::Srv/Labs_lvm::Volume[second-local-disk]/Exec[create-vd-second-local-disk]: Dependency Exec[create-volume-group] has failures: true
Notice: /Stage[main]/Profile::Labs::Lvm::Srv/Labs_lvm::Volume[second-local-disk]/Exec[create-vd-second-local-disk]: Dependency Exec[available-space-second-local-disk] has failures: true
Warning: /Stage[main]/Profile::Labs::Lvm::Srv/Labs_lvm::Volume[second-local-disk]/Exec[create-vd-second-local-disk]: Skipping because of failed dependencies
Warning: /Stage[main]/Profile::Labs::Lvm::Srv/Labs_lvm::Volume[second-local-disk]/Exec[create-mountpoint-second-local-disk]: Skipping because of failed dependencies
Warning: /Stage[main]/Profile::Labs::Lvm::Srv/Labs_lvm::Volume[second-local-disk]/Mount[/srv]: Skipping because of failed dependencies
Warning: /Stage[main]/Profile::Labs::Lvm::Srv/Labs_lvm::Volume[second-local-disk]/File[/srv]: Skipping because of failed dependencies

I don't know if this is meant to work. But, if it intent was to make existing lvm-related manifests work exactly as-is in all possible cases, that seems to not be the case yet for the combination of factors I happen to be in right now. As before, I'm happy to approach this differently or make changes to the manifest in question. Whatever works best :)

@Krinkle I left one step undone so that I didn't cause any potential breakage on your things. You need to include the profile profile::wmcs::lvm. You can add it directly on the puppet classes of the VM you are working with to see how it works or you could add an include to any role you like. I wasn't sure which you needed it on for sure.

@Bstorm OK. I've added profile::wmcs::lvm to qemu-agent-1003 in Horizon before role::ci::slave::labs::docker as minimal change to experimentally try this on that one instance first. I've intentionally done nothing else yet. With that change the puppet agent now says the following:

Info: Loading facts
Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Could not find resource 'Exec[create-volume-group]' in parameter 'require' (file: /etc/puppet/modules/labs_lvm/manifests/volume.pp, line: 55) on node integration-agent-qemu-1003.integration.eqiad1.wikimedia.cloud
Warning: Not using cache on failed catalog
Error: Could not retrieve catalog; skipping run

I've tried it on the older "pre cinder" instances qemu-agent-1001 and integration-agent-docker-1013 for comparison. Those fail with the same error. (I've removed the profile from the latter two as they are actively in-use, I've left the new unused qemu-1003 in broken state).

That's interesting. Thanks for leaving it. I'll take a look and try to figure out what it did. Naturally that didn't happen in my test instances :)

@Krinkle I cannot find that error anywhere on qemu-agent-1003. It looks more fundamentally broken in other ways.

[  135.936341] cloud-init[1527]: [1;31mError: Could not retrieve catalog from remote server: SSL_connect returned=1 errno=0 state=error: certificate verify failed (self signed certificate in certificate chain): [self signed certificate in certificate chain for /CN=Puppet CA: integration-puppetmaster-02.integration.eqiad.wmflabs][0m
[  135.936716] cloud-init[1527]: [1;33mWarning: Not using cache on failed catalog[0m
[  135.937927] cloud-init[1527]: [1;31mError: Could not retrieve catalog; skipping run[0m
[  136.008112] cloud-init[1527]: [1;31mError: Could not send report: SSL_connect returned=1

It seems broken entirely for puppet for some reason. As for a retrofit on an old VM, I would expect it to do nothing since the mountpoint should already be in use unless it was somehow modified at some point. I'll look at those separately and maybe see what happens on a toolsbeta VM with a retrofit.

Ah, I know why I am not finding it :) I was trying to ssh to qemu-agent-1003 instead of integration-agent-qemu-1003. Never mind!

Change 722431 had a related patch set uploaded (by Bstorm; author: Bstorm):

[operations/puppet@production] labslvm: fix the branch around ephemeral vols

https://gerrit.wikimedia.org/r/722431

Change 722431 merged by Bstorm:

[operations/puppet@production] labslvm: fix the branch around ephemeral vols

https://gerrit.wikimedia.org/r/722431

That fixes the problem @Krinkle. It should now be a noop on old things and actually make new ephemeral disk things work. Sorry about that mistake. Your puppetmaster has some cherrypick or something going on, or I'd put it on your actual setup.

Oops, yeah, I got those name parts backward. Sorry about that.

The integration puppetmaster has 1 cherry-picked commit indeed but afaik it's not conflicting with any recent commits. I see latest origin/production is HEAD~2 and seems to be automatically still, but maybe not? It rebased cleanly just now when I ran git-pull. The labslvm patch is live now there.

Puppet now runs cleanly with the new profile applied on agent-qemu-1003 (new) as well as -qemu-1001 (old) and docker-1013 (random current agent). I noticed the following was added in Horizon for the qemu-1003 instance:

labs_lvm::disk: /dev/sdb
labs_lvm::ephemeral: true

Was this for debugging or would this be required in order for the same to work on another new instance created with this profile/role applied?

That is the functional equivalent of applying profile::wmcs::lvm. You can remove that hiera and apply that profile, leave it there, whichever. It was for debugging. Sorry I forgot to remove it.

Change 722476 had a related patch set uploaded (by Krinkle; author: Krinkle):

[operations/puppet@production] ci: Apply profile::wmcs::lvm as needed for new integration instances

https://gerrit.wikimedia.org/r/722476

Ah okay, no problem. I've applied it now via the role as well (rather than via Horizon) with https://gerrit.wikimedia.org/r/722476 and confirmed all still runs clearly and the expected /srv/ exists on the new qemu agent, and on the docker agents it also still allocates space the same way as before. Resolved from my POV :)

Change 722476 merged by Giuseppe Lavagetto:

[operations/puppet@production] ci: Apply profile::wmcs::lvm as needed for new integration instances

https://gerrit.wikimedia.org/r/722476

Change 670524 abandoned by Andrew Bogott:

[operations/puppet@production] profile::ci::slave::labs::common: move to cinder-based storage

Reason:

For now we're just using the old lvm-based roles with a shim to let the lvm classes use cinder/ephemeral volumes instead.

https://gerrit.wikimedia.org/r/670524

Krinkle assigned this task to Bstorm.