https://wikitech.wikimedia.org/wiki/Nova_Resource:Integration/Setup will probably have to be updated a fair bit to cope with cinder.
Description
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | hashar | T292729 TAR_ENTRY_ERROR ENOSPC: no space left on device | |||
Resolved | hashar | T252071 Move all Wikimedia CI (WMCS integration project) instances from stretch to buster/bullseye | |||
Resolved | Jdforrester-WMF | T290783 Create first CI agent with the new disk system | |||
Resolved | Bstorm | T277078 Support Cinder for CI docker workers | |||
Resolved | aborrero | T299704 Request increased quota for integration Cloud VPS project |
Event Timeline
OK, sitrep:
- I created integration-agent-docker-1021 as a bullseye instance, ran into issues, we deleted it and started again.
Note: New things don't get called .eqiad.wmflabs any more; it was integration-agent-docker-1021.integration.eqiad1.wikimedia.cloud
- I created integration-agent-docker-1022, pulled in https://gerrit.wikimedia.org/r/c/operations/puppet/+/670524 on top of https://gerrit.wikimedia.org/r/c/operations/puppet/+/717732 on the puppet master, and got:
Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Resource Statement, Evaluation Error: Error while evaluating a Resource Statement, Duplicate declaration: Exec[prepare_cinder_volume_/srv] is already declared at (file: /etc/puppet/modules/cinderutils/manifests/ensure.pp, line: 74); cannot redeclare (file: /etc/puppet/modules/cinderutils/manifests/ensure.pp, line: 74) (file: /etc/puppet/modules/cinderutils/manifests/ensure.pp, line: 74, column: 17) (file: /etc/puppet/modules/profile/manifests/ci/slave/labs/common.pp, line: 23) on node integration-agent-docker-1022.integration.eqiad1.wikimedia.cloud
So more to do there. I'm stopping for the day.
If I get it right, the bulk of the work has been done via T277078. It was to create a Bullseye image based image in order to benefit from a newer Qemu version, that ended up hitting the migration to Cinder / ephemereal disk but should be working now.
integration-agent-docker-1022 is still around though it is in shutdown state since December: https://horizon.wikimedia.org/project/instances/52eedb5b-f450-4b73-9ad6-39e426eab5eb/
It uses the flavor g3.cores8.ram24.disk20.ephemeral40.4xiops which is the new standard.
Mentioned in SAL (#wikimedia-releng) [2022-01-14T14:59:12Z] <hashar> Starting VM integration-agent-docker-1022 which was in shutdown state since December and is Bullseye based # T290783
When bringing back the instance, it has Docker shipped from Debian: docker.io 20.10.5+dfsg1-1+deb11u1 which sounds good.
For the disks:
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sda 8:0 0 20G 0 disk ├─sda1 8:1 0 19.9G 0 part / ├─sda14 8:14 0 3M 0 part └─sda15 8:15 0 124M 0 part /boot/efi sdb 8:16 0 40G 0 disk
After running puppet:
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sda 8:0 0 20G 0 disk ├─sda1 8:1 0 19.9G 0 part / ├─sda14 8:14 0 3M 0 part └─sda15 8:15 0 124M 0 part /boot/efi sdb 8:16 0 40G 0 disk ├─vd-docker 254:0 0 28G 0 lvm /var/lib/docker └─vd-second--local--disk 254:1 0 12G 0 lvm /srv
The partitions got created via:
(/Stage[main]/Profile::Ci::Dockervolume/Labs_lvm::Volume[docker]/Exec[available-space-docker]/returns) executed successfully (corrective) (/Stage[main]/Profile::Ci::Dockervolume/Labs_lvm::Volume[docker]/Exec[create-vd-docker]/returns) executed successfully (/Stage[main]/Profile::Ci::Dockervolume/Labs_lvm::Volume[docker]/Exec[create-mountpoint-docker]/returns) executed successfully (/Stage[main]/Profile::Ci::Dockervolume/Labs_lvm::Volume[docker]/Mount[/var/lib/docker]/ensure) defined 'ensure' as 'mounted'
(/Stage[main]/Profile::Labs::Lvm::Srv/Labs_lvm::Volume[second-local-disk]/Exec[available-space-second-local-disk]/returns) executed successfully (/Stage[main]/Profile::Labs::Lvm::Srv/Labs_lvm::Volume[second-local-disk]/Exec[create-vd-second-local-disk]/returns) executed successfully (/Stage[main]/Profile::Labs::Lvm::Srv/Labs_lvm::Volume[second-local-disk]/Mount[/srv]/ensure) defined 'ensure' as 'mounted'
Comparison of partitions:
Partition | Old disk.80 | New disk20.ephemeral40 |
---|---|---|
/ | 20G | 20 G |
/var/lib/docker | 42.7 G | 28 G |
/srv | 18.3 G | 12 G |
What we found out via T292729 is that our builds are way larger nowadays and the heavier ones write to /var/lib/docker cause they do not fit in the 18.3 G /srv/.
We also docker prune images every week-end, looking at docker system df this Friday the top usage is at 14G. Thus the 28 G of the new flavor fit.
For /srv/ we definitely need more than 12G, at the very least 21G but I think 30 would give us more breath.
Or a total of 60G.
Puppet creates the docker volume using 70% of the free disk:
class profile::ci::dockervolume { labs_lvm::volume { 'docker': size => '70%FREE', }
After that the rest (100%FREE) is allocated to /srv by profile::labs::lvm::srv.
The current flavor is g3.cores8.ram24.disk20.ephemeral40.4xiops, we would need a new one providing 60G of ephemeral disk. I am guessing we can split them as 25G for Docker and 35G for /srv).
Alternatively for each instance we could create a docker and srv volume via https://horizon.wikimedia.org/project/volumes/ and attach them to the instance when we create it. That might require a bit of puppet work but that does not sound like the end of the world :]
class profile::ci::dockervolume { labs_lvm::volume { 'docker': size => '70%FREE', }
With https://gerrit.wikimedia.org/r/c/operations/puppet/+/755713 (applied on integration project), /var/lib/docker is now set to 24G and I have recreated the partitions on all agents.
integration-agent-docker-1022 (bullseyes) would need to be switched to a flavor with more disk (either a 80G or 20 + 60 ephemeral).
Mentioned in SAL (#wikimedia-releng) [2022-01-26T16:45:42Z] <hashar> integration: creating integration-agent-docker-1023 based on buster with new flavor g3.cores8.ram24.disk20.ephemeral60.4xiops # T290783
Thank you @Krinkle for the work on T277078 and @Jdforrester-WMF for creating the first instance. I have successfully pooled a new Bullseye instance using the ephemeral disk layout.
I think we can drop integration-agent-docker-1022 since it has a disk too small?
It is done, assigning to @Jdforrester-WMF that did all the work, I just got a WMCS flavor with slightly more disk. Thank you!