Page MenuHomePhabricator

Create first CI agent with the new disk system
Open, Needs TriagePublic

Description

https://wikitech.wikimedia.org/wiki/Nova_Resource:Integration/Setup will probably have to be updated a fair bit to cope with cinder.

Event Timeline

OK, sitrep:

  • I created integration-agent-docker-1021 as a bullseye instance, ran into issues, we deleted it and started again.

Note: New things don't get called .eqiad.wmflabs any more; it was integration-agent-docker-1021.integration.eqiad1.wikimedia.cloud

Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Resource Statement, Evaluation Error: Error while evaluating a Resource Statement, Duplicate declaration: Exec[prepare_cinder_volume_/srv] is already declared at (file: /etc/puppet/modules/cinderutils/manifests/ensure.pp, line: 74); cannot redeclare (file: /etc/puppet/modules/cinderutils/manifests/ensure.pp, line: 74) (file: /etc/puppet/modules/cinderutils/manifests/ensure.pp, line: 74, column: 17) (file: /etc/puppet/modules/profile/manifests/ci/slave/labs/common.pp, line: 23) on node integration-agent-docker-1022.integration.eqiad1.wikimedia.cloud

So more to do there. I'm stopping for the day.

If I get it right, the bulk of the work has been done via T277078. It was to create a Bullseye image based image in order to benefit from a newer Qemu version, that ended up hitting the migration to Cinder / ephemereal disk but should be working now.

integration-agent-docker-1022 is still around though it is in shutdown state since December: https://horizon.wikimedia.org/project/instances/52eedb5b-f450-4b73-9ad6-39e426eab5eb/

It uses the flavor g3.cores8.ram24.disk20.ephemeral40.4xiops which is the new standard.

Mentioned in SAL (#wikimedia-releng) [2022-01-14T14:59:12Z] <hashar> Starting VM integration-agent-docker-1022 which was in shutdown state since December and is Bullseye based # T290783

When bringing back the instance, it has Docker shipped from Debian: docker.io 20.10.5+dfsg1-1+deb11u1 which sounds good.

For the disks:

lsblk
NAME    MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
sda       8:0    0   20G  0 disk 
├─sda1    8:1    0 19.9G  0 part /
├─sda14   8:14   0    3M  0 part 
└─sda15   8:15   0  124M  0 part /boot/efi
sdb       8:16   0   40G  0 disk

After running puppet:

lsblk
NAME                     MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
sda                        8:0    0   20G  0 disk 
├─sda1                     8:1    0 19.9G  0 part /
├─sda14                    8:14   0    3M  0 part 
└─sda15                    8:15   0  124M  0 part /boot/efi
sdb                        8:16   0   40G  0 disk 
├─vd-docker              254:0    0   28G  0 lvm  /var/lib/docker
└─vd-second--local--disk 254:1    0   12G  0 lvm  /srv

The partitions got created via:

(/Stage[main]/Profile::Ci::Dockervolume/Labs_lvm::Volume[docker]/Exec[available-space-docker]/returns) executed successfully (corrective)
(/Stage[main]/Profile::Ci::Dockervolume/Labs_lvm::Volume[docker]/Exec[create-vd-docker]/returns) executed successfully
(/Stage[main]/Profile::Ci::Dockervolume/Labs_lvm::Volume[docker]/Exec[create-mountpoint-docker]/returns) executed successfully
(/Stage[main]/Profile::Ci::Dockervolume/Labs_lvm::Volume[docker]/Mount[/var/lib/docker]/ensure) defined 'ensure' as 'mounted'
(/Stage[main]/Profile::Labs::Lvm::Srv/Labs_lvm::Volume[second-local-disk]/Exec[available-space-second-local-disk]/returns) executed successfully
(/Stage[main]/Profile::Labs::Lvm::Srv/Labs_lvm::Volume[second-local-disk]/Exec[create-vd-second-local-disk]/returns) executed successfully
(/Stage[main]/Profile::Labs::Lvm::Srv/Labs_lvm::Volume[second-local-disk]/Mount[/srv]/ensure) defined 'ensure' as 'mounted'

Comparison of partitions:

PartitionOld disk.80New disk20.ephemeral40
/20G20 G
/var/lib/docker42.7 G28 G
/srv18.3 G12 G

What we found out via T292729 is that our builds are way larger nowadays and the heavier ones write to /var/lib/docker cause they do not fit in the 18.3 G /srv/.

We also docker prune images every week-end, looking at docker system df this Friday the top usage is at 14G. Thus the 28 G of the new flavor fit.

For /srv/ we definitely need more than 12G, at the very least 21G but I think 30 would give us more breath.

Or a total of 60G.

Puppet creates the docker volume using 70% of the free disk:

class profile::ci::dockervolume {
    labs_lvm::volume { 'docker':
        size      => '70%FREE',
   }

After that the rest (100%FREE) is allocated to /srv by profile::labs::lvm::srv.

The current flavor is g3.cores8.ram24.disk20.ephemeral40.4xiops, we would need a new one providing 60G of ephemeral disk. I am guessing we can split them as 25G for Docker and 35G for /srv).

Alternatively for each instance we could create a docker and srv volume via https://horizon.wikimedia.org/project/volumes/ and attach them to the instance when we create it. That might require a bit of puppet work but that does not sound like the end of the world :]

class profile::ci::dockervolume {
    labs_lvm::volume { 'docker':
        size      => '70%FREE',
   }

With https://gerrit.wikimedia.org/r/c/operations/puppet/+/755713 (applied on integration project), /var/lib/docker is now set to 24G and I have recreated the partitions on all agents.

integration-agent-docker-1022 (bullseyes) would need to be switched to a flavor with more disk (either a 80G or 20 + 60 ephemeral).