Create first CI agent with the new disk system
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Jdforrester-WMF
	Sep 11 2021, 12:04 AM

Description

https://wikitech.wikimedia.org/wiki/Nova_Resource:Integration/Setup will probably have to be updated a fair bit to cope with cinder.

Related Objects
Search...

Status	Assigned	Task
Resolved	hashar	T292729 TAR_ENTRY_ERROR ENOSPC: no space left on device
Resolved	hashar	T252071 Move all Wikimedia CI (WMCS integration project) instances from stretch to buster/bullseye
Resolved	Jdforrester-WMF	T290783 Create first CI agent with the new disk system
Resolved	• Bstorm	T277078 Support Cinder for CI docker workers
Resolved	aborrero	T299704 Request increased quota for integration Cloud VPS project

Event Timeline

Jdforrester-WMF created this task.Sep 11 2021, 12:04 AM

Jdforrester-WMF updated the task description. (Show Details)

OK, sitrep:

I created integration-agent-docker-1021 as a bullseye instance, ran into issues, we deleted it and started again.

Note: New things don't get called .eqiad.wmflabs any more; it was integration-agent-docker-1021.integration.eqiad1.wikimedia.cloud

I created integration-agent-docker-1022, pulled in https://gerrit.wikimedia.org/r/c/operations/puppet/+/670524 on top of https://gerrit.wikimedia.org/r/c/operations/puppet/+/717732 on the puppet master, and got:

Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Resource Statement, Evaluation Error: Error while evaluating a Resource Statement, Duplicate declaration: Exec[prepare_cinder_volume_/srv] is already declared at (file: /etc/puppet/modules/cinderutils/manifests/ensure.pp, line: 74); cannot redeclare (file: /etc/puppet/modules/cinderutils/manifests/ensure.pp, line: 74) (file: /etc/puppet/modules/cinderutils/manifests/ensure.pp, line: 74, column: 17) (file: /etc/puppet/modules/profile/manifests/ci/slave/labs/common.pp, line: 23) on node integration-agent-docker-1022.integration.eqiad1.wikimedia.cloud

So more to do there. I'm stopping for the day.

Jdforrester-WMF added a subtask: T277078: Support Cinder for CI docker workers.Sep 11 2021, 12:08 AM

Krinkle closed subtask T277078: Support Cinder for CI docker workers as Resolved.Oct 14 2021, 10:12 PM

If I get it right, the bulk of the work has been done via T277078. It was to create a Bullseye image based image in order to benefit from a newer Qemu version, that ended up hitting the migration to Cinder / ephemereal disk but should be working now.

integration-agent-docker-1022 is still around though it is in shutdown state since December: https://horizon.wikimedia.org/project/instances/52eedb5b-f450-4b73-9ad6-39e426eab5eb/

It uses the flavor g3.cores8.ram24.disk20.ephemeral40.4xiops which is the new standard.

Mentioned in SAL (#wikimedia-releng) [2022-01-14T14:59:12Z] <hashar> Starting VM integration-agent-docker-1022 which was in shutdown state since December and is Bullseye based # T290783

When bringing back the instance, it has Docker shipped from Debian: docker.io 20.10.5+dfsg1-1+deb11u1 which sounds good.

For the disks:

lsblk

NAME    MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
sda       8:0    0   20G  0 disk 
├─sda1    8:1    0 19.9G  0 part /
├─sda14   8:14   0    3M  0 part 
└─sda15   8:15   0  124M  0 part /boot/efi
sdb       8:16   0   40G  0 disk

After running puppet:

lsblk

NAME                     MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
sda                        8:0    0   20G  0 disk 
├─sda1                     8:1    0 19.9G  0 part /
├─sda14                    8:14   0    3M  0 part 
└─sda15                    8:15   0  124M  0 part /boot/efi
sdb                        8:16   0   40G  0 disk 
├─vd-docker              254:0    0   28G  0 lvm  /var/lib/docker
└─vd-second--local--disk 254:1    0   12G  0 lvm  /srv

The partitions got created via:

(/Stage[main]/Profile::Ci::Dockervolume/Labs_lvm::Volume[docker]/Exec[available-space-docker]/returns) executed successfully (corrective)
(/Stage[main]/Profile::Ci::Dockervolume/Labs_lvm::Volume[docker]/Exec[create-vd-docker]/returns) executed successfully
(/Stage[main]/Profile::Ci::Dockervolume/Labs_lvm::Volume[docker]/Exec[create-mountpoint-docker]/returns) executed successfully
(/Stage[main]/Profile::Ci::Dockervolume/Labs_lvm::Volume[docker]/Mount[/var/lib/docker]/ensure) defined 'ensure' as 'mounted'

(/Stage[main]/Profile::Labs::Lvm::Srv/Labs_lvm::Volume[second-local-disk]/Exec[available-space-second-local-disk]/returns) executed successfully
(/Stage[main]/Profile::Labs::Lvm::Srv/Labs_lvm::Volume[second-local-disk]/Exec[create-vd-second-local-disk]/returns) executed successfully
(/Stage[main]/Profile::Labs::Lvm::Srv/Labs_lvm::Volume[second-local-disk]/Mount[/srv]/ensure) defined 'ensure' as 'mounted'

hashar mentioned this in T292729: TAR_ENTRY_ERROR ENOSPC: no space left on device.Jan 14 2022, 3:29 PM

Comparison of partitions:

Partition	Old disk.80	New disk20.ephemeral40
/	20G	20 G
/var/lib/docker	42.7 G	28 G
/srv	18.3 G	12 G

What we found out via T292729 is that our builds are way larger nowadays and the heavier ones write to /var/lib/docker cause they do not fit in the 18.3 G /srv/.

We also docker prune images every week-end, looking at docker system df this Friday the top usage is at 14G. Thus the 28 G of the new flavor fit.

For /srv/ we definitely need more than 12G, at the very least 21G but I think 30 would give us more breath.

Or a total of 60G.

Puppet creates the docker volume using 70% of the free disk:

class profile::ci::dockervolume {
    labs_lvm::volume { 'docker':
        size      => '70%FREE',
   }

After that the rest (100%FREE) is allocated to /srv by profile::labs::lvm::srv.

The current flavor is g3.cores8.ram24.disk20.ephemeral40.4xiops, we would need a new one providing 60G of ephemeral disk. I am guessing we can split them as 25G for Docker and 35G for /srv).

Alternatively for each instance we could create a docker and srv volume via https://horizon.wikimedia.org/project/volumes/ and attach them to the instance when we create it. That might require a bit of puppet work but that does not sound like the end of the world :]

hashar mentioned this in T252071: Move all Wikimedia CI (WMCS integration project) instances from stretch to buster/bullseye.Jan 20 2022, 10:55 AM

class profile::ci::dockervolume {
    labs_lvm::volume { 'docker':
        size      => '70%FREE',
   }

With https://gerrit.wikimedia.org/r/c/operations/puppet/+/755713 (applied on integration project), /var/lib/docker is now set to 24G and I have recreated the partitions on all agents.

integration-agent-docker-1022 (bullseyes) would need to be switched to a flavor with more disk (either a 80G or 20 + 60 ephemeral).

hashar added a subtask: T299704: Request increased quota for integration Cloud VPS project.Jan 20 2022, 8:52 PM

aborrero closed subtask T299704: Request increased quota for integration Cloud VPS project as Resolved.Jan 26 2022, 2:17 PM

Mentioned in SAL (#wikimedia-releng) [2022-01-26T16:45:42Z] <hashar> integration: creating integration-agent-docker-1023 based on buster with new flavor g3.cores8.ram24.disk20.ephemeral60.4xiops # T290783

Thank you @Krinkle for the work on T277078 and @Jdforrester-WMF for creating the first instance. I have successfully pooled a new Bullseye instance using the ephemeral disk layout.

I think we can drop integration-agent-docker-1022 since it has a disk too small?

In T290783#7653512, @hashar wrote:

I think we can drop integration-agent-docker-1022 since it has a disk too small?

Yes.

It is done, assigning to @Jdforrester-WMF that did all the work, I just got a WMCS flavor with slightly more disk. Thank you!

hashar mentioned this in T340070: Rebuild WMCS integration instances to larger flavor.Jun 29 2023, 1:54 PM

Create first CI agent with the new disk systemClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Create first CI agent with the new disk system
Closed, ResolvedPublic
Actions

Related Objects
Search...