Page MenuHomePhabricator

Rebuild WMCS integration instances to larger flavor
Closed, ResolvedPublic

Description

The WMCS instances for the integration starts having a full /srv partition more and more often. The maintenance-disconnect-full-disks job unpool the instances which eventually recover once build have completed, but the builds surely fail.

Analysis

MatmaRex looked at the #wikimedia-releng IRC channel logs to get count of message originating from Jenkins (logged as wmf-insecte) and having maintenance-disconnect-full-disks complaining. That indicates the frequency of the error. The raw data are P49465 and the rendering:

image.png (1×2 px, 106 KB)

Previous analyses of the issue are on T338627#8922715 & T338317#8909563 completed below:

The instance have a 36GB /srv. 1.8G is consumed by git mirrors and roughly 200MB by the Jenkins agent for a total of 2GB which leaves 34GB.

The Jenkins agent allow up to 3 concurrent builds, in all cases I have investigated they ran wmf-quibble* jobs (most probably the selenium variants) and the workspace for a build was ~ 11GB. With 3 concurrent builds that is 33GB and with some other consumption.

Moreover I have witnessed /var/lib/docker 24G being full sometime which comes from T338317#8909563 that is machinelearning/liftwing/inference-services introducing a 13G layer in the Docker buildkit cache which overflow the 24G partition.

Needless to say we are short on disk.

The list of instances is publicly accessible via https://openstack-browser.toolforge.org/project/integration

Debian version

Instances are based on Debian Bullseye and I don't think we should upgrade to Bookworm right now (different java, different docker, different kernel, different libs of everything etc)

Disk space flavor

The flavors are g3.cores8.ram24.disk20.ephemeral60.4xiops with the following partitioning:

NAME                     MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
sda                        8:0    0   20G  0 disk 
├─sda1                     8:1    0 19.9G  0 part /
├─sda14                    8:14   0    3M  0 part 
└─sda15                    8:15   0  124M  0 part /boot/efi
sdb                        8:16   0   60G  0 disk 
├─vd-docker              254:0    0   24G  0 lvm  /var/lib/docker
└─vd-second--local--disk 254:1    0   36G  0 lvm  /srv

Thus:

  • disk20 is the 20G for the system on /
  • ephemeral60 is 60G split between:
    • 36G on /srv for Jenkins and git mirror
    • 24G on /var/lib/docker for Docker and its build cache

Goal

We need much more disk space, I am guessing:

  • the Jenkins area at /srv could be bumped to 50G (5G for git mirror and 3 builds * 15GB = 45 G)
  • roughly double the Docker cache to 45G

I am aiming at requesting 20G for the system and 90G ephemeral disk space.

Related Objects

Event Timeline

I have moved the bits about moving the builds to use tmpfs to a standalone task T340073.

The previous flavor g3.cores8.ram24.disk20.ephemeral60.4xiops was created as:

From T299704#7652833:

aborrero@cloudcontrol1005:~ $ sudo wmcs-openstack flavor create --ram 24576 --vcpus 8 --project integration --disk 20 --ephemeral 60 --private --property "aggregate_instance_extra_specs:ceph=true" --property "quota:disk_read_iops_sec=20000" --property "quota:disk_total_bytes_sec=800000000" --property "quota:disk_write_iops_sec=2000" --property "quota:disk_write_iops_sec_max=6000" --property " quota:disk_write_iops_sec_max_length=10" g3.cores8.ram24.disk20.ephemeral60.4xiops
[..]
| id                         | 7455701f-42bf-469c-9362-703eded22dc0  
hashar renamed this task from Rebuild WMCS insegration instances to larger flavor to Rebuild WMCS integration instances to larger flavor.Jun 22 2023, 7:55 AM

Holding on that, after some deeper investigation, the root cause for the instance overflowing disk is the npm cache ever growing:

hashar@integration-castor05:~$ du -m -s  /srv/castor/castor-mw-ext-and-skins/master/wmf-quibble*selenium*/npm/_cacache
5436	/srv/castor/castor-mw-ext-and-skins/master/wmf-quibble-selenium-php74-docker/npm/_cacache
5822	/srv/castor/castor-mw-ext-and-skins/master/wmf-quibble-selenium-php81-docker/npm/_cacache

Asking npm to garbage collect entries in the cache (npm cache verify) would shrink them to ~ 500 m which for 3 concurrent build would save 4.5G * 3 = 13,5G which addresses the root cause. Filed as yet another task: T340092: Figure out how to garbage collect the npm cache.

hashar changed the task status from Open to Stalled.Jun 22 2023, 11:24 AM

Stalled pending outcome of running npm cache verify to garbage collect the huge npm cache. T340092

I think the issue got solved by shrinking the giant npm cache (T340092) saving us the pain to rebuild all instances ;)

Reopening cause installing pytorch generates a 14GB layer in Docker Buildkit cache T338317 . It is certainly optimizable, but I guess we can benefit from a large Docker build cache regardless.

@hashar The integration project is currently has gigabytes quota at 400G, I'm not quite understanding how much more is desired. Could you post the amount that is desired in total (400 + the amount requested)? Thank you!

I filed it as a place holder to request the quota while I was investigating the root cause of the instances filing disk. I fixed one (T340092) but there is another that seems to definitely require new disk (T338317, pytorch uses lot of disk`).

The 400G quota is used by two volumes:

  • cache-storage 320G attached to integration-castor05, that is a volume so we can easily resize it without having to rebuild the instance and to be able to keep the cache when rebuilding.
  • train-dev-workspace 50G (area for testing the MediaWiki train tooling)

From the task description I am aiming at raising the ephemeral disk from 60G to 90G which would let us bump:

  • /srv from 36G to 45G
  • /var/lib/docker from 24G to 45G

That is for 17 instances (integration-agent-docker-*) so a raise of 17 instances * (90G - 60G) = 510GBytes of ephemeral storage. That does not go against the Volume storage quota though.

So I guess a new flavor: g3.cores8.ram24.disk20.ephemeral90.4xiops

Can we change the flavor of an existing instance or does it requires a new one?

Oh I see, sorry I did not understand that you were seeking a new flavor. My newfound clarity brings a new line of questions. Why would attaching additional volumes not resolve the issue? For instance attaching a 45G volume and mounting it as /var/lib/docker ?

Oh I see, sorry I did not understand that you were seeking a new flavor. My newfound clarity brings a new line of questions. Why would attaching additional volumes not resolve the issue? For instance attaching a 45G volume and mounting it as /var/lib/docker ?

The volumes were introduced in 2021 and we had a new kind of Jenkins agent created with an attached volume (T277078 but it is a long read). About the same time we considered moving all the Jenkins agents to use attached volumes. That was T290783 and eventually we decided to stick to ephemeral storage due to:

  • we don't need to persist the data, the CI build auto clean on completion anyway
  • there were bunch of Puppet changes required
  • some extra manual steps needed wikitech:Attachable block storage for cloud VPS
  • Manage 20+ volumes sounded a lot of overhead

The drawback is whenever we need to resize the ephemeral disk we need to ask for a new flavor which seems to happen roughly every two years :)

All set!

root@cloudcontrol1005:~# export OS_PROJECT_ID=integration
root@cloudcontrol1005:~# openstack flavor create --ram 24576 --disk 20 --ephemeral 90 --vcpus 8 --property "aggregate_instance_extra_specs:ceph=true" --property "quota:disk_read_iops_sec=20000" --property "quota:disk_total_bytes_sec=800000000" --property "quota:disk_write_iops_sec=2000" --property "quota:disk_write_iops_sec_max=6000" g3.cores8.ram24.disk20.ephemeral90.4xiops

I have changed the flavor for integration-agent-docker-1039 which instantly reboot the instance. After reboot, the kernel still sees sdb as a 60G partition:

[Thu Jun 29 17:27:03 2023] sd 2:0:0:1: [sdb] 125829120 512-byte logical blocks: (64.4 GB/60.0 GiB)

Same after reboot.

https://access.redhat.com/solutions/3081971#fn:1 is titled Instance cannot resize ephemeral disk in Red Hat OpenStack and marked as Solution in progress. Then it is behind a paywall so there are not much details. I am guessing OpenStack is unable to resize ephemeral disk space.

Essentially that means rebuilding the whole fleet of agents :-/

I have created a new instance integration-agent-docker-1040.integration.eqiad1.wikimedia.cloud and did not have any sudo access. I recreated it and have the same issue. Something something is broken.

I have changed the flavor for integration-agent-docker-1039 which instantly reboot the instance. After reboot, the kernel still sees sdb as a 60G partition:

[Thu Jun 29 17:27:03 2023] sd 2:0:0:1: [sdb] 125829120 512-byte logical blocks: (64.4 GB/60.0 GiB)

I have split that to another task T340825: OpenStack silently fail to resize an Ephemeral volume in the hope maybe we can fix the resizing and thus avoid rebuilding all instances.

The rebuilding of instances is blocked T340814.

Change 934505 had a related patch set uploaded (by Hashar; author: Hashar):

[operations/puppet@production] contint: parameterize the docker lvm disk size

https://gerrit.wikimedia.org/r/934505

Summary

The existing instances can't be switched to a flavor with a larger ephemeral disk (60G to 90G) since that is not supported by OpenStack T340825. Maybe they can be manually adjusted though.

The alternative is to create new instances and switch over to it (which is really time consuming). The instance fails to provision properly when using a standalone puppet master due to an invalid apt repository being provisioned on Bullseye. That is an issue with how we integrate with cloud-init: T340814

Change 934505 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] contint: parameterize the docker lvm disk size

https://gerrit.wikimedia.org/r/934505

Mentioned in SAL (#wikimedia-releng) [2023-07-04T13:49:30Z] <hashar> integration: pooled new Jenkins agents integration-agent-docker-1040 and integration-agent-docker-1041 with larger partitions for Docker and Jenkins workspace - T340070

Mentioned in SAL (#wikimedia-releng) [2023-07-04T14:48:27Z] <hashar> integration: pooled new Jenkins agents 1042, 1043, 1044, 1045, 1046 - T340070

I have rebuild all instances with the flavor g3.cores8.ram24.disk20.ephemeral90.4xiops. Their hostnames are integration-agent-docker-[1040-1057] and I have pooled them all in Jenkins https://integration.wikimedia.org/ci/computer/