I'm seeing this in https://integration.wikimedia.org/ci/job/wmf-quibble-selenium-php72-docker/117038/console which caused the job to fail
Description
Details
Event Timeline
The agent went out of disk space following the releases of Quibble images. The images piled up in Docker. I had them cleaned up last week after it got reported over IRC.
The build ran on integration-agent-docker-1011, I checked the disk usage via https://grafana-labs.wikimedia.org/d/000000059/cloud-vps-project-board?viewPanel=18&orgId=1&var-project=integration&var-server=integration-agent-docker-1011&from=now-3h&to=now . /var/lib/docker went from 19GB free disk down to nothing apparently.
$ df -h /var/lib/docker Filesystem Size Used Avail Use% Mounted on /dev/mapper/vd-docker 42G 22G 19G 54% /var/lib/docker
$ sudo docker system df TYPE TOTAL ACTIVE SIZE RECLAIMABLE Images 43 1 14.2GB 13.06GB (91%) Containers 1 1 4.874GB 0B (0%) Local Volumes 4 0 0B 0B Build Cache 0 0 0B 0B
There was some stalled quibble container using ~ 5GB of disk. There were also a few volumes, though they were rather small.
I guess we should run docker system prune --force --volumes regularly, possibly with --all to get rid of every images as well since they tend to pill up.
I guess we should run docker system prune --force --volumes regularly, possibly with --all to get rid of every images as well since they tend to pill up.
That sounds reasonable to me.
We do currently do a fancy version of that in the maintenance-disconnect-full-disks job.
In this instance, the job left integration-agent-docker-1009 alone at 83% full since the cleanup threshold is 85% of disk utilization.
That could be adjusted if we're hitting the limit. The next run we were then only using 57% of the disk since the docker runs in progress all finished.
Change 731840 had a related patch set uploaded (by Hashar; author: Hashar):
[operations/puppet@production] contint: regularly prune docker material
I have sent a series of patch to prune the images on a daily basis:
- https://gerrit.wikimedia.org/r/c/operations/puppet/+/731838 systemd::timer: spec coverage for splay parameter
- https://gerrit.wikimedia.org/r/c/operations/puppet/+/731839 systemd::timer::job: add support for splay
- https://gerrit.wikimedia.org/r/c/operations/puppet/+/731840 contint: regularly prune docker material
Change 731840 merged by Jbond:
[operations/puppet@production] contint: regularly prune docker material
The timers are up and running:
NEXT LEFT LAST PASSED UNIT ACTIVATES Thu 2021-10-21 03:14:24 UTC 20h left Wed 2021-10-20 03:45:26 UTC 3h 9min ago docker-system-prune-dangling.timer docker-system-prune-dangling.service Sun 2021-10-24 03:01:32 UTC 3 days left n/a n/a docker-system-prune-all.timer docker-system-prune-all.service
-- Logs begin at Wed 2021-10-13 14:50:07 UTC, end at Wed 2021-10-20 06:54:47 UTC. -- Oct 19 08:15:11 integration-agent-docker-1002 systemd[1]: docker-system-prune-dangling.timer: Adding 33min 44.821881s random time. Oct 19 08:15:11 integration-agent-docker-1002 systemd[1]: Started Periodic execution of docker-system-prune-dangling.service. Oct 20 03:45:26 integration-agent-docker-1002 systemd[1]: docker-system-prune-dangling.timer: Adding 14min 24.070205s random time.
The command works:
-- Logs begin at Wed 2021-10-13 14:50:07 UTC, end at Wed 2021-10-20 06:54:53 UTC. -- Oct 20 03:45:26 integration-agent-docker-1002 systemd[1]: Started Prune dangling Docker images. Oct 20 03:45:26 integration-agent-docker-1002 docker[26823]: Total reclaimed space: 0B
Oops sorry @thcipriani I have missed your comment. Looks like those Docker steps are now superseded by the systemd timer and can probably be cleaned up now.
Just saw this again today in https://integration.wikimedia.org/ci/job/wmf-quibble-selenium-php72-docker/119339/console, fwiw.
Change 739515 had a related patch set uploaded (by Hashar; author: Hashar):
[integration/config@master] jjb: normalize jobs to have /src on the host
Change 739515 merged by jenkins-bot:
[integration/config@master] jjb: normalize jobs to have /src on the host
Change 739517 had a related patch set uploaded (by Hashar; author: Hashar):
[integration/config@master] jjb: normalize Quibble jobs to use /src
Change 739517 merged by jenkins-bot:
[integration/config@master] jjb: normalize Quibble jobs to use /src
@hashar The error started happening pretty consistently around 6AM UTC, so those patches might have made something worse.
TLDR, a Wikibase patch cause some job workspace to file 7GB+ of disk space. We run on a 18G partition so if there are too many builds in parallel the disk file and builds fail.
The issue is still going on for sure. On one of the instance I have spotted:
7220 ./wmf-quibble-apache-selenium-php72-docker
That is 7G for one build out of a 18G /srv partition.
Breakdown
118 wmf-quibble-selenium-php72-docker/src/vendor/.git 140 wmf-quibble-selenium-php72-docker/src/extensions/UniversalLanguageSelector 148 wmf-quibble-selenium-php72-docker/src/vendor/wikimedia 197 wmf-quibble-selenium-php72-docker/src/extensions/Echo 213 wmf-quibble-selenium-php72-docker/src/extensions/Cite 223 wmf-quibble-selenium-php72-docker/src/node_modules 247 wmf-quibble-selenium-php72-docker/src/extensions/ProofreadPage 249 wmf-quibble-selenium-php72-docker/src/extensions/FileImporter 281 wmf-quibble-selenium-php72-docker/src/extensions/AbuseFilter 311 wmf-quibble-selenium-php72-docker/src/extensions/GrowthExperiments 334 wmf-quibble-selenium-php72-docker/src/vendor 355 wmf-quibble-selenium-php72-docker/src/.git/objects 356 wmf-quibble-selenium-php72-docker/src/.git 383 wmf-quibble-selenium-php72-docker/src/extensions/VisualEditor 424 wmf-quibble-selenium-php72-docker/cache/npm 424 wmf-quibble-selenium-php72-docker/cache/npm/_cacache 444 wmf-quibble-selenium-php72-docker/cache 701 wmf-quibble-selenium-php72-docker/src/extensions/MobileFrontend 2538 wmf-quibble-selenium-php72-docker/src/extensions/Wikibase 5655 wmf-quibble-selenium-php72-docker/src/extensions 6736 wmf-quibble-selenium-php72-docker/src 7210 wmf-quibble-selenium-php72-docker/
Notably
1016 wmf-quibble-selenium-php72-docker/src/extensions/Wikibase/view/lib 1042 wmf-quibble-selenium-php72-docker/src/extensions/Wikibase/client/data-bridge
Which can theoretically be moved to other jobs (that is T287582)
I uploaded a change yesterday that increases the size of extensions/Wikibase/view/lib/wikibase-tainted-ref/dist/tainted-ref.common.js from 86k to 291k bytes; could that have broken other builds as well?
Probably not, that is a just 210KB more. If we could get Wikibase view/lib and client/data-bridge to be split to their own job that can help a bit since the 1.8G of node modules would no more have to be installed in the wmf-quibble jobs. I gave it a tried but have hit a wall with some test failures and haven't revisited since then.
The issue is a Wikibase change cause the wmf-quibble job to use ~ 7.2 GB of disk space and if 3 such builds happen to run on the same instance we overflow the 18GB partition. So I guess I will look at getting the partition on the Jenkins agents to be larger than the 18G it is currently offering :]
Hey, looking at some other runs I can see jobs failing a bit left and right for other repos too but i guess this might be because you are currently working on this.
I think in terms of people pushing to Wikibase this is now blocking development, is there anything we can do to assist? Do you think picking up T287582: Move some Wikibase selenium tests to a standalone job now is a good idea to mitigate this?
@hashar WMDE appreciates any magic changes/boosts to the infrastructure that you could make to improve the situation short term. It seems other extensions also suffer from it, I hope (and am certain as well) that you'll find a way to unlock the problem without the solution being everyone just works slower. if there's anything we can do from our side to support resolving this, please shout direction east!
@hashar I am also thinking: when the dust settles, would it make sense to attempt estimate what are current limitations of the CI infrastructure, as we clearly have non-rare cases of massively resource consuming builds etc. I am imagining counting how many BIG jobs can run in parallel, how many Mediawiki core patches mean CI queue will be stuck for more than 30 mins, and this kind of overview (examples I have I just for illustration, I am not CI-infra-expert, but I am not sure someone tm would come up with meaningful things to look at).
While this kind of overview wouldn't likely read to any long-term improvement, but might help to identify some really critical weak spots of the CI infra currently, and lead to some short-term improvements that would at least minimize the frequency of the situations like we might observe today.
@hashar WMDE appreciates any magic changes/boosts to the infrastructure that you could make to improve the situation short term. It seems other extensions also suffer from it, I hope (and am certain as well) that you'll find a way to unlock the problem without the solution being everyone just works slower. if there's anything we can do from our side to support resolving this, please shout direction east!
Thank you for the kind words and encouragements! Wikibase large node_modules is nothing new, that was definitely the case last week or months ago. I just highlighted one of the symptom. Turns out the reason is before that the source code (held inside the containers under /workspace/src) was hold inside the Docker containers and thus written on the host to a 42GB partition /var/lib/docker. That also cause Docker to take age to delete the container when there is high I/O but that is another topic.
The recent issue that is due to https://gerrit.wikimedia.org/r/c/integration/config/+/739517 , it moved the source files / node modules etc to the CI agent /src partition which is only 18G large. It thus file up way quicker and cause the outages. I will revert it.
@hashar I am also thinking: when the dust settles, would it make sense to attempt estimate what are current limitations of the CI infrastructure, as we clearly have non-rare cases of massively resource consuming builds etc. I am imagining counting how many BIG jobs can run in parallel, how many Mediawiki core patches mean CI queue will be stuck for more than 30 mins, and this kind of overview (examples I have I just for illustration, I am not CI-infra-expert, but I am not sure someone tm would come up with meaningful things to look at).
While this kind of overview wouldn't likely read to any long-term improvement, but might help to identify some really critical weak spots of the CI infra currently, and lead to some short-term improvements that would at least minimize the frequency of the situations like we might observe today.
That is well known and unfortunately lacks people and infrastructure. That has been going on for quite a while and indeed has to be addressed but that has been a bit challenging to prioritize improvement over a lot of other things. But that is known and there is momentum to get improvements.
Change 739939 had a related patch set uploaded (by Hashar; author: Hashar):
[integration/config@master] jjb: stop using host src for Quibble jobs
That should fix it but it is too late for me to babysit that deployment right now (11pm). Will do it tomorrow.
Change 739939 merged by jenkins-bot:
[integration/config@master] jjb: stop using host src for Quibble jobs
Mentioned in SAL (#wikimedia-releng) [2021-11-18T22:14:48Z] <hashar> Updated Quibble jobs so that they no more fil the /srv/ partition https://gerrit.wikimedia.org/r/739939 # T292729
This happened again today:
https://integration.wikimedia.org/ci/job/wmf-quibble-selenium-php72-docker/123780/console
The job ran on integration-agent-docker-1008. By the time I logged in and to check on it, its disk usage ok
dancy@integration-agent-docker-1008:~$ df -h -t ext4 Filesystem Size Used Avail Use% Mounted on /dev/vda2 19G 3.5G 15G 20% / /dev/mapper/vd-second--local--disk 18G 1.5G 16G 9% /srv /dev/mapper/vd-docker 42G 16G 25G 39% /var/lib/docker
So in short a build of wmf-quibble-selenium-php72-docker uses ~ 7.2 GBytes and since we allow up to 3 concurrent build on the host we require 21.6 G of disk which does not fit in the 18G /srv partition
ef4e32f15c121c456b56bb395ce19c8af5470b86 on Nov 18th 2021 made the Quibble images to write again inside the Docker images thus filing the larger /var/lib/docker partition where they fit (it is a 42 G one).
We have to rebuild all the agents to move out of Stretch ( T290783 ) so I guess we will go with instances having a larger disk so we can write the build data to /srv.
The immediate fixes have been to clean up the disk space. In the end we need larger disk space, I have added some figures at T290783#7622515. That will be included in a new flavor with larger disk which I guess will be applied as instances are moved from Debian Stretch to Bullseye ( T252071 ).
Marking this specific one resolved since we have other tasks covering the long term fix.
Grblmbl. I have just completed the partition shuffle via https://gerrit.wikimedia.org/r/c/operations/puppet/+/755713/ so hosts have:
/var/lib/docker | 24G (was 42G) |
/srv | 37G (was 18G) |
Change 755713 had a related patch set uploaded (by Hashar; author: Hashar):
[operations/puppet@production] ci: set Docker partition size explicitly
Change 755743 had a related patch set uploaded (by Hashar; author: Hashar):
[integration/config@master] Revert \"jjb: stop using host src for Quibble jobs\"
Mentioned in SAL (#wikimedia-releng) [2022-01-20T18:07:44Z] <hashar> Updating Quibble jobs to have MediaWiki files written on the hosts /srv partition (38G) instead of inside the container which ends in /var/lib/docker (24G) https://gerrit.wikimedia.org/r/755743 # T292729
Change 755743 merged by jenkins-bot:
[integration/config@master] Revert \"jjb: stop using host src for Quibble jobs\"
Change 755713 merged by Alexandros Kosiaris:
[operations/puppet@production] ci: set Docker partition size explicitly