Page MenuHomePhabricator

TAR_ENTRY_ERROR ENOSPC: no space left on device
Open, Needs TriagePublic

Event Timeline

hashar claimed this task.
hashar added a subscriber: hashar.

The agent went out of disk space following the releases of Quibble images. The images piled up in Docker. I had them cleaned up last week after it got reported over IRC.

The build ran on integration-agent-docker-1011, I checked the disk usage via https://grafana-labs.wikimedia.org/d/000000059/cloud-vps-project-board?viewPanel=18&orgId=1&var-project=integration&var-server=integration-agent-docker-1011&from=now-3h&to=now . /var/lib/docker went from 19GB free disk down to nothing apparently.

$ df -h /var/lib/docker
Filesystem             Size  Used Avail Use% Mounted on
/dev/mapper/vd-docker   42G   22G   19G  54% /var/lib/docker
$ sudo docker system df
TYPE                TOTAL               ACTIVE              SIZE                RECLAIMABLE
Images              43                  1                   14.2GB              13.06GB (91%)
Containers          1                   1                   4.874GB             0B (0%)
Local Volumes       4                   0                   0B                  0B
Build Cache         0                   0                   0B                  0B

There was some stalled quibble container using ~ 5GB of disk. There were also a few volumes, though they were rather small.

I guess we should run docker system prune --force --volumes regularly, possibly with --all to get rid of every images as well since they tend to pill up.

I guess we should run docker system prune --force --volumes regularly, possibly with --all to get rid of every images as well since they tend to pill up.

That sounds reasonable to me.

I guess we should run docker system prune --force --volumes regularly, possibly with --all to get rid of every images as well since they tend to pill up.

We do currently do a fancy version of that in the maintenance-disconnect-full-disks job.

In this instance, the job left integration-agent-docker-1009 alone at 83% full since the cleanup threshold is 85% of disk utilization.

That could be adjusted if we're hitting the limit. The next run we were then only using 57% of the disk since the docker runs in progress all finished.

hashar changed the task status from Open to In Progress.Oct 18 2021, 9:11 PM

Change 731840 had a related patch set uploaded (by Hashar; author: Hashar):

[operations/puppet@production] contint: regularly prune docker material

https://gerrit.wikimedia.org/r/731840

I have sent a series of patch to prune the images on a daily basis:

Change 731840 merged by Jbond:

[operations/puppet@production] contint: regularly prune docker material

https://gerrit.wikimedia.org/r/731840

The timers are up and running:

systemctl list-timers*docker*
NEXT                         LEFT        LAST                         PASSED      UNIT                               ACTIVATES
Thu 2021-10-21 03:14:24 UTC  20h left    Wed 2021-10-20 03:45:26 UTC  3h 9min ago docker-system-prune-dangling.timer docker-system-prune-dangling.service
Sun 2021-10-24 03:01:32 UTC  3 days left n/a                          n/a         docker-system-prune-all.timer      docker-system-prune-all.service
journalctl -u docker-system-prune-dangling.timer
-- Logs begin at Wed 2021-10-13 14:50:07 UTC, end at Wed 2021-10-20 06:54:47 UTC. --
Oct 19 08:15:11 integration-agent-docker-1002 systemd[1]: docker-system-prune-dangling.timer: Adding 33min 44.821881s random time.
Oct 19 08:15:11 integration-agent-docker-1002 systemd[1]: Started Periodic execution of docker-system-prune-dangling.service.
Oct 20 03:45:26 integration-agent-docker-1002 systemd[1]: docker-system-prune-dangling.timer: Adding 14min 24.070205s random time.

The command works:

journalctl -u docker-system-prune-dangling
-- Logs begin at Wed 2021-10-13 14:50:07 UTC, end at Wed 2021-10-20 06:54:53 UTC. --
Oct 20 03:45:26 integration-agent-docker-1002 systemd[1]: Started Prune dangling Docker images.
Oct 20 03:45:26 integration-agent-docker-1002 docker[26823]: Total reclaimed space: 0B

I guess we should run docker system prune --force --volumes regularly, possibly with --all to get rid of every images as well since they tend to pill up.

We do currently do a fancy version of that in the maintenance-disconnect-full-disks job.

In this instance, the job left integration-agent-docker-1009 alone at 83% full since the cleanup threshold is 85% of disk utilization.

That could be adjusted if we're hitting the limit. The next run we were then only using 57% of the disk since the docker runs in progress all finished.

Oops sorry @thcipriani I have missed your comment. Looks like those Docker steps are now superseded by the systemd timer and can probably be cleaned up now.

Change 739515 had a related patch set uploaded (by Hashar; author: Hashar):

[integration/config@master] jjb: normalize jobs to have /src on the host

https://gerrit.wikimedia.org/r/739515

Change 739515 merged by jenkins-bot:

[integration/config@master] jjb: normalize jobs to have /src on the host

https://gerrit.wikimedia.org/r/739515

Change 739517 had a related patch set uploaded (by Hashar; author: Hashar):

[integration/config@master] jjb: normalize Quibble jobs to use /src

https://gerrit.wikimedia.org/r/739517

Change 739517 merged by jenkins-bot:

[integration/config@master] jjb: normalize Quibble jobs to use /src

https://gerrit.wikimedia.org/r/739517

@hashar The error started happening pretty consistently around 6AM UTC, so those patches might have made something worse.

Tgr triaged this task as Unbreak Now! priority.Wed, Nov 17, 11:49 PM

TLDR, a Wikibase patch cause some job workspace to file 7GB+ of disk space. We run on a 18G partition so if there are too many builds in parallel the disk file and builds fail.

The issue is still going on for sure. On one of the instance I have spotted:

7220	./wmf-quibble-apache-selenium-php72-docker

That is 7G for one build out of a 18G /srv partition.

Breakdown

sudo du -m -d3 wmf-quibble-selenium-php72-docker/|sort -n
118	wmf-quibble-selenium-php72-docker/src/vendor/.git
140	wmf-quibble-selenium-php72-docker/src/extensions/UniversalLanguageSelector
148	wmf-quibble-selenium-php72-docker/src/vendor/wikimedia
197	wmf-quibble-selenium-php72-docker/src/extensions/Echo
213	wmf-quibble-selenium-php72-docker/src/extensions/Cite
223	wmf-quibble-selenium-php72-docker/src/node_modules
247	wmf-quibble-selenium-php72-docker/src/extensions/ProofreadPage
249	wmf-quibble-selenium-php72-docker/src/extensions/FileImporter
281	wmf-quibble-selenium-php72-docker/src/extensions/AbuseFilter
311	wmf-quibble-selenium-php72-docker/src/extensions/GrowthExperiments
334	wmf-quibble-selenium-php72-docker/src/vendor
355	wmf-quibble-selenium-php72-docker/src/.git/objects
356	wmf-quibble-selenium-php72-docker/src/.git
383	wmf-quibble-selenium-php72-docker/src/extensions/VisualEditor
424	wmf-quibble-selenium-php72-docker/cache/npm
424	wmf-quibble-selenium-php72-docker/cache/npm/_cacache
444	wmf-quibble-selenium-php72-docker/cache
701	wmf-quibble-selenium-php72-docker/src/extensions/MobileFrontend
2538	wmf-quibble-selenium-php72-docker/src/extensions/Wikibase
5655	wmf-quibble-selenium-php72-docker/src/extensions
6736	wmf-quibble-selenium-php72-docker/src
7210	wmf-quibble-selenium-php72-docker/

Notably

1016	wmf-quibble-selenium-php72-docker/src/extensions/Wikibase/view/lib
1042	wmf-quibble-selenium-php72-docker/src/extensions/Wikibase/client/data-bridge

Which can theoretically be moved to other jobs (that is T287582)

I uploaded a change yesterday that increases the size of extensions/Wikibase/view/lib/wikibase-tainted-ref/dist/tainted-ref.common.js from 86k to 291k bytes; could that have broken other builds as well?

Probably not, that is a just 210KB more. If we could get Wikibase view/lib and client/data-bridge to be split to their own job that can help a bit since the 1.8G of node modules would no more have to be installed in the wmf-quibble jobs. I gave it a tried but have hit a wall with some test failures and haven't revisited since then.

The issue is a Wikibase change cause the wmf-quibble job to use ~ 7.2 GB of disk space and if 3 such builds happen to run on the same instance we overflow the 18GB partition. So I guess I will look at getting the partition on the Jenkins agents to be larger than the 18G it is currently offering :]

TLDR, a Wikibase patch cause some job workspace to file 7GB+ of disk space. We run on a 18G partition so if there are too many builds in parallel the disk file and builds fail.

1016	wmf-quibble-selenium-php72-docker/src/extensions/Wikibase/view/lib
1042	wmf-quibble-selenium-php72-docker/src/extensions/Wikibase/client/data-bridge

Which can theoretically be moved to other jobs (that is T287582)

Hey, looking at some other runs I can see jobs failing a bit left and right for other repos too but i guess this might be because you are currently working on this.

I think in terms of people pushing to Wikibase this is now blocking development, is there anything we can do to assist? Do you think picking up T287582: Move some Wikibase selenium tests to a standalone job now is a good idea to mitigate this?

@hashar WMDE appreciates any magic changes/boosts to the infrastructure that you could make to improve the situation short term. It seems other extensions also suffer from it, I hope (and am certain as well) that you'll find a way to unlock the problem without the solution being everyone just works slower. if there's anything we can do from our side to support resolving this, please shout direction east!

@hashar I am also thinking: when the dust settles, would it make sense to attempt estimate what are current limitations of the CI infrastructure, as we clearly have non-rare cases of massively resource consuming builds etc. I am imagining counting how many BIG jobs can run in parallel, how many Mediawiki core patches mean CI queue will be stuck for more than 30 mins, and this kind of overview (examples I have I just for illustration, I am not CI-infra-expert, but I am not sure someone tm would come up with meaningful things to look at).
While this kind of overview wouldn't likely read to any long-term improvement, but might help to identify some really critical weak spots of the CI infra currently, and lead to some short-term improvements that would at least minimize the frequency of the situations like we might observe today.

@hashar WMDE appreciates any magic changes/boosts to the infrastructure that you could make to improve the situation short term. It seems other extensions also suffer from it, I hope (and am certain as well) that you'll find a way to unlock the problem without the solution being everyone just works slower. if there's anything we can do from our side to support resolving this, please shout direction east!

Thank you for the kind words and encouragements! Wikibase large node_modules is nothing new, that was definitely the case last week or months ago. I just highlighted one of the symptom. Turns out the reason is before that the source code (held inside the containers under /workspace/src) was hold inside the Docker containers and thus written on the host to a 42GB partition /var/lib/docker. That also cause Docker to take age to delete the container when there is high I/O but that is another topic.

The recent issue that is due to https://gerrit.wikimedia.org/r/c/integration/config/+/739517 , it moved the source files / node modules etc to the CI agent /src partition which is only 18G large. It thus file up way quicker and cause the outages. I will revert it.

@hashar I am also thinking: when the dust settles, would it make sense to attempt estimate what are current limitations of the CI infrastructure, as we clearly have non-rare cases of massively resource consuming builds etc. I am imagining counting how many BIG jobs can run in parallel, how many Mediawiki core patches mean CI queue will be stuck for more than 30 mins, and this kind of overview (examples I have I just for illustration, I am not CI-infra-expert, but I am not sure someone tm would come up with meaningful things to look at).
While this kind of overview wouldn't likely read to any long-term improvement, but might help to identify some really critical weak spots of the CI infra currently, and lead to some short-term improvements that would at least minimize the frequency of the situations like we might observe today.

That is well known and unfortunately lacks people and infrastructure. That has been going on for quite a while and indeed has to be addressed but that has been a bit challenging to prioritize improvement over a lot of other things. But that is known and there is momentum to get improvements.

Change 739939 had a related patch set uploaded (by Hashar; author: Hashar):

[integration/config@master] jjb: stop using host src for Quibble jobs

https://gerrit.wikimedia.org/r/739939

Change 739939 had a related patch set uploaded (by Hashar; author: Hashar):

[integration/config@master] jjb: stop using host src for Quibble jobs

https://gerrit.wikimedia.org/r/739939

That should fix it but it is too late for me to babysit that deployment right now (11pm). Will do it tomorrow.

Change 739939 merged by jenkins-bot:

[integration/config@master] jjb: stop using host src for Quibble jobs

https://gerrit.wikimedia.org/r/739939

Mentioned in SAL (#wikimedia-releng) [2021-11-18T22:14:48Z] <hashar> Updated Quibble jobs so that they no more fil the /srv/ partition https://gerrit.wikimedia.org/r/739939 # T292729

Tgr lowered the priority of this task from Unbreak Now! to Needs Triage.Thu, Nov 18, 11:06 PM

This happened again today:
https://integration.wikimedia.org/ci/job/wmf-quibble-selenium-php72-docker/123780/console
The job ran on integration-agent-docker-1008. By the time I logged in and to check on it, its disk usage ok

dancy@integration-agent-docker-1008:~$ df -h -t ext4
Filesystem                          Size  Used Avail Use% Mounted on
/dev/vda2                            19G  3.5G   15G  20% /
/dev/mapper/vd-second--local--disk   18G  1.5G   16G   9% /srv
/dev/mapper/vd-docker                42G   16G   25G  39% /var/lib/docker