Page MenuHomePhabricator

runner-1030.gitlab-runners.eqiad1.wikimedia.cloud out of space
Closed, ResolvedPublic

Description

Just got the error below:

Runner:

Runner: #642 (m4MQFjvT) runner-1030.gitlab-runners.eqiad1.wikimedia.cloud

Error:

ERROR: Could not install packages due to an OSError: [Errno 28] No space left on device
=================================== log end ====================================
ERROR: could not install deps [-rrequirements-test.txt]; v = InvocationError('/builds/repos/structured-data/image-suggestions/.tox/lint/bin/python -m pip install -rrequirements-test.txt', 1)
___________________________________ summary ____________________________________
ERROR:   lint: could not install deps [-rrequirements-test.txt]; v = InvocationError('/builds/repos/structured-data/image-suggestions/.tox/lint/bin/python -m pip install -rrequirements-test.txt', 1)

Example failed job: https://gitlab.wikimedia.org/repos/structured-data/image-suggestions/-/jobs/87652

Event Timeline

usage right now in / is 40% and in /var/lib/docker it's 85%

After running an apt-get clean usage in / is down to 27%.

seems like something already cleaned up meanwhile.

still of course a valid ticket

I wonder if this happens to be the runner from:

< dancy> We enabled another instance wide Gitlab Runner that accepts untagged jobs.  Let us (Release Engineering Team) know if you experience any issues.

I wonder if this happens to be the runner from:

< dancy> We enabled another instance wide Gitlab Runner that accepts untagged jobs.  Let us (Release Engineering Team) know if you experience any issues.

runner-1030.gitlab-runners.eqiad1.wikimedia.cloud is the existing WMCS instance-wide runner, not the new one.

Alright, thanks dancy!

So, the remaining space in /var/lib/docker is right now about 5.9GB.

The clear-docker-cache.timer ran 42 min ago.

The docker-gc.timer ran 3 min ago.

Manually starting both did not result in a change of used disk space.

Seems like I just happened to run my pipeline on an unlucky window?

Does it make sense to consider running the cron jobs more frequently?

Change 904616 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] gitlab_runner: run clear-docker-cache every hour

https://gerrit.wikimedia.org/r/904616

You can get a view of the partitions free space at https://grafana.wmcloud.org/d/0g9N-7pVz/cloud-vps-project-board?orgId=1&var-project=gitlab-runners&var-instance=runner-1030&viewPanel=41&from=now-7d&to=now Last time I investigated a Gitlab runner being full, it was due to caches

$ sudo docker system df
TYPE            TOTAL     ACTIVE    SIZE      RECLAIMABLE
Images          13        7         5.864GB   762.5MB (13%)
Containers      15        6         66.75MB   38.61MB (57%)
Local Volumes   56        9         30.43GB   26.44GB (86%) <------- reclaimable
Build Cache     0         0         0B        0B

And surely it there are a few which are way too big:

df -m
...
306	/var/lib/docker/volumes/runner-m4mqfjvt-project-1187-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
737	/var/lib/docker/volumes/runner-m4mqfjvt-project-1014-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
1075	/var/lib/docker/volumes/runner-m4mqfjvt-project-93-concurrent-1-cache-c33bcaa1fd2c77edfc3893b41966cea8
1103	/var/lib/docker/volumes/runner-m4mqfjvt-project-1177-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
1410	/var/lib/docker/volumes/runner-m4mqfjvt-project-93-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70
1557	/var/lib/docker/volumes/runner-m4mqfjvt-project-860-concurrent-3-cache-c33bcaa1fd2c77edfc3893b41966cea8
1649	/var/lib/docker/volumes/runner-m4mqfjvt-project-1187-concurrent-2-cache-c33bcaa1fd2c77edfc3893b41966cea8
1661	/var/lib/docker/volumes/runner-m4mqfjvt-project-828-concurrent-2-cache-c33bcaa1fd2c77edfc3893b41966cea8
1665	/var/lib/docker/volumes/runner-m4mqfjvt-project-860-concurrent-2-cache-c33bcaa1fd2c77edfc3893b41966cea8
1734	/var/lib/docker/volumes/runner-m4mqfjvt-project-1177-concurrent-1-cache-c33bcaa1fd2c77edfc3893b41966cea8
1817	/var/lib/docker/volumes/runner-m4mqfjvt-project-837-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
1890	/var/lib/docker/volumes/runner-m4mqfjvt-project-1187-concurrent-1-cache-c33bcaa1fd2c77edfc3893b41966cea8
2028	/var/lib/docker/volumes/runner-m4mqfjvt-project-1215-concurrent-2-cache-c33bcaa1fd2c77edfc3893b41966cea8
2149	/var/lib/docker/volumes/runner-m4mqfjvt-project-93-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
4146	/var/lib/docker/volumes/runner-m4mqfjvt-project-1187-concurrent-3-cache-c33bcaa1fd2c77edfc3893b41966cea8
4367	/var/lib/docker/volumes/runner-m4mqfjvt-project-860-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
4572	/var/lib/docker/volumes/runner-m4mqfjvt-project-1177-concurrent-2-cache-c33bcaa1fd2c77edfc3893b41966cea8

Ie that is the same we had at T291221#7369764.

Does it make sense to consider running the cron jobs more frequently?

Yea, I think it does. I uploaded a change suggesting to run it hourly instead of daily. Will wait for reviews though by others.

Ie that is the same we had at T291221#7369764.

Yea, looks like it indeed. Well, the "clear-docker-cache" command that was run then manually is the same we now have in that timer above. But it's so far just running once daily.

For the immediate action one can delete the volumes on the runner and that solves this task.

Among the largest caches, two have a directory repos/mwbot-rs, another countcount/mwbot and another repos/data-engineering/airflows-dag.

So that is https://gitlab.wikimedia.org/repos/mwbot-rs/mwbot builds storing 4G caches which should be filed as some over task. Surely the caches should have an upper limit (if that is possible) and those two repositories should be investigated (I am pretty sure airflows-dag uses conda which duplicate a large amount of binary packages which itself is an issue).

(I am pretty sure airflows-dag uses conda which duplicate a large amount of binary packages which itself is an issue).

It does use conda. I can help change airflow-dag behavior if you could elaborate on the issue?

root@runner-1030:/var/lib/docker/volumes# for volume in $(du -hs * | grep G | cut -d "G" -f2 | xargs); do ls -1 ${volume}/_data/*; du -hs ${volume}; done
mwbot
mwbot.tmp
1.1G	runner-m4mqfjvt-project-1177-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
mwbot
mwbot.tmp
1.7G	runner-m4mqfjvt-project-1177-concurrent-1-cache-c33bcaa1fd2c77edfc3893b41966cea8
mwbot
mwbot.tmp
4.5G	runner-m4mqfjvt-project-1177-concurrent-2-cache-c33bcaa1fd2c77edfc3893b41966cea8
mwbot
mwbot.tmp
1.1G	runner-m4mqfjvt-project-1187-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
poty-stuff
poty-stuff.tmp
2.0G	runner-m4mqfjvt-project-1215-concurrent-2-cache-c33bcaa1fd2c77edfc3893b41966cea8
apt-browser
apt-browser.tmp
1.7G	runner-m4mqfjvt-project-828-concurrent-2-cache-c33bcaa1fd2c77edfc3893b41966cea8
upcoming-mainpage
upcoming-mainpage.tmp
1.8G	runner-m4mqfjvt-project-837-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
mwbot-rs
3.2G	runner-m4mqfjvt-project-860-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
mwbot-rs
1.7G	runner-m4mqfjvt-project-860-concurrent-2-cache-c33bcaa1fd2c77edfc3893b41966cea8
mwbot-rs
1.6G	runner-m4mqfjvt-project-860-concurrent-3-cache-c33bcaa1fd2c77edfc3893b41966cea8
data-engineering
1.4G	runner-m4mqfjvt-project-93-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70
data-engineering
2.1G	runner-m4mqfjvt-project-93-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
data-engineering
1.1G	runner-m4mqfjvt-project-93-concurrent-1-cache-c33bcaa1fd2c77edfc3893b41966cea8

Mentioned in SAL (#wikimedia-cloud) [2023-03-30T21:52:30Z] <mutante> root@runner-1030:/var/lib/docker/volumes# rm -rf runner-m4mqfjvt-project-1177-concurrent-2-cache-c33bcaa1fd2c77edfc3893b41966cea8 (T333586)

Mentioned in SAL (#wikimedia-cloud) [2023-03-30T21:55:23Z] <mutante> root@runner-1030:/var/lib/docker/volumes# rm -rf runner-m4mqfjvt-project-860-concurrent-2-cache-c33bcaa1fd2c77edfc3893b41966cea8 ; rm -rf runner-m4mqfjvt-project-860-concurrent-3-cache-c33bcaa1fd2c77edfc3893b41966cea8 (T333586)

For the immediate action one can delete the volumes on the runner and that solves this task.

Done.

I deleted the three volumes logged above and additionally "...project-860-concurrent-0-cache-..". So all 4 were from project-860, mwbot-rs.

The same ones you also see in your output before as the lines with "project-860" and if they were large.

Dzahn claimed this task.
Dzahn added a subscriber: Jelto.

available disk space in /var/lib/docker is back to: used: 21G available: 17G usage: 56%

@Jelto and others: feel free to reopen if you think we should do more follow-up than this.

well there is still https://gerrit.wikimedia.org/r/c/operations/puppet/+/904616 as a follow-up action.. so maybe it should still be open.

but also I wanted to make clear that _right now_ there is no disk space issue anymore as described in the title.

(I am pretty sure airflows-dag uses conda which duplicate a large amount of binary packages which itself is an issue).

It does use conda. I can help change airflow-dag behavior if you could elaborate on the issue?

TLDR: there is surely a way to optimize how dependencies are shipped and installed to speed up the build and avoid filing the cache so much ;)

I could not find the related tasks last night, but surely disk space issue, gitlab volume cache and airflow-dag rang a bell somehow. I found previous traces:

T310593#8008684 from July 2022, The task was about disk space issue in Gitlab CI for the airflow-dags repo. That is why yesterday I remembered to look at the Docker volumes via docker system df. At the time I dug a little bit more in the caches contents, the data-engineering conda-base-env and airflow-dags rely on Conda to ship fairly large dependencies (Pyspark, Numpy, Pandas, Babel etc) and their even larger set of library dependencies (lib*.so cause the image do not use Debian packages). Conda itself seems to add a 200M overhead for its execution environment. All that stored in caches rather than being shipped by the container image. That task got resolved the same way by clearing the volumes.

Filed a few days later is T311111: Improve speed of Gitlab CI which talks about the dependencies taking a while to install (and indirectly consume disk space). After the comment I made on the other task (the previous paragraph above), I have indicated a few hints on that task at T311111#8019134 which suggests maybe Numpy could be installed from the Debian package which would saves up the time to install it from scratch and would no more have it stored in the Gitlab cache, then the version might be too old :/ I am guessing that T311111 can be used as the canonical task to improve the dependency management for those repositories.

More or less related is T309046: Airflow: pin dependency versions to prevent long installs which predates the above tasks and apparently pip at the time downloaded multiple versions of the pyspark tarball which really sounds like a huge bug in either pip itself or in the way the requirements are defined. Maybe pip has to download the tarballs to find out the requirements of each version, a requirement file with explicitly versions generated via pip freeze would probably solve it.

I have filed T333663: mwbot-rs large cache usage on Gitlab CI for mwbot-rs/mwbot and countcount/mwbot.

For data engineering I think T311111: Improve speed of Gitlab CI is the right place for follow up actions

Other contenders for which I haven't filed task (one can find the name from the id by querying Gitlab API: https://gitlab.wikimedia.org/api/v4/projects/837 ):

All three are Rust repositories, so I guess they have a similar issue as T333663 (mwbot is written in Rust).

Change 904616 merged by Dzahn:

[operations/puppet@production] gitlab_runner: run clear-docker-cache every hour

https://gerrit.wikimedia.org/r/904616

Does it make sense to consider running the cron jobs more frequently?

Yea, I think it does. I uploaded a change suggesting to run it hourly instead of daily. Will wait for reviews though by others.

Now that I think about it more, perhaps running clear-docker-cache.timer hourly is too aggressive, as in, folks would not benefit from the cache?

So that is https://gitlab.wikimedia.org/repos/mwbot-rs/mwbot builds storing 4G caches which should be filed as some over task. Surely the caches should have an upper limit (if that is possible) and those two repositories should be investigated (I am pretty sure airflows-dag uses conda which duplicate a large amount of binary packages which itself is an issue).

Agreed we should pursue top users, and I will follow up on T311111 with my team, but also: seems like we have a pretty small cache?

Now that I think about it more, perhaps running clear-docker-cache.timer hourly is too aggressive, as in, folks would not benefit from the cache?

Hmm, yes, not sure, could be. I wouldn't mind merging a patch or uploading one that goes back to "every 8 hours". Not sure wow to measure "benefit from cache".

I wouldn't mind merging a patch or uploading one that goes back to "every 8 hours".

But then again, see T333663#8746160.

I wouldn't mind merging a patch or uploading one that goes back to "every 8 hours".

But then again, see T333663#8746160.

Fair enough.

As a result of the cleanup, the runner is broken!

https://gitlab.wikimedia.org/repos/mwbot-rs/mwbot/-/jobs/88477

Running with gitlab-runner 15.8.3 (080abeab)
  on runner-1030.gitlab-runners.eqiad1.wikimedia.cloud m4MQFjvT, system ID: s_6b9bb70d6611
Preparing the "docker" executor 00:13
Using Docker executor with image rust:latest ...
ERROR: Preparation failed: adding cache volume: set volume permissions: running permission container "da7444432858dd3a9db2b7cc10cc49cc336d3bbccf00cbd9ed3fcc15c1a74a60" for volume "runner-m4mqfjvt-project-860-concurrent-2-cache-c33bcaa1fd2c77edfc3893b41966cea8": starting permission container: Error response from daemon: error evaluating symlinks from mount source "/var/lib/docker/volumes/runner-m4mqfjvt-project-860-concurrent-2-cache-c33bcaa1fd2c77edfc3893b41966cea8/_data": lstat /var/lib/docker/volumes/runner-m4mqfjvt-project-860-concurrent-2-cache-c33bcaa1fd2c77edfc3893b41966cea8: no such file or directory (linux_set.go:105:0s)
Will be retried in 3s ...
Using Docker executor with image rust:latest ...
ERROR: Preparation failed: adding cache volume: set volume permissions: running permission container "4bcbafa4463017656c62a9aea926ee127d0f2a57eefaf3dd3fde28013ae79ef5" for volume "runner-m4mqfjvt-project-860-concurrent-2-cache-c33bcaa1fd2c77edfc3893b41966cea8": starting permission container: Error response from daemon: error evaluating symlinks from mount source "/var/lib/docker/volumes/runner-m4mqfjvt-project-860-concurrent-2-cache-c33bcaa1fd2c77edfc3893b41966cea8/_data": lstat /var/lib/docker/volumes/runner-m4mqfjvt-project-860-concurrent-2-cache-c33bcaa1fd2c77edfc3893b41966cea8: no such file or directory (linux_set.go:105:0s)
Will be retried in 3s ...
Using Docker executor with image rust:latest ...
ERROR: Preparation failed: adding cache volume: set volume permissions: running permission container "448ef80ad3d0821aebee268d41b1e29e60292c67d58281c292b2f3fc832e8365" for volume "runner-m4mqfjvt-project-860-concurrent-2-cache-c33bcaa1fd2c77edfc3893b41966cea8": starting permission container: Error response from daemon: error evaluating symlinks from mount source "/var/lib/docker/volumes/runner-m4mqfjvt-project-860-concurrent-2-cache-c33bcaa1fd2c77edfc3893b41966cea8/_data": lstat /var/lib/docker/volumes/runner-m4mqfjvt-project-860-concurrent-2-cache-c33bcaa1fd2c77edfc3893b41966cea8: no such file or directory (linux_set.go:105:0s)
Will be retried in 3s ...
ERROR: Job failed (system failure): adding cache volume: set volume permissions: running permission container "448ef80ad3d0821aebee268d41b1e29e60292c67d58281c292b2f3fc832e8365" for volume "runner-m4mqfjvt-project-860-concurrent-2-cache-c33bcaa1fd2c77edfc3893b41966cea8": starting permission container: Error response from daemon: error evaluating symlinks from mount source "/var/lib/docker/volumes/runner-m4mqfjvt-project-860-concurrent-2-cache-c33bcaa1fd2c77edfc3893b41966cea8/_data": lstat /var/lib/docker/volumes/runner-m4mqfjvt-project-860-concurrent-2-cache-c33bcaa1fd2c77edfc3893b41966cea8: no such file or directory (linux_set.go:105:0s)

The runner tries to reuse a volume and get a symlink error cause the files have vanished:

Error response from daemon:
 error evaluating symlinks from mount source "/var/lib/docker/volumes/runner-m4mqfjvt-project-860-concurrent-2-cache-c33bcaa1fd2c77edfc3893b41966cea8/_data":
 lstat /var/lib/docker/volumes/runner-m4mqfjvt-project-860-concurrent-2-cache-c33bcaa1fd2c77edfc3893b41966cea8: no such file or directory

But the docker volume is still known:

runner-1030:~$ sudo docker volume ls|grep runner-m4mqfjvt-project-860-concurrent-2-cache-c33bcaa1fd2c77edfc3893b41966cea8
local     runner-m4mqfjvt-project-860-concurrent-2-cache-c33bcaa1fd2c77edfc3893b41966cea8

That is 100% become someone went to delete the underlying files in /var/lib/docker instead of using docker volume rm.

Mentioned in SAL (#wikimedia-releng) [2023-04-03T08:11:56Z] <hashar> gtlab: runner-1030: running docker volume prune to discard all volumes metadata after they have been manually deleted from /var/lib/docker ( Reclaimed 13.92GB) | T333586

root@runner-1030:~# docker volume prune
WARNING! This will remove all local volumes not used by at least one container.
Are you sure you want to continue? [y/N] y
Deleted Volumes:
runner-m4mqfjvt-project-860-concurrent-2-cache-c33bcaa1fd2c77edfc3893b41966cea8

...

Total reclaimed space: 13.92GB
root@runner-1030:~# docker system df
TYPE            TOTAL     ACTIVE    SIZE      RECLAIMABLE
Images          9         3         2.109GB   1.768GB (83%)
Containers      3         2         40.71MB   38.6MB (94%)
Local Volumes   1         1         6.145kB   0B (0%)
Build Cache     0         0         0B        0B
root@runner-1030:~# docker volume ls
DRIVER    VOLUME NAME
local     docker-resource-monitor