Hi folks,
on the airflow-dags project, we're facing regular pipeline failures due to disk space restrictions (see https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/jobs/20923 for instance). Rerunning them usually solves the problem, but it's quite annoying.
It'd be great if some more space could be allocated for the pipelines.
Many thanks!
Description
Details
Related Objects
- Mentioned In
- T333586: runner-1030.gitlab-runners.eqiad1.wikimedia.cloud out of space
T311111: Improve speed of Gitlab CI - Mentioned Here
- T311111: Improve speed of Gitlab CI
T210993: Deprecate Diamond collectors in Cloud VPS
T307655: Replacement needed for obsolete Diamond/Graphite monitoring of integration instances
Event Timeline
That is more or less a recurring issue as I understand it.
The WMCS instances are in the gitlab-runners WMCS project, there is no monitoring for them. Diamond has been phased out (T210993) and we have the same issue on the integration project T307655.
So gotta dig through https://horizon.wikimedia.org/
The instances are g3.cores8.ram24.disk20.ephemeral40.4xiops so 20G for the system and 40G of extra space.
Then there are ten 60G Cinder volumes described as being for /var/lib/docker . They got created in October 2021 but are not attached to any instances so they can be ignored.
The build ran on runner-1025.gitlab-runners.eqiad1.wikimedia.cloud
$ lsblk /dev/sdb NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sdb 8:16 0 40G 0 disk /var/lib/docker
There is still some disk space:
$ df -h /var/lib/docker / Filesystem Size Used Avail Use% Mounted on /dev/sdb 40G 33G 4.8G 88% /var/lib/docker /dev/sda1 20G 5.3G 14G 28% /
I did a docker image prune which reclaimed 2.5 GB but looks like most of the disk space is used in volumes:
$ sudo docker system df TYPE TOTAL ACTIVE SIZE RECLAIMABLE Images 25 1 6.134GB 5.935GB (96%) Containers 1 1 0B 0B Local Volumes 73 1 23.46GB 23.46GB (99%) Build Cache 0 0 0B 0B
DRIVER VOLUME NAME local docker-resource-monitor local runner-2zb4qjpg-project-93-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70 local runner-2zb4qjpg-project-93-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8 local runner-2zb4qjpg-project-93-concurrent-1-cache-3c3f060a0374fc8bc39395164f415a70 local runner-2zb4qjpg-project-93-concurrent-1-cache-c33bcaa1fd2c77edfc3893b41966cea8 local runner-2zb4qjpg-project-93-concurrent-2-cache-3c3f060a0374fc8bc39395164f415a70 local runner-2zb4qjpg-project-93-concurrent-2-cache-c33bcaa1fd2c77edfc3893b41966cea8 local runner-2zb4qjpg-project-149-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70 local runner-2zb4qjpg-project-149-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8 local runner-2zb4qjpg-project-211-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70 local runner-2zb4qjpg-project-211-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8 local runner-2zb4qjpg-project-211-concurrent-1-cache-3c3f060a0374fc8bc39395164f415a70 local runner-2zb4qjpg-project-211-concurrent-1-cache-c33bcaa1fd2c77edfc3893b41966cea8 local runner-38lvnqir-project-30-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70 local runner-38lvnqir-project-30-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8 local runner-38lvnqir-project-31-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70 local runner-38lvnqir-project-31-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8 local runner-38lvnqir-project-55-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70 local runner-38lvnqir-project-55-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8 local runner-38lvnqir-project-93-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70 local runner-38lvnqir-project-93-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8 local runner-38lvnqir-project-93-concurrent-1-cache-3c3f060a0374fc8bc39395164f415a70 local runner-38lvnqir-project-93-concurrent-1-cache-c33bcaa1fd2c77edfc3893b41966cea8 local runner-38lvnqir-project-93-concurrent-2-cache-3c3f060a0374fc8bc39395164f415a70 local runner-38lvnqir-project-93-concurrent-2-cache-c33bcaa1fd2c77edfc3893b41966cea8 local runner-38lvnqir-project-93-concurrent-3-cache-3c3f060a0374fc8bc39395164f415a70 local runner-38lvnqir-project-93-concurrent-3-cache-c33bcaa1fd2c77edfc3893b41966cea8 local runner-38lvnqir-project-149-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70 local runner-38lvnqir-project-149-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8 local runner-38lvnqir-project-149-concurrent-1-cache-3c3f060a0374fc8bc39395164f415a70 local runner-38lvnqir-project-149-concurrent-1-cache-c33bcaa1fd2c77edfc3893b41966cea8 local runner-38lvnqir-project-182-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70 local runner-38lvnqir-project-182-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8 local runner-38lvnqir-project-203-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70 local runner-38lvnqir-project-203-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8 local runner-38lvnqir-project-203-concurrent-1-cache-3c3f060a0374fc8bc39395164f415a70 local runner-38lvnqir-project-203-concurrent-1-cache-c33bcaa1fd2c77edfc3893b41966cea8 local runner-38lvnqir-project-203-concurrent-2-cache-3c3f060a0374fc8bc39395164f415a70 local runner-38lvnqir-project-203-concurrent-2-cache-c33bcaa1fd2c77edfc3893b41966cea8 local runner-38lvnqir-project-203-concurrent-3-cache-3c3f060a0374fc8bc39395164f415a70 local runner-38lvnqir-project-203-concurrent-3-cache-c33bcaa1fd2c77edfc3893b41966cea8 local runner-38lvnqir-project-211-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70 local runner-38lvnqir-project-211-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8 local runner-38lvnqir-project-212-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70 local runner-38lvnqir-project-212-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8 local runner-38lvnqir-project-212-concurrent-1-cache-3c3f060a0374fc8bc39395164f415a70 local runner-38lvnqir-project-212-concurrent-1-cache-c33bcaa1fd2c77edfc3893b41966cea8 local runner-38lvnqir-project-235-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70 local runner-38lvnqir-project-235-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8 local runner-38lvnqir-project-276-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70 local runner-38lvnqir-project-276-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8 local runner-38lvnqir-project-315-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70 local runner-38lvnqir-project-315-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8 local runner-38lvnqir-project-315-concurrent-1-cache-3c3f060a0374fc8bc39395164f415a70 local runner-38lvnqir-project-315-concurrent-1-cache-c33bcaa1fd2c77edfc3893b41966cea8 local runner-38lvnqir-project-319-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70 local runner-38lvnqir-project-319-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8 local runner-38lvnqir-project-332-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70 local runner-38lvnqir-project-332-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8 local runner-38lvnqir-project-338-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70 local runner-38lvnqir-project-338-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8 local runner-38lvnqir-project-338-concurrent-1-cache-3c3f060a0374fc8bc39395164f415a70 local runner-38lvnqir-project-338-concurrent-1-cache-c33bcaa1fd2c77edfc3893b41966cea8 local runner-38lvnqir-project-340-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70 local runner-38lvnqir-project-340-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8 local runner-38lvnqir-project-343-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70 local runner-38lvnqir-project-343-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8 local runner-38lvnqir-project-359-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70 local runner-38lvnqir-project-359-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8 local runner-38lvnqir-project-360-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70 local runner-38lvnqir-project-360-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8 local runner-38lvnqir-project-364-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70 local runner-38lvnqir-project-364-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
I have pruned them:
$ sudo docker volume prune WARNING! This will remove all local volumes not used by at least one container. Are you sure you want to continue? [y/N] y ... Total reclaimed space: 23.46GB
$ sudo docker system df TYPE TOTAL ACTIVE SIZE RECLAIMABLE Images 25 1 6.134GB 5.935GB (96%) Containers 1 1 0B 0B Local Volumes 1 1 10.74kB 0B (0%) Build Cache 0 0 0B 0B
Solved for that runner-1025 instance.
Hi, it's still happening now:
- https://gitlab.wikimedia.org/repos/data-engineering/conda-base-env/-/jobs/21111 E: You don't have enough free space in /var/cache/apt/archives/. This is a job to build a .deb file containing a conda env.
- https://gitlab.wikimedia.org/repos/data-engineering/workflow_utils/-/jobs/21105 error: could not create 'build': No space left on device This is a simple linting job.
Are volumes created for all git-libag ci jobs? If yes why?
⚠ rerunning the job is not working currently.
⚠ rerunning the job is not working currently.
The issue is partly related to the jobs. If their builds take a lot of GigaBytes, specially in the cache they have a high chance of filing the runner disk. As the Docker volumes pill up on a specific gitlab runner, it will eventually run out of disk space at which point following builds will fail regardless of the job.
Re running a job might work if it happens to be scheduled on a runner that still has enough disk space.
Two of the builds ran on runner-1026.gitlab-runners.eqiad1.wikimedia.cloud and another on runner-1030.gitlab-runners.eqiad1.wikimedia.cloud.
sudo docker system df TYPE TOTAL ACTIVE SIZE RECLAIMABLE Images 28 4 5.508GB 5.036GB (91%) Containers 8 3 659.6MB 494B (0%) Local Volumes 83 5 28.88GB 28.88GB (99%) Build Cache 0 0 0B 0B
TYPE TOTAL ACTIVE SIZE RECLAIMABLE Images 43 4 9.802GB 9.33GB (95%) Containers 5 2 377.9MB 247B (0%) Local Volumes 77 3 24.09GB 24.09GB (99%) Build Cache 0 0 0B 0B
I have cleared the volumes with sudo docker volume prune -f.
I have bring the topic to the Wednesday Gitlab sync meeting but this week it got used to upgrade Phabricator and we haven't had a chance to talk about it.
The workaround is to manually prune the volumes until we have a chance to discuss the issue and implement a solution.
Note to self, one can see the volumes size with: docker system df -v. On runner-1024 there are a few few 1 GB, 1.2G and 1.5G volumes piled up:
runner-4kunvzhc-project-359-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8 0 1.807GB runner-4kunvzhc-project-203-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8 0 1.553GB runner-at3hz-ze-project-212-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8 0 1.096GB runner-4kunvzhc-project-93-concurrent-1-cache-c33bcaa1fd2c77edfc3893b41966cea8 0 1.201GB runner-4kunvzhc-project-93-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8 0 1.201GB runner-4kunvzhc-project-93-concurrent-2-cache-c33bcaa1fd2c77edfc3893b41966cea8 0 1.217GB runner-4kunvzhc-project-360-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8 0 1.075GB runner-4kunvzhc-project-93-concurrent-3-cache-c33bcaa1fd2c77edfc3893b41966cea8 0 1.176GB runner-4kunvzhc-project-212-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8 0 1.096GB runner-at3hz-ze-project-93-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8 0 1.178GB runner-at3hz-ze-project-93-concurrent-1-cache-c33bcaa1fd2c77edfc3893b41966cea8 0 1.177GB runner-4kunvzhc-project-203-concurrent-3-cache-c33bcaa1fd2c77edfc3893b41966cea8 0 1.568GB runner-4kunvzhc-project-203-concurrent-2-cache-c33bcaa1fd2c77edfc3893b41966cea8 0 1.568GB runner-4kunvzhc-project-203-concurrent-1-cache-c33bcaa1fd2c77edfc3893b41966cea8 0 1.552GB runner-at3hz-ze-project-212-concurrent-1-cache-c33bcaa1fd2c77edfc3893b41966cea8 0 1.105GB
With docker volume inspect we can retrieve the volumes labels set by Gitlab which conveniently refer to the jobs:
$ (sudo docker system df -v|grep GB|awk '{print $1}'|xargs sudo docker volume inspect)|grep job.url| sed -e 's%/-/jobs.*%%'|sort|uniq -c 6 "com.gitlab.gitlab-runner.job.url": "https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags 1 "com.gitlab.gitlab-runner.job.url": "https://gitlab.wikimedia.org/repos/data-engineering/conda-base-env 4 "com.gitlab.gitlab-runner.job.url": "https://gitlab.wikimedia.org/repos/generated-data-platform/datapipelines 3 "com.gitlab.gitlab-runner.job.url": "https://gitlab.wikimedia.org/repos/research/knowledge-gaps 1 "com.gitlab.gitlab-runner.job.url": "https://gitlab.wikimedia.org/repos/research/research-common
Direct links:
https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags
https://gitlab.wikimedia.org/repos/data-engineering/conda-base-env
https://gitlab.wikimedia.org/repos/generated-data-platform/datapipelines
https://gitlab.wikimedia.org/repos/research/knowledge-gaps
https://gitlab.wikimedia.org/repos/research/research-common
The volumes can be further inspected from /var/lib/docker/volumes/, I have looked at two of them:
data-engineering/conda-base-env
It is a 1.807GB one named runner-4kunvzhc-project-359-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8. The volume has most data in _data/repos/data-engineering/conda-base-env/dist/ which has two sub directories debian and conda_dist_env each of 697M.
Contents come from Python dependencies under conda-base-env/usr/share/conda-base-env/lib/python3.10/site-packages such as pyspark at 301M, pyarrow at 98M, and numpy at 70M. And they are apparently duplicated between the debian and conda_dist_env directories.
data-engineering/airflow-dags
It is at 1.248G and named runner-4kunvzhc-project-93-concurrent-1-cache-c33bcaa1fd2c77edfc3893b41966cea8. 834M in the python 3.9 site-packages and I can see a .tox/lint environment which has a 310M pyspark (mostly jars apparently), a 40M f9cd58ee140be1597db0__mypyc.cpython-39-x86_64-linux-gnu.so and the usual fat pandas, numpy, babel :D. Maybe all dependencies are installed for linting?
There is also 200M for conda execution environment.
There might be some optimizations to be made regarding dependencies, regardless we need more disk space or a scalable distributed storage system to hold the caches.
Change 807103 had a related patch set uploaded (by Jelto; author: Jelto):
[operations/puppet@production] gitlab_runner: add job to cleanup old docker volumes/cache
Change 807103 merged by Jelto:
[operations/puppet@production] gitlab_runner: add job to cleanup old docker volumes/cache
Docker cache is cleaned every 24h on GitLab Runner nodes now. So failing jobs due to full docker volume should happen less frequent.
For the scope of this task, that solves the issue. Additional tasks can be filed to keep the cache longer, potentially share them across runners etc
CI builds fail again with No space left on device. See: https://gitlab.wikimedia.org/repos/releng/cli/-/jobs/28118. I'm re-opening the task.
Shared Runners have a dedicated disks for /var/lib/docker. This disks holds cache for CI builds. This disk was full preventing CI jobs from running properly. There is also more pressure on that disk because buildkits dedicated cache also uses this folder.
I executed /usr/share/gitlab-runner/clear-docker-cache on all Shared Runners manually which cleaned the cache.
I also created a change https://gerrit.wikimedia.org/r/q/853312 with more aggressive cache cleanup (every 12h instead of every 24h).
Long term another solution is needed. We either have to increase disk sizes and quotas for the docker disk or store the cache somewhere else.
Change 853312 had a related patch set uploaded (by Jelto; author: Jelto):
[operations/puppet@production] gitlab_runner: run cleanup of docker cache twice daily
Change 853312 merged by Dzahn:
[operations/puppet@production] gitlab_runner: run cleanup of docker cache twice daily
on runner-1021:
dzahn@runner-1021:~$ systemctl status clear-docker-cache.timer ● clear-docker-cache.timer - Periodic execution of clear-docker-cache.service Loaded: loaded (/lib/systemd/system/clear-docker-cache.timer; enabled; vendor preset: enabled) Active: active (waiting) since Mon 2022-10-24 15:54:41 UTC; 1 weeks 4 days ago Trigger: Sat 2022-11-05 05:00:00 UTC; 9h left
The "9h left" shows it is now going to run twice daily.
I have done the first pass investigation back in June to free up disk space but I am otherwise not working on addressing GitLab caching system.
Change 876184 had a related patch set uploaded (by Jelto; author: Jelto):
[operations/puppet@production] gitlab_runner: lower docker_gc watermarks in wmcs
Change 876184 merged by Dzahn:
[operations/puppet@production] gitlab_runner: lower docker_gc watermarks in wmcs
Change 876240 had a related patch set uploaded (by Ahmon Dancy; author: Ahmon Dancy):
[operations/puppet@production] Make gitlab-runner cache volumes eligible for docker-gc
Change 876240 merged by Dzahn:
[operations/puppet@production] Make gitlab-runner cache volumes eligible for docker-gc
Mentioned in SAL (#wikimedia-operations) [2023-01-06T18:56:53Z] <mutante> gitlab-runner1002 - systemctl start docker-gc; run puppet on all gitlab-runners T310593
Mentioned in SAL (#wikimedia-operations) [2023-01-06T18:57:42Z] <mutante> systemctl start docker-gc on all gitlab-runners via cumin T310593