Page MenuHomePhabricator

Experiencing pipeline failure due to disk-space issues
Closed, ResolvedPublic

Description

Hi folks,
on the airflow-dags project, we're facing regular pipeline failures due to disk space restrictions (see https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/jobs/20923 for instance). Rerunning them usually solves the problem, but it's quite annoying.
It'd be great if some more space could be allocated for the pipelines.
Many thanks!

Event Timeline

hashar added subscribers: Jelto, brennen, hashar.

That is more or less a recurring issue as I understand it.

The WMCS instances are in the gitlab-runners WMCS project, there is no monitoring for them. Diamond has been phased out (T210993) and we have the same issue on the integration project T307655.

So gotta dig through https://horizon.wikimedia.org/

The instances are g3.cores8.ram24.disk20.ephemeral40.4xiops so 20G for the system and 40G of extra space.

Then there are ten 60G Cinder volumes described as being for /var/lib/docker . They got created in October 2021 but are not attached to any instances so they can be ignored.

The build ran on runner-1025.gitlab-runners.eqiad1.wikimedia.cloud

$ lsblk /dev/sdb
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sdb    8:16   0  40G  0 disk /var/lib/docker

There is still some disk space:

$ df -h /var/lib/docker /
Filesystem      Size  Used Avail Use% Mounted on
/dev/sdb         40G   33G  4.8G  88% /var/lib/docker
/dev/sda1        20G  5.3G   14G  28% /

I did a docker image prune which reclaimed 2.5 GB but looks like most of the disk space is used in volumes:

$ sudo docker system df
TYPE            TOTAL     ACTIVE    SIZE      RECLAIMABLE
Images          25        1         6.134GB   5.935GB (96%)
Containers      1         1         0B        0B
Local Volumes   73        1         23.46GB   23.46GB (99%)
Build Cache     0         0         0B        0B
docker volume ls
DRIVER    VOLUME NAME
local     docker-resource-monitor
local     runner-2zb4qjpg-project-93-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70
local     runner-2zb4qjpg-project-93-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
local     runner-2zb4qjpg-project-93-concurrent-1-cache-3c3f060a0374fc8bc39395164f415a70
local     runner-2zb4qjpg-project-93-concurrent-1-cache-c33bcaa1fd2c77edfc3893b41966cea8
local     runner-2zb4qjpg-project-93-concurrent-2-cache-3c3f060a0374fc8bc39395164f415a70
local     runner-2zb4qjpg-project-93-concurrent-2-cache-c33bcaa1fd2c77edfc3893b41966cea8
local     runner-2zb4qjpg-project-149-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70
local     runner-2zb4qjpg-project-149-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
local     runner-2zb4qjpg-project-211-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70
local     runner-2zb4qjpg-project-211-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
local     runner-2zb4qjpg-project-211-concurrent-1-cache-3c3f060a0374fc8bc39395164f415a70
local     runner-2zb4qjpg-project-211-concurrent-1-cache-c33bcaa1fd2c77edfc3893b41966cea8
local     runner-38lvnqir-project-30-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70
local     runner-38lvnqir-project-30-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
local     runner-38lvnqir-project-31-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70
local     runner-38lvnqir-project-31-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
local     runner-38lvnqir-project-55-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70
local     runner-38lvnqir-project-55-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
local     runner-38lvnqir-project-93-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70
local     runner-38lvnqir-project-93-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
local     runner-38lvnqir-project-93-concurrent-1-cache-3c3f060a0374fc8bc39395164f415a70
local     runner-38lvnqir-project-93-concurrent-1-cache-c33bcaa1fd2c77edfc3893b41966cea8
local     runner-38lvnqir-project-93-concurrent-2-cache-3c3f060a0374fc8bc39395164f415a70
local     runner-38lvnqir-project-93-concurrent-2-cache-c33bcaa1fd2c77edfc3893b41966cea8
local     runner-38lvnqir-project-93-concurrent-3-cache-3c3f060a0374fc8bc39395164f415a70
local     runner-38lvnqir-project-93-concurrent-3-cache-c33bcaa1fd2c77edfc3893b41966cea8
local     runner-38lvnqir-project-149-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70
local     runner-38lvnqir-project-149-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
local     runner-38lvnqir-project-149-concurrent-1-cache-3c3f060a0374fc8bc39395164f415a70
local     runner-38lvnqir-project-149-concurrent-1-cache-c33bcaa1fd2c77edfc3893b41966cea8
local     runner-38lvnqir-project-182-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70
local     runner-38lvnqir-project-182-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
local     runner-38lvnqir-project-203-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70
local     runner-38lvnqir-project-203-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
local     runner-38lvnqir-project-203-concurrent-1-cache-3c3f060a0374fc8bc39395164f415a70
local     runner-38lvnqir-project-203-concurrent-1-cache-c33bcaa1fd2c77edfc3893b41966cea8
local     runner-38lvnqir-project-203-concurrent-2-cache-3c3f060a0374fc8bc39395164f415a70
local     runner-38lvnqir-project-203-concurrent-2-cache-c33bcaa1fd2c77edfc3893b41966cea8
local     runner-38lvnqir-project-203-concurrent-3-cache-3c3f060a0374fc8bc39395164f415a70
local     runner-38lvnqir-project-203-concurrent-3-cache-c33bcaa1fd2c77edfc3893b41966cea8
local     runner-38lvnqir-project-211-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70
local     runner-38lvnqir-project-211-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
local     runner-38lvnqir-project-212-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70
local     runner-38lvnqir-project-212-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
local     runner-38lvnqir-project-212-concurrent-1-cache-3c3f060a0374fc8bc39395164f415a70
local     runner-38lvnqir-project-212-concurrent-1-cache-c33bcaa1fd2c77edfc3893b41966cea8
local     runner-38lvnqir-project-235-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70
local     runner-38lvnqir-project-235-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
local     runner-38lvnqir-project-276-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70
local     runner-38lvnqir-project-276-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
local     runner-38lvnqir-project-315-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70
local     runner-38lvnqir-project-315-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
local     runner-38lvnqir-project-315-concurrent-1-cache-3c3f060a0374fc8bc39395164f415a70
local     runner-38lvnqir-project-315-concurrent-1-cache-c33bcaa1fd2c77edfc3893b41966cea8
local     runner-38lvnqir-project-319-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70
local     runner-38lvnqir-project-319-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
local     runner-38lvnqir-project-332-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70
local     runner-38lvnqir-project-332-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
local     runner-38lvnqir-project-338-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70
local     runner-38lvnqir-project-338-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
local     runner-38lvnqir-project-338-concurrent-1-cache-3c3f060a0374fc8bc39395164f415a70
local     runner-38lvnqir-project-338-concurrent-1-cache-c33bcaa1fd2c77edfc3893b41966cea8
local     runner-38lvnqir-project-340-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70
local     runner-38lvnqir-project-340-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
local     runner-38lvnqir-project-343-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70
local     runner-38lvnqir-project-343-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
local     runner-38lvnqir-project-359-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70
local     runner-38lvnqir-project-359-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
local     runner-38lvnqir-project-360-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70
local     runner-38lvnqir-project-360-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
local     runner-38lvnqir-project-364-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70
local     runner-38lvnqir-project-364-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8

I have pruned them:

$ sudo docker volume prune
WARNING! This will remove all local volumes not used by at least one container.
Are you sure you want to continue? [y/N] y
...
Total reclaimed space: 23.46GB
$ sudo docker system df
TYPE            TOTAL     ACTIVE    SIZE      RECLAIMABLE
Images          25        1         6.134GB   5.935GB (96%)
Containers      1         1         0B        0B
Local Volumes   1         1         10.74kB   0B (0%)
Build Cache     0         0         0B        0B

Solved for that runner-1025 instance.

Antoine_Quhen subscribed.

Hi, it's still happening now:

Are volumes created for all git-libag ci jobs? If yes why?

⚠ rerunning the job is not working currently.

⚠ rerunning the job is not working currently.

The issue is partly related to the jobs. If their builds take a lot of GigaBytes, specially in the cache they have a high chance of filing the runner disk. As the Docker volumes pill up on a specific gitlab runner, it will eventually run out of disk space at which point following builds will fail regardless of the job.

Re running a job might work if it happens to be scheduled on a runner that still has enough disk space.

Two of the builds ran on runner-1026.gitlab-runners.eqiad1.wikimedia.cloud and another on runner-1030.gitlab-runners.eqiad1.wikimedia.cloud.

runner-1026.gitlab-runners.eqiad1.wikimedia.cloud
sudo docker system df
TYPE            TOTAL     ACTIVE    SIZE      RECLAIMABLE
Images          28        4         5.508GB   5.036GB (91%)
Containers      8         3         659.6MB   494B (0%)
Local Volumes   83        5         28.88GB   28.88GB (99%)
Build Cache     0         0         0B        0B
runner-1030.gitlab-runners.eqiad1.wikimedia.cloud
TYPE            TOTAL     ACTIVE    SIZE      RECLAIMABLE
Images          43        4         9.802GB   9.33GB (95%)
Containers      5         2         377.9MB   247B (0%)
Local Volumes   77        3         24.09GB   24.09GB (99%)
Build Cache     0         0         0B        0B

I have cleared the volumes with sudo docker volume prune -f.

I have bring the topic to the Wednesday Gitlab sync meeting but this week it got used to upgrade Phabricator and we haven't had a chance to talk about it.

The workaround is to manually prune the volumes until we have a chance to discuss the issue and implement a solution.

Note to self, one can see the volumes size with: docker system df -v. On runner-1024 there are a few few 1 GB, 1.2G and 1.5G volumes piled up:

runner-4kunvzhc-project-359-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8   0         1.807GB
runner-4kunvzhc-project-203-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8   0         1.553GB
runner-at3hz-ze-project-212-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8   0         1.096GB
runner-4kunvzhc-project-93-concurrent-1-cache-c33bcaa1fd2c77edfc3893b41966cea8    0         1.201GB
runner-4kunvzhc-project-93-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8    0         1.201GB
runner-4kunvzhc-project-93-concurrent-2-cache-c33bcaa1fd2c77edfc3893b41966cea8    0         1.217GB
runner-4kunvzhc-project-360-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8   0         1.075GB
runner-4kunvzhc-project-93-concurrent-3-cache-c33bcaa1fd2c77edfc3893b41966cea8    0         1.176GB
runner-4kunvzhc-project-212-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8   0         1.096GB
runner-at3hz-ze-project-93-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8    0         1.178GB
runner-at3hz-ze-project-93-concurrent-1-cache-c33bcaa1fd2c77edfc3893b41966cea8    0         1.177GB
runner-4kunvzhc-project-203-concurrent-3-cache-c33bcaa1fd2c77edfc3893b41966cea8   0         1.568GB
runner-4kunvzhc-project-203-concurrent-2-cache-c33bcaa1fd2c77edfc3893b41966cea8   0         1.568GB
runner-4kunvzhc-project-203-concurrent-1-cache-c33bcaa1fd2c77edfc3893b41966cea8   0         1.552GB
runner-at3hz-ze-project-212-concurrent-1-cache-c33bcaa1fd2c77edfc3893b41966cea8   0         1.105GB

With docker volume inspect we can retrieve the volumes labels set by Gitlab which conveniently refer to the jobs:

$ (sudo docker system df -v|grep GB|awk '{print $1}'|xargs sudo docker volume inspect)|grep job.url| sed -e 's%/-/jobs.*%%'|sort|uniq -c
      6             "com.gitlab.gitlab-runner.job.url": "https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags
      1             "com.gitlab.gitlab-runner.job.url": "https://gitlab.wikimedia.org/repos/data-engineering/conda-base-env
      4             "com.gitlab.gitlab-runner.job.url": "https://gitlab.wikimedia.org/repos/generated-data-platform/datapipelines
      3             "com.gitlab.gitlab-runner.job.url": "https://gitlab.wikimedia.org/repos/research/knowledge-gaps
      1             "com.gitlab.gitlab-runner.job.url": "https://gitlab.wikimedia.org/repos/research/research-common

Direct links:
https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags
https://gitlab.wikimedia.org/repos/data-engineering/conda-base-env
https://gitlab.wikimedia.org/repos/generated-data-platform/datapipelines
https://gitlab.wikimedia.org/repos/research/knowledge-gaps
https://gitlab.wikimedia.org/repos/research/research-common

The volumes can be further inspected from /var/lib/docker/volumes/, I have looked at two of them:

data-engineering/conda-base-env

It is a 1.807GB one named runner-4kunvzhc-project-359-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8. The volume has most data in _data/repos/data-engineering/conda-base-env/dist/ which has two sub directories debian and conda_dist_env each of 697M.

Contents come from Python dependencies under conda-base-env/usr/share/conda-base-env/lib/python3.10/site-packages such as pyspark at 301M, pyarrow at 98M, and numpy at 70M. And they are apparently duplicated between the debian and conda_dist_env directories.

data-engineering/airflow-dags

It is at 1.248G and named runner-4kunvzhc-project-93-concurrent-1-cache-c33bcaa1fd2c77edfc3893b41966cea8. 834M in the python 3.9 site-packages and I can see a .tox/lint environment which has a 310M pyspark (mostly jars apparently), a 40M f9cd58ee140be1597db0__mypyc.cpython-39-x86_64-linux-gnu.so and the usual fat pandas, numpy, babel :D. Maybe all dependencies are installed for linting?

There is also 200M for conda execution environment.

There might be some optimizations to be made regarding dependencies, regardless we need more disk space or a scalable distributed storage system to hold the caches.

Change 807103 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] gitlab_runner: add job to cleanup old docker volumes/cache

https://gerrit.wikimedia.org/r/807103

Change 807103 merged by Jelto:

[operations/puppet@production] gitlab_runner: add job to cleanup old docker volumes/cache

https://gerrit.wikimedia.org/r/807103

Docker cache is cleaned every 24h on GitLab Runner nodes now. So failing jobs due to full docker volume should happen less frequent.

For the scope of this task, that solves the issue. Additional tasks can be filed to keep the cache longer, potentially share them across runners etc

Thanks!

Also for space & speed, we may not be using the ci cache properly:

Jelto reopened this task as Open.EditedNov 4 2022, 3:54 PM

CI builds fail again with No space left on device. See: https://gitlab.wikimedia.org/repos/releng/cli/-/jobs/28118. I'm re-opening the task.

Shared Runners have a dedicated disks for /var/lib/docker. This disks holds cache for CI builds. This disk was full preventing CI jobs from running properly. There is also more pressure on that disk because buildkits dedicated cache also uses this folder.

I executed /usr/share/gitlab-runner/clear-docker-cache on all Shared Runners manually which cleaned the cache.

I also created a change https://gerrit.wikimedia.org/r/q/853312 with more aggressive cache cleanup (every 12h instead of every 24h).

Long term another solution is needed. We either have to increase disk sizes and quotas for the docker disk or store the cache somewhere else.

Change 853312 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] gitlab_runner: run cleanup of docker cache twice daily

https://gerrit.wikimedia.org/r/853312

Change 853312 merged by Dzahn:

[operations/puppet@production] gitlab_runner: run cleanup of docker cache twice daily

https://gerrit.wikimedia.org/r/853312

on runner-1021:

dzahn@runner-1021:~$ systemctl status clear-docker-cache.timer
● clear-docker-cache.timer - Periodic execution of clear-docker-cache.service
     Loaded: loaded (/lib/systemd/system/clear-docker-cache.timer; enabled; vendor preset: enabled)
     Active: active (waiting) since Mon 2022-10-24 15:54:41 UTC; 1 weeks 4 days ago
    Trigger: Sat 2022-11-05 05:00:00 UTC; 9h left

The "9h left" shows it is now going to run twice daily.

hashar removed hashar as the assignee of this task.Nov 5 2022, 8:06 AM

I have done the first pass investigation back in June to free up disk space but I am otherwise not working on addressing GitLab caching system.

LSobanski subscribed.

Likely needs a design discussion between RelEng and ServiceOps.

Change 876184 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] gitlab_runner: lower docker_gc watermarks in wmcs

https://gerrit.wikimedia.org/r/876184

Change 876184 merged by Dzahn:

[operations/puppet@production] gitlab_runner: lower docker_gc watermarks in wmcs

https://gerrit.wikimedia.org/r/876184

Change 876240 had a related patch set uploaded (by Ahmon Dancy; author: Ahmon Dancy):

[operations/puppet@production] Make gitlab-runner cache volumes eligible for docker-gc

https://gerrit.wikimedia.org/r/876240

Change 876240 merged by Dzahn:

[operations/puppet@production] Make gitlab-runner cache volumes eligible for docker-gc

https://gerrit.wikimedia.org/r/876240

Mentioned in SAL (#wikimedia-operations) [2023-01-06T18:56:53Z] <mutante> gitlab-runner1002 - systemctl start docker-gc; run puppet on all gitlab-runners T310593

Mentioned in SAL (#wikimedia-operations) [2023-01-06T18:57:42Z] <mutante> systemctl start docker-gc on all gitlab-runners via cumin T310593

dancy claimed this task.

@JAllemandou This should be resolved now thanks to the ServiceOps team.