Experiencing pipeline failure due to disk-space issues
Closed, ResolvedPublic
Actions

Description

Hi folks,
on the airflow-dags project, we're facing regular pipeline failures due to disk space restrictions (see https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/jobs/20923 for instance). Rerunning them usually solves the problem, but it's quite annoying.
It'd be great if some more space could be allocated for the pipelines.
Many thanks!

Details

Subject	Repo	Branch	Lines +/-
Make gitlab-runner cache volumes eligible for docker-gc	operations/puppet	production	+1 -3
gitlab_runner: lower docker_gc watermarks in wmcs	operations/puppet	production	+8 -8
gitlab_runner: run cleanup of docker cache twice daily	operations/puppet	production	+1 -1
gitlab_runner: add job to cleanup old docker volumes/cache	operations/puppet	production	+22 -0

Customize query in gerrit

Related Objects

Mentioned In: T333586: runner-1030.gitlab-runners.eqiad1.wikimedia.cloud out of space
T311111: Improve speed of Gitlab CI
Mentioned Here: T311111: Improve speed of Gitlab CI
T210993: Deprecate Diamond collectors in Cloud VPS
T307655: Replacement needed for obsolete Diamond/Graphite monitoring of integration instances

Event Timeline

JAllemandou created this task.Jun 14 2022, 10:32 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 14 2022, 10:32 AM

JAllemandou updated the task description. (Show Details)Jun 14 2022, 10:35 AM

That is more or less a recurring issue as I understand it.

The WMCS instances are in the gitlab-runners WMCS project, there is no monitoring for them. Diamond has been phased out (T210993) and we have the same issue on the integration project T307655.

So gotta dig through https://horizon.wikimedia.org/

The instances are g3.cores8.ram24.disk20.ephemeral40.4xiops so 20G for the system and 40G of extra space.

Then there are ten 60G Cinder volumes described as being for /var/lib/docker . They got created in October 2021 but are not attached to any instances so they can be ignored.

The build ran on runner-1025.gitlab-runners.eqiad1.wikimedia.cloud

$ lsblk /dev/sdb
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sdb    8:16   0  40G  0 disk /var/lib/docker

There is still some disk space:

$ df -h /var/lib/docker /
Filesystem      Size  Used Avail Use% Mounted on
/dev/sdb         40G   33G  4.8G  88% /var/lib/docker
/dev/sda1        20G  5.3G   14G  28% /

I did a docker image prune which reclaimed 2.5 GB but looks like most of the disk space is used in volumes:

$ sudo docker system df
TYPE            TOTAL     ACTIVE    SIZE      RECLAIMABLE
Images          25        1         6.134GB   5.935GB (96%)
Containers      1         1         0B        0B
Local Volumes   73        1         23.46GB   23.46GB (99%)
Build Cache     0         0         0B        0B

docker volume ls

DRIVER    VOLUME NAME
local     docker-resource-monitor
local     runner-2zb4qjpg-project-93-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70
local     runner-2zb4qjpg-project-93-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
local     runner-2zb4qjpg-project-93-concurrent-1-cache-3c3f060a0374fc8bc39395164f415a70
local     runner-2zb4qjpg-project-93-concurrent-1-cache-c33bcaa1fd2c77edfc3893b41966cea8
local     runner-2zb4qjpg-project-93-concurrent-2-cache-3c3f060a0374fc8bc39395164f415a70
local     runner-2zb4qjpg-project-93-concurrent-2-cache-c33bcaa1fd2c77edfc3893b41966cea8
local     runner-2zb4qjpg-project-149-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70
local     runner-2zb4qjpg-project-149-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
local     runner-2zb4qjpg-project-211-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70
local     runner-2zb4qjpg-project-211-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
local     runner-2zb4qjpg-project-211-concurrent-1-cache-3c3f060a0374fc8bc39395164f415a70
local     runner-2zb4qjpg-project-211-concurrent-1-cache-c33bcaa1fd2c77edfc3893b41966cea8
local     runner-38lvnqir-project-30-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70
local     runner-38lvnqir-project-30-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
local     runner-38lvnqir-project-31-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70
local     runner-38lvnqir-project-31-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
local     runner-38lvnqir-project-55-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70
local     runner-38lvnqir-project-55-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
local     runner-38lvnqir-project-93-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70
local     runner-38lvnqir-project-93-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
local     runner-38lvnqir-project-93-concurrent-1-cache-3c3f060a0374fc8bc39395164f415a70
local     runner-38lvnqir-project-93-concurrent-1-cache-c33bcaa1fd2c77edfc3893b41966cea8
local     runner-38lvnqir-project-93-concurrent-2-cache-3c3f060a0374fc8bc39395164f415a70
local     runner-38lvnqir-project-93-concurrent-2-cache-c33bcaa1fd2c77edfc3893b41966cea8
local     runner-38lvnqir-project-93-concurrent-3-cache-3c3f060a0374fc8bc39395164f415a70
local     runner-38lvnqir-project-93-concurrent-3-cache-c33bcaa1fd2c77edfc3893b41966cea8
local     runner-38lvnqir-project-149-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70
local     runner-38lvnqir-project-149-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
local     runner-38lvnqir-project-149-concurrent-1-cache-3c3f060a0374fc8bc39395164f415a70
local     runner-38lvnqir-project-149-concurrent-1-cache-c33bcaa1fd2c77edfc3893b41966cea8
local     runner-38lvnqir-project-182-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70
local     runner-38lvnqir-project-182-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
local     runner-38lvnqir-project-203-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70
local     runner-38lvnqir-project-203-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
local     runner-38lvnqir-project-203-concurrent-1-cache-3c3f060a0374fc8bc39395164f415a70
local     runner-38lvnqir-project-203-concurrent-1-cache-c33bcaa1fd2c77edfc3893b41966cea8
local     runner-38lvnqir-project-203-concurrent-2-cache-3c3f060a0374fc8bc39395164f415a70
local     runner-38lvnqir-project-203-concurrent-2-cache-c33bcaa1fd2c77edfc3893b41966cea8
local     runner-38lvnqir-project-203-concurrent-3-cache-3c3f060a0374fc8bc39395164f415a70
local     runner-38lvnqir-project-203-concurrent-3-cache-c33bcaa1fd2c77edfc3893b41966cea8
local     runner-38lvnqir-project-211-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70
local     runner-38lvnqir-project-211-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
local     runner-38lvnqir-project-212-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70
local     runner-38lvnqir-project-212-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
local     runner-38lvnqir-project-212-concurrent-1-cache-3c3f060a0374fc8bc39395164f415a70
local     runner-38lvnqir-project-212-concurrent-1-cache-c33bcaa1fd2c77edfc3893b41966cea8
local     runner-38lvnqir-project-235-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70
local     runner-38lvnqir-project-235-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
local     runner-38lvnqir-project-276-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70
local     runner-38lvnqir-project-276-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
local     runner-38lvnqir-project-315-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70
local     runner-38lvnqir-project-315-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
local     runner-38lvnqir-project-315-concurrent-1-cache-3c3f060a0374fc8bc39395164f415a70
local     runner-38lvnqir-project-315-concurrent-1-cache-c33bcaa1fd2c77edfc3893b41966cea8
local     runner-38lvnqir-project-319-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70
local     runner-38lvnqir-project-319-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
local     runner-38lvnqir-project-332-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70
local     runner-38lvnqir-project-332-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
local     runner-38lvnqir-project-338-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70
local     runner-38lvnqir-project-338-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
local     runner-38lvnqir-project-338-concurrent-1-cache-3c3f060a0374fc8bc39395164f415a70
local     runner-38lvnqir-project-338-concurrent-1-cache-c33bcaa1fd2c77edfc3893b41966cea8
local     runner-38lvnqir-project-340-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70
local     runner-38lvnqir-project-340-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
local     runner-38lvnqir-project-343-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70
local     runner-38lvnqir-project-343-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
local     runner-38lvnqir-project-359-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70
local     runner-38lvnqir-project-359-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
local     runner-38lvnqir-project-360-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70
local     runner-38lvnqir-project-360-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
local     runner-38lvnqir-project-364-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70
local     runner-38lvnqir-project-364-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8

I have pruned them:

$ sudo docker volume prune
WARNING! This will remove all local volumes not used by at least one container.
Are you sure you want to continue? [y/N] y
...
Total reclaimed space: 23.46GB

$ sudo docker system df
TYPE            TOTAL     ACTIVE    SIZE      RECLAIMABLE
Images          25        1         6.134GB   5.935GB (96%)
Containers      1         1         0B        0B
Local Volumes   1         1         10.74kB   0B (0%)
Build Cache     0         0         0B        0B

Solved for that runner-1025 instance.

hashar closed this task as Resolved.Jun 14 2022, 11:46 AM

BTullis subscribed.Jun 14 2022, 1:20 PM

Another example: https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/jobs/21076

Hi, it's still happening now:

https://gitlab.wikimedia.org/repos/data-engineering/conda-base-env/-/jobs/21111 E: You don't have enough free space in /var/cache/apt/archives/. This is a job to build a .deb file containing a conda env.
https://gitlab.wikimedia.org/repos/data-engineering/workflow_utils/-/jobs/21105 error: could not create 'build': No space left on device This is a simple linting job.

Are volumes created for all git-libag ci jobs? If yes why?

⚠ rerunning the job is not working currently.

⚠ rerunning the job is not working currently.

The issue is partly related to the jobs. If their builds take a lot of GigaBytes, specially in the cache they have a high chance of filing the runner disk. As the Docker volumes pill up on a specific gitlab runner, it will eventually run out of disk space at which point following builds will fail regardless of the job.

Re running a job might work if it happens to be scheduled on a runner that still has enough disk space.

Two of the builds ran on runner-1026.gitlab-runners.eqiad1.wikimedia.cloud and another on runner-1030.gitlab-runners.eqiad1.wikimedia.cloud.

runner-1026.gitlab-runners.eqiad1.wikimedia.cloud

sudo docker system df
TYPE            TOTAL     ACTIVE    SIZE      RECLAIMABLE
Images          28        4         5.508GB   5.036GB (91%)
Containers      8         3         659.6MB   494B (0%)
Local Volumes   83        5         28.88GB   28.88GB (99%)
Build Cache     0         0         0B        0B

runner-1030.gitlab-runners.eqiad1.wikimedia.cloud

TYPE            TOTAL     ACTIVE    SIZE      RECLAIMABLE
Images          43        4         9.802GB   9.33GB (95%)
Containers      5         2         377.9MB   247B (0%)
Local Volumes   77        3         24.09GB   24.09GB (99%)
Build Cache     0         0         0B        0B

I have cleared the volumes with sudo docker volume prune -f.

I have bring the topic to the Wednesday Gitlab sync meeting but this week it got used to upgrade Phabricator and we haven't had a chance to talk about it.

The workaround is to manually prune the volumes until we have a chance to discuss the issue and implement a solution.

Note to self, one can see the volumes size with: docker system df -v. On runner-1024 there are a few few 1 GB, 1.2G and 1.5G volumes piled up:

runner-4kunvzhc-project-359-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8   0         1.807GB
runner-4kunvzhc-project-203-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8   0         1.553GB
runner-at3hz-ze-project-212-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8   0         1.096GB
runner-4kunvzhc-project-93-concurrent-1-cache-c33bcaa1fd2c77edfc3893b41966cea8    0         1.201GB
runner-4kunvzhc-project-93-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8    0         1.201GB
runner-4kunvzhc-project-93-concurrent-2-cache-c33bcaa1fd2c77edfc3893b41966cea8    0         1.217GB
runner-4kunvzhc-project-360-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8   0         1.075GB
runner-4kunvzhc-project-93-concurrent-3-cache-c33bcaa1fd2c77edfc3893b41966cea8    0         1.176GB
runner-4kunvzhc-project-212-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8   0         1.096GB
runner-at3hz-ze-project-93-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8    0         1.178GB
runner-at3hz-ze-project-93-concurrent-1-cache-c33bcaa1fd2c77edfc3893b41966cea8    0         1.177GB
runner-4kunvzhc-project-203-concurrent-3-cache-c33bcaa1fd2c77edfc3893b41966cea8   0         1.568GB
runner-4kunvzhc-project-203-concurrent-2-cache-c33bcaa1fd2c77edfc3893b41966cea8   0         1.568GB
runner-4kunvzhc-project-203-concurrent-1-cache-c33bcaa1fd2c77edfc3893b41966cea8   0         1.552GB
runner-at3hz-ze-project-212-concurrent-1-cache-c33bcaa1fd2c77edfc3893b41966cea8   0         1.105GB

With docker volume inspect we can retrieve the volumes labels set by Gitlab which conveniently refer to the jobs:

$ (sudo docker system df -v|grep GB|awk '{print $1}'|xargs sudo docker volume inspect)|grep job.url| sed -e 's%/-/jobs.*%%'|sort|uniq -c
      6             "com.gitlab.gitlab-runner.job.url": "https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags
      1             "com.gitlab.gitlab-runner.job.url": "https://gitlab.wikimedia.org/repos/data-engineering/conda-base-env
      4             "com.gitlab.gitlab-runner.job.url": "https://gitlab.wikimedia.org/repos/generated-data-platform/datapipelines
      3             "com.gitlab.gitlab-runner.job.url": "https://gitlab.wikimedia.org/repos/research/knowledge-gaps
      1             "com.gitlab.gitlab-runner.job.url": "https://gitlab.wikimedia.org/repos/research/research-common

Direct links:
https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags
https://gitlab.wikimedia.org/repos/data-engineering/conda-base-env
https://gitlab.wikimedia.org/repos/generated-data-platform/datapipelines
https://gitlab.wikimedia.org/repos/research/knowledge-gaps
https://gitlab.wikimedia.org/repos/research/research-common

The volumes can be further inspected from /var/lib/docker/volumes/, I have looked at two of them:

data-engineering/conda-base-env

It is a 1.807GB one named runner-4kunvzhc-project-359-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8. The volume has most data in _data/repos/data-engineering/conda-base-env/dist/ which has two sub directories debian and conda_dist_env each of 697M.

Contents come from Python dependencies under conda-base-env/usr/share/conda-base-env/lib/python3.10/site-packages such as pyspark at 301M, pyarrow at 98M, and numpy at 70M. And they are apparently duplicated between the debian and conda_dist_env directories.

data-engineering/airflow-dags

It is at 1.248G and named runner-4kunvzhc-project-93-concurrent-1-cache-c33bcaa1fd2c77edfc3893b41966cea8. 834M in the python 3.9 site-packages and I can see a .tox/lint environment which has a 310M pyspark (mostly jars apparently), a 40M f9cd58ee140be1597db0__mypyc.cpython-39-x86_64-linux-gnu.so and the usual fat pandas, numpy, babel :D. Maybe all dependencies are installed for linting?

There is also 200M for conda execution environment.

There might be some optimizations to be made regarding dependencies, regardless we need more disk space or a scalable distributed storage system to hold the caches.

JAllemandou added a project: Data-Engineering.Jun 16 2022, 4:22 PM

Change 807103 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] gitlab_runner: add job to cleanup old docker volumes/cache

https://gerrit.wikimedia.org/r/807103

gerritbot added a project: Patch-For-Review.Jun 21 2022, 11:51 AM

Change 807103 merged by Jelto:

[operations/puppet@production] gitlab_runner: add job to cleanup old docker volumes/cache

https://gerrit.wikimedia.org/r/807103

Maintenance_bot removed a project: Patch-For-Review.Jun 22 2022, 8:30 AM

hashar mentioned this in T311111: Improve speed of Gitlab CI.Jun 22 2022, 8:30 AM

Docker cache is cleaned every 24h on GitLab Runner nodes now. So failing jobs due to full docker volume should happen less frequent.

For the scope of this task, that solves the issue. Additional tasks can be filed to keep the cache longer, potentially share them across runners etc

Thanks!

Also for space & speed, we may not be using the ci cache properly:

CI builds fail again with No space left on device. See: https://gitlab.wikimedia.org/repos/releng/cli/-/jobs/28118. I'm re-opening the task.

Shared Runners have a dedicated disks for /var/lib/docker. This disks holds cache for CI builds. This disk was full preventing CI jobs from running properly. There is also more pressure on that disk because buildkits dedicated cache also uses this folder.

I executed /usr/share/gitlab-runner/clear-docker-cache on all Shared Runners manually which cleaned the cache.

I also created a change https://gerrit.wikimedia.org/r/q/853312 with more aggressive cache cleanup (every 12h instead of every 24h).

Long term another solution is needed. We either have to increase disk sizes and quotas for the docker disk or store the cache somewhere else.

Change 853312 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] gitlab_runner: run cleanup of docker cache twice daily

https://gerrit.wikimedia.org/r/853312

gerritbot added a project: Patch-For-Review.Nov 4 2022, 3:54 PM

dancy subscribed.Nov 4 2022, 4:00 PM

Change 853312 merged by Dzahn:

[operations/puppet@production] gitlab_runner: run cleanup of docker cache twice daily

https://gerrit.wikimedia.org/r/853312

Maintenance_bot removed a project: Patch-For-Review.Nov 4 2022, 7:30 PM

on runner-1021:

dzahn@runner-1021:~$ systemctl status clear-docker-cache.timer
● clear-docker-cache.timer - Periodic execution of clear-docker-cache.service
     Loaded: loaded (/lib/systemd/system/clear-docker-cache.timer; enabled; vendor preset: enabled)
     Active: active (waiting) since Mon 2022-10-24 15:54:41 UTC; 1 weeks 4 days ago
    Trigger: Sat 2022-11-05 05:00:00 UTC; 9h left

The "9h left" shows it is now going to run twice daily.

I have done the first pass investigation back in June to free up disk space but I am otherwise not working on addressing GitLab caching system.

• EChetty edited projects, added Data-Engineering-Planning; removed Data-Engineering.Nov 7 2022, 10:28 AM

• EChetty moved this task from Backlog to Pipelines on the Data-Engineering-Planning board.Nov 7 2022, 10:44 AM

• EChetty added a project: Data Pipelines.

• EChetty moved this task from Backlog to To be discussed /To be estimated on the Data Pipelines board.Nov 7 2022, 12:50 PM

• EChetty moved this task from To be discussed /To be estimated to Discussed (Radar) on the Data Pipelines board.Nov 7 2022, 4:37 PM

Likely needs a design discussion between RelEng and ServiceOps.

LSobanski moved this task from Incoming to Backlog on the collaboration-services board.Nov 15 2022, 3:51 PM

brennen moved this task from Inbox to CI & Job Runners on the GitLab board.Nov 16 2022, 5:49 PM

brennen edited projects, added GitLab (CI & Job Runners); removed GitLab.

Change 876184 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] gitlab_runner: lower docker_gc watermarks in wmcs

https://gerrit.wikimedia.org/r/876184

gerritbot added a project: Patch-For-Review.Jan 6 2023, 9:50 AM

Change 876184 merged by Dzahn:

[operations/puppet@production] gitlab_runner: lower docker_gc watermarks in wmcs

https://gerrit.wikimedia.org/r/876184

Change 876240 had a related patch set uploaded (by Ahmon Dancy; author: Ahmon Dancy):

[operations/puppet@production] Make gitlab-runner cache volumes eligible for docker-gc

https://gerrit.wikimedia.org/r/876240

Change 876240 merged by Dzahn:

[operations/puppet@production] Make gitlab-runner cache volumes eligible for docker-gc

https://gerrit.wikimedia.org/r/876240

Mentioned in SAL (#wikimedia-operations) [2023-01-06T18:56:53Z] <mutante> gitlab-runner1002 - systemctl start docker-gc; run puppet on all gitlab-runners T310593

Mentioned in SAL (#wikimedia-operations) [2023-01-06T18:57:42Z] <mutante> systemctl start docker-gc on all gitlab-runners via cumin T310593

Maintenance_bot removed a project: Patch-For-Review.Jan 6 2023, 7:31 PM

@JAllemandou This should be resolved now thanks to the ServiceOps team.

Thank you !

hashar mentioned this in T333586: runner-1030.gitlab-runners.eqiad1.wikimedia.cloud out of space.Mar 31 2023, 9:30 AM

Experiencing pipeline failure due to disk-space issuesClosed, ResolvedPublicActions

Description

Details

Related Objects

Event Timeline

Experiencing pipeline failure due to disk-space issues
Closed, ResolvedPublic
Actions