Page MenuHomePhabricator

GitLab CI: "ENOSPC: no space left on device, mkdir"
Closed, DuplicatePublic

Description

Seen on https://gitlab.wikimedia.org/repos/mediawiki/services/ipoid/-/jobs/113009:

#13 [6/7] RUN npm install
#13 5.934 npm WARN deprecated uuid@3.4.0: Please upgrade  to version 7 or higher.  Older versions may use Math.random() in certain circumstances, which is known to be problematic.  See https://v8.dev/blog/math-random for details.
#13 6.804 npm WARN deprecated kad-memstore@0.0.1: This package is no longer maintained.
#13 6.804 npm WARN deprecated kad-fs@0.0.4: This package is no longer maintained.
#13 7.111 npm WARN deprecated har-validator@5.1.5: this library is no longer supported
#13 7.237 npm WARN deprecated mkdirp@0.5.1: Legacy versions of mkdirp are no longer supported. Please update to mkdirp 1.x. (Note that the API surface has changed to use Promises in 1.x.)
#13 7.820 npm WARN deprecated request@2.88.2: request has been deprecated, see https://github.com/request/request/issues/3142
#13 10.37 npm WARN tar TAR_ENTRY_ERROR ENOENT: no such file or directory, lstat '/srv/service/node_modules/clarinet/test'
#13 10.38 npm WARN tar TAR_ENTRY_ERROR ENOENT: no such file or directory, lstat '/srv/service/node_modules/clarinet/test'
#13 10.38 npm WARN tar TAR_ENTRY_ERROR ENOENT: no such file or directory, lstat '/srv/service/node_modules/clarinet/test'
#13 10.41 npm notice 
#13 10.41 npm notice New major version of npm available! 8.15.0 -> 9.7.2
#13 10.41 npm notice Changelog: <https://github.com/npm/cli/releases/tag/v9.7.2>
#13 10.41 npm notice Run `npm install -g npm@9.7.2` to update!
#13 10.41 npm notice 
#13 10.41 npm ERR! code ENOSPC
#13 10.41 npm ERR! syscall mkdir
#13 10.41 npm ERR! path /home/somebody/.npm/_cacache/content-v2/sha512/b5/17
#13 10.41 npm ERR! errno -28
#13 10.41 npm ERR! nospc ENOSPC: no space left on device, mkdir '/home/somebody/.npm/_cacache/content-v2/sha512/b5/17'
#13 10.41 npm ERR! nospc There appears to be insufficient space on your system to finish.
#13 10.41 npm ERR! nospc Clear up some disk space and try again.
#13 10.42 
#13 10.42 npm ERR! A complete log of this run can be found in:
#13 10.42 npm ERR!     /home/somebody/.npm/_logs/2023-06-27T21_24_10_261Z-debug-0.log
#13 ERROR: process "/bin/sh -c npm install" did not complete successfully: exit code: 228
------
 > [6/7] RUN npm install:
#13 10.41 npm ERR! code ENOSPC
#13 10.41 npm ERR! syscall mkdir
#13 10.41 npm ERR! path /home/somebody/.npm/_cacache/content-v2/sha512/b5/17
#13 10.41 npm ERR! errno -28
#13 10.41 npm ERR! nospc ENOSPC: no space left on device, mkdir '/home/somebody/.npm/_cacache/content-v2/sha512/b5/17'
#13 10.41 npm ERR! nospc There appears to be insufficient space on your system to finish.
#13 10.41 npm ERR! nospc Clear up some disk space and try again.
#13 10.42 
#13 10.42 npm ERR! A complete log of this run can be found in:
#13 10.42 npm ERR!     /home/somebody/.npm/_logs/2023-06-27T21_24_10_261Z-debug-0.log
------
error: failed to solve: process "/bin/sh -c npm install" did not complete successfully: exit code: 228
2023-06-27 21:24:20,030 Command '['buildctl', '--timeout', '3600', '--wait-for-ready', '3600', 'build', '--progress=plain', '--frontend=gateway.v0', '--opt', 'source=docker-registry.wikimedia.org/repos/releng/blubber/buildkit:v0.16.0', '--local', 'context=.', '--local', 'dockerfile=.', '--opt', 'filename=.pipeline/blubber.yaml', '--opt', 'target=test', '--opt', 'run-variant=true', '--opt', 'entrypoint-args=[]']' returned non-zero exit status 1.

Event Timeline

Do we need to add the auto-cleaner script to the GitLab runners like we do for the Jenkins agents?

Do we need to add the auto-cleaner script to the GitLab runners like we do for the Jenkins agents?

hrm, we have the docker-resource-monitor service (which runs a gc of the local images) running on GitLab runners.

But I see the docker volume almost full on this runner:

thcipriani@runner-1026:/$ df -h -xtmpfs
Filesystem      Size  Used Avail Use% Mounted on
udev             12G     0   12G   0% /dev
/dev/sdb1        20G  8.6G   11G  46% /
/dev/sdb15      124M   11M  114M   9% /boot/efi
/dev/sda         40G   31G  6.3G  84% /var/lib/docker

However, the gc seems to do its job

thcipriani@runner-1026:/$ sudo docker images ls -a
REPOSITORY   TAG       IMAGE ID   CREATED   SIZE
thcipriani@runner-1026:/$ sudo docker ps -a
CONTAINER ID   IMAGE                                                                         COMMAND                  CREATED        STATUS        PORTS     NAMES
dcc8d3ff771c   docker-registry.wikimedia.org/repos/releng/buildkit:wmf-v0.11-6               "/usr/local/bin/entr…"   2 months ago   Up 2 months             buildkitd
fe3c173d1b68   docker-registry.wikimedia.org/repos/releng/docker-gc/resource-monitor:1.1.2   "./docker-resource-a…"   3 months ago   Up 3 months             docker-resource-monitor

EDIT: bah, sorry, should have been docker image ls -a vs docker images ls -a because this UIs are great. Turns out there are lots of old images here:

root@runner-1026:/var/lib/docker/overlay2# #EDIT
root@runner-1026:/var/lib/docker/overlay2# docker image ls -a
REPOSITORY                                                              TAG               IMAGE ID       CREATED         SIZE
rustlang/rust                                                           nightly           6f552fe4b16d   17 hours ago    2.02GB
registry.gitlab.com/gitlab-org/gitlab-runner/gitlab-runner-helper       x86_64-dcfb4b66   48974578ca1e   34 hours ago    64.6MB
rust                                                                    latest            bcf048a21536   2 weeks ago     1.42GB
docker-registry.wikimedia.org/repos/releng/kokkuri                      v1.6.0            d7e437679fd0   6 weeks ago     302MB
docker-registry.wikimedia.org/releng/node16-test-browser                0.2.0             f8093bcf8c88   2 months ago    1.49GB
docker-registry.wikimedia.org/releng/node16-test-browser                latest            f8093bcf8c88   2 months ago    1.49GB
docker-registry.wikimedia.org/repos/releng/docker-gc/resource-monitor   1.1.2             7e1a2968e43b   3 months ago    178MB
docker-registry.wikimedia.org/repos/releng/docker-gc/docker-gc          1.1.2             ffedf86f9eea   3 months ago    172MB
docker-registry.wikimedia.org/repos/releng/buildkit                     wmf-v0.11-6       d05e436b1ada   3 months ago    317MB
docker-registry.wikimedia.org/wikimedia-buster                          latest            d84c30836955   4 months ago    69.3MB
docker-registry.wikimedia.org/buster                                    <none>            d84c30836955   4 months ago    69.3MB
docker-registry.tools.wmflabs.org/cloud-cicd-py39bullseye-tox           latest            55b30359063f   5 months ago    315MB
docker-registry.wikimedia.org/releng/maven-java8                        1.0.0-s1          14aeb25a5250   21 months ago   339MB
docker-registry.wikimedia.org/releng/maven-java8                        latest            14aeb25a5250   21 months ago   339MB

But there is a whole lot left behind here:

root@runner-1026:/var/lib/docker# du -chs ./*
92K     ./buildkit
76K     ./containers
21M     ./image
16K     ./lost+found
92K     ./network
57G     ./overlay2
16K     ./plugins
4.0K    ./runtimes
4.0K    ./swarm
4.0K    ./tmp
4.0K    ./trust
220K    ./volumes
57G     total

Seems like some pruning is in order here. Also, I wonder how possible it would be to offline these runners if they reach above a certain threshold like we do with the Jenkins runners.

/var/lib/docker/overlay2 also holds BuildKit build cache which is not exposed by the docker command (afaik). So one has to use sudo docker buildx du and the BuildKit cache can be pruned with: sudo docker buildx prune --force.

(I found out about it while investigating integration instances disk usage T338317#8909411 , docker buildx du --verbose gives details about the container).

Ran into this several times today on runner-1026.gitlab-runners.eqiad1.wikimedia.cloud and runner-1028.gitlab-runners.eqiad1.wikimedia.cloud.

The Jenkins agents now have a 90G disk via the flavor g3.cores8.ram24.disk20.ephemeral90.4xiops. I have rebuild them all last week and they no more suffer from disk space issue.

The 90G disk space is partioned as:

sdb                        8:16   0   90G  0 disk 
├─vd-docker              254:0    0   45G  0 lvm  /var/lib/docker
└─vd-second--local--disk 254:1    0   45G  0 lvm  /srv

Looks like the gitlab runners are using 40G instances and could benefit from using the same larger ephemeral disk.