Page MenuHomePhabricator

Jenkins instances /srv suddenly becomes full causing it to be disconnected
Closed, ResolvedPublic

Description

For at least a few days now, sometime our maintenance-disconnect-full-disks script puts slaves offline due to /srv partition being full. It is definitely a temporary issue since on investigation I have never seen it full.

An example is https://integration.wikimedia.org/ci/job/maintenance-disconnect-full-disks/63370/ which ran at Apr 13, 2019 2:10:00 PM:

Checking integration-slave-docker-1050...
maintenance-disconnect-full-disks build 63370 (/: 17%)
maintenance-disconnect-full-disks build 63370 (/srv: 100%)
maintenance-disconnect-full-disks build 63370 (/var/lib/docker: 57%)

Which shows up in Grafana (via https://grafana-labs.wikimedia.org/dashboard/db/labs-project-board?orgId=1&var-project=integration ):

srv_disk_full.png (474×984 px, 31 KB)

Event Timeline

The last few builds on that host:

2019-04-13 13:49:08,614 build:quibble-vendor-mysql-hhvm-docker {u'name': u'quibble-vendor-mysql-hhvm-docker', u'url': u'https://integration.wikimedia.org/ci/job/quibble-vendor-mysql-hhvm-docker/44925/', u'worker': u'integration-slave-docker-1050_exec-2', u'number': 44925, u'node_name': u'', u'manager': u'172.17.0.1', u'node_labels': [u'master']}
2019-04-13 13:49:08,622 build:quibble-vendor-mysql-hhvm-docker {u'name': u'quibble-vendor-mysql-hhvm-docker', u'url': u'https://integration.wikimedia.org/ci/job/quibble-vendor-mysql-hhvm-docker/44925/', u'worker': u'integration-slave-docker-1050_exec-2', u'number': 44925, u'node_name': u'', u'manager': u'172.17.0.1', u'node_labels': [u'master']}
2019-04-13 13:49:10,492 build:quibble-vendor-mysql-php72-docker {u'name': u'quibble-vendor-mysql-php72-docker', u'url': u'https://integration.wikimedia.org/ci/job/quibble-vendor-mysql-php72-docker/8940/', u'worker': u'integration-slave-docker-1050_exec-3', u'number': 8940, u'node_name': u'', u'manager': u'172.17.0.1', u'node_labels': [u'master']}
2019-04-13 13:49:10,493 build:quibble-vendor-mysql-php72-docker {u'name': u'quibble-vendor-mysql-php72-docker', u'url': u'https://integration.wikimedia.org/ci/job/quibble-vendor-mysql-php72-docker/8940/', u'worker': u'integration-slave-docker-1050_exec-3', u'number': 8940, u'node_name': u'', u'manager': u'172.17.0.1', u'node_labels': [u'master']}
2019-04-13 14:01:29,659 build:mwselenium-quibble-docker {u'name': u'mwselenium-quibble-docker', u'url': u'https://integration.wikimedia.org/ci/job/mwselenium-quibble-docker/11451/', u'worker': u'integration-slave-docker-1050_exec-1', u'number': 11451, u'node_name': u'', u'manager': u'172.17.0.1', u'node_labels': [u'master']}
2019-04-13 14:01:29,660 build:mwselenium-quibble-docker {u'name': u'mwselenium-quibble-docker', u'url': u'https://integration.wikimedia.org/ci/job/mwselenium-quibble-docker/11451/', u'worker': u'integration-slave-docker-1050_exec-1', u'number': 11451, u'node_name': u'', u'manager': u'172.17.0.1', u'node_labels': [u'master']}
2019-04-13 14:01:31,624 build:mwselenium-quibble-docker {u'name': u'mwselenium-quibble-docker', u'url': u'https://integration.wikimedia.org/ci/job/mwselenium-quibble-docker/11452/', u'worker': u'integration-slave-docker-1050_exec-0', u'number': 11452, u'node_name': u'', u'manager': u'172.17.0.1', u'node_labels': [u'master']}
2019-04-13 14:01:31,625 build:mwselenium-quibble-docker {u'name': u'mwselenium-quibble-docker', u'url': u'https://integration.wikimedia.org/ci/job/mwselenium-quibble-docker/11452/', u'worker': u'integration-slave-docker-1050_exec-0', u'number': 11452, u'node_name': u'', u'manager': u'172.17.0.1', u'node_labels': [u'master']}
2019-04-13 14:04:17,766 build:wmf-sonar-scanner-branch-non-voting {u'name': u'wmf-sonar-scanner-branch-non-voting', u'url': u'https://integration.wikimedia.org/ci/job/wmf-sonar-scanner-branch-non-voting/1407/', u'worker': u'integration-slave-docker-1050_exec-2', u'number': 1407, u'node_name': u'', u'manager': u'172.17.0.1', u'node_labels': [u'master']}
2019-04-13 14:04:17,768 build:wmf-sonar-scanner-branch-non-voting {u'name': u'wmf-sonar-scanner-branch-non-voting', u'url': u'https://integration.wikimedia.org/ci/job/wmf-sonar-scanner-branch-non-voting/1407/', u'worker': u'integration-slave-docker-1050_exec-2', u'number': 1407, u'node_name': u'', u'manager': u'172.17.0.1', u'node_labels': [u'master']}

What I have spotted is the job mwselenium-quibble-docker taking ages to synchronize a cache:

14:01:30 Syncing...
14:01:30 rsync: failed to set times on "/cache/.": Operation not permitted (1)
14:08:38 rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1668) [generator=3.1.2]
14:08:38 
14:08:38 Done

Which must be an insane amount of data, and indeed on integration-castor03:

$ sudo du -m -s /srv/jenkins-workspace/caches/castor-mw-ext-and-skins/master/mwselenium-quibble-docker
8122	/srv/jenkins-workspace/caches/castor-mw-ext-and-skins/master/mwselenium-quibble-docker

Mentioned in SAL (#wikimedia-releng) [2019-04-15T08:51:17Z] <hashar> castor: nuked /srv/jenkins-workspace/caches/castor-mw-ext-and-skins/master/mwselenium-quibble-docker # T220948

the synced cache directory is filled up by Chromium which honors XDG_CACHE_HOME=/cache which have in the Docker container. Eacho .org.chromium.Chromiu.XXXX directory has:

Default/Code CacheV8 cache
Default/CacheWeb cache / --disk-cache-dir

Even if chromedriver does pass --user-data-dir=/tmp/.org.chromium.Chromium.XXX the cache directories are generated based on XDG_CACHE_HOME and are thus shared between sessions/saved by castor.

https://chromium.googlesource.com/chromium/src/+/HEAD/docs/user_data_dir.md

On Linux, the user cache dir is derived from the profile dir as follows:

  1. Determine the system config dir. This is ~/.config, unless overridden by $XDG_CONFIG_HOME. (This step ignores $CHROME_CONFIG_HOME.)
  2. Determine the system cache dir. This is ~/.cache, unless overridden by $XDG_CACHE_HOME.
  3. If the system config dir is an ancestor of the profile dir, the user cache dir is the system cache dir plus the relative path from the system config dir to the profile dir.
  4. Otherwise, the user cache dir is the same as the profile dir.

The container has:

XDG_CACHE_HOME=/cache
XDG_CONFIG_HOME=/tmp

Thus:

1 system config dir is /tmp
2 system cache dir is /cache

The profile dir is something like /tmp/.org.chromium.Chromium.XXX

From 3. in my previous comment:

If the system config dir (/tmp) is an ancestor of the profile dir (/tmp/.org.chromium.Chromium.XXX) => True

then the user cache dir is the system cache dir (/cache) plus the relative path from the system config dir (/tmp) to the profile dir (/tmp/.org.chromium.Chromium.XXX). It is thus /cache + .org.chromium.Chromium.XXX.

The workaround is thus to move the system config dir to a subdirectory so that it is not the ancestor of the profile dir. Eg:

XDG_CONFIG_HOME=/tmp/xdg-config-home

Change 503973 had a related patch set uploaded (by Hashar; owner: Hashar):
[integration/config@master] docker: prevent Chromium cache from being saved

https://gerrit.wikimedia.org/r/503973

Change 503973 merged by jenkins-bot:
[integration/config@master] docker: prevent Chromium cache from being saved

https://gerrit.wikimedia.org/r/503973

Change 503980 had a related patch set uploaded (by Hashar; owner: Hashar):
[integration/config@master] Prevent Chromium to write its caches to XDG_CACHE_HOME

https://gerrit.wikimedia.org/r/503980

Change 503980 merged by jenkins-bot:
[integration/config@master] Prevent Chromium to write its caches to XDG_CACHE_HOME

https://gerrit.wikimedia.org/r/503980

hashar claimed this task.

Change 618079 had a related patch set uploaded (by Hashar; owner: Hashar):
[integration/config@master] docker: point XDG_CONFIG_HOME to a subdirectory of /tmp

https://gerrit.wikimedia.org/r/618079

Change 618079 merged by jenkins-bot:
[integration/config@master] docker: point XDG_CONFIG_HOME to a subdirectory of /tmp

https://gerrit.wikimedia.org/r/618079

Change 618144 had a related patch set uploaded (by Hashar; owner: Hashar):
[integration/config@master] jjb: use image to prevent chromium cache pollution

https://gerrit.wikimedia.org/r/618144

Change 618144 merged by jenkins-bot:
[integration/config@master] jjb: use image to prevent chromium cache pollution

https://gerrit.wikimedia.org/r/618144