Page MenuHomePhabricator

Jenkins instances /srv suddenly becomes full causing it to be disconnected
Closed, ResolvedPublic

Description

For at least a few days now, sometime our maintenance-disconnect-full-disks script puts slaves offline due to /srv partition being full. It is definitely a temporary issue since on investigation I have never seen it full.

An example is https://integration.wikimedia.org/ci/job/maintenance-disconnect-full-disks/63370/ which ran at Apr 13, 2019 2:10:00 PM:

Checking integration-slave-docker-1050...
maintenance-disconnect-full-disks build 63370 (/: 17%)
maintenance-disconnect-full-disks build 63370 (/srv: 100%)
maintenance-disconnect-full-disks build 63370 (/var/lib/docker: 57%)

Which shows up in Grafana (via https://grafana-labs.wikimedia.org/dashboard/db/labs-project-board?orgId=1&var-project=integration ):

Event Timeline

hashar created this task.Apr 15 2019, 8:34 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 15 2019, 8:34 AM

The last few builds on that host:

2019-04-13 13:49:08,614 build:quibble-vendor-mysql-hhvm-docker {u'name': u'quibble-vendor-mysql-hhvm-docker', u'url': u'https://integration.wikimedia.org/ci/job/quibble-vendor-mysql-hhvm-docker/44925/', u'worker': u'integration-slave-docker-1050_exec-2', u'number': 44925, u'node_name': u'', u'manager': u'172.17.0.1', u'node_labels': [u'master']}
2019-04-13 13:49:08,622 build:quibble-vendor-mysql-hhvm-docker {u'name': u'quibble-vendor-mysql-hhvm-docker', u'url': u'https://integration.wikimedia.org/ci/job/quibble-vendor-mysql-hhvm-docker/44925/', u'worker': u'integration-slave-docker-1050_exec-2', u'number': 44925, u'node_name': u'', u'manager': u'172.17.0.1', u'node_labels': [u'master']}
2019-04-13 13:49:10,492 build:quibble-vendor-mysql-php72-docker {u'name': u'quibble-vendor-mysql-php72-docker', u'url': u'https://integration.wikimedia.org/ci/job/quibble-vendor-mysql-php72-docker/8940/', u'worker': u'integration-slave-docker-1050_exec-3', u'number': 8940, u'node_name': u'', u'manager': u'172.17.0.1', u'node_labels': [u'master']}
2019-04-13 13:49:10,493 build:quibble-vendor-mysql-php72-docker {u'name': u'quibble-vendor-mysql-php72-docker', u'url': u'https://integration.wikimedia.org/ci/job/quibble-vendor-mysql-php72-docker/8940/', u'worker': u'integration-slave-docker-1050_exec-3', u'number': 8940, u'node_name': u'', u'manager': u'172.17.0.1', u'node_labels': [u'master']}
2019-04-13 14:01:29,659 build:mwselenium-quibble-docker {u'name': u'mwselenium-quibble-docker', u'url': u'https://integration.wikimedia.org/ci/job/mwselenium-quibble-docker/11451/', u'worker': u'integration-slave-docker-1050_exec-1', u'number': 11451, u'node_name': u'', u'manager': u'172.17.0.1', u'node_labels': [u'master']}
2019-04-13 14:01:29,660 build:mwselenium-quibble-docker {u'name': u'mwselenium-quibble-docker', u'url': u'https://integration.wikimedia.org/ci/job/mwselenium-quibble-docker/11451/', u'worker': u'integration-slave-docker-1050_exec-1', u'number': 11451, u'node_name': u'', u'manager': u'172.17.0.1', u'node_labels': [u'master']}
2019-04-13 14:01:31,624 build:mwselenium-quibble-docker {u'name': u'mwselenium-quibble-docker', u'url': u'https://integration.wikimedia.org/ci/job/mwselenium-quibble-docker/11452/', u'worker': u'integration-slave-docker-1050_exec-0', u'number': 11452, u'node_name': u'', u'manager': u'172.17.0.1', u'node_labels': [u'master']}
2019-04-13 14:01:31,625 build:mwselenium-quibble-docker {u'name': u'mwselenium-quibble-docker', u'url': u'https://integration.wikimedia.org/ci/job/mwselenium-quibble-docker/11452/', u'worker': u'integration-slave-docker-1050_exec-0', u'number': 11452, u'node_name': u'', u'manager': u'172.17.0.1', u'node_labels': [u'master']}
2019-04-13 14:04:17,766 build:wmf-sonar-scanner-branch-non-voting {u'name': u'wmf-sonar-scanner-branch-non-voting', u'url': u'https://integration.wikimedia.org/ci/job/wmf-sonar-scanner-branch-non-voting/1407/', u'worker': u'integration-slave-docker-1050_exec-2', u'number': 1407, u'node_name': u'', u'manager': u'172.17.0.1', u'node_labels': [u'master']}
2019-04-13 14:04:17,768 build:wmf-sonar-scanner-branch-non-voting {u'name': u'wmf-sonar-scanner-branch-non-voting', u'url': u'https://integration.wikimedia.org/ci/job/wmf-sonar-scanner-branch-non-voting/1407/', u'worker': u'integration-slave-docker-1050_exec-2', u'number': 1407, u'node_name': u'', u'manager': u'172.17.0.1', u'node_labels': [u'master']}

What I have spotted is the job mwselenium-quibble-docker taking ages to synchronize a cache:

14:01:30 Syncing...
14:01:30 rsync: failed to set times on "/cache/.": Operation not permitted (1)
14:08:38 rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1668) [generator=3.1.2]
14:08:38 
14:08:38 Done

Which must be an insane amount of data, and indeed on integration-castor03:

$ sudo du -m -s /srv/jenkins-workspace/caches/castor-mw-ext-and-skins/master/mwselenium-quibble-docker
8122	/srv/jenkins-workspace/caches/castor-mw-ext-and-skins/master/mwselenium-quibble-docker

Mentioned in SAL (#wikimedia-releng) [2019-04-15T08:51:17Z] <hashar> castor: nuked /srv/jenkins-workspace/caches/castor-mw-ext-and-skins/master/mwselenium-quibble-docker # T220948

the synced cache directory is filled up by Chromium which honors XDG_CACHE_HOME=/cache which have in the Docker container. Eacho .org.chromium.Chromiu.XXXX directory has:

Default/Code CacheV8 cache
Default/CacheWeb cache / --disk-cache-dir

Even if chromedriver does pass --user-data-dir=/tmp/.org.chromium.Chromium.XXX the cache directories are generated basd on XDG_CACHE_HOME and are thus shared between sessions/saved by castor.

https://chromium.googlesource.com/chromium/src/+/HEAD/docs/user_data_dir.md

On Linux, the user cache dir is derived from the profile dir as follows:

  1. Determine the system config dir. This is ~/.config, unless overridden by $XDG_CONFIG_HOME. (This step ignores $CHROME_CONFIG_HOME.)
  2. Determine the system cache dir. This is ~/.cache, unless overridden by $XDG_CACHE_HOME.
  3. If the system config dir is an ancestor of the profile dir, the user cache dir is the system cache dir plus the relative path from the system config dir to the profile dir.
  4. Otherwise, the user cache dir is the same as the profile dir.

The container has:

XDG_CACHE_HOME=/cache
XDG_CONFIG_HOME=/tmp

Thus:

1 system config dir is /tmp
2 system cache dir is /cache

The profile dir is something like /tmp/.org.chromium.Chromium.XXX

From 3. in my previous comment:

If the system config dir (/tmp) is an ancestor of the profile dir (/tmp/.org.chromium.Chromium.XXX) => True

then the user cache dir is the system cache dir (/cache) plus the relative path from the system config dir (/tmp) to the profile dir (/tmp/.org.chromium.Chromium.XXX). It is thus /cache + .org.chromium.Chromium.XXX.

The workaround is thus to move the system config dir to a subdirectory so that it is not the ancestor of the profile dir. Eg:

XDG_CONFIG_HOME=/tmp/xdg-config-home

Change 503973 had a related patch set uploaded (by Hashar; owner: Hashar):
[integration/config@master] docker: prevent Chromium cache from being saved

https://gerrit.wikimedia.org/r/503973

Change 503973 merged by jenkins-bot:
[integration/config@master] docker: prevent Chromium cache from being saved

https://gerrit.wikimedia.org/r/503973

Change 503980 had a related patch set uploaded (by Hashar; owner: Hashar):
[integration/config@master] Prevent Chromium to write its caches to XDG_CACHE_HOME

https://gerrit.wikimedia.org/r/503980

Change 503980 merged by jenkins-bot:
[integration/config@master] Prevent Chromium to write its caches to XDG_CACHE_HOME

https://gerrit.wikimedia.org/r/503980

hashar closed this task as Resolved.Apr 15 2019, 12:22 PM
hashar claimed this task.