Page MenuHomePhabricator

Selenium quibble jobs have a huge cache overloading the CI agents small disk spaces
Closed, ResolvedPublic

Description

For a weeks, we have noticed CI agents being disconnected due to /srv being full. @dancy added a collect of the jobs workspace disk consumption on all the agents which are stored in Graphite ( 6555facef5bdfb6817e9889cf1ce6b4a48005824 T258626 ).

The top offenders can be seen at https://graphite.wikimedia.org/S/p which leads to Quibble selenium jobs. They run npm install for each extensions/skins and fill up the patch. @dancy found out that XDG_CACHE_HOME get filed with directory named lighthouse.XXXX which keeps piling up.

We need to find what that lighthouse system is and have it either stop caching or no more vary the cache directory name.

Event Timeline

The files are browser pages/session cache. I guess it corresponds to --user-data-dir. I need to reproduce it locally in order to investigate.

Turns out I have deleted bunch of lighthouse files on July 3rd, from root user shell history:

389  2020-07-03 14:38:27 cd /srv/jenkins-workspace/caches/
390  2020-07-03 14:38:29 cd castor-mw-ext-and-skins/

395  2020-07-03 14:38:58 find . -maxdepth 3 -name 'lighthouse*' -delete

396  2020-07-03 14:40:53 rm -fR master/quibble-vendor-mysql-php72-docker/lighthouse*
399  2020-07-03 14:41:31 rm -fR master/quibble-vendor-selenium-docker/lighthouse*
401  2020-07-03 14:41:43 rm -fR ./master/wmf-quibble-selenium-php72-docker/lighthouse*
404  2020-07-03 14:42:11 rm -fR ./master/quibble-vendor-mysql-php74-docker/lighthouse*
406  2020-07-03 14:42:20 rm -fR ./master/quibble-vendor-mysql-php73-docker/lighthouse*
413  2020-07-03 14:43:49 rm -fR ./mediawiki-vendor/master/wmf-quibble-selenium-php72-docker/lighthouse.*
415  2020-07-03 14:44:00 rm -fR ./mediawiki-core/master/wmf-quibble-selenium-php72-docker/lighthouse*

Break down of caches containing lighthouse* for mediawiki extensions and skins master branches:

sudo find /srv/jenkins-workspace/caches/castor-mw-ext-and-skins/master -name 'lighthouse*'|cut -d/ -f-3|sort|uniq -c|sort -n
      1 castor-mw-ext-and-skins/master/selenium-daily-beta-Math
      1 castor-mw-ext-and-skins/master/selenium-daily-beta-Newsletter
      1 castor-mw-ext-and-skins/master/selenium-daily-beta-RelatedArticles
      2 castor-mw-ext-and-skins/master/selenium-daily-beta-Echo
     18 castor-mw-ext-and-skins/master/quibble-composer-selenium-docker
    148 castor-mw-ext-and-skins/master/quibble-vendor-mysql-php73-docker
    178 castor-mw-ext-and-skins/master/quibble-vendor-mysql-php74-docker
    192 castor-mw-ext-and-skins/master/quibble-vendor-mysql-php72-docker
   3639 castor-mw-ext-and-skins/master/quibble-vendor-selenium-docker
   5281 castor-mw-ext-and-skins/master/wmf-quibble-selenium-php72-docker

Crawling all the caches:

find /srv/jenkins-workspace/caches -maxdepth 4 -name 'lighthouse.*'|cut -d/  -f-7|uniq -c
   2056 /srv/jenkins-workspace/caches/mediawiki-core/master/wmf-quibble-selenium-php72-docker
    173 /srv/jenkins-workspace/caches/mediawiki-vendor/master/wmf-quibble-selenium-php72-docker
      2 /srv/jenkins-workspace/caches/castor-mw-ext-and-skins/master/selenium-daily-beta-Echo
      1 /srv/jenkins-workspace/caches/castor-mw-ext-and-skins/master/selenium-daily-beta-Math
     18 /srv/jenkins-workspace/caches/castor-mw-ext-and-skins/master/quibble-composer-selenium-docker
    148 /srv/jenkins-workspace/caches/castor-mw-ext-and-skins/master/quibble-vendor-mysql-php73-docker
    178 /srv/jenkins-workspace/caches/castor-mw-ext-and-skins/master/quibble-vendor-mysql-php74-docker
   5298 /srv/jenkins-workspace/caches/castor-mw-ext-and-skins/master/wmf-quibble-selenium-php72-docker
   3656 /srv/jenkins-workspace/caches/castor-mw-ext-and-skins/master/quibble-vendor-selenium-docker
      1 /srv/jenkins-workspace/caches/castor-mw-ext-and-skins/master/selenium-daily-beta-Newsletter
    192 /srv/jenkins-workspace/caches/castor-mw-ext-and-skins/master/quibble-vendor-mysql-php72-docker
      1 /srv/jenkins-workspace/caches/castor-mw-ext-and-skins/master/selenium-daily-beta-RelatedArticles

Nothing in release branches, the obsolete wmf branches caches got deleted.

The lighthouse node module has been introduced as part of the upgrade to Webdriver.io 5 or 6 though @wdio/devtools-service.

Mentioned in SAL (#wikimedia-releng) [2020-07-28T09:41:54Z] <hashar> Removed bunch of chrome lighthouse cache files on castor instance # T258972

The lighthouse.<something> seems to be created via mktemp and in mediawiki/core I find:

$ grep -R 'lighthouse\.XXXX' node_modules/
node_modules/chrome-launcher/dist/src/utils.js:    return child_process_1.execSync('mktemp -d -t lighthouse.XXXXXXX').toString().trim();
node_modules/chrome-launcher/dist/utils.js:    return child_process_1.execSync('mktemp -d -t lighthouse.XXXXXXX').toString().trim();
node_modules/lighthouse/node_modules/chrome-launcher/dist/utils.js:    return child_process_1.execSync('mktemp -d -t lighthouse.XXXXXXX').toString().trim();
node_modules/lighthouse/docs/headless-chrome.md:TMP_PROFILE_DIR=$(mktemp -d -t lighthouse.XXXXXXXXXX)

I have absolutely no idea how it ends up being invoked but I got:

node_modules/chrome-launcher/dist/src/utils.js
function makeUnixTmpDir() {
    return child_process_1.execSync('mktemp -d -t lighthouse.XXXXXXX').toString().trim();
}

Which is a low level linux implementation for a generic makeTmpDir. It is invoked at:

node_modules/chrome-launcher/dist/src/chrome-launcher.js
class Launcher {
...
    prepare() {
        const platform = utils_1.getPlatform();
        if (!_SUPPORTED_PLATFORMS.has(platform)) {
            throw new utils_1.UnsupportedPlatformError();
        }
        this.userDataDir = this.userDataDir || this.makeTmpDir();
        this.outFile = this.fs.openSync(`${this.userDataDir}/chrome-out.log`, 'a');
        this.errFile = this.fs.openSync(`${this.userDataDir}/chrome-err.log`, 'a');
        // fix for Node4
        // you can't pass a fd to fs.writeFileSync
        this.pidFile = `${this.userDataDir}/chrome.pid`;
        log.verbose('ChromeLauncher', `created ${this.userDataDir}`);
        this.tmpDirandPidFileReady = true;
    }

So I assume it is a method for preliminary steps before running Chrome, notably creates a temporary directory for the Chrome user data dir. mktemp is supposed to create it under TMPDIR. When clearly that is not the case and that should be investigated.

The file is then supposed to be deleted

node_modules/chrome-launcher/dist/src/chrome-launcher.js
class Launcher {
...
    kill() {
        return new Promise((resolve, reject) => {
            if (this.chrome) {
                this.chrome.on('close', () => {
                    delete this.chrome;
                    this.destroyTmp().then(resolve);
                });
                log.log('ChromeLauncher', `Killing Chrome instance ${this.chrome.pid}`);
...
    destroyTmp() { 
        return new Promise(resolve => {
            // Only clean up the tmp dir if we created it.
            if (this.userDataDir === undefined || this.opts.userDataDir !== undefined) {
                return resolve();
            }
            if (this.outFile) {
                this.fs.closeSync(this.outFile);
                delete this.outFile;
            }
            if (this.errFile) {
                this.fs.closeSync(this.errFile);
                delete this.errFile;
            }
            this.rimraf(this.userDataDir, () => resolve());
        });
    }

But that is not always the case. The temporary directory is kept around and pills up in castor :-\

There is a related issue T220948#5111002

When the profile is something like /tmp/.org.chromium.Chromium.XXX and the system configuration dir is an ancestor of it (eg /tmp), then the user cache dir is the system cache dir (/cache) plus the relative path from the system config dir (/tmp) to the profile dir (/tmp/.org.chromium.Chromium.XXX). It is thus /cache + .org.chromium.Chromium.XXX.

Which got worked around by setting XDG_CONFIG_HOME=/tmp/xdg-config-home in our CI containers which is not an ancestor of the temporary profile /tmp/.org.chromium.Chromium.XXX and the user cache is stored there (and not under /cache.

I suspect the same behavior leads to those lighthouse temporary user data profiles to end up under /cache. Potentially because our XDG_CONFIG_HOME=/tmp/xdg-config-home is not properly passed down to Chromium somewhere in the stack.

Also it looks like T220948 (cache filed with .org.chromium.Chromium.XXXXXX) is occurring again and might be related:

$ (cd /srv/jenkins-workspace/caches/castor-mw-ext-and-skins/master; find -maxdepth 2 -name '.org.*' -execdir pwd \; |sort|uniq -c)
      1 /srv/jenkins-workspace/caches/castor-mw-ext-and-skins/master/selenium-daily-beta-CirrusSearch
      1 /srv/jenkins-workspace/caches/castor-mw-ext-and-skins/master/selenium-daily-beta-ContentTranslation
      1 /srv/jenkins-workspace/caches/castor-mw-ext-and-skins/master/selenium-daily-beta-ElectronPdfService
      2 /srv/jenkins-workspace/caches/castor-mw-ext-and-skins/master/selenium-daily-beta-GrowthExperiments
      2 /srv/jenkins-workspace/caches/castor-mw-ext-and-skins/master/selenium-daily-beta-MobileFrontend
      1 /srv/jenkins-workspace/caches/castor-mw-ext-and-skins/master/selenium-daily-beta-ORES
      2 /srv/jenkins-workspace/caches/castor-mw-ext-and-skins/master/selenium-daily-beta-Popups
      1 /srv/jenkins-workspace/caches/castor-mw-ext-and-skins/master/selenium-daily-beta-RevisionSlider
      7 /srv/jenkins-workspace/caches/castor-mw-ext-and-skins/master/selenium-daily-beta-TwoColConflict
      4 /srv/jenkins-workspace/caches/castor-mw-ext-and-skins/master/selenium-daily-beta-Wikibase
     12 /srv/jenkins-workspace/caches/castor-mw-ext-and-skins/master/selenium-daily-beta-WikibaseLexeme

Lighthouse file since last cleanup:

$ (cd /srv/jenkins-workspace/caches/castor-mw-ext-and-skins/master; find -maxdepth 2 -name '*lighthouse*' -execdir pwd \; |sort|uniq -c)
      6 /srv/jenkins-workspace/caches/castor-mw-ext-and-skins/master/quibble-composer-selenium-docker
    128 /srv/jenkins-workspace/caches/castor-mw-ext-and-skins/master/quibble-vendor-mysql-php72-docker
    128 /srv/jenkins-workspace/caches/castor-mw-ext-and-skins/master/quibble-vendor-mysql-php73-docker
    128 /srv/jenkins-workspace/caches/castor-mw-ext-and-skins/master/quibble-vendor-mysql-php74-docker
    541 /srv/jenkins-workspace/caches/castor-mw-ext-and-skins/master/quibble-vendor-selenium-docker
      2 /srv/jenkins-workspace/caches/castor-mw-ext-and-skins/master/selenium-daily-beta-Echo
      1 /srv/jenkins-workspace/caches/castor-mw-ext-and-skins/master/selenium-daily-beta-Math
     10 /srv/jenkins-workspace/caches/castor-mw-ext-and-skins/master/selenium-daily-beta-Minerva
      1 /srv/jenkins-workspace/caches/castor-mw-ext-and-skins/master/selenium-daily-beta-Newsletter
      1 /srv/jenkins-workspace/caches/castor-mw-ext-and-skins/master/selenium-daily-beta-RelatedArticles
   6403 /srv/jenkins-workspace/caches/castor-mw-ext-and-skins/master/wmf-quibble-selenium-php72-docker

wmf-quibble-selenium-php72-docker has 6403 occurences and is one of the most triggered job on our CI hence the large number.

quibble-vendor-selenium-docker runs the Selenium jobs.

The jobs quibble-vendor-mysql-{php72,php73,php74}-docker all have 128 occurences, so the issue is most probably reproducible.

Looking at the cache saved for quibble-vendor-mysql-php74-docker:

$ ls -rtla /srv/jenkins-workspace/caches/castor-mw-ext-and-skins/master/quibble-vendor-mysql-php74-docker|tail -n18
drwx--S---   3 jenkins-deploy wikidev  4096 Jul 29 23:21 lighthouse.VBw1Xkd
drwx--S---   3 jenkins-deploy wikidev  4096 Jul 30 10:06 lighthouse.Ou917L4
drwx--S---   3 jenkins-deploy wikidev  4096 Jul 30 10:06 lighthouse.zBXk6da
drwx--S---   3 jenkins-deploy wikidev  4096 Jul 30 10:07 lighthouse.xz6OeMp
drwx--S---   3 jenkins-deploy wikidev  4096 Jul 30 10:07 lighthouse.rhBbEOZ
drwx--S---   3 jenkins-deploy wikidev  4096 Jul 30 10:07 lighthouse.BleO2JX
drwx--S---   3 jenkins-deploy wikidev  4096 Jul 30 10:07 lighthouse.Mn7w8xY
drwx--S---   3 jenkins-deploy wikidev  4096 Jul 30 10:08 lighthouse.JHjNC3q
drwx--S---   3 jenkins-deploy wikidev  4096 Jul 30 10:08 lighthouse.jvQfOl7
drwx--S---   3 jenkins-deploy wikidev  4096 Jul 30 10:08 lighthouse.kToDYfc
drwx--S---   3 jenkins-deploy wikidev  4096 Jul 30 10:08 lighthouse.Zu7Q8ws
drwx--S---   3 jenkins-deploy wikidev  4096 Jul 30 10:10 lighthouse.Lpc5tb4
drwx--S---   3 jenkins-deploy wikidev  4096 Jul 30 10:10 lighthouse.7Fv1ppU
drwx--S---   3 jenkins-deploy wikidev  4096 Jul 30 10:11 lighthouse.BX2G1dg
drwx--S---   3 jenkins-deploy wikidev  4096 Jul 30 10:11 lighthouse.5wF4TdZ
drwx--S---   3 jenkins-deploy wikidev  4096 Jul 30 10:11 lighthouse.kG4LLCS
drwx--S---   3 jenkins-deploy wikidev  4096 Jul 30 10:11 lighthouse.UuPzVKF

That is 16 entries. Looking at the build history for that job:

#6​44​6
Jul 30, 2020 4:43 PM

Triggered by change: 617492,1
Project: mediawiki/skins/Vector
Branch: master
#6​44​5
Jul 30, 2020 10:02 AM

Triggered by change: 615813,10
Project: mediawiki/skins/MinervaNeue
Branch: ...
#6​44​4
Jul 29, 2020 11:28 PM

Triggered by change: 617287,1
Project: mediawiki/skins/Vector
Branch: master

Which seems to indicate the issue is somewhere in the MinervaNeue skin repository. Looking further in the dig history has the lighthouse directories timestamps matching with builds originating from MinervaNeueue.

That is exactly the same issue as T220948 and we had it for a while.

Our base container releng/ci-stretch has XDG_CONFIG_HOME=/tmp. Quibble creates a directory under /tmp and sets TMPDIR to point to it.

With webdriver.io 4, we used chromedriver which sets the Chromium user data dir to /tmp/.org.chromium.Chromium.XXX leading to T220948 and worked around by setting XDG_CONFIG_HOME=/tmp/xdg-config-home but that only has been done for the releng/quibble-bundle container for some reason.

With webdriver.io 5/6, chromedriver is no more used (devtools protocol is used instead) and the chrome-launcher npm package sets the user data dir to a directory created under TMPDIR which leads to the same issue.

I am not quite sure why it did not happen with webdriver 4 or maybe it never get caught until this point. The fix is to have the images to set XDG_CONFIG_HOME.

Change 618079 had a related patch set uploaded (by Hashar; owner: Hashar):
[integration/config@master] docker: point XDG_CONFIG_HOME to a subdirectory of /tmp

https://gerrit.wikimedia.org/r/618079

I have manually deleted all lighthouse directory on castor. I will rebuild the images and update the Jenkins jobs tonight and I guess we can claim this is fixed.

Change 618079 merged by jenkins-bot:
[integration/config@master] docker: point XDG_CONFIG_HOME to a subdirectory of /tmp

https://gerrit.wikimedia.org/r/618079

Change 618144 had a related patch set uploaded (by Hashar; owner: Hashar):
[integration/config@master] jjb: use image to prevent chromium cache pollution

https://gerrit.wikimedia.org/r/618144

Change 618144 merged by jenkins-bot:
[integration/config@master] jjb: use image to prevent chromium cache pollution

https://gerrit.wikimedia.org/r/618144

Hopefully should be good now. We can check tomorrow whether any lighthouse directories are still showing up and if not resolve this task.

The way I reproduced the issue:

mkdir T258972 && cd T258972

install -m 777 -d cover cache log src tmp xdg-config-home

rm -f src/LocalSettings.php
docker run \
	-e ZUUL_BRANCH=master \
	-e ZUUL_URL=https://gerrit.wikimedia.org/r \
	-e ZUUL_PROJECT=mediawiki/skins/MinervaNeue \
	-e CODEHEALTH=1 \
	-e TMPDIR=/tmp \
	-e XDG_CONFIG_HOME=/xdg-config-home \
	--volume /home/hashar/projects:/srv/git:ro \
	--volume "$(pwd)"/cache:/cache \
	--volume "$(pwd)"/cover:/workspace/cover \
	--volume "$(pwd)"/log:/workspace/log \
	--volume "$(pwd)"/src:/workspace/src \
	--volume "$(pwd)"/tmp:/tmp \
	--volume "$(pwd)"/xdg-config-home:/xdg-config-home \
	--security-opt seccomp=unconfined \
	--init --rm \
	docker-registry.wikimedia.org/releng/quibble-stretch-php74:0.0.44-s1 \
		--packages-source composer --git-parallel=8 \
		--run=selenium

I have deleted the remaining .org.chromium.Chromium.* and lighthouse.* files. Nothing new got inserted since the jobs get switched to the newer images.

This bug is the exact same as T220948 which fix did not get applied on all the images.

The big unknown is why it started to happen only recently which I assume is related to the switch to webdriver 5/6.