Page MenuHomePhabricator

Jenkins: lanthanum/gallium tmpfs are filling up with stale tmp files
Closed, ResolvedPublic

Description

Every few days it is going critical. Let's two criticals were

  • September 17 19:00
  • September 22 11:20

$ df -h
..
tmpfs 512M 505M 7.8M 99% /var/lib/jenkins-slave/tmpfs
..

Example contents:

[18:25 UTC] krinkle at lanthanum.eqiad.wmnet in /var/lib/jenkins-slave/tmpfs
$ l
mediawiki-core-extensions-integration/
mediawiki-core-install-sqlite/
mediawiki-core-phpunit-api/
mediawiki-core-phpunit-databaseless/
mediawiki-core-phpunit-misc/
mediawiki-core-regression-REL1_23/
mwext-Flow-qunit/
mwext-WikimediaEvents-testextension/
parsoidsvc-php-parsertests/

mediawiki-core-regression-master:
build7546.sqlite
build7547.sqlite
build7550.sqlite

mediawiki-vendor-integration:
total 21M
MW_PHPUnit_ExifRotationTest_pCtdaJ/
MW_PHPUnit_TextPassDumperTest_Cpe5LO/
MW_PHPUnit_TextPassDumperTest_J1dvmI
MW_PHPUnit_TextPassDumperTest_QmhN3D
MW_PHPUnit_TextPassDumperTest_rYX7QZ/
277K Sep 22 17:26 build1787.sqlite
291K Sep 22 17:35 build1788.sqlite
291K Sep 22 17:42 build1793.sqlite
291K Sep 22 17:54 build1798.sqlite
291K Sep 22 18:02 build1809.sqlite
265K Sep 22 18:03 build1812.sqlite
271K Sep 22 18:05 build1813.sqlite
291K Sep 22 18:12 build1814.sqlite
291K Sep 22 18:25 build1819.sqlite
4.1M Aug 20 15:08 mw-A8c7xJ
4.1M Aug 20 15:12 mw-BxZ3Ez
4.1M Aug 26 13:42 mw-R4MZCN
4.1M Aug 26 13:42 mw-tbdCvd
0 Sep 20 18:21 transform_f2c5b84944be-1.jpg

I've purged a bunch of files for now, possibly broke a few currently running builds.

Problems:

  • The tmpfs partition is way too small (~ 500MB).
  • Stuff isn't being purged.
  • These are not regular build artefacts (which Jenkins stores separately and we do have them expire/purge properly).
  • These are files only needed for the duration of the test and should be removed right after a test has run.

Version: wmf-deployment
Severity: normal
See Also:
https://rt.wikimedia.org/Ticket/Display.html?id=8582

Details

Reference
bz71128

Event Timeline

bzimport raised the priority of this task from to High.Nov 22 2014, 3:45 AM
bzimport set Reference to bz71128.
  • The tmpfs partition is way too small (~ 500MB).

/var/lib/jenkins/tmpfs is only 512MB because that is a tmpfs, hence it consumes RAM.

  • Stuff isn't being purged.

At least sqlite files are purged since https://gerrit.wikimedia.org/r/#/c/102149/ :

mw-install-sqlite.sh:find "$SQLITE_DIR" -type f -name '*.sqlite' -mmin +60 -delete
  • These are not regular build artefacts (which Jenkins stores separately and we do have them expire/purge properly).
  • These are files only needed for the duration of the test and should be removed right after a test has run.

Seems that is covered by bug 68563 "Jenkins: point TMP/TEMP to workspace and delete it after build completion".

Looking on gallium, the main offenders are the qunit jobs, each consume ~ 7MB and we have ten of them for mediawiki-core-qunit. Seem we had a surge of tests running over an hour.

The find -mtime 60 is pretty lame. Since then I found a way to have a task to run on build completion which is the 'postbuildscript' publisher. The qunit jobs already have such a macro qunit-cleanup (in macro.yaml), so we can just add a step that would delete the sqlite file.

And again...

ssh lanthanum.eqiad.wmnet
cd /var/lib/jenkins/tmpfs
sudo -su jenkins-slave
ll
rm -rf *@* mwext-*
ll

(In reply to Antoine "hashar" Musso from comment #1)

The find -mtime 60 is pretty lame. Since then I found a way to have a task
to run on build completion which is the 'postbuildscript' publisher. The
qunit jobs already have such a macro qunit-cleanup (in macro.yaml), so we
can just add a step that would delete the sqlite file.

Cool. Let's see if we can update our macros that create tmp dbs, to use postbuildscript to clean it up.

I guess keeping it in tmpfs is useful for now, we can just do an 'rm -rf' of the containing directory since its tied to workspace-id (jobname[@concurreny]), so no parallel conflicts.

RT 8582 for requesting more ram in those 2 machines

We can use a postbuilder publisher that execute a shell script to teardown the database. Should be done in jobs creating sqlite databases such as the ones having the macro prepare-mediawiki-qunit (being renamed to prepare-mediawiki).

That is surely annoying but not that critical imho.

Now that I have completed the Zuul cloner sprint, I will adjust the Jenkins jobs to delete the sqlite file on completion (suggested on Comment #1).

gerritadmin wrote:

Change 167948 had a related patch set uploaded by Hashar:
mw-install-sqlite: clear sqlite DB after 20 mins

https://gerrit.wikimedia.org/r/167948

gerritadmin wrote:

Change 167948 merged by jenkins-bot:
mw-install-sqlite: clear sqlite DB after 20 mins

https://gerrit.wikimedia.org/r/167948

(In reply to Gerrit Notification Bot from comment #8)

Change 167948 merged by jenkins-bot:
mw-install-sqlite: clear sqlite DB after 20 mins

https://gerrit.wikimedia.org/r/167948

Deployed. But that is lame workaround.

gerritadmin wrote:

Change 168558 had a related patch set uploaded by Hashar:
Refactor mw sqlite related env variables

https://gerrit.wikimedia.org/r/168558

gerritadmin wrote:

Change 168558 merged by jenkins-bot:
Refactor mw sqlite related env variables

https://gerrit.wikimedia.org/r/168558

gerritadmin wrote:

Change 168562 had a related patch set uploaded by Hashar:
mw-teardown.sh: to be run after mw jobs

https://gerrit.wikimedia.org/r/168562

gerritadmin wrote:

Change 168562 merged by jenkins-bot:
mw-teardown.sh: to be run after mw jobs

https://gerrit.wikimedia.org/r/168562

gerritadmin wrote:

Change 168566 had a related patch set uploaded by Hashar:
Mediawiki teardown publisher

https://gerrit.wikimedia.org/r/168566

gerritadmin wrote:

Change 168566 merged by jenkins-bot:
Mediawiki teardown publisher

https://gerrit.wikimedia.org/r/168566

The patch above cause jobs to delete the sqlite file on completion. That should keep tmpfs usage at a minimum level now.

Leaving the bug open for a while though.

Lets just assume this is fixed for now. Additionally I have manually cleared the tmpfs partitions on both gallium and lanthanum.

If that occurs again one can still reopen the bug.