Page MenuHomePhabricator

gallium and lanthanum disks full (tracking)
Closed, ResolvedPublic

Assigned To
Authored By
Krinkle
Mar 1 2015, 8:08 PM
Referenced Files
F108921: lanthanum-mem.png
Apr 5 2015, 10:01 AM
F108923: lanthanum-disk.png
Apr 5 2015, 10:01 AM
F108922: gallium-disk.png
Apr 5 2015, 10:01 AM
F51650: gallium-mem-year.png
Mar 1 2015, 8:08 PM
F51653: gallium-disk-year.png
Mar 1 2015, 8:08 PM

Description

For a while now (July 2014), disk usage on gallium is going straight down hill. As of January, it's getting a bit more dangerous.

The root disk is 75% full (100GB of 450GB available) and the SSD (used by Jenkins workspaces and gerrit replication) is 75% full (a mere 39GB of 150GB available).

[19:42 UTC] krinkle at gallium.wikimedia.org in ~
Filesystem      Size  Used Avail Use% Mounted on
/dev/md0        452G  321G  108G  75% /
udev            3.9G  4.0K  3.9G   1% /dev
tmpfs           798M   72M  726M  10% /run
none            5.0M     0  5.0M   0% /run/lock
none            3.9G     0  3.9G   0% /run/shm
/dev/sdb1       149G  111G   39G  75% /srv/ssd
tmpfs           512M     0  512M   0% /var/lib/jenkins-slave/tmpfs
[19:43 UTC] krinkle at gallium.wikimedia.org in /srv/ssd
$ du -sh *
7.6G    gerrit
0       jenkins
96G     jenkins-slave
6.9G    zuul
[19:45 UTC] krinkle at gallium.wikimedia.org in /srv/ssd/jenkins-slave
$ du -sh *
12K     maven..-interceptor.jar
432K    slave.jar
12M     tools
96G     workspace

gallium-disk-year.png (912×2 px, 338 KB)

gallium-mem-year.png (520×806 px, 225 KB)

Related Objects

StatusSubtypeAssignedTask
ResolvedKrinkle
Resolvedhashar
Declinedhashar
Resolvedhashar
Declinedhashar
Declinedhashar
Declinedhashar
ResolvedAndrew
Resolvedhashar
Declinedhashar
ResolvedKrinkle
Resolvedhashar
ResolvedDzahn
Resolvedhashar
Resolvedhashar
ResolvedAndrew
Resolvedhashar
ResolvedKrinkle
ResolvedKrinkle
ResolvedKrinkle
Resolvedhashar
ResolvedKrinkle
Resolvedhashar
Resolvedhashar
Resolvedhashar
Resolvedhashar

Event Timeline

Krinkle raised the priority of this task from to High.
Krinkle updated the task description. (Show Details)
Krinkle subscribed.

Running in a screen:

hashar@gallium:/var/lib/jenkins/jobs$ du -sm *|sort -n

Some jobs build history probably need to be logrotated automatically.

I have canceled the command, instead we can use stat to list the number of entries in each builds directory. The top 30 offenders by number of hardlinks:

cd /var/lib/jenkins/jobs
$ stat --format '%h:%n' */builds|sort -rn|head -n30
13371:operations-puppet-tox-data_admin_lint/builds
12284:mwext-Wikibase-lint/builds
10051:mwext-MobileFrontend-lint/builds
8989:mwext-MobileFrontend-qunit/builds
8983:mwext-MobileFrontend-qunit-mobile/builds
8821:mwext-Flow-lint/builds
8713:operations-apache-config-lint/builds
7512:mwext-VisualEditor-lint/builds
7003:mwext-VisualEditor-npm/builds
7002:mwext-VisualEditor-qunit/builds
6991:mwext-VisualEditor-doc-test/builds
6932:operations-mw-config-tests/builds
5672:mwext-Wikibase-qunit/builds
5656:mwext-Wikibase-repo-tests/builds
5653:mwext-Wikibase-repo-api-tests/builds
5642:mwext-Wikibase-client-tests/builds
5618:mediawiki-core-doxygen-publish/builds
5463:mediawiki-gate/builds
5353:VisualEditor-npm/builds
5096:pywikibot-core-tox-flake8/builds
5040:mediawiki-core-phplint/builds
4863:operations-puppet-validate/builds
4777:VisualEditor-jsduck/builds
4638:mediawiki-core-bundle-rubocop/builds
4422:mwext-Flow-qunit/builds
4384:mwext-MobileFrontend-phpcs-HEAD/builds
4263:mediawiki-core-regression-phpcs-HEAD/builds
3953:pywikibot-core-tox-nose/builds
3920:mwext-MobileFrontend-jslint/builds
3808:operations-puppet-puppetlint-strict/builds

Those jobs are most probably not logrotated.

Related:

Filesystem      Size  Used Avail Use% Mounted on
/dev/md0        452G  223G  206G  52% /

Seems good to me isn't it ?

Resolved for now. Work is in progress to reduce the number of jobs being run that will help keep disk usage at a sane level.

Krinkle renamed this task from gallium.wikimedia.org disk space running low to gallium and lanthanum disks full (tracking).Mar 24 2015, 5:50 AM
Krinkle reopened this task as Open.
Krinkle removed hashar as the assignee of this task.
Krinkle removed a project: acl*sre-team.
Krinkle set Security to None.
Krinkle added a subscriber: Legoktm.
Krinkle claimed this task.

The on-going efforts on T86659 and T91396 have finally brought gallium and lanthanum to stable levels in terms of disk usage.

lanthanum-mem.png (272×411 px, 76 KB)
gallium-disk.png (421×1 px, 149 KB)
lanthanum-disk.png (427×1 px, 144 KB)