Page MenuHomePhabricator

contint1001: DISK WARNING - free space: /srv 88397 MB (10% inode=94%):
Closed, ResolvedPublic

Description

There is an icinga warning for contint1001 about its disk space:

DISK WARNING - free space: /srv 88397 MB (10% inode=94%):

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 2 2019, 5:56 AM

Mentioned in SAL (#wikimedia-operations) [2019-04-02T11:33:17Z] <hashar> contint1001: cleaning Docker containers #T219850

hashar reopened this task as Open.Apr 2 2019, 11:41 AM
hashar claimed this task.
hashar triaged this task as Unbreak Now! priority.

That task is valid it is for /srv/ when usually the issues are on / :/

Seems to be caused by Jenkins builds archiving. In number of builds:

$ find /srv/jenkins/builds -maxdepth 2|cut -d/  -f5|uniq -c|sort -n|tail -n10
   1334 mediawiki-quibble-vendor-mysql-php70-docker
   1359 mediawiki-core-jsduck-docker
   1645 mwext-php70-phan-docker
   2212 mediawiki-quibble-composertest-php70-docker
   2542 operations-puppet-tests-stretch-docker
   2860 mwext-php70-phan-seccheck-docker
   3470 publish-to-doc1001
   8480 castor-save-workspace-cache
   8647 maintenance-disconnect-full-disks
  10138 mwgate-npm-node-6-docker

In MBytes, last nine and total:

$ du /srv/jenkins/builds -m -d1|sort -n|tail -n10
29867	/srv/jenkins/builds/mediawiki-quibble-vendor-mysql-hhvm-docker
30968	/srv/jenkins/builds/mediawiki-quibble-vendor-mysql-php72-docker
39495	/srv/jenkins/builds/mediawiki-quibble-vendor-postgres-php70-docker
40828	/srv/jenkins/builds/wmf-quibble-vendor-mysql-hhvm-docker
45393	/srv/jenkins/builds/quibble-vendor-mysql-hhvm-docker
52050	/srv/jenkins/builds/apps-android-wikipedia-test
61176	/srv/jenkins/builds/wmf-quibble-core-vendor-mysql-hhvm-docker
93877	/srv/jenkins/builds/mediawiki-quibble-composer-mysql-php70-docker
94037	/srv/jenkins/builds/mediawiki-quibble-vendor-mysql-php70-docker

738610	/srv/jenkins/builds

A thousand of builds each being 90MBytes ends up taking 90GBytes and we have several such jobs.

Restricted Application added subscribers: Liuxinyu970226, TerraCodes. · View Herald Transcript

Mentioned in SAL (#wikimedia-operations) [2019-04-02T12:07:23Z] <hashar> contint1001: compressing some MediaWiki debugging logs under /srv/jenkins/builds # T219850

Change 500714 had a related patch set uploaded (by Hashar; owner: Hashar):
[integration/config@master] Compress MediaWiki debug logs

https://gerrit.wikimedia.org/r/500714

Change 500714 merged by jenkins-bot:
[integration/config@master] Compress MediaWiki debug logs

https://gerrit.wikimedia.org/r/500714

hashar closed this task as Resolved.Apr 2 2019, 12:48 PM

The jobs running MediaWiki tests no gzip the huge debug logs. I am running a script to gzip the old ones.

$ ssh contint1001.wikimedia.org df -h /srv
Filesystem                        Size  Used Avail Use% Mounted on
/dev/mapper/contint1001--vg-data  870G  655G  172G  80% /srv
                                                   ^^^^^

Looks better now :]

hashar added a comment.Apr 2 2019, 4:07 PM

With compression of mediawiki debug logs, disk usage went down to 287G/870G or 35%:

$ ssh contint1001.wikimedia.org df -h /srv
Filesystem                        Size  Used Avail Use% Mounted on
/dev/mapper/contint1001--vg-data  870G  287G  540G  35% /srv
                                                   ^^^^^

And those Jenkins jobs now gzip those log files so we should be fine for a while.

Thank you @Marostegui !

fsero added a subscriber: fsero.Apr 15 2019, 2:55 PM

it seems is almost full again. Did you consider to set up a periodic docker image cleanup? cc @hashar

a naive straightforward approach will be something like executing docker image prune -a --force --filter "until=240h" (this works on 18.09 which is installed on contint1001).

Note that this task is about /srv and the current issue is with /

fsero added a comment.Apr 15 2019, 2:57 PM

oh sorry @Marostegui i misread :)