Page MenuHomePhabricator

contint1001: DISK WARNING - free space: /srv 88397 MB (10% inode=94%):
Closed, ResolvedPublic


There is an icinga warning for contint1001 about its disk space:

DISK WARNING - free space: /srv 88397 MB (10% inode=94%):

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 2 2019, 5:56 AM

Mentioned in SAL (#wikimedia-operations) [2019-04-02T11:33:17Z] <hashar> contint1001: cleaning Docker containers #T219850

hashar reopened this task as Open.Apr 2 2019, 11:41 AM
hashar claimed this task.
hashar triaged this task as Unbreak Now! priority.

That task is valid it is for /srv/ when usually the issues are on / :/

Seems to be caused by Jenkins builds archiving. In number of builds:

$ find /srv/jenkins/builds -maxdepth 2|cut -d/  -f5|uniq -c|sort -n|tail -n10
   1334 mediawiki-quibble-vendor-mysql-php70-docker
   1359 mediawiki-core-jsduck-docker
   1645 mwext-php70-phan-docker
   2212 mediawiki-quibble-composertest-php70-docker
   2542 operations-puppet-tests-stretch-docker
   2860 mwext-php70-phan-seccheck-docker
   3470 publish-to-doc1001
   8480 castor-save-workspace-cache
   8647 maintenance-disconnect-full-disks
  10138 mwgate-npm-node-6-docker

In MBytes, last nine and total:

$ du /srv/jenkins/builds -m -d1|sort -n|tail -n10
29867	/srv/jenkins/builds/mediawiki-quibble-vendor-mysql-hhvm-docker
30968	/srv/jenkins/builds/mediawiki-quibble-vendor-mysql-php72-docker
39495	/srv/jenkins/builds/mediawiki-quibble-vendor-postgres-php70-docker
40828	/srv/jenkins/builds/wmf-quibble-vendor-mysql-hhvm-docker
45393	/srv/jenkins/builds/quibble-vendor-mysql-hhvm-docker
52050	/srv/jenkins/builds/apps-android-wikipedia-test
61176	/srv/jenkins/builds/wmf-quibble-core-vendor-mysql-hhvm-docker
93877	/srv/jenkins/builds/mediawiki-quibble-composer-mysql-php70-docker
94037	/srv/jenkins/builds/mediawiki-quibble-vendor-mysql-php70-docker

738610	/srv/jenkins/builds

A thousand of builds each being 90MBytes ends up taking 90GBytes and we have several such jobs.

Restricted Application added subscribers: Liuxinyu970226, TerraCodes. · View Herald Transcript

Mentioned in SAL (#wikimedia-operations) [2019-04-02T12:07:23Z] <hashar> contint1001: compressing some MediaWiki debugging logs under /srv/jenkins/builds # T219850

Change 500714 had a related patch set uploaded (by Hashar; owner: Hashar):
[integration/config@master] Compress MediaWiki debug logs

Change 500714 merged by jenkins-bot:
[integration/config@master] Compress MediaWiki debug logs

hashar closed this task as Resolved.Apr 2 2019, 12:48 PM

The jobs running MediaWiki tests no gzip the huge debug logs. I am running a script to gzip the old ones.

$ ssh df -h /srv
Filesystem                        Size  Used Avail Use% Mounted on
/dev/mapper/contint1001--vg-data  870G  655G  172G  80% /srv

Looks better now :]

With compression of mediawiki debug logs, disk usage went down to 287G/870G or 35%:

$ ssh df -h /srv
Filesystem                        Size  Used Avail Use% Mounted on
/dev/mapper/contint1001--vg-data  870G  287G  540G  35% /srv

And those Jenkins jobs now gzip those log files so we should be fine for a while.

Thank you @Marostegui !

fsero added a subscriber: fsero.Apr 15 2019, 2:55 PM

it seems is almost full again. Did you consider to set up a periodic docker image cleanup? cc @hashar

a naive straightforward approach will be something like executing docker image prune -a --force --filter "until=240h" (this works on 18.09 which is installed on contint1001).

Note that this task is about /srv and the current issue is with /

fsero added a comment.Apr 15 2019, 2:57 PM

oh sorry @Marostegui i misread :)

ayounsi lowered the priority of this task from Unbreak Now! to High.May 30 2019, 6:53 PM
dduvall claimed this task.May 30 2019, 10:50 PM
dduvall added a subscriber: dduvall.

Slightly different as the alert is complaining about / and not /srv. Regardless, I'll take a look.

Mentioned in SAL (#wikimedia-operations) [2019-05-30T22:59:45Z] <marxarelli> deleted 95 docker images from contint1001, freeing ~ 8G on / cc: T219850

dduvall closed this task as Resolved.May 30 2019, 11:03 PM

Alert is back to OK.

The long term fix should still be T178663

ayounsi reopened this task as Open.Jun 5 2019, 7:20 PM

Alert is warning again.

DISK WARNING - free space: / 4594 MB (10% inode=57%):

hashar closed this task as Resolved.Jun 6 2019, 9:22 AM

I cleaned up some images yesterday:

19:57 <hashar> contint1001: docker container prune -f && docker image prune -f # reclaimed 166 MB and 3.4 GB

We had new disks added to the machine (T207707#5226746) so we will get plenty of space eventually.

Dzahn added a comment.Jul 3 2019, 5:44 PM

Can be closed together with T207707 once the docker images have moved to the new logical volume that can now be used.

Dzahn closed this task as Resolved.Jul 8 2019, 10:16 PM
Dzahn added a subscriber: thcipriani.

I merged @hashar 's change to the docker data dir and @thcipriani restarted it and pulled the images (see T207707#5315360) and now we got:

18:03 <+icinga-wm> RECOVERY - Disk space on contint1001 is OK: DISK OK