Page MenuHomePhabricator

contint1001: DISK WARNING - free space: /srv 88397 MB (10% inode=94%):
Closed, ResolvedPublic

Description

There is an icinga warning for contint1001 about its disk space:

DISK WARNING - free space: /srv 88397 MB (10% inode=94%):

Event Timeline

Mentioned in SAL (#wikimedia-operations) [2019-04-02T11:33:17Z] <hashar> contint1001: cleaning Docker containers #T219850

hashar claimed this task.
hashar triaged this task as Unbreak Now! priority.

That task is valid it is for /srv/ when usually the issues are on / :/

Seems to be caused by Jenkins builds archiving. In number of builds:

$ find /srv/jenkins/builds -maxdepth 2|cut -d/  -f5|uniq -c|sort -n|tail -n10
   1334 mediawiki-quibble-vendor-mysql-php70-docker
   1359 mediawiki-core-jsduck-docker
   1645 mwext-php70-phan-docker
   2212 mediawiki-quibble-composertest-php70-docker
   2542 operations-puppet-tests-stretch-docker
   2860 mwext-php70-phan-seccheck-docker
   3470 publish-to-doc1001
   8480 castor-save-workspace-cache
   8647 maintenance-disconnect-full-disks
  10138 mwgate-npm-node-6-docker

In MBytes, last nine and total:

$ du /srv/jenkins/builds -m -d1|sort -n|tail -n10
29867	/srv/jenkins/builds/mediawiki-quibble-vendor-mysql-hhvm-docker
30968	/srv/jenkins/builds/mediawiki-quibble-vendor-mysql-php72-docker
39495	/srv/jenkins/builds/mediawiki-quibble-vendor-postgres-php70-docker
40828	/srv/jenkins/builds/wmf-quibble-vendor-mysql-hhvm-docker
45393	/srv/jenkins/builds/quibble-vendor-mysql-hhvm-docker
52050	/srv/jenkins/builds/apps-android-wikipedia-test
61176	/srv/jenkins/builds/wmf-quibble-core-vendor-mysql-hhvm-docker
93877	/srv/jenkins/builds/mediawiki-quibble-composer-mysql-php70-docker
94037	/srv/jenkins/builds/mediawiki-quibble-vendor-mysql-php70-docker

738610	/srv/jenkins/builds

A thousand of builds each being 90MBytes ends up taking 90GBytes and we have several such jobs.

Mentioned in SAL (#wikimedia-operations) [2019-04-02T12:07:23Z] <hashar> contint1001: compressing some MediaWiki debugging logs under /srv/jenkins/builds # T219850

Change 500714 had a related patch set uploaded (by Hashar; owner: Hashar):
[integration/config@master] Compress MediaWiki debug logs

https://gerrit.wikimedia.org/r/500714

Change 500714 merged by jenkins-bot:
[integration/config@master] Compress MediaWiki debug logs

https://gerrit.wikimedia.org/r/500714

The jobs running MediaWiki tests no gzip the huge debug logs. I am running a script to gzip the old ones.

$ ssh contint1001.wikimedia.org df -h /srv
Filesystem                        Size  Used Avail Use% Mounted on
/dev/mapper/contint1001--vg-data  870G  655G  172G  80% /srv
                                                   ^^^^^

Looks better now :]

With compression of mediawiki debug logs, disk usage went down to 287G/870G or 35%:

$ ssh contint1001.wikimedia.org df -h /srv
Filesystem                        Size  Used Avail Use% Mounted on
/dev/mapper/contint1001--vg-data  870G  287G  540G  35% /srv
                                                   ^^^^^

And those Jenkins jobs now gzip those log files so we should be fine for a while.

Thank you @Marostegui !

it seems is almost full again. Did you consider to set up a periodic docker image cleanup? cc @hashar

a naive straightforward approach will be something like executing docker image prune -a --force --filter "until=240h" (this works on 18.09 which is installed on contint1001).

Note that this task is about /srv and the current issue is with /

ayounsi lowered the priority of this task from Unbreak Now! to High.May 30 2019, 6:53 PM
dduvall added a subscriber: dduvall.

Slightly different as the alert is complaining about / and not /srv. Regardless, I'll take a look.

Mentioned in SAL (#wikimedia-operations) [2019-05-30T22:59:45Z] <marxarelli> deleted 95 docker images from contint1001, freeing ~ 8G on / cc: T219850

Alert is back to OK.

The long term fix should still be T178663

I cleaned up some images yesterday:

2019-06-05
19:57 <hashar> contint1001: docker container prune -f && docker image prune -f # reclaimed 166 MB and 3.4 GB

We had new disks added to the machine (T207707#5226746) so we will get plenty of space eventually.

Can be closed together with T207707 once the docker images have moved to the new logical volume that can now be used.

Dzahn added a subscriber: thcipriani.

I merged @hashar 's change to the docker data dir and @thcipriani restarted it and pulled the images (see T207707#5315360) and now we got:

18:03 <+icinga-wm> RECOVERY - Disk space on contint1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space

This is alerting again:

[09:05:40]  <+icinga-wm>	PROBLEM - Disk space on contint1001 is CRITICAL: DISK CRITICAL - free space: /srv 50402 MB (5% inode=93%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=contint1001&var-datasource=eqiad+prometheus/ops
root@contint1001:~# df -hT /srv
Filesystem                       Type  Size  Used Avail Use% Mounted on
/dev/mapper/contint1001--vg-data ext4  870G  777G   50G  95% /srv

The builds are growing insane again. In megabytes:

contint1001:/srv$ du /srv/jenkins/builds -m -d1|sort -n|tail -n10
28852	/srv/jenkins/builds/mediawiki-quibble-vendor-postgres-php70-docker
29414	/srv/jenkins/builds/mediawiki-quibble-vendor-mysql-hhvm-docker
33491	/srv/jenkins/builds/mediawiki-quibble-vendor-mysql-php71-docker
33613	/srv/jenkins/builds/mediawiki-quibble-vendor-mysql-php73-docker
34559	/srv/jenkins/builds/mediawiki-quibble-vendor-mysql-php72-docker
42702	/srv/jenkins/builds/mediawiki-quibble-vendor-postgres-php72-docker
56069	/srv/jenkins/builds/mediawiki-fresnel-patch-docker
134506	/srv/jenkins/builds/mediawiki-quibble-composer-mysql-php70-docker
134525	/srv/jenkins/builds/mediawiki-quibble-vendor-mysql-php70-docker
771464	/srv/jenkins/builds

There are a lot of mw-debug-cli.log files which are 130MBytes. It is generated by MediaWiki scripts such as maintenance/install.php or maintenance/update.php. Notably we log every single queries in the DBQuery log bucket which accounts for 122MBytes of the data and 600k queries :-]

I guess CI should just gzip the debug log files on build completion which we already done above via 397c36fc9201afc76e0be7e054b62db3a62b00f9 . But some job do not have the compression enabled :-(

Mentioned in SAL (#wikimedia-operations) [2019-08-28T12:16:59Z] <hashar> contint1001: manually gzip a few mw-debug-cli.log.gz files # T219850

Change 532995 had a related patch set uploaded (by Hashar; owner: Hashar):
[integration/config@master] Compress MediaWiki logs in all jobs

https://gerrit.wikimedia.org/r/532995

Change 532995 merged by jenkins-bot:
[integration/config@master] Compress MediaWiki logs in all jobs

https://gerrit.wikimedia.org/r/532995

Change 533010 had a related patch set uploaded (by Hashar; owner: Hashar):
[integration/config@master] Align mediawiki-quibble* jobs with other quibble jobs

https://gerrit.wikimedia.org/r/533010

Change 533010 merged by jenkins-bot:
[integration/config@master] Align mediawiki-quibble* jobs with other quibble jobs

https://gerrit.wikimedia.org/r/533010

Some of the jobs (mediawiki-quibble-*, Fresnel) lacked compression of MediaWiki debug log and also kept the artifacts for 15 days instead of 7 days for the other jobs. It is way better now:

$ ssh contint1001.wikimedia.org df -h /srv
Filesystem                        Size  Used Avail Use% Mounted on
/dev/mapper/contint1001--vg-data  870G  497G  330G  61% /srv

Once clenanup has completed:

$ df -h /srv
Filesystem                        Size  Used Avail Use% Mounted on
/dev/mapper/contint1001--vg-data  870G  370G  457G  45% /srv

So we should be fine now.