There is an icinga warning for contint1001 about its disk space:
DISK WARNING - free space: /srv 88397 MB (10% inode=94%):
There is an icinga warning for contint1001 about its disk space:
DISK WARNING - free space: /srv 88397 MB (10% inode=94%):
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Invalid | None | T207702 contint1001:/var/lib/docker growth | |||
Resolved | Dzahn | T178663 Switch CI Docker Storage Driver to its own partition and to use devicemapper | |||
Resolved | hashar | T207707 contint1001 store docker images on separate partition or disk | |||
Resolved | dduvall | T219850 contint1001: DISK WARNING - free space: /srv 88397 MB (10% inode=94%): |
Mentioned in SAL (#wikimedia-operations) [2019-04-02T11:33:17Z] <hashar> contint1001: cleaning Docker containers #T219850
That task is valid it is for /srv/ when usually the issues are on / :/
Seems to be caused by Jenkins builds archiving. In number of builds:
$ find /srv/jenkins/builds -maxdepth 2|cut -d/ -f5|uniq -c|sort -n|tail -n10 1334 mediawiki-quibble-vendor-mysql-php70-docker 1359 mediawiki-core-jsduck-docker 1645 mwext-php70-phan-docker 2212 mediawiki-quibble-composertest-php70-docker 2542 operations-puppet-tests-stretch-docker 2860 mwext-php70-phan-seccheck-docker 3470 publish-to-doc1001 8480 castor-save-workspace-cache 8647 maintenance-disconnect-full-disks 10138 mwgate-npm-node-6-docker
In MBytes, last nine and total:
$ du /srv/jenkins/builds -m -d1|sort -n|tail -n10 29867 /srv/jenkins/builds/mediawiki-quibble-vendor-mysql-hhvm-docker 30968 /srv/jenkins/builds/mediawiki-quibble-vendor-mysql-php72-docker 39495 /srv/jenkins/builds/mediawiki-quibble-vendor-postgres-php70-docker 40828 /srv/jenkins/builds/wmf-quibble-vendor-mysql-hhvm-docker 45393 /srv/jenkins/builds/quibble-vendor-mysql-hhvm-docker 52050 /srv/jenkins/builds/apps-android-wikipedia-test 61176 /srv/jenkins/builds/wmf-quibble-core-vendor-mysql-hhvm-docker 93877 /srv/jenkins/builds/mediawiki-quibble-composer-mysql-php70-docker 94037 /srv/jenkins/builds/mediawiki-quibble-vendor-mysql-php70-docker 738610 /srv/jenkins/builds
A thousand of builds each being 90MBytes ends up taking 90GBytes and we have several such jobs.
Mentioned in SAL (#wikimedia-operations) [2019-04-02T12:07:23Z] <hashar> contint1001: compressing some MediaWiki debugging logs under /srv/jenkins/builds # T219850
Change 500714 had a related patch set uploaded (by Hashar; owner: Hashar):
[integration/config@master] Compress MediaWiki debug logs
Change 500714 merged by jenkins-bot:
[integration/config@master] Compress MediaWiki debug logs
The jobs running MediaWiki tests no gzip the huge debug logs. I am running a script to gzip the old ones.
$ ssh contint1001.wikimedia.org df -h /srv Filesystem Size Used Avail Use% Mounted on /dev/mapper/contint1001--vg-data 870G 655G 172G 80% /srv ^^^^^
Looks better now :]
With compression of mediawiki debug logs, disk usage went down to 287G/870G or 35%:
$ ssh contint1001.wikimedia.org df -h /srv Filesystem Size Used Avail Use% Mounted on /dev/mapper/contint1001--vg-data 870G 287G 540G 35% /srv ^^^^^
And those Jenkins jobs now gzip those log files so we should be fine for a while.
Thank you @Marostegui !
it seems is almost full again. Did you consider to set up a periodic docker image cleanup? cc @hashar
a naive straightforward approach will be something like executing docker image prune -a --force --filter "until=240h" (this works on 18.09 which is installed on contint1001).
Reopening as it alerted again: https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=contint1001&service=Disk+space
Slightly different as the alert is complaining about / and not /srv. Regardless, I'll take a look.
Mentioned in SAL (#wikimedia-operations) [2019-05-30T22:53:41Z] <marxarelli> deleting stale docker images from contint1001, cc: T207707 T219850
Mentioned in SAL (#wikimedia-operations) [2019-05-30T22:59:45Z] <marxarelli> deleted 95 docker images from contint1001, freeing ~ 8G on / cc: T219850
Alert is warning again.
DISK WARNING - free space: / 4594 MB (10% inode=57%):
https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=contint1001&service=Disk+space
I cleaned up some images yesterday:
2019-06-05
19:57 <hashar> contint1001: docker container prune -f && docker image prune -f # reclaimed 166 MB and 3.4 GB
We had new disks added to the machine (T207707#5226746) so we will get plenty of space eventually.
Can be closed together with T207707 once the docker images have moved to the new logical volume that can now be used.
I merged @hashar 's change to the docker data dir and @thcipriani restarted it and pulled the images (see T207707#5315360) and now we got:
18:03 <+icinga-wm> RECOVERY - Disk space on contint1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space
This is alerting again:
[09:05:40] <+icinga-wm> PROBLEM - Disk space on contint1001 is CRITICAL: DISK CRITICAL - free space: /srv 50402 MB (5% inode=93%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=contint1001&var-datasource=eqiad+prometheus/ops
root@contint1001:~# df -hT /srv Filesystem Type Size Used Avail Use% Mounted on /dev/mapper/contint1001--vg-data ext4 870G 777G 50G 95% /srv
The builds are growing insane again. In megabytes:
contint1001:/srv$ du /srv/jenkins/builds -m -d1|sort -n|tail -n10 28852 /srv/jenkins/builds/mediawiki-quibble-vendor-postgres-php70-docker 29414 /srv/jenkins/builds/mediawiki-quibble-vendor-mysql-hhvm-docker 33491 /srv/jenkins/builds/mediawiki-quibble-vendor-mysql-php71-docker 33613 /srv/jenkins/builds/mediawiki-quibble-vendor-mysql-php73-docker 34559 /srv/jenkins/builds/mediawiki-quibble-vendor-mysql-php72-docker 42702 /srv/jenkins/builds/mediawiki-quibble-vendor-postgres-php72-docker 56069 /srv/jenkins/builds/mediawiki-fresnel-patch-docker 134506 /srv/jenkins/builds/mediawiki-quibble-composer-mysql-php70-docker 134525 /srv/jenkins/builds/mediawiki-quibble-vendor-mysql-php70-docker 771464 /srv/jenkins/builds
There are a lot of mw-debug-cli.log files which are 130MBytes. It is generated by MediaWiki scripts such as maintenance/install.php or maintenance/update.php. Notably we log every single queries in the DBQuery log bucket which accounts for 122MBytes of the data and 600k queries :-]
I guess CI should just gzip the debug log files on build completion which we already done above via 397c36fc9201afc76e0be7e054b62db3a62b00f9 . But some job do not have the compression enabled :-(
Mentioned in SAL (#wikimedia-operations) [2019-08-28T12:16:59Z] <hashar> contint1001: manually gzip a few mw-debug-cli.log.gz files # T219850
Change 532995 had a related patch set uploaded (by Hashar; owner: Hashar):
[integration/config@master] Compress MediaWiki logs in all jobs
Change 532995 merged by jenkins-bot:
[integration/config@master] Compress MediaWiki logs in all jobs
Change 533010 had a related patch set uploaded (by Hashar; owner: Hashar):
[integration/config@master] Align mediawiki-quibble* jobs with other quibble jobs
Change 533010 merged by jenkins-bot:
[integration/config@master] Align mediawiki-quibble* jobs with other quibble jobs
Some of the jobs (mediawiki-quibble-*, Fresnel) lacked compression of MediaWiki debug log and also kept the artifacts for 15 days instead of 7 days for the other jobs. It is way better now:
$ ssh contint1001.wikimedia.org df -h /srv Filesystem Size Used Avail Use% Mounted on /dev/mapper/contint1001--vg-data 870G 497G 330G 61% /srv
Once clenanup has completed:
$ df -h /srv Filesystem Size Used Avail Use% Mounted on /dev/mapper/contint1001--vg-data 870G 370G 457G 45% /srv
So we should be fine now.