Disk space on the / partition of deployment-jobrunner01 triggers a >60% disk usage error from Shinken
deployment-jobrunner01:~$ df --human --type ext4 Filesystem Size Used Avail Use% Mounted on /dev/vda1 18G 12G 5.3G 69% /
Disk space on the / partition of deployment-jobrunner01 triggers a >60% disk usage error from Shinken
deployment-jobrunner01:~$ df --human --type ext4 Filesystem Size Used Avail Use% Mounted on /dev/vda1 18G 12G 5.3G 69% /
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | hashar | T130179 deployment-jobrunner01/Free space - all mounts is CRITICAL | |||
Resolved | hashar | T130184 beta cluster 'labswiki' not referenced in all-labs.dblist causing jobrunner to error out |
du -d 1 -m /var/log/*|sort -rn|head -n5 1472 /var/log/hhvm 1297 /var/log/apache2 1021 /var/log/mediawiki 690 /var/log/account 425 /var/log/atop
Cleaned up stack traces from /var/log/hhvm.
/var/log/hhvm/error.log is not rotated. I emptied it since every errors were from before Feb 23 and got fixed.
Mentioned in SAL [2016-03-17T09:04:14Z] <hashar> Upgrading hhvm and related extensions on jobrunner01 T130179
Mentioned in SAL [2016-03-17T09:34:51Z] <hashar> deployment-jobrunner01 deleted /var/log/apache/*.gz T130179
Summary
The hhvm error.log is not rotated, I trimmed it
The Apache vhost for the jobrunner RPC was filling with errors as well as /var/log/mediawiki/jobrunner.log because some jobs were still for labswiki. I deleted the related keys from Redis T130184.
I think it is under control now :-}