Page MenuHomePhabricator

Process accounting routinely fill up /var on deployment-bastion
Closed, ResolvedPublic

Description

We've had a few instances in the past month where /var on deployment-bastion filled to a critical point (also, see T71604 for past trouble w/ pacct logs). The biggest offenders of usage are the two /var/log/account/pacct logs that remain uncompressed following rotation, and the vast majority of the process accounting activity is coming from deployments.

root@deployment-bastion:/var/log/account# sa -m | head -n 8
                                  3751547   19511.41re     496.85cp         0avio      2699k
mwdeploy                          3671814    3013.74re     342.33cp         0avio      2663k
l10nupdate                           1397     237.33re     105.13cp         0avio     22077k
root                                67315    3448.54re      46.97cp         0avio      3127k
udp2log                               238   10051.94re       0.84cp         0avio      3348k
www-data                             1395       9.10re       0.58cp         0avio      6734k
diamond                              2320       2.44re       0.43cp         0avio     21544k
nobody                                 11      18.96re       0.16cp         0avio      4123k

We should investigate whether we can safely remove delaycompress from the rotation config for this log to spare our tiny /var.

Event Timeline

dduvall raised the priority of this task from to Needs Triage.
dduvall updated the task description. (Show Details)
dduvall added subscribers: dduvall, hashar.
dduvall set Security to None.

We lowered the accounting retention with T71604 . That is done by having puppet populate /etc/default/acct with ACCT_LOGGING="7" (keep 7 days). The log handling is done via a daily cron /etc/cron.daily/acct which really:

savelog -g adm -m 0640 -u root -c "${ACCT_LOGGING}" /var/log/account/pacct > /dev/null

When handling the rotation, it will not compress the previous file so we ended up with uncompressed pacct (which is now written to and pacct.0).

The real fix would be to recreate deployment-bastion.eqiad.wmflabs using the new labs images they offers a different disk partitioning with a larger /var. Maybe migrate to Trusty as well.

It is probably easier to create a new bastion instance and migrate progressively.

Change 194549 had a related patch set uploaded (by coren):
Move sourceswiki special.dblist->wikisource.dblist

https://gerrit.wikimedia.org/r/194549

coren subscribed.

That patch was mistakenly assigned to the wrong task -- please ignore.

greg triaged this task as Medium priority.Mar 11 2015, 3:02 PM
greg moved this task from To Triage to Backlog on the Beta-Cluster-Infrastructure board.
greg renamed this task from Process accounting + deployments routinely fill up /var on deployment-bastion to Process accounting routinely fill up /var on deployment-bastion.Sep 24 2015, 6:06 AM
03:18 < shinken-w> PROBLEM - Free space - all mounts on deployment-bastion is CRITICAL: CRITICAL: 
                   deployment-prep.deployment-bastion.diskspace._var.byte_percentfree (<33.33%)
gjg@deployment-bastion:~$ df -h
Filesystem                                                 Size  Used Avail Use% Mounted on
/dev/vda1                                                  7.6G  2.7G  4.5G  38% /
/dev/vda2                                                  1.9G  1.8G   57M  97% /var
gjg@deployment-bastion:/var/log/account$ ls -alh
total 617M
drwxr-xr-x  2 root root 4.0K Sep 23 06:25 .
drwxr-xr-x 25 root root 4.0K Sep 23 06:55 ..
-rw-r-----  1 root adm  286M Sep 24 06:03 pacct
-rw-r-----  1 root adm  217M Sep 23 05:55 pacct.0
-rw-r-----  1 root adm   26M Sep 22 06:25 pacct.1.gz
-rw-r-----  1 root adm   23M Sep 20 02:53 pacct.3.gz
-rw-r-----  1 root adm   23M Sep 19 03:42 pacct.4.gz
-rw-r-----  1 root adm   22M Sep 18 02:33 pacct.5.gz
-rw-r-----  1 root adm   23M Sep 17 06:25 pacct.6.gz
06:38 < shinken-w> RECOVERY - Free space - all mounts on deployment-bastion is OK: OK: All targets OK

I have a shell script on deployment-bastion now to clean up logs on /var: /home/bd808/cleanup-var-crap.sh. I needed it on 2015-11-04 and again on 2015-11-23 to recover from /var being nearly full. Here's what the script does:

#!/usr/bin/env bash
# Get rid of crap that clutters up /var which is a ridiculously tiny partition

df -h /var

sudo apt-get clean
sudo /etc/cron.daily/acct
sudo rm /var/log/account/pacct.?*
sudo rm /var/log/atop.log.?
sudo rm /var/log/*.??.gz
sudo sh -c "rm /var/log/apache2/*.??.gz"

df -h /var
hashar claimed this task.

This issue was mostly for deployment-bastion which only had 2GBytes for /var and by design a lot of account actions (scap, code update etc). We have phased out that instance (replaced by deployment-tin). So it is no more an issue.