Page MenuHomePhabricator

on labcontrol1001, /var/cache/salt has too many files!
Closed, DeclinedPublic

Description

labcontrol1001 just ran out of inodes. The vast majority of inode consumption was in /var/cache/salt.

Is this happening on all of our salt masters? What can we do about it?

Event Timeline

neodymium is running the same version of salt, and yet it is purging its files properly

Never had this issue on palladium either. Do you know if they were in /var/cache/salt/master/jobs?

The fixes from the bug report mentioned above (10433) are both in our version of salt, in returners/local_cache.py, one for a missing function in Python 2.6 (which we don't run) and one for corrupted job entries.

Since the files are no longer around to look at, and rightly so as they killed the system, I'm going to wait 2 days and see if we have cache entries older than a day and, if so, whether they have anything in common.

Just a note I think we do not check inode usage w/ our default check_disk params:

/usr/lib/nagios/plugins/check_disk -w 6% -c 3% -l -e -A -i "/srv/sd[a-b][1-3]"

We could add --icritical=10

-K, --icritical=PERCENT%

Exit with CRITICAL status if less than PERCENT of inode space is free

Make sure 100% is avail (to test alert):

labcontrol1001:~# /usr/lib/nagios/plugins/check_disk --icritical=100 -w 6% -c 3% -l -e -A -i "/srv/sd[a-b][1-3]"
DISK CRITICAL - free space: / 31987 MB (72% inode=91%); /sys/fs/cgroup 0 MB (100% inode=99%); /dev 24133 MB (99% inode=99%); /run 4828 MB (99% inode=99%); /run/lock 5 MB (100% inode=99%); /run/shm 24144 MB (100% inode=99%); /run/user 100 MB (100% inode=99%); /srv 480878 MB (71% inode=99%);| /=12387MB;43966;45369;0;46773 /sys/fs/cgroup=0MB;0;0;0;0 /dev=0MB;22685;23409;0;24133 /run=0MB;4538;4683;0;4828 /run/lock=0MB;4;4;0;5 /run/shm=0MB;22695;23419;0;24144 /run/user=0MB;94;97;0;100 /srv=195487MB;669830;691208;0;712586

Never had this issue on palladium either. Do you know if they were in /var/cache/salt/master/jobs?

Yes, I think that's where they were.

Ori suggests tmpreaper::dir if we can't get salt to handle things itself.

Thu Mar 10 08:45:59 UTC 2016 I have run a salt test.ping. I see jobs in the cache from Mar 9 15:01 and nothing earlier. Tonight I'll run another such command and see if those Mar 9 15:01 jobs are still there afterwards or if they are gone. That will give me a starting point for investigation.

Just checked the job cache again:

root@labcontrol1001:/var/cache/salt/master# ls -lt jobs/
total 28
drwxr-xr-x 3 root root 4096 Mar 10 15:02 d6
drwxr-xr-x 3 root root 4096 Mar 10 15:02 d0
drwxr-xr-x 2 root root 4096 Mar 10 15:02 ad
drwxr-xr-x 2 root root 4096 Mar 10 15:02 89
drwxr-xr-x 2 root root 4096 Mar 10 15:01 5f
drwxr-xr-x 3 root root 4096 Mar 10 08:45 aa
drwxr-xr-x 3 root root 4096 Mar 10 08:45 a2
root@labcontrol1001:/var/cache/salt/master# date
Thu Mar 10 17:11:50 UTC 2016

so it's cleaning up now. This means the bug is set off only under certain conditions and that it could be present on neodymium (i.e. not particular to conditions on labstore1001). I'll have to stare at the code closely.

ArielGlenn triaged this task as Medium priority.Mar 10 2016, 5:16 PM

This is good news and bad news, both at once.

ArielGlenn raised the priority of this task from Medium to High.Mar 29 2016, 1:14 PM