Saw this on alerts.wikimedia.org:
DISK WARNING - free space: / 3084 MB (6% inode=86%): /tmp 3084 MB (6% inode=86%): /var/tmp 3084 MB (6% inode=86%)
instance: an-coord1001
Confirmed with df:
razzi@an-coord1001:~$ df -h df: /mnt/hdfs: Input/output error Filesystem Size Used Avail Use% Mounted on udev 32G 0 32G 0% /dev tmpfs 6.3G 666M 5.7G 11% /run /dev/md0 46G 41G 3.1G 94% / tmpfs 32G 0 32G 0% /dev/shm tmpfs 5.0M 0 5.0M 0% /run/lock tmpfs 32G 0 32G 0% /sys/fs/cgroup /dev/mapper/an--coord1001--vg-srv 102G 46G 56G 46% /srv /dev/mapper/an--coord1001--vg-mysql 59G 49G 11G 82% /var/lib/mysql tmpfs 6.3G 0 6.3G 0% /run/user/13926 tmpfs 6.3G 0 6.3G 0% /run/user/124 tmpfs 6.3G 0 6.3G 0% /run/user/2129 tmpfs 6.3G 0 6.3G 0% /run/user/26051
Looking at the logs, /var/log/hive is the largest by far:
du -sh /var/log/* | sort -h ... 12M /var/log/auth.log.1 80M /var/log/account 2.1G /var/log/oozie 14G /var/log/hive
The logs there date back to February, which is strange, considering log4j seems to be configured to only have 2 log files of 256m each: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/bigtop/templates/hive/hive-exec-log4j.properties.erb#16
Removing logs from February for now. We should find a long-term strategy for ensuring old logs get deleted.
We might want to look into the mysql data size as well.