Page MenuHomePhabricator

2018-01-02: labstore Tools and Misc share very full
Closed, ResolvedPublic

Description

/dev/drbd1     ext4      9.8G  535M  8.7G   6% /srv/test
/dev/drbd4     ext4      8.0T  6.8T  815G  90% /srv/tools
/dev/drbd3     ext4      5.0T  3.9T  883G  82% /srv/misc

Event Timeline

running as of now

labstore1004:~# find /srv/tools -type f -size +100M -printf "%p %k KB\n" &> /root/tools_large_files_01022018.txt

running as of now

labstore1004:~# find /srv/tools -type f -size +100M -printf "%p %k KB\n" &> /root/tools_large_files_01022018.txt

This finished and I see the largest files are:

/srv/tools/shared/tools/project/bambots/CategoryWatchlistBot.out 120153380 KB
/srv/tools/shared/tools/project/iabot/Workers/Worker3.err 403703740 KB
/srv/tools/shared/tools/project/iabot/Workers/Worker2.err 427911920 KB
/srv/tools/shared/tools/project/iabot/Workers/Worker1.err 475118796 KB

root@labstore1004:~# du -sh /srv/tools/shared/tools/project/iabot/Workers/Worker1.err
454G /srv/tools/shared/tools/project/iabot/Workers/Worker1.err

That's pretty huge. I think iobot workers have something going wrong with them. But since we have breached 90% I need to truncate this asap.

truncate -s 0 /srv/tools/shared/tools/project/iabot/Workers/Worker1.err
truncate -s 0 /srv/tools/shared/tools/project/iabot/Workers/Worker2.err
truncate -s 0 /srv/tools/shared/tools/project/iabot/Workers/Worker3.er
truncate -s 0 /srv/tools/shared/tools/project/bambots/CategoryWatchlistBot.out

/dev/drbd4 ext4 8.0T 5.5T 2.2T 72% /srv/tools

cat tools_large_files_01022018.txt | sort -h -k 2

bd808 renamed this task from labstore Tools and Misc share very full to 2018-01-02: labstore Tools and Misc share very full.Jan 2 2018, 6:56 PM

Yea. Those workers are definitely having trouble somewhere. They’ve never had such an explosion of errors.

@madhuvishy could you review and potentially close this round of cleanup? :D

bd808 added a subscriber: madhuvishy.
chasemp added a subscriber: Bstorm.

Thanks @Bstorm. Take this over or close and make one for modern cleanup as you see fit :)

FWIW as of right now

root@labstore1004:~# df -Th
Filesystem     Type      Size  Used Avail Use% Mounted on
udev           devtmpfs   10M     0   10M   0% /dev
tmpfs          tmpfs      26G  2.6G   23G  11% /run
/dev/sda1      ext4      880G   16G  820G   2% /
tmpfs          tmpfs      63G     0   63G   0% /dev/shm
tmpfs          tmpfs     5.0M     0  5.0M   0% /run/lock
tmpfs          tmpfs      63G     0   63G   0% /sys/fs/cgroup
/dev/drbd1     ext4      9.8G  535M  8.7G   6% /srv/test
/dev/drbd4     ext4      8.0T  6.8T  775G  90% /srv/tools
/dev/drbd3     ext4      5.0T  2.4T  2.4T  50% /srv/misc
root@labstore1004:~#

For the record, yesterday, I ran: ionice -c 3 nice -19 find /srv/tools -type f -size +100M -printf "%p %k KB\n" > /root/tools_large_files_20180606.txt and found a nice set. I also ran cat tools_large_files_20180606.txt | sort -k 2,2 -h > sorted_tools_large_files_20180606.txt

From that and the dashboard at https://grafana.wikimedia.org/dashboard/db/labstore-nfs-directory-sizes?orgId=1 it looks like we need cleanup in templatetiger, wikidata-exports and videoconvert. Also, there is a list of *.err and *.out files that might be worth truncating.

Also, there is a list of *.err and *.out files that might be worth truncating.

I suppose that they are outputs of grid jobs commands. And as I know nothing delete/delete them by default. Can we imagine a log rotation of all jobs outputs by default ?

I suppose that they are outputs of grid jobs commands. And as I know nothing delete/delete them by default. Can we imagine a log rotation of all jobs outputs by default ?

We are almost never short on imagination. We are almost always short on people and hardware to implement new solutions.

Since videoconvert cleanup, we are at

/dev/drbd4      8.0T  6.7T  960G  88% /srv/tools

Waiting on a couple cleanup tickets, and I will probably try truncating larger gridengine output files on Monday.

I totally forgot to tell @Bstorm we have announced the invasive parts of this in the past, old example: https://lists.wikimedia.org/pipermail/labs-l/2016-May/004493.html

Following templatetiger cleanup:

/dev/drbd4      8.0T  5.6T  2.1T  74% /srv/tools

This seem pretty good at this point, so I'll close this task for now.