/dev/drbd1 ext4 9.8G 535M 8.7G 6% /srv/test /dev/drbd4 ext4 8.0T 6.8T 815G 90% /srv/tools /dev/drbd3 ext4 5.0T 3.9T 883G 82% /srv/misc
Description
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | Bstorm | T183920 2018-01-02: labstore Tools and Misc share very full | |||
Resolved | • madhuvishy | T183953 tools.iabot is using 1.3T of 8T available tools nfs storage | |||
Resolved | Kolossos | T183954 templatetiger is using 827G of 8T available tools nfs storage | |||
Resolved | • madhuvishy | T183970 wikidumpparse is using 1.2TB of 5T available NFS misc storage | |||
Duplicate | • madhuvishy | T183971 dumps project is using 2T of 5T available NFS misc storage | |||
Resolved | • madhuvishy | T174468 VPS Project dumps is using 1.7T at /data/project on NFS | |||
Resolved | Bstorm | T255628 VPS Project dumps is using 2.4 TB at /data/project on NFS | |||
Resolved | bd808 | T184958 tools.wikidata-exports using 369G of 8T tools NFS storage | |||
Resolved | Bstorm | T196660 Clean up videoconvert if there are temp files to do so with |
Event Timeline
running as of now
labstore1004:~# find /srv/tools -type f -size +100M -printf "%p %k KB\n" &> /root/tools_large_files_01022018.txt
This finished and I see the largest files are:
/srv/tools/shared/tools/project/bambots/CategoryWatchlistBot.out 120153380 KB /srv/tools/shared/tools/project/iabot/Workers/Worker3.err 403703740 KB /srv/tools/shared/tools/project/iabot/Workers/Worker2.err 427911920 KB /srv/tools/shared/tools/project/iabot/Workers/Worker1.err 475118796 KB
root@labstore1004:~# du -sh /srv/tools/shared/tools/project/iabot/Workers/Worker1.err
454G /srv/tools/shared/tools/project/iabot/Workers/Worker1.err
That's pretty huge. I think iobot workers have something going wrong with them. But since we have breached 90% I need to truncate this asap.
truncate -s 0 /srv/tools/shared/tools/project/iabot/Workers/Worker1.err
truncate -s 0 /srv/tools/shared/tools/project/iabot/Workers/Worker2.err
truncate -s 0 /srv/tools/shared/tools/project/iabot/Workers/Worker3.er
truncate -s 0 /srv/tools/shared/tools/project/bambots/CategoryWatchlistBot.out
/dev/drbd4 ext4 8.0T 5.5T 2.2T 72% /srv/tools
Yea. Those workers are definitely having trouble somewhere. They’ve never had such an explosion of errors.
Thanks @Bstorm. Take this over or close and make one for modern cleanup as you see fit :)
FWIW as of right now
root@labstore1004:~# df -Th Filesystem Type Size Used Avail Use% Mounted on udev devtmpfs 10M 0 10M 0% /dev tmpfs tmpfs 26G 2.6G 23G 11% /run /dev/sda1 ext4 880G 16G 820G 2% / tmpfs tmpfs 63G 0 63G 0% /dev/shm tmpfs tmpfs 5.0M 0 5.0M 0% /run/lock tmpfs tmpfs 63G 0 63G 0% /sys/fs/cgroup /dev/drbd1 ext4 9.8G 535M 8.7G 6% /srv/test /dev/drbd4 ext4 8.0T 6.8T 775G 90% /srv/tools /dev/drbd3 ext4 5.0T 2.4T 2.4T 50% /srv/misc root@labstore1004:~#
For the record, yesterday, I ran: ionice -c 3 nice -19 find /srv/tools -type f -size +100M -printf "%p %k KB\n" > /root/tools_large_files_20180606.txt and found a nice set. I also ran cat tools_large_files_20180606.txt | sort -k 2,2 -h > sorted_tools_large_files_20180606.txt
From that and the dashboard at https://grafana.wikimedia.org/dashboard/db/labstore-nfs-directory-sizes?orgId=1 it looks like we need cleanup in templatetiger, wikidata-exports and videoconvert. Also, there is a list of *.err and *.out files that might be worth truncating.
I suppose that they are outputs of grid jobs commands. And as I know nothing delete/delete them by default. Can we imagine a log rotation of all jobs outputs by default ?
We are almost never short on imagination. We are almost always short on people and hardware to implement new solutions.
Since videoconvert cleanup, we are at
/dev/drbd4 8.0T 6.7T 960G 88% /srv/tools
Waiting on a couple cleanup tickets, and I will probably try truncating larger gridengine output files on Monday.
I totally forgot to tell @Bstorm we have announced the invasive parts of this in the past, old example: https://lists.wikimedia.org/pipermail/labs-l/2016-May/004493.html