Page MenuHomePhabricator

labstore1004 - DISK CRITICAL - free space: /srv/tools 115904 MB (1% inode=79%):
Closed, ResolvedPublic

Description

Alert generated on 2019-02-25 01:31:05 but we did not receive any notifications about this.

https://icinga.wikimedia.org/cgi-bin/icinga/history.cgi?host=labstore1004&service=Disk+space

gtirloni@labstore1004:~$ date
Mon Feb 25 08:49:30 UTC 2019

gtirloni@labstore1004:~$ df -h /srv/tools/
Filesystem      Size  Used Avail Use% Mounted on
/dev/drbd4      8.0T  7.5T  112G  99% /srv/tools

Event Timeline

iabot is consuming 1.6TB.

iabot# du -s --block-size=1G * | sort -nr
1516	Workers
29	MemoryFiles
16	error.log
4	access.log
1	vendor
1	service.manifest
1	service.log
1	run_guiworker.sh
1	replica.my.cnf
1	public_html
1	logs
1	IABot
1	deadCheck.out
1	deadCheck.err
1	dbMaint.out
1	dbMaint.err
1	cron-tools.iabot-1.err
1	crontab
1	cron-5.out
1	cron-5.err
1	composer.phar
1	composer.lock
1	composer.json
1	bigbrotherrc.old
1	bigbrother.log
1	archiveCheck.out
0	cron-tools.iabot-1.out
0	archiveCheck.err
root@labstore1004:/srv/tools/shared/tools/project/iabot/Workers# ls -lh
total 1.5T
-rw-rw---- 1 53156 53156  283 May 30  2018 dbScript.err
-rw-rw---- 1 53156 53156  639 Jun  1  2018 dbScript.out
-rw-r--r-- 1 53156 53156 393G Feb 25 08:56 Worker1.err
-rw-r--r-- 1 53156 53156 363M Feb 24 02:05 Worker1.out
-rw-r--r-- 1 53156 53156 723G Feb 25 08:56 Worker2.err
-rw-r--r-- 1 53156 53156 312M Feb 24 01:58 Worker2.out
-rw-r--r-- 1 53156 53156  33M Feb 25 08:55 Worker3.err
-rw-r--r-- 1 53156 53156 444M Feb 25 08:56 Worker3.out
-rw-rw---- 1 53156 53156  49M Feb 25 08:55 Worker4.err
-rw-rw---- 1 53156 53156 691M Feb 25 08:56 Worker4.out
-rw-rw---- 1 53156 53156 401G Feb 25 08:56 Worker5.err
-rw-rw---- 1 53156 53156 921M Feb 24 01:55 Worker5.out

Extracted last 20k lines from the .err files into .err.truncated files for later troubleshooting and deleted .err files.

root@labstore1004:/srv/tools/shared/tools/project/iabot/Workers# ls -lh
total 2.7G
-rw-r--r-- 1 root  53156  283 Feb 25 09:00 dbScript.err.truncated
-rw-rw---- 1 53156 53156  639 Jun  1  2018 dbScript.out
-rw-r--r-- 1 root  53156 1.9M Feb 25 09:00 Worker1.err.truncated
-rw-r--r-- 1 53156 53156 363M Feb 24 02:05 Worker1.out
-rw-r--r-- 1 root  53156 1.9M Feb 25 09:00 Worker2.err.truncated
-rw-r--r-- 1 53156 53156 312M Feb 24 01:58 Worker2.out
-rw-r--r-- 1 root  53156 1.9M Feb 25 09:00 Worker3.err.truncated
-rw-r--r-- 1 53156 53156 444M Feb 25 08:56 Worker3.out
-rw-r--r-- 1 root  53156 1.5M Feb 25 09:00 Worker4.err.truncated
-rw-rw---- 1 53156 53156 691M Feb 25 08:56 Worker4.out
-rw-r--r-- 1 root  53156 1.9M Feb 25 09:00 Worker5.err.truncated
-rw-rw---- 1 53156 53156 921M Feb 24 01:55 Worker5.out

root@labstore1004:/srv/tools/shared/tools/project/iabot/Workers# du -hs
2.7G	.

Commented entries in contrab, killed worker* jobs and stopped webservice in new Stretch cluster.

tools.iabot@tools-sgebastion-07:~$ qstat
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID 
-----------------------------------------------------------------------------------------------------------------
 317074 0.26514 worker4    tools.iabot  r     02/24/2019 01:55:13 task@tools-sgeexec-0916.tools.     1        
 317078 0.26514 worker3    tools.iabot  r     02/24/2019 01:55:13 task@tools-sgeexec-0940.tools.     1        
 317079 0.26514 worker1    tools.iabot  r     02/24/2019 01:55:13 task@tools-sgeexec-0916.tools.     1        
 317084 0.26514 worker2    tools.iabot  r     02/24/2019 01:55:13 task@tools-sgeexec-0922.tools.     1        
 317085 0.26514 worker5    tools.iabot  r     02/24/2019 01:55:13 task@tools-sgeexec-0916.tools.     1        
 317112 0.26512 lighttpd-i tools.iabot  r     02/24/2019 01:57:43 webgrid-lighttpd@tools-sgewebg     1  
       
tools.iabot@tools-sgebastion-07:~$ qdel 317074
tools.iabot has registered the job 317074 for deletion
tools.iabot@tools-sgebastion-07:~$ qdel 317078
tools.iabot has registered the job 317078 for deletion
tools.iabot@tools-sgebastion-07:~$ qdel 317079
tools.iabot has registered the job 317079 for deletion
tools.iabot@tools-sgebastion-07:~$ qdel 317084
tools.iabot has registered the job 317084 for deletion
tools.iabot@tools-sgebastion-07:~$ qdel 317085
tools.iabot has registered the job 317085 for deletion

tools.iabot@tools-sgebastion-07:~$ qstat
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID 
-----------------------------------------------------------------------------------------------------------------
 317112 0.26517 lighttpd-i tools.iabot  r     02/24/2019 01:57:43 webgrid-lighttpd@tools-sgewebg     1     

tools.iabot@tools-sgebastion-07:~$ webservice stop
Stopping webservice...........

Confirmed iabot is not running on the old Trusty cluster:

crontab tools.iabot@tools-bastion-03:~$ crontab -l
no crontab for tools.iabot
tools.iabot@tools-bastion-03:~$ qstat
tools.iabot@tools-bastion-03:~$ webservice status
Your webservice is not running

Something is still holding onto those files:

root@labstore1004:~# df -h /srv/tools
Filesystem      Size  Used Avail Use% Mounted on
/dev/drbd4      8.0T  7.5T  108G  99% /srv/tools

Mentioned in SAL (#wikimedia-cloud) [2019-02-25T09:31:09Z] <gtirloni> commented cronjobs, stop webservices and truncated Worker*.err files (T216988)

Mentioned in SAL (#wikimedia-cloud) [2019-02-25T09:50:17Z] <gtirloni> rebooted tools-sgeexec-09{16,22,40} (T216988)

Mentioned in SAL (#wikimedia-operations) [2019-02-25T10:31:47Z] <gtirloni> labstore1004 restarted nfsd and killed stuck rpc.mountd.real processed (T216988)

root@labstore1004:~# df -h /srv/tools
Filesystem      Size  Used Avail Use% Mounted on
/dev/drbd4      8.0T  5.9T  1.7T  78% /srv/tools

Lesson re-learned: use truncate instead of rm so the file handle isn't lost completely (and the kernel holds the file open and doesn't release the disk space).

Re-opening due to several tools having issues after NFS was restarted.