Page MenuHomePhabricator

labstore1004 - DISK CRITICAL - free space: /srv/tools 115904 MB (1% inode=79%):
Closed, ResolvedPublic

Description

Alert generated on 2019-02-25 01:31:05 but we did not receive any notifications about this.

https://icinga.wikimedia.org/cgi-bin/icinga/history.cgi?host=labstore1004&service=Disk+space

gtirloni@labstore1004:~$ date
Mon Feb 25 08:49:30 UTC 2019

gtirloni@labstore1004:~$ df -h /srv/tools/
Filesystem      Size  Used Avail Use% Mounted on
/dev/drbd4      8.0T  7.5T  112G  99% /srv/tools

Event Timeline

GTirloni created this task.Feb 25 2019, 8:50 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 25 2019, 8:50 AM
GTirloni triaged this task as High priority.Feb 25 2019, 9:15 AM

iabot is consuming 1.6TB.

iabot# du -s --block-size=1G * | sort -nr
1516	Workers
29	MemoryFiles
16	error.log
4	access.log
1	vendor
1	service.manifest
1	service.log
1	run_guiworker.sh
1	replica.my.cnf
1	public_html
1	logs
1	IABot
1	deadCheck.out
1	deadCheck.err
1	dbMaint.out
1	dbMaint.err
1	cron-tools.iabot-1.err
1	crontab
1	cron-5.out
1	cron-5.err
1	composer.phar
1	composer.lock
1	composer.json
1	bigbrotherrc.old
1	bigbrother.log
1	archiveCheck.out
0	cron-tools.iabot-1.out
0	archiveCheck.err
root@labstore1004:/srv/tools/shared/tools/project/iabot/Workers# ls -lh
total 1.5T
-rw-rw---- 1 53156 53156  283 May 30  2018 dbScript.err
-rw-rw---- 1 53156 53156  639 Jun  1  2018 dbScript.out
-rw-r--r-- 1 53156 53156 393G Feb 25 08:56 Worker1.err
-rw-r--r-- 1 53156 53156 363M Feb 24 02:05 Worker1.out
-rw-r--r-- 1 53156 53156 723G Feb 25 08:56 Worker2.err
-rw-r--r-- 1 53156 53156 312M Feb 24 01:58 Worker2.out
-rw-r--r-- 1 53156 53156  33M Feb 25 08:55 Worker3.err
-rw-r--r-- 1 53156 53156 444M Feb 25 08:56 Worker3.out
-rw-rw---- 1 53156 53156  49M Feb 25 08:55 Worker4.err
-rw-rw---- 1 53156 53156 691M Feb 25 08:56 Worker4.out
-rw-rw---- 1 53156 53156 401G Feb 25 08:56 Worker5.err
-rw-rw---- 1 53156 53156 921M Feb 24 01:55 Worker5.out

Extracted last 20k lines from the .err files into .err.truncated files for later troubleshooting and deleted .err files.

root@labstore1004:/srv/tools/shared/tools/project/iabot/Workers# ls -lh
total 2.7G
-rw-r--r-- 1 root  53156  283 Feb 25 09:00 dbScript.err.truncated
-rw-rw---- 1 53156 53156  639 Jun  1  2018 dbScript.out
-rw-r--r-- 1 root  53156 1.9M Feb 25 09:00 Worker1.err.truncated
-rw-r--r-- 1 53156 53156 363M Feb 24 02:05 Worker1.out
-rw-r--r-- 1 root  53156 1.9M Feb 25 09:00 Worker2.err.truncated
-rw-r--r-- 1 53156 53156 312M Feb 24 01:58 Worker2.out
-rw-r--r-- 1 root  53156 1.9M Feb 25 09:00 Worker3.err.truncated
-rw-r--r-- 1 53156 53156 444M Feb 25 08:56 Worker3.out
-rw-r--r-- 1 root  53156 1.5M Feb 25 09:00 Worker4.err.truncated
-rw-rw---- 1 53156 53156 691M Feb 25 08:56 Worker4.out
-rw-r--r-- 1 root  53156 1.9M Feb 25 09:00 Worker5.err.truncated
-rw-rw---- 1 53156 53156 921M Feb 24 01:55 Worker5.out

root@labstore1004:/srv/tools/shared/tools/project/iabot/Workers# du -hs
2.7G	.

Commented entries in contrab, killed worker* jobs and stopped webservice in new Stretch cluster.

tools.iabot@tools-sgebastion-07:~$ qstat
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID 
-----------------------------------------------------------------------------------------------------------------
 317074 0.26514 worker4    tools.iabot  r     02/24/2019 01:55:13 task@tools-sgeexec-0916.tools.     1        
 317078 0.26514 worker3    tools.iabot  r     02/24/2019 01:55:13 task@tools-sgeexec-0940.tools.     1        
 317079 0.26514 worker1    tools.iabot  r     02/24/2019 01:55:13 task@tools-sgeexec-0916.tools.     1        
 317084 0.26514 worker2    tools.iabot  r     02/24/2019 01:55:13 task@tools-sgeexec-0922.tools.     1        
 317085 0.26514 worker5    tools.iabot  r     02/24/2019 01:55:13 task@tools-sgeexec-0916.tools.     1        
 317112 0.26512 lighttpd-i tools.iabot  r     02/24/2019 01:57:43 webgrid-lighttpd@tools-sgewebg     1  
       
tools.iabot@tools-sgebastion-07:~$ qdel 317074
tools.iabot has registered the job 317074 for deletion
tools.iabot@tools-sgebastion-07:~$ qdel 317078
tools.iabot has registered the job 317078 for deletion
tools.iabot@tools-sgebastion-07:~$ qdel 317079
tools.iabot has registered the job 317079 for deletion
tools.iabot@tools-sgebastion-07:~$ qdel 317084
tools.iabot has registered the job 317084 for deletion
tools.iabot@tools-sgebastion-07:~$ qdel 317085
tools.iabot has registered the job 317085 for deletion

tools.iabot@tools-sgebastion-07:~$ qstat
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID 
-----------------------------------------------------------------------------------------------------------------
 317112 0.26517 lighttpd-i tools.iabot  r     02/24/2019 01:57:43 webgrid-lighttpd@tools-sgewebg     1     

tools.iabot@tools-sgebastion-07:~$ webservice stop
Stopping webservice...........

Confirmed iabot is not running on the old Trusty cluster:

crontab tools.iabot@tools-bastion-03:~$ crontab -l
no crontab for tools.iabot
tools.iabot@tools-bastion-03:~$ qstat
tools.iabot@tools-bastion-03:~$ webservice status
Your webservice is not running

Something is still holding onto those files:

root@labstore1004:~# df -h /srv/tools
Filesystem      Size  Used Avail Use% Mounted on
/dev/drbd4      8.0T  7.5T  108G  99% /srv/tools

Mentioned in SAL (#wikimedia-cloud) [2019-02-25T09:31:09Z] <gtirloni> commented cronjobs, stop webservices and truncated Worker*.err files (T216988)

Mentioned in SAL (#wikimedia-cloud) [2019-02-25T09:50:17Z] <gtirloni> rebooted tools-sgeexec-09{16,22,40} (T216988)

Mentioned in SAL (#wikimedia-operations) [2019-02-25T10:31:47Z] <gtirloni> labstore1004 restarted nfsd and killed stuck rpc.mountd.real processed (T216988)

root@labstore1004:~# df -h /srv/tools
Filesystem      Size  Used Avail Use% Mounted on
/dev/drbd4      8.0T  5.9T  1.7T  78% /srv/tools
GTirloni added a comment.EditedFeb 25 2019, 1:19 PM

Lesson re-learned: use truncate instead of rm so the file handle isn't lost completely (and the kernel holds the file open and doesn't release the disk space).

GTirloni closed this task as Resolved.Feb 25 2019, 1:19 PM
GTirloni reopened this task as Open.Feb 25 2019, 2:27 PM

Re-opening due to several tools having issues after NFS was restarted.

GTirloni closed this task as Resolved.Mar 2 2019, 12:44 PM