3 of my jobs are stuck on two different hosts.
Tool username: periodibot
Jobs: 203211 203212 203214
From what I have been told in #Wikimedia-cloud the processes apparently are stuck in "uninterruptible sleep".
Description
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | bd808 | T218038 NFS issue affecting Toolforge SGE master | |||
Resolved | • Bstorm | T218486 Grid jobs stuck on host |
Event Timeline
I've done some initial investigation, but I'm unable to find the root cause. Some observations:
The SGE jobs are in state dRr (delete requested, restarted, running)
The periodibot processes are in state D:
ps aux grvalhallasw@tools-sgeexec-0937:~$ ps aux | grep "tools.p" tools.p+ 1422 0.0 0.0 0 0 ? DNs Feb25 0:00 [203214] tools.p+ 1425 0.0 0.0 0 0 ? ZNl Feb25 16:55 [mono-sgen] <defunct> tools.p+ 9767 0.0 0.0 0 0 ? DNs Feb25 0:00 [203211] tools.p+ 9768 0.0 0.0 0 0 ? ZNl Feb25 23:13 [mono-sgen] <defunct>
One of giftbots processes is also stuck in state D, and there are a few lsof processes (and a bunch of processes in D from my doing which I have not copied in)
tools.p+ 1422 0.0 0.0 0 0 ? DNs Feb25 0:00 [203214] tools.p+ 9767 0.0 0.0 0 0 ? DNs Feb25 0:00 [203211] tools.g+ 10376 0.0 0.6 235924 49624 ? DN Mar13 1:07 /data/project/shared/tcl/bin/tclsh8.7 /mnt/nfs/labstore-secondary-tools-project/giftbot/gva.tcl root 22350 0.0 0.0 33088 160 ? D Mar15 0:00 /usr/bin/lsof +c 15 -nXd DEL root 23301 0.0 0.0 33088 156 ? D Mar15 0:00 /usr/bin/lsof +c 15 -nXd DEL
The labstore-secondary-tools-project mount is half-broken:
$ ls /mnt/nfs/labstore-secondary-tools-project/ [works] $ ls /mnt/nfs/labstore-secondary-tools-project/nlwikibots [works] $ ls /mnt/nfs/labstore-secondary-tools-project/periodibot/ [works] $ ls /mnt/nfs/labstore-secondary-tools-project/giftbot/gva.out [hangs]
All of these work on tools-sgebastion-07.
strace hangs at:
ioctl(1, TCGETS, {B38400 opost isig icanon echo ...}) = 0 ioctl(1, TIOCGWINSZ, {ws_row=45, ws_col=192, ws_xpixel=0, ws_ypixel=0}) = 0 stat("/mnt/nfs/labstore-secondary-tools-project/giftbot/gva.out",
Oddly enough, process 1422 doesn't seem to have any file descriptors open:
valhallasw@tools-sgeexec-0937:~$ sudo ls -l /proc/1422/fd total 0
The contents of /proc/1422 are weird in general:
total 0 dr-xr-xr-x 2 tools.periodibot tools.periodibot 0 Mar 16 17:14 attr -rw-r--r-- 1 root root 0 Mar 16 17:14 autogroup -r-------- 1 root root 0 Mar 16 17:14 auxv -r--r--r-- 1 root root 0 Mar 16 17:14 cgroup --w------- 1 root root 0 Mar 16 17:14 clear_refs -r--r--r-- 1 root root 0 Mar 11 20:30 cmdline -rw-r--r-- 1 root root 0 Mar 16 17:14 comm -rw-r--r-- 1 root root 0 Mar 16 17:14 coredump_filter -r--r--r-- 1 root root 0 Mar 16 17:14 cpuset lrwxrwxrwx 1 root root 0 Mar 11 22:53 cwd -r-------- 1 root root 0 Mar 16 17:14 environ lrwxrwxrwx 1 root root 0 Mar 11 20:39 exe dr-x------ 2 root root 0 Mar 11 22:53 fd dr-x------ 2 root root 0 Mar 11 22:53 fdinfo
etc. Almost everything is owned by root, and only directories are still owned by tools.periodibot. Compare this to a running process:
total 0 dr-xr-xr-x 2 valhallasw wikidev 0 Mar 16 17:47 attr -rw-r--r-- 1 valhallasw wikidev 0 Mar 16 17:47 autogroup -r-------- 1 valhallasw wikidev 0 Mar 16 17:47 auxv -r--r--r-- 1 valhallasw wikidev 0 Mar 16 17:47 cgroup --w------- 1 valhallasw wikidev 0 Mar 16 17:47 clear_refs -r--r--r-- 1 valhallasw wikidev 0 Mar 16 17:36 cmdline -rw-r--r-- 1 valhallasw wikidev 0 Mar 16 17:47 comm -rw-r--r-- 1 valhallasw wikidev 0 Mar 16 17:47 coredump_filter -r--r--r-- 1 valhallasw wikidev 0 Mar 16 17:47 cpuset lrwxrwxrwx 1 valhallasw wikidev 0 Mar 16 17:47 cwd -> /mnt/nfs/labstore-secondary-tools-home/valhallasw -r-------- 1 valhallasw wikidev 0 Mar 16 17:47 environ lrwxrwxrwx 1 valhallasw wikidev 0 Mar 16 17:39 exe -> /bin/bash
Note also that exe and cwd are undefined for process 1422:
valhallasw@tools-sgeexec-0937:~$ sudo ls -l /proc/1422/exe ls: cannot read symbolic link '/proc/1422/exe': No such file or directory lrwxrwxrwx 1 root root 0 Mar 11 20:39 /proc/1422/exe
In kern.log:
Mar 15 10:23:26 tools-sgeexec-0937 kernel: [1552657.630779] nfs: server nfs-tools-project.svc.eqiad.wmnet not responding, still trying Mar 15 10:23:26 tools-sgeexec-0937 kernel: [1552657.630784] nfs: server nfs-tools-project.svc.eqiad.wmnet not responding, still trying Mar 15 10:23:35 tools-sgeexec-0937 kernel: [1552666.078369] nfs: server nfs-tools-project.svc.eqiad.wmnet not responding, still trying Mar 15 10:23:35 tools-sgeexec-0937 kernel: [1552666.430394] nfs: server nfs-tools-project.svc.eqiad.wmnet not responding, still trying Mar 15 10:33:02 tools-sgeexec-0937 kernel: [1553233.098780] INFO: task gva.tcl:10376 blocked for more than 120 seconds. Mar 15 10:33:02 tools-sgeexec-0937 kernel: [1553233.099565] Not tainted 4.9.0-8-amd64 #1 Debian 4.9.144-3 Mar 15 10:33:02 tools-sgeexec-0937 kernel: [1553233.100202] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Mar 15 10:33:02 tools-sgeexec-0937 kernel: [1553233.101033] gva.tcl D 0 10376 1423 0x00000000 [... three lines of addresses / values removed ..] Mar 15 10:33:02 tools-sgeexec-0937 kernel: [1553233.101043] Call Trace: Mar 15 10:33:02 tools-sgeexec-0937 kernel: [1553233.101051] [<removed>] ? __schedule+0x239/0x6f0 Mar 15 10:33:02 tools-sgeexec-0937 kernel: [1553233.101052] [<removed>] ? bit_wait+0x50/0x50 Mar 15 10:33:02 tools-sgeexec-0937 kernel: [1553233.101054] [<removed>] ? schedule+0x32/0x80 Mar 15 10:33:02 tools-sgeexec-0937 kernel: [1553233.101055] [<removed>] ? schedule_timeout+0x1dd/0x380 Mar 15 10:33:02 tools-sgeexec-0937 kernel: [1553233.101079] [<removed>] ? xprt_release+0x19b/0x310 [sunrpc] Mar 15 10:33:02 tools-sgeexec-0937 kernel: [1553233.101082] [<removed>] ? kvm_clock_get_cycles+0x1a/0x20 Mar 15 10:33:02 tools-sgeexec-0937 kernel: [1553233.101084] [<removed>] ? bit_wait+0x50/0x50 Mar 15 10:33:02 tools-sgeexec-0937 kernel: [1553233.101085] [<removed>] ? io_schedule_timeout+0x9d/0x100 Mar 15 10:33:02 tools-sgeexec-0937 kernel: [1553233.101088] [<removed>] ? prepare_to_wait+0x57/0x80 Mar 15 10:33:02 tools-sgeexec-0937 kernel: [1553233.101090] [<removed>] ? bit_wait_io+0x17/0x60 Mar 15 10:33:02 tools-sgeexec-0937 kernel: [1553233.101092] [<removed>] ? __wait_on_bit+0x55/0x80 Mar 15 10:33:02 tools-sgeexec-0937 kernel: [1553233.101093] [<removed>] ? bit_wait+0x50/0x50 Mar 15 10:33:02 tools-sgeexec-0937 kernel: [1553233.101095] [<removed>] ? out_of_line_wait_on_bit+0x7e/0xa0 Mar 15 10:33:02 tools-sgeexec-0937 kernel: [1553233.101096] [<removed>] ? wake_atomic_t_function+0x60/0x60 Mar 15 10:33:02 tools-sgeexec-0937 kernel: [1553233.101110] [<removed>] ? nfs_updatepage+0x154/0x910 [nfs] Mar 15 10:33:02 tools-sgeexec-0937 kernel: [1553233.101115] [<removed>] ? nfs_write_end+0x140/0x4f0 [nfs] Mar 15 10:33:02 tools-sgeexec-0937 kernel: [1553233.101118] [<removed>] ? generic_perform_write+0x122/0x1d0 Mar 15 10:33:02 tools-sgeexec-0937 kernel: [1553233.101122] [<removed>] ? nfs_file_write+0xd5/0x200 [nfs] Mar 15 10:33:02 tools-sgeexec-0937 kernel: [1553233.101125] [<removed>] ? new_sync_write+0xe0/0x130 Mar 15 10:33:02 tools-sgeexec-0937 kernel: [1553233.101126] [<removed>] ? vfs_write+0xb0/0x190 Mar 15 10:33:02 tools-sgeexec-0937 kernel: [1553233.101127] [<removed>] ? SyS_write+0x52/0xc0 Mar 15 10:33:02 tools-sgeexec-0937 kernel: [1553233.101130] [<removed>] ? do_syscall_64+0x8d/0xf0 Mar 15 10:33:02 tools-sgeexec-0937 kernel: [1553233.101132] [<removed>] ? entry_SYSCALL_64_after_swapgs+0x58/0xc6
Which supports the NFS hypothesis -- but the last of these is logged at Mar 15 10:51:09, which is 29 hours ago.
Attempted a force-umount using sudo umount -fr /mnt/nfs/labstore-secondary-tools-project, but this only resulted in umount.nfs4: /mnt/nfs/labstore-secondary-tools-project: device is busy.
@MarioFinale: as a workaround, I would suggest ignoring these jobs, and resubmitting them under a different name. Once the NFS issues are resolved (or the host is rebooted), they should disappear.
I have got a stretch webgrid webservice stuck in 'uninterruptible sleep'.
Tool username: bambots
Job: 859159
@valhallasw Thanks for you help, I just created the new jobs under different names and they are working ok but there's another job stuck, is running in tools-sgeexec-0938 under the name "efevid" (job number 203212). Maybe you guys want to check it.
We found a bunch of these last week after the NFS filesystem. Looks like there was another blip :(
I am seeing repeats of this error:
Mar 15 09:13:00 labstore1004 kernel: [17082818.262184] EXT4-fs warning: 20 callbacks suppressed Mar 15 09:13:00 labstore1004 kernel: [17082818.262188] EXT4-fs warning (device drbd4): ext4_dx_add_entry:2234: inode #236585023: comm nfsd: Directory index full! Mar 15 09:13:00 labstore1004 kernel: [17082818.271286] EXT4-fs warning (device drbd4): ext4_dx_add_entry:2234: inode #236585023: comm nfsd: Directory index full! Mar 15 09:13:00 labstore1004 kernel: [17082818.271301] EXT4-fs warning (device drbd4): ext4_dx_add_entry:2234: inode #236585023: comm nfsd: Directory index full! Mar 15 09:13:00 labstore1004 kernel: [17082818.271311] EXT4-fs warning (device drbd4): ext4_dx_add_entry:2234: inode #236585023: comm nfsd: Directory index full! Mar 15 09:13:00 labstore1004 kernel: [17082818.271322] EXT4-fs warning (device drbd4): ext4_dx_add_entry:2234: inode #236585023: comm nfsd: Directory index full! Mar 15 09:13:00 labstore1004 kernel: [17082818.284638] EXT4-fs warning (device drbd4): ext4_dx_add_entry:2234: inode #236585023: comm nfsd: Directory index full! Mar 15 09:13:00 labstore1004 kernel: [17082818.284661] EXT4-fs warning (device drbd4): ext4_dx_add_entry:2234: inode #236585023: comm nfsd: Directory index full! Mar 15 09:13:00 labstore1004 kernel: [17082818.284678] EXT4-fs warning (device drbd4): ext4_dx_add_entry:2234: inode #236585023: comm nfsd: Directory index full! Mar 15 09:13:00 labstore1004 kernel: [17082818.284692] EXT4-fs warning (device drbd4): ext4_dx_add_entry:2234: inode #236585023: comm nfsd: Directory index full! Mar 15 09:13:00 labstore1004 kernel: [17082818.284705] EXT4-fs warning (device drbd4): ext4_dx_add_entry:2234: inode #236585023: comm nfsd: Directory index full!
Running a find for that inode number. It's always the same one and is likely a directory...
@liangent /data/project/liangent-php/mw-log appears to have too many subdirectories. I am seeing repeated filesystem errors on a fairly regular schedule. Something is creating these, potentially a cron job?
I'm trying to get more information about the directory, but I cannot get a command to return from there. Since it could be causing problems for NFSd, I may have to delete it. I'm just hoping to check in first in case there's something important there.
File: ‘mw-log’ Size: 596557824 Blocks: 1165256 IO Block: 1048576 directory Device: 26h/38d Inode: 236585023 Links: 2 Access: (2755/drwxr-sr-x) Uid: (51117/tools.liangent-php) Gid: (51117/tools.liangent-php) Access: 2016-11-14 19:29:20.165239239 +0000 Modify: 2019-03-16 22:40:21.263693712 +0000 Change: 2019-03-16 22:40:21.263693712 +0000 Birth: -
I cannot get an actual count of objects under the top level so far. It doesn't seem like there are any subdirs, just lots and lots of files.
It looks like the php_cleanupILH_DOM_arwiki job in that project has been stuck since Sept. I'm not sure if that's the problem...
Trying to tar up that directory then recreate it. So far, it isn't looking good. I may have to just remove it.
Tar is succeeding in building a file when running from the server, but I suspect the directory contents are changing since it has been running for hours and hasn't finished. I'm stopping the tar.
Emailed the tool owner. There's a lot of crons and at least one service. I'll disable it all, delete the folder and recreate it tomorrow, but I'm hoping the maintainer will be able to fix it in a more sensible way since I cannot create a reasonable backup of the data there.
I've removed the folder contents and disabled that tool. We will be tracking NFS issues in general. The LDAP problems we are experiencing currently could be part of the problem as well, but I'm hoping that clearing up that NFS error will clear up some more things.
I have four unkillable jobs stuck on the stretch grid for two different tools:
tool: russbot
job-ID: 853908
tool: dplbot
job-IDs: 854031, 872071, 872102