⚓ T218486 Grid jobs stuck on host

		Status	Subtype	Assigned	Task
		Resolved		bd808	T218038 NFS issue affecting Toolforge SGE master
		Resolved		• Bstorm	T218486 Grid jobs stuck on host

MarioFinale created this task.Mar 16 2019, 5:27 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 16 2019, 5:27 PM

MarioFinale updated the task description. (Show Details)Mar 16 2019, 5:29 PM

I've done some initial investigation, but I'm unable to find the root cause. Some observations:

The SGE jobs are in state dRr (delete requested, restarted, running)

The periodibot processes are in state D:

ps aux grvalhallasw@tools-sgeexec-0937:~$ ps aux | grep "tools.p"
tools.p+  1422  0.0  0.0      0     0 ?        DNs  Feb25   0:00 [203214]
tools.p+  1425  0.0  0.0      0     0 ?        ZNl  Feb25  16:55 [mono-sgen] <defunct>
tools.p+  9767  0.0  0.0      0     0 ?        DNs  Feb25   0:00 [203211]
tools.p+  9768  0.0  0.0      0     0 ?        ZNl  Feb25  23:13 [mono-sgen] <defunct>

One of giftbots processes is also stuck in state D, and there are a few lsof processes (and a bunch of processes in D from my doing which I have not copied in)

tools.p+  1422  0.0  0.0      0     0 ?        DNs  Feb25   0:00 [203214]
tools.p+  9767  0.0  0.0      0     0 ?        DNs  Feb25   0:00 [203211]
tools.g+ 10376  0.0  0.6 235924 49624 ?        DN   Mar13   1:07 /data/project/shared/tcl/bin/tclsh8.7 /mnt/nfs/labstore-secondary-tools-project/giftbot/gva.tcl
root     22350  0.0  0.0  33088   160 ?        D    Mar15   0:00 /usr/bin/lsof +c 15 -nXd DEL
root     23301  0.0  0.0  33088   156 ?        D    Mar15   0:00 /usr/bin/lsof +c 15 -nXd DEL

The labstore-secondary-tools-project mount is half-broken:

$ ls /mnt/nfs/labstore-secondary-tools-project/
[works]

$ ls /mnt/nfs/labstore-secondary-tools-project/nlwikibots
[works]

$ ls /mnt/nfs/labstore-secondary-tools-project/periodibot/
[works]

$ ls /mnt/nfs/labstore-secondary-tools-project/giftbot/gva.out
[hangs]

All of these work on tools-sgebastion-07.

strace hangs at:

ioctl(1, TCGETS, {B38400 opost isig icanon echo ...}) = 0
ioctl(1, TIOCGWINSZ, {ws_row=45, ws_col=192, ws_xpixel=0, ws_ypixel=0}) = 0
stat("/mnt/nfs/labstore-secondary-tools-project/giftbot/gva.out",

Oddly enough, process 1422 doesn't seem to have any file descriptors open:

valhallasw@tools-sgeexec-0937:~$ sudo ls -l /proc/1422/fd
total 0

The contents of /proc/1422 are weird in general:

total 0
dr-xr-xr-x 2 tools.periodibot tools.periodibot 0 Mar 16 17:14 attr
-rw-r--r-- 1 root             root             0 Mar 16 17:14 autogroup
-r-------- 1 root             root             0 Mar 16 17:14 auxv
-r--r--r-- 1 root             root             0 Mar 16 17:14 cgroup
--w------- 1 root             root             0 Mar 16 17:14 clear_refs
-r--r--r-- 1 root             root             0 Mar 11 20:30 cmdline
-rw-r--r-- 1 root             root             0 Mar 16 17:14 comm
-rw-r--r-- 1 root             root             0 Mar 16 17:14 coredump_filter
-r--r--r-- 1 root             root             0 Mar 16 17:14 cpuset
lrwxrwxrwx 1 root             root             0 Mar 11 22:53 cwd
-r-------- 1 root             root             0 Mar 16 17:14 environ
lrwxrwxrwx 1 root             root             0 Mar 11 20:39 exe
dr-x------ 2 root             root             0 Mar 11 22:53 fd
dr-x------ 2 root             root             0 Mar 11 22:53 fdinfo

etc. Almost everything is owned by root, and only directories are still owned by tools.periodibot. Compare this to a running process:

total 0
dr-xr-xr-x 2 valhallasw wikidev 0 Mar 16 17:47 attr
-rw-r--r-- 1 valhallasw wikidev 0 Mar 16 17:47 autogroup
-r-------- 1 valhallasw wikidev 0 Mar 16 17:47 auxv
-r--r--r-- 1 valhallasw wikidev 0 Mar 16 17:47 cgroup
--w------- 1 valhallasw wikidev 0 Mar 16 17:47 clear_refs
-r--r--r-- 1 valhallasw wikidev 0 Mar 16 17:36 cmdline
-rw-r--r-- 1 valhallasw wikidev 0 Mar 16 17:47 comm
-rw-r--r-- 1 valhallasw wikidev 0 Mar 16 17:47 coredump_filter
-r--r--r-- 1 valhallasw wikidev 0 Mar 16 17:47 cpuset
lrwxrwxrwx 1 valhallasw wikidev 0 Mar 16 17:47 cwd -> /mnt/nfs/labstore-secondary-tools-home/valhallasw
-r-------- 1 valhallasw wikidev 0 Mar 16 17:47 environ
lrwxrwxrwx 1 valhallasw wikidev 0 Mar 16 17:39 exe -> /bin/bash

Note also that exe and cwd are undefined for process 1422:

valhallasw@tools-sgeexec-0937:~$ sudo ls -l /proc/1422/exe
ls: cannot read symbolic link '/proc/1422/exe': No such file or directory
lrwxrwxrwx 1 root root 0 Mar 11 20:39 /proc/1422/exe

In kern.log:

Mar 15 10:23:26 tools-sgeexec-0937 kernel: [1552657.630779] nfs: server nfs-tools-project.svc.eqiad.wmnet not responding, still trying
Mar 15 10:23:26 tools-sgeexec-0937 kernel: [1552657.630784] nfs: server nfs-tools-project.svc.eqiad.wmnet not responding, still trying
Mar 15 10:23:35 tools-sgeexec-0937 kernel: [1552666.078369] nfs: server nfs-tools-project.svc.eqiad.wmnet not responding, still trying
Mar 15 10:23:35 tools-sgeexec-0937 kernel: [1552666.430394] nfs: server nfs-tools-project.svc.eqiad.wmnet not responding, still trying
Mar 15 10:33:02 tools-sgeexec-0937 kernel: [1553233.098780] INFO: task gva.tcl:10376 blocked for more than 120 seconds.
Mar 15 10:33:02 tools-sgeexec-0937 kernel: [1553233.099565]       Not tainted 4.9.0-8-amd64 #1 Debian 4.9.144-3
Mar 15 10:33:02 tools-sgeexec-0937 kernel: [1553233.100202] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 15 10:33:02 tools-sgeexec-0937 kernel: [1553233.101033] gva.tcl         D    0 10376   1423 0x00000000
[... three lines of addresses / values removed ..]
Mar 15 10:33:02 tools-sgeexec-0937 kernel: [1553233.101043] Call Trace:
Mar 15 10:33:02 tools-sgeexec-0937 kernel: [1553233.101051]  [<removed>] ? __schedule+0x239/0x6f0
Mar 15 10:33:02 tools-sgeexec-0937 kernel: [1553233.101052]  [<removed>] ? bit_wait+0x50/0x50
Mar 15 10:33:02 tools-sgeexec-0937 kernel: [1553233.101054]  [<removed>] ? schedule+0x32/0x80
Mar 15 10:33:02 tools-sgeexec-0937 kernel: [1553233.101055]  [<removed>] ? schedule_timeout+0x1dd/0x380
Mar 15 10:33:02 tools-sgeexec-0937 kernel: [1553233.101079]  [<removed>] ? xprt_release+0x19b/0x310 [sunrpc]
Mar 15 10:33:02 tools-sgeexec-0937 kernel: [1553233.101082]  [<removed>] ? kvm_clock_get_cycles+0x1a/0x20
Mar 15 10:33:02 tools-sgeexec-0937 kernel: [1553233.101084]  [<removed>] ? bit_wait+0x50/0x50
Mar 15 10:33:02 tools-sgeexec-0937 kernel: [1553233.101085]  [<removed>] ? io_schedule_timeout+0x9d/0x100
Mar 15 10:33:02 tools-sgeexec-0937 kernel: [1553233.101088]  [<removed>] ? prepare_to_wait+0x57/0x80
Mar 15 10:33:02 tools-sgeexec-0937 kernel: [1553233.101090]  [<removed>] ? bit_wait_io+0x17/0x60
Mar 15 10:33:02 tools-sgeexec-0937 kernel: [1553233.101092]  [<removed>] ? __wait_on_bit+0x55/0x80
Mar 15 10:33:02 tools-sgeexec-0937 kernel: [1553233.101093]  [<removed>] ? bit_wait+0x50/0x50
Mar 15 10:33:02 tools-sgeexec-0937 kernel: [1553233.101095]  [<removed>] ? out_of_line_wait_on_bit+0x7e/0xa0
Mar 15 10:33:02 tools-sgeexec-0937 kernel: [1553233.101096]  [<removed>] ? wake_atomic_t_function+0x60/0x60
Mar 15 10:33:02 tools-sgeexec-0937 kernel: [1553233.101110]  [<removed>] ? nfs_updatepage+0x154/0x910 [nfs]
Mar 15 10:33:02 tools-sgeexec-0937 kernel: [1553233.101115]  [<removed>] ? nfs_write_end+0x140/0x4f0 [nfs]
Mar 15 10:33:02 tools-sgeexec-0937 kernel: [1553233.101118]  [<removed>] ? generic_perform_write+0x122/0x1d0
Mar 15 10:33:02 tools-sgeexec-0937 kernel: [1553233.101122]  [<removed>] ? nfs_file_write+0xd5/0x200 [nfs]
Mar 15 10:33:02 tools-sgeexec-0937 kernel: [1553233.101125]  [<removed>] ? new_sync_write+0xe0/0x130
Mar 15 10:33:02 tools-sgeexec-0937 kernel: [1553233.101126]  [<removed>] ? vfs_write+0xb0/0x190
Mar 15 10:33:02 tools-sgeexec-0937 kernel: [1553233.101127]  [<removed>] ? SyS_write+0x52/0xc0
Mar 15 10:33:02 tools-sgeexec-0937 kernel: [1553233.101130]  [<removed>] ? do_syscall_64+0x8d/0xf0
Mar 15 10:33:02 tools-sgeexec-0937 kernel: [1553233.101132]  [<removed>] ? entry_SYSCALL_64_after_swapgs+0x58/0xc6

Which supports the NFS hypothesis -- but the last of these is logged at Mar 15 10:51:09, which is 29 hours ago.

Attempted a force-umount using sudo umount -fr /mnt/nfs/labstore-secondary-tools-project, but this only resulted in umount.nfs4: /mnt/nfs/labstore-secondary-tools-project: device is busy.

@MarioFinale: as a workaround, I would suggest ignoring these jobs, and resubmitting them under a different name. Once the NFS issues are resolved (or the host is rebooted), they should disappear.

I have got a stretch webgrid webservice stuck in 'uninterruptible sleep'.

Tool username: bambots
Job: 859159

@valhallasw Thanks for you help, I just created the new jobs under different names and they are working ok but there's another job stuck, is running in tools-sgeexec-0938 under the name "efevid" (job number 203212). Maybe you guys want to check it.

We found a bunch of these last week after the NFS filesystem. Looks like there was another blip :(

I am seeing repeats of this error:

Mar 15 09:13:00 labstore1004 kernel: [17082818.262184] EXT4-fs warning: 20 callbacks suppressed
Mar 15 09:13:00 labstore1004 kernel: [17082818.262188] EXT4-fs warning (device drbd4): ext4_dx_add_entry:2234: inode #236585023: comm nfsd: Directory index full!
Mar 15 09:13:00 labstore1004 kernel: [17082818.271286] EXT4-fs warning (device drbd4): ext4_dx_add_entry:2234: inode #236585023: comm nfsd: Directory index full!
Mar 15 09:13:00 labstore1004 kernel: [17082818.271301] EXT4-fs warning (device drbd4): ext4_dx_add_entry:2234: inode #236585023: comm nfsd: Directory index full!
Mar 15 09:13:00 labstore1004 kernel: [17082818.271311] EXT4-fs warning (device drbd4): ext4_dx_add_entry:2234: inode #236585023: comm nfsd: Directory index full!
Mar 15 09:13:00 labstore1004 kernel: [17082818.271322] EXT4-fs warning (device drbd4): ext4_dx_add_entry:2234: inode #236585023: comm nfsd: Directory index full!
Mar 15 09:13:00 labstore1004 kernel: [17082818.284638] EXT4-fs warning (device drbd4): ext4_dx_add_entry:2234: inode #236585023: comm nfsd: Directory index full!
Mar 15 09:13:00 labstore1004 kernel: [17082818.284661] EXT4-fs warning (device drbd4): ext4_dx_add_entry:2234: inode #236585023: comm nfsd: Directory index full!
Mar 15 09:13:00 labstore1004 kernel: [17082818.284678] EXT4-fs warning (device drbd4): ext4_dx_add_entry:2234: inode #236585023: comm nfsd: Directory index full!
Mar 15 09:13:00 labstore1004 kernel: [17082818.284692] EXT4-fs warning (device drbd4): ext4_dx_add_entry:2234: inode #236585023: comm nfsd: Directory index full!
Mar 15 09:13:00 labstore1004 kernel: [17082818.284705] EXT4-fs warning (device drbd4): ext4_dx_add_entry:2234: inode #236585023: comm nfsd: Directory index full!

Running a find for that inode number. It's always the same one and is likely a directory...

That inode is /srv/tools/shared/tools/project/liangent-php/mw-log

@liangent /data/project/liangent-php/mw-log appears to have too many subdirectories. I am seeing repeated filesystem errors on a fairly regular schedule. Something is creating these, potentially a cron job?

I'm trying to get more information about the directory, but I cannot get a command to return from there. Since it could be causing problems for NFSd, I may have to delete it. I'm just hoping to check in first in case there's something important there.

  File: ‘mw-log’
  Size: 596557824       Blocks: 1165256    IO Block: 1048576 directory
Device: 26h/38d Inode: 236585023   Links: 2
Access: (2755/drwxr-sr-x)  Uid: (51117/tools.liangent-php)   Gid: (51117/tools.liangent-php)
Access: 2016-11-14 19:29:20.165239239 +0000
Modify: 2019-03-16 22:40:21.263693712 +0000
Change: 2019-03-16 22:40:21.263693712 +0000
 Birth: -

I cannot get an actual count of objects under the top level so far. It doesn't seem like there are any subdirs, just lots and lots of files.

It looks like the php_cleanupILH_DOM_arwiki job in that project has been stuck since Sept. I'm not sure if that's the problem...

Trying to tar up that directory then recreate it. So far, it isn't looking good. I may have to just remove it.

Tar is succeeding in building a file when running from the server, but I suspect the directory contents are changing since it has been running for hours and hasn't finished. I'm stopping the tar.

Emailed the tool owner. There's a lot of crons and at least one service. I'll disable it all, delete the folder and recreate it tomorrow, but I'm hoping the maintainer will be able to fix it in a more sensible way since I cannot create a reasonable backup of the data there.

I've removed the folder contents and disabled that tool. We will be tracking NFS issues in general. The LDAP problems we are experiencing currently could be part of the problem as well, but I'm hoping that clearing up that NFS error will clear up some more things.

• Bstorm mentioned this in T218038: NFS issue affecting Toolforge SGE master.Mar 18 2019, 10:32 PM

I have four unkillable jobs stuck on the stretch grid for two different tools:

tool: russbot
job-ID: 853908

tool: dplbot
job-IDs: 854031, 872071, 872102

As of this morning, all of the jobs listed in my previous comment are now gone.

• GTirloni triaged this task as Medium priority.Mar 23 2019, 9:38 PM

• GTirloni added a project: cloud-services-team (Kanban).

Closing this ticket for the specific issue involved.

Grid jobs stuck on host
Closed, ResolvedPublic
Actions

Description

Related Objects
Search...

Event Timeline

Grid jobs stuck on hostClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Grid jobs stuck on host
Closed, ResolvedPublic
Actions

Related Objects
Search...