Page MenuHomePhabricator

Investigate why the new Son of Grid Engine grid landed in a worse state when NFS was filled than the old Sun Grid Engine grid did
Closed, ResolvedPublic

Description

When the tools NFS filesystem filled up and needed restarting (see T217068 for the timing), the old Ubuntu Trusty grid running Sun Grid Engine 6 had few errors. The newer Debian Stretch Son of Grid Engine grid lost at least 19 nodes, all in a hung state that required a reboot from within Openstack to recover them (soft reboot was generally sufficient).

This is to track and find out why the new grid is even more fragile on NFS errors than the old one.

Event Timeline

Bstorm triaged this task as Medium priority.Feb 25 2019, 8:01 PM
Bstorm created this task.

First, checking mount options:
fstab-new:

nfs-tools-project.svc.eqiad.wmnet:/project/tools/home   /mnt/nfs/labstore-secondary-tools-home  nfs vers=4,bg,intr,sec=sys,proto=tcp,port=0,noatime,lookupcache=all,nofsc,rw,hard   0       0
nfs-tools-project.svc.eqiad.wmnet:/project/tools/project        /mnt/nfs/labstore-secondary-tools-project       nfs     vers=4,bg,intr,sec=sys,proto=tcp,port=0,noatime,lookupcache=all,nofsc,rw,hard  0

fstab-old:

nfs-tools-project.svc.eqiad.wmnet:/project/tools/project        /mnt/nfs/labstore-secondary-tools-project       nfs     vers=4,bg,intr,sec=sys,proto=tcp,port=0,noatime,lookupcache=all,nofsc,rw,hard  0
nfs-tools-project.svc.eqiad.wmnet:/project/tools/home   /mnt/nfs/labstore-secondary-tools-home  nfs     vers=4,bg,intr,sec=sys,proto=tcp,port=0,noatime,lookupcache=all,nofsc,rw,hard   0       0

So the mount options called for are identical. However, what the kernel is actually using may not be the same (output from the mount command on two nodes):
new:

nfs-tools-project.svc.eqiad.wmnet:/project/tools/project on /mnt/nfs/labstore-secondary-tools-project type nfs4 (rw,noatime,vers=4.0,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,port=0,timeo=600,retrans=2,sec=sys,clientaddr=172.16.7.112,local_lock=none,addr=10.64.37.18)
nfs-tools-project.svc.eqiad.wmnet:/project/tools/home on /mnt/nfs/labstore-secondary-tools-home type nfs4 (rw,noatime,vers=4.0,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,port=0,timeo=600,retrans=2,sec=sys,clientaddr=172.16.7.112,local_lock=none,addr=10.64.37.18)

old:

nfs-tools-project.svc.eqiad.wmnet:/project/tools/project on /mnt/nfs/labstore-secondary-tools-project type nfs (rw,noatime,vers=4,bg,intr,sec=sys,proto=tcp,port=0,lookupcache=all,nofsc,hard,addr=10.64.37.18,clientaddr=10.68.18.3)
nfs-tools-project.svc.eqiad.wmnet:/project/tools/home on /mnt/nfs/labstore-secondary-tools-home type nfs (rw,noatime,vers=4,bg,intr,sec=sys,proto=tcp,port=0,lookupcache=all,nofsc,hard,addr=10.64.37.18,clientaddr=10.68.18.3)

So it seems the problem is also related to the fact that we went from 3.13.0-147-generic to 4.9.0-8-amd64. The intr option is deprecated, so I'm not surprised that doesn't show up. It's interesting how much is actually different, though.

It is interesting to note that at least one server I checked that had problems has no interruption of most services in the logs at all.

Same on another server. (tools-sgeexec-0905)

Feb 25 19:11:10 tools-sgeexec-0904 puppet-agent[29329]: (/Stage[main]/Profile::Toolforge::Grid::Hba/Exec[make-access]/onlyif) Check "/usr/bin/test -n \"$(/usr/bin/find /data/project/.system_sge/store -maxdepth 1 \\( -type d -or -type f -name submithost-\\* \\) -newer /etc/project.access)\" -o ! -s /etc/project.access" exceeded timeout
Feb 25 10:33:51 tools-sgeexec-0904 kernel: [3042432.250199] nfs: server nfs-tools-project.svc.eqiad.wmnet not responding, still trying
Feb 25 10:33:52 tools-sgeexec-0904 kernel: [3042433.498050] nfs: server nfs-tools-project.svc.eqiad.wmnet not responding, timed out
Feb 25 10:33:52 tools-sgeexec-0904 kernel: [3042433.498073] nfs: server nfs-tools-project.svc.eqiad.wmnet not responding, still trying

This is similar to the D-state hang which found occurred to tools-worker-1015.

Well, NFS was actually thrashed at one point. That's not really an error, per se. That should have happened. The problem is that afterward, scads of these ended up in a hung state that never came back.

The HBA doesn't affect user logins, just grid functioning. Also, it doesn't need to reconstruct it on every puppet run for it to continue working.

Basically, I expect NFS to be knocked out for a while, but why didn't these ever come back while the old trusty grid ones did.

Just dug a bit again and confirmed that the NFS client was screwed right up to the moment I rebooted on tools-sgeexec-0905 (again, no surprise...except that it shouldn't have continued to be messed up). The thing about tools-worker-1015 is we've seen the k8s workers basically never let go of NFS handles. We presumed this was because of container weirdness, but now I'm wondering if it isn't the kernel on Debian handling NFS differently.

The HBA doesn't affect user logins, just grid functioning. Also, it doesn't need to reconstruct it on every puppet run for it to continue working.

Ik, it's an indication of a process waiting on NFS and failing. I don't know how root ssh would fail in this case (it's not accessing NFS or anything... unless sshd hanged when something ssh-ed into it earlier?).

the old trusty grid ones did.

For some reason I don't even see any relevant kernel messages on tools-bastion-02....

Strangely, I don't see those 'task blocked for more than 120 seconds' like tools-worker-1015:

[Feb20 15:55] INFO: task lighttpd:6573 blocked for more than 120 seconds.
[  +0.034371]       Not tainted 4.9.0-0.bpo.6-amd64 #1 Debian 4.9.88-1+deb9u1~bpo8+1
[  +0.002393] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  +0.008110] lighttpd        D    0  6573   6556 0x00000000
[  +0.000018]  ffff9015b185f400 ffff90162b00d400 ffff90147362d080 ffff90158d15e040
[  +0.000003]  ffff90163fd98ec0 ffffb8ab41e87b50 ffffffff9fe0e973 ffff9016345a93e8
[  +0.000002]  000000017ba09b00 00000000345a93e8 ffffffff9f8e839a ffff90158d15e040
[  +0.000002] Call Trace:
[  +0.000044]  [<ffffffff9fe0e973>] ? __schedule+0x243/0x6f0
[  +0.000021]  [<ffffffff9f8e839a>] ? internal_add_timer+0x1a/0x70
[  +0.000002]  [<ffffffff9fe0ee52>] ? schedule+0x32/0x80
[  +0.000003]  [<ffffffff9fe1233f>] ? schedule_timeout+0x1df/0x380
[  +0.000084]  [<ffffffffc061bfeb>] ? xprt_release+0x19b/0x310 [sunrpc]
[  +0.000016]  [<ffffffff9fb4f619>] ? list_del+0x9/0x20
[  +0.000009]  [<ffffffffc0619be4>] ? rpc_task_release_client+0x54/0x80 [sunrpc]
[  +0.000011]  [<ffffffff9f859eda>] ? kvm_clock_get_cycles+0x1a/0x20
[  +0.000002]  [<ffffffff9fe0f6e0>] ? bit_wait_timeout+0x90/0x90
[  +0.000002]  [<ffffffff9fe0e6b4>] ? io_schedule_timeout+0xb4/0x130
[  +0.000006]  [<ffffffff9f8bd0d7>] ? prepare_to_wait+0x57/0x80
[  +0.000002]  [<ffffffff9fe0f6f7>] ? bit_wait_io+0x17/0x60
[  +0.000002]  [<ffffffff9fe0f1de>] ? __wait_on_bit+0x5e/0x90
[  +0.000003]  [<ffffffff9fe0f6e0>] ? bit_wait_timeout+0x90/0x90
[  +0.000001]  [<ffffffff9fe0f34e>] ? out_of_line_wait_on_bit+0x7e/0xa0
[  +0.000005]  [<ffffffff9f8bd400>] ? autoremove_wake_function+0x40/0x40
[  +0.000052]  [<ffffffffc0682734>] ? nfs_updatepage+0x1b4/0x950 [nfs]
[  +0.000012]  [<ffffffff9fa02d3b>] ? __check_object_size+0x10b/0x1dc
[  +0.000008]  [<ffffffffc0672f6f>] ? nfs_write_end+0x16f/0x520 [nfs]
[  +0.000011]  [<ffffffff9f980eb2>] ? generic_perform_write+0x122/0x1e0
[  +0.000007]  [<ffffffffc0672a04>] ? nfs_file_write+0xd4/0x210 [nfs]
[  +0.000001]  [<ffffffff9fa06e74>] ? new_sync_write+0xe4/0x140
[  +0.000002]  [<ffffffff9fa079c3>] ? vfs_write+0xb3/0x1a0
[  +0.000001]  [<ffffffff9fa08e22>] ? SyS_write+0x52/0xc0
[  +0.000011]  [<ffffffff9f803bd1>] ? do_syscall_64+0x91/0x1a0
[  +0.000005]  [<ffffffff9fe137ce>] ? entry_SYSCALL_64_after_swapgs+0x58/0xc6
Feb 25 10:46:57 tools-sgeexec-0905 puppet-agent[32403]: (/Stage[main]/Role::Labs::Nfsclient/Labstore::Nfs_mount[tools-home-on-labstore-secondary]/Exec[create-/mnt/nfs/labstore-secondary-tools-home]/returns) /bin/mkdir: cannot create directory ‘/mnt/nfs/labstore-secondary-tools-home’: File exists
exec { "create-${mount_path}":
    command   => "/usr/bin/timeout -k 5s 20s /bin/mkdir ${mount_path}",
    unless    => "/usr/bin/timeout -k 5s 60s /usr/bin/test -d ${mount_path}",
    logoutput => true,
    require   => Mount[$mount_path],
}

o.O self-conflict?

Not exactly, it's about whether or not NFS is mounted, etc. The test command timed out because of the broken NFS mount so it tried to mkdir.

Overall, there is nothing I see wrong with the puppet stuff so far. It's what it should do if NFS is messed up (except that perhaps it was more confused than normal because of the dir not even reporting that it was a dir without a timeout).

The console log of 0905 was unhelpful. From last boot to my reboot, the only thing on there was this (just failing jobs, looks like which isn't abnormal):

 [4623987.681537] Out of memory: Kill process 24161 (update_items_fr) score 231 or sacrifice child
[4623987.682676] Killed process 24161 (update_items_fr) total-vm:8172060kB, anon-rss:7672824kB, file-rss:0kB, shmem-rss:0kB
[4677948.272155] Out of memory: Kill process 16669 (update_items_fr) score 227 or sacrifice child
[4677948.273353] Killed process 16669 (update_items_fr) total-vm:8047900kB, anon-rss:7546748kB, file-rss:0kB, shmem-rss:0kB
[4703133.475179] Out of memory: Kill process 26112 (update_items_fr) score 226 or sacrifice child
[4703133.476417] Killed process 26112 (update_items_fr) total-vm:8010188kB, anon-rss:7507680kB, file-rss:0kB, shmem-rss:0kB
[4865217.653260] Out of memory: Kill process 8620 (update_items_fr) score 229 or sacrifice child
[4865217.654398] Killed process 8620 (update_items_fr) total-vm:8120436kB, anon-rss:7622368kB, file-rss:0kB, shmem-rss:0kB
[4932263.950309] Out of memory: Kill process 26862 (python) score 224 or sacrifice child
[4932263.951616] Killed process 26862 (python) total-vm:7538980kB, anon-rss:7456044kB, file-rss:0kB, shmem-rss:0kB

Overall, we now know that the OS didn't ever recover the mounts on some Stretch instances while Trusty instances did more reliably.

We also know that the ones that didn't recover were more severely hung somehow. This did not happen on all Stretch grid hosts. I am willing to guess the ones affected were busier, but I don't know.

tools-sgeexec-0901 recovered itself, for instance.

The console log of 0905 was unhelpful. From last boot to my reboot, the only thing on there was this (just failing jobs, looks like which isn't abnormal):

Looks like low-loglevel messages are not written to console then. There should be memory statistics along with the OOM messages in dmesg.

I will see if I can reproduce this issue in some VMs.

Adding some related task mentions: T124133, T141673, T141017

These may have nothing to do with this event, but they are interesting. They also suggest that NFS via simple Linux server sucks, in case anyone here thought otherwise (I'm actually a fan of NFS served from far more specialized appliances, but think vanilla Linux servers have never been very good at the task).

Change 497762 had a related patch set uploaded (by GTirloni; owner: GTirloni):
[operations/puppet@production] labstore: Increase nfs-exportd interval from 60 to 300s

https://gerrit.wikimedia.org/r/497762

Change 497762 merged by GTirloni:
[operations/puppet@production] labstore: Increase nfs-exportd interval from 60 to 300s

https://gerrit.wikimedia.org/r/497762

Mentioned in SAL (#wikimedia-cloud) [2019-03-21T00:49:50Z] <gtirloni> nfs-exportd interval changes from 60 to 300s (T217086)

More mentions for linking things T169290, T169281, T203254 -- so far, the story runs like: all stretch kernels (acquired through backport or upgrade to stretch) seem to hate our NFS setup.

Change 498328 had a related patch set uploaded (by GTirloni; owner: GTirloni):
[operations/puppet@production] cloudvps: Exclude NFS shares from wmf-auto-restarts

https://gerrit.wikimedia.org/r/498328

Change 498351 had a related patch set uploaded (by Jbond; owner: John Bond):
[operations/puppet@production] mf-auto-restart: Exclude NFS mountpoints (alternate)

https://gerrit.wikimedia.org/r/498351

Change 498328 abandoned by GTirloni:
wmf-auto-restart: Exclude NFS mountpoints

Reason:
In favor of https://gerrit.wikimedia.org/r/c/operations/puppet/ /498351

https://gerrit.wikimedia.org/r/498328

Change 498351 merged by Jbond:
[operations/puppet@production] mf-auto-restart: Exclude NFS mountpoints (alternate)

https://gerrit.wikimedia.org/r/498351

I have found that jobs with the name update_items_fr are a repeat offender against the OOM killer. Searching for what job that actually is brings:

tools.topicmatcher:12 * * * * jsub -mem 8g -once -quiet -N paper2taxon /data/project/topicmatcher/scripts/update_items_from_sparql.php paper2taxon
tools.topicmatcher:42 * * * * jsub -mem 8g -once -quiet -N paper2taxon /data/project/topicmatcher/scripts/update_items_from_sparql.php paper2taxon
tools.topicmatcher:22 * * * * jsub -mem 8g -once -quiet -N paper2taxon2 /data/project/topicmatcher/scripts/update_items_from_sparql.php paper2taxon2
tools.topicmatcher:52 * * * * jsub -mem 8g -once -quiet -N paper2taxon2 /data/project/topicmatcher/scripts/update_items_from_sparql.php paper2taxon2
tools.topicmatcher:31 * * * * jsub -mem 8g -once -quiet -N topic_query /data/project/topicmatcher/scripts/update_items_from_sparql.php topic_query
tools.topicmatcher:* 22 * * * jsub -mem 4g -once -quiet -N errata /data/project/topicmatcher/scripts/update_items_from_sparql.php errata
tools.topicmatcher:* 17 * * 3 jsub -mem 4g -once -quiet -N named_after /data/project/topicmatcher/scripts/update_items_from_sparql.php named_after
tools.topicmatcher:* 12 * * 3 jsub -mem 8g -once -quiet -N named_after /data/project/topicmatcher/scripts/update_items_from_sparql.php streets
tools.topicmatcher:0,20,40 * * * * jsub -mem 4g -once -quiet -N obituary /data/project/topicmatcher/scripts/update_items_from_sparql.php obituary

Many of those have -mem set at the entire memory of a grid instance (8g). The new grid may be interacting differently with such things. Checking into that.

I'll note that the new grid should specifically kill jobs before they need an OOM, but I'm wondering if that arg is somehow getting around that.

I don't think that's the actual reason this happens, though. I think I'm getting side tracked. LDAP errors seem better correlated on this host. No NFS errors on the one I'm looking at this time.

This may actually be fixed. No grid queues were offline over the weekend due to NFS. Between fixing the filesystem errors and various tasks that needed to not block on NFS forever, that may have done it. If so, this case is linked: T218649 and may also be good. Keeping this open a bit.

Overall, NFS is doing well now with regard to grid nodes going offline. This appears to have been a result of a variety of monitoring tools monitoring on the client side in Stretch.