Page MenuHomePhabricator

Stale NFS handle breaks puppet on tools-exec-1204, -1205 and -1218
Closed, ResolvedPublic

Description

Error: /Stage[main]/Role::Labs::Nfsclient/Labstore::Nfs_mount[dumps]/File[/public/dumps]: Could not evaluate: Stale NFS file handle - /public/dumps
Notice: /Stage[main]/Role::Labs::Nfsclient/Labstore::Nfs_mount[dumps]/Mount[/public/dumps]: Dependency File[/public/dumps] has failures: true
Warning: /Stage[main]/Role::Labs::Nfsclient/Labstore::Nfs_mount[dumps]/Mount[/public/dumps]: Skipping because of failed dependencies

Event Timeline

Restricted Application added subscribers: Zppix, Aklapper. · View Herald Transcript
valhallasw renamed this task from Stale NFS handle breaks puppet on tools-exec-1215 to Stale NFS handle breaks puppet on tools-exec-1204, -1205 and -1218.May 28 2016, 8:59 PM

Disabled queues on the three hosts, rescheduled continuous jobs, waiting for the rest to drain:

HOSTNAME                ARCH         NCPU  LOAD  MEMTOT  MEMUSE  SWAPTO  SWAPUS
-------------------------------------------------------------------------------
global                  -               -     -       -       -       -       -
tools-exec-1205.eqiad.wmflabs lx26-amd64      4  2.34    7.8G  628.8M   23.9G     0.0
   job-ID  prior   name       user         state submit/start at     queue      master ja-task-ID
   ----------------------------------------------------------------------------------------------
   6855617 0.30003 ASB        tools.james  r     05/28/2016 21:00:22 task@tools MASTER
tools-exec-1218.eqiad.wmflabs lx26-amd64      4  1.32    7.8G    1.8G   23.9G     0.0
   6111113 0.33744 unr        tools.antign r     05/08/2016 13:06:27 task@tools MASTER
   6612000 0.31183 oldidGA    tools.legobo r     05/22/2016 11:04:09 task@tools MASTER
   6803930 0.30267 run_cluste tools.wikida r     05/27/2016 10:35:04 task@tools MASTER
   6842270 0.30070 gallery    tools.ameen  r     05/28/2016 12:10:08 task@tools MASTER
   6855820 0.30002 start      aka          r     05/28/2016 21:07:34 task@tools MASTER

Mentioned in SAL [2016-05-28T21:21:15Z] <valhallasw`cloud> rebooting tools-exec-1204 (T136495)

tools-exec-1204 is back up and running, the other two hosts still need to drain before they can be rebooted.

tools-exec-1205 has also been rebooted and reenabled.

This is now also happening on tools-exec-1203:

Error: /Stage[main]/Role::Labs::Nfsclient/Labstore::Nfs_mount[dumps]/File[/public/dumps]: Could not evaluate: Stale NFS file handle - /public/dumps
Notice: /Stage[main]/Role::Labs::Nfsclient/Labstore::Nfs_mount[dumps]/Mount[/public/dumps]: Dependency File[/public/dumps] has failures: true
Warning: /Stage[main]/Role::Labs::Nfsclient/Labstore::Nfs_mount[dumps]/Mount[/public/dumps]: Skipping because of failed dependencies
Notice: Finished catalog run in 72.19 seconds

I have disabled the queues and resubmitted jobs in the continuous queue. There are four jobs still running:

tools-exec-1203.eqiad.wmflabs lx26-amd64      4  1.54    7.8G    2.5G   23.9G     0.0
   job-ID  prior   name       user         state submit/start at     queue      master ja-task-ID
   ----------------------------------------------------------------------------------------------
   4752695 0.41449 gpy        tools.gpy    r     03/28/2016 23:14:45 task@tools MASTER
   6915763 0.30008 enwp10-upd tools.enwp10 r     05/30/2016 12:10:07 task@tools MASTER
   6917304 0.30001 gallery    tools.ameen  r     05/30/2016 13:10:09 task@tools MASTER
   6917463 0.30000 inv-es-red tools.invadi r     05/30/2016 13:16:06 task@tools MASTER

Mentioned in SAL [2016-05-31T13:13:34Z] <chasemp> reboot of tools-exec-1203 see T136495 all jobs seem gone now

This is now also happening on tools-exec-1203:

Thanks @valhallasw ... you rock.

So this is a weird thing and I'm not sure yet what to make of it. We did have a period of NFS /public/dumps being offline previously. AFAIK things were cleared up from that.

Along w/ this period of /public/dumps being inaccessible (which froze / caused stale mounts) we also changed /public/dumps to be soft mounted here https://gerrit.wikimedia.org/r/#/c/289903/

That did function then as described in teh commit message. Puppet choked on a reload and things should be ok until next natural reboot and then in soft mode.

But maybe there is another bug for precise.

Confirmed puppet has issues with /public/dumps:

Error: /Stage[main]/Role::Labs::Nfsclient/Labstore::Nfs_mount[dumps]/File[/public/dumps]: Could not evaluate: Stale NFS file handle - /public/dumps
Notice: /Stage[main]/Role::Labs::Nfsclient/Labstore::Nfs_mount[dumps]/Mount[/public/dumps]: Dependency File[/public/dumps] has failures: true
Warning: /Stage[main]/Role::Labs::Nfsclient/Labstore::Nfs_mount[dumps]/Mount[/public/dumps]: Skipping because of failed dependencies
Notice: Finished catalog run in 67.57 seconds
rush@tools-exec-1203:~$

nfs mount is stale:

rush@tools-exec-1203:~$ timeout 10s ls /public/dumps
ls: cannot access /public/dumps: Stale NFS file handle

A halmark of precise NFS buggyness

rush@tools-exec-1203:~$ sudo umount -f /public/dumps
/public/dumps was not found in /proc/mounts
/public/dumps was not found in /proc/mounts

/public/dumps seems to be hard mounted...ok must not have been rebooted yet.

labstore1003.eqiad.wmnet:/dumps on /public/dumps type nfs (ro,noatime,vers=4,bg,hard,intr,sec=sys,proto=tcp,port=0,lookupcache=none,nofsc)

Now it gets really weird:

...reboot...

STILL hard mode

labstore1003.eqiad.wmnet:/dumps on /public/dumps type nfs (ro,noatime,vers=4,bg,hard,intr,sec=sys,proto=tcp,port=0,lookupcache=none,nofsc,addr=10.64.4.10,clientaddr=10.68.16.133)

ok let's confirm that /etc/fstab says we should be soft mounted

labstore1003.eqiad.wmnet:/dumps	/public/dumps	nfs	vers=4,bg,intr,sec=sys,proto=tcp,port=0,noatime,lookupcache=none,nofsc,ro,soft,timeo=300,retrans=3	0	0

I have never seen this before.

So trying for a manual soft mount:

sudo umount -f /public/dumps
sudo mount -t nfs -o vers=4,bg,intr,sec=sys,proto=tcp,port=0,noatime,lookupcache=none,nofsc,ro,soft,timeo=300,retrans=3 labstore1003.eqiad.wmnet:/dumps /public/dumps

now it's there:

labstore1003.eqiad.wmnet:/dumps on /public/dumps type nfs (ro,noatime,vers=4,bg,intr,sec=sys,proto=tcp,port=0,lookupcache=none,nofsc,soft,timeo=300,retrans=3,addr=10.64.4.10,clientaddr=10.68.16.133)

on subsequent reboot:

labstore1003.eqiad.wmnet:/dumps on /public/dumps type nfs (ro,noatime,vers=4,bg,intr,sec=sys,proto=tcp,port=0,lookupcache=none,nofsc,soft,timeo=300,retrans=3,addr=10.64.4.10,clientaddr=10.68.16.133)

So I have no explanation for this yet, but I would like to see if this issues crops up on tools-exec-1203 again. I did not look at /etc/fstab prior to the reboot (and I should have) so maybe it said hard mount and I have a race condition on my hands here with the initial puppet run from teh first reboot. One thing is for sure precise + NFS = sadness.

Mentioned in SAL [2016-05-28T21:21:15Z] <valhallasw`cloud> rebooting tools-exec-1204 (T136495)

Here is the crazy part, this is post that reboot:

mount -t nfs

labstore1003.eqiad.wmnet:/dumps on /public/dumps type nfs (ro,noatime,vers=4,bg,hard,intr,sec=sys,proto=tcp,port=0,lookupcache=none,nofsc,addr=10.64.4.10,clientaddr=10.68.17.88)

cat /etc/fstab

labstore1003.eqiad.wmnet:/dumps /public/dumps nfs vers=4,bg,intr,sec=sys,proto=tcp,port=0,noatime,lookupcache=none,nofsc,ro,soft,timeo=300,retrans=3 0 0

...reboot....

mount -t nfs

labstore1003.eqiad.wmnet:/dumps on /public/dumps type nfs (ro,noatime,vers=4,bg,intr,sec=sys,proto=tcp,port=0,lookupcache=none,nofsc,soft,timeo=300,retrans=3,addr=10.64.4.10,clientaddr=10.68.17.88)

cat /etc/fstab

labstore1003.eqiad.wmnet:/dumps /public/dumps nfs vers=4,bg,intr,sec=sys,proto=tcp,port=0,noatime,lookupcache=none,nofsc,ro,soft,timeo=300,retrans=3 0 0

Ok went into tools-exec-1218 looking for the race condition and I think I confirmed it.

mount -t nfs
labstore1003.eqiad.wmnet:/dumps on /public/dumps type nfs (ro,noatime,vers=4,bg,hard,intr,sec=sys,proto=tcp,port=0,lookupcache=none,nofsc)

so hard mount and puppet is hung.

fstab still says hard mode

labstore1003.eqiad.wmnet:/dumps /public/dumps nfs ro,vers=4,bg,hard,intr,sec=sys,proto=tcp,port=0,noatime,lookupcache=none,nofsc 0 0

i.e. the stale handler for /public/dumps is preventing puppet from running to update /etc/fstab, preventing the good state on reboot initially because the mounts happen before puppet is run.

If I reboot, immediately force a puppet run and then reboot again I see things come back as advertised on /etc/fstab now:

labstore1003.eqiad.wmnet:/dumps on /public/dumps type nfs (ro,noatime,vers=4,bg,intr,sec=sys,proto=tcp,port=0,lookupcache=none,nofsc,soft,timeo=300,retrans=3,addr=10.64.4.10,clientaddr=10.68.18.19)

chasemp claimed this task.
chasemp triaged this task as Medium priority.

so these hosts in particular are ok now. I'm suspicious of something w/ delayed effect happening here. Let's see if this happens again but for now resolved.