Error: /Stage[main]/Role::Labs::Nfsclient/Labstore::Nfs_mount[dumps]/File[/public/dumps]: Could not evaluate: Stale NFS file handle - /public/dumps Notice: /Stage[main]/Role::Labs::Nfsclient/Labstore::Nfs_mount[dumps]/Mount[/public/dumps]: Dependency File[/public/dumps] has failures: true Warning: /Stage[main]/Role::Labs::Nfsclient/Labstore::Nfs_mount[dumps]/Mount[/public/dumps]: Skipping because of failed dependencies
Description
Related Objects
Event Timeline
Disabled queues on the three hosts, rescheduled continuous jobs, waiting for the rest to drain:
HOSTNAME ARCH NCPU LOAD MEMTOT MEMUSE SWAPTO SWAPUS ------------------------------------------------------------------------------- global - - - - - - - tools-exec-1205.eqiad.wmflabs lx26-amd64 4 2.34 7.8G 628.8M 23.9G 0.0 job-ID prior name user state submit/start at queue master ja-task-ID ---------------------------------------------------------------------------------------------- 6855617 0.30003 ASB tools.james r 05/28/2016 21:00:22 task@tools MASTER tools-exec-1218.eqiad.wmflabs lx26-amd64 4 1.32 7.8G 1.8G 23.9G 0.0 6111113 0.33744 unr tools.antign r 05/08/2016 13:06:27 task@tools MASTER 6612000 0.31183 oldidGA tools.legobo r 05/22/2016 11:04:09 task@tools MASTER 6803930 0.30267 run_cluste tools.wikida r 05/27/2016 10:35:04 task@tools MASTER 6842270 0.30070 gallery tools.ameen r 05/28/2016 12:10:08 task@tools MASTER 6855820 0.30002 start aka r 05/28/2016 21:07:34 task@tools MASTER
Mentioned in SAL [2016-05-28T21:21:15Z] <valhallasw`cloud> rebooting tools-exec-1204 (T136495)
tools-exec-1204 is back up and running, the other two hosts still need to drain before they can be rebooted.
This is now also happening on tools-exec-1203:
Error: /Stage[main]/Role::Labs::Nfsclient/Labstore::Nfs_mount[dumps]/File[/public/dumps]: Could not evaluate: Stale NFS file handle - /public/dumps Notice: /Stage[main]/Role::Labs::Nfsclient/Labstore::Nfs_mount[dumps]/Mount[/public/dumps]: Dependency File[/public/dumps] has failures: true Warning: /Stage[main]/Role::Labs::Nfsclient/Labstore::Nfs_mount[dumps]/Mount[/public/dumps]: Skipping because of failed dependencies Notice: Finished catalog run in 72.19 seconds
I have disabled the queues and resubmitted jobs in the continuous queue. There are four jobs still running:
tools-exec-1203.eqiad.wmflabs lx26-amd64 4 1.54 7.8G 2.5G 23.9G 0.0 job-ID prior name user state submit/start at queue master ja-task-ID ---------------------------------------------------------------------------------------------- 4752695 0.41449 gpy tools.gpy r 03/28/2016 23:14:45 task@tools MASTER 6915763 0.30008 enwp10-upd tools.enwp10 r 05/30/2016 12:10:07 task@tools MASTER 6917304 0.30001 gallery tools.ameen r 05/30/2016 13:10:09 task@tools MASTER 6917463 0.30000 inv-es-red tools.invadi r 05/30/2016 13:16:06 task@tools MASTER
Mentioned in SAL [2016-05-31T13:13:34Z] <chasemp> reboot of tools-exec-1203 see T136495 all jobs seem gone now
Thanks @valhallasw ... you rock.
So this is a weird thing and I'm not sure yet what to make of it. We did have a period of NFS /public/dumps being offline previously. AFAIK things were cleared up from that.
Along w/ this period of /public/dumps being inaccessible (which froze / caused stale mounts) we also changed /public/dumps to be soft mounted here https://gerrit.wikimedia.org/r/#/c/289903/
That did function then as described in teh commit message. Puppet choked on a reload and things should be ok until next natural reboot and then in soft mode.
But maybe there is another bug for precise.
Confirmed puppet has issues with /public/dumps:
Error: /Stage[main]/Role::Labs::Nfsclient/Labstore::Nfs_mount[dumps]/File[/public/dumps]: Could not evaluate: Stale NFS file handle - /public/dumps Notice: /Stage[main]/Role::Labs::Nfsclient/Labstore::Nfs_mount[dumps]/Mount[/public/dumps]: Dependency File[/public/dumps] has failures: true Warning: /Stage[main]/Role::Labs::Nfsclient/Labstore::Nfs_mount[dumps]/Mount[/public/dumps]: Skipping because of failed dependencies Notice: Finished catalog run in 67.57 seconds rush@tools-exec-1203:~$
nfs mount is stale:
rush@tools-exec-1203:~$ timeout 10s ls /public/dumps
ls: cannot access /public/dumps: Stale NFS file handle
A halmark of precise NFS buggyness
rush@tools-exec-1203:~$ sudo umount -f /public/dumps /public/dumps was not found in /proc/mounts /public/dumps was not found in /proc/mounts
/public/dumps seems to be hard mounted...ok must not have been rebooted yet.
labstore1003.eqiad.wmnet:/dumps on /public/dumps type nfs (ro,noatime,vers=4,bg,hard,intr,sec=sys,proto=tcp,port=0,lookupcache=none,nofsc)
Now it gets really weird:
...reboot...
STILL hard mode
labstore1003.eqiad.wmnet:/dumps on /public/dumps type nfs (ro,noatime,vers=4,bg,hard,intr,sec=sys,proto=tcp,port=0,lookupcache=none,nofsc,addr=10.64.4.10,clientaddr=10.68.16.133)
ok let's confirm that /etc/fstab says we should be soft mounted
labstore1003.eqiad.wmnet:/dumps /public/dumps nfs vers=4,bg,intr,sec=sys,proto=tcp,port=0,noatime,lookupcache=none,nofsc,ro,soft,timeo=300,retrans=3 0 0
I have never seen this before.
So trying for a manual soft mount:
sudo umount -f /public/dumps
sudo mount -t nfs -o vers=4,bg,intr,sec=sys,proto=tcp,port=0,noatime,lookupcache=none,nofsc,ro,soft,timeo=300,retrans=3 labstore1003.eqiad.wmnet:/dumps /public/dumps
now it's there:
labstore1003.eqiad.wmnet:/dumps on /public/dumps type nfs (ro,noatime,vers=4,bg,intr,sec=sys,proto=tcp,port=0,lookupcache=none,nofsc,soft,timeo=300,retrans=3,addr=10.64.4.10,clientaddr=10.68.16.133)
on subsequent reboot:
labstore1003.eqiad.wmnet:/dumps on /public/dumps type nfs (ro,noatime,vers=4,bg,intr,sec=sys,proto=tcp,port=0,lookupcache=none,nofsc,soft,timeo=300,retrans=3,addr=10.64.4.10,clientaddr=10.68.16.133)
So I have no explanation for this yet, but I would like to see if this issues crops up on tools-exec-1203 again. I did not look at /etc/fstab prior to the reboot (and I should have) so maybe it said hard mount and I have a race condition on my hands here with the initial puppet run from teh first reboot. One thing is for sure precise + NFS = sadness.
Here is the crazy part, this is post that reboot:
mount -t nfs
labstore1003.eqiad.wmnet:/dumps on /public/dumps type nfs (ro,noatime,vers=4,bg,hard,intr,sec=sys,proto=tcp,port=0,lookupcache=none,nofsc,addr=10.64.4.10,clientaddr=10.68.17.88)
cat /etc/fstab
labstore1003.eqiad.wmnet:/dumps /public/dumps nfs vers=4,bg,intr,sec=sys,proto=tcp,port=0,noatime,lookupcache=none,nofsc,ro,soft,timeo=300,retrans=3 0 0
...reboot....
mount -t nfs
labstore1003.eqiad.wmnet:/dumps on /public/dumps type nfs (ro,noatime,vers=4,bg,intr,sec=sys,proto=tcp,port=0,lookupcache=none,nofsc,soft,timeo=300,retrans=3,addr=10.64.4.10,clientaddr=10.68.17.88)
cat /etc/fstab
labstore1003.eqiad.wmnet:/dumps /public/dumps nfs vers=4,bg,intr,sec=sys,proto=tcp,port=0,noatime,lookupcache=none,nofsc,ro,soft,timeo=300,retrans=3 0 0
Ok went into tools-exec-1218 looking for the race condition and I think I confirmed it.
mount -t nfs
labstore1003.eqiad.wmnet:/dumps on /public/dumps type nfs (ro,noatime,vers=4,bg,hard,intr,sec=sys,proto=tcp,port=0,lookupcache=none,nofsc)
so hard mount and puppet is hung.
fstab still says hard mode
labstore1003.eqiad.wmnet:/dumps /public/dumps nfs ro,vers=4,bg,hard,intr,sec=sys,proto=tcp,port=0,noatime,lookupcache=none,nofsc 0 0
i.e. the stale handler for /public/dumps is preventing puppet from running to update /etc/fstab, preventing the good state on reboot initially because the mounts happen before puppet is run.
If I reboot, immediately force a puppet run and then reboot again I see things come back as advertised on /etc/fstab now:
labstore1003.eqiad.wmnet:/dumps on /public/dumps type nfs (ro,noatime,vers=4,bg,intr,sec=sys,proto=tcp,port=0,lookupcache=none,nofsc,soft,timeo=300,retrans=3,addr=10.64.4.10,clientaddr=10.68.18.19)
so these hosts in particular are ok now. I'm suspicious of something w/ delayed effect happening here. Let's see if this happens again but for now resolved.