Page MenuHomePhabricator

Puppet flapping on mounting /mnt/nfs/labstore-secondary-project failures ("Device busy or already mounted")
Closed, ResolvedPublic

Description

For the past two months instances cvn-apache8 (Ubunty Trusty) has been flapping between failure and recovery every other hour or so.

I recently re-created the instance with Debian Stretch (cvn-apache9) in the hopes that it would solve the issue, but alas, it didn't. Yet, the other Debian Stretch instances (cvn-app*) are not affected by this plague of puppet errors.

There is no custom Puppet configuration on any of these instances, aside from the project-wide config at https://wikitech.wikimedia.org/wiki/Hiera:Cvn:

hiera/cvn.yaml
---
classes:
    - role::labs::lvm::srv
nfs_mounts:
    project: true
    home: false
    scratch: false
    dumps: false

Log sample:

syslog
Nov 23 02:55:29 cvn-apache9 puppet-agent[29986]: Retrieving pluginfacts
Nov 23 02:55:29 cvn-apache9 puppet-agent[29986]: Retrieving plugin
Nov 23 02:55:30 cvn-apache9 puppet-agent[29986]: Loading facts
Nov 23 02:55:36 cvn-apache9 puppet-agent[29986]: Caching catalog for cvn-apache9.cvn.eqiad.wmflabs
Nov 23 02:55:37 cvn-apache9 puppet-agent[29986]: Applying configuration version '1511405498'
Nov 23 02:55:37 cvn-apache9 crontab[30184]: (root) LIST (root)
Nov 23 02:55:39 cvn-apache9 nfs-mount-manager[30390]: /mnt/nfs/labstore-secondary-project is mounted.
Nov 23 02:56:09 cvn-apache9 nfs-mount-manager[30397]: /mnt/nfs/labstore-secondary-project does not seem healthy.
Nov 23 02:56:09 cvn-apache9 nfs-mount-manager[30403]: mounting /mnt/nfs/labstore-secondary-project
Nov 23 02:56:13 cvn-apache9 puppet-agent[29986]: (/Stage[main]/Role::Labs::Nfsclient/Labstore::Nfs_mount[project-on-labstore-secondary]/Exec[ensure-nfs-project-on-labstore-secondary]/returns)
	mounting /mnt/nfs/labstore-secondary-project
	mount.nfs: mount(2): Device or resource busy
	mount.nfs: /mnt/nfs/labstore-secondary-project is busy or already mounted
	mount.nfs: trying text-based options 'bg,intr,sec=sys,proto=tcp,port=0,lookupcache=all,nofsc,hard,vers=4,addr=10.64.37.18,clientaddr=10.68.22.159'
	> returned 32 instead of one of [0]
[..]
	 has failures: true
	 Skipping because of failed dependencies
[..]
Nov 23 02:56:14 cvn-apache9 puppet-agent[29986]: Finished catalog run in 37.38 seconds
[..]
Nov 23 03:25:08 cvn-apache9 puppet-agent[31067]: Retrieving pluginfacts
Nov 23 03:25:08 cvn-apache9 puppet-agent[31067]: Retrieving plugin
Nov 23 03:25:08 cvn-apache9 puppet-agent[31067]: Loading facts
Nov 23 03:25:14 cvn-apache9 puppet-agent[31067]: Caching catalog for cvn-apache9.cvn.eqiad.wmflabs
Nov 23 03:25:15 cvn-apache9 puppet-agent[31067]: Applying configuration version '1511406691'
Nov 23 03:25:16 cvn-apache9 crontab[31265]: (root) LIST (root)
Nov 23 03:25:17 cvn-apache9 nfs-mount-manager[31471]: /mnt/nfs/labstore-secondary-project is mounted.
Nov 23 03:25:18 cvn-apache9 nfs-mount-manager[31475]: /mnt/nfs/labstore-secondary-project seems healthy.
Nov 23 03:25:19 cvn-apache9 puppet-agent[31067]: Finished catalog run in 3.64 seconds
[..]
Nov 23 03:55:33 cvn-apache9 puppet-agent[32222]: Retrieving pluginfacts
Nov 23 03:55:33 cvn-apache9 puppet-agent[32222]: Retrieving plugin
Nov 23 03:55:33 cvn-apache9 puppet-agent[32222]: Loading facts
Nov 23 03:55:40 cvn-apache9 puppet-agent[32222]: Caching catalog for cvn-apache9.cvn.eqiad.wmflabs
Nov 23 03:55:41 cvn-apache9 puppet-agent[32222]: Applying configuration version '1511409250'
Nov 23 03:55:41 cvn-apache9 crontab[32420]: (root) LIST (root)
Nov 23 03:55:43 cvn-apache9 nfs-mount-manager[32626]: /mnt/nfs/labstore-secondary-project is mounted.

Event Timeline

You seem to have nfs enable with project: true unless im wrong.

This is my config

nfs_mounts:

project: false
home: false
scratch: false
dumps: false

You seem to have nfs enable with project: true unless im wrong.

Yes, that is is how NFS is enabled. That is intentional, because we use it to persist long-term data and backups. To clarify: The NFS drive is working fine.

Every time Puppet, the code tries to mount the same NFS drive again. When this fails, there is no real problem because it is already mounted. The only problem is that the Puppet code frequently fails due to a bug, which then causes confusing notifications in our IRC channel and in my e-mail inbox.

We have a similar or the same issue in Toolforge. I don't have an answer right now but historically we thought it was due to load on exec/bastions causing local load and NFS unavailability outright. I don't believe that explains things anymore. I have seen this happen on Toolforge instances with roles that don't see direct impact from user load. I don't think the mounting-already-mounted is the entire story. It's true that in some of the failures (for us) that surfaces but AFAICT it is a legit response from a potentially false negative catalyst when the mount appears unhealthy. I don't know if it's the checking code or the actual state of things. That is one aspect, we also see it where Puppet has its own verifications that it is working (it looks for specific content only on NFS) on an NFS mount and not local storage and those sometimes flake out inconsistently causing similar flapping.

We are trying to get time to diagnose a few semi-related and overlapping issues here (T173526, etc) but this is a mystery in cause for now.

Andrew claimed this task.
Andrew subscribed.

closing until someone reports this happening again