Puppet fails on integration instances: nfs_mount[home-on-labstoresvc]: umount: /home: not mounted
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	hashar
	Jan 20 2017, 10:13 AM

Description

Since Jan 19th at 17:00 UTC, puppet is falling on integration instances:

[17:07:41] <shinken-wm>  PROBLEM - Puppet run on integration-slave-trusty-1011 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0]
[17:12:00] <shinken-wm>  PROBLEM - Puppet run on integration-slave-precise-1012 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0]
[17:12:42] <shinken-wm>  PROBLEM - Puppet run on integration-slave-precise-1011 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0]
[17:15:50] <shinken-wm>  PROBLEM - Puppet run on integration-slave-jessie-1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0]
[17:16:04] <shinken-wm>  PROBLEM - Puppet run on integration-slave-precise-1002 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0]
[17:16:04] <shinken-wm>  PROBLEM - Puppet run on integration-slave-trusty-1003 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0]
[17:18:50] <shinken-wm>  PROBLEM - Puppet run on integration-slave-jessie-1002 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0]
[17:19:45] <shinken-wm>  PROBLEM - Puppet run on integration-slave-trusty-1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0]
[17:24:25] <shinken-wm>  PROBLEM - Puppet run on integration-slave-trusty-1004 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0]

On integration-slave-jessie-1001.integration.eqiad.wmflabs.:

Notice: /Stage[main]/Role::Labs::Nfsclient/Labstore::Nfs_mount[home-on-labstoresvc]/Exec[cleanup-/home]
  returns: umount: /home: not mounted
Error: /usr/local/sbin/nfs-mount-manager umount /home returned 32 instead of one of [0]

Error: /Stage[main]/Role::Labs::Nfsclient/Labstore::Nfs_mount[home-on-labstoresvc]/Exec[cleanup-/home]/returns:
  change from notrun to 0 failed: /usr/local/sbin/nfs-mount-manager umount /home returned 32 instead of one of [0]

Notice: /Stage[main]/Role::Labs::Nfsclient/Labstore::Nfs_mount[home-on-labstoresvc]/Mount[/home]:
  Dependency Exec[cleanup-/home] has failures: true

Note that https://wikitech.wikimedia.org/wiki/Hiera:Integration has:

nfs_mounts:
    project: false
    home: false
    scratch: false
    dumps: false

The puppetmaster had a stall puppet.git repo and I rebased it a bit before that started happening. I don't know which change in puppet triggered it, but it seems to me that Exec[cleanup-/home] should skip when /home is not a NFS mount?

Details

	Subject	Repo	Branch	Lines +/-
	labstore: check should search for exact mount match	operations/puppet	production	+1 -1

Customize query in gerrit

Related Objects

Mentioned Here: rOPUP2791f51fda9b: nfsclient: remove temporary absents from migration

Event Timeline

hashar created this task.Jan 20 2017, 10:13 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 20 2017, 10:13 AM

Mentioned in SAL (#wikimedia-releng) [2017-01-20T10:14:14Z] <hashar> puppet fails on "integration" labs instances due to an attempt to unmount the non existing NFS /home. Filled T155820

Is this running the latest puppet code?

The puppetmaster was stall with 2-3 days of lag and I rebased yesterday just before that happened.

I did create a new instance hashar-nfs.integration.eqiad.wmflabs and it has no trouble.

Looking at puppet:

modules/role/manifests/labs/instance.pp

# Allows per-host overriding of NFS mounts
$mount_nfs = hiera('mount_nfs', true)
# No NFS on labs metal for now.
if $::virtual == 'kvm' and $mount_nfs{
    require role::labs::nfsclient
}

Yes, there's probably going to be a refactor at some point for that. Does puppet run fine on older hosts too now?

On integration-slave-jessie-1001

# /usr/local/sbin/nfs-mount-manager check /home
It's mounted.
It seems healthy.

Which is wrong, there is some grep pattern that match another mount point:

# mount|grep /home
none on /srv/home/jenkins-deploy/tmpfs type tmpfs (rw,noatime,size=262144k)

/usr/local/sbin/nfs-mount-manager

check)

    if cat /proc/mounts | /usr/bin/awk '{print $2}' \
        | /bin/grep -qs $2; then
        echo "It's mounted."
    else
        echo "It's not mounted."
        exit 1
    fi

# echo /srv/home/jenkins-deploy/tmpfs|grep /home
/srv/home/jenkins-deploy/tmpfs

Change 333230 had a related patch set uploaded (by Hashar):
labstore: check should search for exact mount match

https://gerrit.wikimedia.org/r/333230

gerritbot added a project: Patch-For-Review.Jan 20 2017, 10:55 AM

That matches the hosts failing puppet:

# salt --output=text '*' cmd.run 'grep -q /srv/home/jenkins-deploy/tmpfs /proc/mounts && echo yes || echo no'|sort
buildlog.integration.eqiad.wmflabs: no
castor.integration.eqiad.wmflabs: no
hashar-nfs.integration.eqiad.wmflabs: no
integration-publisher.integration.eqiad.wmflabs: no
integration-puppetmaster01.integration.eqiad.wmflabs: no
integration-saltmaster.integration.eqiad.wmflabs: no
integration-slave-docker-1000.integration.eqiad.wmflabs: no
integration-slave-jessie-1001.integration.eqiad.wmflabs: yes
integration-slave-jessie-1002.integration.eqiad.wmflabs: yes
integration-slave-jessie-android.integration.eqiad.wmflabs: no
integration-slave-precise-1002.integration.eqiad.wmflabs: yes
integration-slave-precise-1011.integration.eqiad.wmflabs: yes
integration-slave-precise-1012.integration.eqiad.wmflabs: yes
integration-slave-trusty-1001.integration.eqiad.wmflabs: yes
integration-slave-trusty-1003.integration.eqiad.wmflabs: yes
integration-slave-trusty-1004.integration.eqiad.wmflabs: yes
integration-slave-trusty-1006.integration.eqiad.wmflabs: yes
integration-slave-trusty-1011.integration.eqiad.wmflabs: yes
repository.integration.eqiad.wmflabs: no

https://gerrit.wikimedia.org/r/#/c/333230/ cherry picked on CI puppet master and that fixed it :-}

hashar moved this task from Untriaged to Done on the Continuous-Integration-Infrastructure board.Jan 20 2017, 11:37 AM

hashar moved this task from Triage to In Progress on the Cloud-Services board.

hashar triaged this task as High priority.Jan 23 2017, 1:56 PM

hashar moved this task from Done to Externally Blocked on the Continuous-Integration-Infrastructure board.

Change 333230 merged by Rush:
labstore: check should search for exact mount match

https://gerrit.wikimedia.org/r/333230

The fix here is actually a bit of a misnomer, and while it's probably best to anchor that regex the actual resolution is in 2791f51fda9b5d4e7252ae23034ce0822a39a5dc. i.e. home is not actually a mount only a symlink to one and should never actually be matched.

So I merged https://gerrit.wikimedia.org/r/#/c/333230/ after running through a few test cases but this was 2 levels of weirdness to begin with.

In T155820#2955685, @hashar wrote:

https://gerrit.wikimedia.org/r/#/c/333230/ cherry picked on CI puppet master and that fixed it :-}

Can you un-cherry pick?

In T155820#3036397, @chasemp wrote:

The fix here is actually a bit of a misnomer, and while it's probably best to anchor that regex the actual resolution is in 2791f51fda9b5d4e7252ae23034ce0822a39a5dc. i.e. home is not actually a mount only a symlink to one and should never actually be matched.

So I merged https://gerrit.wikimedia.org/r/#/c/333230/ after running through a few test cases but this was 2 levels of weirdness to begin with.

2791f51fda9b5d4e7252ae23034ce0822a39a5dc would not apply on projects that are not using NFS. That is the case for deployment-prep and integration at least. But yeah looks like that fix another oddity :-}

In T155820#2955685, @hashar wrote:

https://gerrit.wikimedia.org/r/#/c/333230/ cherry picked on CI puppet master and that fixed it :-}

Can you un-cherry pick?

Not sure what you mean. The Puppet repository on standalone puppetmaster is automatically rebased via a cronjob. If we want to test without https://gerrit.wikimedia.org/r/#/c/333230/ we can revert it on puppet master and run puppet to see what happens.

Regardless, looks like the task is solved for good now.

• chasemp closed this task as Resolved.Mar 16 2017, 7:04 PM

• chasemp claimed this task.

• Phabricator_maintenance removed a subscriber: yuvipanda.Jun 7 2017, 6:39 PM

Puppet fails on integration instances: nfs_mount[home-on-labstoresvc]: umount: /home: not mountedClosed, ResolvedPublicActions

Description

Details

Related Objects

Event Timeline

Puppet fails on integration instances: nfs_mount[home-on-labstoresvc]: umount: /home: not mounted
Closed, ResolvedPublic
Actions