Puppet fails on integration instances: nfs_mount[home-on-labstoresvc]: umount: /home: not mounted
Closed, ResolvedPublic

Description

Since Jan 19th at 17:00 UTC, puppet is falling on integration instances:

[17:07:41] <shinken-wm>  PROBLEM - Puppet run on integration-slave-trusty-1011 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0]
[17:12:00] <shinken-wm>  PROBLEM - Puppet run on integration-slave-precise-1012 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0]
[17:12:42] <shinken-wm>  PROBLEM - Puppet run on integration-slave-precise-1011 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0]
[17:15:50] <shinken-wm>  PROBLEM - Puppet run on integration-slave-jessie-1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0]
[17:16:04] <shinken-wm>  PROBLEM - Puppet run on integration-slave-precise-1002 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0]
[17:16:04] <shinken-wm>  PROBLEM - Puppet run on integration-slave-trusty-1003 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0]
[17:18:50] <shinken-wm>  PROBLEM - Puppet run on integration-slave-jessie-1002 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0]
[17:19:45] <shinken-wm>  PROBLEM - Puppet run on integration-slave-trusty-1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0]
[17:24:25] <shinken-wm>  PROBLEM - Puppet run on integration-slave-trusty-1004 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0]

On integration-slave-jessie-1001.integration.eqiad.wmflabs.:

Notice: /Stage[main]/Role::Labs::Nfsclient/Labstore::Nfs_mount[home-on-labstoresvc]/Exec[cleanup-/home]
  returns: umount: /home: not mounted
Error: /usr/local/sbin/nfs-mount-manager umount /home returned 32 instead of one of [0]

Error: /Stage[main]/Role::Labs::Nfsclient/Labstore::Nfs_mount[home-on-labstoresvc]/Exec[cleanup-/home]/returns:
  change from notrun to 0 failed: /usr/local/sbin/nfs-mount-manager umount /home returned 32 instead of one of [0]

Notice: /Stage[main]/Role::Labs::Nfsclient/Labstore::Nfs_mount[home-on-labstoresvc]/Mount[/home]:
  Dependency Exec[cleanup-/home] has failures: true

Note that https://wikitech.wikimedia.org/wiki/Hiera:Integration has:

nfs_mounts:
    project: false
    home: false
    scratch: false
    dumps: false

The puppetmaster had a stall puppet.git repo and I rebased it a bit before that started happening. I don't know which change in puppet triggered it, but it seems to me that Exec[cleanup-/home] should skip when /home is not a NFS mount?

hashar created this task.Jan 20 2017, 10:13 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 20 2017, 10:13 AM

Mentioned in SAL (#wikimedia-releng) [2017-01-20T10:14:14Z] <hashar> puppet fails on "integration" labs instances due to an attempt to unmount the non existing NFS /home. Filled T155820

Is this running the latest puppet code?

The puppetmaster was stall with 2-3 days of lag and I rebased yesterday just before that happened.

I did create a new instance hashar-nfs.integration.eqiad.wmflabs and it has no trouble.

Looking at puppet:

modules/role/manifests/labs/instance.pp
# Allows per-host overriding of NFS mounts
$mount_nfs = hiera('mount_nfs', true)
# No NFS on labs metal for now.
if $::virtual == 'kvm' and $mount_nfs{
    require role::labs::nfsclient
}

Yes, there's probably going to be a refactor at some point for that. Does puppet run fine on older hosts too now?

On integration-slave-jessie-1001

# /usr/local/sbin/nfs-mount-manager check /home
It's mounted.
It seems healthy.

Which is wrong, there is some grep pattern that match another mount point:

# mount|grep /home
none on /srv/home/jenkins-deploy/tmpfs type tmpfs (rw,noatime,size=262144k)
/usr/local/sbin/nfs-mount-manager
check)

    if cat /proc/mounts | /usr/bin/awk '{print $2}' \
        | /bin/grep -qs $2; then
        echo "It's mounted."
    else
        echo "It's not mounted."
        exit 1
    fi
# echo /srv/home/jenkins-deploy/tmpfs|grep /home
/srv/home/jenkins-deploy/tmpfs

Change 333230 had a related patch set uploaded (by Hashar):
labstore: check should search for exact mount match

https://gerrit.wikimedia.org/r/333230

That matches the hosts failing puppet:

# salt --output=text '*' cmd.run 'grep -q /srv/home/jenkins-deploy/tmpfs /proc/mounts && echo yes || echo no'|sort
buildlog.integration.eqiad.wmflabs: no
castor.integration.eqiad.wmflabs: no
hashar-nfs.integration.eqiad.wmflabs: no
integration-publisher.integration.eqiad.wmflabs: no
integration-puppetmaster01.integration.eqiad.wmflabs: no
integration-saltmaster.integration.eqiad.wmflabs: no
integration-slave-docker-1000.integration.eqiad.wmflabs: no
integration-slave-jessie-1001.integration.eqiad.wmflabs: yes
integration-slave-jessie-1002.integration.eqiad.wmflabs: yes
integration-slave-jessie-android.integration.eqiad.wmflabs: no
integration-slave-precise-1002.integration.eqiad.wmflabs: yes
integration-slave-precise-1011.integration.eqiad.wmflabs: yes
integration-slave-precise-1012.integration.eqiad.wmflabs: yes
integration-slave-trusty-1001.integration.eqiad.wmflabs: yes
integration-slave-trusty-1003.integration.eqiad.wmflabs: yes
integration-slave-trusty-1004.integration.eqiad.wmflabs: yes
integration-slave-trusty-1006.integration.eqiad.wmflabs: yes
integration-slave-trusty-1011.integration.eqiad.wmflabs: yes
repository.integration.eqiad.wmflabs: no

https://gerrit.wikimedia.org/r/#/c/333230/ cherry picked on CI puppet master and that fixed it :-}

hashar moved this task from Triage to In Progress on the Labs board.
hashar triaged this task as High priority.

Change 333230 merged by Rush:
labstore: check should search for exact mount match

https://gerrit.wikimedia.org/r/333230

The fix here is actually a bit of a misnomer, and while it's probably best to anchor that regex the actual resolution is in 2791f51fda9b5d4e7252ae23034ce0822a39a5dc. i.e. home is not actually a mount only a symlink to one and should never actually be matched.

So I merged https://gerrit.wikimedia.org/r/#/c/333230/ after running through a few test cases but this was 2 levels of weirdness to begin with.

https://gerrit.wikimedia.org/r/#/c/333230/ cherry picked on CI puppet master and that fixed it :-}

Can you un-cherry pick?

The fix here is actually a bit of a misnomer, and while it's probably best to anchor that regex the actual resolution is in 2791f51fda9b5d4e7252ae23034ce0822a39a5dc. i.e. home is not actually a mount only a symlink to one and should never actually be matched.

So I merged https://gerrit.wikimedia.org/r/#/c/333230/ after running through a few test cases but this was 2 levels of weirdness to begin with.

2791f51fda9b5d4e7252ae23034ce0822a39a5dc would not apply on projects that are not using NFS. That is the case for deployment-prep and integration at least. But yeah looks like that fix another oddity :-}

https://gerrit.wikimedia.org/r/#/c/333230/ cherry picked on CI puppet master and that fixed it :-}

Can you un-cherry pick?

Not sure what you mean. The Puppet repository on standalone puppetmaster is automatically rebased via a cronjob. If we want to test without https://gerrit.wikimedia.org/r/#/c/333230/ we can revert it on puppet master and run puppet to see what happens.

Regardless, looks like the task is solved for good now.

chasemp closed this task as Resolved.Mar 16 2017, 7:04 PM
chasemp claimed this task.