Page MenuHomePhabricator

Puppet fails on integration instances: nfs_mount[home-on-labstoresvc]: umount: /home: not mounted
Closed, ResolvedPublic

Description

Since Jan 19th at 17:00 UTC, puppet is falling on integration instances:

[17:07:41] <shinken-wm>  PROBLEM - Puppet run on integration-slave-trusty-1011 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0]
[17:12:00] <shinken-wm>  PROBLEM - Puppet run on integration-slave-precise-1012 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0]
[17:12:42] <shinken-wm>  PROBLEM - Puppet run on integration-slave-precise-1011 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0]
[17:15:50] <shinken-wm>  PROBLEM - Puppet run on integration-slave-jessie-1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0]
[17:16:04] <shinken-wm>  PROBLEM - Puppet run on integration-slave-precise-1002 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0]
[17:16:04] <shinken-wm>  PROBLEM - Puppet run on integration-slave-trusty-1003 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0]
[17:18:50] <shinken-wm>  PROBLEM - Puppet run on integration-slave-jessie-1002 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0]
[17:19:45] <shinken-wm>  PROBLEM - Puppet run on integration-slave-trusty-1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0]
[17:24:25] <shinken-wm>  PROBLEM - Puppet run on integration-slave-trusty-1004 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0]

On integration-slave-jessie-1001.integration.eqiad.wmflabs.:

Notice: /Stage[main]/Role::Labs::Nfsclient/Labstore::Nfs_mount[home-on-labstoresvc]/Exec[cleanup-/home]
  returns: umount: /home: not mounted
Error: /usr/local/sbin/nfs-mount-manager umount /home returned 32 instead of one of [0]

Error: /Stage[main]/Role::Labs::Nfsclient/Labstore::Nfs_mount[home-on-labstoresvc]/Exec[cleanup-/home]/returns:
  change from notrun to 0 failed: /usr/local/sbin/nfs-mount-manager umount /home returned 32 instead of one of [0]

Notice: /Stage[main]/Role::Labs::Nfsclient/Labstore::Nfs_mount[home-on-labstoresvc]/Mount[/home]:
  Dependency Exec[cleanup-/home] has failures: true

Note that https://wikitech.wikimedia.org/wiki/Hiera:Integration has:

nfs_mounts:
    project: false
    home: false
    scratch: false
    dumps: false

The puppetmaster had a stall puppet.git repo and I rebased it a bit before that started happening. I don't know which change in puppet triggered it, but it seems to me that Exec[cleanup-/home] should skip when /home is not a NFS mount?

Event Timeline

Mentioned in SAL (#wikimedia-releng) [2017-01-20T10:14:14Z] <hashar> puppet fails on "integration" labs instances due to an attempt to unmount the non existing NFS /home. Filled T155820

Is this running the latest puppet code?

The puppetmaster was stall with 2-3 days of lag and I rebased yesterday just before that happened.

I did create a new instance hashar-nfs.integration.eqiad.wmflabs and it has no trouble.

Looking at puppet:

modules/role/manifests/labs/instance.pp
# Allows per-host overriding of NFS mounts
$mount_nfs = hiera('mount_nfs', true)
# No NFS on labs metal for now.
if $::virtual == 'kvm' and $mount_nfs{
    require role::labs::nfsclient
}

Yes, there's probably going to be a refactor at some point for that. Does puppet run fine on older hosts too now?

On integration-slave-jessie-1001

# /usr/local/sbin/nfs-mount-manager check /home
It's mounted.
It seems healthy.

Which is wrong, there is some grep pattern that match another mount point:

# mount|grep /home
none on /srv/home/jenkins-deploy/tmpfs type tmpfs (rw,noatime,size=262144k)
/usr/local/sbin/nfs-mount-manager
check)

    if cat /proc/mounts | /usr/bin/awk '{print $2}' \
        | /bin/grep -qs $2; then
        echo "It's mounted."
    else
        echo "It's not mounted."
        exit 1
    fi
# echo /srv/home/jenkins-deploy/tmpfs|grep /home
/srv/home/jenkins-deploy/tmpfs

Change 333230 had a related patch set uploaded (by Hashar):
labstore: check should search for exact mount match

https://gerrit.wikimedia.org/r/333230

That matches the hosts failing puppet:

# salt --output=text '*' cmd.run 'grep -q /srv/home/jenkins-deploy/tmpfs /proc/mounts && echo yes || echo no'|sort
buildlog.integration.eqiad.wmflabs: no
castor.integration.eqiad.wmflabs: no
hashar-nfs.integration.eqiad.wmflabs: no
integration-publisher.integration.eqiad.wmflabs: no
integration-puppetmaster01.integration.eqiad.wmflabs: no
integration-saltmaster.integration.eqiad.wmflabs: no
integration-slave-docker-1000.integration.eqiad.wmflabs: no
integration-slave-jessie-1001.integration.eqiad.wmflabs: yes
integration-slave-jessie-1002.integration.eqiad.wmflabs: yes
integration-slave-jessie-android.integration.eqiad.wmflabs: no
integration-slave-precise-1002.integration.eqiad.wmflabs: yes
integration-slave-precise-1011.integration.eqiad.wmflabs: yes
integration-slave-precise-1012.integration.eqiad.wmflabs: yes
integration-slave-trusty-1001.integration.eqiad.wmflabs: yes
integration-slave-trusty-1003.integration.eqiad.wmflabs: yes
integration-slave-trusty-1004.integration.eqiad.wmflabs: yes
integration-slave-trusty-1006.integration.eqiad.wmflabs: yes
integration-slave-trusty-1011.integration.eqiad.wmflabs: yes
repository.integration.eqiad.wmflabs: no

Change 333230 merged by Rush:
labstore: check should search for exact mount match

https://gerrit.wikimedia.org/r/333230

The fix here is actually a bit of a misnomer, and while it's probably best to anchor that regex the actual resolution is in 2791f51fda9b5d4e7252ae23034ce0822a39a5dc. i.e. home is not actually a mount only a symlink to one and should never actually be matched.

So I merged https://gerrit.wikimedia.org/r/#/c/333230/ after running through a few test cases but this was 2 levels of weirdness to begin with.

https://gerrit.wikimedia.org/r/#/c/333230/ cherry picked on CI puppet master and that fixed it :-}

Can you un-cherry pick?

The fix here is actually a bit of a misnomer, and while it's probably best to anchor that regex the actual resolution is in 2791f51fda9b5d4e7252ae23034ce0822a39a5dc. i.e. home is not actually a mount only a symlink to one and should never actually be matched.

So I merged https://gerrit.wikimedia.org/r/#/c/333230/ after running through a few test cases but this was 2 levels of weirdness to begin with.

2791f51fda9b5d4e7252ae23034ce0822a39a5dc would not apply on projects that are not using NFS. That is the case for deployment-prep and integration at least. But yeah looks like that fix another oddity :-}

https://gerrit.wikimedia.org/r/#/c/333230/ cherry picked on CI puppet master and that fixed it :-}

Can you un-cherry pick?

Not sure what you mean. The Puppet repository on standalone puppetmaster is automatically rebased via a cronjob. If we want to test without https://gerrit.wikimedia.org/r/#/c/333230/ we can revert it on puppet master and run puppet to see what happens.

Regardless, looks like the task is solved for good now.

chasemp claimed this task.