Page MenuHomePhabricator

New labs instance fails running `block-for-project-export` before running mount
Closed, ResolvedPublic

Description

I setup a new labs instance in deployment-prep (deployment-tin.deployment-prep.eqiad.wmflabs) On the initial puppet run, before applying any roles via wikitech, I get a failure on:

Error: /usr/local/sbin/block-for-export labstore.svc.eqiad.wmnet project/deployment-prep 180 returned 2 instead of one of [0]
Error: /Stage[main]/Role::Labs::Instance/Exec[block-for-project-export]/returns: change from notrun to 0 failed: /usr/local/sbin/block-for-export labstore.svc.eqiad.wmnet project/deployment-prep 180 returned 2 instead of one of [0]

This seemingly happens before it tries to create the mountpoint for /data/project.

Trying to edit /etc/fstab directly:

# /etc/fstab: static file system information.
# <file system>                                 <mount point>   <type>  <options>       <dump>  <pass>
proc                                            /proc           proc    defaults        0       0
/dev/vda1                                       /               ext4    defaults        0       0
/dev/vda2                                       swap            swap    defaults        0       0
labstore.svc.eqiad.wmnet:/project/deployment-prep/project       /data/project   nfs     rw,vers=4,bg,hard,intr,sec=sys,proto=tcp,port=0,noatime,lookupcache=none,nofsc  0       0

Then running mount -a gets me:

thcipriani@deployment-tin:~$ sudo mount -a
mount.nfs: mounting labstore.svc.eqiad.wmnet:/project/deployment-prep/project failed, reason given by server: No such file or directory

Which seems incorrect since it is mounted on many of the other deployment-prep instances.

Event Timeline

thcipriani raised the priority of this task from to Needs Triage.
thcipriani updated the task description. (Show Details)
thcipriani added subscribers: thcipriani, yuvipanda.

Did @yuvipanda start making changes to remove NFS from the non-MediaWiki nodes in beta cluster? I can't find a task but remember an email thread about that.

I don't think that the deploy server should need /data/project for anything in the current setup.

Forgot to quote the relevant part from Yuvi:

The way to disable this would be to set mount_nfs hiera variable to false in Hiera:deployment-prep page on wikitech and turn it on only for the specific hosts it is needed in.

At least it is enabled on https://wikitech.wikimedia.org/wiki/Hiera:Deployment-prep

nfs_mounts:
    project: true  # /data/project enabled
    home: false
    scratch: false
    dumps: false

But hieradata/labs/deployment-prep/common.yaml has:

nfs_mounts:
  project: false
  home: false
  scratch: false
  dumps: false

IIRC the NFS server whitelists instances of a project on a per IP basis. So potentially deployment-tin with IP 10.68.17.240 would not be whitelisted. The reason the white list is done per instance is to restrict mounting a project dir from another labs project.

Might want to look at the NFS export rules.

greg triaged this task as Unbreak Now! priority.Feb 11 2016, 5:30 PM
greg subscribed.

UBN! because this is blocking Beta Cluster updates.

Change 270343 had a related patch set uploaded (by Thcipriani):
Beta: Move bastion server

https://gerrit.wikimedia.org/r/270343

This happens for toolsbeta-puppetmaster4.toolsbeta.eqiad.wmflabs as well:

scfc@toolsbeta-puppetmaster4:~$ sudo puppet agent -t
Info: Retrieving plugin
Info: Loading facts in /var/lib/puppet/lib/facter/ganeti.rb
Info: Loading facts in /var/lib/puppet/lib/facter/lldp.rb
Info: Loading facts in /var/lib/puppet/lib/facter/initsystem.rb
Info: Loading facts in /var/lib/puppet/lib/facter/root_home.rb
Info: Loading facts in /var/lib/puppet/lib/facter/pe_version.rb
Info: Loading facts in /var/lib/puppet/lib/facter/apt.rb
Info: Loading facts in /var/lib/puppet/lib/facter/physicalcorecount.rb
Info: Loading facts in /var/lib/puppet/lib/facter/labsprojectfrommetadata.rb
Info: Loading facts in /var/lib/puppet/lib/facter/puppet_vardir.rb
Info: Loading facts in /var/lib/puppet/lib/facter/puppet_config_dir.rb
Info: Caching catalog for toolsbeta-puppetmaster4.toolsbeta.eqiad.wmflabs
Info: Applying configuration version '1455417234'
Error: /usr/local/sbin/block-for-export labstore.svc.eqiad.wmnet project/toolsbeta 180 returned 2 instead of one of [0]
Error: /Stage[main]/Role::Labs::Instance/Exec[block-for-project-export]/returns: change from notrun to 0 failed: /usr/local/sbin/block-for-export labstore.svc.eqiad.wmnet project/toolsbeta 180 returned 2 instead of one of [0]
Notice: /Stage[main]/Role::Labs::Instance/Mount[/data/project]: Dependency Exec[block-for-project-export] has failures: true
Warning: /Stage[main]/Role::Labs::Instance/Mount[/data/project]: Skipping because of failed dependencies
Error: /usr/local/sbin/block-for-export labstore.svc.eqiad.wmnet project/toolsbeta 180 returned 2 instead of one of [0]
Error: /Stage[main]/Role::Labs::Instance/Exec[block-for-home-export]/returns: change from notrun to 0 failed: /usr/local/sbin/block-for-export labstore.svc.eqiad.wmnet project/toolsbeta 180 returned 2 instead of one of [0]
Notice: /Stage[main]/Role::Labs::Instance/Mount[/home]: Dependency Exec[block-for-home-export] has failures: true
Warning: /Stage[main]/Role::Labs::Instance/Mount[/home]: Skipping because of failed dependencies
Notice: Finished catalog run in 384.63 seconds
scfc@toolsbeta-puppetmaster4:~$

Wild guess: nfs-exports-daemon is not running on the active labstore*.

Cf. also T122250 where @chasemp increased the timeout for nfs-exports-daemon to five seconds. If I try to query the API for the largest project, it returns the answer almost immediately:

scfc@toolsbeta-puppetmaster4:~$ time curl -o /dev/null 'https://wikitech.wikimedia.org/w/api.php?action=query&list=novainstances&niproject=tools&niregion=eqiad&format=json'
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 19820    0 19820    0     0   138k      0 --:--:-- --:--:-- --:--:--  139k

real    0m0.151s
user    0m0.009s
sys     0m0.009s
scfc@toolsbeta-puppetmaster4:~$
yuvipanda lowered the priority of this task from Unbreak Now! to Medium.Feb 17 2016, 7:55 AM

Is this still happening? I see that nfs-exports daemon is running fine on labstore1001...

Resetting priority since betacluster is not blocked on this anymore (afaict) due to https://phabricator.wikimedia.org/T125624

I am assuming @chasemp fixed the script that generate the NFS export list by simply raising the timeout.

Change 270343 merged by Filippo Giunchedi:
Beta: Move deployment server

https://gerrit.wikimedia.org/r/270343