Instances without a shared NFS storage suffers from a 3 minutes boot delay
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	hashar
	Jun 15 2015, 8:17 PM

Description

Instances on integration project suffers from a 3 minutes initial boot delay because there is no NFS server available. The shared NFS has been disabled on integration T90610#1344487

It only happens on first boot though, The code from our firstboot.sh files has:

# Sleep until the nfs volumes we need are available.
#  Worst case, just time out after 3 minutes.
tries=18
for i in `seq 1 ${tries}`; do
    prod_domain=`echo $domain | sed 's/wmflabs/wmnet/'`
    nfs_server="labstore.svc.${prod_domain}"
    echo $(showmount -e ${nfs_server} | egrep ^/exp/project/${project}\\s), | fgrep -q $ip,
    if [ $? -eq 0 ];  then
        break
    fi
    sleep 10
done

A couple console logs from T102108: integration-t102108-jessie-new2 and integration-t102108-trusty-new2 show:

+ tries=18
++ seq 1 18
+ for i in '`seq 1 ${tries}`'
++ sed s/wmflabs/wmnet/
++ echo eqiad.wmflabs
+ prod_domain=eqiad.wmnet
+ nfs_server=labstore.svc.eqiad.wmnet
+ fgrep -q 10.68.17.6,
++ egrep '^/exp/project/integration\s'
++ showmount -e labstore.svc.eqiad.wmnet
+ echo /exp/project/integration 10.68.18.59,10.68.18.38,10.68.18.34,10.68.18.30,10.68.18.29,10.68.18.28,10.68.18.24,10.68.18.2,10.68.17.70,10.68.17.244,10.68.17.209,10.68.17.184,10.68.17.180,10.68.17.174,10.68.17.136,10.68.16.8,10.68.16.72,10.68.16.68,10.68.16.59,10.68.16.53,10.68.16.42,10.68.16.4,10.68.16.255,10.68.16.227,10.68.16.200,
+ '[' 1 -eq 0 ']'
+ sleep 10

...
+ echo 'Warning:  Timed out trying to detect NFS mounts.'

And indeed newly created instances are not exported since Shared NFS is disabled.

Maybe the information that there is no NFS share to expect can be retrieved by firstboot.sh? I have no idea how it is stored, I am guessing LDAP but as I understand it we want to move out of LDAP.

Details

	Subject	Repo	Branch	Lines +/-
	Remove the wait-on-NFS code from labs instance firstboot.	operations/puppet	production	+0 -34
	Wait for a minute for NFS exports before trying to mount requested volumes.	operations/puppet	production	+59 -2

Customize query in gerrit

Related Objects

Mentioned In: rOPUPc77457a502e0: Remove the wait-on-NFS code from labs instance firstboot.
rOPUPede70dd9afc8: Wait for a minute for NFS exports before trying to mount requested volumes.
T90610: Continuous integration should not depend on labs NFS
Mentioned Here: T90610: Continuous integration should not depend on labs NFS
T102108: New jessie instance can't attach to puppet due to wrong certname

Event Timeline

hashar created this task.Jun 15 2015, 8:17 PM

hashar assigned this task to yuvipanda.

hashar raised the priority of this task from to Needs Triage.

hashar updated the task description. (Show Details)

hashar added projects: Cloud-VPS, Cloud-Services.

hashar mentioned this in T90610: Continuous integration should not depend on labs NFS.

hashar added subscribers: Andrew, yuvipanda, hashar, Aklapper.

hashar removed a parent task: T102108: New jessie instance can't attach to puppet due to wrong certname.Jun 15 2015, 8:21 PM

hashar updated the task description. (Show Details)

hashar set Security to None.

faidon added a project: Labs-Sprint-103.Jun 22 2015, 5:55 PM

The root issue, of course, if the risk that puppet runs before the NFS server has been updated with the new exports. A plausible solution would be to move the guard/wait at puppet time instead of at boot time.

yuvipanda removed yuvipanda as the assignee of this task.Jun 26 2015, 3:40 PM

+ Continuous-Integration-Scaling so I get it on my radar. That is not needed for that project though.

Change 221150 had a related patch set uploaded (by Andrew Bogott):
Wait for a minute for NFS exports before trying to mount requested volumes.

https://gerrit.wikimedia.org/r/221150

Change 221151 had a related patch set uploaded (by Andrew Bogott):
Remove the wait-on-NFS code from labs instance firstboot.

https://gerrit.wikimedia.org/r/221151

Andrew claimed this task.Jun 26 2015, 5:21 PM

Andrew moved this task from To Do to Code Review / Blocked on the Labs-Sprint-103 board.

hashar triaged this task as Medium priority.Jun 29 2015, 8:43 AM

hashar moved this task from Backlog to In-progress on the Continuous-Integration-Scaling board.

hashar moved this task from Triage to In Progress on the Cloud-Services board.

Andrew added a project: Labs-Sprint-104.Jun 29 2015, 5:40 PM

Andrew moved this task from To Do to Code Review / Blocked on the Labs-Sprint-104 board.Jun 29 2015, 5:43 PM

The puppet fix for this can be merged as soon as the export daemon is running again: https://gerrit.wikimedia.org/r/#/c/217861/

Andrew added a project: Labs-Sprint-105.Jul 6 2015, 5:42 PM

Andrew moved this task from To Do to Code Review / Blocked on the Labs-Sprint-105 board.Jul 6 2015, 5:45 PM

Change 221150 merged by Andrew Bogott:
Wait for a minute for NFS exports before trying to mount requested volumes.

https://gerrit.wikimedia.org/r/221150

Andrew mentioned this in rOPUPede70dd9afc8: Wait for a minute for NFS exports before trying to mount requested volumes..Jul 7 2015, 3:09 PM

Change 221151 merged by Andrew Bogott:
Remove the wait-on-NFS code from labs instance firstboot.

https://gerrit.wikimedia.org/r/221151

Andrew mentioned this in rOPUPc77457a502e0: Remove the wait-on-NFS code from labs instance firstboot..Jul 7 2015, 4:04 PM

The new trusty image now has all the updated changes and should start up fairly quickly. If all is well in a day or two I'll build a new jessie image as well.

Andrew moved this task from Code Review / Blocked to Doing on the Labs-Sprint-105 board.Jul 7 2015, 5:40 PM

Works for me with Trusty. I created a Trusty instance on the integration project and had it build quite fast. Well done!

I've built new images for Trusty and Jessie. Not bothering with Precise since it needs to die off anyway.

Now new builds /with/ NFS are sometimes very slow due to problems with something in the way exports are created. I don't know details yet.

Andrew closed this task as Resolved.Jul 8 2015, 9:00 PM

Andrew moved this task from Doing to Done on the Labs-Sprint-105 board.

hashar moved this task from In-progress to Done on the Continuous-Integration-Scaling board.Jul 10 2015, 3:02 PM

• Phabricator_maintenance removed a subscriber: yuvipanda.Jun 7 2017, 6:50 PM

Instances without a shared NFS storage suffers from a 3 minutes boot delayClosed, ResolvedPublicActions

Description

Details

Related Objects

Event Timeline

Instances without a shared NFS storage suffers from a 3 minutes boot delay
Closed, ResolvedPublic
Actions