Page MenuHomePhabricator

Instances without a shared NFS storage suffers from a 3 minutes boot delay
Closed, ResolvedPublic

Description

Instances on integration project suffers from a 3 minutes initial boot delay because there is no NFS server available. The shared NFS has been disabled on integration T90610#1344487

It only happens on first boot though, The code from our firstboot.sh files has:

# Sleep until the nfs volumes we need are available.
#  Worst case, just time out after 3 minutes.
tries=18
for i in `seq 1 ${tries}`; do
    prod_domain=`echo $domain | sed 's/wmflabs/wmnet/'`
    nfs_server="labstore.svc.${prod_domain}"
    echo $(showmount -e ${nfs_server} | egrep ^/exp/project/${project}\\s), | fgrep -q $ip,
    if [ $? -eq 0 ];  then
        break
    fi
    sleep 10
done

A couple console logs from T102108: integration-t102108-jessie-new2 and integration-t102108-trusty-new2 show:

+ tries=18
++ seq 1 18
+ for i in '`seq 1 ${tries}`'
++ sed s/wmflabs/wmnet/
++ echo eqiad.wmflabs
+ prod_domain=eqiad.wmnet
+ nfs_server=labstore.svc.eqiad.wmnet
+ fgrep -q 10.68.17.6,
++ egrep '^/exp/project/integration\s'
++ showmount -e labstore.svc.eqiad.wmnet
+ echo /exp/project/integration 10.68.18.59,10.68.18.38,10.68.18.34,10.68.18.30,10.68.18.29,10.68.18.28,10.68.18.24,10.68.18.2,10.68.17.70,10.68.17.244,10.68.17.209,10.68.17.184,10.68.17.180,10.68.17.174,10.68.17.136,10.68.16.8,10.68.16.72,10.68.16.68,10.68.16.59,10.68.16.53,10.68.16.42,10.68.16.4,10.68.16.255,10.68.16.227,10.68.16.200,
+ '[' 1 -eq 0 ']'
+ sleep 10

...
+ echo 'Warning:  Timed out trying to detect NFS mounts.'

And indeed newly created instances are not exported since Shared NFS is disabled.

Maybe the information that there is no NFS share to expect can be retrieved by firstboot.sh? I have no idea how it is stored, I am guessing LDAP but as I understand it we want to move out of LDAP.

Event Timeline

hashar created this task.Jun 15 2015, 8:17 PM
hashar assigned this task to yuvipanda.
hashar raised the priority of this task from to Needs Triage.
hashar updated the task description. (Show Details)
hashar added projects: Cloud-VPS, Cloud-Services.
hashar added subscribers: Andrew, yuvipanda, hashar, Aklapper.
coren added a subscriber: coren.Jun 23 2015, 3:23 PM

The root issue, of course, if the risk that puppet runs before the NFS server has been updated with the new exports. A plausible solution would be to move the guard/wait at puppet time instead of at boot time.

yuvipanda removed yuvipanda as the assignee of this task.Jun 26 2015, 3:40 PM

+ Continuous-Integration-Scaling so I get it on my radar. That is not needed for that project though.

Change 221150 had a related patch set uploaded (by Andrew Bogott):
Wait for a minute for NFS exports before trying to mount requested volumes.

https://gerrit.wikimedia.org/r/221150

Change 221151 had a related patch set uploaded (by Andrew Bogott):
Remove the wait-on-NFS code from labs instance firstboot.

https://gerrit.wikimedia.org/r/221151

Andrew claimed this task.Jun 26 2015, 5:21 PM
Andrew moved this task from To Do to Code Review / Blocked on the Labs-Sprint-103 board.
hashar triaged this task as Normal priority.Jun 29 2015, 8:43 AM
hashar moved this task from Backlog to In-progress on the Continuous-Integration-Scaling board.
hashar moved this task from Triage to In Progress on the Cloud-Services board.

The puppet fix for this can be merged as soon as the export daemon is running again: https://gerrit.wikimedia.org/r/#/c/217861/

Change 221150 merged by Andrew Bogott:
Wait for a minute for NFS exports before trying to mount requested volumes.

https://gerrit.wikimedia.org/r/221150

Change 221151 merged by Andrew Bogott:
Remove the wait-on-NFS code from labs instance firstboot.

https://gerrit.wikimedia.org/r/221151

Andrew added a comment.Jul 7 2015, 5:40 PM

The new trusty image now has all the updated changes and should start up fairly quickly. If all is well in a day or two I'll build a new jessie image as well.

hashar added a comment.Jul 8 2015, 8:00 AM

Works for me with Trusty. I created a Trusty instance on the integration project and had it build quite fast. Well done!

Andrew added a comment.Jul 8 2015, 8:58 PM

I've built new images for Trusty and Jessie. Not bothering with Precise since it needs to die off anyway.

Now new builds /with/ NFS are sometimes very slow due to problems with something in the way exports are created. I don't know details yet.

Andrew closed this task as Resolved.Jul 8 2015, 9:00 PM
Andrew moved this task from Doing to Done on the Labs-Sprint-105 board.