Page MenuHomePhabricator

CI labs instances can't start on reboot: tmpfs: Bad value 'jenkins-deploy' for mount option 'uid'
Closed, ResolvedPublic

Description

integration-slave1001.eqiad.wmflabs wont start because of a DNS resolution error from mount.nfs

https://wikitech.wikimedia.org/wiki/Nova_Resource:I-000001bd.eqiad.wmflabs

I can restart it just fine from the wikitech page, but each time I get via Get console output:

[    5.456372] tmpfs: Bad value 'jenkins-deploy' for mount option 'uid'
rpcbind: Cannot open '/run/rpcbind/rpcbind.xdr' file for reading, errno 2 (No such file or directory)
rpcbind: Cannot open '/run/rpcbind/portmap.xdr' file for reading, errno 2 (No such file or directory)
mount.nfs: Failed to resolve server labstore1003.eqiad.wmnet: Temporary failure in name resolution
mount.nfs: Failed to resolve server labstore.svc.eqiad.wmnet: Temporary failure in name resolution
mount.nfs: Failed to resolve server labstore.svc.eqiad.wmnet: Temporary failure in name resolution
mount.nfs: Failed to resolve server labstore.svc.eqiad.wmnet: Temporary failure in name resolution
mount.nfs: Failed to resolve server labstore.svc.eqiad.wmnet: Temporary failure in name resolution
mountall: mount /public/keys [336] terminated with status 32
mount.nfs: Failed to resolve server labstore.svc.eqiad.wmnet: Temporary failure in name resolution
mountall: mount /public/backups [338] terminated with status 32
mountall: mount /public/dumps [340] terminated with status 32
mountall: mount /data/scratch [344] terminated with status 32
mountall: mount /home [346] terminated with status 32
mountall: mount /data/project [342] terminated with status 32
mount.nfs: Failed to resolve server labstore.svc.eqiad.wmnet: Temporary failure in name resolution
mountall: mount /public/keys [585] terminated with status 32
mount.nfs: Failed to resolve server labstore1003.eqiad.wmnet: Temporary failure in name resolution
mount.nfs: Failed to resolve server labstore.svc.eqiad.wmnet: Temporary failure in name resolution
mount.nfs: Failed to resolve server labstore.svc.eqiad.wmnet: Temporary failure in name resolution
mountall: mount /public/backups [587] terminated with status 32
mountall: mount /public/dumps [588] terminated with status 32
mountall: mount /data/project [592] terminated with status 32
mount.nfs: Failed to resolve server labstore.svc.eqiad.wmnet: Temporary failure in name resolution
mountall: mount /data/scratch [594] terminated with status 32
mount.nfs: Failed to resolve server labstore.svc.eqiad.wmnet: Temporary failure in name resolution
mountall: mount /home [597] terminated with status 32
mount: wrong fs type, bad option, bad superblock on tmpfs,
       missing codepage or helper program, or other error
       (for several filesystems (e.g. nfs, cifs) you might
       need a /sbin/mount.<type> helper program)
       In some cases useful info is found in syslog - try
       dmesg | tail  or so

mountall: mount /mnt/home/jenkins-deploy/tmpfs [669] terminated with status 32
mountall: Filesystem could not be mounted: /mnt/home/jenkins-deploy/tmpfs
An error occurred while mounting /mnt/home/jenkins-deploy/tmpfs.
Press S to skip mounting or M for manual recovery
cloud-init start-local running: Fri, 28 Nov 2014 20:11:22 +0000. up 6.47 seconds
no instance data found in start-local
ci-info: lo    : 1 127.0.0.1       255.0.0.0       .
ci-info: eth0  : 1 10.68.16.171    255.255.248.0   fa:16:3e:cc:fa:c3
ci-info: route-0: 0.0.0.0         10.68.16.1      0.0.0.0         eth0   UG
ci-info: route-1: 10.68.16.0      0.0.0.0         255.255.248.0   eth0   U
cloud-init start running: Fri, 28 Nov 2014 20:11:25 +0000. up 9.86 seconds

Event Timeline

hashar created this task.Nov 28 2014, 8:15 PM
hashar raised the priority of this task from to Needs Triage.
hashar updated the task description. (Show Details)
hashar added a project: Cloud-VPS.
hashar changed Security from none to None.
hashar added subscribers: hashar, Cloud-Services.

I have marked the associated Jenkins slave as offline. We will want to bring it back online whenever the instance is recovered.
https://integration.wikimedia.org/ci/computer/integration-slave1001/

hashar updated the task description. (Show Details)Nov 28 2014, 8:18 PM

@Andrew @yuvipanda @coren , any chance to fix up that instance resolv.conf ? If it is not too much of a hassle, that would save us the time needed to add a new slave.

So I looked at this today - our documented method for getting an interactive console for this seems to not work anymore, so I'm not really sure what exactly to do. The instance doesn't boot at all - so can't ssh with root key to fix things. If saving time is the purported end goal, I'd suggest creating a new instance, sadly.

Release Engineering/SAL:

Dec 2 05:26 Krinkle: integration-slave1001 has been down since the failed reboot on 28 November 2014. Still unreachable over ssh and no Jenkins slave agent.

hashar claimed this task.Dec 16 2014, 1:56 PM

Thanks @yuvipanda.

I have deleted the old instance and recreated it (with IP 10.68.17.119). I have updated the Jenkins configuration for the node. Once puppet has completed the instance configuration, I will reenable it in Jenkins.

Puppet choke on newly created Precise instances T78661 :-(

So I deleted the old blocked instance and created a fresh one. Applied the manifests I needed and did a reboot from the command line. Result is the same:

mount.nfs: Failed to resolve server labstore.svc.eqiad.wmnet: Temporary failure in name resolution
mountall: mount /home [578] terminated with status 32
cloud-init start-local running: Tue, 16 Dec 2014 15:28:34 +0000. up 9.83 seconds
An error occurred while mounting /mnt/home/jenkins-deploy/tmpfs.
Press S to skip mounting or M for manual recovery
no instance data found in start-local
ci-info: lo    : 1 127.0.0.1       255.0.0.0       .
ci-info: eth0  : 1 10.68.17.119    255.255.248.0   fa:16:3e:75:4d:a9
ci-info: route-0: 0.0.0.0         10.68.16.1      0.0.0.0         eth0   UG
ci-info: route-1: 10.68.16.0      0.0.0.0         255.255.248.0   eth0   U
cloud-init start running: Tue, 16 Dec 2014 15:28:36 +0000. up 11.10 seconds
found data source: DataSourceEc2

:-(

From coren, the relevant bits are:

tmpfs: Bad value 'jenkins-deploy' for mount option 'uid'
...
An error occurred while mounting /mnt/home/jenkins-deploy/tmpfs.
Press S to skip mounting or M for manual recovery

Which is caused by the cherry picked patch https://gerrit.wikimedia.org/r/#/c/173512/1 which adds a tmpfs mount on labs while jenkins-deploy is only present in the LDAP server. The boot sequence does the mount before network/LDAP is available hence the failure :-(

Krinkle closed this task as Declined.Dec 19 2014, 6:19 AM

Instance has been deleted and re-created. Can't reproduce this issue.

hashar reopened this task as Open.Dec 19 2014, 10:21 AM

Instance has been deleted and re-created. Can't reproduce this issue.

Have you rebooted the instance? It will end up being locked early on during the boot with:

tmpfs: Bad value 'jenkins-deploy' for mount option 'uid'

Because network/LDAP is not available yet.

hashar renamed this task from integration-slave1001.eqiad.wmflabs can't start, mount.nfs yields failure in name resolution to CI labs instances can't start on reboot: tmpfs: Bad value 'jenkins-deploy' for mount option 'uid'.Dec 19 2014, 10:22 AM

Instance has been deleted and re-created. Can't reproduce this issue.

Have you rebooted the instance? It will end up being locked early on during the boot [..]

I didn't rebooted it. But it worked fine when provisioning. Both existing as well as new instances (integration-slave1001/UbuntuPrecise and integration-slave1005/UbuntuTrusty). So if this fails on reboot then that presumably happens on existing instances as well.

I didn't rebooted it. But it worked fine when provisioning. Both existing as well as new instances (integration-slave1001/UbuntuPrecise and integration-slave1005/UbuntuTrusty). So if this fails on reboot then that presumably happens on existing instances as well.

Indeed. So if we ever have all instances to reboot (due to a labs issue for example), CI is gone entirely and would force us to rebuild fresh instances.

We should probably remove the tmpfs from labs instance and have puppet ensure the mount is absent until the boot sequence issue is figured out.

hashar added a comment.Jan 5 2015, 1:53 PM

To workaround the boot sequence not finding jenkins-slave user, we could have the tmpfs mounted just like /tmp : owned by root:root with world writable rights and restricted deletion 1777

Change 173511 had a related patch set uploaded (by Hashar):
contint: tmpfs is now root:root and world writable

https://gerrit.wikimedia.org/r/173511

Patch-For-Review

hashar added a comment.Jan 5 2015, 2:56 PM

Applied on all labs CI slaves using:

umount /mnt/home/jenkins-deploy/tmpfs
rmdir /mnt/home/jenkins-deploy/tmpfs
puppet agent -tv

I have rebooted integration-slave1001 and confirmed it comes back just fine.

Krinkle closed this task as Resolved.Jan 5 2015, 4:38 PM

Thanks!

By the way, do we have a ticket to track LDAP not being available? Or is that by design and a wontfix for Labs?

hashar added a comment.Jan 5 2015, 8:09 PM

By the way, do we have a ticket to track LDAP not being available? Or is that by design and a wontfix for Labs?

There is none and I have no idea how to alter the boot sequence to have network + nscld (LDAP) to be started before the tmpfs mount :/ The man fstab mentions the mount option bootwait.

Using root:root works around it anyway :-]

Change 173511 merged by coren:
contint: tmpfs is now root:root and world writable

https://gerrit.wikimedia.org/r/173511

hashar added a comment.Jan 6 2015, 4:01 PM

I have also updated lanthanum and gallium so they now have /var/lib/jenkins-slave/tmpfs belonging to root:root and mode 1777.