Continuous integration should not depend on labs NFS
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Krinkle
	Feb 24 2015, 6:14 PM

Description

I don't know which part of our infrastructure depends on it (perhaps ssh access for Precise slaves?), but it seems jobs are stagnating and failing for strange reasons while NFS is down.

While some instances are still working (Trusty?) our test pipeline requires a green light from all sub-jobs.

I can't debug at the moment since for human accounts at least, ssh keys are on NFS for Precise instances.

Related Objects
Search...

Status	Assigned	Task
Resolved	yuvipanda	T105720 Labs team reliability goal for Q1 2015/16
Resolved	Andrew	T102240 Audit projects' use of NFS, and remove it where not necessary
Resolved	hashar	T90610 Continuous integration should not depend on labs NFS
Resolved	hashar	T103312 Cant ssh to integration-slave-jessie-1001.integration.eqiad.wmflabs

Event Timeline

Krinkle created this task.Feb 24 2015, 6:14 PM

Krinkle raised the priority of this task from to Needs Triage.

Krinkle updated the task description. (Show Details)

Krinkle added projects: Release-Engineering-Team, Cloud-Services, Continuous-Integration-Infrastructure.

Krinkle subscribed.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 24 2015, 6:14 PM

Krinkle updated the task description. (Show Details)Feb 24 2015, 6:23 PM

Krinkle set Security to None.

greg removed a project: Release-Engineering-Team.Mar 20 2015, 6:43 PM

Krinkle triaged this task as Low priority.Apr 7 2015, 2:35 PM

Krinkle moved this task from Untriaged to Backlog on the Continuous-Integration-Infrastructure board.

During today's NFS outage, all running builds were halted in the middle of wherever they were and eventually timed out after 30 minutes.

https://gerrit.wikimedia.org/r/#/c/203868/

mediawiki-extensions-zend FAILURE in 30m 01s

https://integration.wikimedia.org/ci/job/mediawiki-extensions-zend/11839/console

00:01:26.096 + /srv/deployment/integration/slave-scripts/bin/mw-run-update-script.sh
00:01:26.229 MediaWiki 1.26alpha Updater
00:01:26.230 Going to run database updates for jenkins_u3_mw
..
00:01:46.205 ...doing rc_id from 1 to 100
00:01:46.264 ...doing rc_id from 100 to 199
00:01:46.265 ...cu_changes table added and populated.
00:01:47.546 ...cu_log added
00:30:00.037 Build timed out (after 30 minutes). Marking the build as failed.
00:30:00.041 Build was aborted
00:30:00.041 Recording test results

So somehow NFS is able to disrupt us three ways:

Unable to connect to labs instances.
Unable to spawn new jobs.
Running builds halt immediately and stop reporting with no recovery.

Krinkle raised the priority of this task from Low to High.Apr 13 2015, 7:18 PM

As a first step, I disabled "Shared project storage" (/data/project NFS mount) in the Nova Project management for integration instances.

We didn't use this anyway. Two existing directories were archived to /home/krinkle/integration-nfs-data-project/ in case they're still needed for something (owned by root, chmod 777).

Next step is to disable shared home directories within the integration project. CI sysadmins like Antoine and myself will lose a little bit of convenience with regards to dotfiles, utility scripts, and document store; but it increases reliability for the overall project. It also makes it more like servers in production, and like what temporary executors in nodepool will be like.

We can stash our personal files on wmflabs-bastion instead (which has NFS/shared/backedup home directories), or some place in production.

hashar mentioned this in T102108: New jessie instance can't attach to puppet due to wrong certname.Jun 15 2015, 8:09 PM

As a side effect that causes instances creation to be delay by 3 minutes which is T102544.

The labs NFS server is down currently and the /home dirs prevent CI slaves from running. Going to disable NFS.

hashar moved this task from Backlog to In-progress on the Continuous-Integration-Infrastructure board.Jun 18 2015, 11:50 AM

Disabling NFS on Hiera:Integration with:

nfs_mounts:
    project: false
    home: false
    scratch: false
    dumps: false

Devised by @yuvipanda

Need to:

manually reboot instances, wait for them to timeout
unmount
run puppet
verify /etc/fstab content
reboot

The integration-slave-trusty* instances are no more depending on NFS.

All integration instances should now be NFS independent. Keeping the bug open though pending verification, also have to make sure newly created instances on Precise/Trusty/Jessie are still NFS free.

hashar lowered the priority of this task from High to Medium.Jun 18 2015, 1:24 PM

\o/ :D

yuvipanda added a parent task: T102240: Audit projects' use of NFS, and remove it where not necessary.Jun 18 2015, 4:06 PM

Side effect, homedirs are now local to the instance. If you want your favourite dotfiles:

operations/puppet.git
./modules/admin/files/home/<your username/...

Still keeping the bug pending verification.

In T90610#1379763, @hashar wrote:

Side effect, homedirs are now local to the instance. If you want your favourite dotfiles:

operations/puppet.git
./modules/admin/files/home/<your username/...

Unfortunately, that doesn't work for labs. See T102173: implement a simple way to share dotfiles across Cloud VPS project instances.

salt '*' cmd.run 'grep labstore /etc/fstab' yields:

integration-slave-jessie-1001.integration.eqiad.wmflabs:
    labstore.svc.eqiad.wmnet:/project/integration/project       /data/project   nfs     rw,vers=4,bg,hard,intr,sec=sys,proto=tcp,port=0,noatime,nofsc   0       0
    labstore.svc.eqiad.wmnet:/project/integration/home  /home   nfs     rw,vers=4,bg,hard,intr,sec=sys,proto=tcp,port=0,noatime,nofsc   0       0
    labstore1003.eqiad.wmnet:/dumps     /public/dumps   nfs     ro,vers=4,bg,hard,intr,sec=sys,proto=tcp,port=0,noatime,nofsc   0       0
    labstore.svc.eqiad.wmnet:/scratch   /data/scratch   nfs     rw,vers=4,bg,hard,intr,sec=sys,proto=tcp,port=0,noatime,nofsc   0       0
integration-zuul-server.integration.eqiad.wmflabs:
    labstore.svc.eqiad.wmnet:/project/integration/project       /data/project   nfs     rw,vers=4,bg,hard,intr,sec=sys,proto=tcp,port=0,noatime,nofsc   0       0
    labstore.svc.eqiad.wmnet:/keys      /public/keys    nfs     ro,vers=4,bg,hard,intr,sec=sys,proto=tcp,port=0,noatime,nofsc   0       0
    labstore.svc.eqiad.wmnet:/project/integration/home  /home   nfs     rw,vers=4,bg,hard,intr,sec=sys,proto=tcp,port=0,noatime,nofsc   0       0
    labstore1003.eqiad.wmnet:/dumps     /public/dumps   nfs     ro,vers=4,bg,hard,intr,sec=sys,proto=tcp,port=0,noatime,nofsc   0       0
    labstore.svc.eqiad.wmnet:/scratch   /data/scratch   nfs     rw,vers=4,bg,hard,intr,sec=sys,proto=tcp,port=0,noatime,nofsc   0       0

Fixed puppet on integration-zuul-server.

Bulk of the work has been done. Now waiting for T103312

hashar closed subtask T103312: Cant ssh to integration-slave-jessie-1001.integration.eqiad.wmflabs as Resolved.Jun 23 2015, 8:56 PM

labstore is gone from /etc/fstab

root@integration-saltmaster:~# salt '*' cmd.run 'grep labstore /etc/fstab'
integration-dev.integration.eqiad.wmflabs:
integration-raita.integration.eqiad.wmflabs:
integration-slave-precise-1013.integration.eqiad.wmflabs:
integration-slave-precise-1011.integration.eqiad.wmflabs:
integration-t102108-trusty-new2.integration.eqiad.wmflabs:
integration-slave-trusty-1012.integration.eqiad.wmflabs:
integration-slave-trusty-1021.integration.eqiad.wmflabs:
integration-publisher.integration.eqiad.wmflabs:
integration-vmbuilder-trusty.integration.eqiad.wmflabs:
integration-slave-precise-1012.integration.eqiad.wmflabs:
integration-slave-trusty-1013.integration.eqiad.wmflabs:
integration-slave-precise-1014.integration.eqiad.wmflabs:
integration-lightslave-jessie-1002.integration.eqiad.wmflabs:
integration-slave-jessie-1001.integration.eqiad.wmflabs:
integration-slave-trusty-1011.integration.eqiad.wmflabs:
integration-slave-trusty-1016.integration.eqiad.wmflabs:
integration-slave-trusty-1015.integration.eqiad.wmflabs:
integration-slave-trusty-1014.integration.eqiad.wmflabs:
integration-labsvagrant.integration.eqiad.wmflabs:
integration-saltmaster.integration.eqiad.wmflabs:
root@integration-saltmaster:~#

On all instances I have unmounted: /public/dump , /data/scratch, /public/keys.

Unmounted /data/project and /home NFS mounts from integration-raita and integration-vmbuilder-trusty and rebooted them.

All fixed as far as I can tell. labstore is no more mounted nor in fstab.

• Phabricator_maintenance removed a subscriber: yuvipanda.Jun 7 2017, 6:54 PM

Restricted Application added a project: Release-Engineering-Team (Kanban). · View Herald TranscriptJun 7 2017, 6:54 PM

• Phabricator_maintenance edited projects, added RelEng-Archive-FY201718-Q1; removed Release-Engineering-Team (Kanban).Sep 26 2017, 11:48 PM

Continuous integration should not depend on labs NFSClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Continuous integration should not depend on labs NFS
Closed, ResolvedPublic
Actions

Related Objects
Search...