Page MenuHomePhabricator

Continuous integration should not depend on labs NFS
Closed, ResolvedPublic

Description

I don't know which part of our infrastructure depends on it (perhaps ssh access for Precise slaves?), but it seems jobs are stagnating and failing for strange reasons while NFS is down.

While some instances are still working (Trusty?) our test pipeline requires a green light from all sub-jobs.

I can't debug at the moment since for human accounts at least, ssh keys are on NFS for Precise instances.

Event Timeline

Krinkle raised the priority of this task from to Needs Triage.
Krinkle updated the task description. (Show Details)
Krinkle subscribed.
Krinkle set Security to None.
Krinkle moved this task from Untriaged to Backlog on the Continuous-Integration-Infrastructure board.

During today's NFS outage, all running builds were halted in the middle of wherever they were and eventually timed out after 30 minutes.

https://gerrit.wikimedia.org/r/#/c/203868/

mediawiki-extensions-zend FAILURE in 30m 01s

https://integration.wikimedia.org/ci/job/mediawiki-extensions-zend/11839/console

00:01:26.096 + /srv/deployment/integration/slave-scripts/bin/mw-run-update-script.sh
00:01:26.229 MediaWiki 1.26alpha Updater
00:01:26.230 Going to run database updates for jenkins_u3_mw
..
00:01:46.205 ...doing rc_id from 1 to 100
00:01:46.264 ...doing rc_id from 100 to 199
00:01:46.265 ...cu_changes table added and populated.
00:01:47.546 ...cu_log added
00:30:00.037 Build timed out (after 30 minutes). Marking the build as failed.
00:30:00.041 Build was aborted
00:30:00.041 Recording test results

So somehow NFS is able to disrupt us three ways:

  • Unable to connect to labs instances.
  • Unable to spawn new jobs.
  • Running builds halt immediately and stop reporting with no recovery.
Krinkle raised the priority of this task from Low to High.Apr 13 2015, 7:18 PM

As a first step, I disabled "Shared project storage" (/data/project NFS mount) in the Nova Project management for integration instances.

We didn't use this anyway. Two existing directories were archived to /home/krinkle/integration-nfs-data-project/ in case they're still needed for something (owned by root, chmod 777).

Next step is to disable shared home directories within the integration project. CI sysadmins like Antoine and myself will lose a little bit of convenience with regards to dotfiles, utility scripts, and document store; but it increases reliability for the overall project. It also makes it more like servers in production, and like what temporary executors in nodepool will be like.

We can stash our personal files on wmflabs-bastion instead (which has NFS/shared/backedup home directories), or some place in production.

hashar subscribed.

As a side effect that causes instances creation to be delay by 3 minutes which is T102544.

The labs NFS server is down currently and the /home dirs prevent CI slaves from running. Going to disable NFS.

hashar added a subscriber: yuvipanda.

Disabling NFS on Hiera:Integration with:

nfs_mounts:
    project: false
    home: false
    scratch: false
    dumps: false

Devised by @yuvipanda

Need to:

  • manually reboot instances, wait for them to timeout
  • unmount
  • run puppet
  • verify /etc/fstab content
  • reboot

The integration-slave-trusty* instances are no more depending on NFS.

All integration instances should now be NFS independent. Keeping the bug open though pending verification, also have to make sure newly created instances on Precise/Trusty/Jessie are still NFS free.

hashar lowered the priority of this task from High to Medium.Jun 18 2015, 1:24 PM

Side effect, homedirs are now local to the instance. If you want your favourite dotfiles:

operations/puppet.git
./modules/admin/files/home/<your username/...

Still keeping the bug pending verification.

Side effect, homedirs are now local to the instance. If you want your favourite dotfiles:

operations/puppet.git
./modules/admin/files/home/<your username/...

Unfortunately, that doesn't work for labs. See T102173: implement a simple way to share dotfiles across Cloud VPS project instances.

salt '*' cmd.run 'grep labstore /etc/fstab' yields:

integration-slave-jessie-1001.integration.eqiad.wmflabs:
    labstore.svc.eqiad.wmnet:/project/integration/project       /data/project   nfs     rw,vers=4,bg,hard,intr,sec=sys,proto=tcp,port=0,noatime,nofsc   0       0
    labstore.svc.eqiad.wmnet:/project/integration/home  /home   nfs     rw,vers=4,bg,hard,intr,sec=sys,proto=tcp,port=0,noatime,nofsc   0       0
    labstore1003.eqiad.wmnet:/dumps     /public/dumps   nfs     ro,vers=4,bg,hard,intr,sec=sys,proto=tcp,port=0,noatime,nofsc   0       0
    labstore.svc.eqiad.wmnet:/scratch   /data/scratch   nfs     rw,vers=4,bg,hard,intr,sec=sys,proto=tcp,port=0,noatime,nofsc   0       0
integration-zuul-server.integration.eqiad.wmflabs:
    labstore.svc.eqiad.wmnet:/project/integration/project       /data/project   nfs     rw,vers=4,bg,hard,intr,sec=sys,proto=tcp,port=0,noatime,nofsc   0       0
    labstore.svc.eqiad.wmnet:/keys      /public/keys    nfs     ro,vers=4,bg,hard,intr,sec=sys,proto=tcp,port=0,noatime,nofsc   0       0
    labstore.svc.eqiad.wmnet:/project/integration/home  /home   nfs     rw,vers=4,bg,hard,intr,sec=sys,proto=tcp,port=0,noatime,nofsc   0       0
    labstore1003.eqiad.wmnet:/dumps     /public/dumps   nfs     ro,vers=4,bg,hard,intr,sec=sys,proto=tcp,port=0,noatime,nofsc   0       0
    labstore.svc.eqiad.wmnet:/scratch   /data/scratch   nfs     rw,vers=4,bg,hard,intr,sec=sys,proto=tcp,port=0,noatime,nofsc   0       0

Fixed puppet on integration-zuul-server.

hashar raised the priority of this task from Medium to High.Jun 22 2015, 12:51 PM
hashar lowered the priority of this task from High to Medium.

Bulk of the work has been done. Now waiting for T103312

labstore is gone from /etc/fstab

root@integration-saltmaster:~# salt '*' cmd.run 'grep labstore /etc/fstab'
integration-dev.integration.eqiad.wmflabs:
integration-raita.integration.eqiad.wmflabs:
integration-slave-precise-1013.integration.eqiad.wmflabs:
integration-slave-precise-1011.integration.eqiad.wmflabs:
integration-t102108-trusty-new2.integration.eqiad.wmflabs:
integration-slave-trusty-1012.integration.eqiad.wmflabs:
integration-slave-trusty-1021.integration.eqiad.wmflabs:
integration-publisher.integration.eqiad.wmflabs:
integration-vmbuilder-trusty.integration.eqiad.wmflabs:
integration-slave-precise-1012.integration.eqiad.wmflabs:
integration-slave-trusty-1013.integration.eqiad.wmflabs:
integration-slave-precise-1014.integration.eqiad.wmflabs:
integration-lightslave-jessie-1002.integration.eqiad.wmflabs:
integration-slave-jessie-1001.integration.eqiad.wmflabs:
integration-slave-trusty-1011.integration.eqiad.wmflabs:
integration-slave-trusty-1016.integration.eqiad.wmflabs:
integration-slave-trusty-1015.integration.eqiad.wmflabs:
integration-slave-trusty-1014.integration.eqiad.wmflabs:
integration-labsvagrant.integration.eqiad.wmflabs:
integration-saltmaster.integration.eqiad.wmflabs:
root@integration-saltmaster:~#

On all instances I have unmounted: /public/dump , /data/scratch, /public/keys.

Unmounted /data/project and /home NFS mounts from integration-raita and integration-vmbuilder-trusty and rebooted them.

hashar moved this task from In-progress to Done on the Continuous-Integration-Infrastructure board.

All fixed as far as I can tell. labstore is no more mounted nor in fstab.