Page MenuHomePhabricator

Migrate misc to secondary labstore HA cluster
Closed, ResolvedPublic

Description

Misc (so far known as others) constitutes all labs projects nfs shares that aren't tools or maps. This data is still served from labstore1001 and the next step is to move this to the secondary nfs cluster (labstore1004/5).

Prior to the maintenance window

  • Announce to labs-l, labs-announce, engineering/wikitech-l(?) tech-news(?)
  • Mount misc from labstore1001 and labstore-secondary simultaneously (https://gerrit.wikimedia.org/r/329711 - Tested on nfs-test.testlabs and works fine)
  • Sync others 24 hours before migration to labstore-secondary and labstore200*
  • Check up on these nodes with others

Puppet run issues in these nodes

  • dumps-3.dumps.eqiad.wmflabs -> Var is full? [FIXED]
  • relic-relics.testlabs.eqiad.wmflabs -> Var is full? [FIXED]
  • labs-dnsrecursor-test.openstack.eqiad.wmflabs -> Puppet errors
  • labs-dynamicproxy-test.openstack.eqiad.wmflabs -> Puppet errors

SSH Auth fails for following instances prior to migration:

  • fastcci-puppetmaster.fastcci.eqiad.wmflabs [Can log in as root, but Puppet Errors]
  • phab-test.contributors.eqiad.wmflabs [Yuvi can't log in either]
  • wmt-exec.wmt.eqiad.wmflabs (only instance in project) [Yuvi can't log in either]
  • wsexport.wikisource-tools.eqiad.wmflabs (only instance in project) [Yuvi can't log in either]
  • wtui-new.wikidata-topicmaps.eqiad.wmflabs (only instance in project) [Can log in as root, Puppet Errors]

During migration:

  • Update lists/irc channel on start of migration
  • Silence shinken alarms (shinken-01 instance kill ircecho and disable puppet)
  • Silence icinga on labstore1001, and 1004
  • Disable puppet across misc
  • Make others share on labstore1001 and labstore1004 (because dual mount) Read Only (export as ro)
    • service nfs-exportd stop
  • staged a copy of /etc/export.d as of 1/17 for all things ro at /root/export_d_ro
  • exportfs -ra (will activate a new config)
  • Keep puppet disabled
  • Snapshot and latest sync of the others/misc share (Last one took ~50 minutes)
  • Merge gerrit patch to remove the mount from 1001 (Removing defn from nfs-mount.yaml), and symlink mount path - /mnt/nfs/labstore-secondary-project to /data/project and /mnt/nfs/labstore-secondary-home to /home on tools
  • https://gerrit.wikimedia.org/r/#/c/332735/
  • Run nfs-exportd on labstore-secondary to make sure the mount is exported to all labs hosts <== comes w/ nfs-manage up
  • For each instance - enable and run puppet twice (If necessary, reboot any problematic nodes after running puppet)

Post Migration

  • Follow up on lists/irc
  • Consider cleaning up following files/directories /srv/others

volumes-without-projects
orphan-volumes
analytics --> not nfs anymore
cvresearch --> deleted project
fatg --> deleted project
deployment-prep --> not nfs anymore
gsoc2014-fonttailor-demo --> deleted project
maps-team --> only scratch currently mounted
wikidata-quality --> deleted project
wikispy --> deleted project
wikidata-query --> only scratch and dumps mounted
wikistats --> not nfs anymore
ircnotifier --> deleted project
megacron --> deleted project?

Success criteria:

  • All labs instances in the list below have /data/project and /home(if mounted) symlinked and mounted from the secondary NFS cluster

Rollback strategy:

  • Disable puppet across tools
  • Merge gerrit patch that will make mounts from labstore1001 present on nfsclient.pp
  • Export misc nfs share as rw on labstore1001
  • Run puppet to have the share from 1001 symlinked to /home and /data/project on clients
  1. List of misc projects with nfs mounted
    • catgraph
    • account-creation-assistance
    • contributors
    • wikidata-topicmaps
    • sugarcrm
    • wikidumpparse
    • video
    • openstack
    • testlabs
    • quarry
    • huggle
    • editor-engagement
    • utrs
    • wmt
    • cvn
    • fastcci
    • toolsbeta
    • project-proxy
    • dumps
    • bots
    • snuggle
    • math
    • wikisource-tools

List of nodes with nfs mounted

  • mwdiffstuff.catgraph.eqiad.wmflabs
  • fridolin.catgraph.eqiad.wmflabs
  • fishbone.catgraph.eqiad.wmflabs
  • sylvester.catgraph.eqiad.wmflabs
  • accounts-dbslave.account-creation-assistance.eqiad.wmflabs
  • accounts-db3.account-creation-assistance.eqiad.wmflabs
  • accounts-mwoauth.account-creation-assistance.eqiad.wmflabs
  • accounts-appserver4.account-creation-assistance.eqiad.wmflabs
  • accounts-db2.account-creation-assistance.eqiad.wmflabs
  • phab-test.contributors.eqiad.wmflabs
  • contributors-metrics.contributors.eqiad.wmflabs
  • wtui-new.wikidata-topicmaps.eqiad.wmflabs
  • officetools.sugarcrm.eqiad.wmflabs
  • wigi.wikidumpparse.eqiad.wmflabs
  • videodev.video.eqiad.wmflabs
  • gfg01.video.eqiad.wmflabs
  • encoding01.video.eqiad.wmflabs
  • encoding02.video.eqiad.wmflabs
  • encoding03.video.eqiad.wmflabs
  • video-redis.video.eqiad.wmflabs
  • t104588-test.openstack.eqiad.wmflabs
  • labs-traefikbuild.openstack.eqiad.wmflabs
  • labs-dynamicproxy-new2.openstack.eqiad.wmflabs
  • labs-dnsrecursor-test.openstack.eqiad.wmflabs
  • labs-bootstrapvz-jessie.openstack.eqiad.wmflabs
  • labs-dynamicproxy-test.openstack.eqiad.wmflabs
  • labs-vmbuilder-trusty.openstack.eqiad.wmflabs
  • labs-vmbuilder-precise.openstack.eqiad.wmflabs
  • saltreactortest.testlabs.eqiad.wmflabs
  • nfs-test-02.testlabs.eqiad.wmflabs
  • util-abogott.testlabs.eqiad.wmflabs
  • relic-relics.testlabs.eqiad.wmflabs
  • quarry-runner-01.quarry.eqiad.wmflabs
  • quarry-runner-02.quarry.eqiad.wmflabs
  • quarry-main-01.quarry.eqiad.wmflabs
  • huggle-d2.huggle.eqiad.wmflabs
  • huggle-pg.huggle.eqiad.wmflabs
  • huggle-win32.huggle.eqiad.wmflabs
  • huggle.huggle.eqiad.wmflabs
  • deferred-changes.editor-engagement.eqiad.wmflabs
  • flow-tests.editor-engagement.eqiad.wmflabs
  • docs.editor-engagement.eqiad.wmflabs
  • mwui.editor-engagement.eqiad.wmflabs
  • utrs-secondary.utrs.eqiad.wmflabs
  • utrs-primary.utrs.eqiad.wmflabs
  • wmt-exec.wmt.eqiad.wmflabs
  • cvn-app7.cvn.eqiad.wmflabs
  • cvn-app6.cvn.eqiad.wmflabs
  • cvn-apache8.cvn.eqiad.wmflabs
  • fastcci-puppetmaster.fastcci.eqiad.wmflabs
  • fastcci-main.fastcci.eqiad.wmflabs
  • fastcci-master.fastcci.eqiad.wmflabs
  • fastcci-worker1.fastcci.eqiad.wmflabs
  • toolsbeta-puppetdb-03.toolsbeta.eqiad.wmflabs
  • toolsbeta-puppetdb-02.toolsbeta.eqiad.wmflabs
  • toolsbeta-valhallasw-puppet-compiler-4.toolsbeta.eqiad.wmflabs
  • toolsbeta-valhallasw-puppet-compiler-3.toolsbeta.eqiad.wmflabs
  • toolsbeta-puppetmaster7.toolsbeta.eqiad.wmflabs
  • test-spm-1.project-proxy.eqiad.wmflabs
  • novaproxy-02.project-proxy.eqiad.wmflabs
  • novaproxy-01.project-proxy.eqiad.wmflabs
  • bugzilla.dumps.eqiad.wmflabs
  • dumps-3.dumps.eqiad.wmflabs
  • dumps-2.dumps.eqiad.wmflabs
  • dumps-1.dumps.eqiad.wmflabs
  • dumps-stats.dumps.eqiad.wmflabs
  • botbot.bots.eqiad.wmflabs
  • wm-bot.bots.eqiad.wmflabs
  • snuggle-enwiki-01.snuggle.eqiad.wmflabs
  • snuggle-en.snuggle.eqiad.wmflabs
  • hadoop000.math.eqiad.wmflabs
  • hadoop002.math.eqiad.wmflabs
  • hadoop003.math.eqiad.wmflabs
  • drmf2017.math.eqiad.wmflabs
  • drmf-beta.math.eqiad.wmflabs
  • drmf2016.math.eqiad.wmflabs
  • math-de.math.eqiad.wmflabs
  • mathosphere.math.eqiad.wmflabs
  • mathoid.math.eqiad.wmflabs
  • math-ru.math.eqiad.wmflabs
  • drmf.math.eqiad.wmflabs
  • mathoid2.math.eqiad.wmflabs
  • wsexport.wikisource-tools.eqiad.wmflabs

Event Timeline

Change 329711 had a related patch set uploaded (by Madhuvishy):
nfs: Dual mount misc projects from labstore-secondary cluster

https://gerrit.wikimedia.org/r/329711

@madhuvishy Do you know what editor-engagement is? Is is the project for http://ee-dashboards.wmflabs.org? I'm trying to figure out whether I'll need to do something here :)

@Neil_P._Quinn_WMF This is the link to the project - https://wikitech.wikimedia.org/wiki/Nova_Resource:Editor-engagement. It's described as project for Flow and Growth teams, and has the following instances:

deferred-changes.editor-engagement.eqiad.wmflabs
flow-tests.editor-engagement.eqiad.wmflabs
docs.editor-engagement.eqiad.wmflabs
ee-flow-extra.editor-engagement.eqiad.wmflabs
ee-flow.editor-engagement.eqiad.wmflabs
mwui.editor-engagement.eqiad.wmflabs

The proxies hosted from this project are:

mwui.wmflabs.org
ee-flow.wmflabs.org	
pronunciationrecording.wmflabs.org
annotator.wmflabs.org
toro.wmflabs.org
piramido.wmflabs.org
proveit.wmflabs.org
e3doc.wmflabs.org
growthdoc.wmflabs.org
ee-flow-extra.wmflabs.org
editor-campaigns.wmflabs.org
editor-campaigns-wkimetrics.wmflabs.org
legoktm.wmflabs.org
ee-flow-mlitn.wmflabs.org
legoktm-de.wmflabs.org
legoktm-fr.wmflabs.org
flow-tests.wmflabs.org
mm-ch.wmflabs.org
ee-prototype.wmflabs.org
cirrustest-phantomcirrus.wmflabs.org
commons-phantomcirrus.wmflabs.org
mobile-phantomcirrus.wmflabs.org
deferred-changes.wmflabs.org

ee-dashboards doesn't seem to be in the list. Let me know if that helps, or if you need anything else :)

@madhuvishy: okay, thanks! None of that looks like it impacts my work. I'm not sure who else one could talk to about that project, but I would guess @Catrope because (1) he's still around and (2) he usually has an impressive grasp of infrastructure like this :)

All the data will be migrated over and shouldn't need any prior action. If any of the services that are writing to /home or /data/project don't seem to be running post migration, they will require manual restarts.

Change 329711 merged by Madhuvishy:
nfs: Dual mount misc projects from labstore-secondary cluster

https://gerrit.wikimedia.org/r/329711

Removed wikidata-dev from list of affected projects - It has nfs-mount turned off explicitly via hiera - https://wikitech.wikimedia.org/wiki/Hiera:Wikidata-dev

Mentioned in SAL (#wikimedia-labs) [2017-01-18T17:09:20Z] <madhuvishy> Silencing shinken for nfs misc migration T154336

Mentioned in SAL (#wikimedia-operations) [2017-01-18T17:11:28Z] <madhuvishy> Silenced shinken, and icinga on labstore1001 for misc nfs migration T154336

Mentioned in SAL (#wikimedia-operations) [2017-01-18T17:13:01Z] <madhuvishy> Disabling puppet across labs instances with NFS (/home and/or /data/project) mounted for T154336

Mentioned in SAL (#wikimedia-operations) [2017-01-18T17:25:26Z] <madhuvishy> Stopping nfs-exportd on labstore1001 T154336

Mentioned in SAL (#wikimedia-operations) [2017-01-18T17:28:18Z] <madhuvishy> Exported all misc exports as RO on labstore1001 T154336

Mentioned in SAL (#wikimedia-operations) [2017-01-18T17:30:19Z] <madhuvishy> Stopping nfs-exportd on labstore1004 T154336

Mentioned in SAL (#wikimedia-operations) [2017-01-18T17:36:45Z] <madhuvishy> Exporting all misc shares from labstore1004 as RO T154336

Mentioned in SAL (#wikimedia-operations) [2017-01-18T17:38:14Z] <madhuvishy> Disabling puppet on labstore1001 and 1004 to make sure nfs exports are not overridden T154336

Mentioned in SAL (#wikimedia-operations) [2017-01-18T17:39:40Z] <madhuvishy> Starting final sync of latest diff from labstore1001 to labstore-secondary T154336

Mentioned in SAL (#wikimedia-operations) [2017-01-18T17:54:50Z] <madhuvishy> Disabled (systemctl disable) nfs-export on labstore1001 and 1004 to prevent auto restart from bringing them back up T154336

Mentioned in SAL (#wikimedia-operations) [2017-01-18T22:46:19Z] <madhuvishy> Reenabled nfs-exportd and puppet on labstore1004. All of misc being exported as rw now. T154336

Mentioned in SAL (#wikimedia-labs) [2017-01-19T07:22:57Z] <madhuvishy> Bringing back shinken post T154336

Having trouble with Wikipedia Library Card platform suddenly. It wasn't on the planned list, but we're getting server errors:

Failure:
https://twl-test.wmflabs.org/oauth/login/?next=/users/test_permission/

Server:
https://wikitech.wikimedia.org/wiki/Nova_Resource:Twlight-test.twl.eqiad.wmflabs

Github:
https://github.com/thatandromeda/TWLight

Any reason this would have affected us if we're not on the list anywhere?

Thanks,
Jake (Ocaasi)
Wikipedia Library

Apparently we were missed on the list and therefore not rebooted, so it's being recovered now. Should fix it most likely. Thanks! -Jake

Closing this now. Noting that https://wikitech.wikimedia.org/wiki/Incident_documentation/20170118-Labs happened during this migration, as an intended side effect of merging https://gerrit.wikimedia.org/r/#/c/332735/ .