Page MenuHomePhabricator

Migrate tools to secondary labstore HA cluster (Scheduled on 11/14) [tracking]
Closed, ResolvedPublic

Description

Blockers:

  • T144255 Migrate tools and misc data to labstore1004
  • T127567 revise/fix labstore replicate backup jobs
  • T146153 Performance test new secondary labstore HA cluster

Related:

  • T144633 Setup monitoring for secondary HA cluster

MIGRATION PLAN

Prior to the maintenance window
  • Set up NFS kernel server on labstore1004/5 (DRBD Primary) [DONE]
  • Define tools as a labstore-secondary mount [DONE]
  • Mount tools from labstore1001 and labstore-secondary simultaneously (need to add tools as a new mount name in nfs-mounts.yaml)
  • Final sync of tools data 24 hours before migration
During migration:
  • Update lists/irc channel on start of migration
  • wall a message to both bastions
  • Silence shinken alarms (shinken-01 instance kill ircecho and disable puppet)
  • Silence all of toolschecker (https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=1&host=checker.tools.wmflabs.org)
  • Silence icinga on labstore1001
  • Disable puppet across tools
  • K8S Master - Unmount NFS(/usr/local/sbin/nfs-mount-manager clean), stop process that depends on NFS
  • Grid Master/Shadow - Stop the grid master process
  • Deploy webservice debian package that sends logs to /dev/null (T149946)
  • Restart all processes (running webservices) to apply new webservice package
  • Make tools share on labstore1001 Read Only -- remount failed. changed to export as ro.
  • Snapshot and latest sync of the tools share
  • Stop cron on tools-cron-01 (service cron stop)
  • Run nfs-exportd on labstore-secondary to make sure the mount is exported to all tools hosts <== comes w/ nfs-manage up
  • wall message about bastions going offline from ro mode and /etc/nologin root only for 03 for testing
  • Merge gerrit patch to remove the mount from 1001 (Removing defn from nfs-mount.yaml), and symlink mount path - /mnt/nfs/labstore-secondary-tools/project to /data/project and /mnt/nfs/labstore-secondary-tools/home to /home on tools
  • Bring grid master back online
  • For each tools worker/exec - depool/cordon a host, enable and run puppet, repool/uncordon host - (If necessary, reboot any problematic exec nodes after running puppet)
  • Mount NFS on k8s master, restart process that depends on NFS
  • Start grid shadow process ----??? (shadow process doesn't start)
  • Start cron on tools-cron-01 (service cron start)
  • Bring shinken back up
Post Migration
  • Follow up on lists/irc
Success criteria:
Rollback strategy:
  • Disable puppet across tools
  • K8S Master - Unmount NFS, stop process that depends on NFS
  • Grid Master/Shadow - Stop the grid master process
  • Merge gerrit patch that will make mounts from labstore1001 present on nfsclient.pp
  • Make tools nfs read write on labstore1001
  • Run puppet to have the share from 1001 symlinked to /home and /data/project on clients

Notes

List of nodes with nfs mounted
  • Trusty grid exec nodes : tools-exec-1401 - 1420
  • Precise grid exec nodes: tools-exec-1201 - 1221
  • Trusty grid webgrid-lighttpd nodes: tools-webgrid-lighttpd-1401 - 1418
  • Precise grid webgrid-lighttpd nodes: tools-webgrid-lighttpd-1201 - 1210
  • Trusty grid webgrid-generic nodes: tools-webgrid-generic-1401 - 1404
  • K8S worker nodes: tools-worker-1001 - 1025
  • Tools static nodes: tools-static-10 - 11
  • Tools bastions: tools-bastion-02, tools-bastion-03, tools-bastion-05
  • Other grid submit nodes: tools-cron-01, tools-mail-01, tools-exec-gift, tools-exec-cyberbot, tools-mail
  • Grid master/shadow: tools-grid-master, tools-grid-shadow
  • K8S master: tools-k8s-master-01
  • Tools-checker: tools-checker-01, tools-checker-02
  • Tools services: tools-services-01, tools-services-02
  • Other: tools-precise-dev
    1. NFS not mounted
  • tools-puppetmaster-01
  • tools-puppetmaster-02
  • tools-prometheus-01
  • tools-prometheus-02
  • tools-logs-02
  • tools-docker-builder-01
  • tools-docker-builder-02
  • tools-docker-registry-01
  • tools-flannel-etcd-03
  • tools-flannel-etcd-02
  • tools-flannel-etcd-01
  • tools-k8s-etcd-03
  • tools-k8s-etcd-02
  • tools-k8s-etcd-01
  • tools-redis-1002
  • tools-redis-1001
  • tools-elastic-03
  • tools-elastic-02
  • tools-elastic-01
  • tools-proxy-02
  • tools-proxy-01

Event Timeline

madhuvishy renamed this task from Migrate tools and misc(others) to secondary labstore HA cluster to Migrate tools and misc(others) to secondary labstore HA cluster [tracking].Sep 20 2016, 3:09 PM
madhuvishy renamed this task from Migrate tools and misc(others) to secondary labstore HA cluster [tracking] to Migrate tools to secondary labstore HA cluster (Scheduled on 11/2) [tracking].Oct 28 2016, 3:10 PM
madhuvishy renamed this task from Migrate tools to secondary labstore HA cluster (Scheduled on 11/2) [tracking] to Migrate tools to secondary labstore HA cluster (Scheduled on 11/14) [tracking].Nov 1 2016, 7:01 PM

We're rescheduling this from 11/2 (previously announced) to 11/14 tentatively, since labstore2001 is still having issues.

Change 321001 had a related patch set uploaded (by Madhuvishy):
labstore: Dual mount tools from labstore1001 and labstore-secondary

https://gerrit.wikimedia.org/r/321001

Mentioned in SAL (#wikimedia-labs) [2016-11-11T19:29:04Z] <madhuvishy> Disabling puppet across tools to dual mount tools share from labstore-secondary T146154

Change 321001 merged by Madhuvishy:
labstore: Dual mount tools from labstore1001 and labstore-secondary

https://gerrit.wikimedia.org/r/321001

Mentioned in SAL (#wikimedia-labs) [2016-11-11T20:18:03Z] <madhuvishy> Rolling out dual mount of tools share across all hosts T146154

Mentioned in SAL (#wikimedia-labs) [2016-11-11T20:49:18Z] <madhuvishy> Dual mount of tools share complete. Puppet reenabled across tools hosts. T146154

Mentioned in SAL (#wikimedia-labs) [2016-11-14T16:14:27Z] <madhuvishy> Disabling puppet across tools T146154

Mentioned in SAL (#wikimedia-labs) [2016-11-14T16:18:33Z] <madhuvishy> Stopped irc-echo and puppet on shinken-01 for T146154

Mentioned in SAL (#wikimedia-labs) [2016-11-14T16:30:58Z] <madhuvishy> Unmounted all nfs shares from tools-k8s-master-01 (sudo /usr/local/sbin/nfs-mount-manager clean) T146154

Mentioned in SAL (#wikimedia-labs) [2016-11-14T18:24:51Z] <madhuvishy> Tools NFS is read-only. /data/project and /home across tools are ro T146154

Mentioned in SAL (#wikimedia-labs) [2016-11-14T19:35:50Z] <madhuvishy> Stopped cron on tools-cron-01 (T146154)

Change 321556 had a related patch set uploaded (by Madhuvishy):
labstore: Symlink /data/project and /home on tools to mounts from labstore-secondary

https://gerrit.wikimedia.org/r/321556

Change 321556 merged by Madhuvishy:
labstore: Symlink /data/project and /home on tools to mounts from labstore-secondary

https://gerrit.wikimedia.org/r/321556

chasemp claimed this task.

Some fallout here T150829: Tools Docker Registry is Dead and I'm looking at addressing an issue w/ where the bind mount may decide it has a stale handle here https://gerrit.wikimedia.org/r/#/c/321786/ but overall this has crossed the point of return.