Blockers:
- T144255 Migrate tools and misc data to labstore1004
- T127567 revise/fix labstore replicate backup jobs
- T146153 Performance test new secondary labstore HA cluster
Related:
- T144633 Setup monitoring for secondary HA cluster
MIGRATION PLAN
Prior to the maintenance window
- Set up NFS kernel server on labstore1004/5 (DRBD Primary) [DONE]
- Define tools as a labstore-secondary mount [DONE]
- Mount tools from labstore1001 and labstore-secondary simultaneously (need to add tools as a new mount name in nfs-mounts.yaml)
- Final sync of tools data 24 hours before migration
During migration:
- Update lists/irc channel on start of migration
- wall a message to both bastions
- Silence shinken alarms (shinken-01 instance kill ircecho and disable puppet)
- Silence all of toolschecker (https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=1&host=checker.tools.wmflabs.org)
- Silence icinga on labstore1001
- Disable puppet across tools
- K8S Master - Unmount NFS(/usr/local/sbin/nfs-mount-manager clean), stop process that depends on NFS
- Grid Master/Shadow - Stop the grid master process
- Deploy webservice debian package that sends logs to /dev/null (T149946)
- Restart all processes (running webservices) to apply new webservice package
- Make tools share on labstore1001 Read Only -- remount failed. changed to export as ro.
- Snapshot and latest sync of the tools share
- Stop cron on tools-cron-01 (service cron stop)
- Run nfs-exportd on labstore-secondary to make sure the mount is exported to all tools hosts <== comes w/ nfs-manage up
- wall message about bastions going offline from ro mode and /etc/nologin root only for 03 for testing
- Merge gerrit patch to remove the mount from 1001 (Removing defn from nfs-mount.yaml), and symlink mount path - /mnt/nfs/labstore-secondary-tools/project to /data/project and /mnt/nfs/labstore-secondary-tools/home to /home on tools
- Bring grid master back online
- For each tools worker/exec - depool/cordon a host, enable and run puppet, repool/uncordon host - (If necessary, reboot any problematic exec nodes after running puppet)
- Mount NFS on k8s master, restart process that depends on NFS
- Start grid shadow process ----??? (shadow process doesn't start)
- Start cron on tools-cron-01 (service cron start)
- Bring shinken back up
Post Migration
- Follow up on lists/irc
Success criteria:
- Submitting jobs using qsub succeeds (should set up a test job)
- Submitting a cron succeeds
- Submitting a job via mail succeeds https://wikitech.wikimedia.org/wiki/Help:Tool_Labs#Mail_to_tools (ignoring for now - we'll fix if someone complain broken)
- Submitting jobs to k8s succeeds (should set up a test job)
Rollback strategy:
- Disable puppet across tools
- K8S Master - Unmount NFS, stop process that depends on NFS
- Grid Master/Shadow - Stop the grid master process
- Merge gerrit patch that will make mounts from labstore1001 present on nfsclient.pp
- Make tools nfs read write on labstore1001
- Run puppet to have the share from 1001 symlinked to /home and /data/project on clients
Notes
List of nodes with nfs mounted
- Trusty grid exec nodes : tools-exec-1401 - 1420
- Precise grid exec nodes: tools-exec-1201 - 1221
- Trusty grid webgrid-lighttpd nodes: tools-webgrid-lighttpd-1401 - 1418
- Precise grid webgrid-lighttpd nodes: tools-webgrid-lighttpd-1201 - 1210
- Trusty grid webgrid-generic nodes: tools-webgrid-generic-1401 - 1404
- K8S worker nodes: tools-worker-1001 - 1025
- Tools static nodes: tools-static-10 - 11
- Tools bastions: tools-bastion-02, tools-bastion-03, tools-bastion-05
- Other grid submit nodes: tools-cron-01, tools-mail-01, tools-exec-gift, tools-exec-cyberbot, tools-mail
- Grid master/shadow: tools-grid-master, tools-grid-shadow
- K8S master: tools-k8s-master-01
- Tools-checker: tools-checker-01, tools-checker-02
- Tools services: tools-services-01, tools-services-02
- Other: tools-precise-dev
- NFS not mounted
- tools-puppetmaster-01
- tools-puppetmaster-02
- tools-prometheus-01
- tools-prometheus-02
- tools-logs-02
- tools-docker-builder-01
- tools-docker-builder-02
- tools-docker-registry-01
- tools-flannel-etcd-03
- tools-flannel-etcd-02
- tools-flannel-etcd-01
- tools-k8s-etcd-03
- tools-k8s-etcd-02
- tools-k8s-etcd-01
- tools-redis-1002
- tools-redis-1001
- tools-elastic-03
- tools-elastic-02
- tools-elastic-01
- tools-proxy-02
- tools-proxy-01