Page MenuHomePhabricator
Paste P6912

Dumps migration
ActivePublic

Authored by madhuvishy on Mar 28 2018, 6:36 AM.
labstore1006 -- Serves cloud VPS NFS traffic
labstore1007 -- Serves stat* NFS, web and rsync mirror traffic
# NFS migration
Starting at 14:00 UTC (7:00 PST) April 2, 2018
Goals:
1. Migrate Cloud VPS users consuming dumps via /public/dumps on instances from labstore1003 to labstore1006
2. Migrate NFS mounts in stat1005 & 6 from dataset1001 to labstore1007
Migration Plan:
Pre:
* Announce to cloud and analytics/research mailing lists that the migration is happening in 24 hours
PART I Cloud VPS
During:
* Announcements
** Update wikimedia-cloud irc channel
** Last minute mailing list update
* Silencing monitoring -- can't think of anything to silence right now
* Make sure nfs-kernel-server is up and running on labstore1006 & 7
* Make sure the shares are being exported on both servers
* Mount NFS shares from labstore1006 & 7 on instances at /mnt/nfs/labstore1006-dumps & /mnt/nfs/labstore1007-dumps
** Disable puppet on instances with NFS
** Merge puppet patch -
** Apply patch on canaries and test
** Roll out to all instances
* Kill any processes actively accessing /public/dumps
** nfs-mount-manager kill-active /public/dumps
* Absent NFS mount at /public/dumps (served from labstore1003)
** Disable puppet on instances with NFS
** Merge puppet patch -
** Apply patch on canaries and test
** Roll out to all instances
* Set up symlinks on instances for /public/dumps, /public/dumps/pagecounts-all-sites, /public/dumps/pagecounts-raw, /public/dumps/pageviews and /public/dumps/incr from active mount on /mnt/nfs
** Disable puppet on instances with NFS
** Merge puppet patch -
** Apply patch on canaries and test
** Roll out to all instances
Success Criteria:
* Instances can sucessfully read from /public/dumps
** test across instances with dumps mounted
*** head -n 1 /public/dumps/public/liwiki/latest/liwiki-latest-abstract.xml
*** head -n 1 /public/dumps/pagecounts-raw/index.html
*** head -n 1 /public/dumps/incr/enwiki/20180327/status.txt
*** head -n 1 /public/dumps/pageviews/2018/2018-03/projectviews-20180328-040000
*** head -n 1 /public/dumps/pagecounts-all-sites/README.txt
* Labstore1006 - load on the server is normal (monitor for a couple hours atleast)
Post (if success):
* Announce all clear to mailing lists & IRC
* Remove the dumps export from labstore1003 --
* Clean up labstore1003 dumps mount code in nfsclient.pp --
* Stop dumps rsync jobs that sync to labstore1003 --
Rollback plan:
* Kill any new processes actively accessing /public/dumps
** nfs-mount-manager kill-active /public/dumps
* [UNDO] Set up symlinks on instances for /public/dumps, /public/dumps/pagecounts-all-sites, /public/dumps/pagecounts-raw, /public/dumps/pageviews and /public/dumps/incr from active mount on /mnt/nfs
** Disable puppet on instances with NFS
** Revert puppet patch -
** Apply patch on canaries and test
** Roll out to all instances
* [POTENTIALLY UNDO] Mount NFS shares from labstore1006 & 7 on instances at /mnt/nfs/labstore1006-dumps & /mnt/nfs/labstore1007-dumps
** Disable puppet on instances with NFS
** Revert puppet patch -
** Apply patch on canaries and test
** Roll out to all instances
* [UNDO] Absent NFS mount at /public/dumps (served from labstore1003)
** Disable puppet on instances with NFS
** Revert puppet patch -
** Apply patch on canaries and test
** Roll out to all instances
PART II stat*
During
* Announcements
** Last minute analytics mailing list update
* Silence monitoring -- need to check if there's anything set up on stat*
* Mount NFS shares from labstore1006 & 7 on stat1005|6 at /mnt/nfs/labstore1006-dumps & /mnt/nfs/labstore1007-dumps
** Disable puppet on stat1005|6
** Merge puppet patch -
** Apply patch on stat* and test
* Kill any processes actively accessing /mnt/data
** find with lsof +f -- /mnt/data
* Absent NFS mount at /mnt/data (served from dataset1001)
** Disable puppet on stat1005|6
** Merge puppet patch -
** Apply patch on stat* and test
* Set up symlink to /mnt/data from active NFS mount for labstore1007
** Disable puppet on stat1005|6
** Merge puppet patch -
** Apply patch on stat* and test
Success Criteria:
* stat1005 & 6 can sucessfully read from /mnt/data
** head -n 1 /mnt/data/public/liwiki/latest/liwiki-latest-abstract.xml
* Labstore1007 - load on the server is normal (monitor for a couple hours atleast)
Post (if success):
* Announce all clear to mailing lists
* Clean up dataset1001 dumps mount code in statistics::dataset_mount --
* Remove the dumps NFS export from dataset1001 --
Rollback
* Kill any processes actively accessing /mnt/data
** find with lsof +f -- /mnt/data
* [UNDO] Set up symlink to /mnt/data from active NFS mount for labstore1007
** Disable puppet on stat1005|6
** Revert puppet patch -
** Apply patch on stat* and test
* [POSSIBLY UNDO] Mount NFS shares from labstore1006 & 7 on stat1005|6 at /mnt/nfs/labstore1006-dumps & /mnt/nfs/labstore1007-dumps
** Disable puppet on stat1005|6
** Revert puppet patch -
** Apply patch on stat* and test
* [UNDO] Absent NFS mount at /mnt/data (served from dataset1001)
** Disable puppet on stat1005|6
** Revert puppet patch -
** Apply patch on stat* and test

Event Timeline

madhuvishy edited the content of this paste. (Show Details)
madhuvishy edited the content of this paste. (Show Details)