Task for actual migration
Description
Details
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | bd808 | T166402 Program 7 Outcome 3: data services | |||
Resolved | ArielGlenn | T182540 get datset1001, ms1001 ready for decommission | |||
Resolved | • madhuvishy | T168486 Migrate customer-facing Dumps endpoints to Cloud Services | |||
Resolved | • madhuvishy | T188643 Migrate Dumps WMCS NFS users from labstore1003 to labstore1006/7 |
Event Timeline
Initial PoC patch for nfsclient.pp changes https://gerrit.wikimedia.org/r/#/c/403767/1
Change 403767 had a related patch set uploaded (by Madhuvishy; owner: Madhuvishy):
[operations/puppet@production] nfsclient: Setup dumps mounts from new servers
Change 422848 had a related patch set uploaded (by Madhuvishy; owner: Madhuvishy):
[operations/puppet@production] dumps: Absent /public/dumps mount served from labstore1003
Change 422867 had a related patch set uploaded (by Madhuvishy; owner: Madhuvishy):
[operations/puppet@production] dumps: Set up symlinks on instances under /public/dumps
Mentioned in SAL (#wikimedia-operations) [2018-04-02T14:28:56Z] <madhuvishy> Disabling puppet across VPS instances with dumps mounted (https://phabricator.wikimedia.org/P6921) T188643
Change 403767 merged by Madhuvishy:
[operations/puppet@production] nfsclient: Setup dumps mounts from new servers
Mentioned in SAL (#wikimedia-operations) [2018-04-02T15:06:38Z] <madhuvishy> Reenabled puppet and rolled out mounting new dumps NFS shares from labstore1006|7 on VPS instances T188643
Change 422848 merged by Madhuvishy:
[operations/puppet@production] dumps: Absent /public/dumps mount served from labstore1003
Mentioned in SAL (#wikimedia-operations) [2018-04-02T15:59:49Z] <madhuvishy> Absenting /public/dumps mount from labstore1003 across the VPS fleet T188643
Change 422867 merged by Madhuvishy:
[operations/puppet@production] dumps: Set up symlinks on instances under /public/dumps
Mentioned in SAL (#wikimedia-operations) [2018-04-02T16:37:48Z] <madhuvishy> Rolling out new symlinks to /public/dumps for labstore1006 dumps nfs mount T188643
Notes from migration plan doc:
labstore1006 -- Serves cloud VPS NFS traffic
labstore1007 -- Serves stat* NFS, web and rsync mirror traffic
NFS migration
Starting at 14:00 UTC (7:00 PST) April 2, 2018
Goals:
- Migrate Cloud VPS users consuming dumps via /public/dumps on instances from labstore1003 to labstore1006
- Migrate NFS mounts in stat1005 & 6 from dataset1001 to labstore1007
Migration Plan:
Pre:
- [DONE] Announce to cloud and analytics/research mailing lists that the migration is happening in 24 hours <- maybe announce Friday early, for folks who don't pay attention to mail on the weekends?
PART I Cloud VPS
Useful commands:
To generate list of hosts with dumps NFS share enabled
madhuvishy@labpuppetmaster1001:~$ sudo nfs-hostlist -s dumps -f dumps-hosts
To target with cumin hosts with dumps enabled
madhuvishy@labpuppetmaster1001:~$ sudo cumin 'F{dumps-hosts}' 'sudo date'
Disable puppet
madhuvishy@labpuppetmaster1001:~$ sudo cumin 'F{dumps-hosts}' "disable-puppet 'Dumps migration in progress - T168486- ${USER}'"
Enable puppet
madhuvishy@labpuppetmaster1001:~$ sudo cumin 'F{dumps-hosts}' "enable-puppet 'Dumps migration in progress - T168486- ${USER}'"
Dumps enabled VPS instances - https://phabricator.wikimedia.org/P6921
Canary hosts
madhuvishy@labpuppetmaster1001:~$ cat > canaries
tools-worker-1026.tools.eqiad.wmflabs
toolsbeta-paws-worker-1003.toolsbeta.eqiad.wmflabs
tools-bastion-05.tools.eqiad.wmflabs
tools-exec-1442.tools.eqiad.wmflabs
tools-webgrid-lighttpd-1401.tools.eqiad.wmflabs
During:
- Announcements
- [DONE] Update wikimedia-cloud irc channel
- [DONE] Last minute mailing list update
- Silencing monitoring
- [DONE] dumps_read_check on tools-checker - https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?host=checker.tools.wmflabs.org
- [DONE] Make sure nfs-kernel-server is up and running on labstore1006 & 7
- [DONE] Make sure the shares are being exported on both servers
- Mount NFS shares from labstore1006 & 7 on instances at /mnt/nfs/labstore1006-dumps & /mnt/nfs/labstore1007-dumps (optinos are in https://gerrit.wikimedia.org/r/c/403767/)
- [DONE] Disable puppet on instances with NFS
- sudo cumin -b 10 'F{dumps-hosts}' "disable-puppet 'Dumps migration in progress - T168486- ${USER}'"
- [DONE] Merge puppet patch - https://gerrit.wikimedia.org/r/#/c/403767/
- [DONE] Apply patch on canaries and test
- sudo cumin -b 1 'F{canaries}' "enable-puppet 'Dumps migration in progress - T168486- ${USER}'"
- sudo cumin -b 1 'F{canaries}' 'run-puppet-agent'
- df output
- [DONE] Disable puppet on instances with NFS
labstore1006.wikimedia.org:/dumps nfs4 66T 38T 25T 60% /mnt/nfs/dumps-labstore1006.wikimedia.org
labstore1007.wikimedia.org:/dumps nfs4 65T 38T 24T 62% /mnt/nfs/dumps-labstore1007.wikimedia.org
- [DONE] Roll out to all instances
- sudo cumin -b 10 'F{dumps-hosts}' "enable-puppet 'Dumps migration in progress - T168486- ${USER}'"
- sudo cumin -b 10 'F{dumps-hosts}' "run-puppet-agent"
- sudo cumin -b 10 'F{dumps-hosts}' 'test -e /mnt/nfs/dumps-labstore1006.wikimedia.org/xmldatadumps/public/enwiki/latest/enwiki-latest-abstract11.xml.gz'
- sudo cumin -b 10 'F{dumps-hosts}' 'test -e /mnt/nfs/dumps-labstore1007.wikimedia.org/xmldatadumps/public/enwiki/latest/enwiki-latest-abstract11.xml.gz'
- wikidata-dev has nfs turned off explicitly through https://wikitech.wikimedia.org/wiki/Hiera:Wikidata-dev, wikidata-lexeme.wikidata-dev still has the older mounts. noting here for cleanup later.
- [DONE] Kill any processes actively accessing /public/dumps
- Possibly - nfs-mount-manager kill-active /public/dumps
- Only tools.yifeibot was accessing /public/dumps. Asked Yifei to stop the bot temporarily :)
<At this point all instances that have dumps are successfully mounting labstore1003/1006/1007>
</public/dumps is not currently a symlink so swap out the mount for symlink to new data>
- Absent NFS mount at /public/dumps (served from labstore1003)
- [DONE] Disable puppet on instances with NFS <-- do we know puppet is working on all instances? what about ssh in? this is typically a horror show. or do you mean something else here? (yup, i've checked the instance puppet runs)
- sudo cumin -b 20 'F{dumps-hosts}' "disable-puppet 'Dumps migration in progress - T168486- ${USER}'"
- [DONE] Merge puppet patch - https://gerrit.wikimedia.org/r/#/c/422848/
- [DONE] Apply patch on canaries and test
- sudo cumin -b 5 'F{canaries}' "enable-puppet 'Dumps migration in progress - T168486- ${USER}'"
- sudo cumin -b 5 'F{canaries}' 'run-puppet-agent'
- [DONE] Roll out to all instances
- sudo cumin -b 20 'F{dumps-hosts}' "enable-puppet 'Dumps migration in progress - T168486- ${USER}'"
- sudo cumin -b 20 'F{dumps-hosts}' "run-puppet-agent"
- [DONE] Disable puppet on instances with NFS <-- do we know puppet is working on all instances? what about ssh in? this is typically a horror show. or do you mean something else here? (yup, i've checked the instance puppet runs)
- Set up symlinks on instances for /public/dumps, /public/dumps/pagecounts-all-sites, /public/dumps/pagecounts-raw, /public/dumps/pageviews and /public/dumps/incr from active mount on /mnt/nfs
- [DONE] Disable puppet on instances with NFS
- sudo cumin -b 20 'F{dumps-hosts}' "disable-puppet 'Dumps migration in progress - T168486- ${USER}'"
- [DONE] Merge puppet patch - https://gerrit.wikimedia.org/r/#/c/422867/
- [DONE] Apply patch on canaries and test
- sudo cumin -b 5 'F{canaries}' "enable-puppet 'Dumps migration in progress - T168486- ${USER}'"
- sudo cumin -b 5 'F{canaries}' 'run-puppet-agent'
- sudo cumin -b 5 'F{canaries}' 'head -n 1 /public/dumps/public/liwiki/latest/liwiki-latest-pages-articles.xml.bz2-rss.xml'
- sudo cumin -b 5 'F{canaries}' 'head -n 1 /public/dumps/pagecounts-raw/index.html'
- sudo cumin -b 5 'F{canaries}' 'head -n 1 /public/dumps/incr/enwiki/20180327/status.txt'
- sudo cumin -b 5 'F{canaries}' 'head -n 1 /public/dumps/pageviews/2018/2018-03/projectviews-20180328-040000'
- sudo cumin -b 5 'F{canaries}' 'head -n 1 /public/dumps/pagecounts-all-sites/README.txt'
- [DONE] Disable puppet on instances with NFS
- [DONE] Roll out to all instances
- sudo cumin -b 20 'F{dumps-hosts}' "enable-puppet 'Dumps migration in progress - T168486- ${USER}'"
- sudo cumin -b 20 'F{dumps-hosts}' "run-puppet-agent"
Success Criteria:
- [DONE] Instances can sucessfully read from /public/dumps
- test across instances with dumps mounted
- sudo cumin -b 20 'F{dumps-hosts}' 'head -n 1 /public/dumps/public/liwiki/latest/liwiki-latest-pages-articles.xml.bz2-rss.xml'
- sudo cumin -b 20 'F{dumps-hosts}' 'head -n 1 /public/dumps/pagecounts-raw/index.html'
- sudo cumin -b 20 'F{dumps-hosts}' 'head -n 1 /public/dumps/incr/enwiki/20180327/status.txt'
- sudo cumin -b 20 'F{dumps-hosts}' 'head -n 1 /public/dumps/pageviews/2018/2018-03/projectviews-20180328-040000'
- sudo cumin -b 20 'F{dumps-hosts}' 'head -n 1 /public/dumps/pagecounts-all-sites/README.txt'
- test across instances with dumps mounted
- [DONE] Dumps read check https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?host=checker.tools.wmflabs.org is OK
- Labstore1006 - load on the server is normal (monitor for a couple hours atleast) -- All good right now, nothing much is going on though
Post (if success):
- [DONE] Announce all clear to mailing lists & IRC
- Remove the dumps export from labstore1003 --
- Clean up labstore1003 dumps mount code in nfsclient.pp --
- Stop dumps rsync jobs that sync to labstore1003 --
Rollback plan:
- Kill any new processes actively accessing /public/dumps
- nfs-mount-manager kill-active /public/dumps
- [UNDO] Set up symlinks on instances for /public/dumps, /public/dumps/pagecounts-all-sites, /public/dumps/pagecounts-raw, /public/dumps/pageviews and /public/dumps/incr from active mount on /mnt/nfs
- Disable puppet on instances with NFS
- Revert puppet patch - https://gerrit.wikimedia.org/r/#/c/422867/
- Apply patch on canaries and test
- Roll out to all instances
- [POTENTIALLY UNDO] Mount NFS shares from labstore1006 & 7 on instances at /mnt/nfs/labstore1006-dumps & /mnt/nfs/labstore1007-dumps
- Disable puppet on instances with NFS
- Revert puppet patch - https://gerrit.wikimedia.org/r/#/c/403767/
- Apply patch on canaries and test
- Roll out to all instances
- [UNDO] Absent NFS mount at /public/dumps (served from labstore1003)
- Disable puppet on instances with NFS
- Revert puppet patch - https://gerrit.wikimedia.org/r/#/c/422848/
- Apply patch on canaries and test
- Roll out to all instances
This went pretty well! Todos for clean up:
- Remove the dumps export from labstore1003
- Clean up labstore1003 dumps mount code in nfsclient.pp
- Stop dumps rsync jobs that sync to labstore1003
- Stop managing nfs shares for wikidata-dev project
Change 423727 had a related patch set uploaded (by Madhuvishy; owner: Madhuvishy):
[operations/puppet@production] nfs: Stop exporting dumps from labstore1003
Change 423728 had a related patch set uploaded (by Madhuvishy; owner: Madhuvishy):
[operations/puppet@production] nfsclient: Cleanup absented dumps mount from labstore1003
Change 423731 had a related patch set uploaded (by Madhuvishy; owner: Madhuvishy):
[operations/puppet@production] dumps: Turn off cron that rsyncs to labstore1003
Change 423732 had a related patch set uploaded (by Madhuvishy; owner: Madhuvishy):
[operations/puppet@production] dumps: Clean up code that rsyncs to labstore1003
Change 423728 merged by Madhuvishy:
[operations/puppet@production] nfsclient: Cleanup absented dumps mount from labstore1003
Change 423731 merged by Madhuvishy:
[operations/puppet@production] dumps: Turn off cron that rsyncs to labstore1003
Change 423727 merged by Madhuvishy:
[operations/puppet@production] nfs: Stop exporting dumps from labstore1003
Change 423732 merged by Madhuvishy:
[operations/puppet@production] dumps: Clean up code that rsyncs to labstore1003
Change 426003 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[operations/puppet@production] stop dumps-related cron jobs on labstore1003
Change 426003 merged by ArielGlenn:
[operations/puppet@production] stop dumps-related cron jobs on labstore1003