Page MenuHomePhabricator

Migrate Dumps WMCS NFS users from labstore1003 to labstore1006/7
Closed, ResolvedPublic

Description

Task for actual migration

Event Timeline

madhuvishy created this task.

Change 403767 had a related patch set uploaded (by Madhuvishy; owner: Madhuvishy):
[operations/puppet@production] nfsclient: Setup dumps mounts from new servers

https://gerrit.wikimedia.org/r/403767

Change 422848 had a related patch set uploaded (by Madhuvishy; owner: Madhuvishy):
[operations/puppet@production] dumps: Absent /public/dumps mount served from labstore1003

https://gerrit.wikimedia.org/r/422848

Change 422867 had a related patch set uploaded (by Madhuvishy; owner: Madhuvishy):
[operations/puppet@production] dumps: Set up symlinks on instances under /public/dumps

https://gerrit.wikimedia.org/r/422867

Mentioned in SAL (#wikimedia-operations) [2018-04-02T14:28:56Z] <madhuvishy> Disabling puppet across VPS instances with dumps mounted (https://phabricator.wikimedia.org/P6921) T188643

Change 403767 merged by Madhuvishy:
[operations/puppet@production] nfsclient: Setup dumps mounts from new servers

https://gerrit.wikimedia.org/r/403767

Mentioned in SAL (#wikimedia-operations) [2018-04-02T15:06:38Z] <madhuvishy> Reenabled puppet and rolled out mounting new dumps NFS shares from labstore1006|7 on VPS instances T188643

Change 422848 merged by Madhuvishy:
[operations/puppet@production] dumps: Absent /public/dumps mount served from labstore1003

https://gerrit.wikimedia.org/r/422848

Mentioned in SAL (#wikimedia-operations) [2018-04-02T15:59:49Z] <madhuvishy> Absenting /public/dumps mount from labstore1003 across the VPS fleet T188643

Change 422867 merged by Madhuvishy:
[operations/puppet@production] dumps: Set up symlinks on instances under /public/dumps

https://gerrit.wikimedia.org/r/422867

Mentioned in SAL (#wikimedia-operations) [2018-04-02T16:37:48Z] <madhuvishy> Rolling out new symlinks to /public/dumps for labstore1006 dumps nfs mount T188643

Notes from migration plan doc:

labstore1006 -- Serves cloud VPS NFS traffic
labstore1007 -- Serves stat* NFS, web and rsync mirror traffic

NFS migration

Starting at 14:00 UTC (7:00 PST) April 2, 2018

Goals:

  1. Migrate Cloud VPS users consuming dumps via /public/dumps on instances from labstore1003 to labstore1006
  2. Migrate NFS mounts in stat1005 & 6 from dataset1001 to labstore1007

Migration Plan:

Pre:

  • [DONE] Announce to cloud and analytics/research mailing lists that the migration is happening in 24 hours <- maybe announce Friday early, for folks who don't pay attention to mail on the weekends?

PART I Cloud VPS

Useful commands:

To generate list of hosts with dumps NFS share enabled

madhuvishy@labpuppetmaster1001:~$ sudo nfs-hostlist -s dumps -f dumps-hosts

To target with cumin hosts with dumps enabled

madhuvishy@labpuppetmaster1001:~$ sudo cumin 'F{dumps-hosts}' 'sudo date'

Disable puppet

madhuvishy@labpuppetmaster1001:~$ sudo cumin 'F{dumps-hosts}' "disable-puppet 'Dumps migration in progress - T168486- ${USER}'"

Enable puppet

madhuvishy@labpuppetmaster1001:~$ sudo cumin 'F{dumps-hosts}' "enable-puppet 'Dumps migration in progress - T168486- ${USER}'"

Dumps enabled VPS instances - https://phabricator.wikimedia.org/P6921

Canary hosts

madhuvishy@labpuppetmaster1001:~$ cat > canaries
tools-worker-1026.tools.eqiad.wmflabs
toolsbeta-paws-worker-1003.toolsbeta.eqiad.wmflabs
tools-bastion-05.tools.eqiad.wmflabs
tools-exec-1442.tools.eqiad.wmflabs
tools-webgrid-lighttpd-1401.tools.eqiad.wmflabs

During:

  • Announcements
    • [DONE] Update wikimedia-cloud irc channel
    • [DONE] Last minute mailing list update
  • [DONE] Make sure nfs-kernel-server is up and running on labstore1006 & 7
  • [DONE] Make sure the shares are being exported on both servers
  • Mount NFS shares from labstore1006 & 7 on instances at /mnt/nfs/labstore1006-dumps & /mnt/nfs/labstore1007-dumps (optinos are in https://gerrit.wikimedia.org/r/c/403767/)
    • [DONE] Disable puppet on instances with NFS
      • sudo cumin -b 10 'F{dumps-hosts}' "disable-puppet 'Dumps migration in progress - T168486- ${USER}'"
    • [DONE] Merge puppet patch - https://gerrit.wikimedia.org/r/#/c/403767/
    • [DONE] Apply patch on canaries and test
      • sudo cumin -b 1 'F{canaries}' "enable-puppet 'Dumps migration in progress - T168486- ${USER}'"
      • sudo cumin -b 1 'F{canaries}' 'run-puppet-agent'
      • df output

labstore1006.wikimedia.org:/dumps nfs4 66T 38T 25T 60% /mnt/nfs/dumps-labstore1006.wikimedia.org
labstore1007.wikimedia.org:/dumps nfs4 65T 38T 24T 62% /mnt/nfs/dumps-labstore1007.wikimedia.org

  • [DONE] Roll out to all instances
    • sudo cumin -b 10 'F{dumps-hosts}' "enable-puppet 'Dumps migration in progress - T168486- ${USER}'"
    • sudo cumin -b 10 'F{dumps-hosts}' "run-puppet-agent"
    • sudo cumin -b 10 'F{dumps-hosts}' 'test -e /mnt/nfs/dumps-labstore1006.wikimedia.org/xmldatadumps/public/enwiki/latest/enwiki-latest-abstract11.xml.gz'
    • sudo cumin -b 10 'F{dumps-hosts}' 'test -e /mnt/nfs/dumps-labstore1007.wikimedia.org/xmldatadumps/public/enwiki/latest/enwiki-latest-abstract11.xml.gz'
    • wikidata-dev has nfs turned off explicitly through https://wikitech.wikimedia.org/wiki/Hiera:Wikidata-dev, wikidata-lexeme.wikidata-dev still has the older mounts. noting here for cleanup later.
  • [DONE] Kill any processes actively accessing /public/dumps
    • Possibly - nfs-mount-manager kill-active /public/dumps
    • Only tools.yifeibot was accessing /public/dumps. Asked Yifei to stop the bot temporarily :)

<At this point all instances that have dumps are successfully mounting labstore1003/1006/1007>

</public/dumps is not currently a symlink so swap out the mount for symlink to new data>

  • Absent NFS mount at /public/dumps (served from labstore1003)
    • [DONE] Disable puppet on instances with NFS <-- do we know puppet is working on all instances? what about ssh in? this is typically a horror show. or do you mean something else here? (yup, i've checked the instance puppet runs)
      • sudo cumin -b 20 'F{dumps-hosts}' "disable-puppet 'Dumps migration in progress - T168486- ${USER}'"
    • [DONE] Merge puppet patch - https://gerrit.wikimedia.org/r/#/c/422848/
    • [DONE] Apply patch on canaries and test
      • sudo cumin -b 5 'F{canaries}' "enable-puppet 'Dumps migration in progress - T168486- ${USER}'"
      • sudo cumin -b 5 'F{canaries}' 'run-puppet-agent'
    • [DONE] Roll out to all instances
      • sudo cumin -b 20 'F{dumps-hosts}' "enable-puppet 'Dumps migration in progress - T168486- ${USER}'"
      • sudo cumin -b 20 'F{dumps-hosts}' "run-puppet-agent"
  • Set up symlinks on instances for /public/dumps, /public/dumps/pagecounts-all-sites, /public/dumps/pagecounts-raw, /public/dumps/pageviews and /public/dumps/incr from active mount on /mnt/nfs
    • [DONE] Disable puppet on instances with NFS
      • sudo cumin -b 20 'F{dumps-hosts}' "disable-puppet 'Dumps migration in progress - T168486- ${USER}'"
    • [DONE] Merge puppet patch - https://gerrit.wikimedia.org/r/#/c/422867/
    • [DONE] Apply patch on canaries and test
      • sudo cumin -b 5 'F{canaries}' "enable-puppet 'Dumps migration in progress - T168486- ${USER}'"
      • sudo cumin -b 5 'F{canaries}' 'run-puppet-agent'
      • sudo cumin -b 5 'F{canaries}' 'head -n 1 /public/dumps/public/liwiki/latest/liwiki-latest-pages-articles.xml.bz2-rss.xml'
      • sudo cumin -b 5 'F{canaries}' 'head -n 1 /public/dumps/pagecounts-raw/index.html'
      • sudo cumin -b 5 'F{canaries}' 'head -n 1 /public/dumps/incr/enwiki/20180327/status.txt'
      • sudo cumin -b 5 'F{canaries}' 'head -n 1 /public/dumps/pageviews/2018/2018-03/projectviews-20180328-040000'
      • sudo cumin -b 5 'F{canaries}' 'head -n 1 /public/dumps/pagecounts-all-sites/README.txt'
  • [DONE] Roll out to all instances
    • sudo cumin -b 20 'F{dumps-hosts}' "enable-puppet 'Dumps migration in progress - T168486- ${USER}'"
    • sudo cumin -b 20 'F{dumps-hosts}' "run-puppet-agent"

Success Criteria:

  • [DONE] Instances can sucessfully read from /public/dumps
    • test across instances with dumps mounted
      • sudo cumin -b 20 'F{dumps-hosts}' 'head -n 1 /public/dumps/public/liwiki/latest/liwiki-latest-pages-articles.xml.bz2-rss.xml'
      • sudo cumin -b 20 'F{dumps-hosts}' 'head -n 1 /public/dumps/pagecounts-raw/index.html'
      • sudo cumin -b 20 'F{dumps-hosts}' 'head -n 1 /public/dumps/incr/enwiki/20180327/status.txt'
      • sudo cumin -b 20 'F{dumps-hosts}' 'head -n 1 /public/dumps/pageviews/2018/2018-03/projectviews-20180328-040000'
      • sudo cumin -b 20 'F{dumps-hosts}' 'head -n 1 /public/dumps/pagecounts-all-sites/README.txt'
  • [DONE] Dumps read check https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?host=checker.tools.wmflabs.org is OK
  • Labstore1006 - load on the server is normal (monitor for a couple hours atleast) -- All good right now, nothing much is going on though

Post (if success):

  • [DONE] Announce all clear to mailing lists & IRC
  • Remove the dumps export from labstore1003 --
  • Clean up labstore1003 dumps mount code in nfsclient.pp --
  • Stop dumps rsync jobs that sync to labstore1003 --

Rollback plan:

  • Kill any new processes actively accessing /public/dumps
    • nfs-mount-manager kill-active /public/dumps
  • [UNDO] Set up symlinks on instances for /public/dumps, /public/dumps/pagecounts-all-sites, /public/dumps/pagecounts-raw, /public/dumps/pageviews and /public/dumps/incr from active mount on /mnt/nfs
  • [POTENTIALLY UNDO] Mount NFS shares from labstore1006 & 7 on instances at /mnt/nfs/labstore1006-dumps & /mnt/nfs/labstore1007-dumps
  • [UNDO] Absent NFS mount at /public/dumps (served from labstore1003)

This went pretty well! Todos for clean up:

  • Remove the dumps export from labstore1003
  • Clean up labstore1003 dumps mount code in nfsclient.pp
  • Stop dumps rsync jobs that sync to labstore1003
  • Stop managing nfs shares for wikidata-dev project

Change 423727 had a related patch set uploaded (by Madhuvishy; owner: Madhuvishy):
[operations/puppet@production] nfs: Stop exporting dumps from labstore1003

https://gerrit.wikimedia.org/r/423727

Change 423728 had a related patch set uploaded (by Madhuvishy; owner: Madhuvishy):
[operations/puppet@production] nfsclient: Cleanup absented dumps mount from labstore1003

https://gerrit.wikimedia.org/r/423728

Change 423731 had a related patch set uploaded (by Madhuvishy; owner: Madhuvishy):
[operations/puppet@production] dumps: Turn off cron that rsyncs to labstore1003

https://gerrit.wikimedia.org/r/423731

Change 423732 had a related patch set uploaded (by Madhuvishy; owner: Madhuvishy):
[operations/puppet@production] dumps: Clean up code that rsyncs to labstore1003

https://gerrit.wikimedia.org/r/423732

This went pretty well! Todos for clean up:

  • Remove the dumps export from labstore1003
  • Clean up labstore1003 dumps mount code in nfsclient.pp
  • Stop dumps rsync jobs that sync to labstore1003

Patches are up for these

  • Stop managing nfs shares for wikidata-dev project

Made task -- T191318

Change 423728 merged by Madhuvishy:
[operations/puppet@production] nfsclient: Cleanup absented dumps mount from labstore1003

https://gerrit.wikimedia.org/r/423728

Change 423731 merged by Madhuvishy:
[operations/puppet@production] dumps: Turn off cron that rsyncs to labstore1003

https://gerrit.wikimedia.org/r/423731

Change 423727 merged by Madhuvishy:
[operations/puppet@production] nfs: Stop exporting dumps from labstore1003

https://gerrit.wikimedia.org/r/423727

Change 423732 merged by Madhuvishy:
[operations/puppet@production] dumps: Clean up code that rsyncs to labstore1003

https://gerrit.wikimedia.org/r/423732

Change 426003 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[operations/puppet@production] stop dumps-related cron jobs on labstore1003

https://gerrit.wikimedia.org/r/426003

Change 426003 merged by ArielGlenn:
[operations/puppet@production] stop dumps-related cron jobs on labstore1003

https://gerrit.wikimedia.org/r/426003