Page MenuHomePhabricator

Migrate dumpsdata hosts to Stretch/Buster
Closed, ResolvedPublic

Description

These are currently running jessie:

  • dumpsdata1001.eqiad.wmnet
  • dumpsdata1002.eqiad.wmnet

Event Timeline

Tentatively planning to move these straight to buster in the next quarter.

ArielGlenn triaged this task as Medium priority.

I can start on this once the new dumpsdata host is racked and has a base install.

ArielGlenn added a comment.EditedNov 5 2019, 4:12 PM

Plan for migration:

  • base install of buster on new dumpsdata server with role::spare
  • make a mount point for thedata filesystem, install rsync, copy over data from labstore1006 (or whichever is the web server) with bandwidth capping
  • make sure we have 'if buster, use something other than mailx' in the dumps server manifests
  • disable puppet on the active dumpsdata secondary, turn off the rsync pulls, make the new server the secondary
  • make the old secondary role::spare in puppet, unmount its raid filesystem, reimage with buster preserving all data
  • install rsync manually there, update from labstore1006 periodically during the next period
  • refresh or rewrite puppet patch to move 'misc dumps' into separate role
  • apply that patch to the old secondary during dead time for misc crons (late Friday, all day Saturday), and to the primary host so it no longer stores/rsyncs the misc dumps
  • wait for regular monthly xml/sql dumps run on primary to complete
  • re-image with buster, keep same role

At this point we should have three hosts on buster, one doing misc crons, one as fallback for the xml/sql dumps, and one as primary for the xml/sql dumps.
If the misc crons host fails, the xml/sql fallback server should be able to be used for it without issues by applying the right role.

  • Base install is ready thanks to Chris.
  • Resized the lvm and the filesystem for /data so that's ready to go.
  • rsync running in screen on dumpsdata1003 pulling last two good dumps from labstore1006:
rsync -v labstore1006.wikimedia.org::data/xmldatadumps/public/rsync-inc-last-2.txt .
rsync -av  --include '/*wik*/' --include-from=rsync-inc-last-2.txt --exclude='*'  labstore1006.wikimedia.org::data/xmldatadumps/public/ /data/xmldatadumps/public

This will bring over some older files from 2007 and 2009 but it's easier to clean those up later than try to get the rsync args right to exclude them.

ArielGlenn added a comment.EditedNov 11 2019, 6:38 PM

The above rsync completed; I will be rerunning it from time to time. In the meantime I have now moved onto the 'misc' dumps:
rsync -av labstore1006.wikimedia.org::data/xmldatadumps/public/other/ /data/otherdumps

rsync -av --bwlimit=80000  dumpdata1002.eqiad.wmnet::data/otherdumps/ /data/otherdumps

I see that I did not bwlimit the labstore rsync, though in my earlier 20 attempts to get the rsync args right, I did have that in there. It will be limited for any catchup runs.

I have extended the rsync of xlm/sql dumps to the last three good dumps and have been running a bandwidth-limited pull from labstore1006 to dumpsdata1003 in a screen session on dumpsdata1003. I've periodically been updating the misc/other dumps via pull from dumpsdata1002.

Rync of both xmldatadumps/public and otherdumps from dumpsdata1002 to dumpsdata1003 is caught up as of earlier this evening. I'll be running these throughout the day tomorrow, waiting for the misc cron dumps to finish.

Change 551035 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[operations/puppet@production] make dumpsdata1003 another secondary dumps NFS server along with dumpsdata1002

https://gerrit.wikimedia.org/r/551035

Change 551035 merged by ArielGlenn:
[operations/puppet@production] make dumpsdata1003 another secondary dumps NFS server along with dumpsdata1002

https://gerrit.wikimedia.org/r/551035

Change 551038 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[operations/puppet@production] add per-host configs for new dumps fallback NFS server

https://gerrit.wikimedia.org/r/551038

Change 551038 merged by ArielGlenn:
[operations/puppet@production] add per-host configs for new dumps fallback NFS server

https://gerrit.wikimedia.org/r/551038

Change 551039 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[operations/puppet@production] buster doesn't have mailx, replace with s-nail

https://gerrit.wikimedia.org/r/551039

Change 551039 merged by ArielGlenn:
[operations/puppet@production] buster doesn't have mailx, replace with s-nail

https://gerrit.wikimedia.org/r/551039

Change 551042 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[operations/puppet@production] fix up dump stats script to use either mail or s-nail

https://gerrit.wikimedia.org/r/551042

Change 551042 merged by ArielGlenn:
[operations/puppet@production] fix up dump stats script to use either mail or s-nail

https://gerrit.wikimedia.org/r/551042

Change 551173 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[operations/puppet@production] make dumpsdata primary nfs server rsync to dumpsdata1003 now

https://gerrit.wikimedia.org/r/551173

Change 551173 merged by ArielGlenn:
[operations/puppet@production] make dumpsdata primary nfs server rsync to dumpsdata1003 now

https://gerrit.wikimedia.org/r/551173

dumpsdata1003 is now receiving all files from dumpsdata1001 via rsync. dumpsdata1002 can be turned into a spare and re-imaged with buster as the next step.

Change 551317 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[operations/puppet@production] make dumpsdata1002 spare before reimaging

https://gerrit.wikimedia.org/r/551317

Change 551317 merged by ArielGlenn:
[operations/puppet@production] make dumpsdata1002 spare before reimaging

https://gerrit.wikimedia.org/r/551317

Change 551319 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[operations/puppet@production] make dupsdata1002 install buster instead of jessie

https://gerrit.wikimedia.org/r/551319

Change 551319 merged by ArielGlenn:
[operations/puppet@production] make dumpsdata1002 install buster instead of jessie

https://gerrit.wikimedia.org/r/551319

Script wmf-auto-reimage was launched by ariel on cumin1001.eqiad.wmnet for hosts:

['dumpsdata1002.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201911161927_ariel_118105.log.

Completed auto-reimage of hosts:

['dumpsdata1002.eqiad.wmnet']

and were ALL successful.

Expanded /data on dumpsdata1002, rsyncing copies of adds-changes dumps now from dumpsdata1003 in a screen session. After that I'll pick up the categoryrdf dumps, also via rsync from dumpsdata1003.

ArielGlenn updated the task description. (Show Details)Nov 16 2019, 9:06 PM

Change 551503 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[operations/puppet@production] add new partman recipe that skips format of /data partition for dumps servers

https://gerrit.wikimedia.org/r/551503

Change 551804 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[operations/puppet@production] move misc crons to dumpsdata1002 nfs server

https://gerrit.wikimedia.org/r/551804

ArielGlenn added a comment.EditedNov 19 2019, 12:41 PM

The schedule is now:

And then of course check that everything is running ok when xml dumps start on Dec 1st.

Change 551879 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[operations/puppet@production] add partman recipe that leaves /data on dump servers alone

https://gerrit.wikimedia.org/r/551879

Change 551879 abandoned by ArielGlenn:
add partman recipe that leaves /data on dump servers alone

Reason:
did testing without needing the commit.

https://gerrit.wikimedia.org/r/551879

Change 551503 merged by ArielGlenn:
[operations/puppet@production] add new partman recipe that skips format of /data partition for dumps servers

https://gerrit.wikimedia.org/r/551503

The patchset for tonight/tomorrow, moving misc cron storage to dumpsdata1002, is ready to go.

Given that the wikidata entity dumps are still finishing up the truthy gz files, and after that there will be bz2 recompression and the Lexemes, I'll be making the switchover tomorrow morning or mid-day EET.

Change 551804 merged by ArielGlenn:
[operations/puppet@production] move misc crons to dumpsdata1002 nfs server

https://gerrit.wikimedia.org/r/551804

snapshot1008 now uses dumpsdata1002 as its nfs server. I had to manually systemctl stop nfs-mountd.service and start it again for dumpsdata1002 to pick up the values (and especially the port setting) in /etc/default/nfs-kernel-server so that's poor. Other than that, no problems with puppet's unmounting and remounting of the share.

The next misc cron dump is already running (pagetitles) so I expect to see the files over on labstore1006,7 in a little while.

And some of them are already on labstore1006, so rsyncs are working as expected.

Adds-changes dumps did not run properly; when I checked this afternoon the Nov 23 job was hung indefinitely trying to get a lockfile on the first wiki to be processed (abwiki). I watched snapshot1008 attempt to connect to dumpsdata1002 for (some) nfs request and then try dumpsdata1003 when that failed (!) I rebooted snapshot1008 which no longer does this. Some port was still advertised wrongly on dumsdata1002 it seems, a reboot took care of that.

However, locks over nfs in buster either behave differently or there is some other flag someplace I missed.

I've pushed over changes to the adds-changes scripts to skip locking for now, since only oe process runs at a time for a given date anyways. However, it needs to be fixed soon. I need also to see if the locking mechanism for xml/sql dumps works in buster as is, since that switchover is coming up very soon.

Changeset for skipping locks not yet merged, that will go tomorrow.

Backrunning the Nov 23 adds-changes now so they'll be complete in time for the Nov 24 run which kicks off around 9 pm UTC.

Change 552658 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[operations/dumps@master] add ability to skip locking for adds-changes dumps

https://gerrit.wikimedia.org/r/552658

Change 552658 merged by ArielGlenn:
[operations/dumps@master] add ability to skip locking for adds-changes dumps

https://gerrit.wikimedia.org/r/552658

Change 552659 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[operations/puppet@production] configure adds-changes dumps to skip locking for now

https://gerrit.wikimedia.org/r/552659

Change 552659 merged by ArielGlenn:
[operations/puppet@production] configure adds-changes dumps to skip locking for now

https://gerrit.wikimedia.org/r/552659

I have tested on snapshot1008, which mounts only the buster nfs share, that the dump_lock.py script with multiple instances works as it should; this is the locking mechanism for xml/sql dumps. This means that although the adds-changes dumps locking must still be investigated later, I can go ahead and re-image dumpsdata1001 now that the current xml/sql run has completed.

Change 553324 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[operations/puppet@production] dumpsdata1001 will install with buster now

https://gerrit.wikimedia.org/r/553324

Change 553324 merged by ArielGlenn:
[operations/puppet@production] dumpsdata1001 will install with buster now

https://gerrit.wikimedia.org/r/553324

Aaaaand dumpsdata1001 is reimaged. All the data is still there, available to snapshot hosts.

ArielGlenn updated the task description. (Show Details)Nov 27 2019, 12:20 PM
ArielGlenn closed this task as Resolved.Nov 27 2019, 12:44 PM

Closing, any followup issues can get their own tasks.