Page MenuHomePhabricator

Move xml/sql dumps to dumpsdata1001 from dataset1001
Closed, ResolvedPublic

Description

Snapshot hosts should use the exported filesystem on dumpsdata1001 for the xml/sql dumps. The adds/changes dumps and misc datasets generated by cron can be moved over at a later date.

I would like this to happen before the next run (Nov 1), even if rsyncs of the generated dumps to dataset1001 aren't very smooth initially. I particularly want to avoid recurrences of T169680 by moving a bunch of work off of dataset1001.

Event Timeline

This move will be made much easier if all the wiki config files for xml/sql dumps are merged into one. This takes a bit of code and some changes to the puppet snapshot manifests.

Change 386388 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[operations/puppet@production] generate one config fule for xml/sql dumps for wikis

https://gerrit.wikimedia.org/r/386388

Change 386389 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[operations/dumps@master] Make a few more config settings parseable per-project.

https://gerrit.wikimedia.org/r/386389

Change 386389 abandoned by ArielGlenn:
Make a few more config settings parseable per-project.

Reason:
wrong way to do this, right way forthcoming

https://gerrit.wikimedia.org/r/386389

Change 387022 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[operations/dumps@master] Permit overrides section in dump config files and more per proj settings

https://gerrit.wikimedia.org/r/387022

Change 387022 merged by ArielGlenn:
[operations/dumps@master] Permit overrides section in dump config files and more per proj settings

https://gerrit.wikimedia.org/r/387022

Not at all sure I'm going to get it done by the Nov 1 run because I forgot a small detail. I need the last two dumps (fulls and partials) to be copied to the dumpsdata1001 host for use for prefetch. Copy is going now, I will soon have an estimate of when it will complete.

I'm more hopeful about this rsync completing on time. Already wikidatawiki and enwiki for the Oct 1 run (full history content) have made it over, that's about 1.8T out of the 5.4T to copy. I've got a script going to pick up all files for all (public) wikis for the last two runs; I expect it will complete sometime before tomorrow evening.

Additionally, I will likely delay the monthly run by a few days and skip the second monthly run for November, so that I have plenty of time to get all issues straightened out; everything from fixing up the user id for these dump runs (at last!) to the various rsyncs between all these hosts, etc.

Mail sent to: analytics, research-wmf, wikitech-l, xmldatadumps-l

If the mail doesn't show up on one of these lists, someone please let me know to nag a moderator, because I'm not subscribed to them all.

I'd been rsyncing one wiki and date at a time (via script) but it turns out that it takes about 2 minutes to generate the file list, whether for one wiki or for 20. That would have taken a long time for 800 wikis and 2 dates. Restarted with smarter exclusion and doing one letter of the alphabet at a time, this should get done much sooner.

Clean copy of all wiki dumps for 20171001 and 20171020 now on dumpsdata1001. Found a bunch of old deleted and/or moved-to-incubator wikis which I removed from there. They remain on the datasets host, those dumps are of historical interest I guess. Chowned/chgrped to the dumpsgen user.

Still several gerrit changesets to thoroughly test and merge, add the dumpsgen user to the snapshot hosts, set up the nfs mount for dumpsdata1001 on the snap hosts, create a second xml/sql config file with the new nfs mount path for writing, do some manual testing. Then take stock again of what's left to do.

Change 386388 merged by ArielGlenn:
[operations/puppet@production] generate one config file for xml/sql dumps for wikis

https://gerrit.wikimedia.org/r/386388

The job that lists the last n good dumps, for rsyncers, broke last night, due to the change in config file setting handling. It doesn't use the dumps library (should it?), and while I had changed the config file argument to be passed in, I had not fixed up the handling. That's done now, tested and deployed, https://gerrit.wikimedia.org/r/#/c/387781/

Luckily this job would have produced the same files last night as the day before, since we're in between dump runs, so no harm done.

Change 387834 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[operations/puppet@production] add new dumpsgen user to dataset1001 and ms1001

https://gerrit.wikimedia.org/r/387834

Change 387834 merged by ArielGlenn:
[operations/puppet@production] add new dumpsgen user to dataset1001 and ms1001

https://gerrit.wikimedia.org/r/387834

Change 388048 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[operations/puppet@production] generate separate config for dumps jobs on dumpsdata hosts

https://gerrit.wikimedia.org/r/388048

Change 388048 merged by ArielGlenn:
[operations/puppet@production] generate separate config for dumps jobs on dumpsdata hosts

https://gerrit.wikimedia.org/r/388048

Change 388142 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[operations/puppet@production] run xml/sql dumps on dumpsdata host

https://gerrit.wikimedia.org/r/388142

Change 388142 merged by ArielGlenn:
[operations/puppet@production] run xml/sql dumps on dumpsdata host

https://gerrit.wikimedia.org/r/388142

Tomorrow's dump run will write files to the dumpsdata host. These files will not be automatiaclly rsynced anywhere at first. I will be watching resource use and making sure that (even after thorough testing) everything runs as it should. I will manually push out files later in the day, assuming that the run has no issues.

Dumps seem to be running fine, though folks can't see them on the web server right now. Load and resource usage on the snapshots and on the dumpsdata host look fine, as expected.

Next up is to have a script that can clean up old dumps on the datasets (webserver) hosts before any rsync of new dumps takes place. Previously, the dumps generation script cleaned up old dumps as it ran, since it wrote dumps directly to the web server. Since this is no longer true, we have to do it as a separate cron job. And since we need to remove dumps before all of the new ones are rsynced over, rather than 1 minute before the new ones start to be created, we will be keeping one less dump of each type.

Change 388467 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[operations/puppet@production] cleanup old dumps on dumps web servers, part one

https://gerrit.wikimedia.org/r/388467

Change 388467 merged by ArielGlenn:
[operations/puppet@production] cleanup old dumps on dumps web servers, part one

https://gerrit.wikimedia.org/r/388467

Change 388541 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[operations/puppet@production] script for cleaning up old dumps on the web servers

https://gerrit.wikimedia.org/r/388541

Change 388541 merged by ArielGlenn:
[operations/puppet@production] script for cleaning up old dumps on the web servers

https://gerrit.wikimedia.org/r/388541

I have manually run the cron job for cleanup of oldest wiki dumps past the amount we want to keep. The job will run daily from now on.

This clears up plenty of space for rsync of the dumps in progress. I will likely do a manual rsync of those tonight, so that whatever files are available can be picked up by dumps users.

First manual rsync completed a few minutes ago. That's it for tonight. Note that it's possible some index.html files have links to files that aren't there yet, if the fie was produced after the rsync would have picked it up but before the rsyncer got to the index.html for that dump.

Change 389025 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[operations/puppet@production] rsync xml/sql dumps on an ongoing basis to fallback nfs server

https://gerrit.wikimedia.org/r/389025

I'm doing a manual rsync to dumpsdata1002 from ms1001 of the 20171001, 2017100 and 20171103 runs so that we have the data in place before I start rolling rsyncs to it from dumpsdata1001. That will run for most of a day, for sure.

We're pretty caught up both on the copy of data on dataset1001 and on the copy on dumspdata1002. Next up is to automate this very crude rsync so that data stays up to date. The pending gerrit changeset wil do just that. Refinements will come later. E.g.: deleting rsyncedold files from jobs that we re-ran, making sure the index.html and other status files on the web server reflect only the files that are actually there and rsynced for download, providing a link on the web server to a copy of the index.html file that reflects the current run status, without links for download.

Change 389025 merged by ArielGlenn:
[operations/puppet@production] rsync xml/sql dumps on an ongoing basis from primary dump nf host

https://gerrit.wikimedia.org/r/389025

Rolling rsyncs to dataset1001 and dumpsdata1002 in progress (one after the other, not both at the same time), done by script now.

That pretty much does what needs to be done on this ticket. Nice things to do around the rsync will go on a new task.