Page MenuHomePhabricator

Architecture and puppetize setup for dumpsdata boxes
Closed, ResolvedPublic

Description

These two boxes will serve nfs filesystems for dumps generation; they will be mounted only on the snapshot hosts. Files must be kept for at least two runs (current run, previous full run); new files must be copied over to dataset host after generation, so they can be rsynced elsewhere and served to the public. The dump monitoring service expects live status information; this will have to be changed.

Architecture for the above needs to be worked out, then implemented in puppet manifests and deployed.

Event Timeline

The moving parts are as follows:

dumpsdata:

  • filesystem definition, nfs export to snapshot hosts
  • clean up of old dump run output
  • save completed revision content files (article/meta* bz2, flow current/history) for prefetch -- in the usual way for now
  • cleanup of older prefetch files; we must always have at least two prefetch files for a given wiki and dump step.-- as part of usual dumps cleanup for now
  • [x[] rsync job which rsyncs , for each current dump directory: completed files, then dump status fles, then index.html; then main index.html, to dataset hosts, every ten minutes
  • rsync dumps that land in 'other', non-xml dumps, as well
  • rsync adds-changes dumps, clean up old ones, leave some around (previous ones needed to generate current ones), leave more than that because sometimes they break

datasets (until new labstore boxes come on line):

  • accept rsync of files from dumpsdata hosts
  • nfs export to stats hosts only
  • rsync to mirrors, labstore, etc as before
  • must check that rsync input filelist generation works properly; list-last-n-good-dumps must run here now, not on a snapshot
  • web service as before
  • exporting old dumps will have to be done from here; it can't rely on dumps config, so some generic easy cleanup script

snapshots:

  • get prefetch files from a given directory, if specified by config --deferred for now if space permits
  • perhaps use 7z for full history prefetch? do 7z/bz2 decompression timing tests, also look at cpu/memory usage --deferred for now
  • perhaps write a nicely formatted list of which files are complete for a run, so that dumpsdata server can use as rsync filelist? or can it use json dump status files? -- json status files are fine
  • in dumprun status, add list of "special" files (hash files, latest links, status files etc) that a downloader may also want; this can also be used by snapshot rsyncers
  • dump monitor runs on one of these, writing to the dumpsdata filesystem, to generate index.html file covering all wikis
  • run adds-changes dumps on snapshot hosts as usual

Due to dump schduling, the code that will take the longest to deploy will be the changesets to the dumps repo; these can only be pushed between dump runs.

That makes this: "in dumprun status, add list of "special" files (hash files, latest links, status files etc) that a downloader may also want; this can also be used by snapshot rsyncers" first on the list to test and get done so it's ready as soon as the current dump run completes. Accordingly, I've been working on that and have an untested draft patchset.

Change 364729 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[operations/dumps@master] write out a list of special dump files per dump run that downloaders may want

https://gerrit.wikimedia.org/r/364729

Change 364729 merged by ArielGlenn:
[operations/dumps@master] write out a list of special dump files per dump run that downloaders may want

https://gerrit.wikimedia.org/r/364729

Change 366308 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[operations/puppet@production] setup for dumpsdata hosts to serve dumps work area via nfs to snapshots

https://gerrit.wikimedia.org/r/366308

That puppet patchset above is missing a lot, has lots of dup code, and guaranteed not to pass jenkins either, but it's got a draft of many of the pieces.

For the rolling rsync to be effective for the larger files (revision history content, primarily), the dumps runner should be notified as each file covering a specific pagerange is completed, so that it can mark it as such in the dump status file. This can then be used by the rsyncer to select files for rsync, as we don't want to rsync files that are stll being written (waste of resources). Currently the runner waits for all revision content files to be completed before adding them to the dump status file.

This will take some reworking of the revision content dump steps, and the code that manages parallel processes. It would be good to get this done and deployed before the next run starts on Aug 1.

https://gerrit.wikimedia.org/r/#/c/368744/ is a draft of the callback plus file content check for dump output files as they are produced by each series of commands. All dump steps will need to incorporate the new logic.

Change 373117 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[operations/puppet@production] start of setup of dumpsdata hosts

https://gerrit.wikimedia.org/r/373117

This is now deployed. The Sept 1 dumps will use this code.

Change 373117 merged by ArielGlenn:
[operations/puppet@production] start of setup of dumpsdata hosts

https://gerrit.wikimedia.org/r/373117

Change 374242 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[operations/puppet@production] add user and directory setup to dumpsdata hosts

https://gerrit.wikimedia.org/r/374242

Notes from today's irc chat with @madhuvishy about the rsync that will happen from dumpsdata hosts to labstore hosts:

There are several open questions:

  • How often should the rsyncs run? I'd been thinking often, say every 20 minutes, to shovel over files as they become complete. Madhumitha has been thinking every 6 hours, to make operational/maintenance costs easier.
  • Does dumpsdata host rsync to both labstores, as well as to its fallback host, or does it rsync to one labstore host which syncs to its fallback in turn? It could rsync to its fallback host which could push out to the secondary labstore box, splitting up the network and disk/io load, but then we have even more of a problem with the next item:
  • Consistency of dumps data across hosts: if there's a delay between rsyncs, new files may be picked up and sent to one host that aren't on the other
  • Consistency of index.html with the files that are rsynced: we will rsync only completed files, we want the index.html file to include only links to the files that are part of the rsync. The same goes for hash file lists, status files and so on.
  • What sort of monitoring do we do on the labstore end to make sure the rsyncs are working? Currently we do none; failed cron jobs will notify via email. Checking that copies arrived intact (md5sums) is expensive; is it worth it even? The remedy would be to run another rsync after fixing any underlying (disk/network/host) issue anyways.
  • Rsync retry one time on failure, in case there's a short network flap, might be worth it.

Please add/clarify anything I left out!

@ArielGlenn Thanks for the summary! Looks right - one note is that I would prefer that the dumpsdata host(primary or secondary) is the pristine source for both labstore1006 and 7, rather than the labstores trying to sync between each other.

A few more thoughts.

I should stop thinking of this as an rsync and instead think of it as a copy of files that don't exist/need updated on the remote host(s), a copy that just happens to use rsync as the transport.

I could stash the dump special files in a temp dir, copy the dump output files that are ready, one at a time to each remote host in turn, using rsync. Then I could grab the dump special files (includes hash lists and index.html) and rsync those to each remote host in turn. Then I want to clean up the remote host dirs, to make sure we don't have broken files from a previous run, for example. And finally I would clean up the temp dir. There should be a daily cron to clean up that temp dir in case of broken copy jobs.

The largest file we have right now is 52GB (en wiki stubs gz file). Copying this over a 1GB link takes around 11 minutes if it does nothing else. We'll want to QoS or something so that rsync uses spare bandwidth that the dumps writes over nfs aren't using. I think we'll have bonded ethernet for these boxes, so we can get close to that bw. However...

Given that I have to copy this file three places (dumpsdata failover, both labstore hosts), this means 45 minutes or an hour between rsyncs is a better target than the 10 minutes I'd hoped.

Change 374242 merged by ArielGlenn:
[operations/puppet@production] add user and directory setup to dumpsdata hosts

https://gerrit.wikimedia.org/r/374242

@ArielGlenn Sounds good, I would push towards a larger window of atleast 2 hours - 45 minutes to an hour for 3 rsyncs + some cleanup seems like cutting it close.

As for rsync bandwidth capping, it does have a --bwlimit option that does just that.

Change 374606 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[operations/puppet@production] copy of completed dump files plus metadata from dumpsdata to web server

https://gerrit.wikimedia.org/r/374606

I'm hoping to avoid the bwlimit option, I use this in our current setup but it's a hard cap even when there's not use of the interface by anything else.

There should be some time to play around with rsync timing when we're set up to rsync to the existing web servers. Hopefully that will happen before your hosts are ready :-)

@ArielGlenn Sounds good, I would push towards a larger window of atleast 2 hours - 45 minutes to an hour for 3 rsyncs + some cleanup seems like cutting it close.

As for rsync bandwidth capping, it does have a --bwlimit option that does just that.

Change 375768 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[operations/puppet@production] move dumps other than the xml/sql dumps to new path on dumpsdata hosts

https://gerrit.wikimedia.org/r/375768

Change 375768 merged by ArielGlenn:
[operations/puppet@production] move dumps other than the xml/sql dumps to new path on dumpsdata hosts

https://gerrit.wikimedia.org/r/375768

This is done now. While there's still moving the cron misc dump jobs to write to the filesystem on the dumpsdata box, and some rsync niceties to be dealt with, the main 'architecture and puppetize' task is complete.

Change 374606 abandoned by ArielGlenn:
copy of completed dump files plus metadata from dumpsdata to web server

Reason:
Obsoleted at last, by 6e9ccd3e2761c277d1cfa30596dcb8310672e453 and related changes.

https://gerrit.wikimedia.org/r/374606

Change 366308 abandoned by ArielGlenn:
setup for dumpsdata hosts to serve dumps work area via nfs to snapshots

Reason:
Obsoleted at last by 165c6bf520d9a6934edb4d326140ab77c22adc3f and related changes

https://gerrit.wikimedia.org/r/366308