Page MenuHomePhabricator

Retry failures of runs of checkpoint files immediately instead of waiting for related jobs to complete
Closed, ResolvedPublic

Description

We're going to see dbs pulled out for maintenance or schema/index changes regularly, and retries need to happen at the subjob level immediately. Currently we wait for a group of jobs to fail, which may mean that others run to completion (a few days) before we retry. Dewiki is lagging for this very reason.

We need to finish up the pages-meta-history step manually, running several checkpoint files simultaneously so it completes well before the next dump run is scheduled to start.

Event Timeline

ArielGlenn moved this task from Backlog to Active on the Dumps-Generation board.

For cleanup, I have a python script ready to go and the bash wrapper script is in testing now; I should be able to get this going a little later today, finishing up this step by late tomorrow, so 7zs can be generated the following day.

Change 342846 had a related patch set uploaded (by ArielGlenn):
[operations/dumps] scripts to generate a series of checkpoint files for a dump run manually

https://gerrit.wikimedia.org/r/342846

Because these things always take longer than expected, the jobs for dewiki and wikidatawiki (also a piece that failed over there) have just been started. However they should run relatively quickly. I'll report progress late tomorrow, earlier if they get done sooner of course.

Wikidatawiki still has a few files going, I expect them to be done within a few hours. After that, I'll clean up and start 7zs on these wikis.

In spite of what I said yesterday, there are still a few wikidatawiki files running tonight. I can't keep my eyes open any longer so I'll be checking tomorrow morning and kicking off the 7z phase as soon as possible.

Jobs ran and completed as expected. I am in the middle of several patchsets that will supplant the whole "checkpoint file" mechanism, splitting up the output files ahead of time into small enough page ranges, running them in batches and retrying immediately on failure. This entailed some cleanup of tech debt (more and more methods added over time which deal with: what, a path, an open file, pieces of a dumps directory/filename, dump file contents manipulation? No args documented, nothing clear from the names, etc), some removal of dup code etc.

Plan to test over the next two days. Current dump run should finish late tomorrow or early Wednesday, I plan to be ready to deploy Wednesday afternoon.

Change 343542 had a related patch set uploaded (by ArielGlenn):
[operations/dumps@master] convert to using page ranges for all checkpoint stuff and no checkpoints whatsoever, for checkpoint-file enabled wikis

https://gerrit.wikimedia.org/r/343542

Change 343542 merged by ArielGlenn:
[operations/dumps@master] convert to using page ranges for all checkpoint stuff and no checkpoints whatsoever, for checkpoint-file enabled wikis

https://gerrit.wikimedia.org/r/343542

Change 345985 had a related patch set uploaded (by ArielGlenn):
[operations/dumps@master] retry failed page content pieces immediately after page content step completes

https://gerrit.wikimedia.org/r/345985

TL;DR: dumps will start a day late.

Although I've tested these changesets quite a bit over the last few days (don't be deceived by the commit dates, they'be been through repeated reordering, squashes and rebases), I'd rather double-check a bit more and deploy tomorrow, rather thandeploy Right Now and hustle to re-enable cron and not be around to baby-sit. These changes (plus a puppet change to add config settings) will go live tomorrow.

Change 345985 merged by ArielGlenn:
[operations/dumps@master] retry failed page content pieces immediately after page content step completes

https://gerrit.wikimedia.org/r/345985

I see that the stub recombine step got borked somewhere in the bowels of this or the previous cleanup changesets. The big giveaway was enwiki recombined stubs approaching .5T and not being done yet :!)

After poking around, I found that we now somehow generate a list for recombining of stubs1, stubs1, stubs2, stubs1, stubs2, stubs3, stubs1, stubs2, stubs3, stubs4 (for dumps where the stubs and page content steps are broken up into subjobs). For en wp that repeats for up through the 27 parts... I'll have to clean these up and let that step rerun later; it's not needed for generation of any later content, and folks can download the individual parts for now if they are in a hurry.

There is a similar problem with https://dumps.wikimedia.org/frwiki/20170401/
frwiki-20170401-stub-meta-history.xml.gz is 22.2 GB whereas is should be around 9 GB. A simple script using this file to count user contributions report largely inflated numbers. After digging a little, there are several dumps in the same file, for example rev id 100000000 appear 4 times with the same data.

The bad recombined stubs have been removed on all wikis except commons, which is busy running another job. When it's not actively running I can clean up there, and these jobs will all got redone automagically a bit later in the run.

This affected the recombined articles dumps, and therefore the article bz2 stream dumps as well. That was enough for me to shot everything, fail those steps for the big wikis and toss the relevant files. The dumps should resume in about an hour, and they'll pick up with the failed jobs first.

Stubs recombines now seem to run ok; for example en wiki stubs recombines ran to completion properly. The page range code needs some fixups; running on a set of pages with hundreds of thousands of revisions always shakes out more bugs. I'm doing some testing now on our canary host.

The run on the canary host looks good. I've merged the fix and will deploy shortly. All other wikis proceeding fine.

This has been merged and deployed: https://gerrit.wikimedia.org/r/#/c/346951/
This will be merged and deployed today: https://gerrit.wikimedia.org/r/#/c/347182/
Currently working on an issue where the last page range for a content dump "part" may contain too many revisions, because we rely on the estimate of row counts from 'show explain' as accurate when it isn't.

I had to set up a new testbed which meant some cleanup to old import tools (see https://gerrit.wikimedia.org/r/#/c/347625/ and https://gerrit.wikimedia.org/r/#/c/347626/). I've fixed up https://gerrit.wikimedia.org/r/#/c/347627/ a bit, have tested it on this new testbed and am now doing some tests on one of the production hosts, not generating dumps but only the page ranges that would be used for small dump jobs. So far so good but it needs several more test runs.

With the addition of https://gerrit.wikimedia.org/r/#/c/348138/ my downloaded elwikivoyage files convert and import and dump out to the same content, so that's looking really good. Page ranges look good when I generate just the numbers for parts of enwiki or wikidata; tomorrow I'll likely set up some manual jobs to run a bunch of those so we can get these dumps completed before h̶e̶l̶l̶ ̶f̶r̶e̶e̶z̶e̶s̶ ̶o̶v̶e̶r̶ the next run starts on the 20th. Generally, lots of cleanup to do.

Running manually on snapshot1005 and snapshot1007 as of earlier this morning:

  • wikidata pages-meta-history 4, 20 jobs at a time
  • enwiki pages-meta-history 24-27, 30 jobs at a time

When these finish up I'll run more enwiki jobs. I'll also shoot the regular wikidata run shortly so we don't get a bunch of extra output files. Cleanup will get done later.

  • enwiki pages-meta-history 22-23 running, 20 jobs at a time, btw that '30 jobs at a time' above was a typo, it was '20'. snapshot1007
  • wikidatawiki 7z dumps running; this will conclude that dump.
  • renamed files badly named due to https://gerrit.wikimedia.org/r/#/c/347182/ for commonswiki, jawiki, eswiki, ran a no-op job to clean up the hash sums, index.html and latest links for those wikis

As the enwiki jobs finish up I'll be starting new ones until we have full coverage for meta-history dumps; then will be cleanup of overlapping files and start of the 7z job.

As the enwiki page-meta-history 22-23 generation is almost complete, I have started 20-21 going on snapshot1006. The wikidata 7z job is still running so there's nothing to be done there.

Just in case anyone wonders why ms1001 disk activity has just increased, it's only me, doing the 7z compression on some of the anomalously large files produced for this month's enwp full run. Doing them over there means we don't impact i/o on dataset1001 for ongoing dump generation and service. I'll have to pay the piper at some point (late on the 19th, most likely) by rsyncing those files back across to dataset1001 but that's better than the run finishing late.

The thing about ms1001 running a bunch of 7zs is that it only has 1 poor little quad core cpu. I've paused a bunch of the processes and it's much happier. When all that backlog is finally completed I'll mount the filesystem on one of the snapshots and do some jobs from there instead, as those boxes have plenty of cores.

Waiting for the regular enwiki run to finish up pages-meta-history{17,19}, then I will kick off 18 by hand, 20 jobs at a time; once that is complete I can start some 7zs going that write to dataset1001. Gonna get this run done before the 20th if it kills me :-P

The wikidata run is complete.

OK, finished up all 7za, rsynced over the ones from ms1001, fixed up ownership, checked that page ids in the names corresponded to contents (for 12 files they did not, these were dumped on code predating https://gerrit.wikimedia.org/r/#/c/347182/) and renamed as appropriate, checked page range continuity (only a few pages missing between files here and there, due to deletions), checked that none of the bz2 outputs were truncated. Manual checking sucks, no two ways about it.

The stubs recombine job that I had thought to be broken ran successfully it seems, so the noop job is now running to fix up index.html, latest links, md5/sha1 sums, json status files. This will take several hours precisely due to those checksums.

Once that's done, all bug fixes will be deployed; the hosts have been live-patched until now.

Enwiki run is complete at last!

Change 342846 merged by ArielGlenn:
[operations/dumps@master] scripts to generate a series of checkpoint files for a dump run manually

https://gerrit.wikimedia.org/r/342846

Current run (Apr 20) is proceeding nicely. Closing this as complete.