Page MenuHomePhabricator

Produce stub dumps for all wikis as soon as a new month starts, then generate all other dumps on second round-robin cycle
Closed, ResolvedPublic

Description

Both WMF staff and community would really like to see dump based metrics for previous month within days after month ends. Wikistats only requires stub dumps (with a few exceptions, which can be mitigated). These stub dumps can be generated relatively quickly.

Suggested approach which leaves current processing mostly intact, and hopefully thus can be implemented without too much overhaul:

  1. Early in the month run every dump job only till step 29, then exit job, and continue with next wiki in normal round-robin fashion. These early steps in the dump job are relatively quick, so a full cycle of steps 1-29 for all wikis could be done within days.
  2. After all stub dumps are done, on next round-robin cycle, run steps 30-39 (which are much heavier) or run the entire job from the start, if that's easier.
  3. Go idle after all jobs are fully done, so that all servers are ready to start a new cycle as soon as the month ends

Here are run times for largests dumps:

enwiki: stub dumps are done after 2 days and 6 hours, remainder of job takes 13 days
dewiki: stub takes 15 hours, remainder takes 8 days and 16 hours
wikidata: stubs take 2 days, remainder takes 13 days (Nov 2014, now more)

Here is an example of a full dump job:
http://dumps.wikimedia.org/dewiki/20150106/

Step 1: 2015-01-06 17:41:30 done User account data. (private)
[13 hours pass]
Step 29: 2015-01-07 07:06:33 done Recombine first-pass for page XML data dumps
[8 days and 16 hours pass]
Step 39: 2015-01-16 02:47:15 done Articles, templates, media/file descriptions, and primary meta-pages, in multiple bz2 streams, 100 pages per stream
[Now job is complete]

Background:

Now, after all dumps are done for a month, a new cycle starts (which for wikistats is totally redundant if dump starts in same month). That would have to be undone. Rather have the servers idle and ready to start new cycle when new month begins. I can't imagine any other users depend heavily on having a new dump say every 20-25 days rather than once a month.

Of course autonomous growth makes job run time more of an issue every month.

Event Timeline

ezachte raised the priority of this task from to Needs Triage.
ezachte updated the task description. (Show Details)
ezachte added subscribers: ezachte, ArielGlenn, mark and 2 others.

Ideally except for en wikipedia we want to get out (at least) two runs a month. I have had complaints in the past from community members, in particular people generating lists of pages for bot editing or manual upates, when these runs start to take too long. We could float the once a month plan and see what community reaction is, but I'd be reluctant to "just do it".

On doing the dumps two-phase, with stubs in the first run and everything else in the second, I've been trying to sketch out how that would work in the current rolling dumps scenario. Not quite there yet.

Thanks for looking into this. Two full runs inside one month would be totally compatible with this proposal (and then still go in sleep mode for the remaining days, so as to be ready for next month). 1.5 run probably would not, as all servers could be occupied for days in the lengthy part of phase 2.

Well I've started one process on a manual run, discovered a buglet that won't affect the current run but prevents me from starting up more processes), and I'll get to work on that right away.

Obviously this month's run being late is an anomaly, next month I'll want to run near the beginning of the month to give you guys plenty of time.

Full stubs of all but large wikis are done; I'll be looking at the large wikis shortly (Monday).

Thanks, Ariel

I updated Wikistats which didn't detect the new stub dumps (as the expected input in index.html changed slightly, no previous steps were 'done')
I also switched to stub dumps for all wikis (it was the default for largest wikis only)

Here's the new month, all stubs are ready but en wp (probably done tomorrow), commonswiki (same) and wikidatawiki (same).

Doesn't this approach increase the staleness of the dumps that are later in the cycle?

I thought that all the dump processes ran from the same snapshot of a given wiki.

Can the stub dumps be extended from step 29 [1] to either step 30 [2] or even 31 [3]? This would generate usable dumps early in the cycle and help mitigate the staleness factor that DCDuring brings up.

For the cited dewiki example above, extending to step 29 / 30 only adds 1 hr 39 min, and 1 hr 6 min respectively. Extending the first job's 13 hours to about 16 hours doesn't seem like much of a tradeoff, especially since steps 32 - end still take 8 days 13 hours

Even enwiki [4] only adds about 3 hrs and 51 min and 3 hrs and 11 min. This additional 7 hours still compares favorably to the initial job duration of 2 days and 6 hours and remainder of 13 days

[1]: AKA "Recombine first-pass for page XML data dumps"
[2]: AKA "Articles, templates, media/file descriptions, and primary meta-pages"
[3]: AKA "Recombine articles, templates, media/file descriptions, and primary meta-pages."
[4]: based on http://dumps.wikimedia.org/enwiki/20150112/

Thanks for weighing in. The order of the run is by no means settle, except that stubs must come first; that's so the monthly stats can get one right away.

@DCDuring There are no "snapshots" of wikis per se. Each step of a dump is run against the live database. Content dumps are generated from the list of pages in the stub dumps, so there is at least that consistency, but that's as far as it goes. The dumps are meant to be a good starting point for import, or in the case of bots and other tools, to provide the content needed. They are not a perfect copy.

@gnosygnu I'd rather run the articles dumps as a separate pass after the stubs. But maybe @ezachte can weigh in. My thought would be: stubs, tables, articles, articles + metadata pages, "the rest", doing each of these as one stage. We want the tables phase right after stubs so that the inconsistency isn't huge (as opposed to moving it to the end, let's say).

@ArielGlenn Thanks. I think 5 rounds would be better, especially in comparison to the original 2 round proposal.

I am slightly worried that it makes the dump process a little more fragile. It means that the dumper would need to make it through all wikis for stubs, then tables, before it can start with articles. If something breaks on any of the 800+ wikis in the first two rounds, then it means that articles won't be generated (sort of like what happened in May). In contrast, the 2 round model would mean that some wikis can generate working dumps (though these wikis would be random, so that may only be marginally better)

That said, I think 5 rounds is fine. Hopefully it doesn't increase the complexity too much. ;)

@gnosygnu: 5 rounds is no more complex than two with this setup; it just means a script that has 5 lines in it instead of two :-)

May was an exception because I knew we couldn't get through the rest of the rounds in time to be ready for the beginning of the month for stubs. But In general if something goes wrong on a wiki it will get the chance for retries later on in the cycle. I''ll explain how that works in detail a bit later (hmm, ought to update the docs :-))

For everyone following along: the other change I'd like to make is that we do two runs per month, but the second run does not contain the history dumps. We should be able to make that timetable easily enough, and I don't expect the full history dumps are in as much demand as the dumps of the current revisions. Thoughts?

@ArielGlenn Cool. Didn't know that it would just be 3 more lines. If only all changes could scale as nicely as that. :)

Yeah, docs would be great. I remember reading one of your brain dumps linked from an offline-l mailing list and thinking it was a good insight into the process

Personally, I think history dumps once per month should be fine, though honestly I don't use them. I would like to use them, but they are... well pretty large. Also, I really don't know if every month, I can download incremental changes (only the new revisions) or need to pull down the full set. My impression is the latter, though I've been too lazy to investigate.

Re: "ought to update the docs"

Please. If you could do that, you might have fewer signs of ignorance
(like mine) in discussions like these. You might also indicate there what
seems likely to be in flux or, at least, direct users to some topical
discussions. I was chastised for not grasping the connection between
different Phabricator threads as a good software practitioner would have.
As only a mere non-technical user affected by down-the-line consequences of
Phabricator-documented matters, I cannot always grasp such connections. No
one pays me for the time I would have to spend to do the unwelcome amount
of self-education and research to find such things.

Thanks for the substantive work you've done and the patience you've shown.
I hope that any work you can find the time to do on documentation repays
dividends to you in better quality, broader participation in discussions.

BTW, the only dump I use or am likely to use is enwikt's "Articles,
templates, media/file descriptions, and primary meta-pages." I have my
hands full trying to keep up with the work in my area that it generates, so
my concern is largely with not working from lists that would be rendered
partially obsolete by an imminent new dump.

Here is what happened with stub dumps this June:
Ariel ran these as a separate job for the first time this month. And all were available at around June 8.
June 10 all stub dumps were processed by Wikistats and reports were published. Which is a huge improvement! Thanks again, Ariel.

I expect comScore reports to be released in less than two days, after which we can update Monthly Report Card sooner than ever,
with also plenty of time to prepare Quarterly Report before end June.

With this early release we also now have a slack/buffer period to fix mishaps, in case something goes wrong in any phase.

ArielGlenn claimed this task.
ArielGlenn set Security to None.

These have been running for some time now, I venture to say we can close this :-)