Both WMF staff and community would really like to see dump based metrics for previous month within days after month ends. Wikistats only requires stub dumps (with a few exceptions, which can be mitigated). These stub dumps can be generated relatively quickly.
Suggested approach which leaves current processing mostly intact, and hopefully thus can be implemented without too much overhaul:
- Early in the month run every dump job only till step 29, then exit job, and continue with next wiki in normal round-robin fashion. These early steps in the dump job are relatively quick, so a full cycle of steps 1-29 for all wikis could be done within days.
- After all stub dumps are done, on next round-robin cycle, run steps 30-39 (which are much heavier) or run the entire job from the start, if that's easier.
- Go idle after all jobs are fully done, so that all servers are ready to start a new cycle as soon as the month ends
Here are run times for largests dumps:
enwiki: stub dumps are done after 2 days and 6 hours, remainder of job takes 13 days
dewiki: stub takes 15 hours, remainder takes 8 days and 16 hours
wikidata: stubs take 2 days, remainder takes 13 days (Nov 2014, now more)
Here is an example of a full dump job:
Step 1: 2015-01-06 17:41:30 done User account data. (private)
[13 hours pass]
Step 29: 2015-01-07 07:06:33 done Recombine first-pass for page XML data dumps
[8 days and 16 hours pass]
Step 39: 2015-01-16 02:47:15 done Articles, templates, media/file descriptions, and primary meta-pages, in multiple bz2 streams, 100 pages per stream
[Now job is complete]
Now, after all dumps are done for a month, a new cycle starts (which for wikistats is totally redundant if dump starts in same month). That would have to be undone. Rather have the servers idle and ready to start new cycle when new month begins. I can't imagine any other users depend heavily on having a new dump say every 20-25 days rather than once a month.
Of course autonomous growth makes job run time more of an issue every month.