Page MenuHomePhabricator

Better progress indication and/or notification of completion of dump files for specific wikis and files
Open, Needs TriagePublic


The recent changes to the dump process have reduced my ability to predict when the dump file I need will be and has become available.


  1. it is not obvious what the sequence of initiations of each file or file group dump process is.
  2. it would be helpful to get notice when a specific file (or file group) in the dump is complete. This would effectively decrease the gap between completion of the dump and the taking of action based on the dump to make corrections and additions.
  3. some kind of indication of ETA for a dump file would be helpful it it were reliable. notification of the initiation of the dump process for a given wiki would facilitate making one's own estimate of the completion of a dump file of interest.

Event Timeline

DCDuring raised the priority of this task from to Needs Triage.
DCDuring updated the task description. (Show Details)
DCDuring added a subscriber: DCDuring.

@DCDuring: Which "recent changes to the dump process"? As there are no steps to reproduce given, which project is this task actually about? Datasets-General-or-Unknown ? Associating a project is welcome.

See T89273 which led to the changes, intended to be and apparently an improvement in the reliability of the dump process.

These three points make sense to me. Obviously, one would need to figure out more specifically how these notifications and indicators should look like in detail, and how they could be generated.

Trying to provide answers to your points based on my limited knowledge, please correct me if I am wrong:

  1. The dumps will be generated in the same sequence for all dumps of the same wiki, but the sequence may be slightly different depending on whether the wiki is small, big or huge. Puppet has a rough indication of what the sequence is like (perhaps this can be properly documented on Wikitech once things are stable).
  2. I am not exactly sure what kind of "notice" you would like. There are RSS links available and the "dumpruninfo.txt" file is available for parsing to know if a particular file is ready, in progress or has failed.
  3. An ETA is provided in the dump progress page for most dumps (especially those that are frequently used). Although it may not work well for parallel dumps (see T29124 for details), it should be sufficient for your needs. Of course, the dump stage must have been started before the ETA appears. If it is still "waiting" then such information will not be available.

Generally, if you are looking for a better system that allows you to be informed when a new dump is ready (i.e. stuff like push notifications, emails, etc), perhaps T92966 is quite similar to what you need. If you are looking for a stable URL to query for new dumps, the same task aims to achieve that as well.

I already use an RSS feed service to let me know when the dump of interest to me IS available.

Much of what I contribute to English Wiktionary is derived from my own dump-derived reports of what is missing in my areas of interest. My work planning benefits from knowing, say, two weeks in advance when the run of interest to me is likely to be available. (Supposedly we are to have dumps every two weeks, but that hope is usually disappointed.) I realize that the dump process is not stable and that it has a long history of not being stable and I blame no one for that. This item could be viewed as just a wish for more stability, which would enable me to make my own weekly ETAs for the dump file of interest. I don't really have time to get into the entire who-shot-John of why the next dump's ETA has changed when there is so much alteration and failure of the processes. If we were at the point of being able to have some kind of Total Quality metrics for the dump process, the reliability of arrival time and the average time between arrivals would figure prominently.

Now it looks like we're getting the dumps processed TOO FAST.

But seriously folks, I think the apparent improvement in speed and reliability is astonishing. I hope it holds up. If it does hold up, it might be possible to forecast when a given file will be available based solely on the start date and time of the dump cycle which would eliminate the need for anything other than notice of something taking longer than expected and an estimate of the range of resulting delay.

In the meantime thanks to Ariel G.

This month we won't have a second run because the servers that produce the dumps are being upgraded to Ubuntu trusty and to use HHVM instead of php. We'll be back on schedule for October 1 though.

So going back to the subject of this task: if the RSS feed files are not sufficient, are you wanting guesstimates about how long each dump stage (.e.g. "all stubs", "all page-article bz2 files" takes to complete? Or something else?

Projected availability date for each file would be ideal, based on all
known contributing factors, as soon as they are known. When something goes
wrong, projecting availability 'no sooner than" a date would be even more