Page MenuHomePhabricator

Provide a way to check if a dump has been generated
Closed, ResolvedPublic

Description

Use case: I have a script that performs processing on dumps, and I want that data to be updated automatically when a new dump is generated. I'm fine with running a cronjob once a day to check for updates, but as far as I (and @Halfak) know, there is no Good Way to do this currently. He will hopefully elaborate with his thoughts on this.

note from Dan: triaging High because this is blocking the multimedia team in the near future.

Event Timeline

MarkTraceur raised the priority of this task from to Needs Triage.
MarkTraceur updated the task description. (Show Details)
MarkTraceur added subscribers: MarkTraceur, Halfak.
Milimetric updated the task description. (Show Details)
Milimetric set Security to None.
Milimetric lowered the priority of this task from High to Medium.Feb 18 2016, 6:17 PM
Milimetric moved this task from Incoming to Event Platform on the Analytics board.

A new dump for any project, or for a specific one, or... ?

Pinging again since it's been many months with no input.

Hey! So, usually I want to know when a specific project gets a new dump. E.g. I want to process the "pages-meta-history" dump for enwiki, frwiki or ruwiki as soon as it is available in bz2 format.

My apologies for the long delay in answering, I didn't see that you'd replied.

You can check a json file for the run date for the wiki in question: dumpstatus.json E.g. for enqiki 20170401, you would grab https://dumps.wikimedia.org/enwiki/20170401/dumpstatus.json If the file does not yet exist, that's a good indication that your files are not ready :-D However, it lists each step, whether the run for that step is complete, and a list of files, giving sha1/md5 sums and the url (relative to the docroot) of each file if available

This is a relatively new service, so do let me know if you run into any problems.

Getting that file presumes I know the date of the dump ahead of time. I could a request for all future dates between the last dump and the current date until I find a file. But that seems messy. I guess I'm imagining something at the base of a directory that would report on the status of all dumps. E.g. at https://dumps.wikimedia.org/dumpstatus/2017041.json and that would have a record for all dumps.

Actually, now that I think of it, you could probably keep the same format for the json file and have
https://dumps.wikimedia.org/dumpstatus/2017041/enwiki.json

The presence of a new date-folder in /dumpstatus/ would signify that new dumps are inprogress/available for that date and then I could read the individual files (e.g. enwiki.json) for the wikis I care about.

Well.... there is also the aggregate file at the docroot: https://dumps.wikimedia.org/index.json which aggregates the json status info for the latest run for all wikis. Does that cover your needs?

Hmm.. It might work just fine. It seems like 9MB is a lot, but then again, this is on dumps.wikimedia.org, so 9MB isn't really that big. :)

ArielGlenn claimed this task.

OK, I'ma gonna close this then. If it turns out you want extra features, we can have a new ticket for that.