Page MenuHomePhabricator

Create a Python Pywikibot script to download Wikimedia database dump
Closed, ResolvedPublic

Description

Pywikibot is a Python-based framework to write bots for MediaWiki (more information).

It would be useful to have a "download dump" script that fetches a Wikimedia database dump from http://dumps.wikimedia.org/ , and places the dump in a predicable directory for semi-automated use by other scripts and tests.

Create a simple script that downloads a file (for example one of the stored files for French Wikipedia). The script should be stored in the scripts folder of Pywikibot. The user should be able to choose what file they want by providing the filename, the wiki (language/sister project) and the repository where it should be saved using command line arguments.
Useful link: https://meta.wikimedia.org/wiki/Data_dumps/Download_tools

You are expected to provide a patch in Wikimedia Gerrit. Documentation on Gerrit is available.

Event Timeline

jayvdb raised the priority of this task from to Low.
jayvdb updated the task description. (Show Details)
jayvdb added subscribers: Beta16, Aklapper, Rubin16 and 2 others.
Framawiki added a subscriber: Framawiki.

I purpose to mentor this task for Google-Code-in-2017.

@Framawiki: Just explicitly asking, would you mentor this in GCI 2017? Maybe together with @jayvdb ?

@Framawiki: Just explicitly asking, would you mentor this in GCI 2017? Maybe together with @jayvdb ?

I'd be happy to mentor it with @jayvdb .

Aklapper renamed this task from Pywikibot Wikimedia dump fetch to Make it easier in Pywikibot to fetch a Wikimedia database dump.Nov 29 2017, 2:05 PM

@jayvdb: Also in for mentoring? (Just explicitly asking.)

I created a draft in https://codein.withgoogle.com/tasks/6721462075392000/ (not published yet).

Ya, nice idea.
(published)

Framawiki renamed this task from Make it easier in Pywikibot to fetch a Wikimedia database dump to Create a Python Pywikibot script to download Wikimedia database dump.Dec 14 2017, 6:19 PM
Framawiki updated the task description. (Show Details)

I wrote a related script. Unfortunately it is hardcoded for Hungarian Wikipedia and therefore it has a lot of Hungarian text (as I didn't think at that time, that it would be useful for others).
https://hu.wikipedia.org/wiki/Szerkeszt%C5%91:BinBot/frissdump.py
This script does not download the dump, but refreshes a template with the date. Users interested in dump watch this template. So the first part of the task is solved: to watch the new version. It needs cron timing.
Somebody may find it useful.

Thanks, it can be useful to parse the site.

Is this script something like https://github.com/WikiTeam/wikiteam/blob/master/wikipediadownloader.py?

Also, does "choose what file they want by providing the filename" mean that user will have to provide argument like frwiki-latest-abstract.xml or just abstract?

And does user have to provide a "predicable directory" in argument?

I wrote a related script. Unfortunately it is hardcoded for Hungarian Wikipedia and therefore it has a lot of Hungarian text (as I didn't think at that time, that it would be useful for others).
https://hu.wikipedia.org/wiki/Szerkeszt%C5%91:BinBot/frissdump.py
This script does not download the dump, but refreshes a template with the date. Users interested in dump watch this template. So the first part of the task is solved: to watch the new version. It needs cron timing.
Somebody may find it useful.

So the script will download a new (updated) file from http://dumps.wikimedia.org/ once a while and replace the old downloaded one, right?

Sorry for so many questions 😅. I'm new to this.

This should also, if found to be on toolforge, use dumps from the already-mounted dumps directory rather than downloading another copy from dumps.wm.o.

Change 399179 had a related patch set uploaded (by Eflyjason; owner: Eflyjason):
[pywikibot/core@master] Create a Python Pywikibot script to download Wikimedia database dump

https://gerrit.wikimedia.org/r/399179

Change 399179 merged by jenkins-bot:
[pywikibot/core@master] Create a maintenance script to download Wikimedia database dump

https://gerrit.wikimedia.org/r/399179

Framawiki assigned this task to eflyjason.