Page MenuHomePhabricator

download_dump.py: Verify the file using the checksum
Open, NormalPublic

Description

Pywikibot is a Python-based framework to write bots for MediaWiki (more information).

Thanks to work in Google Code-in, Pywikibot now has a script called download_dump.py. It downloads a Wikimedia database dump from http://dumps.wikimedia.org/, and places the dump in a predictable directory for semi-automated use by other scripts and tests.

We should check that the file is not corrupted: compare downloaded file md5 and excepted one, delete the corrupted and retries if it fails, with a maxretries parameter.

File where md5 can be found: https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-md5sums.txt (and this for each project).

If the md5 of the current filename cannot be found in the list, simply use pywikibot.warning('text') to show a warning and skip the verification.

You are expected to provide a patch in Wikimedia Gerrit. See https://www.mediawiki.org/wiki/Gerrit/Tutorial for how to set up Git and Gerrit.

Event Timeline

Framawiki triaged this task as Normal priority.Dec 24 2017, 5:18 PM
Framawiki created this task.
Framawiki updated the task description. (Show Details)
Aklapper updated the task description. (Show Details)Dec 24 2017, 6:39 PM

It seems that file like enwiki-latest-abstract.xml does not have a md5 code in https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-md5sums.txt. How can we check those files?

It seems that file like enwiki-latest-abstract.xml does not have a md5 code in https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-md5sums.txt. How can we check those files?

Interesting, perhaps a bug from the dump generator. So the script shouldn't break if it can't find the filename in the list, but should show a warning (pywikibot.warning('text')). Note that all filenames in the list are in the date format, so latest will ne to be converted.

Note that all filenames in the list are in the date format, so latest will ne to be converted.

Just check if the filename user entered matches the text after the last - each line in the md5 file would be fine.

eflyjason updated the task description. (Show Details)Dec 26 2017, 9:00 AM

Can the maxretries parameters load from config2.py in default? (Though 15 retries in default would be too many...)

# Maximum number of times to retry an API request before quitting.
max_retries = 15
# Minimum time to wait before resubmitting a failed API request.
retry_wait = 5

Can the maxretries parameters load from config2.py in default? (Though 15 retries in default would be too many...)

Yes, but 15 is too much IMO. So, no :)