Page MenuHomePhabricator

download_dump.py: Handle cases when the dump file already exists
Open, NormalPublic

Description

Pywikibot is a Python-based framework to write bots for MediaWiki (more information).

Thanks to work in Google Code-in, Pywikibot now has a script called download_dump.py. It downloads a Wikimedia database dump from http://dumps.wikimedia.org/ , and places the dump in a predicable directory for semi-automated use by other scripts and tests.

If the same file already exists in the folder :

  • If the filename doesn't contain latest, it shouldn't be downloaded again.
  • Endif, add the current date as a suffix to the name

You are expected to provide a patch in Wikimedia Gerrit. See https://www.mediawiki.org/wiki/Gerrit/Tutorial for how to set up Git and Gerrit.

Event Timeline

Framawiki triaged this task as Normal priority.Dec 24 2017, 4:55 PM
Framawiki created this task.
Restricted Application added a subscriber: pywikibot-bugs-list. · View Herald TranscriptDec 24 2017, 4:58 PM
Aklapper renamed this task from download_dump.py: If the file already exists to download_dump.py: Handle cases when the dump file already exists.Dec 24 2017, 6:41 PM
Aklapper updated the task description. (Show Details)

Would timezone settings on bot user's computer be a problem to this?

Would timezone settings on bot user's computer be a problem to this?

It's not a real problem, as the file will not leave the computer and we can assume that it will be consistent, it will not change its timezone everyday :) It's more user-friendly to keep using a local timezone instead of UTC.

Would timezone settings on bot user's computer be a problem to this?

It's more user-friendly to keep using a local timezone instead of UTC.

Why not adding a UNIX timestamp instead which would be independent from timezones?

rafidaslam added a subscriber: rafidaslam.

I'll work on this

Sorry for not working on this at the moment,

@Framawiki

If the filename doesn't contain latest, it shouldn't be downloaded again.

But I think the filename will always has latest in it ( https://github.com/wikimedia/pywikibot/blob/ca7c0ce89f2b2e96ebc5bb7b5b8aef2ccd04c2c3/scripts/maintenance/download_dump.py#L65 ). Should I create a new task for this, so this script can download from a date directory (f.ex https://dumps.wikimedia.org/idwiki/20170801/ ), instead of just https://dumps.wikimedia.org/latest/...

eflyjason added a comment.EditedDec 30 2017, 11:22 AM

I assume that with

If the same file already exists in the folder :

  • If the filename doesn't contain latest, it shouldn't be downloaded again.
  • Endif, add the current date as a suffix to the name

the dump file can be re-downloaded every day?

However, as stated on https://dumps.wikimedia.org,

These snapshots are provided at the very least monthly and usually twice a month.

which means the user will usually still be downloading the same file the next day.


One solution is T183789: download_dump.py: Support for "date specified" dumps. But we will have to make sure when the user doesn't specify the -revision, the script will first find the latest date from https://dumps.wikimedia.org/frwiki/, then download the latest file in that folder (e.g. https://dumps.wikimedia.org/frwiki/20171220/frwiki-20171220-abstract.xml.gz).

After download, the script will have to link a -latest file (e.g. frwiki-latest-abstract.xml.gz) to that downloaded file (e.g. frwiki-20171220-abstract.xml.gz) for other scripts' automated uses.

Then when everytime user run the script without the -revision, check if the file frwiki-latest-abstract.xml.gz linked to contains the same date as the latest date on website.

If the date is not equal, download the new file and relink it to the -latest file.

However, one problem would be how can we (or do we have to) manage all those non-latest files downloaded before.

Or is there any simpler solution?

@eflyjason

I think I didn't write the task description clearly, sorry about that. I wanna clarify everything I've written based on I've thought now.

I meant, If the user doesn't specify the -revision parameter, the script will download from the latest dump, f.ex (https://dumps.wikimedia.org/idwiki/latest/).
Then the script will add the current date as the suffix of the filename, f.ex (idwiki-latest-abstract-30122017.xml).

If the user runs the script again with the same filename and without -revision parameter again (or the user specifies it as latest manually), and if the file already exist with the suffix date equals to current date, the script will not download the file again, but if it differs with the current date, the script will download it again and place the current date as the suffix of it.

Framawiki added a comment.EditedDec 30 2017, 10:51 PM

I think solving first T183789: download_dump.py: Support for "date specified" dumps is the best solution.
Note also that T183675: download_dump.py: Make download process atomic was merged, and the code was in majority updated: https://gerrit.wikimedia.org/r/#/c/400616/12/scripts/maintenance/download_dump.py

Then the script will add the current date as the suffix of the filename, f.ex (idwiki-latest-abstract-30122017.xml).

You mean that the script will download the latest file from the server, and then rename it with a date ?

If the user runs the script again with the same filename and without -revision parameter again (or the user specifies it as latest manually), and if the file already exist with the suffix date equals to current date, the script will not download the file again, but if it differs with the current date, the script will download it again and place the current date as the suffix of it.

But if the user'll download the file another day and this file has the same name and the same content server-side, two files will exist in the client computer with exactly the same content, no ?

You mean that the script will download the latest file from the server, and then rename it with a date ?

Yup

But if the user'll download the file another day and this file has the same name and the same content server-side, two files will exist in the client computer with exactly the same content, no ?

Yeah, I think you're right.

Change 401191 had a related patch set uploaded (by Rafidaslam; owner: rafid):
[pywikibot/core@master] download_dump: Handle cases when the dump file already exists

https://gerrit.wikimedia.org/r/401191

(Sorry I wasn't reading the discussion here)
Copying my comment on the patch:

From a dump consumer perspective, it would be great if the filename is as predictable as possible. With the current implementation the consumer will have to check whether the file downloaded today, yesterday, the day before yesterday, etc. until it finds one that is successful.

Honestly T183667#3864150 will be IMO a better implementation, but yes it will risk old dumps wasting all the storage. Perhaps we could add a functionality to this script that remove old dumps, only when a flag is specified / not specified and/or ask for permission before remove (except when -always is specified), and only when the revision is set to latest.

eflyjason added a comment.EditedJan 1 2018, 3:29 AM

Perhaps we could do something like:

download_dump -filename:... -storepath=... -> Download the latest. If we have a local file and the local one is not the latest one, prompt the user to choose Y/N on deleting all old revisions.

download_dump -filename:... -storepath=... -keep_old=False -> Download the latest. If the local one is not the latest one, delete all old revisions.

download_dump -filename:... -storepath=... -keep_old=True -> Download the latest. Keep the old files too.

download_dump -remove_outdated_files -> Delete all old revisions.

Thanks for the ideas @eflyjason @zhuyifei1999

I'm agree with the -keep_old parameter..

BTW I wanna ask, is actually latest is just a pointer to a dated revision (f.ex 20171220) ? I've found that every files in the latest revision have -rss.xml file that contains metadata that links the file to a dated revision. I think we can use that.

So by using that, I think the implementation will be:
File that downloaded with latest revision will be renamed into their "real" date revision f.ex idwiki-latest-abstract.xml -> idwiki-20171103-abstract.xml based on their -rss.xml file.
Then do check if there's a file with the same name, if the file exist, the script will not download the dump again, vice versa.
Then if -keep_old is False, we delete all previous dumps with the same name

D3r1ck01 moved this task from Backlog to Needs Review on the Pywikibot board.Nov 5 2018, 11:27 AM