Page MenuHomePhabricator

download_dump.py: Make download process atomic
Closed, ResolvedPublic

Description

Pywikibot is a Python-based framework to write bots for MediaWiki (more information).

Thanks to work in Google Code-in, Pywikibot now has a script called download_dump.py. It downloads a Wikimedia database dump from http://dumps.wikimedia.org/, and places the dump in a predictable directory for semi-automated use by other scripts and tests. It is however not atomic:

(computing) Of an operation: guaranteed to complete either fully or not at all while waiting in a pause, and running synchronously when called by multiple asynchronous threads.

This task shall include two parts:

  1. Make sure the target file is always a complete file; only commit if and only if the file is completely downloaded (and verified with checksum), or discard if anything goes wrong.

Consider a scenario where on a machine, a bot that processes the dump starts right after the download script starts, and that the dump processor is faster than the downloader. In the current implementation, the processor will eventually find the dump incomplete, and the behavior of the processor will be often undefined (eg: crash).

One implementation is to perform the download on a temporary file. In the wild '<filename>.part' is often used. Move the temporary file over to the target file if and only if the download is complete and verified, or delete the temporary file if it fails. In *nix implementation, the currently running processors will not be disrupted as the file descriptors of their opened dump file will continue to point to the old moved-over now-nonexistent file, and will happily read it without ever knowing a new version has arrived, till the dump file is reopened for another run.

  1. Make sure two downloaders do not write on the same partially-downloaded file at the same time.

Consider a scenario where two downloaders run at the same time. If their file descriptors point to the same inode, their write(2) calls will go to the same file, corrupting it, and doubling its size. Otherwise, in the case where they point to different inodes but same filename, the suggested implementation in part 1 will fail, as moving a file rename(2) is done on the filename, not the inode. The early-finishing downloader will consider the dump downloaded by the late-starting downloader complete and commit it; in the case that they are different processes, the file will end up in a partially complete state, and dump-consumers go undefined.

One implementation is to perform file locking on whichever is being downloaded on, and exit with an error if the will-be-written file is locked. Another is to avoid same-filenames entirely.

You are expected to provide a patch in Wikimedia Gerrit. See https://www.mediawiki.org/wiki/Gerrit/Tutorial for how to set up Git and Gerrit.

Event Timeline

zhuyifei1999 triaged this task as Normal priority.Dec 25 2017, 1:09 AM
zhuyifei1999 created this task.
zhuyifei1999 updated the task description. (Show Details)Dec 25 2017, 1:15 AM
eflyjason updated the task description. (Show Details)Dec 25 2017, 1:19 AM

(clarification: by iff I meant if and only if)

eflyjason updated the task description. (Show Details)Dec 25 2017, 1:22 AM

(clarification: by iff I meant if and only if)

Sorry 😅

Thanks for proofreading though :)

One implementation is to perform file locking on whichever is being downloaded on, and exit with an error if the will-be-written file is locked. Another is to avoid same-filenames entirely.

Based on https://stackoverflow.com/q/489861/2603230, it seems that locking file is really complicated especially when this script has to be cross-platform. I guess downloading the file with name e.g. '<filename>-2017-12-25-09-27-23.part' would be better if we want to avoid same-filenames entirely.

I just realized that os.rename fails on windows if target exists, and os.replace does not exist on python 2. One might have to use os.unlink / os.remove then os.rename on windows. I guess it's safe to assume that the time difference be the calls should be small enough that no dump consumers are started. Example import mess:

try:
    from os import replace
except ImportError:   # py2
    if sys.platform == 'win32':
        import os
        def replace(src, dst):
            try:
                os.rename(src, dst)
            except OSError:
                os.remove(dst)
                os.rename(src, dst)
    else:
        from os import rename as replace

os.remove will still cause an error on windows if the file is in use, but I do not know of a way to workaround that.

eflyjason added a comment.EditedDec 26 2017, 8:44 AM

The two parts can actually combine to:

  1. Download/link the source dump file to a temporary file name '<filename>-<time>.part'.
  2. After finish downloading/linking and md5 is checked, use replace function provided by @zhuyifei1999 to replace the current local file by the temporary file.
Ryan10145 added a subscriber: Ryan10145.EditedDec 28 2017, 10:28 PM

I'm trying to run this script on my computer so I can do this task, but whenever I try to download the abstract.xml file, I get a Http reponse status 404 error. I am using a computer running Windows 7.
This is my console output without -v:

python scripts/maintenance/download_dump.py -filename
abstract.xml

Enter the filename: Downloading dump from enwiki
Downloading file from https://dumps.wikimedia.org/enwiki\latest\enwiki-latest-abstract.xml
WARNING: Http response status 404

This is my console output with -v:

python scripts/maintenance/download_dump.py -filename -v
abstract.xml

WARNING: C:\Users\Chang\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pywikibot\version.py:115: DeprecationWarning: pywikibot.version.getversion_svn is deprecated; use getversion_svn_setuptools instead.
  (tag, rev, date, hsh) = vcs_func(_program_dir)

WARNING: C:\Users\Chang\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pywikibot\version.py:266: DeprecationWarning: pywikibot.version.svn_rev_info is deprecated; use getversion_svn_setuptools instead.
  tag, rev, date = svn_rev_info(_program_dir)

Pywikibot rede13695872939f4cedbb139bd03ead318ef64cb
Python 3.6.3 (v3.6.3:2c5fed8, Oct  3 2017, 17:26:49) [MSC v.1900 32 bit (Intel)]
Enter the filename: Downloading dump from enwiki
Downloading file from https://dumps.wikimedia.org/enwiki\latest\enwiki-latest-abstract.xml
WARNING: Http response status 404
Found 1 wikipedia:en processes running, including this one.
Dropped throttle(s).
Closing network session.
Network session closed.
Dalba added a subscriber: Dalba.EditedDec 28 2017, 10:42 PM
python scripts/maintenance/download_dump.py -filename
abstract.xml
Enter the filename: Downloading dump from enwiki
Downloading file from https://dumps.wikimedia.org/enwiki\latest\enwiki-latest-abstract.xml
WARNING: Http response status 404

Well, https://dumps.wikimedia.org/enwiki\latest\enwiki-latest-abstract.xml returns 404 response. Replace those backslashes with a forward slash and it should work. We should not use os.path.join to create a URL. This is a bug and should be easy to fix.

Thank you for the help, I didn't see this before because my web browser would automatically correct the slashes when I clicked on it.

Well, https://dumps.wikimedia.org/enwiki\latest\enwiki-latest-abstract.xml returns 404 response. Replace those backslashes with a forward slash and it should work. We should not use os.path.join to create a URL. This is a bug and should be easy to fix.

/me hates windows using backslashes...

Anyways, we could perhaps use urljoin (py2, py3), or str.format? I personally dislike a ton of string concatenations.

Change 400616 had a related patch set uploaded (by Ryan10145; owner: Ryan10145):
[pywikibot/core@master] Made download_dump.py download process atomic

https://gerrit.wikimedia.org/r/400616

I just submitted a patch, but I'll try to make the change.

A remark for a future patch: do you think that it's possible to remove this big loop, that makes the code difficult to understand ? Perhaps using a function ? @zhuyifei1999 @Ryan10145

Change 400616 merged by jenkins-bot:
[pywikibot/core@master] Made download_dump.py download process atomic

https://gerrit.wikimedia.org/r/400616

Ryan10145 closed this task as Resolved.Dec 31 2017, 2:11 AM

@Framawiki I can work on it, but I'm not sure if it would be right to do it at this moment, since the patches for T183664 and T183667 would be affected. I'll try to get this done when there aren't any open patches affecting download_dump.py

A remark for a future patch: do you think that it's possible to remove this big loop, that makes the code difficult to understand ? Perhaps using a function ? @zhuyifei1999 @Ryan10145

I personally find a loop the 'intuitive' way to write a repeating code. Do you have an example of how you would do it? I'd say functions can make the process more 'functional', at the cost of that all the environment has to be passed in.

@zhuyifei1999 I was thinking of something along the lines of

def download_dump(filepath, is_atomic):
    try:
        #Put download code here
    except (OSError, IOError):
        #Exception/remove stuff
        if is_atomic:
            # Print error, retrying
            return download_dump(file_final_storepath, False)
        else:
            return False

if download_dump(file_temp_storepath, True): 
    # Print file stored
    return
else:
    return False

However, if you don't think this would be a good idea or if there are any flaws with this, that's fine.

^ is recursion? Instead of

if download_dump(file_temp_storepath, True): 
    # Print file stored
    return
else:
    return False

and recursing inside you could do something like:

return (download_dump(file_temp_storepath, True) or
        download_dump(file_final_storepath, False))

The current looping implementation is like:

filepath = file_temp_storepath
for non_atomic in range(2):
    try:
        #Put download code here
    except (OSError, IOError):
        #Exception/remove stuff
        if not non_atomic:
            # Print error
            filepath = file_final_storepath
        else:
            return False

The implementation in my head is:

filepath = file_temp_storepath
while True:
    try:
        #Put download code here
    except (OSError, IOError):
        #Exception/remove stuff
        if filepath == file_final_storepath:
            return False
        filepath = file_final_storepath
    else:
        return

Good point, it is probably best to avoid recursion and to instead use or. However, I feel that the while loop implementation is still somewhat difficult to understand, especially for a newer programmer. @Framawiki, what are your thoughts on which one to use?