Pywikibot is a Python-based framework to write bots for MediaWiki (more information).
Thanks to work in Google Code-in, Pywikibot now has a script called download_dump.py. It downloads a Wikimedia database dump from http://dumps.wikimedia.org/, and places the dump in a predictable directory for semi-automated use by other scripts and tests. It is however not atomic:
(computing) Of an operation: guaranteed to complete either fully or not at all while waiting in a pause, and running synchronously when called by multiple asynchronous threads.
This task shall include two parts:
- Make sure the target file is always a complete file; only commit if and only if the file is completely downloaded (and verified with checksum), or discard if anything goes wrong.
Consider a scenario where on a machine, a bot that processes the dump starts right after the download script starts, and that the dump processor is faster than the downloader. In the current implementation, the processor will eventually find the dump incomplete, and the behavior of the processor will be often undefined (eg: crash).
One implementation is to perform the download on a temporary file. In the wild '<filename>.part' is often used. Move the temporary file over to the target file if and only if the download is complete and verified, or delete the temporary file if it fails. In *nix implementation, the currently running processors will not be disrupted as the file descriptors of their opened dump file will continue to point to the old moved-over now-nonexistent file, and will happily read it without ever knowing a new version has arrived, till the dump file is reopened for another run.
- Make sure two downloaders do not write on the same partially-downloaded file at the same time.
Consider a scenario where two downloaders run at the same time. If their file descriptors point to the same inode, their write(2) calls will go to the same file, corrupting it, and doubling its size. Otherwise, in the case where they point to different inodes but same filename, the suggested implementation in part 1 will fail, as moving a file rename(2) is done on the filename, not the inode. The early-finishing downloader will consider the dump downloaded by the late-starting downloader complete and commit it; in the case that they are different processes, the file will end up in a partially complete state, and dump-consumers go undefined.
One implementation is to perform file locking on whichever is being downloaded on, and exit with an error if the will-be-written file is locked. Another is to avoid same-filenames entirely.
You are expected to provide a patch in Wikimedia Gerrit. See https://www.mediawiki.org/wiki/Gerrit/Tutorial for how to set up Git and Gerrit.