Page MenuHomePhabricator

download_dump.py: Add a progress bar
Closed, ResolvedPublic

Description

Pywikibot is a Python-based framework to write bots for MediaWiki (more information).

Thanks to work in Google Code-in, Pywikibot now has a script called download_dump.py. It downloads a Wikimedia database dump from http://dumps.wikimedia.org/, and places the dump in a predictable directory for semi-automated use by other scripts and tests.

This script downloads big files, a progress bar or another indicator can be really useful.

You must avoid using external libraries for this. For reference, please read:

It would be even better if something like 105.2M/2.1G can be shown alongside the progress bar. (Not required)

You are expected to provide a patch in Wikimedia Gerrit. Documentation on Gerrit is available.

Event Timeline

Framawiki triaged this task as Normal priority.Dec 24 2017, 4:44 PM
Framawiki created this task.
Framawiki updated the task description. (Show Details)
Restricted Application added a subscriber: pywikibot-bugs-list. · View Herald TranscriptDec 24 2017, 4:58 PM
Aklapper updated the task description. (Show Details)Dec 24 2017, 6:41 PM
Aklapper moved this task from Proposed tasks to Information needed on the Google-Code-in-2017 board.

Is there any existing implementation of a progress bar? Is some library recommended? I'd like to avoid reinventing a wheel in this task.

There's also a library to make progressbar and some command-line related things in Python: https://github.com/kennethreitz/clint

eflyjason added a comment.EditedDec 25 2017, 12:46 AM

If we are to use an external library, does that mean we have to add this library to requirements.txt?

We have to do task T183666: download_dump.py: Use response.iter_content first as we cannot implement progress bar without using iter_content.

I think it's better to avoid using an external library, as it can be done without it pretty easily.
https://stackoverflow.com/a/15645088/2603230 is great, see https://stackoverflow.com/questions/3160699/python-progress-bar too for a general approach.
Note that the first steep is to get the file length.

eflyjason updated the task description. (Show Details)Dec 28 2017, 4:26 AM
eflyjason updated the task description. (Show Details)Dec 28 2017, 4:29 AM
Aklapper updated the task description. (Show Details)Dec 29 2017, 4:33 PM

Thanks everyone for the clarifications! Published as https://codein.withgoogle.com/tasks/6247319068475392/

I'm running into a problem with this task. From the line response = fetch(url, stream=True), I assumed that the file is not being downloaded on that line, and rather it will wait until the data is accessed to begin downloading it. Therefore, I thought that I could put any download bar code inside of

for data in response.data.iter_content(100 * 1024):
    result_file.write(data)

To test this, I put pywikibot.output('test') right after result_file.write(data)
However, when I did this and ran the script, the console stopped for ~10 seconds, and then printed my test code multiple times extremely rapidly.

After this, I tested the line response = fetch(url, stream=True) by putting a print statement before it and right after it. The result was that the first one would print, then there would be a ~10 second wait, and then the second one would print.

All of this leads me to believe that the response = fetch(url, stream=True) is actually downloading the file, and

for data in response.data.iter_content(100 * 1024):
    result_file.write(data)

is just writing the downloaded data to the file. As a result, I am confused on how to make a download bar. How can I check the download progress and print it out to the user while the download is still occurring if I can't put it in the above?

@Ryan10145 can you try and see if https://stackoverflow.com/a/15645088/2603230 works first? If no, perhaps the pywikibot.comms.http is not behaving the same way as requests and we'll need to look into that.

Yeah, I think pywikibot.comms.http.fetch(url, stream=True) is not behaving like requests.get(url, stream=True). Filed as https://phabricator.wikimedia.org/T183830

@eflyjason I tried it, and the none of the messages printed until the download completed.

For now, I will just write the download bar as if the pywikibot.comms.http.fetch(url, stream=True) behaves like requests.get(url, stream=True), and then submit the patch when the issue is resolved.

Ryan10145 added a comment.EditedDec 31 2017, 9:14 PM

I downloaded the patch that was supposed to fix the streaming issue, but I am still running into issues with the streaming. First of all, when I test the functionality of pywikibot.comms.http.fetch(url, stream=True) from the pywikibot shell, it works as intended. I tested it using

def test_fetch():
    resp = pywikibot.comms.http.fetch('https://dumps.wikimedia.org/idwiki/latest/idwiki-latest-abstract.xml', stream=True)
    for data in resp.data.iter_content(100 * 1024):
         sys.stdout.write('asd')
         sys.stdout.flush()

This produced the string 'asd' gradually as it began downloading the file.
However, when I go into download_dump.py, and I have the below code, it does not work.

response = fetch(url, stream=True)
for data in response.data.iter_content(100 * 1024):
    sys.stdout.write('asd')
    sys.stdout.flush()

What happens instead is that there is a ~10 second pause, and then the string 'asd' is printed many times, nearly instantaneously.
This leads me to believe that for some strange reason, fetch(url, stream=True) is still not streaming the data.
I don't know why this is occurring, and help would be greatly appreciated.

I was experimenting with the location of a test script, and I found something interesting. I used the following test script.

import sys
import pywikibot

sys.stdout.write('begin')
sys.stdout.flush()
response = pywikibot.comms.http.fetch('https://dumps.wikimedia.org/idwiki/latest/idwiki-latest-abstract.xml', stream=True)
for data in response.data.iter_content(100 * 1024):
    sys.stdout.write('asd')
    sys.stdout.flush()

When I put it in the core directory of pywikibot, the script worked as intended. As the extremely large file downloaded, the console would constantly print 'asd'.

However, when I moved this test script into the scripts folder, scripts/maintenance folder, or any other folder, the script would stop working as intended. No errors would appear, but the script would just print 'begin', and then pause, as it downloaded the extremely large file. Since the file being downloaded is large, I did not wait for the download to finish, but I believe that pywikibot.comms.http.fetch('https://dumps.wikimedia.org/idwiki/latest/idwiki-latest-abstract.xml', stream=True) breaks when not placed in the core folder.

I discovered the error, I was running the script incorrectly. I was using python scripts/maintenance/download_dump.py -filename instead of python pwb.py maintenance/download_dump.py
pywikibot.comms.http.fetch('https://dumps.wikimedia.org/idwiki/latest/idwiki-latest-abstract.xml', stream=True) works as intended when ran correctly.

Ignore my comments above.

Change 401199 had a related patch set uploaded (by Ryan10145; owner: Ryan10145):
[pywikibot/core@master] Added Progress Bar for download_dump.py

https://gerrit.wikimedia.org/r/401199

Change 401199 merged by jenkins-bot:
[pywikibot/core@master] download_dump: Add a progress bar

https://gerrit.wikimedia.org/r/401199

Framawiki closed this task as Resolved.Jan 5 2018, 7:31 PM