Page MenuHomePhabricator

Pywikibot: copy upload calls hang (ish) rather than time out
Open, Needs TriagePublicBUG REPORT

Description

List of steps to reproduce (step by step, including full links if applicable):

  • Attempt to copy-upload a file from the Internet Archive using Pywikibot:
site.upload(filepage, source_url=file_url,
            comment=desc)

What happens?:

  • If the upload succeeds, the call still hangs:
File "./script.py", line 105, in upload
  self.site.upload(filepage, source_url=file_url,
File "/usr/lib/python3.9/site-packages/pywikibot/tools/_deprecate.py", line 404, in wrapper
  return obj(*__args, **__kw)
File "/usr/lib/python3.9/site-packages/pywikibot/site/_decorators.py", line 92, in callee
  return fn(self, *args, **kwargs)
File "/usr/lib/python3.9/site-packages/pywikibot/site/_apisite.py", line 2869, in upload
  result = final_request.submit()
File "/usr/lib/python3.9/site-packages/pywikibot/data/api.py", line 1757, in submit
  response, use_get = self._http_request(use_get, uri, body, headers,
File "/usr/lib/python3.9/site-packages/pywikibot/data/api.py", line 1510, in _http_request
  self.wait()
File "/usr/lib/python3.9/site-packages/pywikibot/data/api.py", line 1892, in wait
  pywikibot.sleep(delay)
File "/usr/lib/python3.9/site-packages/pywikibot/__init__.py", line 1305, in sleep

It looks like the handler is catching the 504 which is often returned, and retrying, which for a large file can take an extremely long time. However, during a copy upload, a 504 is very common, but the file may still upload.

What should have happened instead?:

  • Call either succeeds, or times out and allows the script to continue.

Software version (if not a Wikimedia wiki), browser information, screenshots, other information, etc:

Event Timeline

Restricted Application added subscribers: pywikibot-bugs-list, Aklapper. · View Herald Transcript
Inductiveload renamed this task from Pywikibot: copy upload calls hang rather than time out to Pywikibot: copy upload calls hang (ish) rather than time out.Oct 15 2021, 7:47 AM
Inductiveload updated the task description. (Show Details)

Any idea how it can be validated that the upload was successfull even if the status response was 504?

@Inductiveload : You may decrease the retries count by changeing max_retries config variable. This is also possible as command line argument, either if your script uses handle_args() function or with the pwb.py wrapper script like

pwb.py -max_retries:2 <yourscript> [<youroptions>]

but pwb.py is not shipped with Pywikibot pypi package (yet)

Does it help?

It helped a bit, though I have to handle errors very carefully around such calls.

I know the overall "fault" is at the MediaWiki end for returning a 504 for this at all. Perhaps there's a way to catch the 504 and "poll" for an existing file? Or maybe that should just be done in user code?

The problem is the 504 timeout code which comes prior other warnings like

upload_warnings = {
    # map API warning codes to user error messages
    # {msg} will be replaced by message string from API response
    'duplicate-archive':
        'The file is a duplicate of a deleted file {msg}.',
    'was-deleted': 'The file {msg} was previously deleted.',
    'emptyfile': 'File {msg} is empty.',
    'exists': 'File {msg} already exists.',
    'duplicate': 'Uploaded file is a duplicate of {msg}.',
    'badfilename': 'Target filename is invalid.',
    'filetype-unwanted-type': 'File {msg} type is unwanted type.',
    'exists-normalized': 'File exists with different extension as '
                         '"{msg}".',
    'bad-prefix': 'Target filename has a bad prefix {msg}.',
    'page-exists':
        'Target filename exists but with a different file {msg}.',

    # API-returned message string will be timestamps, not much use here
    'nochange': 'The upload is an exact duplicate of the current '
                'version of this file.',
    'duplicateversions': 'The upload is an exact duplicate of older '
                         'version(s) of this file.',
}

And retrying is an usual behaviour with this code. But maybe we could check for the existing file after it.

I notice that the Internet-Archive project tag was automatically added to this task. Does this bug happen only on archive.org links, or does it happen in other cases as well?

@Harej Indeed it is not exclusive to the IA, it's any file which times out on upload. The IA is, however, an excellent source of files that do that! T292954 is making progress but it would be good to not have such a code path if avoidable.