Page MenuHomePhabricator

Impossible to upload a file that ends with the '\r' byte using Pywikibot
Open, LowPublic

Description

The file in question: https://v2c.wmflabs.org/temp.webm, using pywikibot chunked uploading (stash=true, async=false):

1$ python debug.py -log -verbose -debug
2WARNING: /usr/local/lib/python2.7/dist-packages/pywikibot/version.py:101: DeprecationWarning: pywikibot.version.getversion_svn is deprecated; use getversion_svn_setuptools instead.
3 (tag, rev, date, hsh) = vcs_func(_program_dir)
4
5WARNING: /usr/local/lib/python2.7/dist-packages/pywikibot/version.py:252: DeprecationWarning: pywikibot.version.svn_rev_info is deprecated; use getversion_svn_setuptools instead.
6 tag, rev, date = svn_rev_info(_program_dir)
7
8Pywikibot rc80c9e184da41f3a73769e71d4a6b19d62078229
9Python 2.7.9 (default, Mar 1 2015, 12:57:24)
10[GCC 4.9.2]
11
12Found 1 commons:commons processes running, including this one.
13
14Sleeping for 9.8 seconds, 2016-04-14 11:53:42
15Submitting unthrottled action 'upload'.
16Submitting unthrottled action 'upload'.
17Submitting unthrottled action 'upload'.
18Submitting unthrottled action 'upload'.
19Submitting unthrottled action 'upload'.
20Submitting unthrottled action 'upload'.
21Submitting unthrottled action 'upload'.
22Submitting unthrottled action 'upload'.
23Submitting unthrottled action 'upload'.
24Submitting unthrottled action 'upload'.
25Submitting unthrottled action 'upload'.
26Submitting unthrottled action 'upload'.
27Submitting unthrottled action 'upload'.
28Submitting unthrottled action 'upload'.
29Submitting unthrottled action 'upload'.
30Submitting unthrottled action 'upload'.
31Submitting unthrottled action 'upload'.
32Submitting unthrottled action 'upload'.
33Submitting unthrottled action 'upload'.
34Submitting unthrottled action 'upload'.
35Submitting unthrottled action 'upload'.
36Submitting unthrottled action 'upload'.
37Submitting unthrottled action 'upload'.
38Submitting unthrottled action 'upload'.
39Submitting unthrottled action 'upload'.
40Submitting unthrottled action 'upload'.
41Submitting unthrottled action 'upload'.
42Submitting unthrottled action 'upload'.
43Submitting unthrottled action 'upload'.
44Old offset: 121634816; Returned offset: 125164627; Chunk size: 3529812
45WARNING: Unexpected offset.
46Submitting unthrottled action 'upload'.
47WARNING: API error chunk-too-small: Minimum chunk size is 1024 bytes for non-final chunks
48API Error: query=
49u"{u'ignorewarnings': [False], u'maxlag': ['5'], u'format': [u'json'], u'filekey': [u'13y76lyal8og.e9uyal.1803702.webm'], u'stash': [True], u'filename': [u'Bernie Supporters at the Bernie Sanders rally in Philadelphia, PA, April 2016.webm'], u'assert': [u'user'], u'token': [u'0d2cab<REDACTED>570f84c6+\\\\'], u'filesize': [125164628], u'offset': [125164627], u'action': [u'upload']}"
50 response=
51{u'servedby': u'mw1206', u'error': {u'info': u'Minimum chunk size is 1024 bytes for non-final chunks', u'code': u'chunk-too-small', u'help': u'See https://commons.wikimedia.org/w/api.php for API usage'}}
52Traceback (most recent call last):
53 File "debug.py", line 38, in <module>
54 site.upload(page, source_filename=filename, comment=comment, text=filedesc, chunk_size=chunked, ignore_warnings=['exists-normalized'])
55 File "/usr/local/lib/python2.7/dist-packages/pywikibot/tools/__init__.py", line 1370, in wrapper
56 return obj(*__args, **__kw)
57 File "/usr/local/lib/python2.7/dist-packages/pywikibot/site.py", line 5871, in upload
58 raise error
59pywikibot.data.api.APIError: chunk-too-small: Minimum chunk size is 1024 bytes for non-final chunks [help:See https://commons.wikimedia.org/w/api.php for API usage]
60Dropped throttle(s).
61<class 'pywikibot.data.api.APIError'>
62CRITICAL: Closing network session.
63Network session closed.

12016-04-14 11:55:26 api.py, 1980 in submit: DEBUG API response received from commons:commons:
2{"upload":{"result":"Continue","offset":125164627,"filekey":"13y76lyal8og.e9uyal.1803702.webm"}}
32016-04-14 11:55:26 site.py, 5896 in upload: VERBOSE Old offset: 121634816; Returned offset: 125164627; Chunk size: 3529812
42016-04-14 11:55:26 site.py, 5897 in upload: WARNING Unexpected offset.
52016-04-14 11:55:26 api.py, 1473 in __init__: DEBUG Adding user assertion
62016-04-14 11:55:26 api.py, 1925 in submit: VERBOSE Submitting unthrottled action 'upload'.
72016-04-14 11:55:26 api.py, 1947 in submit: DEBUG API request to commons:commons (uses get: False):
8Headers: {'Content-Type': u'multipart/form-data; boundary="===============6100145239360526832=="', 'MIME-Version': '1.0'}
9URI: u'/w/api.php'
10Body: '--===============6100145239360526832==\nContent-Type: text/plain\nMIME-Version: 1.0\nContent-disposition: form-data; name="maxlag"\n\n5\n--===============6100145239360526832==\nContent-Type: text/plain\nMIME-Version: 1.0\nContent-disposition: form-data; name="format"\n\njson\n--===============6100145239360526832==\nContent-Type: text/plain\nMIME-Version: 1.0\nContent-disposition: form-data; name="filekey"\n\n13y76lyal8og.e9uyal.1803702.webm\n--===============6100145239360526832==\nContent-Type: text/plain\nMIME-Version: 1.0\nContent-disposition: form-data; name="stash"\n\n\n--===============6100145239360526832==\nContent-Type: text/plain\nMIME-Version: 1.0\nContent-disposition: form-data; name="filename"\n\nBernie Supporters at the Bernie Sanders rally in Philadelphia, PA, April 2016.webm\n--===============6100145239360526832==\nContent-Type: text/plain\nMIME-Version: 1.0\nContent-disposition: form-data; name="assert"\n\nuser\n--===============6100145239360526832==\nContent-Type: text/plain\nMIME-Version: 1.0\nContent-disposition: form-data; name="token"\n\n0d2ca<REDACTED>84c6+\\\n--===============6100145239360526832==\nContent-Type: text/plain\nMIME-Version: 1.0\nContent-disposition: form-data; name="filesize"\n\n125164628\n--===============6100145239360526832==\nContent-Type: text/plain\nMIME-Version: 1.0\nContent-disposition: form-data; name="offset"\n\n125164627\n--===============6100145239360526832==\nContent-Type: text/plain\nMIME-Version: 1.0\nContent-disposition: form-data; name="action"\n\nupload\n--===============6100145239360526832==\nContent-Type: application/octet-stream\nMIME-Version: 1.0\nContent-disposition: form-data; name="chunk"; filename="FAKE-NAME"\nContent-Transfer-Encoding: binary\n\n\r\n--===============6100145239360526832==--\n'
112016-04-14 11:55:26 api.py, 1980 in submit: DEBUG API response received from commons:commons:
12{"servedby":"mw1206","error":{"code":"chunk-too-small","info":"Minimum chunk size is 1024 bytes for non-final chunks","*":"See https://commons.wikimedia.org/w/api.php for API usage"}}
132016-04-14 11:55:26 api.py, 2087 in submit: WARNING API error chunk-too-small: Minimum chunk size is 1024 bytes for non-final chunks
142016-04-14 11:55:26 api.py, 2177 in submit: VERBOSE API Error: query=
15u"{u'ignorewarnings': [False], u'maxlag': ['5'], u'format': [u'json'], u'filekey': [u'13y76lyal8og.e9uyal.1803702.webm'], u'stash': [True], u'filename': [u'Bernie Supporters at the Bernie Sanders rally in Philadelphia, PA, April 2016.webm'], u'assert': [u'user'], u'token': [u'0d2ca<REDACTED>84c6+\\\\'], u'filesize': [125164628], u'offset': [125164627], u'action': [u'upload']}"
162016-04-14 11:55:26 api.py, 2179 in submit: VERBOSE response=
17{u'servedby': u'mw1206', u'error': {u'info': u'Minimum chunk size is 1024 bytes for non-final chunks', u'code': u'chunk-too-small', u'help': u'See https://commons.wikimedia.org/w/api.php for API usage'}}
182016-04-14 11:55:26 __init__.py, 886 in _flush: DEBUG _flush() called
192016-04-14 11:55:26 __init__.py, 921 in _flush: VERBOSE Dropped throttle(s).
202016-04-14 11:55:26 http.py, 97 in _flush: CRITICAL Closing network session.
212016-04-14 11:55:26 http.py, 101 in _flush: VERBOSE Network session closed.

Event Timeline

Restricted Application added subscribers: Steinsplitter, Aklapper. · View Herald Transcript

This error seems to be constantly reproducible for this specific file (none of the upload attempts succeeded), but it was not previously encountered for other files.

I downloaded it from https://v2c.wmflabs.org/temp.webm and the file is 125164628 bytes for me, not 125164627.

$ du --bytes temp.webm
125164628       temp.webm

$ sha1sum temp.webm
e25042092c55f6ab12b178323e75d471cbc009c5 *temp.webm

Pywikibot seems to also think the file is 125164628 bytes (u'filesize': [125164628]), but it can't read/upload the last byte? First it sends a chunk one byte smaller than it claims (resulting in 'Unexpected offset' log warning from Pywikibot), then tries to send the last byte again but sends zero bytes (resulting in 'chunk-too-small' error from the API).

I don't have Pywikibot set up, but I was able to upload the file to stash to Commons with UploadWizard.

Indeed:

01:45:56 0 ✓ zhuyifei1999@encoding01: ~$ ls -al /srv/v2c/ssu/temp.webm
-rw-r--r-- 1 root root 125164628 Apr 14 12:10 /srv/v2c/ssu/temp.webm

I guess that was my error copy-pasting.

zhuyifei1999 renamed this task from Uploading 125164627-byte file using 4MiB chunked uploading results in unexpected offset on final chunk (1 byte less than file size) then chunk-too-small error to Uploading 125164628-byte file using 4MiB chunked uploading results in unexpected offset on final chunk (1 byte less than file size) then chunk-too-small error.Apr 14 2016, 1:48 PM

The interesting thing is, the last byte is a \r character:

$ cat /srv/v2c/ssu/temp.webm | xxd -b | tail
775dc1c: 10001101 11110111 10000001 00000001 11110001 10000100  ......
775dc22: 00000110 00000110 01111011 00111111 11110000 10000010  ..{?..
775dc28: 00000010 00100011 10111011 10010011 10110011 10000010  .#....
775dc2e: 11011000 11100101 10110111 10001101 11110111 10000001  ......
775dc34: 00000001 11110001 10000100 00000110 10001011 00010100  ......
775dc3a: 11011111 11110000 10000010 00000010 00001110 10111011  ......
775dc40: 10010011 10110011 10000010 11100111 10100000 10110111  ......
775dc46: 10001101 11110111 10000001 00000001 11110001 10000100  ......
775dc4c: 00000110 11111111 11011000 11101101 11110000 10000010  ......
775dc52: 00000010 00001101

I guess this is a pywikibot bug (or a python bug).

It could potentially be a WMF severs' misconfiguration, too.

The log you pasted above seems to have it handled correctly:

Content-disposition: form-data; name="chunk"; filename="FAKE-NAME"\nContent-Transfer-Encoding: binary\n\n\r\n

Note the '\r'.

Oooh, wait. I'm pretty sure that according to the spec, the request body should be using \r\n (CRLF) for line breaks, not just \n (LF). I guess the server is permissive and accepts it anyway, and in this case it can't tell that the '\r' wasn't meant to be part of the line break.

http://stackoverflow.com/questions/3086860/how-do-i-generate-a-multipart-mime-message-with-correct-crlf-in-python

jayvdb triaged this task as High priority.Apr 14 2016, 6:09 PM
matmarex renamed this task from Uploading 125164628-byte file using 4MiB chunked uploading results in unexpected offset on final chunk (1 byte less than file size) then chunk-too-small error to Impossible to upload a file that ends with the '\r' byte using Pywikibot.Apr 14 2016, 6:12 PM

I thought of some ways to fix/workaround this issue:

  • WONTFIX for python 2 and move on to python 3 - really? Tool labs still have python 2 by default.
  • Use the hack mentioned in StackOverflow (replace every \n with \r\n) - since we are working with binary data, this isn't desirable either.
  • Code our own MIMEMultipart - I'm not an expert on how http transfers file data, so idk how difficult this would be.
  • Base64 encode file contents - This would mean an increase in the bytes that are transferred for each upload, so users might feel a longer upload time.

Base64 encode file contents - This would mean an increase in the bytes that are transferred for each upload, so users might feel a longer upload time.

Since this is only a problem for files/chunks ending with '\r', I think it'd be a reasonable workaround if applied only in this case.

@Anomie is base64 encode supported by api action=upload?

Change 636164 had a related patch set uploaded (by Mpaa; owner: Mpaa):
[pywikibot/core@master] [bugfix] upload a file that ends with the '\r' byte

https://gerrit.wikimedia.org/r/636164

Change 636164 had a related patch set uploaded (by Mpaa; owner: Mpaa):
[pywikibot/core@master] [bugfix] upload a file that ends with the '\r' byte

https://gerrit.wikimedia.org/r/636164

Added to my site.py, but did not seem to stop the bug. Not clear to me why, perhaps initial conditions not met. Instead my work-around of removing the last byte was still triggered by receiving the error.

If you want examples to test out, there are over 200 PDFs now listed at https://commons.wikimedia.org/wiki/Category:Work_around_of_byte_trimmed_for_chunk-too-small_API_error where the original PDFs are not uploaded. The target could be the same files, just try to overwrite with the true original PDFs.

@Fae, strange, it worked for me, I tried this file and the byte was added.
https://commons.wikimedia.org/w/index.php?title=File%3ABritish_Transport_Commission_Act_1949_%28UKLA_1949-29%29.pdf&type=revision&diff=500900036&oldid=500838588

site = pywikibot.Site('commons', 'commons')
page = pywikibot.FilePage(site, 'File:British_Transport_Commission_Act_1949_(UKLA_1949-29).pdf')
site.loadimageinfo(page, history=True)

{'timestamp': '2020-10-25T13:52:37Z',  'user': 'Mpaa', 'size': 4873007,  'width': 1239,  'height': 1754,  'pagecount': 44, ...
{'timestamp': '2020-10-25T13:00:08Z',  'user': 'Fæ',  'size': 4873006,  'width': 1239,  'height': 1754,  'pagecount': 44, ...

Change 636164 merged by jenkins-bot:
[pywikibot/core@master] [bugfix] upload a file that ends with the '\r' byte

https://gerrit.wikimedia.org/r/636164

Xqt lowered the priority of this task from High to Low.Oct 25 2020, 7:15 PM

@Anomie is base64 encode supported by api action=upload?

Not in itself, but you can use MIME's Content-Transfer-Encoding to encode the data as base64 rather than binary.

Since I no longer work for the WMF, I'm not inclined to do much to do their work for them at this time.

@Anomie is base64 encode supported by api action=upload?

Not in itself, but you can use MIME's Content-Transfer-Encoding to encode the data as base64 rather than binary.

Since I no longer work for the WMF, I'm not inclined to do much to do their work for them at this time.

OK, thanks.
As info for future readers, I tried but if I set base64 there is a mismatch with filesize. At the end it gives an error saying that filesize does not match (as if it is adding the size of the base64-encoded chunks, maybe, and then it gets more data than declared in filesize).