Page MenuHomePhabricator

Large PDF upload issue
Closed, ResolvedPublic

Description

For the past day I have had issues with Internet Archive uploads getting stuck in a retry upload loop. I say 'stuck' because just letting this run to see what happens overnight did not resolve anything. However some examples were actually uploaded and for an unknown reason the API did not return any success messages.

The upload is using the standard Pywikibot site.upload module and the API is returning retry messages like this (and this example of the 180MB upload has been attempted several times without any success:

Catalog of Copyright Entries, 1978 Serials & Periodicals (IA catalogofcopyri12libr).pdf 
 https://archive.org/download/catalogofcopyri12libr/catalogofcopyri12libr.pdf 189.8M 
Sleeping for 15.4 seconds, 2020-06-04 11:41:39
WARNING: Waiting 5 seconds before retrying.
WARNING: Http response status 503
WARNING: Non-JSON response received from server commons:commons; the server may be down.
WARNING: Waiting 10 seconds before retrying.
WARNING: Waiting 20 seconds before retrying.
WARNING: Waiting 40 seconds before retrying.
WARNING: Waiting 80 seconds before retrying.
WARNING: Waiting 120 seconds before retrying.
WARNING: Waiting 120 seconds before retrying.
WARNING: Waiting 120 seconds before retrying.

There may be a prior known error, but I am unaware of any work-around or fix.

Also raised for community comment at Commons Technical VP along with further example files.

Event Timeline

Restricted Application added subscribers: pywikibot-bugs-list, Aklapper. · View Herald Transcript

If the script sends at least one request to Wikidata, then this is caused by Wikidata lag (last week no bot was able to edit Wikidata with correct bot settings).

These are simple uploads to Commons, so these don't touch Wikidata, in fact these don't even use wd links.

Trying local copying of the files to my creaky laptop, then uploading, which did not solve the problem as it returned "MemoryError" messages instead. Also tried limiting uploads to <80MB then <50MB and though uploads are more often successful, it still gets stuck in retrying loops on some files. For example https://archive.org/download/catalogofcopyri2112libr/catalogofcopyri2112libr.pdf which is reported as 41.3MB.

Then I would say this is classic Commons rate-limit issue with large files. Not sure what to do with it.

Some of these files do eventually upload, but they are not flagged as such, and the API appears to keep on with the 'retrying' loop. The example below is IA_catalogoftitleen33018libr which seems to have taken 48 minutes to actually upload, though the process keeps running for another hour before generating a (technically correct) error that the file now exists on the WMF servers, because the same process actually managed to complete some time before.

Wikimedia Commons is effectively broken if uploading simple PDFs is unreliable, and even a "successful" upload might take 2 hours for a 43 MB file.

669/674 Catalog of Title Entries of Books Etc. July 3-Dec 30 1893 5, Nos. 105-130 12:19:43 
 Catalog of Title Entries of Books Etc. July 3-Dec 30 1893 5, Nos. 105-130 (IA catalogoftitleen33018libr).pdf 
 https://archive.org/download/catalogoftitleen33018libr/catalogoftitleen33018libr.pdf 43.1M 
Sleeping for 7.5 seconds, 2020-06-07 12:19:48
WARNING: Waiting 5 seconds before retrying.
WARNING: Waiting 10 seconds before retrying.
WARNING: Waiting 20 seconds before retrying.
WARNING: Waiting 40 seconds before retrying.
WARNING: Waiting 80 seconds before retrying.
WARNING: Waiting 120 seconds before retrying.
WARNING: Waiting 120 seconds before retrying.
WARNING: Waiting 120 seconds before retrying.
WARNING: Waiting 120 seconds before retrying.
WARNING: Waiting 120 seconds before retrying.
WARNING: Waiting 120 seconds before retrying.
WARNING: Waiting 120 seconds before retrying.
WARNING: Waiting 120 seconds before retrying.
WARNING: Waiting 120 seconds before retrying.
WARNING: Waiting 120 seconds before retrying.
WARNING: Waiting 120 seconds before retrying.
WARNING: Waiting 120 seconds before retrying.
WARNING: Waiting 120 seconds before retrying.
WARNING: Waiting 120 seconds before retrying.
WARNING: Waiting 120 seconds before retrying.
WARNING: Waiting 120 seconds before retrying.
WARNING: Waiting 120 seconds before retrying.
WARNING: Waiting 120 seconds before retrying.
WARNING: Waiting 120 seconds before retrying.
WARNING: Http response status 503
WARNING: Non-JSON response received from server commons:commons; the server may be down.
WARNING: Waiting 120 seconds before retrying.
 Maximum retries attempted without success. 
 Maximum retries attempted without success. 
WARNING: Waiting 5 seconds before retrying.
WARNING: Waiting 10 seconds before retrying.
WARNING: Waiting 20 seconds before retrying.
WARNING: Waiting 40 seconds before retrying.
WARNING: Waiting 80 seconds before retrying.
WARNING: Waiting 120 seconds before retrying.
WARNING: Waiting 120 seconds before retrying.
WARNING: Waiting 120 seconds before retrying.
WARNING: Waiting 120 seconds before retrying.
WARNING: Waiting 120 seconds before retrying.
WARNING: Waiting 120 seconds before retrying.
WARNING: Waiting 120 seconds before retrying.
WARNING: Waiting 120 seconds before retrying.
WARNING: Waiting 120 seconds before retrying.
WARNING: Waiting 120 seconds before retrying.
WARNING: Waiting 120 seconds before retrying.
WARNING: Waiting 120 seconds before retrying.
WARNING: Waiting 120 seconds before retrying.
WARNING: ./scripts/ia_copyrightrecords.py:100: UserWarning: The upload returned 2 warnings: nochange, exists
  ignore_warnings = False,

 nochange: The upload is an exact duplicate of the current version of this file.

Sticking the upload script inside an infinite loop, allowing the upload to break on any first API error seems a practical but bad brute-force work around. However this is incredibly slow, wasteful of processing time and bandwidth, and not a solution for the vast majority for Commons contributors.

Here's what I found:

  1. The most common pattern is an indefinite series of retries, which shortly after over 1 hour give up with Maximum retries attempted without success. In actuality, the file may have been uploaded during the retries but the API failed to return a success message, and failed to return a 'duplicate' message when trying to re-upload the file it just uploaded.
  2. Second most likely is successful upload on first run, but this decreases in likelihood with the size of the file. In the example of "catalogofcopyri12libr" there have been around 10 attempts to upload the 185MB file, and it has yet to succeed (most of these attempts taking an hour to time-out).
  3. Third most likely is after a series of retries, the API returns a message like duplicate: Uploaded file is a duplicate of [u'Catalog_of_Copyright_Entries_1977_Books_and_Pamphlets_Jan-June.pdf']. Which though a technically correct message, is a symptom of failing to return a successful upload message.
  4. Lastly is a http-curl-error, which seems to be the InternetArchive falling over during these mass multiple requests.

Here's an example of one of the parallel looping tasks we are running. This report of InternetArchive idents is after the second attempts, so the files in bold have timed-out twice, while the others succeeded on first or second attempt:

  • 2 catalogofcopyri12libr 189.8M
  • 9 catalogofcopyrig17libr 173.9M
  • 10 catalogofco11libr 217.7M
  • 11 catalogofcop13libr 136.6M
  • 13 catalogofcopyr11libr 199.4M
  • 17 1977booksandpamp33112libr 143.1M
  • 18 1977musicjanjune33152libr 106.8M
  • 19 1977booksandpamp33112library 128.1M
  • 29 1977booksandpamphle33111libr 128.8M
  • 30 1976worksofartja330711libr 83.3M
  • 33 catalogofcopyrig33051libr 91.6M
  • 37 catalogofcopyrig33012library 134.4M
  • 38 catalogofcopyrig33011library 122.1M
  • 39 catalogofcopyrig33012libr 139.5M
  • 40 catalogofcopyrig33011libr 128.2M
  • 41 catalogofcopyrig33051library 88.3M
  • 43 catalogofcopyrig33052libr 114.5M
  • 44 catalogofcopyrig33052library 109.9M
  • 45 catalogofcopyrig32912lib 131.4M
  • 49 catalogofcopyrig32952libr 103.3M

    One conclusion would be that PDFs over 125MB are highly unlikely to be successfully uploaded by ordinary volunteers, and significantly larger PDFs cannot be batch uploaded in any practical way by anyone at this time.

Let's try the API team about the successfull/duplicate messages not retrieved when expected

Slight addition. File:Catalog of Copyright Entries 1937 Musical Compositions New Series Vol 32 Pt 3 For the Year 1937 (IA catalogofcopyrig323libr).pdf successfully uploaded on first run and is 238MB.

So significantly larger PDF can be uploaded, but the stats support the hypothesis that the larger the PDF, the less likely it is that an upload attempt will complete successfully.

This issue, at least the 503s, is reproducible with Special:Upload (not just the API).

Request from *** via cp1085 frontend, Varnish XID 40264702
Error: 503, Backend fetch failed at Sat, 13 Jun 2020 23:35:18 GMT
Response Headers
HTTP/2 503 Service Unavailable
date: Sat, 13 Jun 2020 23:35:18 GMT
server: Varnish
content-type: text/html; charset=utf-8
age: 0
x-cache: cp1085 int
x-cache-status: int-front
server-timing: cache;desc="int-front"
strict-transport-security: max-age=106384710; includeSubDomains; preload
x-client-ip: ****
cache-control: private, s-maxage=0, max-age=0, must-revalidate
content-length: 1792
X-Firefox-Spdy: h2
Request headers
Host: commons.wikimedia.org
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:77.0) Gecko/20100101 Firefox/77.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate, br
Referer: https://commons.wikimedia.org/wiki/Special:Upload
Content-Type: multipart/form-data; boundary=---------------------------239181092836737455363224853218
Content-Length: 2924
Origin: https://commons.wikimedia.org
Connection: keep-alive
Cookie: ****
Upgrade-Insecure-Requests: 1
DNT: 1
Pragma: no-cache
Cache-Control: no-cache
TE: Trailers

Related-
It's taken multiple attempts to get - https://archive.org/details/catalogofcopyrig263libr
The PDF just deoesn't want to RELIABLY upload , even when I use a chkunked upload...
It fails repeatedly...

This is a big disincentive to encouraging GLAM contributors, that have public domain resources...

eprodromou subscribed.

So, I think we can take a look at the API error, which seems to be malformed, but we probably can't address the underlying issues. At this size of an upload, it may make sense for smart clients to chunk the upload, and we re-assemble it on the server side...? Some thoughts.

My understanding is that by Pywikibot automatically using chunked upload as a default means that "smart clients to chunk the upload, and we re-assemble it on the server side" was how this worked. Checking Pywikibot's site.py module, this links to https://www.mediawiki.org/wiki/API:Upload#Chunked_uploading to explain how it works.

If there is some other process rather than the API calls that will work for large PDFs, or we might be missing something in the site.upload call, please provide a link to that work-around.

FYI the upload call in Pywikibot looks like this, but has been run with chunk_size and async normally commented out, as they appear to make no difference when set (site.py defaults being async=False, chunk_size=0):

		site.upload(pywikibot.FilePage(site, 'File:' + pagetitle),
			source_filename=source_filename,
			source_url=source_url,
			comment=comment,
			text=desc,
			ignore_warnings = False,
			chunk_size = 1048576,
			async = True,
			)

As a separate experiment, here's the results of locally saving a 264MB PDF from archive.org, then running the site.upload from a command line instead of within Python. Local directory names trimmed for privacy but have no logical effect. This was a meaningful experiment as a work-around could have been to locally cache the largest files and url upload the rest, however it seems to have no effect on outcomes.

A low chunk size of 500k was used to avoid local memory problems. This failed with interesting DBQueryErrors on the way before being aborted with stashfailed.

python pwb.py upload -filename:"Motion Picture News (Jul-Oct 1915) (IA motionpicturenew121unse).pdf" -chunked:500k -summary:"Manual upload to test a point about chunked uploading" motionpicturenew121unse.pdf "Source = https://archive.org/stream/motionpicturenew121unse\n{{PD-USGov}}"
The filename on the target wiki will default to: Motion Picture News (Jul-Oct 1915) (IA motionpicturenew121unse).pdf
Enter a better name, or press enter to accept: 
The suggested description is:
Source = https://archive.org/stream/motionpicturenew121unse\n{{PD-USGov}}
Do you want to change this description? ([y]es, [N]o, [q]uit): n
Uploading file to commons:commons...
Sleeping for 24.7 seconds, 2020-06-19 17:23:20
WARNING: Unexpected offset.
WARNING: Unexpected offset.
WARNING: Unexpected offset.
WARNING: Waiting 5 seconds before retrying.
WARNING: API error internal_api_error_DBQueryError: [d9e6fdd1-8af1-49ad-a89b-fca1f4e06198] Caught exception of type Wikimedia\Rdbms\DBQueryError
ERROR: Detected MediaWiki API exception internal_api_error_DBQueryError: [d9e6fdd1-8af1-49ad-a89b-fca1f4e06198] Caught exception of type Wikimedia\Rdbms\DBQueryError [errorclass:Wikimedia\Rdbms\DBQueryError; servedby:mw1394]; retrying
WARNING: Waiting 10 seconds before retrying.
WARNING: API error internal_api_error_DBQueryError: [341ea9fe-880c-462b-925e-6898043bdefb] Caught exception of type Wikimedia\Rdbms\DBQueryError
ERROR: Detected MediaWiki API exception internal_api_error_DBQueryError: [341ea9fe-880c-462b-925e-6898043bdefb] Caught exception of type Wikimedia\Rdbms\DBQueryError [errorclass:Wikimedia\Rdbms\DBQueryError; servedby:mw1358]; retrying
WARNING: Waiting 20 seconds before retrying.
WARNING: API error internal_api_error_DBQueryError: [15f88480-ee34-4c83-8f51-4e0777fc08d0] Caught exception of type Wikimedia\Rdbms\DBQueryError
ERROR: Detected MediaWiki API exception internal_api_error_DBQueryError: [15f88480-ee34-4c83-8f51-4e0777fc08d0] Caught exception of type Wikimedia\Rdbms\DBQueryError [errorclass:Wikimedia\Rdbms\DBQueryError; servedby:mw1402]; retrying
WARNING: Waiting 40 seconds before retrying.
WARNING: Http response status 500
WARNING: Non-JSON response received from server commons:commons; the server may be down.
WARNING: Waiting 80 seconds before retrying.
WARNING: API error stashfailed: Chunked upload is already completed, check status for details.
ERROR: Upload error: 
Traceback (most recent call last):
  File "/home/.../core/pywikibot/specialbots.py", line 430, in upload_file
    comment=self.summary)
  File "/home/.../core/pywikibot/page.py", line 2674, in upload
    **kwargs)
  File "/home/.../core/pywikibot/tools/__init__.py", line 1488, in wrapper
    return obj(*__args, **__kw)
  File "/home/.../core/pywikibot/site.py", line 6397, in upload
    raise error
APIError: stashfailed: Chunked upload is already completed, check status for details. [help:See https://commons.wikimedia.org/w/api.php for API usage. Subscribe to the mediawiki-api-announce mailing list at &lt;https://lists.wikimedia.org/mailman/listinfo/mediawiki-api-announce&gt; for notice of API deprecations and breaking changes.]
1 pages read
0 pages written
Script terminated successfully.

Possibly relevant to keep in mind when debugging this; PDF and DjVu files are not treated as just blobs of binary data on disk with just a little metadata in the db. Last I heard the hidden text layer gets extracted from the file, wrapped in an XML structure, and stuffed into a field in the database that is sized for metadata and can overflow. The more pages in a PDF, the more text, the more likely that's going to start playing a role. If this happens before the "upload" is considered complete, it might be a factor.

It is also a known issue that thumbnail generation for PDFs (which uses ghostscript iirc) is bog slow (like 10+ seconds for a single page). Depending on where in the process this happens it may be a relevant factor (the File information page shows at least two thumbnails; but possibly only generated on demand when the information page is actually loaded / those thumbs are requested).

Thumbnail generation is independent of the upload process. Thumbnails are generated on demand by Thumbor after a client request, and MediaWiki is largely uninvolved in thumbnail generation on Wikimedia.

With regard to debugging, these examples may be of interest:

  1. IA TheOakLeaf1949January1June25 59MB, took 23 upload attempts before successful upload
  2. IA asagraycorrespo00grayg 1.44 GB, uploaded on first attempt

It seems likely that the internal format of these two PDFs, rather than just size, is key to understanding why some will repeatedly fail to upload and not others.

Fae: I've also noted a quality issue with some PDF uploads,
https://en.wikisource.org/wiki/File:Catalog_of_Title_Entries_of_Books_Etc._July_1-July_11_1891_1,_Nos._1-26_(IA_catalogoftitleen11118libr).pdf
being one, Where the page image displayed in the interface is of a lower quality than that in the actual PDF ( compared to a direct display of the page in Adobe Reader, which was also unable to export a usable image to older image viewer tools.)

This may be related to an issue I've had in the past that some tools cannot convert 'progressive JPEG' (or JPEG2000) as quickly or easily as earlier JPEG variants, OR PDF's which are using 'layered' images for each page. It may be worth checking with the upstream vendor (for ghostscript) what level of support it has for more recent PDF variants ( such as for progressive JEPG or 'layered' pages.) (see T256848)

It would be tedious, and I'm not sure if 'free' software to do it exists, but ideally all PDF's on commons should be 'flattened' on upload. Progressive or layered pages being converted back to single image pages with appropriate tools.

The sheer size of some PDF's may also be confusing tools which attempt to extract the text layer from them. (Fully Solving the issue of text-layer extraction for PDF, is very much a a long-term issue, as it would involve a substantial redesign of parts of the extension concerned.)

Please re-open if it is.

PDF files as big as 2.13 GB have been uploaded to Commons with some tweaks to API requests. This was never imaginable before. Thanks!