Page MenuHomePhabricator

Frequent chunk-too-small errors
Closed, DuplicatePublic

Description

During the batch upload project UK Legislation, there are a significant number of files rejected by the API with 'chunk-too-small'. Is there a work-around or a fix that could be applied for this Pywikibot based mass upload?

This error has not been a problem for image mimetype uploads, but appears quite likely for document mimetypes.

Event Timeline

I think it is related to T132676, I checked the first file and it ends in '\r'.

As my 'personal' work-around, just for UK legislation PDFs that the API flags with chunk-too-small and fails on a second upload, the pdf is trimmed of the final byte and re-attempted. In my view this is a terrible hack rather than a fix.

However, this initially appears to be working with the files both uploading and displaying successfully, though it may later cause unpredictable errors as it's hardly an intelligent fix. Ref to this category for examples.

Code snippet:

rec = uptry(local, fn, dd, comment, False)
if rec in ['chunk-too-small']:
	print "Chunk-too-small, so trying trimming off 1 byte"
	with open(local, 'rb+') as filehandle:
		filehandle.seek(-1, os.SEEK_END)
		filehandle.truncate()
	rec = uptry(local, fn, dd + "\n[[Category:Work around of byte trimmed for chunk-too-small API error]]", comment, False)

Could not reproduce. Please provide at least the following information:

  • Operating system
  • Python environment and version (import sys; print(sys.version))
  • Pywikibot version (import pywikibot; print(pywikibot.__version__))
  • Relevant code or command used, including the chunk size configuration
  • Complete logs of upload attempt (VERBOSE-level or lower preferred)
  • Hash of the file that could not be uploaded

It would also be very useful if you could provide the request and response headers for the failed chunk upload, including the exact size of the chunk. Information about if the file appears in Special:UploadStash would also be helpful.

Could not reproduce. Please provide at least the following information:

  • Operating system

Ubuntu Release 18.04.5 LTS (Bionic Beaver) 64-bit

  • Python environment and version (import sys; print(sys.version))

2.7.17 (default, Sep 30 2020, 13:38:04) [GCC 7.5.0]

  • Pywikibot version (import pywikibot; print(pywikibot.__version__))

3.1.dev0

  • Relevant code or command used, including the chunk size configuration
site.upload(pywikibot.FilePage(site, 'File:' + pagetitle),
			source_filename=source_filename,
			source_url=source_url,
			comment=comment,
			text=desc,
			ignore_warnings = False,
			chunk_size = 400000,#1048576,
			#async = True,
			)
  • Complete logs of upload attempt (VERBOSE-level or lower preferred)
pywikibot.data.api.APIError: chunk-too-small: Minimum chunk size is 1,024 bytes for non-final chunks. [help:See https://commons.wikimedia.org/w/api.php for API usage. Subscribe to the mediawiki-api-announce mailing list at <https://lists.wikimedia.org/mailman/listinfo/mediawiki-api-announce> for notice of API deprecations and breaking changes.]
  • Hash of the file that could not be uploaded

Test file is https://www.legislation.gov.uk/ukpga/1844/61/pdfs/ukpga_18440061_en.pdf

It would also be very useful if you could provide the request and response headers for the failed chunk upload, including the exact size of the chunk. Information about if the file appears in Special:UploadStash would also be helpful.

Can't recall how to dig out the WMF server stash log. This upload was under User:Fæ and would have been at 2020-10-25 10:37 UK time.

@Fae, I proposed a fix/hack at T132676.
As you have several cases, I would appreciate if you could use it and provide feedback.

As my 'personal' work-around, just for UK legislation PDFs that the API flags with chunk-too-small and fails on a second upload, the pdf is trimmed of the final byte and re-attempted. In my view this is a terrible hack rather than a fix.

However, this initially appears to be working with the files both uploading and displaying successfully, though it may later cause unpredictable errors as it's hardly an intelligent fix. Ref to this category for examples.

Code snippet:

rec = uptry(local, fn, dd, comment, False)
if rec in ['chunk-too-small']:
	print "Chunk-too-small, so trying trimming off 1 byte"
	with open(local, 'rb+') as filehandle:
		filehandle.seek(-1, os.SEEK_END)
		filehandle.truncate()
	rec = uptry(local, fn, dd + "\n[[Category:Work around of byte trimmed for chunk-too-small API error]]", comment, False)

I think a better workaround, if it works, is to use source_url in site.upload().
It delegates to API the task of fetching the file. If it works, the file is hopefully the original.
See https://en.wikisource.org/w/api.php?action=help&modules=upload

I think a better workaround, if it works, is to use source_url in site.upload().
It delegates to API the task of fetching the file. If it works, the file is hopefully the original.
See https://en.wikisource.org/w/api.php?action=help&modules=upload

That can only happen when T265690 is complete.