Page MenuHomePhabricator

Fixable vs. unfixable IA Upload failures: overview
Open, Needs TriagePublic

Description

Browsing into 44 IA Upload failures I found 23 cases of probably fixable ones, and 20 cases of unfixable ones.

Among the former I found:

  • 14 cases where _jp2.zip file exists, but its prefix is different from IA ID;
  • 7 cases where there's no _jp2.zip file, but there's a _tif.zip file;
  • 1 case where there's no _jp2.zip file, but there's a _jp2.tar file;
  • 1 case where _jp2.zip file exists, but uploader exits with no result (it is a very large item with 1010 pages).

Among the latter (unfixable) I found a variety of abnormal uploads of files (.jpg, .png, .mp3, .ogg...) or of abnormal upload of folders or zip files with name structure different from the allowed one (_images.zip), lacking _djvu.xml file.

My suggestions (unluckily I can't fix code at all....) are:

  • to test for existence of a _djvu.xml as first step,
    • if it exists
      • to test for existence of a _jp2.zip file and to use it even if its prefix is different from IA ID
      • if it doesn't exists
        • to test for a _tif.zip file and to use it after a tif to jpg conversion
        • to test for a _jp2.tar file and to use it after a tar splitting

This approach should avoid most fixable IA Uploader failures.

Event Timeline

Is it the case that the zip files we want are identified by format = 'Abbyy GZ' in the files' list? That seems to identify the jp2 and tif zip files in the items I've looked at.

Is it the case that the zip files we want are identified by format = 'Abbyy GZ' in the files' list? That seems to identify the jp2 and tif zip files in the items I've looked at.

No, perhaps the most interesting/robust way is likely to find the file with format "Djvu XML" (typically <IAID>_djvu.xml). We aren't going anywhere without that for now (maybe someday we can incorporate our own OCR, etc.). Then follow the original field back to the previous file and keep chasing it until we get to an original source file (which does not have an original field). The derivative scans will almost always be the ones just before the end with a format starting with "Single Page Processed <imageformat><archiveformat>" where you are likely most familiar with <imageformat><archiveformat> being "JP2 ZIP" (typically <IAID>_jp2.zip).

FYI: Since the IA move to Tesseract OCR (~2021+), it is now quite common to find IA items without any format = "Abbyy GZ".