Page MenuHomePhabricator

IA Uploader fails to recognize the first page of a book
Open, Needs TriagePublicBUG REPORT

Description

I uploaded the book https://commons.wikimedia.org/w/index.php?title=File:Zawis_and_Kunigunde_(1895).djvu&page=1 using IA uploader. The IA uploader showed me the page with the title Zawis and Kunigunde as the first page of the book and asked whether I want to remove it. I ticked "No", as it was a part of the book. However, after the upload was finished, a completely different page appeared there at the position of the first page. The one I was asked about appeared on the position of the 2nd page after the upload. As the uploaded first page is completely useless, I would have asked it to be removed, if I were asked about it.

To see which page IA uploader originally showed as the 1st page of the book have a look at the screenshot https://drive.google.com/file/d/16j-KROmg7tpHxqqp1koEV1vnHsBSctAf/view?usp=sharing

Event Timeline

BTW, one of problems it creates is that when a thumbnail picture of the uploaded book is displayed somewhere, this useless first page is shown instead of the book's real cover.

I think what's happening here is that there are some JP2 files in the JP2 zip file that is used as a source for building the DJVU that should not be included.

From the scandata.xml file, we see:

<page leafNum="0">
    <handSide>LEFT</handSide>
    <pageType>Color Card</pageType>
    <addToAccessFormats>false</addToAccessFormats>
    ...
</page>
<page leafNum="1">
    <handSide>RIGHT</handSide>
    <pageType>Cover</pageType>
    <addToAccessFormats>true</addToAccessFormats>
    ...
</page>
...

Where addToAccessFormats=false denotes that that image should *not* be included in "access formats", which (AFAIK) is things like the BookReader interface, PDFs and DjVus (when available).

So, IA-Upload should read the scan data and omit any addToAccessFormats=false pages. In this case, "leaf numbers" 0 and 329.

This also causes problems with the OCR when the non-access format images are included, because the OCR text page offsets assume these pages are not included.

I just wanted to note here that not all 0000 numbered pages at IA are useless. Some are whole covers and not the scanner bed or the spine (and the spine might be useful and/or informative).

Please do not consider the elimination of 0000 images as an easy fix....

That entire scandata.xml file is currently not taken into account. You can see how the tool uses the zip files here:
https://github.com/wikisource/ia-upload/blob/1ba22eb9083f53c1118175648941a702e80b2a15/src/DjvuMaker/Jp2DjvuMaker.php#L100

If it is too hard to do an automated process, then maybe we can have an advanced functionality, Either something were we can exclude a page name, or an intermediary page where we can grab all the pages into a list, and then checkbox them for inclusion, and allow the manual deselection of pages for exclusion. It is problematic at this time that many works are pretty much excluded. I have deleted half a dozen of uploads due to this issue.

TheresNoTime changed the subtype of this task from "Task" to "Bug Report".Aug 1 2022, 2:32 PM

Looks like the same bug a reported in T243163.

I too am looking forward to scandata.xml addToAccessFormats page filtering. That would get rid of the irritating color card and white card pages often included at the end of many scans (but I have seen them in the middle of book scans too).

Meaning this is also the same bug as: T367491: Extra pages after upload.