IA Uploader fails to recognize the first page of a book
Open, Needs TriagePublicBUG REPORT
Actions

Assigned To

None

Authored By

	Jan.Kamenicek
	Nov 19 2020, 4:02 PM

Description

I uploaded the book https://commons.wikimedia.org/w/index.php?title=File:Zawis_and_Kunigunde_(1895).djvu&page=1 using IA uploader. The IA uploader showed me the page with the title Zawis and Kunigunde as the first page of the book and asked whether I want to remove it. I ticked "No", as it was a part of the book. However, after the upload was finished, a completely different page appeared there at the position of the first page. The one I was asked about appeared on the position of the 2nd page after the upload. As the uploaded first page is completely useless, I would have asked it to be removed, if I were asked about it.

To see which page IA uploader originally showed as the 1st page of the book have a look at the screenshot https://drive.google.com/file/d/16j-KROmg7tpHxqqp1koEV1vnHsBSctAf/view?usp=sharing

Related Objects

Mentioned In: T363619: Remove option for PDF → DjVu conversion (phetools)
Mentioned Here: T367491: Extra pages after upload
T243163: 'First page' thumbnail isn't always of the first page

Event Timeline

Jan.Kamenicek created this task.Nov 19 2020, 4:02 PM

Restricted Application added a project: Community-Tech. · View Herald TranscriptNov 19 2020, 4:02 PM

BTW, one of problems it creates is that when a thumbnail picture of the uploaded book is displayed somewhere, this useless first page is shown instead of the book's real cover.

I think what's happening here is that there are some JP2 files in the JP2 zip file that is used as a source for building the DJVU that should not be included.

From the scandata.xml file, we see:

<page leafNum="0">
    <handSide>LEFT</handSide>
    <pageType>Color Card</pageType>
    <addToAccessFormats>false</addToAccessFormats>
    ...
</page>
<page leafNum="1">
    <handSide>RIGHT</handSide>
    <pageType>Cover</pageType>
    <addToAccessFormats>true</addToAccessFormats>
    ...
</page>
...

Where addToAccessFormats=false denotes that that image should *not* be included in "access formats", which (AFAIK) is things like the BookReader interface, PDFs and DjVus (when available).

So, IA-Upload should read the scan data and omit any addToAccessFormats=false pages. In this case, "leaf numbers" 0 and 329.

This also causes problems with the OCR when the non-access format images are included, because the OCR text page offsets assume these pages are not included.

I just wanted to note here that not all 0000 numbered pages at IA are useless. Some are whole covers and not the scanner bed or the spine (and the spine might be useful and/or informative).

Please do not consider the elimination of 0000 images as an easy fix....

That entire scandata.xml file is currently not taken into account. You can see how the tool uses the zip files here:
https://github.com/wikisource/ia-upload/blob/1ba22eb9083f53c1118175648941a702e80b2a15/src/DjvuMaker/Jp2DjvuMaker.php#L100

If it is too hard to do an automated process, then maybe we can have an advanced functionality, Either something were we can exclude a page name, or an intermediary page where we can grab all the pages into a list, and then checkbox them for inclusion, and allow the manual deselection of pages for exclusion. It is problematic at this time that many works are pretty much excluded. I have deleted half a dozen of uploads due to this issue.

TheresNoTime changed the subtype of this task from "Task" to "Bug Report".Aug 1 2022, 2:32 PM

KSiebert removed a project: Community-Tech.Dec 5 2022, 3:23 PM

Looks like the same bug a reported in T243163.

I too am looking forward to scandata.xml addToAccessFormats page filtering. That would get rid of the irritating color card and white card pages often included at the end of many scans (but I have seen them in the middle of book scans too).

Meaning this is also the same bug as: T367491: Extra pages after upload.

Uzume mentioned this in T363619: Remove option for PDF → DjVu conversion (phetools).Aug 11 2024, 12:29 PM

IA Uploader fails to recognize the first page of a bookOpen, Needs TriagePublicBUG REPORTActions

Description

Related Objects

Event Timeline

IA Uploader fails to recognize the first page of a book
Open, Needs TriagePublicBUG REPORT
Actions