Page MenuHomePhabricator

Failures of merging text layer into djvu file ("Failed to get specified page")
Open, Needs TriagePublic

Description

Browsing into log of the first "possibly failed" IA Upload output (030PoloIlMilioneSi203) , you can see:

  1. the image layer of djvu is good and it can be downloaded;
  2. the djvu has no text layer;
  3. the last log output says: [2017-12-05 23:09:27] LOG.CRITICAL: Command "djvuxmlparser "/mnt/nfs/labstore-secondary-tools-project/ia-upload/ia-upload/jobqueue/030PoloIlMilioneSi203/030_Polo_Il_milione_si203_djvu.xml_new.xml" 2>&1" exited with code 1: * [1-16201] Failed to get specified page. * (XMLParser.cpp:581) *** 'DJVU::GP<DJVU::DjVuFile> DJVU::lt_XMLParser::Impl::get_file(const DJVU::GURL&, DJVU::GUTF8String)' [] []

I see from log messages that individual djvu generated images have a name different from original jp2 images:
030_Polo_Il_milione_si203_0305.jp2 -> 030PoloIlMilioneSi203_p307.jpg -> 030PoloIlMilioneSi203_p307.djvu

The same log message is shown too into cases where there's no underscore into original name of jpg images, the only difference being the postfix of image of file names:

item BarriliDallarupe: BarriliDallarupe_0002.jp2 -> BarriliDallarupe_p4.jpg -> BarriliDallarupe_p4.djvu

I suspect that the problem could be, that djvu xml parser can't find name pages it expects; this would explain the log message Failed to get specified page.

Anyway, it.source can use a python script to merge _djvu.xml text into djvu coming from these failures: https://it.wikisource.org/wiki/Progetto:Bot/Programmi_in_Python_per_i_bot/xml2dsed.py (it's rough but running).

Event Timeline

Restricted Application added a project: Community-Tech. · View Herald TranscriptDec 13 2017, 3:05 PM
Alex_brollo updated the task description. (Show Details)Dec 15 2017, 7:24 AM
Samwilson renamed this task from IA Upload: failures of merging text layer into djvu file to Failures of merging text layer into djvu file ("Failed to get specified page").Dec 21 2017, 6:51 AM

Most of the recent failures are due this very same bug. See the log for any upload failure with a link for DJVU download.

I moved from plain use of _djvu.xml to the more complex _djvu.xml -> dsed conversion since dsed manipulation is really much more simple - the unique hard step being coordinate conversion. As soon as you get dsed format of OCR layer, you can use djvused routine, that is much faster and "elastic". I found too that some IA _djvu.xml are somehow bugged from origin, but that is possible to fix these bugs. I think that IA uses _djvu.xml just to get text coordinates needed to words search and highlight routine in its viewer, so that IA isn't so much interested into usability of _djvu.xml file to build a text layer into a djvu file.

Samwilson removed Samwilson as the assignee of this task.Jan 16 2019, 3:44 PM
Samwilson added a subscriber: Samwilson.

I'm not working on this at the moment.