Page MenuHomePhabricator

Failures of merging text layer into djvu file ("Failed to get specified page")
Open, Needs TriagePublic

Description

Browsing into log of the first "possibly failed" IA Upload output (030PoloIlMilioneSi203) , you can see:

  1. the image layer of djvu is good and it can be downloaded;
  2. the djvu has no text layer;
  3. the last log output says: [2017-12-05 23:09:27] LOG.CRITICAL: Command "djvuxmlparser "/mnt/nfs/labstore-secondary-tools-project/ia-upload/ia-upload/jobqueue/030PoloIlMilioneSi203/030_Polo_Il_milione_si203_djvu.xml_new.xml" 2>&1" exited with code 1: * [1-16201] Failed to get specified page. * (XMLParser.cpp:581) *** 'DJVU::GP<DJVU::DjVuFile> DJVU::lt_XMLParser::Impl::get_file(const DJVU::GURL&, DJVU::GUTF8String)' [] []

I see from log messages that individual djvu generated images have a name different from original jp2 images:
030_Polo_Il_milione_si203_0305.jp2 -> 030PoloIlMilioneSi203_p307.jpg -> 030PoloIlMilioneSi203_p307.djvu

The same log message is shown too into cases where there's no underscore into original name of jpg images, the only difference being the postfix of image of file names:

item BarriliDallarupe: BarriliDallarupe_0002.jp2 -> BarriliDallarupe_p4.jpg -> BarriliDallarupe_p4.djvu

I suspect that the problem could be, that djvu xml parser can't find name pages it expects; this would explain the log message Failed to get specified page.

Anyway, it.source can use a python script to merge _djvu.xml text into djvu coming from these failures: https://it.wikisource.org/wiki/Progetto:Bot/Programmi_in_Python_per_i_bot/xml2dsed.py (it's rough but running).

Event Timeline

Samwilson renamed this task from IA Upload: failures of merging text layer into djvu file to Failures of merging text layer into djvu file ("Failed to get specified page").Dec 21 2017, 6:51 AM

Most of the recent failures are due this very same bug. See the log for any upload failure with a link for DJVU download.

I moved from plain use of _djvu.xml to the more complex _djvu.xml -> dsed conversion since dsed manipulation is really much more simple - the unique hard step being coordinate conversion. As soon as you get dsed format of OCR layer, you can use djvused routine, that is much faster and "elastic". I found too that some IA _djvu.xml are somehow bugged from origin, but that is possible to fix these bugs. I think that IA uses _djvu.xml just to get text coordinates needed to words search and highlight routine in its viewer, so that IA isn't so much interested into usability of _djvu.xml file to build a text layer into a djvu file.

Samwilson subscribed.

I'm not working on this at the moment.

@Alex_brollo I think this might be due to things like this in the djvu_xml file:

<WORD coords="x1,y2,x2,y2,dontcare">1</WORD>

where x1,x2,y1,y2 are all integers. If x1 == x2 || y1 == y2, the text layer will be corrupted according to djvused. It should be fairly easy to spot that and just drop the WORD if it happens. Or even set the co-ordinates to be he same as the prior/following WORD.

IA Upload failures come from a variety of _djvu.xml IA files, I suppose
that xml comes from _abbyy.gz. I noted two common mismatches in corrupted
xml files:

  1. pages sometimes don't match with _jp2.zip list of images, since some

images are discarded;

  1. xml file contains tags with no text or empty strings; such tags must be

removed.

Presently I'm not maintaining the script mentioned by Inductiveload, I
simply use Abbyy FineReader when IA Upload fails.

Alex brollo

Il giorno gio 11 giu 2020 alle ore 18:07 Inductiveload <
no-reply@phabricator.wikimedia.org> ha scritto:

Inductiveload added a comment. View Task
https://phabricator.wikimedia.org/T182778

@Alex_brollo https://phabricator.wikimedia.org/p/Alex_brollo/ I think
this might be due to things like this in the djvu_xml file:

<WORD coords="x1,y2,x2,y2,dontcare">1</WORD>

where x1,x2,y1,y2 are all integers. If x1 == x2 || y1 == y2, the text
layer will be corrupted according to djvused. It should be fairly easy to
spot that and just drop the WORD if it happens. Or even set the
co-ordinates to be he same as the prior/following WORD.

*TASK DETAIL*
https://phabricator.wikimedia.org/T182778

*EMAIL PREFERENCES*
https://phabricator.wikimedia.org/settings/panel/emailpreferences/

*To: *Inductiveload
*Cc: *Inductiveload, Samwilson, Ninovolador, Alex_brollo, Tpt