Failures of merging text layer into djvu file ("Failed to get specified page")
Open, Needs TriagePublic
Actions

Assigned To

None

Authored By

	Alex_brollo
	Dec 13 2017, 3:05 PM

Description

Browsing into log of the first "possibly failed" IA Upload output (030PoloIlMilioneSi203) , you can see:

the image layer of djvu is good and it can be downloaded;
the djvu has no text layer;
the last log output says: [2017-12-05 23:09:27] LOG.CRITICAL: Command "djvuxmlparser "/mnt/nfs/labstore-secondary-tools-project/ia-upload/ia-upload/jobqueue/030PoloIlMilioneSi203/030_Polo_Il_milione_si203_djvu.xml_new.xml" 2>&1" exited with code 1: * [1-16201] Failed to get specified page. * (XMLParser.cpp:581) *** 'DJVU::GP<DJVU::DjVuFile> DJVU::lt_XMLParser::Impl::get_file(const DJVU::GURL&, DJVU::GUTF8String)' [] []

I see from log messages that individual djvu generated images have a name different from original jp2 images:
030_Polo_Il_milione_si203_0305.jp2 -> 030PoloIlMilioneSi203_p307.jpg -> 030PoloIlMilioneSi203_p307.djvu

The same log message is shown too into cases where there's no underscore into original name of jpg images, the only difference being the postfix of image of file names:

item BarriliDallarupe: BarriliDallarupe_0002.jp2 -> BarriliDallarupe_p4.jpg -> BarriliDallarupe_p4.djvu

I suspect that the problem could be, that djvu xml parser can't find name pages it expects; this would explain the log message Failed to get specified page.

Anyway, it.source can use a python script to merge _djvu.xml text into djvu coming from these failures: https://it.wikisource.org/wiki/Progetto:Bot/Programmi_in_Python_per_i_bot/xml2dsed.py (it's rough but running).

Related Objects

Mentioned In: T183338: Add a flag "Can be removed" to unsuccessful uploads of IA Upload

Event Timeline

Alex_brollo created this task.Dec 13 2017, 3:05 PM

Restricted Application added a project: Community-Tech. · View Herald TranscriptDec 13 2017, 3:05 PM

Alex_brollo updated the task description. (Show Details)Dec 15 2017, 7:24 AM

Here two recent examples of IA Upload failures, recovered by xml2dsed:
https://commons.wikimedia.org/wiki/File:Caterina_da_Siena_%E2%80%93_Libro_della_divina_dottrina,_1912_%E2%80%93_BEIC_1785736.djvu

https://commons.wikimedia.org/wiki/File:Guidiccioni,_Giovanni_%E2%80%93_Rime,_1912_%E2%80%93_BEIC_1850335.djvu

Alex_brollo assigned this task to Samwilson.Dec 20 2017, 8:57 AM

Samwilson renamed this task from IA Upload: failures of merging text layer into djvu file to Failures of merging text layer into djvu file ("Failed to get specified page").Dec 21 2017, 6:51 AM

Most of the recent failures are due this very same bug. See the log for any upload failure with a link for DJVU download.

I moved from plain use of _djvu.xml to the more complex _djvu.xml -> dsed conversion since dsed manipulation is really much more simple - the unique hard step being coordinate conversion. As soon as you get dsed format of OCR layer, you can use djvused routine, that is much faster and "elastic". I found too that some IA _djvu.xml are somehow bugged from origin, but that is possible to fix these bugs. I think that IA uses _djvu.xml just to get text coordinates needed to words search and highlight routine in its viewer, so that IA isn't so much interested into usability of _djvu.xml file to build a text layer into a djvu file.

• TBolliger removed a project: Community-Tech.Mar 27 2018, 12:37 AM

I'm not working on this at the moment.

@Alex_brollo I think this might be due to things like this in the djvu_xml file:

<WORD coords="x1,y2,x2,y2,dontcare">1</WORD>

where x1,x2,y1,y2 are all integers. If x1 == x2 || y1 == y2, the text layer will be corrupted according to djvused. It should be fairly easy to spot that and just drop the WORD if it happens. Or even set the co-ordinates to be he same as the prior/following WORD.

IA Upload failures come from a variety of _djvu.xml IA files, I suppose
that xml comes from _abbyy.gz. I noted two common mismatches in corrupted
xml files:

pages sometimes don't match with _jp2.zip list of images, since some

images are discarded;

xml file contains tags with no text or empty strings; such tags must be

removed.

Presently I'm not maintaining the script mentioned by Inductiveload, I
simply use Abbyy FineReader when IA Upload fails.

Alex brollo

Il giorno gio 11 giu 2020 alle ore 18:07 Inductiveload <
no-reply@phabricator.wikimedia.org> ha scritto:

Inductiveload added a comment. View Task
https://phabricator.wikimedia.org/T182778

@Alex_brollo https://phabricator.wikimedia.org/p/Alex_brollo/ I think
this might be due to things like this in the djvu_xml file:

<WORD coords="x1,y2,x2,y2,dontcare">1</WORD>

where x1,x2,y1,y2 are all integers. If x1 == x2 || y1 == y2, the text
layer will be corrupted according to djvused. It should be fairly easy to
spot that and just drop the WORD if it happens. Or even set the
co-ordinates to be he same as the prior/following WORD.

*TASK DETAIL*
https://phabricator.wikimedia.org/T182778

*EMAIL PREFERENCES*
https://phabricator.wikimedia.org/settings/panel/emailpreferences/

*To: *Inductiveload
*Cc: *Inductiveload, Samwilson, Ninovolador, Alex_brollo, Tpt

Samwilson mentioned this in T183338: Add a flag "Can be removed" to unsuccessful uploads of IA Upload.Jul 15 2021, 6:59 AM

Failures of merging text layer into djvu file ("Failed to get specified page")Open, Needs TriagePublicActions

Description

Related Objects

Event Timeline

Failures of merging text layer into djvu file ("Failed to get specified page")
Open, Needs TriagePublic
Actions