IA Upload: "Failed to get specified page" DjVu generation error
Open, LowPublic3 Estimated Story Points
Actions

Assigned To

None

Authored By

	Samwilson
	Mar 7 2017, 6:11 AM

Description

When uploading seispersonajesen00pira to seis_personaje_en_busca_de_autor.djvu:

[2017-03-06 14:10:31] LOG.INFO: Merging modified XML into full DjVu file [] []
[2017-03-06 14:10:44] LOG.CRITICAL: Command "djvuxmlparser "/mnt/nfs/labstore-secondary-tools-project/ia-upload/ia-upload/jobqueue/seispersonajesen00pira/seispersonajesen00pira_djvu.xml_new.xml" 2>&1" exited with code 1: *** [1-16201] Failed to get specified page. *** (XMLParser.cpp:581) *** 'DJVU::GP<DJVU::DjVuFile> DJVU::lt_XMLParser::Impl::get_file(const DJVU::GURL&, DJVU::GUTF8String)'  [] []

Related Objects

Mentioned Here: T161396: Memory issues for IA-upload when converting large files to djvu

Event Timeline

Samwilson created this task.Mar 7 2017, 6:11 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 7 2017, 6:11 AM

Samwilson triaged this task as Medium priority.Mar 7 2017, 6:11 AM

Samwilson moved this task from Backlog to IA Upload on the All-and-every-Wikisource board.

Another: icolloquiliriche00gozzuoft

• DannyH moved this task from New & TBD Tickets to Needs Discussion on the Community-Tech board.Apr 4 2017, 11:25 PM

kaldari lowered the priority of this task from Medium to Low.Apr 4 2017, 11:37 PM

kaldari set the point value for this task to 5.

kaldari moved this task from Needs Discussion to Up Next (June 3-21) on the Community-Tech board.

kaldari edited projects, added Community-Tech-Sprint; removed Community-Tech.Apr 11 2017, 10:24 PM

Another: misteridipolizia00niceuoft

[2017-04-18 09:33:56] LOG.INFO: Merging modified XML into full DjVu file [] []
[2017-04-18 09:35:39] LOG.CRITICAL: Command "djvuxmlparser "/mnt/nfs/labstore-secondary-tools-project/ia-upload/ia-upload/jobqueue/misteridipolizia00niceuoft/misteridipolizia00niceuoft_djvu.xml_new.xml" 2>&1" exited with code 1: *** [1-16201] Failed to get specified page. *** (XMLParser.cpp:581) *** 'DJVU::GP<DJVU::DjVuFile> DJVU::lt_XMLParser::Impl::get_file(const DJVU::GURL&, DJVU::GUTF8String)'  [] []

Samwilson moved this task from Ready to In Development on the Community-Tech-Sprint board.Apr 26 2017, 2:42 AM

Samwilson claimed this task.Apr 26 2017, 7:46 AM

An upstream bug report hasn't yielded anything useful yet.

However, it's looking like it's simply a matter of the Internet Archive not always including all page images in their DjVu XML. For example:

in web1990gard_djvu.xml there are 176 pages represented, numbered 1–176,
but in the JP2 zip file there are 180, numbered 0–179,
and in the item's metadata XML the imagecount value is 178.

So I'm going to look at solving this and T161396 at the same time by processing individual pages' OCR before merging the whole book.

Switching to GraphicsMagick seems to have helped (I guess because the total memory usage of the executing user is taken into account?).

Certainly, the three remaining files suffering from this have been generated correctly now (and either uploaded correctly, or been blocked from doing so as they were queued by a now-blocked user).

I'm rerunning all pending jobs; will investigate further any that fail.

All pending jobs have run, and although some have failed none were for this reason. I'm calling this done (until it crops up again...).

Samwilson moved this task from In Development to Q1 2018-19 on the Community-Tech-Sprint board.May 15 2017, 1:04 AM

• DannyH edited projects, added Community-Tech; removed Community-Tech-Sprint.Jun 6 2017, 9:24 PM

• DannyH moved this task from Up Next (June 3-21) to Archive on the Community-Tech board.Jun 6 2017, 9:27 PM

This has started coming up again. e.g. https://tools.wmflabs.org/ia-upload/log/b28146220_0002

Did you try to name derived djvu pages with the name that _djvu.xml expects into its code. t.i. the name of jp2 file, changing extension only?

I'd like do take a look to IA Upload scripts - even if there's a high probability that I'll not understand them :-( . Where can I find them?

• TBolliger moved this task from Archive to New & TBD Tickets on the Community-Tech board.Jan 31 2018, 12:49 AM

• TBolliger removed Samwilson as the assignee of this task.Feb 6 2018, 10:48 PM

• TBolliger removed the point value for this task.

• TBolliger added a project: IA Upload.Feb 6 2018, 10:57 PM

Assigned to... none?
:-(

Samwilson merged a task: T186972: DJVU generated from JP2 seem to be failing.Feb 11 2018, 5:36 AM

Samwilson added a subscriber: Peteforsyth.

Yes, I'm afraid I'm not actively working on this right now. Mainly because I'm not quite sure of the fix! (And I'll not be doing it with my comm-tech hat on.)

The individual djvu files are created with internal filenames pointing to the newly-named JP2 files... but that should be fine, because the names in the _djvu.xml are also changed to the same names. Which maybe is wrong, but the weird thing is that it works in many cases. :-( I'd expect it to be an off-by-one sort of problem, and not work at all...

As perhaps I told you, I'm exploring a different approach:

to convert _djvu,xml into "dsed" format, t.i. the lisp-like OCR structure by djvused output-txt;
to manipulate dsed file if needed;
to use resulting dsed file to upload OCR into djvu using djvused again, such upload being both simpler and faster of upload using xml file (I presume, that dsed structure is much more similar to internal djvu text structure; pages can be referred by their order number into bundled djvu file, ignoring their individual name)

Following such an approach, it.source got excellent djvu files from "failed" djvu files produced by IA Upload (image djvu only) merging into them the content of IA _djvu.xml files.

The problem is, that python script has an "amateur-quality", t.i. far from sufficient for a decent implementation into a shared tool. Nevertheless, the trick xml->dsed conversion seems to run, and IMHO is promising. It's also possible to get a very similar hOCR->dsed conversion, that allows to mount into a djvu file the hOCR output of tesseract.

I inserted some debug printout in XMLParser.cpp and tested it on https://tools.wmflabs.org/ia-upload/log/toda1

See output below.
The +2 offset generated during *jpg/*djvu file conversion & renumbering seems to be blamed.

toda1_p0.djvu is not found in doc->get_id_list(): [toda1_p2.djvu, ..., oda1_p52.djvu]

References in toda1_djvu.xml_new.xml are not coherent with the files used to make toda1.djvu (those available in /build directory).
IMHO, it would be better to use a uniform naming everywhere.

<OBJECT data="file://localhost/mnt/nfs/labstore-secondary-tools-project/ia-upload/ia-upload/jobqueue/toda1/toda1.djvu" height       ="7017" type="image/x.djvu" usemap="toda1_0000.djvu" width="4992">
 14638 <PARAM name="PAGE" value="toda1_p0.djvu"/>

mpaa@tools-bastion-03:~/iaupload/toda1$ ~/iaupload/djvu-djvulibre-git/xmltools/djvuxmlparser toda1_djvu.xml_new.xml -o test.djvu
debug: XMLParser.cpp parse(): get page toda1_p0.djvu
debug: item NOT found: toda1_p0.djvu
debug: item toda1_p2.djvu
debug: item toda1_p3.djvu
...
...
debug: item toda1_p51.djvu
debug: item toda1_p52.djvu
*** [1-16201] Failed to get specified page.
*** (XMLParser.cpp:601)
*** 'DJVU::GP<DJVU::DjVuFile> DJVU::lt_XMLParser::Impl::get_file(const DJVU::GURL&, DJVU::GUTF8String)'

The following steps are done:

1. toda1_djvu.xml_new.xml is parsed
2. the first object found is "/mnt/nfs/labstore-secondary-tools-project/ia-upload/ia-upload/jobqueue/toda1/toda1.djvu", page toda1_p0.djvu
3. page toda1_p0.djvu is searched in toda1.djvu
4. but toda1.djvu has been built with files from 2->52, so toda1_p0.djvu is not recognised as part of toda1.djvu
5. Error

+2 offset comes from here.

<?php  
       $jp2Files = preg_grep( '/^.*\.jp2$/', scandir("./toda1_jp2"));
       foreach ( $jp2Files as $jp2FileNum => $jp2FileName ) {
           print $jp2FileNum . " " . $jp2FileName .  " " . PHP_EOL;
      }   
?>

gives

mpaa@tools-bastion-03:~/iaupload/toda1$ php -q test.php 
2 toda1_0000.jp2 
3 toda1_0001.jp2 
...
52 toda1_0050.jp2

This should fix it:

$jp2Files = array_values(preg_grep( '/^.*\.jp2$/', scandir("./toda1_jp2")));

Happy to see that this annoying issue has been analyzed and hopefully solved - even if I can't understand code. I can't wait to test fixed IA Upload version.

Thank you @Mpaa! I am very glad you have delved into this. :-)

I'll get your fix deployed today.

I've updated the site, and am running the updated code on the remaining hung jobs. Seems to be working:
https://commons.wikimedia.org/wiki/Special:RecentChanges?hidebots=1&translations=filter&hidecategorization=1&hideWikibase=1&tagfilter=OAuth+CID%3A+772&limit=50&days=7&urlversion=2

Processed, though I think that they are losing their text layer
https://en.wikisource.org/w/index.php?title=Page:Gloucestershire_notes_and_queries,_volume_1.djvu/276&action=edit&redlink=1

I took a good look to djvu linked by Billinghurst, unluckily I've to confirm that there's no text layer :-(

I sort of replicated the process of your program and I get a text layer.
Some corrupted pages were filtered but at least a few survived.
It looks like the original djvu file is untouched when djvuxmlparser runs, instead of bring modified.
Is it possible to access the log files after a job is completed?

I tried again with this: https://tools.wmflabs.org/ia-upload/log/CapuanaGiacinta
It failed as the file is already available at Commons but the djvu in the tool dir has no text layer.
So it is a good test case (given that I do not understand why the produced djvu has one page less, 248 pages vs 249 images, weird as it should have been an error in the logs ....).
Anyhow, once I removed the missing page from the xml file, djvuxmlparser produced a djvu with text layer in my local directory.

So I have no clue why djvuxmlparser in the tool environment does not add a text layer.

Niharika moved this task from Q1 2018-19 to In Development on the Community-Tech-Sprint board.Mar 13 2018, 9:54 PM

Samwilson set the point value for this task to 3.Mar 13 2018, 11:15 PM

Samwilson moved this task from In Development to Ready on the Community-Tech-Sprint board.Mar 27 2018, 11:10 PM

• TBolliger removed a project: Community-Tech-Sprint.Apr 11 2018, 11:58 PM

Removing myself as I'm not actually working on this at the moment.

IA Upload: "Failed to get specified page" DjVu generation errorOpen, LowPublic3 Estimated Story PointsActions

Description

Related Objects

Event Timeline

IA Upload: "Failed to get specified page" DjVu generation error
Open, LowPublic3 Estimated Story Points
Actions