Page MenuHomePhabricator

IA Upload: "Failed to get specified page" DjVu generation error
Open, LowPublic3 Estimated Story Points

Description

When uploading seispersonajesen00pira to seis_personaje_en_busca_de_autor.djvu:

[2017-03-06 14:10:31] LOG.INFO: Merging modified XML into full DjVu file [] []
[2017-03-06 14:10:44] LOG.CRITICAL: Command "djvuxmlparser "/mnt/nfs/labstore-secondary-tools-project/ia-upload/ia-upload/jobqueue/seispersonajesen00pira/seispersonajesen00pira_djvu.xml_new.xml" 2>&1" exited with code 1: *** [1-16201] Failed to get specified page. *** (XMLParser.cpp:581) *** 'DJVU::GP<DJVU::DjVuFile> DJVU::lt_XMLParser::Impl::get_file(const DJVU::GURL&, DJVU::GUTF8String)'  [] []

Event Timeline

Samwilson triaged this task as Medium priority.Mar 7 2017, 6:11 AM
Samwilson moved this task from Backlog to IA Upload on the All-and-every-Wikisource board.
kaldari lowered the priority of this task from Medium to Low.Apr 4 2017, 11:37 PM
kaldari set the point value for this task to 5.
kaldari moved this task from Needs Discussion to Up Next (June 3-21) on the Community-Tech board.

Another: misteridipolizia00niceuoft

[2017-04-18 09:33:56] LOG.INFO: Merging modified XML into full DjVu file [] []
[2017-04-18 09:35:39] LOG.CRITICAL: Command "djvuxmlparser "/mnt/nfs/labstore-secondary-tools-project/ia-upload/ia-upload/jobqueue/misteridipolizia00niceuoft/misteridipolizia00niceuoft_djvu.xml_new.xml" 2>&1" exited with code 1: *** [1-16201] Failed to get specified page. *** (XMLParser.cpp:581) *** 'DJVU::GP<DJVU::DjVuFile> DJVU::lt_XMLParser::Impl::get_file(const DJVU::GURL&, DJVU::GUTF8String)'  [] []

An upstream bug report hasn't yielded anything useful yet.

However, it's looking like it's simply a matter of the Internet Archive not always including all page images in their DjVu XML. For example:

So I'm going to look at solving this and T161396 at the same time by processing individual pages' OCR before merging the whole book.

Switching to GraphicsMagick seems to have helped (I guess because the total memory usage of the executing user is taken into account?).

Certainly, the three remaining files suffering from this have been generated correctly now (and either uploaded correctly, or been blocked from doing so as they were queued by a now-blocked user).

I'm rerunning all pending jobs; will investigate further any that fail.

All pending jobs have run, and although some have failed none were for this reason. I'm calling this done (until it crops up again...).

Did you try to name derived djvu pages with the name that _djvu.xml expects into its code. t.i. the name of jp2 file, changing extension only?

I'd like do take a look to IA Upload scripts - even if there's a high probability that I'll not understand them :-( . Where can I find them?

TBolliger removed the point value for this task.

Yes, I'm afraid I'm not actively working on this right now. Mainly because I'm not quite sure of the fix! (And I'll not be doing it with my comm-tech hat on.)

The individual djvu files are created with internal filenames pointing to the newly-named JP2 files... but that should be fine, because the names in the _djvu.xml are also changed to the same names. Which maybe is wrong, but the weird thing is that it works in many cases. :-( I'd expect it to be an off-by-one sort of problem, and not work at all...

As perhaps I told you, I'm exploring a different approach:

  • to convert _djvu,xml into "dsed" format, t.i. the lisp-like OCR structure by djvused output-txt;
  • to manipulate dsed file if needed;
  • to use resulting dsed file to upload OCR into djvu using djvused again, such upload being both simpler and faster of upload using xml file (I presume, that dsed structure is much more similar to internal djvu text structure; pages can be referred by their order number into bundled djvu file, ignoring their individual name)

Following such an approach, it.source got excellent djvu files from "failed" djvu files produced by IA Upload (image djvu only) merging into them the content of IA _djvu.xml files.

The problem is, that python script has an "amateur-quality", t.i. far from sufficient for a decent implementation into a shared tool. Nevertheless, the trick xml->dsed conversion seems to run, and IMHO is promising. It's also possible to get a very similar hOCR->dsed conversion, that allows to mount into a djvu file the hOCR output of tesseract.

I inserted some debug printout in XMLParser.cpp and tested it on https://tools.wmflabs.org/ia-upload/log/toda1

See output below.
The +2 offset generated during *jpg/*djvu file conversion & renumbering seems to be blamed.

toda1_p0.djvu is not found in doc->get_id_list(): [toda1_p2.djvu, ..., oda1_p52.djvu]

References in toda1_djvu.xml_new.xml are not coherent with the files used to make toda1.djvu (those available in /build directory).
IMHO, it would be better to use a uniform naming everywhere.

<OBJECT data="file://localhost/mnt/nfs/labstore-secondary-tools-project/ia-upload/ia-upload/jobqueue/toda1/toda1.djvu" height       ="7017" type="image/x.djvu" usemap="toda1_0000.djvu" width="4992">
 14638 <PARAM name="PAGE" value="toda1_p0.djvu"/>
mpaa@tools-bastion-03:~/iaupload/toda1$ ~/iaupload/djvu-djvulibre-git/xmltools/djvuxmlparser toda1_djvu.xml_new.xml -o test.djvu
debug: XMLParser.cpp parse(): get page toda1_p0.djvu
debug: item NOT found: toda1_p0.djvu
debug: item toda1_p2.djvu
debug: item toda1_p3.djvu
...
...
debug: item toda1_p51.djvu
debug: item toda1_p52.djvu
*** [1-16201] Failed to get specified page.
*** (XMLParser.cpp:601)
*** 'DJVU::GP<DJVU::DjVuFile> DJVU::lt_XMLParser::Impl::get_file(const DJVU::GURL&, DJVU::GUTF8String)'

The following steps are done:

1. toda1_djvu.xml_new.xml is parsed
2. the first object found is "/mnt/nfs/labstore-secondary-tools-project/ia-upload/ia-upload/jobqueue/toda1/toda1.djvu", page toda1_p0.djvu
3. page toda1_p0.djvu is searched in toda1.djvu
4. but toda1.djvu has been built with files from 2->52, so toda1_p0.djvu is not recognised as part of toda1.djvu
5. Error
+2 offset comes from here.

<?php  
       $jp2Files = preg_grep( '/^.*\.jp2$/', scandir("./toda1_jp2"));
       foreach ( $jp2Files as $jp2FileNum => $jp2FileName ) {
           print $jp2FileNum . " " . $jp2FileName .  " " . PHP_EOL;
      }   
?>

gives

mpaa@tools-bastion-03:~/iaupload/toda1$ php -q test.php 
2 toda1_0000.jp2 
3 toda1_0001.jp2 
...
52 toda1_0050.jp2

This should fix it:

$jp2Files = array_values(preg_grep( '/^.*\.jp2$/', scandir("./toda1_jp2")));

Happy to see that this annoying issue has been analyzed and hopefully solved - even if I can't understand code. I can't wait to test fixed IA Upload version.

Samwilson edited projects, added Community-Tech-Sprint; removed Community-Tech.

Thank you @Mpaa! I am very glad you have delved into this. :-)

I'll get your fix deployed today.

I took a good look to djvu linked by Billinghurst, unluckily I've to confirm that there's no text layer :-(

I sort of replicated the process of your program and I get a text layer.
Some corrupted pages were filtered but at least a few survived.
It looks like the original djvu file is untouched when djvuxmlparser runs, instead of bring modified.
Is it possible to access the log files after a job is completed?

I tried again with this: https://tools.wmflabs.org/ia-upload/log/CapuanaGiacinta
It failed as the file is already available at Commons but the djvu in the tool dir has no text layer.
So it is a good test case (given that I do not understand why the produced djvu has one page less, 248 pages vs 249 images, weird as it should have been an error in the logs ....).
Anyhow, once I removed the missing page from the xml file, djvuxmlparser produced a djvu with text layer in my local directory.

So I have no clue why djvuxmlparser in the tool environment does not add a text layer.

Samwilson set the point value for this task to 3.Mar 13 2018, 11:15 PM

Removing myself as I'm not actually working on this at the moment.