Page MenuHomePhabricator

File:Толковый_словарь._Том_2(1)_(Даль_1905).djvu from 2015-10-28T22:38:30 fails to parse djvu metadata for lack LIBXML_PARSEHUGE flag
Closed, ResolvedPublic

Description

Splitting from parent

File Толковый_словарь._Том_2(1)_(Даль_1905).djvu version 2015-10-28T22:38:30 won't render.

Locally it causes an OOM, apparently during utf8 normalization of ocr text. Unclear if that's what's happening on production.


For ease of reference, I uploaded the problematic version of the file to testwiki: https://test.wikipedia.org/wiki/File:T107664.djvu

To test the problematic version against the ProofreadPage extension, file also uploaded to testwiki2:
https://test2.wikipedia.org/wiki/File:T107664.djvu

Event Timeline

Bawolff raised the priority of this task from to Low.
Bawolff updated the task description. (Show Details)
Bawolff added subscribers: Tgr, Hinote, Vladis13 and 11 others.
Tgr set Security to None.

Error creating thumbnail: No path supplied in thumbnail object

canRender() is returning false, probably due to getWidth() being 0 (which is probably due to lack of img_metadata on a multipage file). This causes MediaTransformOutput::hasFile() to return false, which is where the "Error creating thumbnail: No path supplied in thumbnail object" error comes from.

The real question is why img_metadata is failing to be extracted.

I think the out-of-memory error I'm getting locally is due to a massive amount of stuff in the debug log when I was reading metadat for that image.

So locally, it appears my issue was that I had DBO_DEBUG set, which was causing enough log spam to push my wiki over the memory_limit (mem limit of 128M). That's unlikely to be the problem on production (Although I'll note, that even without DBO_DEBUG, there is still about 10 MB worth of error logging due to the @trigger_error( '' ) on line 2893 of GlobalFunctions.php )

It should be noted, that once I disable DBO_DEBUG and no longer OOM, I can view the file fine on my local wiki

I think my local issues with out of memory seem unrelated. Metadata seems to be extracted intact. The md5 hash of img_metadata locally matches the md5 hash of the metadata on commons. So the metadata is being extracted correctly. So I guess I was barking up the wrong tree earlier.

Maybe

line 312 of DjVu.php should be

$tree = new SimpleXMLElement( $metadata, LIBXML_PARSEHUGE );

I think 10 MB is about where the libxml restrictions start to kick in for newer libxml (my local wiki has an old version of libxml without that restriction, which might explain why its working for me locally but not on production)


If anyone can reproduce this bug locally, it would be great if they could test to see if the above fix works.

Uploading this file with UploadWizard on a default installtion is still problematic because "This result was truncated because it would otherwise be larger than the limit of 8,388,608 bytes".

In WMF installation we have 12MB (temporarily, lol):

'wgAPIMaxResultSize' => array(
	'default' => 12582912, // 12 MB; temporary while I figure out what the deal with those overlarge revisions is --Roan
),

So a lot of bugs around huge meta data.

P2257 is what happens for me. Note that this takes about 30s to process.

And after changing the code to

- $tree = new SimpleXMLElement( $metadata );
+ $tree = new SimpleXMLElement( $metadata, LIBXML_PARSEHUGE );

It, indeed, WORKS (see P2258). Thanks Bawolff!

rillke@rillke-VirtualBox:/var/www/mw$ xmllint --version
xmllint: using libxml version 20901
   compiled with: Threads Tree Output Push Reader Patterns Writer SAXv1 FTP HTTP DTDValid HTML Legacy C14N Catalog XPath XPointer XInclude Iconv ISO8859X Unicode Regexps Automata Expr Schemas Schematron Modules Debug Zlib Lzma

Change 249724 had a related patch set uploaded (by Rillke):
Parse huge XML metadata from DjVu images

https://gerrit.wikimedia.org/r/249724

@Bawolff if you would like to have some changes to the patch set, it's probably better if you'd do them yourself 'cause I'll be busy for the next days and it's always a hassle running my VMs side-by-side with my normal work. Feel free to upload a new patch set or override the commits in the set. I just uploaded them to be helpful in case they're okay, it can be just merged...

Change 249724 had a related patch set uploaded (by Rillke):
Parse huge XML metadata from DjVu images

https://gerrit.wikimedia.org/r/249724

Please get Gerrit to merge this... the sooner it gets rolled out, the sooner we'll know if all the mentioned cases are resolved or not.

Please get Gerrit to merge this... the sooner it gets rolled out, the sooner we'll know if all the mentioned cases are resolved or not.

This might not be without security implications, that's why I guess people are hesitating to +2 it. They might want to come up with a better solution like using a SAX parser (PHP's XMLReader, for example) but this in turn needs some code rewriting.

Change 249724 had a related patch set uploaded (by Rillke):
Parse huge XML metadata from DjVu images

https://gerrit.wikimedia.org/r/249724

Please get Gerrit to merge this... the sooner it gets rolled out, the sooner we'll know if all the mentioned cases are resolved or not.

It won't go live until code is updated on Wednesday.

So locally, it appears my issue was that I had DBO_DEBUG set, which was causing enough log spam to push my wiki over the memory_limit (mem limit of 128M). That's unlikely to be the problem on production (Although I'll note, that even without DBO_DEBUG, there is still about 10 MB worth of error logging due to the @trigger_error( '' ) on line 2893 of GlobalFunctions.php )

See T103671

Bawolff renamed this task from File:Толковый_словарь._Том_2(1)_(Даль_1905).djvu from 2015-10-28T22:38:30 fails to extract metadata/render. Popssibly OOM on utf-8 normalization to File:Толковый_словарь._Том_2(1)_(Даль_1905).djvu from 2015-10-28T22:38:30 fails to parse djvu metadata for lack LIBXML_PARSEHUGE flag.Oct 29 2015, 10:17 PM

Change 249724 merged by jenkins-bot:
Parse huge XML metadata from DjVu images

https://gerrit.wikimedia.org/r/249724

It won't go live until code is updated on Wednesday.

I'm well aware of that but unless its merged by then, Wednesday can come and go before we know it (e.g. can't take anything for granted around here so 'a squeaky wheel gets the grease').

@Bawolff, unless I'm assuming too much, won't this fix indirectly resolve T104056 in the process? If so, please merge/close that task as needed.

It does sound similar. Lets just wait until patch goes live and see if the issue becomes fixed.

To test the problematic version against the ProofreadPage extension, file also uploaded to testwiki2:
https://test2.wikipedia.org/wiki/File:T107664.djvu

Well the pages, dimensions AND thumbnails came in on the above just fine. So it does seem like tomorrow's roll out to the rest of the wikis should fix this pain in the azz bug once and for all!

Bawolff claimed this task.