Page MenuHomePhabricator

Text extraction fails on seemingly bog standard DjVu file
Closed, ResolvedPublicBUG REPORT

Description

On File:Atalanta - Vol. 2.djvu (Wikisource index), MediaWiki-DjVu fails to extract the text layer; exemplified at this page.

Properties of the file that might conceivably trigger a problem: it is moderately large (~320MB), has many pages (838 pages), and contains lots of text (~3MB). None of these are unique or extreme values for the range processed by Commons/Wikisource, but they do reside towards the high end. The sheer amount of text might be worth looking into as (iirc) it gets stored in a database field that isn't really designed to hold huge amounts of text (a field meant for metadata? The details escape me, but I seem to recall overshooting this size makes MediaWiki fall down in interesting ways, quite possibly because truncated and thus invalid XML is saved in a text field), and the total size in the DjVu is inflated by being wrapped in XML for storage (iirc and aiui; I could very well be wrong on all counts there).

The underlying DjVu file does have a text layer, and all manual checks of the file suggest that it is a perfectly normal and valid text layer. No relevant error messages are visible in the web browsers javascript console, and the problem is reproducible across browsers (Safari, Firefox, and Chrome tested), OS platforms (macOS and Windows tested), and users (including logged out), so it is unlikely to be a client or front-end issue.

Steps to Reproduce: Open https://en.wikisource.org/w/index.php?title=Page:Atalanta_-_Vol._2.djvu/18&action=edit&redlink=1 in a browser
Actual Results: No OCR text layer is preloaded into the text field
Expected Results: The OCR text layer from the DjVu should be preloaded into the text field

Event Timeline

It looks like this could possibly be a manifestation of T192866. In my testing, removing the text layer from a sufficient number of pages makes the problem disappear; but it does not appear removing any single (potentially problematic) page has any effect.

And looking at 192866 and the referenced code (DjVuImage.php) it looks like both the raw page count and the actual amount of text in the text layer will contribute to the size of the data stashed in img_metadata: each page is described in a set of XML tags (DPI, gamma, etc.) and then the entirety of the text layer is added (wrapped in yet more XML). So this work (Atalanta) is probably tripping the limit by being long (838 pages) and by legitimately having a lot of text on each page.

However, I saw similar symptoms on c:File:Scenes in Southern India.djvu, which is 424 pages and with less text per page. On this file, it looks as if a single pathological page will trigger the problem. In…

…there is one normal page with 10 detected words in 6 detected lines in 335 bytes of (compressed) text; and one pathological page with 1390 words (each containing only a single space character) in 1385 lines in 8684 bytes of (compressed) text. The fully expanded sexpr markup for the pages are 1259 and 201508 bytes, respectively.

If those roughly 200kB are overrunning the field size of this database table (or attendant query limit) then the parsing and persistance code is even more pathological than is immediately obvious from inspection, meaning it seems very likely that there is an additional issue at work here.

One possibility for that issue is that the current code runs a regex replace on \n, \035, and \037, and the pathological page contains 1386 of each of these characters. Depending on the limits on the regex engine, or on stack use limits, and so forth, this might be hitting one such limit. That's impossible to tell without logs or debug output from Mediawiki. The attached file should be a good test case for that.

@Ladsgroup Testing suggests gerrit:738638 resolves this issue as well.

Ladsgroup claimed this task.

Awesome!