Page MenuHomePhabricator

retrieveMetaData() in DjVuImage.php creates knock-on error when a page has invalid text layer
Open, Needs TriagePublic

Description

retrieveMetaData() in DjVuImage.php dumps a DjVu file and then parses the output with a regex looking for (essentially) "(page …" to extract the text layers for the various pages. This approach is position-dependent, so that for each page that the regex for whatever reason misses, the following pages will have their text layer offset by one (which sucks pretty bad for the Wikisources).

For example, if the first two pages of a DjVu file has an invalid text layer, djvutxt --detail=page will spit out…

failed
failed
(page 0 0 2049 3296 …

retrieveMetaData() will (in effect) silently ignore the two failed, match the (page … string for page 3, and assume that it is the text layer for page 1.

Invalid text layers can happen for any number of reasons, but in this particular case I discovered that DjVuLibre will happily accept a command to set the text layer for a page even if the syntax of the sexpr is sufficiently invalid that it will later fail to parse it with djvutxt.

One possible more robust algorithm for this would be to iterate over the file one page at a time to extract the text layer (djvused -e 'n' will give you the count; djvutxt --page=<pagenum> the text for each page). The net effect of that would be that pages with invalid text layers will simply not show any text layer on Wikisource et al, and thus affect only those pages for which the source DjVu has an invalid text layer to begin with. This is presumably a lot less efficient, but given the relative low volume of DjVu-files uploaded I would imagine it would be within acceptable bounds.

If you need to reproduce the sexpr (page 0 0 123 456 (line 0 0 0 42 (word 0 0 0 42 ))) should do it (not tested, reduced from actual test case). The syntax error there is that the word is missing its attendant text string (it should be (word 0 0 0 42 "foo")). Stick it in a text file and use djvused -e 'select 1; set-txt file.sexpr; save' djvufile.djvu to set the text layer for page 1 to the invalid expression.

An actual file exhibiting the problem is in the 11:22, 27 March 2019 version of https://commons.wikimedia.org/wiki/File:Henry_VI_Part_3_(1923)_Yale.djvu (which I intend to overwrite with a fixed version imminently).

Addendum: simply removing the invalid text layer (djvused -e 'select 1; remove-txt; save') seems to work around this. Pages without a text layer will then be spit out as:

()
()
(page 0 0 123 456 …