Bug in Djvu text layer extraction
Closed, ResolvedPublic

Description

Author: simon.lipp

Description:
Bug has been encountered on fr.wikisource :

MediaWiki 1.16alpha-wmf (r58524)
PHP 5.2.4-2ubuntu5.7wm1 (apache2handler)
MySQL 4.0.40-wikimedia-log

When the text layer of the Djvu file contains « ") », the MediaWiki parser produces an empty page and then the text layer is shifted by one page from the image. An example of problematic Djvu file can be found here :

http://commons.wikimedia.org/w/index.php?title=File:Sima_qian_chavannes_memoires_historiques_v4.djvu&oldid=31865251

In particular, we can find, in page 80, the following text (bad quality of scan) : « La quatrième année (.\"),*)()) ». The problem can be seen in the proofread version of this scan :

http://fr.wikisource.org/w/index.php?title=Page:Sima_qian_chavannes_memoires_historiques_v4.djvu/80&action=edit : the end of the text is missing
http://fr.wikisource.org/w/index.php?title=Page:Sima_qian_chavannes_memoires_historiques_v4.djvu/81&action=edit : no text layer
http://fr.wikisource.org/w/index.php?title=Page:Sima_qian_chavannes_memoires_historiques_v4.djvu/82&action=edit : text layer and image does not longer match

I have been able to track and fix the bug in my local mediawiki installation (same branch, same revision as fr.wikisource). The problem is located in DjvuImage::retrieveMetadata (includes/DjvuImage.php:257) : the regular expression considers any ") as the end of page marker, but a \ before the double quote should prevent this interpretation.

I replaced the current regular expression by this one, and now the problem is fixed :

$reg = "/\(page\s[\d-]*\s[\d-]*\s[\d-]*\s[\d-]*\s*\"((?>\\\\.|(?:(?!\\\\|\").)++)*?)\"\s*\)/s";
$txt = preg_replace( $reg, "<PAGE value=\"$1\" />", $txt );

Note for the regular expression : this is the adaptation of the regular expression used to match a text between double quotes with backslash as escape character, which in perl would be :
"((?>\\.|[^"\\]++)*?)". The rather ugly (but working) (?:(?!\\\\|\&quot;).) corresponds to the trivial [^"\\], but the problem is that [^\&quot;] and [^"] are not really the same thing…


Version: 1.16.x
Severity: normal

Details

Reference
bz21526
bzimport set Reference to bz21526.
bzimport created this task.Nov 15 2009, 9:21 PM

lars wrote:

In the Djvu file
File:Post- och Inrikes Tidningar 1836-01-27.djvu
page 4 contains the two character sequence ")
properly escaped. After this, on the same page is
the word "Eskilstuna" which you can search and
find in djview, if you download the djvu file.

But text extraction for the Wikisource ProofreadPage
extension stops at the "). To verify this, go to
http://en.wikisource.org/wiki/Page:Post-_och_Inrikes_Tidningar_1836-01-27.djvu/4
and click "create". (But don't create that page on
the English Wikisource. It already exists on the
Swedish Wikisource.)

lars wrote:

To extract the OCR text (without pixel coordinates for
each word) for the page NNN, this command should do:

djvused -e 'select NNN; print-pure-txt' FILENAME.djvu

lars wrote:

For /66 of commons:File:Östgötars_minne.djvu
contains the two character sequence ")
and that is where the extracted text ends.

For /67 the extracted text is empty.

For /68, the extracted text is the one that
belongs to the /67 image. All subsequent pages
have the text layer off by one or more pages.

The OCR quality is low (coming from Google), so a new
OCR should be generated before proofreading. But until
then, this file is another test case for this bug.

http://sv.wikisource.org/wiki/Index:%C3%96stg%C3%B6tars_minne.djvu

thomasV1 wrote:

The proposed patch is a perl-compatible regexp. I am not familiar with that syntax, this is why I have not commited it.

Could someone have a look at it, or provide a posix regexp ?

simon.lipp wrote:

or provide a posix regexp ?

That’s not possible. Matching C-like quoted strings needs look-ahead and possessive operators, which are not available in POSIX syntax. But if you have any question, feel free to contact me (I’m Sloonz on fr.wikisource)

thomasV1 wrote:

I tested your patch on this djvu file:
http://fr.wikisource.org/wiki/Livre:Revue_des_Romans_%281839%29.djvu

The file does not have the bug; djvu text extraction works without the patch. With the patch, pages are no longer aligned with the text.

simon.lipp wrote:

With the patch, pages are no longer aligned with the text.

Strange ; by the time I made the patch, I didn’t see this problem. I’ll look at it during this week.

simon.lipp wrote:

Patch

Found the problem (I dropped the empty-page case). Attached an updated patch that fix it. By doing htmlspecialchars after the matching phase, it allow to get rid of unreadable look-ahead. And I commented the regexp using /x modifier of PCRE. But that’s still not possible to convert this into POSIX regexp, since ereg_* doesn't have an equivalent of preg_replace_callback.

Also, your file has a problem page 8 (http://fr.wikisource.org/w/index.php?title=Page:Revue_des_Romans_%281839%29.djvu/8&action=edit). As a side-effect, the patch fixes that too ;)

Attached: fix-djvu-ocr.patch

thomasV1 wrote:

Thanks for patch and the detailed explanation.
I commited it (r69139)

It would be nice if this bug fix could be considered out of session to look to be implemented at the Wikisource sites ahead of scheduled updates (next full application review).

It is a minor bug that has major impediments for works for consideration. It leaves a blank page, misaligns text, and requires every subsequent pages in a work to be moved incrementally forward.

Simple equations, even if we have only 20 works broken, and DjVu files are typically 200-500 pages in size, that would already start to equate to somewhere between 2000-8000 page moves.

Thanks for any consideration that could be made to this request.

simon.lipp wrote:

Well, in the meanwhile, it’s still possible to manually fix the broken djvu files ; my own pdf to djvu converter has these lines :

  1. Workaround for MediaWiki bug #21526
  2. see https://bugzilla.wikimedia.org/show_bug.cgi?id=21526

$text =~ s/"(?=\s*\))//g;

A quick look at man djvused give me this simple command to fix a djvu file (untested):

cp thefile.djvu thefile-fixed.djvu; djvused thefile.djvu -e output-all | perl -pe 's/"(?=\s*\))//g' | djvused thefile-fixed.djvu -s

This is reported as fixed and for a period of time, and even with asking nicely for it to be given some priority for the Wikisource sites there is neither action, nor evidence of it being noticed. Something somewhere somehow would be nice, even a rough indication of who needs to sleep with whom, and where we have to send the photographs would be helpful. :-)

Deployed now.

Note that the effect of create_function() is to create a global function with a random name and to return the name. Calling it in a loop will eventually use up all memory, because there is no way to delete global functions once they are created. For this reason alone, it shouldn't be used. But it is also slow, requiring a parse operation that is uncached by APC, and it's insecure in the sense that eval() is insecure: construction of PHP code can easily lead to arbitrary execution if user input is included in the code.

Many thanks to all.

As a side not to Wikisourcerers the files need to be purged at Commons to get them to reload the text layer properly.

simon.lipp wrote:

@Tim Starling
I wasn’t aware of the performance issues of using create_function, sorry.
But since the created function is static, it should be trivial to factor it out ; I used create_function only because I’m used to use blocks in Ruby. The corresponding function should just be:

function convert_page_to_xml($matches) {
return '<PAGE value="'.htmlspecialchars($matches[1]).'" />';
}

Anyway, since the text layer is computed only once and then cached, I don’t fix that’s a big issue.

(In reply to comment #16)

@Tim Starling
I wasn’t aware of the performance issues of using create_function, sorry.
But since the created function is static, it should be trivial to factor it out
; I used create_function only because I’m used to use blocks in Ruby. The
corresponding function should just be:

function convert_page_to_xml($matches) {
return '<PAGE value="'.htmlspecialchars($matches[1]).'" />';
}

Anyway, since the text layer is computed only once and then cached, I don’t fix
that’s a big issue.

Tim fixed the issue in r78046. The two revisions were then merged from trunk in r78047.

GOIII reopened this task as "Open".EditedFeb 22 2015, 3:46 AM
GOIII added a subscriber: GOIII.
Mpaa added a subscriber: Mpaa.Feb 22 2015, 10:29 AM

@GOIII: Opening a new ticket and referring to the existing old one instead of reopening a four year old one might be more effective to get attention.

Aklapper lowered the priority of this task from "High" to "Low".Apr 19 2015, 4:28 PM
AuFCL added a subscriber: AuFCL.May 24 2015, 2:09 AM
jayvdb added a comment.Oct 9 2015, 1:28 AM

The issue is still on the scriptorium, but the URL #section has changed to https://en.wikisource.org/wiki/Wikisource:Scriptorium#EB11.2C_vol._XXVI
I suspect this is a new bug, rather than an old bug.

Billinghurst added a comment.EditedOct 9 2015, 2:59 AM

Definitely sounds like a different bug. Can we replicate the process outside of PrP to see what the API returns in standard sense andwhat it gives in this example?

Bawolff closed this task as "Resolved".Nov 4 2015, 1:11 AM
Bawolff claimed this task.
Bawolff added a subscriber: Bawolff.

Reclosing this bug. The symtoms you describe for EB1911_-_Volume_26.djvu is totally different from the original bug this is about (That issue is probably T117013)

Add Comment