Bug in Djvu text layer extraction
OpenPublic

Assigned To
None
Priority
Low
Author
bzimport
Subscribers
AuFCL, Aklapper, Mpaa and 6 others
Projects
Reference
bz21526
Description

Author: simon.lipp

Description:
Bug has been encountered on fr.wikisource :

MediaWiki 1.16alpha-wmf (r58524)
PHP 5.2.4-2ubuntu5.7wm1 (apache2handler)
MySQL 4.0.40-wikimedia-log

When the text layer of the Djvu file contains « ") », the MediaWiki parser produces an empty page and then the text layer is shifted by one page from the image. An example of problematic Djvu file can be found here :

http://commons.wikimedia.org/w/index.php?title=File:Sima_qian_chavannes_memoires_historiques_v4.djvu&oldid=31865251

In particular, we can find, in page 80, the following text (bad quality of scan) : « La quatrième année (.\"),*)()) ». The problem can be seen in the proofread version of this scan :

http://fr.wikisource.org/w/index.php?title=Page:Sima_qian_chavannes_memoires_historiques_v4.djvu/80&action=edit : the end of the text is missing
http://fr.wikisource.org/w/index.php?title=Page:Sima_qian_chavannes_memoires_historiques_v4.djvu/81&action=edit : no text layer
http://fr.wikisource.org/w/index.php?title=Page:Sima_qian_chavannes_memoires_historiques_v4.djvu/82&action=edit : text layer and image does not longer match

I have been able to track and fix the bug in my local mediawiki installation (same branch, same revision as fr.wikisource). The problem is located in DjvuImage::retrieveMetadata (includes/DjvuImage.php:257) : the regular expression considers any ") as the end of page marker, but a \ before the double quote should prevent this interpretation.

I replaced the current regular expression by this one, and now the problem is fixed :

$reg = "/\(page\s[\d-]*\s[\d-]*\s[\d-]*\s[\d-]*\s*\"((?>\\\\.|(?:(?!\\\\|\").)++)*?)\"\s*\)/s";
$txt = preg_replace( $reg, "<PAGE value=\"$1\" />", $txt );

Note for the regular expression : this is the adaptation of the regular expression used to match a text between double quotes with backslash as escape character, which in perl would be :
"((?>\\.|[^"\\]++)*?)". The rather ugly (but working) (?:(?!\\\\|\&quot;).) corresponds to the trivial [^"\\], but the problem is that [^\&quot;] and [^"] are not really the same thing…


Version: 1.16.x
Severity: normal

bzimport added a project: MediaWiki-DjVu.Via ConduitNov 21 2014, 10:56 PM
bzimport set Reference to bz21526.
bzimport created this task.Via LegacyNov 15 2009, 9:21 PM
bzimport added a comment.Via ConduitApr 29 2010, 7:26 PM

lars wrote:

In the Djvu file
File:Post- och Inrikes Tidningar 1836-01-27.djvu
page 4 contains the two character sequence ")
properly escaped. After this, on the same page is
the word "Eskilstuna" which you can search and
find in djview, if you download the djvu file.

But text extraction for the Wikisource ProofreadPage
extension stops at the "). To verify this, go to
http://en.wikisource.org/wiki/Page:Post-_och_Inrikes_Tidningar_1836-01-27.djvu/4
and click "create". (But don't create that page on
the English Wikisource. It already exists on the
Swedish Wikisource.)

bzimport added a comment.Via ConduitMay 1 2010, 1:12 AM

lars wrote:

To extract the OCR text (without pixel coordinates for
each word) for the page NNN, this command should do:

djvused -e 'select NNN; print-pure-txt' FILENAME.djvu

bzimport added a comment.Via ConduitMay 1 2010, 1:26 AM

lars wrote:

For /66 of commons:File:Östgötars_minne.djvu
contains the two character sequence ")
and that is where the extracted text ends.

For /67 the extracted text is empty.

For /68, the extracted text is the one that
belongs to the /67 image. All subsequent pages
have the text layer off by one or more pages.

The OCR quality is low (coming from Google), so a new
OCR should be generated before proofreading. But until
then, this file is another test case for this bug.

http://sv.wikisource.org/wiki/Index:%C3%96stg%C3%B6tars_minne.djvu

bzimport added a comment.Via ConduitJul 6 2010, 8:12 AM

thomasV1 wrote:

The proposed patch is a perl-compatible regexp. I am not familiar with that syntax, this is why I have not commited it.

Could someone have a look at it, or provide a posix regexp ?

bzimport added a comment.Via ConduitJul 6 2010, 8:32 AM

simon.lipp wrote:

or provide a posix regexp ?

That’s not possible. Matching C-like quoted strings needs look-ahead and possessive operators, which are not available in POSIX syntax. But if you have any question, feel free to contact me (I’m Sloonz on fr.wikisource)

bzimport added a comment.Via ConduitJul 6 2010, 8:58 AM

thomasV1 wrote:

I tested your patch on this djvu file:
http://fr.wikisource.org/wiki/Livre:Revue_des_Romans_%281839%29.djvu

The file does not have the bug; djvu text extraction works without the patch. With the patch, pages are no longer aligned with the text.

bzimport added a comment.Via ConduitJul 6 2010, 4:10 PM

simon.lipp wrote:

With the patch, pages are no longer aligned with the text.

Strange ; by the time I made the patch, I didn’t see this problem. I’ll look at it during this week.

bzimport added a comment.Via ConduitJul 7 2010, 10:15 AM

simon.lipp wrote:

Patch

Found the problem (I dropped the empty-page case). Attached an updated patch that fix it. By doing htmlspecialchars after the matching phase, it allow to get rid of unreadable look-ahead. And I commented the regexp using /x modifier of PCRE. But that’s still not possible to convert this into POSIX regexp, since ereg_* doesn't have an equivalent of preg_replace_callback.

Also, your file has a problem page 8 (http://fr.wikisource.org/w/index.php?title=Page:Revue_des_Romans_%281839%29.djvu/8&action=edit). As a side-effect, the patch fixes that too ;)

Attached: fix-djvu-ocr.patch

bzimport added a comment.Via ConduitJul 7 2010, 11:07 AM

thomasV1 wrote:

Thanks for patch and the detailed explanation.
I commited it (r69139)

Billinghurst added a comment.Via ConduitJul 25 2010, 2:44 PM

It would be nice if this bug fix could be considered out of session to look to be implemented at the Wikisource sites ahead of scheduled updates (next full application review).

It is a minor bug that has major impediments for works for consideration. It leaves a blank page, misaligns text, and requires every subsequent pages in a work to be moved incrementally forward.

Simple equations, even if we have only 20 works broken, and DjVu files are typically 200-500 pages in size, that would already start to equate to somewhere between 2000-8000 page moves.

Thanks for any consideration that could be made to this request.

bzimport added a comment.Via ConduitJul 25 2010, 3:17 PM

simon.lipp wrote:

Well, in the meanwhile, it’s still possible to manually fix the broken djvu files ; my own pdf to djvu converter has these lines :

  1. Workaround for MediaWiki bug #21526
  2. see https://bugzilla.wikimedia.org/show_bug.cgi?id=21526

$text =~ s/"(?=\s*\))//g;

A quick look at man djvused give me this simple command to fix a djvu file (untested):

cp thefile.djvu thefile-fixed.djvu; djvused thefile.djvu -e output-all | perl -pe 's/"(?=\s*\))//g' | djvused thefile-fixed.djvu -s

Billinghurst added a comment.Via ConduitNov 3 2010, 10:55 AM

This is reported as fixed and for a period of time, and even with asking nicely for it to be given some priority for the Wikisource sites there is neither action, nor evidence of it being noticed. Something somewhere somehow would be nice, even a rough indication of who needs to sleep with whom, and where we have to send the photographs would be helpful. :-)

tstarling added a comment.Via ConduitDec 8 2010, 6:06 AM

Deployed now.

Note that the effect of create_function() is to create a global function with a random name and to return the name. Calling it in a loop will eventually use up all memory, because there is no way to delete global functions once they are created. For this reason alone, it shouldn't be used. But it is also slow, requiring a parse operation that is uncached by APC, and it's insecure in the sense that eval() is insecure: construction of PHP code can easily lead to arbitrary execution if user input is included in the code.

Billinghurst added a comment.Via ConduitDec 8 2010, 11:28 AM

Many thanks to all.

As a side not to Wikisourcerers the files need to be purged at Commons to get them to reload the text layer properly.

bzimport added a comment.Via ConduitDec 8 2010, 11:41 AM

simon.lipp wrote:

@Tim Starling
I wasn’t aware of the performance issues of using create_function, sorry.
But since the created function is static, it should be trivial to factor it out ; I used create_function only because I’m used to use blocks in Ruby. The corresponding function should just be:

function convert_page_to_xml($matches) {
return '<PAGE value="'.htmlspecialchars($matches[1]).'" />';
}

Anyway, since the text layer is computed only once and then cached, I don’t fix that’s a big issue.

MZMcBride added a comment.Via ConduitDec 8 2010, 2:36 PM

(In reply to comment #16)

@Tim Starling
I wasn’t aware of the performance issues of using create_function, sorry.
But since the created function is static, it should be trivial to factor it out
; I used create_function only because I’m used to use blocks in Ruby. The
corresponding function should just be:

function convert_page_to_xml($matches) {
return '<PAGE value="'.htmlspecialchars($matches[1]).'" />';
}

Anyway, since the text layer is computed only once and then cached, I don’t fix
that’s a big issue.

Tim fixed the issue in r78046. The two revisions were then merged from trunk in r78047.

GOIII reopened this task as "Open".Via WebFeb 22 2015, 3:46 AM
GOIII added a subscriber: GOIII.
Mpaa added a subscriber: Mpaa.Via WebFeb 22 2015, 10:29 AM
Aklapper added a subscriber: Aklapper.Via WebMar 3 2015, 5:29 PM

@GOIII: Opening a new ticket and referring to the existing old one instead of reopening a four year old one might be more effective to get attention.

Aklapper lowered the priority of this task from "High" to "Low".Via WebApr 19 2015, 4:28 PM
AuFCL added a subscriber: AuFCL.Via WebSun, May 24, 2:09 AM

Add Comment