Page MenuHomePhabricator

Stop ignoring paragraph and region separators in DjVu file OCR text layer
Open, Needs TriagePublic

Description

In line 277 of DjVuImage.php, the code…

$txt = preg_replace( "/[\013\035\037]/", "", $txt );

…removes various control characters from the OCR text layer output from djvutxt output, including \035 (ASCII 0x1D, "GS", group separator) and \037 (ASCII 0x1F, "US", unit separator). (\036 (ASCII 0x1E, "RS", record separator is not used by DjVuLibre that I can tell).

These characters are the markers djvutxt uses to signal the presence of a paragraph break (or other OCR page area break), so removing them (ignoring them) leads to consecutive paragraphs or regions of text being smushed together instead of separated by a blank line.

The practical consequence of this is that proofreaders on the Wikisources have to manually identify all paragraph breaks in the source text by visually identifying them in the scanned page image, locating the equivalent point in the wikitext, and inserting an extra line break. Multiply this by 5–10 paragraphs per page, for several hundred pages per book, and even after just a few hundred books the amount of wasted manual effort for volunteers becomes relatively staggering (and English Wikisource alone currently hosts around a million proofread pages).

The code should therefore be replaced with something like…

$txt = preg_replace( "/[\013]/", "", $txt ); // Ignore carriage returns
$txt = preg_replace( "/[\035\037]+/", "\n", $txt ); // Replace runs of OCR region separators with a single extra line break

…to instead insert an extra newline (i.e. Mediawiki's syntax equivalent to indicate a paragraph break) in that position.

And the good news is that, since retrieveMetaData(), as best I can tell, is called on-demand, this fix will retroactively apply to all not-previously-proofread pages of all DjVu files without needing a massive re-generate run as would be needed for image thumbnails or similar.

Event Timeline

@Xover: Thanks for taking a look at the code! You are very welcome to use developer access to submit the proposed code changes as a Git branch directly into Gerrit which makes it easier to review and provide feedback. If you don't want to set up Git/Gerrit, you can also use the Gerrit Patch Uploader. Thanks again!

Sigh. Since GPU is either broken, or the correct module to pick there isn't mediawiki/core…

Here's the pseudocode diff above converted to unified diff format:

--- DjVuImage.php	2020-07-03 18:27:30.000000000 +0200
+++ DjVuImage new.php	2020-07-03 18:31:36.000000000 +0200
@@ -277,7 +277,8 @@
 			$txt = wfShellExec( $cmd, $retval, [], [ 'memory' => self::DJVUTXT_MEMORY_LIMIT ] );
 			if ( $retval == 0 ) {
 				# Strip some control characters
-				$txt = preg_replace( "/[\013\035\037]/", "", $txt );
+				$txt = preg_replace( "/[\013]/", "", $txt ); // Ignore carriage returns
+				$txt = preg_replace( "/[\035\037]+/", "\n", $txt ); // Replace runs of OCR region separators with a single extra line break
 				$reg = <<<EOR
 					/\(page\s[\d-]*\s[\d-]*\s[\d-]*\s[\d-]*\s*"
 					((?>    # Text to match is composed of atoms of either:

@Xover Gerrit change here: https://gerrit.wikimedia.org/r/c/mediawiki/core/+/634781

I don't have an email for you, or I'd have set you as the author.

I also can't seem to provoke Jenkins into running on the changeset.

Change 634781 had a related patch set uploaded (by Aklapper; owner: Inductiveload):
[mediawiki/core@master] Stop ignoring paragraph and region separators in DjVu file OCR text layer

https://gerrit.wikimedia.org/r/634781

Change 634781 merged by jenkins-bot:
[mediawiki/core@master] Stop ignoring paragraph and region separators in DjVu file OCR text layer

https://gerrit.wikimedia.org/r/634781

@Xover: doesn't seem to be working on the pages I've tried. Is there some cache that needs to cycle or is it something else do you think?

Hmm. I didn't think there'd be any caching of this, but I may have misunderstood. It might also be that retrieveMetaData() is called once on upload rather than on demand as I'd assumed. And we need to check what $wgDjvuTxt is set to, since this whole block is only executed if that config var isset().

Hmm. $wgDjvuTxt is set in CommonSettings.php, so that should be ok.

retrieveMetaData() is called from getMetadata() in DjVuHandler.php, which is called somewhere in core as a specific image format handler. It's also called from ApiQueryImageInfo.php in getInfo(). And while DjVuHandler.php does talk about caching, I don't really see any caching related code there. As best I can tell this code should be getting called on demand for each page, so I'm out of ideas.

@Xover - Have you tried it with a newly uploaded file? That should let us know if it's a caching issue or just not working at all.

It definitely isn't working. On this page the paragraphs run together, but the output from djvutxt thefile.djvu -page=17 -detail=page is:

(page 0 0 2211 3549 
  "Pe OW Bot) OLR RT SLAM OOOR \n\037\013The seizure by the pirate of so consider- \nable a person as that of the Queen of Kish- \nmoor, and of the enormous treasure that \nhe found aboard her ship, would alone have \nbeen sufficient to have established his fame. \nBut the capture of so extraordinary a prize \nas that of the ruby — which was, in itself, \nworth the value of an entire Oriental king- \ndom—exalted him at once to the very high- \nest pinnacle of renown. \n\037Having achieved the capture of this in- \ncredible prize, our captain scuttled the great \nship and left her to sink with all on board. \nThree Lascars of the crew alone escaped \nto bear the news of this tremendous disaster \nto an astounded world. \n\037As may readily be supposed, it was now \nno longer possible for Captain Keitt to hope \nto live in such comparative obscurity as he \nhad before enjoyed. His was now too re- \nmarkable a figure in the eyes of the world. \nSeveral expeditions from various parts were \nimmediately fitted out against him, and it \npresently became no longer compatible with \n\037\013his safety to remain thus clearly outlined \n[3] \n\037\013" )

Oh, no, wait… I think I'm just being a dummy!

retrieveMetaData() isn't actually processing raw setext data, but djvutxt output. djvutxt outputs the setext with metacharacters backslash escaped rather than as raw bytes, but this regex is looking for the raw code points:

# Strip some control characters
# Ignore carriage returns
$txt = preg_replace( "/[\013]/", "", $txt );
# Replace runs of OCR region separators with a single extra line break
$txt = preg_replace( "/[\035\037]+/", "\n", $txt );

It needs to be something like this instead:

# Replace runs of OCR region separators with a single extra line break
$txt = preg_replace( "/\\(013|035|037)+/", "\n", $txt );

Since \013 (ASCII 0x0B, "VT", vertical tab) is being used as a column separator (the other two are for regions and paragraphs) we can treat them the same. (Note, \013 isn't actually carriage return as the comment implies; CR is \015).

But this begs the question: did this code actually do anything to begin with? It was added by ThomasV in this commit in 2009, along with a call to iconv() seemingly intended to nuke invalid UTF-8 characters (which probably stopped working soon after due to a long-standing PHP bug).

Ironically, it looks like @Bawolff came this close to discovering the problem in 2013 when they fixed a similar problem in closely related code (pageTextCallback() is invoked 10 lines later).

In any case… Provided I'm not just being a dummy again, I'm inclined to conclude that the original code was never functional (but did no harm), and thus my original proposed change did nothing either. The correct fix is to look for the escape sequences output by djvutxt rather than the raw characters. I can't think of any way that actual raw VT, GS, US characters should end up in the text layer (unless it's manually generated by buggy tools), so the original code should not be needed.

Anybody want to double-check my thinking / logic on this before we start futzing around with another patch?

Change 652762 had a related patch set uploaded (by Inductiveload; owner: Inductiveload):
[mediawiki/core@master] Fix replacement of control chars in DJVU text output

https://gerrit.wikimedia.org/r/652762

I actually installed MW and ProofreadPage and and it does now seem to work with the patch above.

However, it might be that all the text layers are generated on upload? This is how the function is called for me:

1	0.0402	424424	{main}( )	.../index.php:0
2	0.0830	3129776	wfIndexMain( )	.../index.php:46
3	0.0830	3131904	MediaWiki->run( )	.../index.php:53
4	0.0832	3131904	MediaWiki->main( )	.../MediaWiki.php:548
5	0.0839	3146600	MediaWiki->performRequest( )	.../MediaWiki.php:945
6	0.0907	3859672	MediaWiki\SpecialPage\SpecialPageFactory->executePath( )	.../MediaWiki.php:310
7	0.0909	3861440	SpecialUpload->run( )	.../SpecialPageFactory.php:1404
8	0.0909	3861440	SpecialUpload->execute( )	.../SpecialPage.php:645
9	0.1126	5338352	SpecialUpload->processUpload( )	.../SpecialUpload.php:244
10	0.2073	5751808	SpecialUpload->showUploadWarning( )	.../SpecialUpload.php:573
11	0.2073	5751808	UploadFromFile->tryStashFile( )	.../SpecialUpload.php:428
12	0.2073	5751808	UploadFromFile->doStashFile( )	.../UploadBase.php:1158
13	0.2074	5753360	UploadStash->stashFile( )	.../UploadBase.php:1214
14	0.2074	5753416	MWFileProps->getPropsFromPath( )	.../UploadStash.php:217
15	0.2083	5754320	DjVuHandler->getMetadata( )	.../MWFileProps.php:87
16	0.2083	5754752	DjVuImage->retrieveMetaData( )	.../DjVuHandler.php:374

So all our text layers will be stale?

Text layer is generally generated on upload. It used to be refreshable by action=purge but i dont think that is the case anymore. There is a maintenance script to refresh all of a specific file type.

In any case… Provided I'm not just being a dummy again, I'm inclined to conclude that the original code was never functional (but did no harm), and thus my original proposed change did nothing either

Yeah, i dont think your change would change anything. Its just the difference between whether php string is interpreting the escape sequence or the regex engine is, but either way the escape sequence gets turned ito the raw character.

Maybe safest approach is to match both the raw character and the escaped version. Cover all bases.

@Bawolff djvutxt is not called with -escape , so the control codes will not be emitted encoded into ASCII as octal equivalents.

@kaldari / @Tpt I don't mean to hassle, but could this please be reviewed again at some point? Lack of paragraphs in the DjVU OCR is infuriating.

Change 652762 merged by jenkins-bot:
[mediawiki/core@master] Fix replacement of control chars in DJVU text output

https://gerrit.wikimedia.org/r/652762