Page MenuHomePhabricator

Stop ignoring paragraph and region separators in DjVu file OCR text layer
Open, Needs TriagePublic

Description

In line 277 of DjVuImage.php, the code…

$txt = preg_replace( "/[\013\035\037]/", "", $txt );

…removes various control characters from the OCR text layer output from djvutxt output, including \035 (ASCII 0x1D, "GS", group separator) and \037 (ASCII 0x1F, "US", unit separator). (\036 (ASCII 0x1E, "RS", record separator is not used by DjVuLibre that I can tell).

These characters are the markers djvutxt uses to signal the presence of a paragraph break (or other OCR page area break), so removing them (ignoring them) leads to consecutive paragraphs or regions of text being smushed together instead of separated by a blank line.

The practical consequence of this is that proofreaders on the Wikisources have to manually identify all paragraph breaks in the source text by visually identifying them in the scanned page image, locating the equivalent point in the wikitext, and inserting an extra line break. Multiply this by 5–10 paragraphs per page, for several hundred pages per book, and even after just a few hundred books the amount of wasted manual effort for volunteers becomes relatively staggering (and English Wikisource alone currently hosts around a million proofread pages).

The code should therefore be replaced with something like…

$txt = preg_replace( "/[\013]/", "", $txt ); // Ignore carriage returns
$txt = preg_replace( "/[\035\037]+/", "\n", $txt ); // Replace runs of OCR region separators with a single extra line break

…to instead insert an extra newline (i.e. Mediawiki's syntax equivalent to indicate a paragraph break) in that position.

And the good news is that, since retrieveMetaData(), as best I can tell, is called on-demand, this fix will retroactively apply to all not-previously-proofread pages of all DjVu files without needing a massive re-generate run as would be needed for image thumbnails or similar.

Event Timeline

Xover created this task.Aug 13 2019, 1:52 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 13 2019, 1:52 PM

@Xover: Thanks for taking a look at the code! You are very welcome to use developer access to submit the proposed code changes as a Git branch directly into Gerrit which makes it easier to review and provide feedback. If you don't want to set up Git/Gerrit, you can also use the Gerrit Patch Uploader. Thanks again!

Xover added a comment.Jul 3 2020, 4:44 PM

Sigh. Since GPU is either broken, or the correct module to pick there isn't mediawiki/core…

Here's the pseudocode diff above converted to unified diff format:

--- DjVuImage.php	2020-07-03 18:27:30.000000000 +0200
+++ DjVuImage new.php	2020-07-03 18:31:36.000000000 +0200
@@ -277,7 +277,8 @@
 			$txt = wfShellExec( $cmd, $retval, [], [ 'memory' => self::DJVUTXT_MEMORY_LIMIT ] );
 			if ( $retval == 0 ) {
 				# Strip some control characters
-				$txt = preg_replace( "/[\013\035\037]/", "", $txt );
+				$txt = preg_replace( "/[\013]/", "", $txt ); // Ignore carriage returns
+				$txt = preg_replace( "/[\035\037]+/", "\n", $txt ); // Replace runs of OCR region separators with a single extra line break
 				$reg = <<<EOR
 					/\(page\s[\d-]*\s[\d-]*\s[\d-]*\s[\d-]*\s*"
 					((?>    # Text to match is composed of atoms of either:

@Xover Gerrit change here: https://gerrit.wikimedia.org/r/c/mediawiki/core/+/634781

I don't have an email for you, or I'd have set you as the author.

I also can't seem to provoke Jenkins into running on the changeset.

Thanks for the patch. For future reference please follow https://www.mediawiki.org/wiki/Gerrit/Commit_message_guidelines

Change 634781 had a related patch set uploaded (by Aklapper; owner: Inductiveload):
[mediawiki/core@master] Stop ignoring paragraph and region separators in DjVu file OCR text layer

https://gerrit.wikimedia.org/r/634781