Page MenuHomePhabricator

Error during parsing of djvu text layer to produce metadata leads to page offset in ProofreadPage extension
Open, Needs TriagePublic

Description

Error during parsing of djvu text layer to produce metadata leads to page offset in ProofreadPage extension.

For this file,: https://commons.wikimedia.org/wiki/File:Philosophical_Transactions_-_Volume_053.djvu
parsing of some pages when text is loaded from file to metadata fails.

Metadata of djvu files contain the text layer of pages.
See https://commons.wikimedia.org/w/api.php?action=query&prop=imageinfo&iiprop=metadata&titles=File:Philosophical_Transactions_-_Volume_053.djvu

The format of the metadata is as follows:

<?xml version=\"1.0\" ?>
<!DOCTYPE DjVuXML PUBLIC \"-//W3C//DTD DjVuXML 1.1//EN\" \"pubtext/DjVuXML-s.dtd\">
<mw-djvu>
  <DjVuXML>
    <HEAD></HEAD>
    <BODY><OBJECT height=\"1500\" width=\"1201\">
        <PARAM name=\"DPI\" value=\"300\" />  <-- One per page, they are  actually as many as the pages
        <PARAM name=\"GAMMA\" value=\"2.2\" />
      </OBJECT>
      ...
      <OBJECT height=\"1500\" width=\"1026\">
        <PARAM name=\"DPI\" value=\"300\" />
        <PARAM name=\"GAMMA\" value=\"2.2\" />
      </OBJECT>
    </BODY>
  </DjVuXML>
  
  /DjVuXML>
  <DjVuTxt>
    <HEAD></HEAD>
    <BODY>
      <PAGE value=\"\" /> <-- One per page when OK, less if parsing fails.
      ...
      <PAGE value=\"\" />
    </BODY>
  </DjVuTxt>
</mw-djvu>

Evidence of parsing failure is the fact instead of a page, this is returned by the API.

<PAGE value=\"[ 30 ] ...... &#10;\" />
failed                                                                   <--- ERROR! this is suposed to be page 51
<PAGE value=\"1 &#10;\u2666 &#10;\" />

So one page is lost and the text goes out of sync in ProofreadPage extension.

In this file there are 7 of such failures.

I checked the XML of page 51 and I got no error regarding tag structure.
Maybe an encoding error?!

Event Timeline

It is the djvutxt that fails for this page 51:

$ djvutxt philosophicaltra5317roya.djvu --page=51 --detail=page
failed

And also djvused fails:

$ djvused philosophicaltra5317roya.djvu -e 'select 51; print-txt' 
*** [1-10100] Text layer hierarchy is corrupt"
*** (DjVuText.cpp:287)
*** 'void DJVU::DjVuTXT::Zone::decode(const DJVU::GP<DJVU::ByteStream>&, int, const DJVU::DjVuTXT::Zone*, const DJVU::DjVuTXT::Zone*)'

While there is nothing to be done in MW to fetch the correct text, a better solution would be to at least keep the page tagging number correct.
That is, instead of just writng

failed

something like this:

<PAGE value=\"failed" />

So text would not go out of sync.

Anomie subscribed.

There does not seem to be anything to do with MediaWiki-Action-API here, the API is just returning the data given to it.

The improvement is to be done in DjVuImage.php: function retrieveMetaData()
https://doc.wikimedia.org/mediawiki-core/master/php/DjVuImage_8php_source.html#l00246

# Text layer
         if ( isset( $wgDjvuTxt ) ) {
             $cmd = wfEscapeShellArg( $wgDjvuTxt ) . ' --detail=page ' . wfEscapeShellArg( $this->mFilename );
             wfDebug( __METHOD__ . ": $cmd\n" );
             $retval = '';
             $txt = wfShellExec( $cmd, $retval, [], [ 'memory' => self::DJVUTXT_MEMORY_LIMIT ] );
             if ( $retval == 0 ) {
                 # Strip some control characters
                 $txt = preg_replace( "/[\013\035\037]/", "", $txt );
                 $reg = <<<EOR
                     /\(page\s[\d-]*\s[\d-]*\s[\d-]*\s[\d-]*\s*"
                     ((?>    # Text to match is composed of atoms of either:
                         \\\\. # - any escaped character
                         |     # - any character different from " and \
                         [^"\\\\]+
                     )*?)
                     "\s*\)
                     | # Or page can be empty ; in this case, djvutxt dumps ()
                     \(\s*()\)/sx
 EOR;
                 $txt = preg_replace_callback( $reg, [ $this, 'pageTextCallback' ], $txt );
                 $txt = "<DjVuTxt>\n<HEAD></HEAD>\n<BODY>\n" . $txt . "</BODY>\n</DjVuTxt>\n";
                 $xml = preg_replace( "/<DjVuXML>/", "<mw-djvu><DjVuXML>", $xml, 1 );
                 $xml = $xml . $txt . '</mw-djvu>';
             }
  }

retval is == 0 even if there are errors.

So the error, which is not processed by the callback as it does not match any regex, is just kept in $txt as is:

This can be seen running:

$djvutxt --detail=page philosophicaltra5317roya.djvu | grep failed

At least it should be checked if "failed" matches at the beginning of a line (a bit hackish).

Unfortunately my php ends here. Help would be appreciated.
Thanks

Something like this.

...
$txt = preg_replace_callback( $reg, [ $this, 'pageTextCallback' ], $txt );
$reg_failed = '/(?m)^failed$/'; 
$txt = preg_replace_callback( $reg_failed, [ $this, 'pageTextCallbackFailed' ], $txt );
txt = "<DjVuTxt>\n<HEAD></HEAD>\n<BODY>\n" . $txt . "</BODY>\n</DjVuTxt>\n";

...

function pageTextCallbackFailed( $matches ) {
	 $val = $matches[0];
         return '<PAGE value="' . $val . '" />';
  }
...