Page MenuHomePhabricator

Error during parsing of djvu text layer to produce metadata leads to page offset in ProofreadPage extension
Open, Needs TriagePublic

Description

Error during parsing of djvu text layer to produce metadata leads to page offset in ProofreadPage extension.

For this file,: https://commons.wikimedia.org/wiki/File:Philosophical_Transactions_-_Volume_053.djvu
parsing of some pages when text is loaded from file to metadata fails.

Metadata of djvu files contain the text layer of pages.
See https://commons.wikimedia.org/w/api.php?action=query&prop=imageinfo&iiprop=metadata&titles=File:Philosophical_Transactions_-_Volume_053.djvu

The format of the metadata is as follows:

<?xml version=\"1.0\" ?>
<!DOCTYPE DjVuXML PUBLIC \"-//W3C//DTD DjVuXML 1.1//EN\" \"pubtext/DjVuXML-s.dtd\">
<mw-djvu>
  <DjVuXML>
    <HEAD></HEAD>
    <BODY><OBJECT height=\"1500\" width=\"1201\">
        <PARAM name=\"DPI\" value=\"300\" />  <-- One per page, they are  actually as many as the pages
        <PARAM name=\"GAMMA\" value=\"2.2\" />
      </OBJECT>
      ...
      <OBJECT height=\"1500\" width=\"1026\">
        <PARAM name=\"DPI\" value=\"300\" />
        <PARAM name=\"GAMMA\" value=\"2.2\" />
      </OBJECT>
    </BODY>
  </DjVuXML>
  
  /DjVuXML>
  <DjVuTxt>
    <HEAD></HEAD>
    <BODY>
      <PAGE value=\"\" /> <-- One per page when OK, less if parsing fails.
      ...
      <PAGE value=\"\" />
    </BODY>
  </DjVuTxt>
</mw-djvu>

Evidence of parsing failure is the fact instead of a page, this is returned by the API.

<PAGE value=\"[ 30 ] ...... &#10;\" />
failed                                                                   <--- ERROR! this is suposed to be page 51
<PAGE value=\"1 &#10;\u2666 &#10;\" />

So one page is lost and the text goes out of sync in ProofreadPage extension.

In this file there are 7 of such failures.

I checked the XML of page 51 and I got no error regarding tag structure.
Maybe an encoding error?!

Event Timeline

It is the djvutxt that fails for this page 51:

$ djvutxt philosophicaltra5317roya.djvu --page=51 --detail=page
failed

And also djvused fails:

$ djvused philosophicaltra5317roya.djvu -e 'select 51; print-txt' 
*** [1-10100] Text layer hierarchy is corrupt"
*** (DjVuText.cpp:287)
*** 'void DJVU::DjVuTXT::Zone::decode(const DJVU::GP<DJVU::ByteStream>&, int, const DJVU::DjVuTXT::Zone*, const DJVU::DjVuTXT::Zone*)'

While there is nothing to be done in MW to fetch the correct text, a better solution would be to at least keep the page tagging number correct.
That is, instead of just writng

failed

something like this:

<PAGE value=\"failed" />

So text would not go out of sync.

Anomie subscribed.

There does not seem to be anything to do with MediaWiki-Action-API here, the API is just returning the data given to it.

The improvement is to be done in DjVuImage.php: function retrieveMetaData()
https://doc.wikimedia.org/mediawiki-core/master/php/DjVuImage_8php_source.html#l00246

# Text layer
         if ( isset( $wgDjvuTxt ) ) {
             $cmd = wfEscapeShellArg( $wgDjvuTxt ) . ' --detail=page ' . wfEscapeShellArg( $this->mFilename );
             wfDebug( __METHOD__ . ": $cmd\n" );
             $retval = '';
             $txt = wfShellExec( $cmd, $retval, [], [ 'memory' => self::DJVUTXT_MEMORY_LIMIT ] );
             if ( $retval == 0 ) {
                 # Strip some control characters
                 $txt = preg_replace( "/[\013\035\037]/", "", $txt );
                 $reg = <<<EOR
                     /\(page\s[\d-]*\s[\d-]*\s[\d-]*\s[\d-]*\s*"
                     ((?>    # Text to match is composed of atoms of either:
                         \\\\. # - any escaped character
                         |     # - any character different from " and \
                         [^"\\\\]+
                     )*?)
                     "\s*\)
                     | # Or page can be empty ; in this case, djvutxt dumps ()
                     \(\s*()\)/sx
 EOR;
                 $txt = preg_replace_callback( $reg, [ $this, 'pageTextCallback' ], $txt );
                 $txt = "<DjVuTxt>\n<HEAD></HEAD>\n<BODY>\n" . $txt . "</BODY>\n</DjVuTxt>\n";
                 $xml = preg_replace( "/<DjVuXML>/", "<mw-djvu><DjVuXML>", $xml, 1 );
                 $xml = $xml . $txt . '</mw-djvu>';
             }
  }

retval is == 0 even if there are errors.

So the error, which is not processed by the callback as it does not match any regex, is just kept in $txt as is:

This can be seen running:

$djvutxt --detail=page philosophicaltra5317roya.djvu | grep failed

At least it should be checked if "failed" matches at the beginning of a line (a bit hackish).

Unfortunately my php ends here. Help would be appreciated.
Thanks

Something like this.

...
$txt = preg_replace_callback( $reg, [ $this, 'pageTextCallback' ], $txt );
$reg_failed = '/(?m)^failed$/'; 
$txt = preg_replace_callback( $reg_failed, [ $this, 'pageTextCallbackFailed' ], $txt );
txt = "<DjVuTxt>\n<HEAD></HEAD>\n<BODY>\n" . $txt . "</BODY>\n</DjVuTxt>\n";

...

function pageTextCallbackFailed( $matches ) {
	 $val = $matches[0];
         return '<PAGE value="' . $val . '" />';
  }
...

Any chance for this to be solved eventually? It's a very common problem, and it looks like it'd be rather simple to solve.

Any chance for this to be solved eventually? It's a very common problem, and it looks like it'd be rather simple to solve.

If you find it so simple to solve please do solve it. Thank you.

I'm sorry if I sounded dismissive, that was not what I meant. I was asking if someone planned to do this. I've tested at the other end, and replacing (in the file) the invalid page text by an empty page, in the way outlined above by Mpaa, does prevent the OCR offset. So, if we replace it on the Proofreadpage side, it should also work (see also T219376). I myself know very little PHP, so I'm afraid that if I try to write it up myself I will break something.

I'm sorry if I sounded dismissive, that was not what I meant. I was asking if someone planned to do this. I've tested at the other end, and replacing (in the file) the invalid page text by an empty page, in the way outlined above by Mpaa, does prevent the OCR offset. So, if we replace it on the Proofreadpage side, it should also work (see also T219376). I myself know very little PHP, so I'm afraid that if I try to write it up myself I will break something.

Then I cannot say I understand your post. I have little doubt this issue will eventually be solved. As far as I know there is no particularly priority (its current prior is "Needs Triage") much less a plan. You did not even mention the word "plan" in your original post. Asking about plans is typically of little value. If there is a plan, someone would have posted about such. If there is no such post then likely there is no such plan.