Error during parsing of djvu text layer to produce metadata leads to page offset in ProofreadPage extension.
For this file,: https://commons.wikimedia.org/wiki/File:Philosophical_Transactions_-_Volume_053.djvu
parsing of some pages when text is loaded from file to metadata fails.
Metadata of djvu files contain the text layer of pages.
See https://commons.wikimedia.org/w/api.php?action=query&prop=imageinfo&iiprop=metadata&titles=File:Philosophical_Transactions_-_Volume_053.djvu
The format of the metadata is as follows:
<?xml version=\"1.0\" ?> <!DOCTYPE DjVuXML PUBLIC \"-//W3C//DTD DjVuXML 1.1//EN\" \"pubtext/DjVuXML-s.dtd\"> <mw-djvu> <DjVuXML> <HEAD></HEAD> <BODY><OBJECT height=\"1500\" width=\"1201\"> <PARAM name=\"DPI\" value=\"300\" /> <-- One per page, they are actually as many as the pages <PARAM name=\"GAMMA\" value=\"2.2\" /> </OBJECT> ... <OBJECT height=\"1500\" width=\"1026\"> <PARAM name=\"DPI\" value=\"300\" /> <PARAM name=\"GAMMA\" value=\"2.2\" /> </OBJECT> </BODY> </DjVuXML> /DjVuXML> <DjVuTxt> <HEAD></HEAD> <BODY> <PAGE value=\"\" /> <-- One per page when OK, less if parsing fails. ... <PAGE value=\"\" /> </BODY> </DjVuTxt> </mw-djvu>
Evidence of parsing failure is the fact instead of a page, this is returned by the API.
<PAGE value=\"[ 30 ] ...... \" /> failed <--- ERROR! this is suposed to be page 51 <PAGE value=\"1 \u2666 \" />
So one page is lost and the text goes out of sync in ProofreadPage extension.
In this file there are 7 of such failures.
I checked the XML of page 51 and I got no error regarding tag structure.
Maybe an encoding error?!