Page MenuHomePhabricator

OCR extracted from DjVu files is incorrectly assigned to pages
Open, Needs TriagePublic

Description

For some DjVu files the internal OCR layer after extraction is incorrectly assigned to pages while presenting in Page namespace in Wikisources. Few examples:

  1. File:Ossendowski - Ázsiai titkok, ázsiai emberek.djvu
  2. File:Маркъ Чертванъ - Мирные завоеватели.djvu

Page numbers relates to DjVu page numbering (not page numbers printed in the books). Standard DjVu software shows OCR layer for all pages in the right place, so the problem is either in MediaWiki DjVu handling or in ProofreadPage.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

OK, this seems to be a problem with XML data validation in djvulibre itself.
So it cannot be fixed in mediawiki unless other djvu software is used...

Cherkash added a subscriber: Cherkash.

Shouldn’t this be reported upstream to the developers of Djvulibre? I’ve seen it done with other libraries: then when it’s fixed upstream, the resolution naturally flows down to the Mediawiki when it gets upgraded to the latest patched version.

This is the same issue as T219376 which contains some possible approaches to fix this.