Page MenuHomePhabricator

OCR extracted from DjVu files is incorrectly assigned to pages
Open, Needs TriagePublic


For some DjVu files the internal OCR layer after extraction is incorrectly assigned to pages while presenting in Page namespace in Wikisources. Few examples:

  1. File:Ossendowski - Ázsiai titkok, ázsiai emberek.djvu
  2. File:Маркъ Чертванъ - Мирные завоеватели.djvu

Page numbers relates to DjVu page numbering (not page numbers printed in the books). Standard DjVu software shows OCR layer for all pages in the right place, so the problem is either in MediaWiki DjVu handling or in ProofreadPage.

Event Timeline

Ankry created this task.Sep 11 2018, 6:31 AM
Restricted Application added a project: Multimedia. · View Herald TranscriptSep 11 2018, 6:31 AM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Ankry updated the task description. (Show Details)
Ankry updated the task description. (Show Details)Sep 11 2018, 6:37 AM
Ankry added a comment.Sep 11 2018, 6:39 AM

T194861 is possibly related

Ankry added a comment.Sep 25 2018, 7:29 AM

OK, this seems to be a problem with XML data validation in djvulibre itself.
So it cannot be fixed in mediawiki unless other djvu software is used...

Ankry closed this task as Invalid.Sep 25 2018, 7:29 AM
Cherkash reopened this task as Open.Mar 11 2019, 3:53 PM
Cherkash added a subscriber: Cherkash.

Shouldn’t this be reported upstream to the developers of Djvulibre? I’ve seen it done with other libraries: then when it’s fixed upstream, the resolution naturally flows down to the Mediawiki when it gets upgraded to the latest patched version.

Xover added a subscriber: Xover.Dec 12 2019, 11:10 AM

This is the same issue as T219376 which contains some possible approaches to fix this.