Maniphest T204020

OCR extracted from DjVu files is incorrectly assigned to pages
Closed, DuplicatePublic
Actions

Assigned To

None

Authored By

	Ankry
	Sep 11 2018, 6:31 AM

Description

For some DjVu files the internal OCR layer after extraction is incorrectly assigned to pages while presenting in Page namespace in Wikisources. Few examples:

File:Ossendowski - Ázsiai titkok, ázsiai emberek.djvu
- OCR for pages 1-13 is placed correctly
- page 14 (https://hu.wikisource.org/w/index.php?title=Oldal:Ossendowski_-_Ázsiai_titkok,_ázsiai_emberek.djvu/14&action=edit&redlink=1) shows OCR layer from page 16
- page 15 shows OCR layer from page 17
- for all subsequent pages OCR is shifted for few pages; the shift increases gradually), reaching 22 pages at the end of the book: page 250 shows OCR layer from page 272.
File:Маркъ Чертванъ - Мирные завоеватели.djvu
- the shift starts at page 9 (pages 1-8 shows correct OCR) reaching 4 pages at the end ot the book; eg. page 70 (https://hu.wikisource.org/w/index.php?title=Oldal:%D0%9C%D0%B0%D1%80%D0%BA%D1%8A_%D0%A7%D0%B5%D1%80%D1%82%D0%B2%D0%B0%D0%BD%D1%8A_-_%D0%9C%D0%B8%D1%80%D0%BD%D1%8B%D0%B5_%D0%B7%D0%B0%D0%B2%D0%BE%D0%B5%D0%B2%D0%B0%D1%82%D0%B5%D0%BB%D0%B8.djvu/70&action=edit&redlink=1) shows OCR layer of page 74

Page numbers relates to DjVu page numbering (not page numbers printed in the books). Standard DjVu software shows OCR layer for all pages in the right place, so the problem is either in MediaWiki DjVu handling or in ProofreadPage.

Related Objects

Mentioned In: T237848: "success is not a function" JS exception on certain DjVu files
Mentioned Here: T219376: retrieveMetaData() in DjVuImage.php creates knock-on error when a page has invalid text layer
T194861: Text is offset by one page

Event Timeline

Ankry created this task.Sep 11 2018, 6:31 AM

Restricted Application added a project: Multimedia. · View Herald TranscriptSep 11 2018, 6:31 AM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Ankry added a project: All-and-every-Wikisource.Sep 11 2018, 6:31 AM

Ankry updated the task description. (Show Details)

Ankry updated the task description. (Show Details)Sep 11 2018, 6:37 AM

T194861 is possibly related

OK, this seems to be a problem with XML data validation in djvulibre itself.
So it cannot be fixed in mediawiki unless other djvu software is used...

Ankry closed this task as Invalid.Sep 25 2018, 7:29 AM

Shouldn’t this be reported upstream to the developers of Djvulibre? I’ve seen it done with other libraries: then when it’s fixed upstream, the resolution naturally flows down to the Mediawiki when it gets upgraded to the latest patched version.

Tpt mentioned this in T237848: "success is not a function" JS exception on certain DjVu files.Dec 12 2019, 8:39 AM

This is the same issue as T219376 which contains some possible approaches to fix this.

Xover closed this task as a duplicate of T219376: retrieveMetaData() in DjVuImage.php creates knock-on error when a page has invalid text layer.Jan 2 2023, 3:30 PM

OCR extracted from DjVu files is incorrectly assigned to pagesClosed, DuplicatePublicActions

Description

Related Objects

Event Timeline

OCR extracted from DjVu files is incorrectly assigned to pages
Closed, DuplicatePublic
Actions