**Author:** `vladjohn2013`
**Description:**
Merge proofread text back into Djvu files
Wikisource, the free library, has an enormous collection of Djvu files and proofread texts based on those scans. However, while the DjVu files contain a text layer, this text is the original computer generated (OCR) text and not the volunteer-proofread text. There is some previous work about merging the proofread text as a blob into pages, and also about finding similar words to be used as anchors for text re-mapping. The idea is to create an export tool that will get word positions and confidence levels using Tesseract and then re-map the text layer back into the DjVu file. If possible, word coordinates should be kept.
Project proposed by Micru.[w:DJVU](https://en.wikipedia.org/wiki/DJVU) files include a text layer. Typically a DjVu file begins with a text layer that consists of [w:OCR](https://en.wikipedia.org/wiki/OCR) text, which Wikisource uses as the initial version of the transcription. Wikisource contributors then 'fix' the OCR errors and save the corrections onto the Wikisource project as wikitext, I have found an external mentor that could give a hand on Tesseract,and eventually the transcription is accurate & completed. now I'm looking for a mentor that would provide assistance on MediawikiA tool is needed to create a new DjVu file with the accurate & complete Wikisource transcription.
Aubrey can be a mentor providing assistance regarding Wikisource,There are existing tools being worked on that extract the accurate & complete Wikisource transcription, typically exporting it as EPUB. and some past history of this issue.However they likely discard a lot of useful information that is needed to recreate a DJVU file, Not much,most importantly the (x,y) positions of each piece of text. but glad to help if neededThey may also discard the page numbers.
@Rtdwivedi is willing to be a mentor.
URL:There is some previous work about merging the proofread text as a blob into pages, and also about finding similar words to be used as anchors for text re-mapping. Tools exist which work with the [w:hOCR data](https://en.wikipedia.org/wiki/hOCR), for instance [hOCR.js](https://en.wikisource.org/wiki/it:MediaWiki:Gadget-hOCR.js) by @Alex_brollo (the gadget author who worked most with the DjVu layers), and #pywikibot-core 's [[https://www.mediawiki.org/wiki/Mentorship_programs/Possible_projects#Merge_proofread_text_back_into_Djvu_filesanual:Pywikibot/djvutext.py | djvutext.py ]].
Mentor: @jayvdbThe idea is to create an export tool that will get word positions and confidence levels using Tesseract and then re-map the text layer back into the DjVu file. If possible, word coordinates should be kept.
Project proposed by Micru. I have found an external mentor that could give a hand on Tesseract, now I'm looking for a mentor that would provide assistance on Mediawiki.
--------------------------URL:https://www.mediawiki.org/wiki/Mentorship_programs/Possible_projects#Merge_proofread_text_back_into_Djvu_files
Skills: Good knowledge of the DjVu file type desirable, and EPUB.
Mentors: @jayvdb, @aubrey
**Version**: unspecified@Rtdwivedi is willing to be a mentor?
**Severity**: enhancement