Change Details

**Author:** `vladjohn2013` **Description:** Merge proofread text back into Djvu files Wikisource, the free library, has an enormous collection of Djvu files and proofread texts based on those scans. However, while the DjVu files contain a text layer, this text is the original computer generated (OCR) text and not the volunteer-proofread text. There is some previous work about merging the proofread text as a blob into pages, and also about finding similar words to be used as anchors for text re-mapping. The idea is to create an export tool that will get word positions and confidence levels using Tesseract and then re-map the text layer back into the DjVu file. If possible, word coordinates should be kept. Project proposed by Micru. I have found an external mentor that could give a hand on Tesseract, now I'm looking for a mentor that would provide assistance on Mediawiki. Aubrey can be a mentor providing assistance regarding Wikisource, and some past history of this issue. Not much, but glad to help if needed. @Rtdwivedi is willing to be a mentor. URL:https://www.mediawiki.org/wiki/Mentorship_programs/Possible_projects#Merge_proofread_text_back_into_Djvu_files Mentor: @jayvdb -------------------------- **Version**: unspecified **Severity**: enhancement