Merge proofread text back into Djvu files
Open, LowPublic
Actions

Assigned To

None

Authored By

	• bzimport
	Dec 1 2013, 4:09 PM

Description

Author: vladjohn2013

Description:
Merge proofread text back into Djvu files

Wikisource, the free library, has an enormous collection of Djvu files and proofread texts based on those scans.

w:DJVU files include a text layer. Typically a DjVu file begins with a text layer that consists of w:OCR text, which Wikisource uses as the initial version of the transcription. Wikisource contributors then 'fix' the OCR errors and save the corrections onto the Wikisource project as wikitext, and eventually the transcription is accurate & completed. A tool is needed to create a new DjVu file with the accurate & complete Wikisource transcription.

There are existing tools being worked on that extract the accurate & complete Wikisource transcription, typically exporting it as EPUB. However they likely discard a lot of useful information that is needed to recreate a DJVU file, most importantly the (x,y) positions of each piece of text. They may also discard the page numbers.

There is some previous work about merging the proofread text as a blob into pages, and also about finding similar words to be used as anchors for text re-mapping. Tools exist which work with the w:hOCR data, for instance hOCR.js by @Alex_brollo (the gadget author who worked most with the DjVu layers), and Pywikibot 's djvutext.py.

The idea is to create an export tool that will get word positions and confidence levels using Tesseract and then re-map the text layer back into the DjVu file. If possible, word coordinates should be kept.

Project proposed by Micru. I have found an external mentor that could give a hand on Tesseract, now I'm looking for a mentor that would provide assistance on Mediawiki.

URL:https://www.mediawiki.org/wiki/Mentorship_programs/Possible_projects#Merge_proofread_text_back_into_Djvu_files
Skills: knowledge of DjVu file type desirable, knowledge of how to build a web api on unix, knowledge of python, knowledge of hocr file format.
Mentors:

Details

Reference: bz57807

Related Objects

Mentioned In: T277768: Wikisource: Investigate adding support for bulk OCR to Wikimedia OCR [16H]
T70638: phpunit: SKIPPED: djvu binaries do not exist or are not executable
T89123: Import transcription into DjVu file
T59351: To add some basic djvulibre calls to API
Mentioned Here: T6198: Suggestion to implement page-level links into uploaded PDF files [[media:foo.pdf|page=n]]
T13259: Djvu thumbnailing should dynamically select PNG or JPG
T25831: Better error presentation for high res PDFs
T30868: display page count (of the multi-paged media file) in dimensions
T38594: Switch from GhostScript to MuPDF for thumbnailing PDFs
T46531: Download a single page from a PDF at the original (or high) resolution as JPEG
T60331: dumpHtml does not create correct link for file using [[:Media:file.pdf]]
T78060: importImages.php does not run input file names through UTF8 normalization functions
T87318: TIFF images acquiring an extra page

Event Timeline

• bzimport raised the priority of this task from to Low.Nov 22 2014, 2:34 AM

• bzimport added a project: ProofreadPage.

• bzimport set Reference to bz57807.

• bzimport added a subscriber: Unknown Object (MLST).

• bzimport created this task.Dec 1 2013, 4:09 PM

vladjohn2013 wrote:

This proposal has been listed at https://www.mediawiki.org/wiki/Mentorship_programs/Possible_projects and we are filing a report to gather community feedback and share updates.

CCing Micru and Aarti, who are proposing this project. I'm not sure who is Aubrey:

"Aubrey can be a mentor providing assistance regarding Wikisource"

Created attachment 16531
simple DjVu test file

A simple .DjVu test file where the embedded text-layer is LINE based instead of WORD based. Provided for XML file generation illustration (testme.xml)

Attached:

testme.djvu61 KBDownload

Created attachment 16532
resulting XML file

Command line used to generate XML file

C:\Program Files (x86)\DjVuLibre>djvutoxml.exe testme.djvu testme.xml

Attached:

testme.xml5 KBDownload

(In reply to vladjohn2013 from comment #0)

Merge proofread text back into Djvu files

. . . The idea is to create an
export tool that will get word positions and confidence levels using
Tesseract and then re-map the text layer back into the DjVu file. If
possible, word coordinates should be kept.

Isn't some of that already possible using DjVuLibre's built in DjVu-to-XML scheme? (See attachments)

As far as I can tell, this method was once feasible & pursued then "abandoned" some 7+ years ago for the current 'plain-text' dump approach we have now due to some resource(?) issues at the time. Most of the related bits seem (to me) to still be in place if you go by what is found in https://git.wikimedia.org/tree/mediawiki%2Fcore

/includes/media/DjVu.php and;
/includes/media/DjVuImage.php

It seems (again, to me) the first step on the path to making the proposal a reality is to see if its still possible to actually generate an XML from a DjVu file using the current state of mediawiki et. al as it stands today. I know this is possible on a vanilla x86 local install of the DjVuLibre software package (refer to the attachments again)... but all that online server, Linux, Debian, Ubuntobama stuff is beyond me - and something along those lines is what is in play here.

So: Can anyone successfully generate the DjVuLibre defined XML derivative from a .DjVu file using just the available mediawiki regime/scheme in place?

GOIII mentioned this in T59351: To add some basic djvulibre calls to API.Nov 27 2014, 12:47 PM

Qgil added a project: Possible-Tech-Projects.Feb 10 2015, 2:54 PM

jayvdb subscribed.Feb 10 2015, 8:49 PM

jayvdb mentioned this in T89123: Import transcription into DjVu file.Feb 10 2015, 8:53 PM

Wikimedia will apply to Google Summer of Code and Outreachy on Tuesday, February 17. If you want this task to become a featured project idea, please follow these instructions.

@Micru, @Aubrey, are you still interested in pursuing this task as a GSoC/Outreachy project?

Qgil moved this task from Backlog to Need Discussion on the Possible-Tech-Projects board.Feb 17 2015, 9:45 AM

Jdforrester-WMF added a project: Contributors-Team.Mar 5 2015, 6:28 PM

Niharika moved this task from Need Discussion to Re-check in September 2015 on the Possible-Tech-Projects board.Mar 8 2015, 2:32 PM

Ineuw unsubscribed.Mar 8 2015, 3:38 PM

Aklapper added a project: All-and-every-Wikisource.Mar 10 2015, 4:15 PM

Accurimbono subscribed.May 10 2015, 10:51 AM

jayantanth subscribed.Sep 19 2015, 7:38 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 19 2015, 7:38 PM

This is a message posted to all tasks under "Re-check in September 2015" at Possible-Tech-Projects. Outreachy-Round-11 is around the corner. If you want to propose this task as a featured project idea, we need a clear plan with community support, and two mentors willing to support it.

This is a message sent to all Possible-Tech-Projects. The new round of Wikimedia Individual Engagement Grants is open until 29 Sep. For the first time, technical projects are within scope, thanks to the feedback received at Wikimania 2015, before, and after (T105414). If someone is interested in obtaining funds to push this task, this might be a good way.

Aklapper moved this task from Re-check in September 2015 to Need Mentors on the Possible-Tech-Projects board.Oct 4 2015, 9:06 AM

jayvdb updated the task description. (Show Details)Oct 4 2015, 9:25 AM

jayvdb set Security to None.

jayvdb updated the task description. (Show Details)

jayvdb updated the task description. (Show Details)Oct 9 2015, 1:19 AM

jayvdb merged a task: T89123: Import transcription into DjVu file.

jayvdb added a subscriber: Alex_brollo.

jayvdb added subscribers: Shrutika719, Bawolff, Niharika.

jayvdb updated the task description. (Show Details)Oct 9 2015, 2:04 AM

This is the last call for Possible-Tech-Projects missing mentors. The application deadline for Outreachy-Round-11 is 2015-11-02. If this proposal doesn't have two mentors assigned by the end of Thursday, October 22, it will be moved as a candidate for the next round.

Interested in mentoring? Check the documentation for possible mentors.

Just to let you know that I'm presently working about a different - but related - problem: to build a "wikisource-like" djvu text editor. First results are very encouraging, here a screenshot of my "djvu python ajax editor".

The first djvu, with a completely edited text layer by this draft tool, is this one: Atlantide (Mario Rapisardi).djvu.

Presently the goal for such tool is to edit djvu text layer before it is used by wikisource, but its engine can edit djvu text layer saving its structure, so that I see that it will be possible to use it to re-upload text coming from wikisource: I'll explore this issue as soon as I'll add some needed property to editor. It's only a matter to match djvu text with $(".pagetext").text() with some "aligning" tool.

As previously mentioned, this task is moved to 'Recheck in February 2016' as it doesn't have two mentors assigned to it as of today, October 23 - 2015. The project will be included in the discussion of next iteration of GSoC/Outreachy, and is excluded from #Outreachy-11. Potential candidates are discouraged from submitting proposals to this task for #Outreachy-11 as it lacks mentors in this round.

01tonythomas moved this task from Need Mentors to Re-check in February 2016 on the Possible-Tech-Projects board.Oct 23 2015, 10:09 AM

GOIII mentioned this in T70638: phpunit: SKIPPED: djvu binaries do not exist or are not executable.Nov 1 2015, 2:19 PM

NOTE: This task is a proposed project for Google-Summer-of-Code (2016) and Outreachy-Round-12 : GSoC 2016 and Outreachy round 12 is around the corner, and this task is listed as a Possible-Tech-Projects for the same. Projects listed for the internship programs should have a well-defined scope within the timeline of the event, minimum of two mentors, and should take about 2 weeks for a senior developer to complete. Interested in mentoring? Please add your details to the task description, if not done yet. Prospective interns should go through Life of a successful project doc to find out how to come up with a strong proposal for the same.

Core code for this task already exists at https://github.com/wikisource/ocr-tools , how to use the core code need to be documented. It'll need to be glued inside a container which can run on toolabs and provide a web interface and api to start a task. The core code use a Needleman-Wunsch algorithm to do a global alignment between an already existing hocr-like text layer in a djvu with an existing proofreaded text available on wikisource. One of the task needed is to generate such text layer using tesseract if needed.

I see skills required include Epub knowledge and many micro-task linked to this task, all of that seems unrelated to this task. For skills needed, knowledge of Djvu format, some knowledge of how to build a web api on unix, knowledge of python should be enough.

I see skills required include Epub knowledge and many micro-task linked to this task, all of that seems unrelated to this task. For skills needed, knowledge of Djvu format, some knowledge of how to build a web api on unix, knowledge of python should be enough.

Thank you a lot! Please fix the description accordingly. :)

Phe updated the task description. (Show Details)Feb 18 2016, 7:08 PM

Phe updated the task description. (Show Details)Feb 18 2016, 7:12 PM

@jayvdb, @Aubrey, you were listed as mentors of this project for Outreachy-11. A new round of GSoC '16 and Outreachy 12 has started. Are you willing to mentor the project in this round?

Qgil unsubscribed.Mar 2 2016, 10:01 AM

Just to let you know briefly the "state of art" of my tries:

I've a rough, but running "djvu editor" (based on a server-client local python application; editing is done into a simple html page, with js tools, somehow similar to wikisource nsPage edit environment);
I'm trying some DIY trick to align djvu text layer with wikisource edited text, using the same "djvu editor" GUI and base scripts;
I'm testing too something deeply different - t.i. uploading wikisource code (raw or parsed into html) into a metadata text field of djvu page.

Please consider that my skill is very limited, I can't publish code into Github nor I can implement the server-client trick into Tool Labs; I'd like to zip and send the whole thing to anyone interested.

Niharika unsubscribed.Mar 2 2016, 6:38 PM

Yann subscribed.Mar 3 2016, 9:38 AM

Per Brewster, the Internet Archive also has a routine to extract the OCR:

get .txt files from the _djvu.xml (which come from the abbyy.xml)

Worth checking if this code is available on their public git repositories.

P.s.: https://github.com/rajbot/AbbyyToDjvuXml is linked from http://raj.blog.archive.org/2011/03/17/how-to-serve-ia-style-books-from-your-own-cluster/#comment-17944

Sumit updated the task description. (Show Details)Mar 25 2016, 9:04 PM

Sumit moved this task from Re-check in February 2016 to Need Mentors on the Possible-Tech-Projects board.

Blahma subscribed.Apr 17 2016, 10:00 PM

Anyone willing to push this project for the current round of Outreachy internship ( Dec6 to March 6 )? Application period has already started and remains open until Oct-17

Bodhisattwa subscribed.Sep 13 2016, 4:14 AM

Liuxinyu970226 removed a parent task: T37925: [DO NOT USE. Please use the Wikisource project] Wikisource related bugs and enhancements (tracking).Dec 23 2016, 12:09 PM

BethNaught subscribed.Mar 5 2017, 9:39 AM