Page MenuHomePhabricator

Import transcription into DjVu file
Closed, DuplicatePublic

Description

w:DJVU files include a text layer. Typically a DjVu file begins with a text layer that consists of w:OCR text, which Wikisource uses as the initial version of the transcription. Wikisource contributors then 'fix' the OCR errors and save the corrections onto the Wikisource project as wikitext, and eventually the transcription is accurate & completed. A tool is needed to create a new DjVu file with the accurate & complete Wikisource transcription.

There are existing tools being worked on that extract the accurate & complete Wikisource transcription, typically exporting it as EPUB. However they likely discard a lot of useful information that is needed to recreate a DJVU file, most importantly the (x,y) positions of each piece of text. They may also discard the page numbers.

Tools exist which work with the w:hOCR data, for instance hOCR.js by Alex brollo (the gadget author who worked most with the DjVu layers), and djvutext.py.

Skills: Good knowledge of the DjVu file type desirable, and EPUB.
Mentors: John Vandenberg, ?.

Event Timeline

Niharika raised the priority of this task from to Needs Triage.
Niharika updated the task description. (Show Details)
Niharika added subscribers: Aklapper, Niharika.

@jayvdb, this is a project you proposed. If you think it is a duplicate of T59807 then you can just merge it.

Qgil triaged this task as Lowest priority.Feb 11 2015, 12:56 PM

Wikimedia will apply to Google Summer of Code and Outreachy on Tuesday, February 17. If you want this task to become a featured project idea, please follow these instructions.

@jayvdb is there interest in pushing this for upcoming GSoC/Outreachy round? If yes, willing to mentor?

This is a message posted to all tasks under "Re-check in September 2015" at Possible-Tech-Projects. Outreachy-Round-11 is around the corner. If you want to propose this task as a featured project idea, we need a clear plan with community support, and two mentors willing to support it.

This is a message sent to all Possible-Tech-Projects. The new round of Wikimedia Individual Engagement Grants is open until 29 Sep. For the first time, technical projects are within scope, thanks to the feedback received at Wikimania 2015, before, and after (T105414). If someone is interested in obtaining funds to push this task, this might be a good way.