Import transcription into DjVu file
Closed, DuplicatePublic
Actions

Assigned To

None

Authored By

	• Niharika
	Feb 10 2015, 1:07 PM

Description

w:DJVU files include a text layer. Typically a DjVu file begins with a text layer that consists of w:OCR text, which Wikisource uses as the initial version of the transcription. Wikisource contributors then 'fix' the OCR errors and save the corrections onto the Wikisource project as wikitext, and eventually the transcription is accurate & completed. A tool is needed to create a new DjVu file with the accurate & complete Wikisource transcription.

There are existing tools being worked on that extract the accurate & complete Wikisource transcription, typically exporting it as EPUB. However they likely discard a lot of useful information that is needed to recreate a DJVU file, most importantly the (x,y) positions of each piece of text. They may also discard the page numbers.

Tools exist which work with the w:hOCR data, for instance hOCR.js by Alex brollo (the gadget author who worked most with the DjVu layers), and djvutext.py.

Skills: Good knowledge of the DjVu file type desirable, and EPUB.
Mentors: John Vandenberg, ?.

Related Objects

Mentioned Here: T59807: Merge proofread text back into Djvu files

Event Timeline

• Niharika created this task.Feb 10 2015, 1:07 PM

• Niharika raised the priority of this task from to Needs Triage.

• Niharika updated the task description. (Show Details)

• Niharika added a project: Possible-Tech-Projects.

• Niharika added subscribers: Aklapper, • Niharika.

Qgil added a project: MediaWiki-DjVu.Feb 10 2015, 2:26 PM

Qgil subscribed.

This looks like it is a dup of T59807 .

@jayvdb, this is a project you proposed. If you think it is a duplicate of T59807 then you can just merge it.

Qgil triaged this task as Lowest priority.Feb 11 2015, 12:56 PM

Wikimedia will apply to Google Summer of Code and Outreachy on Tuesday, February 17. If you want this task to become a featured project idea, please follow these instructions.

@jayvdb is there interest in pushing this for upcoming GSoC/Outreachy round? If yes, willing to mentor?

Bawolff subscribed.Mar 4 2015, 3:25 PM

• Niharika moved this task from Backlog to Re-check in September 2015 on the Possible-Tech-Projects board.Mar 16 2015, 6:30 PM

Ricordisamoa subscribed.Apr 29 2015, 6:13 PM

jayantanth subscribed.Sep 19 2015, 7:37 PM

This is a message posted to all tasks under "Re-check in September 2015" at Possible-Tech-Projects. Outreachy-Round-11 is around the corner. If you want to propose this task as a featured project idea, we need a clear plan with community support, and two mentors willing to support it.

This is a message sent to all Possible-Tech-Projects. The new round of Wikimedia Individual Engagement Grants is open until 29 Sep. For the first time, technical projects are within scope, thanks to the feedback received at Wikimania 2015, before, and after (T105414). If someone is interested in obtaining funds to push this task, this might be a good way.

Qgil moved this task from Re-check in September 2015 to Need Discussion on the Possible-Tech-Projects board.Oct 7 2015, 3:48 PM

Shrutika719 subscribed.Oct 8 2015, 3:15 PM

jayvdb closed this task as a duplicate of T59807: Merge proofread text back into Djvu files.Oct 9 2015, 1:20 AM

Import transcription into DjVu fileClosed, DuplicatePublicActions

Description

Related Objects

Event Timeline

Import transcription into DjVu file
Closed, DuplicatePublic
Actions