Page MenuHomePhabricator

Outreachy 30: Bulk OCR for Wikisource
Open, Needs TriagePublic

Description

Project title
Bulk OCR on Wikisource

Description of project
Wikisource is an online wiki-based digital library of free-content textual sources operated by the Wikimedia Foundation. The Bulk OCR feature aims to provide an easy way for volunteers to OCR multiple pages or, say, an entire book on Wikisource. However, the ability to perform bulk OCR on any work should be restricted only to certain groups of users. To this end, there is a need to add features to the Wikisource extension to allow authorized users to OCR multiple pages at once and insert the OCRed text back into the relevant text layer of the corresponding pages of the book on Wikisource.

Expected outcomes
By the end of the project, contributors will have written well documented code to enable a functional workflow that allows authorized users to perform bulk OCR of pages of a particular work on Wikisource. Updating the documentation of the new workflow on any existing documentation pages is also expected.

Preferred skills
Javascript, HTML, CSS, familiarity with object oriented programming, experience with PHP and Mediawiki are bonuses

Mentor(s)
Parthiv Menon (@theprotonade), Satdeep Gill (@SGill)

Size
350 hours

Difficulty
Medium

Microtasks

Additional information

IMPORTANT: GSoC / Outreachy candidates are required to complete micro-tasks during the application period to prove their ability to work on a three month long project.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
theprotonade renamed this task from GSoC 2025: [add project title] to GSoC 2025: Bulk OCR for Wikisource.Feb 26 2025, 3:39 AM

Hi @theprotonade thanks for submitting this proposal! Please be sure to add the microtasks, such as any Phabricator links or specific examples that could help a potential contributor.

LGoto renamed this task from GSoC 2025: Bulk OCR for Wikisource to Outreachy 30: Bulk OCR for Wikisource.Feb 28 2025, 4:50 PM
LGoto edited projects, added Outreachy (Round 30); removed Google-Summer-of-Code-2025.
LGoto changed the visibility from "Public (No Login Required)" to "acl*outreachy-mentors (Project)".
LGoto moved this task from Backlog to Project Proposals on the Outreachy (Round 30) board.
LGoto changed the visibility from "acl*outreachy-mentors (Project)" to "Public (No Login Required)".Mar 14 2025, 5:27 PM

@theprotonade @SGill I have set up the codebase locally and completed the microtasks and after getting familiar with the code I understood the problem better and how we can plan on tackling this so I wanted to share my insights and discuss if what I am thinking is the right way to approach this.

So the current problem is that the existing process requires manual OCR execution on each page individually, adding unnecessary redundancy and slowing down the proofreading workflow. The previous OCR tool, OCR4wikisource, has been non-functional for some time, further exacerbating the issue.
This lack of automation is particularly challenging during editing campaigns, where new editors might struggle with selecting the correct OCR engine and language, instead of focusing on proofreading.

In order to resolve this we can have a dedicated API endpoint that will be created to handle bulk OCR requests, processing multiple pages at once. This endpoint will integrate with existing OCR engines (Tesseract, Google Cloud Vision, Transkribus) and support caching to prevent redundant processing.
Further a JavaScript module will be developed to add a "Bulk OCR" button to the Index page. This button will allow authorized users to trigger OCR processing for all pages in a given Index. And to ensure high-quality OCR output, the tool will provide a preview of a few pages before executing bulk processing. Users can review the sample OCR results and confirm before proceeding with the full batch.
In order to prevent unnecessary flooding of recent changes, this feature will be restricted to authorized bot accounts. An access control mechanism will be implemented, allowing administrators to configure permissions via a bot authorization service. An admin interface will be created to manage bot permissions. This will allow administrators to grant or revoke bulk OCR privileges as needed.

Implementation would look something like this:
Pages will be processed in small batches (e.g., 3 at a time) to avoid overloading servers.
Progress indicators will show real-time updates on OCR processing.
The implementation will work seamlessly with EditInSequence and other existing tools.

The implementation will include configurable options in LocalSettings.php:

$wgBulkOcrEnabled = true; // Enable or disable bulk OCR functionality
$wgBulkOcrAllowBots = true; // Restrict usage to bot accounts
$wgBulkOcrRestrictedMode = true; // Limit feature to explicitly authorized users
$wgBulkOcrMaxConcurrent = 3; // Set maximum concurrent OCR processes
$wgBulkOcrPreviewCount = 5; // Define number of pages for preview before bulk OCR

Let me know if the direction of my approach seems aligned to the goal and if I could start testing and implementing this step by step.
Thank you!

Hello @Aklapper @theprotonade @SGill I'm Victor Akoh an Outreachy applicant, just to be clear are these the only micro-tasks required for this Outreachy project https://phabricator.wikimedia.org/T316182 https://phabricator.wikimedia.org/T359010 ? Thanks.

Hi @Vicolas11, there were two other tasks as well, related to Wikimedia OCR, as you can see above. They have been closed since their patches have been merged.

@theprotonade since the other two patches has been merged, should I just do these ones https://phabricator.wikimedia.org/T316182 https://phabricator.wikimedia.org/T359010. Or what else should I do I really want to contribute to this project? Thanks

@theprotonade @SGill
Please are there any community specific questions you would like us to answer when filling our final application ?
Also, for the project timeline, is there any template you would like us to follow or should we structure the timeline as we deem appropriate ...