Page MenuHomePhabricator

GSoC 2026: Bulk OCR Improvements
Open, Needs TriagePublic

Description

Project title
Bulk OCR on Wikisource

Brief summary
Wikisource is an online wiki-based digital library of free-content textual sources operated by the Wikimedia Foundation. The Bulk OCR feature aims to provide an easy way for volunteers to OCR multiple pages or, say, an entire book on Wikisource. However, the ability to perform bulk OCR on any work should be restricted only to certain groups of users. To this end, there is a need to add features to the Wikisource extension to allow authorized users to OCR multiple pages at once and insert the OCRed text back into the relevant text layer of the corresponding pages of the book on Wikisource.

Expected outcomes
By the end of the project, contributors will have written well documented code to improve the existing workflow for bulk OCR that allows authorized users to perform bulk OCR of pages of a particular work on Wikisource. They are expected to add option selection for different OCR configurations and a feedback loop before triggering the write to wiki process. Updating the documentation of the new workflow on any existing documentation pages is also expected.

Skills required/preferred
Javascript, HTML, CSS, familiarity with object oriented programming, experience with PHP and Mediawiki are bonuses, familiarity with the Wikimedia OCR project

Possible mentors
Parthiv Menon (@theprotonade), Satdeep Gill (@SGill)

Expected size of the project
350 hours

Rating
Medium

Microtasks

Any other additional information for contributors:

Why are you proposing this project?
Benefits the Wikisource community

What is the expected impact?
Users who can perform bulk OCR can see feedback related to pages that are being OCRed before approving the entire OCR process. This helps reduce faulty OCRed texts being inserted into the wiki. Success would typically look like having proper UI elements to choose from different OCR engines and having a feedback modal to approve or disapprove of the entire OCR process.

IMPORTANT: GSoC / Outreachy candidates are required to complete micro-tasks during the application period to prove their ability to work on a three month long project