Page MenuHomePhabricator

[GSoC 2026 Proposal] Bulk OCR Improvements
Closed, DeclinedPublic

Description

Profile Information

Name: Okereke Chinweotito
Zulip: okereke chinweotito
Web Profile: https://www.okereke.me
Email: okerekechinweotito@gmail.com
Resume: https://drive.google.com/file/d/1ZvcAMV65V77ol0p4xVDIi5ZsFzT2K5-3/view
Location: Lagos, Nigeria
Typical working hours: 8:00 AM - 10:00 PM (UTC +1)
Github: https://github.com/okerekechinweotito
Linkedin: https://www.linkedin.com/in/chinweotito-okereke-9185941ba/
Project: T415145

Synopsis

Wikisource is an online wiki-based free digital library of free-content textual sources operated by the Wikimedia Foundation. It houses thousands of scanned books and historical documents that volunteers manually transcribe. One of the most tedious parts of that process is OCR, right now a volunteer working through a 300 page scanned book has to trigger OCR page by page, with no way to do it in bulk safely.
Bulk OCR already exists in the Wikisource extension, but it is not really production ready. There is no access control, the ability to perform bulk OCR on any work should be restricted only to certain groups of users.
There is no review step or feedback loop, OCR output goes directly to the text layer of each page without the user ever reviewing it first. There are also success/error notification bugs
To this end, there is a need to add features to the Wikisource extension to allow authorized users to OCR multiple pages at once and insert the OCRed text back into the relevant text layer of the corresponding pages of the book on Wikisource. By the end of the project, contributors will have written well documented code to improve the existing workflow for bulk OCR that allows authorized users to perform bulk OCR of pages of a particular work on Wikisource. They are expected to add option selection for different OCR configurations and a feedback loop before triggering the write to wiki process. Updating the documentation of the new workflow on any existing documentation pages is also expected. The result is a workflow trustworthy enough that Wikisource communities can actually rely on it for OCR work.

  • Mentor(s) @theprotonade , @SGill
  • Have you contacted your mentors already? : Yes I have and I participate in relevant discussions
  • Project Size: 350 hours
  • Rating: Medium

Deliverables

GSoC 2026 — Bulk OCR Improvements: Project Timeline

PhaseDatesActivitiesMilestone
Community Bonding & SetupMay 1 – May 24Address review comments on open patches: T411157 — success message shown with errors, rotate option for OCR tool, and direct models.json loading — Wikisource extension patch (companion OCR tool patch already merged); deep-read BulkOcrWidget.js, OcrController.php, and EngineFactory.php; study OOUI Dialog, Booklet, SelectWidget, and ProgressBar APIs; review T359703 and all related Gerrit patches; discuss implementation approach and communication cadence with mentors (@theprotonade & @SGill)All open pre-GSoC patches iterated on and merge-ready; codebase familiarity and mentor-approved implementation plan ready before coding begins
Week 1–2: Bug FixesMay 25 – June 7Get T411157 patch and rotate option PR merged by addressing any remaining reviewer feedback; write or complete unit tests for both fixes; verify rotate option works end-to-end against real rotated scan examples; begin investigation for user group restriction workBoth patches merged; rotate option functional and tested
Week 3–4: Direct models.json LoadingJune 8 – June 21OCR tool patch already merged; focus on getting the Wikisource extension patch merged by addressing reviewer feedback; verify page-load performance improvement and confirm no regressions in language dropdowns across all three enginesBoth patches merged; models.json loaded directly from the OCR tool with no ResourceLoader proxy
Week 5–6: User Group RestrictionsJune 22 – July 7Add a custom MediaWiki right (wikisource-bulk-ocr) and wire it into $wgGroupPermissions so administrators can grant it to trusted user groups; hide the bulk OCR trigger from unauthorised users at the UI level; add a server-side permission check in OcrController.php using User::isAllowed() before any OCR API calls are made — UI gating alone is not sufficient; write tests covering both authorised and unauthorised access pathsBulk OCR trigger accessible only to authorised user groups; unit and integration tests pass for both access paths; feature demoed live to mentors
Week 7: OCR Configuration UIJuly 8 – July 14Verify and harden existing engine and language selection dropdowns; wire engine, language, PSM mode, and rotate config into the bulk OCR API request payload; add an OOUI indeterminate progress bar during the bulk OCR processing phase; handle partial batch failures with per-page error surfacing; write integration tests. Success criteria: all OCR configuration options pass through correctly in tests; progress bar visible and dismisses correctly on completionAll OCR configuration options functional and passing tests end-to-end
Week 8–9: Feedback & Approval DialogJuly 15 – July 28Design and implement OOUI Process dialog with split-panel layout: page image thumbnail (left) + editable OCR text (right); use OOUI Booklet/Pages to navigate previews for at least the first 5 pages of a batch; add warning banner ("This action will modify multiple pages — only proceed if you have reviewed the results"); add per-page approve/reject controls; add approve-all and reject-all batch actions; page images loaded via the MediaWiki Action API (prop=imageinfo) to avoid additional infrastructure. Success criteria: dialog renders correctly for batches of varying sizes; per-page and batch controls confirmed working in manual testingFeedback dialog renders OCR results and allows per-page and batch-level decisions
Week 10: Write-back & Edge CasesJuly 29 – August 4Wire the "Approve" action to write OCR text into each page's text layer via the MediaWiki action=edit API; allow inline text edits in the dialog before approval; handle edge cases: empty OCR text, network failure mid-batch, partial approvals; for large batches, writes will be sequential with a small delay to stay within MediaWiki's rate limits rather than firing parallel requests; test the full flow end-to-end (bulk OCR → dialog preview → approve → wiki write-back)Complete feedback loop working; approved text saved to wiki pages reliably, including edge case handling
Week 11–12: Testing & ReviewAugust 5 – August 16Full regression testing on a staging Wikisource instance; accessibility pass (keyboard navigation, screen-reader labels on all OOUI dialogs); address all pending Gerrit reviewer comments; ensure all patches pass CI; fix any bugs uncovered during testingFull test coverage for all new features; all patches passing CI and ready for merge
Week 13–14: Documentation & Final SubmissionAugust 17 – August 24Update existing Wikisource and MediaWiki documentation pages for the new bulk OCR workflow; write user guide for engine, language, and rotate configuration; write configuration guide for user group restrictions; final code cleanup; respond to any last review comments; submit final work product report to GSoCAll patches merged or merge-ready; documentation complete; 350h of work delivered

Related tasks: T411157, T413556, T359703, T394130

Participation

About Me

I hold a Bsc in Computer Science from Imo State University. I am a FullStack Engineer and I frequently use Php , Javascript, Typescript , HTML/CSS and Rust. I am a Wikimedian and a member of Wikisource Tech - Telegram and Wikisource global community.
As per availability, I am 100% available to focus on this project during and after the GSoC period and I have no other commitments elsewhere.
Wikisource does something quietly important; keeping old texts accessible to everyone, and that is a mission I genuinely care about, not just one I discovered while browsing GSoC projects.
The problem itself drew me in too. Bulk OCR already exists but the workflow is not trustworthy yet: no access control, no review step before text gets written to hundreds of pages at once. That's exactly the kind of gap I find worth closing.
The fact that I already had patches open before the program began reflects how I actually feel about it.

Past Experience

Bulk OCR contributions

Merged

Under Review

I am an active Wikimedian and experienced in contributing to open source projects. I participated in Outreachy round 30 with Wikimedia Foundation where I contributed to the Book Uploader Bot 2 project, which gave me hands on experience with Wikimedia infrastructure and introduced me to the general Wikimedia community.
Additionally, I have also contributed to multiple open source projects like -

Any Other Info

Useful References:

T359703: Related bulk OCR task
Wikisource extension: https://www.mediawiki.org/wiki/Extension:Wikisource
Wikimedia OCR tool: https://ocr.wmcloud.org
Wikimedia OCR source: https://github.com/wikimedia/wikimedia-ocr
Wikimedia OCR documentation: https://www.mediawiki.org/wiki/Help:Extension:Wikisource/Wikimedia_OCR
Merged bulk OCR patch: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikisource/+/1153779
Legacy bulk OCR user script: https://en.wikisource.org/wiki/User:Kirtisikka972/common.js

Event Timeline

Hi, thanks for submitting your GSoC 2026 project proposal with Wikimedia!

Please make sure you’ve also submitted your proposal on the official Summer of Code website: https://summerofcode.withgoogle.com. The deadline for both submission and any edits is the same, so ensure everything is finalized before March 31, 18:00 UTC, as changes won’t be possible after that.

We strongly recommend completing any updates at least 30 minutes before the deadline to avoid last-minute glitches or unexpected technical issues.

Wishing you all the best for your application. Hope to see you as part of the program soon! 🚀

Hi, thank you for your submission and the effort you put into your proposal. This year we received over 380 strong applications, and unfortunately we were not able to offer you a slot. This was a very competitive process, and many high quality proposals could not be selected. We truly encourage you to stay engaged and continue contributing to Wikimedia projects. Over the years, many contributors who were not selected for Google Summer of Code have gone on to make impactful contributions and become long term members of the community. Please do not see this as a failure, but as a step forward in your journey. We would love to stay in touch and support your continued involvement.

If you would like guidance on how to contribute to our projects outside GSoC, feel free to reach out to any of the mentors or org admins, they will be happy to help you get started.

You can get started or continue contributing here:

We hope to see your contributions in our community soon.