Profile Information
Name: Okereke Chinweotito
Zulip: okereke chinweotito
Web Profile: https://www.okereke.me
Email: okerekechinweotito@gmail.com
Resume: https://drive.google.com/file/d/1ZvcAMV65V77ol0p4xVDIi5ZsFzT2K5-3/view
Location: Lagos, Nigeria
Typical working hours: 8:00 AM - 10:00 PM (UTC +1)
Github: https://github.com/okerekechinweotito
Linkedin: https://www.linkedin.com/in/chinweotito-okereke-9185941ba/
Project: T415145
Synopsis
Wikisource is an online wiki-based free digital library of free-content textual sources operated by the Wikimedia Foundation. It houses thousands of scanned books and historical documents that volunteers manually transcribe. One of the most tedious parts of that process is OCR, right now a volunteer working through a 300 page scanned book has to trigger OCR page by page, with no way to do it in bulk safely.
Bulk OCR already exists in the Wikisource extension, but it is not really production ready. There is no access control, the ability to perform bulk OCR on any work should be restricted only to certain groups of users.
There is no review step or feedback loop, OCR output goes directly to the text layer of each page without the user ever reviewing it first. There are also success/error notification bugs
To this end, there is a need to add features to the Wikisource extension to allow authorized users to OCR multiple pages at once and insert the OCRed text back into the relevant text layer of the corresponding pages of the book on Wikisource. By the end of the project, contributors will have written well documented code to improve the existing workflow for bulk OCR that allows authorized users to perform bulk OCR of pages of a particular work on Wikisource. They are expected to add option selection for different OCR configurations and a feedback loop before triggering the write to wiki process. Updating the documentation of the new workflow on any existing documentation pages is also expected. The result is a workflow trustworthy enough that Wikisource communities can actually rely on it for OCR work.
- Mentor(s) @theprotonade , @SGill
- Have you contacted your mentors already? : Yes I have and I participate in relevant discussions
- Project Size: 350 hours
- Rating: Medium
Deliverables
GSoC 2026 — Bulk OCR Improvements: Project Timeline
| Phase | Dates | Activities | Milestone |
|---|---|---|---|
| Community Bonding & Setup | May 1 – May 24 | Address review comments on open patches: T411157 — success message shown with errors, rotate option for OCR tool, and direct models.json loading — Wikisource extension patch (companion OCR tool patch already merged); deep-read BulkOcrWidget.js, OcrController.php, and EngineFactory.php; study OOUI Dialog, Booklet, SelectWidget, and ProgressBar APIs; review T359703 and all related Gerrit patches; discuss implementation approach and communication cadence with mentors (@theprotonade & @SGill) | All open pre-GSoC patches iterated on and merge-ready; codebase familiarity and mentor-approved implementation plan ready before coding begins |
| Week 1–2: Bug Fixes | May 25 – June 7 | Get T411157 patch and rotate option PR merged by addressing any remaining reviewer feedback; write or complete unit tests for both fixes; verify rotate option works end-to-end against real rotated scan examples; begin investigation for user group restriction work | Both patches merged; rotate option functional and tested |
| Week 3–4: Direct models.json Loading | June 8 – June 21 | OCR tool patch already merged; focus on getting the Wikisource extension patch merged by addressing reviewer feedback; verify page-load performance improvement and confirm no regressions in language dropdowns across all three engines | Both patches merged; models.json loaded directly from the OCR tool with no ResourceLoader proxy |
| Week 5–6: User Group Restrictions | June 22 – July 7 | Add a custom MediaWiki right (wikisource-bulk-ocr) and wire it into $wgGroupPermissions so administrators can grant it to trusted user groups; hide the bulk OCR trigger from unauthorised users at the UI level; add a server-side permission check in OcrController.php using User::isAllowed() before any OCR API calls are made — UI gating alone is not sufficient; write tests covering both authorised and unauthorised access paths | Bulk OCR trigger accessible only to authorised user groups; unit and integration tests pass for both access paths; feature demoed live to mentors |
| Week 7: OCR Configuration UI | July 8 – July 14 | Verify and harden existing engine and language selection dropdowns; wire engine, language, PSM mode, and rotate config into the bulk OCR API request payload; add an OOUI indeterminate progress bar during the bulk OCR processing phase; handle partial batch failures with per-page error surfacing; write integration tests. Success criteria: all OCR configuration options pass through correctly in tests; progress bar visible and dismisses correctly on completion | All OCR configuration options functional and passing tests end-to-end |
| Week 8–9: Feedback & Approval Dialog | July 15 – July 28 | Design and implement OOUI Process dialog with split-panel layout: page image thumbnail (left) + editable OCR text (right); use OOUI Booklet/Pages to navigate previews for at least the first 5 pages of a batch; add warning banner ("This action will modify multiple pages — only proceed if you have reviewed the results"); add per-page approve/reject controls; add approve-all and reject-all batch actions; page images loaded via the MediaWiki Action API (prop=imageinfo) to avoid additional infrastructure. Success criteria: dialog renders correctly for batches of varying sizes; per-page and batch controls confirmed working in manual testing | Feedback dialog renders OCR results and allows per-page and batch-level decisions |
| Week 10: Write-back & Edge Cases | July 29 – August 4 | Wire the "Approve" action to write OCR text into each page's text layer via the MediaWiki action=edit API; allow inline text edits in the dialog before approval; handle edge cases: empty OCR text, network failure mid-batch, partial approvals; for large batches, writes will be sequential with a small delay to stay within MediaWiki's rate limits rather than firing parallel requests; test the full flow end-to-end (bulk OCR → dialog preview → approve → wiki write-back) | Complete feedback loop working; approved text saved to wiki pages reliably, including edge case handling |
| Week 11–12: Testing & Review | August 5 – August 16 | Full regression testing on a staging Wikisource instance; accessibility pass (keyboard navigation, screen-reader labels on all OOUI dialogs); address all pending Gerrit reviewer comments; ensure all patches pass CI; fix any bugs uncovered during testing | Full test coverage for all new features; all patches passing CI and ready for merge |
| Week 13–14: Documentation & Final Submission | August 17 – August 24 | Update existing Wikisource and MediaWiki documentation pages for the new bulk OCR workflow; write user guide for engine, language, and rotate configuration; write configuration guide for user group restrictions; final code cleanup; respond to any last review comments; submit final work product report to GSoC | All patches merged or merge-ready; documentation complete; 350h of work delivered |
Related tasks: T411157, T413556, T359703, T394130
Participation
- Weekly written updates posted to the relevant Phabricator tasks
- Weekly video demo to mentors (@theprotonade & @SGill)
- Use Zulip , Phabricator and Gerrit for relevant correspondence
- Source code on https://gerrit.wikimedia.org/r/mediawiki/extensions/Wikisource and https://github.com/wikimedia/wikimedia-ocr
About Me
I hold a Bsc in Computer Science from Imo State University. I am a FullStack Engineer and I frequently use Php , Javascript, Typescript , HTML/CSS and Rust. I am a Wikimedian and a member of Wikisource Tech - Telegram and Wikisource global community.
As per availability, I am 100% available to focus on this project during and after the GSoC period and I have no other commitments elsewhere.
Wikisource does something quietly important; keeping old texts accessible to everyone, and that is a mission I genuinely care about, not just one I discovered while browsing GSoC projects.
The problem itself drew me in too. Bulk OCR already exists but the workflow is not trustworthy yet: no access control, no review step before text gets written to hundreds of pages at once. That's exactly the kind of gap I find worth closing.
The fact that I already had patches open before the program began reflects how I actually feel about it.
Past Experience
Bulk OCR contributions
Merged
- Allow to change user interface language of ocr.wmcloud.org T365940 - https://github.com/wikimedia/wikimedia-ocr/pull/150
- Fix the CORS error when fetching ocr.wmcloud.org/models.json T411447 - https://github.com/wikimedia/wikimedia-ocr/pull/166
- Fix no-unused-vars error T421660 - https://github.com/wikimedia/wikimedia-ocr/pull/174
- Enforce eslint semi-style rule T421660 - https://github.com/wikimedia/wikimedia-ocr/pull/176
- Satisfy eslint no-new rule T421660 - https://github.com/wikimedia/wikimedia-ocr/pull/175
- Enforce eslint yml/no-empty-mapping-value T421660 - https://github.com/wikimedia/wikimedia-ocr/pull/178
- Fix yml/no-empty-document error T421660 - https://github.com/wikimedia/wikimedia-ocr/pull/177
Under Review
- Add rotate option to OCR tool T413556 - https://github.com/wikimedia/wikimedia-ocr/pull/158
- Success message is shown when there are errors T411157 - https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikisource/+/1229805
- Load models' config directly rather than proxying via MediaWiki T411447 - https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikisource/+/1247993
- Implement eslint no-jquery/no-global-selector rule T421660 - https://github.com/wikimedia/wikimedia-ocr/pull/179
I am an active Wikimedian and experienced in contributing to open source projects. I participated in Outreachy round 30 with Wikimedia Foundation where I contributed to the Book Uploader Bot 2 project, which gave me hands on experience with Wikimedia infrastructure and introduced me to the general Wikimedia community.
Additionally, I have also contributed to multiple open source projects like -
- Book Uploader Bot 2 - https://github.com/coderwassananmol/BUB2/pulls?q=is%3Apr+is%3Aclosed+author%3Aokerekechinweotito
- campwiz - https://github.com/Open-Knowledge-Impact-Foundation/campwiz-nxt-frontend2/pulls?q=is%3Apr+author%3Aokerekechinweotito+is%3Aclosed
- OSCSA - https://github.com/Open-Science-Community-Saudi-Arabia/OSCSA-en-blog/pulls?q=is%3Apr+is%3Aclosed+author%3Aokerekechinweotito
- OSCSA - https://github.com/Open-Science-Community-Saudi-Arabia/MOOCs/pulls?q=is%3Apr+is%3Aclosed+author%3Aokerekechinweotito
- Piximi - https://github.com/piximi/piximi/pulls?q=is%3Apr+is%3Aclosed+author%3Aokerekechinweotito
Any Other Info
Useful References:
T359703: Related bulk OCR task
Wikisource extension: https://www.mediawiki.org/wiki/Extension:Wikisource
Wikimedia OCR tool: https://ocr.wmcloud.org
Wikimedia OCR source: https://github.com/wikimedia/wikimedia-ocr
Wikimedia OCR documentation: https://www.mediawiki.org/wiki/Help:Extension:Wikisource/Wikimedia_OCR
Merged bulk OCR patch: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikisource/+/1153779
Legacy bulk OCR user script: https://en.wikisource.org/wiki/User:Kirtisikka972/common.js