Page MenuHomePhabricator

GSoC 2026: Bulk OCR Improvements
Open, Needs TriagePublic

Description

Project title
Bulk OCR on Wikisource

Brief summary
Wikisource is an online wiki-based digital library of free-content textual sources operated by the Wikimedia Foundation. The Bulk OCR feature aims to provide an easy way for volunteers to OCR multiple pages or, say, an entire book on Wikisource. However, the ability to perform bulk OCR on any work should be restricted only to certain groups of users. To this end, there is a need to add features to the Wikisource extension to allow authorized users to OCR multiple pages at once and insert the OCRed text back into the relevant text layer of the corresponding pages of the book on Wikisource.

Expected outcomes
By the end of the project, contributors will have written well documented code to improve the existing workflow for bulk OCR that allows authorized users to perform bulk OCR of pages of a particular work on Wikisource. They are expected to add option selection for different OCR configurations and a feedback loop before triggering the write to wiki process. Updating the documentation of the new workflow on any existing documentation pages is also expected.

Skills required/preferred
Javascript, HTML, CSS, familiarity with object oriented programming, experience with PHP and Mediawiki are bonuses, familiarity with the Wikimedia OCR project

Possible mentors
Parthiv Menon (@theprotonade), Satdeep Gill (@SGill)

Expected size of the project
350 hours

Rating
Medium

Microtasks

Any other additional information for contributors:

Why are you proposing this project?
Benefits the Wikisource community

What is the expected impact?
Users who can perform bulk OCR can see feedback related to pages that are being OCRed before approving the entire OCR process. This helps reduce faulty OCRed texts being inserted into the wiki. Success would typically look like having proper UI elements to choose from different OCR engines and having a feedback modal to approve or disapprove of the entire OCR process.

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Hello, my name is Rythm. I’m interested in working on the Bulk OCR Improvements project for Google Summer of Code. I have set up MediaWiki locally and located the bulk OCR
module in extensions/Wikisource/modules/ext.wikisource.bulkocr and many others. Right now I am exploring the notification logic for OCR results and looking into this issue.

Hi,
I have been setting up the MediaWiki development environment for about a week and finally managed to get it running locally after quite a bit of efforts. During this process I started exploring the ProofreadPage extension and the tasks related to the Bulk OCR improvements project.

While going through the project description and some related Phabricator tasks, I found the OCR + ProofreadPage workflow , especially the idea of running OCR on multiple pages and integrating the results back into the text layer.

I’m currently trying to understand the architecture and the existing implementation. Could @theprotonade and @SGill you suggest a good first task or bug related to this area that would help me get familiar with the codebase?
Thank you!

While exploring the Bulk OCR feature, I noticed that multiple OCR engines (like Tesseract, Kraken, and Transkribus) can be selected.
I was wondering how the engine selection is currently handled in the system — is it configured per request from the UI, or is there some default engine configuration at the backend service level?

Did you check the linked microtasks in the task description?

Thanks! I’ll go through the linked microtasks in the task description and start exploring them. I’ll update here once I understand which one I can start working on.

This comment was removed by RisXeng.

Hi @Aklapper , I explored the linked microtasks (T411157 and T413556) to understand the OCR workflow and the Wikisource extension.
It looks like patches are already under review for those tasks, so I’ve been going through the related code (especially BulkOcrWidget.js) to better understand how bulk OCR notifications and the UI logic work.
I’ll continue exploring the OCR tool and related areas to see where I can contribute next. If there are any beginner-friendly issues around the OCR workflow or Wikisource extension, I’d be happy to take a look.

Hi everyone
My name Hussein Mmbaga (@Ssein) a fullstack developer and a Swahili Wikipedia administrator from Arusha, Tanzania.

I am interested in working on this task as part of my preparation for GSoC 2026. I have already set up MediaWiki locally on Linux with the Wikisource related extensions and have been exploring the Bulk OCR codebase.I would like to work on it and submit a patch for review.

I have submitted my first microtask patch:
https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikisource/+/1252657

Bug: T411157 - Fix success message shown when there are errors

Hi, I am interested in working on the Bulk OCR Improvements project for GSoC 2026.

I have started exploring the codebase and would like to begin with small tasks and submit patches.

Looking forward to contributing.

Hello moderator,

I am interested in working on the Bulk OCR Improvements project for GSoC 2026.

I explored the current task and also the microtasks:
T411157
T413556
T365940
T359703

I noticed:

  • OCR is mostly manual and page-by-page
  • users must provide an image link of each page, choose an OCR engine, optionally crop, and trigger transcription.
  • no integrated Bulk OCR option, making everything manual
  • no rotation support before OCR
  • UI bug exists, where a success message is shown even when errors occur during OCR actions
  • current solution works on client-side, and takes more time to process data (especially the ones where page is rotated)

A better alternative is to build a backend-driven bulk OCR system that:

  • fetches all pages from an Index
  • processes OCR asyncronously in batches or page-by-page using the MediaWiki job queue
  • allows engine and language selection
  • supports rotation options (rotate by 90 deg on rotate icon/button click)
  • shows preview before saving results

While testing OCR, I got to know that Google Cloud Vision provides the best accuracy and automatic orientation detection, so it can be used as the default engine while still allowing Tesseract and Transkribus as options.

I would be happy to receive feedback on whether this understanding aligns with the intended direction for the task.

Hi, I am Samuel Agbozo from KNUST, Ghana. I am interested in the Bulk OCR Improvements GSoC project for 2026. I have read the Phabricator task and set up my local MediaWiki environment. My Wikimedia username is Samuel Agbozo. Also I have read the description of this project.
Please I would like to know the next step.

Hi I’ve submitted a patch to improve Bulk OCR notifications
https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikisource/+/1257369
it handles partial failures better by showing failed pages and keeping the notification visible.

Feedback welcome

Hello, I am Ntifo Benjamin Nana Buabeng from KNUST, Ghana. I am interested in the Bulk OCR Improvements for GSoC 2026. I am currently going through the task description and would like to begin with the small tasks.
Looking forward to contributing.
I will reach out if I have any questions about the implementation.

Hi mentors, I’m Victor, recent Python backend diploma graduate from Nigeria. Strong with FastAPI/REST APIs + SQL. I love Open World Holidays Framework. Here’s my GitHub [https://github.com/VictorJatto-altschool]. Can we discuss my proposal? Happy to fix a small issue first.

Change #1260768 had a related patch set uploaded (by Apurvaanand51; author: Apurvaanand51):

[mediawiki/extensions/ProofreadPage@master] Improve UI label clarity for better user understanding

https://gerrit.wikimedia.org/r/1260768

Change #1260772 had a related patch set uploaded (by Apurvaanand51; author: Apurvaanand51):

[mediawiki/extensions/ProofreadPage@master] Improve image tool label clarity for better user experience

https://gerrit.wikimedia.org/r/1260772

Hi Parthiv / SGill,

My name is Muhannad Mahmoud, and I’m a Frontend developer interested in contributing to the Wikisource OCR project for GSoC 2026.

I’ve prepared a Frontend prototype focusing only on the user interface: page selection, displaying converted text (using mock data), and basic UX improvements. The prototype does not include OCR backend or MediaWiki API integration yet, as I’m focusing on design and interaction first.

You can view the prototype here:
https://github.com/mohnedmahmoud772-tech/wikisource-ocr-frontend-prototype

You can also view my portfolio here:
https://mohnedmahmoud772-tech.github.io/Muhannad-s-Portfolio/

I’m planning to submit a formal GSoC Proposal soon based on this prototype and would appreciate any guidance on starting points and good first Frontend tasks.
Could you please advise me on the best way to start and suggest any good first Frontend tasks I could work on?

Thank you very much for your guidance.

Best regards,
Muhannad Mahmoud

Hi, I am Manvi, interested in the Bulk OCR Improvements project for GSoC 2026.

I have started exploring the microtasks and Wikimedia OCR tools.
Could you please guide me on which microtask would be best to begin with?

Also, any recommended resources to understand the OCR workflow in Wikisource?

Thank you!

@Manvikesarwani09 Hi and welcome! Regarding your questions, have you seen the ticket description? Thanks.

Hi Mentors,

I have been studying the Bulk OCR problem description in depth, exploring the existing single-page OCR workflow on Wikisource, and reviewing the Wikimedia OCR API endpoints. Before I finalize my architecture and proposed solution, I had a few important clarifying questions:

  1. Crop Option in Bulk OCR

The existing single-page OCR tool allows users to manually draw a crop rectangle to transcribe only a specific area of a page (e.g., a single column). Since this requires direct user interaction per page, it seems impractical to include in a bulk workflow. Should the crop option be excluded from Bulk OCR entirely, or is there an expectation to support some form of it — for example, applying a fixed crop region uniformly across all selected pages?

  1. OCR Engine Switching Mid-Batch

Should the user be allowed to assign different OCR engines to different pages within the same bulk job? For instance, using Tesseract for most pages but Transkribus for a few manuscript pages. Or should the engine selection be a single global config applied uniformly to the entire batch?

  1. Transkribus Credit Consumption

Since Transkribus operates on limited free credits, should the system estimate and display the number of credits a bulk job would consume before the user triggers it? And should there be a hard cap or a warning threshold to prevent accidental exhaustion of credits in a single bulk run?

  1. Concurrency Limit

Does the Wikimedia OCR API enforce any documented rate limits or maximum concurrent requests? This would directly inform the concurrency limit we set in the bulk scheduler to avoid overloading the service.

  1. Feedback Loop Granularity

In the review step before writing to the wiki, should users be able to edit the OCR output inline before approving it? Or is the feedback loop intended only for approve/reject decisions, with editing left to the standard wiki editor afterward?

Clarifying these points would be very helpful in ensuring my proposed architecture aligns well with the project's expectations before I finalize my GSoC'26 proposal.

Thank you for your time and guidance.

Best regards
Soumyadeep Dutta
[SimpleMan05]

Hi @theprotonade and @SGill,

I am Shubham Solanki, a GSoC 2026 applicant for the Bulk OCR Improvements
project. I have submitted my proposal as a subtask and have also started
exploring the wikimedia-ocr codebase.

I submitted a small contribution adding i18n messages for the bulk OCR
feature and rate limiting documentation:
https://github.com/wikimedia/wikimedia-ocr/pull/171

I have a couple of questions about the project architecture:

  1. Should the bulk OCR queue be handled via MediaWiki's JobQueue or synchronously?
  2. Are there existing rate-limiting utilities in the Wikisource extension I should reuse?

Looking forward to your guidance!

Hello mentors,
My name is Ahmad Ali, a Computer Science student interested in applying for this project under GSoC 2025. I have been exploring the existing Wikisource OCR workflow and the ProofreadPage extension codebase this week, and I find the problem of bulk digitization genuinely important.
I have drafted a proposal covering: asynchronous job queue design using the MediaWiki-native JobQueue, a side-by-side OCR preview interface with low-confidence highlighting, role-based access control with a new bulkocr user right, and an optional Python NLP post-processing microservice designed for graceful degradation.
My technical question: When a bulk OCR job is processing 200+ pages asynchronously, what is the recommended strategy for rate-limiting calls to the Wikimedia OCR service — is there an existing throttling mechanism in the extension I should build on, or would I need to implement a token-bucket style limiter in the job handler?
I would appreciate any guidance or feedback from mentors. I am available on Wikimedia Zulip as well AhmadAli.
Thank you for your time.

Hi @theprotonade and @SGill! I'm Kaung Sithu Tun and currently third year IT student in Stamford International University Thailand. I’m a frontend developer with experience in skills that required and building commercial web platforms. I am very interested in the "Bulk OCR Improvements" project and am currently drafting my proposal. I've already started looking into the microtasks, specifically T411157, to better understand the Wikisource extension's UI logic. Looking forward to contributing!

Hi everyone, I'm Abel Angkyier, I'm a Computer Science student at the University for Development Studies in Ghana. I'm preparing a proposal for the Bulk OCR Improvements for Wikisource project (T415145, 350 hours).So far I have:
Set up a local MediaWiki instance with the Wikisource extension
Explored the Bulk OCR code and the Wikimedia OCR tool
Reviewed the microtasks (T411157, T413556, T365940)
I've drafted a proposal that focuses on permission controls, configurable OCR options (engine + language), and a feedback/preview loop before writing text to wiki pages.I would really appreciate any early feedback from the mentors (@theprotonade and @SGill) or anyone familiar with the project. Happy to share the full proposal (Google Doc or PDF) here.Thank you!

Change #1260768 merged by jenkins-bot:

[mediawiki/extensions/ProofreadPage@master] Improve UI label clarity for better user understanding

https://gerrit.wikimedia.org/r/1260768

Change #1260772 merged by jenkins-bot:

[mediawiki/extensions/ProofreadPage@master] Improve image tool label clarity for better user experience

https://gerrit.wikimedia.org/r/1260772

@theprotonade @SGill @Samwilson @LGoto

Good evening, I just received a mail from Google informing me that my proposal was not accepted
I know the mentors are very busy, but please I was wondering what the criteria for selection was... please is the Wikimedia Selection Guide ( outlined here) still being followed ?
I ask this , given that I successfully fixed 2 out of the 4 micro tasks and I have working solutions awaiting merge for the other 2 micro tasks. In total during the contribution stage, across OCR tool and Wikisource I have 7 merged Pull requests and 4 Pull requests and gerrit patches awaiting review as outlined in my proposal - https://phabricator.wikimedia.org/T420680
No other candidate had similar impact and the selected contributor did not complete any micro task and the single patch he listed as contribution did not align with the task requirement @Samwilson wanted.

Please I do not want this to come across the wrong way and I am not demanding an explanation , I only humbly request for criteria clarification to better inform myself and future contributors.

Relevant Links

My Proposal - T420680
Accepted Proposal - T420281
Accepted Proposal Submitted Patch - https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikisource/+/1252657 and is trying to fix T411157

More Context on why submitted patch is wrong

Thanks @Okerekechinweotito!

I think cancelling the whole batch when there's any error is probably safer, and generally more likely to be what people want. If there's an error with one page, it's probably something like the thumbnail not being generated or an API being down, and so retrying with other pages will probably also fail. And if not, then the user can just try again. But if they're just getting a string of red warnings it won't inspire confidence.

The warning and error notifications should be persistent, and stay on the screen for the user's reference.

@Okerekechinweotito Thank you so much for your continued interest in the Wikimedia movement and this project. You did score the highest according to the criteria however it was also noted that you have been an Outreachy intern with the Wikimedia Foundation and also a mentor for another project. It was also noted that you are working as a full stack engineer. So, you were noted as highly overqualified for the internship. So, It was decided to move with someone with less experience as that is the goal behind GSoC to help newcomers develop into professionals. Per FAQ, the goal of GSOC is, "To bring new contributors into open source communities to foster long-term involvement in the open source ecosystem."

Also according to the GSOC FAQs, "While open to various backgrounds, GSoC targets newcomers. Professionals often find the required time commitment difficult to balance."

I hope you understand. I am sorry it wasn't clear in advance. You have been an amazing contributor to our projects and we hope you continue to be engaged. Thank you and I hope to see you around on the Wikimedia projects.

Hi mentors (@theprotonade, @SGill)

I am very happy to have the opportunity to contribute to the Wikisource Bulk OCR Improvements project this summer. Thank you !

I am looking forward to the Community Bonding phase and learning more about the community's practices. I am ready to start refining the project plan and milestones whenever you are available to discuss them.

Happy to be part of the open source team!

@SGill I appreciate your detailed response. Thank you for the clarification