Page MenuHomePhabricator

Outreachy 30: Bulk OCR for Wikisource
Closed, ResolvedPublic

Description

Project title
Bulk OCR on Wikisource

Description of project
Wikisource is an online wiki-based digital library of free-content textual sources operated by the Wikimedia Foundation. The Bulk OCR feature aims to provide an easy way for volunteers to OCR multiple pages or, say, an entire book on Wikisource. However, the ability to perform bulk OCR on any work should be restricted only to certain groups of users. To this end, there is a need to add features to the Wikisource extension to allow authorized users to OCR multiple pages at once and insert the OCRed text back into the relevant text layer of the corresponding pages of the book on Wikisource.

Expected outcomes
By the end of the project, contributors will have written well documented code to enable a functional workflow that allows authorized users to perform bulk OCR of pages of a particular work on Wikisource. Updating the documentation of the new workflow on any existing documentation pages is also expected.

Preferred skills
Javascript, HTML, CSS, familiarity with object oriented programming, experience with PHP and Mediawiki are bonuses

Mentor(s)
Parthiv Menon (@theprotonade), Satdeep Gill (@SGill)

Size
350 hours

Difficulty
Medium

Microtasks

Additional information

IMPORTANT: GSoC / Outreachy candidates are required to complete micro-tasks during the application period to prove their ability to work on a three month long project.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
theprotonade renamed this task from GSoC 2025: [add project title] to GSoC 2025: Bulk OCR for Wikisource.Feb 26 2025, 3:39 AM

Hi @theprotonade thanks for submitting this proposal! Please be sure to add the microtasks, such as any Phabricator links or specific examples that could help a potential contributor.

LGoto renamed this task from GSoC 2025: Bulk OCR for Wikisource to Outreachy 30: Bulk OCR for Wikisource.Feb 28 2025, 4:50 PM
LGoto edited projects, added Outreachy (Round 30); removed Google-Summer-of-Code-2025.
LGoto changed the visibility from "Public (No Login Required)" to "acl*outreachy-mentors (Project)".
LGoto moved this task from Backlog to Project Proposals on the Outreachy (Round 30) board.
LGoto changed the visibility from "acl*outreachy-mentors (Project)" to "Public (No Login Required)".Mar 14 2025, 5:27 PM

@theprotonade @SGill I have set up the codebase locally and completed the microtasks and after getting familiar with the code I understood the problem better and how we can plan on tackling this so I wanted to share my insights and discuss if what I am thinking is the right way to approach this.

So the current problem is that the existing process requires manual OCR execution on each page individually, adding unnecessary redundancy and slowing down the proofreading workflow. The previous OCR tool, OCR4wikisource, has been non-functional for some time, further exacerbating the issue.
This lack of automation is particularly challenging during editing campaigns, where new editors might struggle with selecting the correct OCR engine and language, instead of focusing on proofreading.

In order to resolve this we can have a dedicated API endpoint that will be created to handle bulk OCR requests, processing multiple pages at once. This endpoint will integrate with existing OCR engines (Tesseract, Google Cloud Vision, Transkribus) and support caching to prevent redundant processing.
Further a JavaScript module will be developed to add a "Bulk OCR" button to the Index page. This button will allow authorized users to trigger OCR processing for all pages in a given Index. And to ensure high-quality OCR output, the tool will provide a preview of a few pages before executing bulk processing. Users can review the sample OCR results and confirm before proceeding with the full batch.
In order to prevent unnecessary flooding of recent changes, this feature will be restricted to authorized bot accounts. An access control mechanism will be implemented, allowing administrators to configure permissions via a bot authorization service. An admin interface will be created to manage bot permissions. This will allow administrators to grant or revoke bulk OCR privileges as needed.

Implementation would look something like this:
Pages will be processed in small batches (e.g., 3 at a time) to avoid overloading servers.
Progress indicators will show real-time updates on OCR processing.
The implementation will work seamlessly with EditInSequence and other existing tools.

The implementation will include configurable options in LocalSettings.php:

$wgBulkOcrEnabled = true; // Enable or disable bulk OCR functionality
$wgBulkOcrAllowBots = true; // Restrict usage to bot accounts
$wgBulkOcrRestrictedMode = true; // Limit feature to explicitly authorized users
$wgBulkOcrMaxConcurrent = 3; // Set maximum concurrent OCR processes
$wgBulkOcrPreviewCount = 5; // Define number of pages for preview before bulk OCR

Let me know if the direction of my approach seems aligned to the goal and if I could start testing and implementing this step by step.
Thank you!

Hello @Aklapper @theprotonade @SGill I'm Victor Akoh an Outreachy applicant, just to be clear are these the only micro-tasks required for this Outreachy project https://phabricator.wikimedia.org/T316182 https://phabricator.wikimedia.org/T359010 ? Thanks.

Hi @Vicolas11, there were two other tasks as well, related to Wikimedia OCR, as you can see above. They have been closed since their patches have been merged.

@theprotonade since the other two patches has been merged, should I just do these ones https://phabricator.wikimedia.org/T316182 https://phabricator.wikimedia.org/T359010. Or what else should I do I really want to contribute to this project? Thanks

@theprotonade @SGill
Please are there any community specific questions you would like us to answer when filling our final application ?
Also, for the project timeline, is there any template you would like us to follow or should we structure the timeline as we deem appropriate ...

Gopavasanth subscribed.

Congratulations @Osuji_pius on being selected for Outreachy! 🎉
Wishing you a great journey ahead—happy coding and best of luck with the program!

As you move through the community bonding period, feel free to refine your project timeline and finalize the steps leading up to the coding phase. If you have any questions, don’t hesitate to reach out—whether on Zulip, via email, or directly on this ticket.

Thank you so much @Gopavasanth!
I'm truly grateful for this opportunity

I’ll be sure to reach out if I have any questions or need guidance along the way.

@Osuji_pius congratulations chief 🎉. You deserve it. I wish you all the best.

@Osuji_pius congratulations chief 🎉. You deserve it. I wish you all the best.

Thanks man @Vicolas11

Weekly Internship Report
Week 1 : June 2 - June 6

Task progress

  • Had an onboarding session with my mentors @satdeep_gill and @theprotonade
  • I created a blog post about myself where I introduced myself. Blog prompt: "Introduce yourself"
  • Worked on two open patches I had from the contribution phase, which needed changes - this has been merged now: Patch. The other is still under review, but all requested changes have been resolved.
  • Got started on the first subtask - T359703

Challenges Faced

  • Issues with getting the Bulk OCR button to display above the pagelist in the index namespace due to how the ext.wikisource.OCR is loaded

Weekly Internship Report
Week 2: June 9 – June 14

Task progress

  • Rounded up with the first subtask that triggers bulk OCR for pages in the index namespace using Google as the default engine.
  • Used mw.notify tooltip for Error handling and Progress tracking
  • The Bulk OCR is only visible for admin accounts on wiki

Challenges Faced

  • None this week

Learnings and Skills Gained

  • Learned how to use OOUI components
  • Improved how I work with documentation

Weekly Report

Week 3 (June 16 - 20)

  • Wrote my second blog Everybody struggles
  • Pushed a patch for the first subtask of the Bulk OCR, which achieved the following:
    • Adds a Bulk OCR button that is only visible to admin accounts on the wiki
    • Clicking the button triggers a Bulk OCR for pages in the Index namespace, using Google as the default engine.
    • Error Handling and Progress tracking with mw. notify tooltip
    • Update wiki pages with text when OCR is complete
    • Update Page status to Not proofread (pages marked with red)
  • Got feedback from my mentors, and I'm now making improvements to my patch.

Key Takeaways

  • Improved my understanding of Wikisource in general.
  • Created a new module - ext.wikisource.bulkocr.

Weekly Report

Week 4 (June 23 - 27)

Task Progress:

Challenges Faced:

  • The MediaWiki API returns all pages of an index in a single flat array, without regard to multiple page lists or sections.
  • To make sectioning work, requests to the API must be made immediately when the BulkOCR buttons are rendered, even if the user hasn’t clicked the button yet.
  • DOM Manipulation Complexity

Weekly Report

Week 5 (June 30 - July 4)

Task Progress

  • The build error I was getting on the Gerrit patch has been resolved
  • Working on improvements based on comments and feedback from my mentors - almost done with that
  • Already developing a plan to tackle the second subtask

Challenges Faced

  • None

Weekly Report

Week 6 (July 7 - July 11)

  • This week I refactored the code for my Bulk OCR patch to meets all the requirements based on the comments from my mentors
  • My patch is approved and 99% ready to be merged - just need to update a tiny piece of code
  • I plan to start working on the second subtask - Task 2 immediately.

Challenges Faced:

  • None

Weekly Report

Week 7 (July 14 - July 18)

Tasks completed

  • Completed this task and got the core functionality of the Bulk OCR working - T394129
  • Had a check-in meeting with my mentors to discuss the next steps and how to proceed with my second subtask - T394130
  • Talked about handling announcements for the new feature to Wikisource communities
  • Looking into improving the OCR documentation to include the new feature

Challenges faced

  • None

Weekly Report

Week 8 (July 21 - July 25 )

Tasks Completed:

  • Fixed an issue where Bulk OCR is not sending full image URLs - Patch
  • Started work on my second subtask - Add popup to allow engine and model selection before bulk OCR happens. I already have a Patch for this, currently looking at the comments from my mentors
  • Preparing a draft documentation of the new Bulk OCR feature

Challenges Faced:

  • None

Weekly Internship Report

Week 9: July 28 - August 1

Tasks Completed:

  • Created a first draft of the documentation for the new Bulk OCR feature
  • Updated my patch for the second subtask to address comments from my mentors
  • Scheduled a demo for the new feature during the next Wikisource community meeting

Challenges Faced:

  • None

Weekly Internship Report

Week 10 (August 4 - August 8)

Task Progress:

  • In progress Patch - Add pop-up to allow engine and model selection before bulk OCR
  • Gearing up for a demo this Sunday at the Wikisource community meeting

Weekly Internship Report

Week 11: August 11 – August 15

Task Progress:

  • In progress Patch - Add pop-up to allow engine and model selection before bulk OCR
  • Presented the new Bulk OCR feature during the monthly Wikisource community meeting
  • Preparing a mockup for how to handle to last subtask - T394131