Profile Information
- Name: Lakshita Jain
- Github Profile: https://github.com/lakshita10341
- resume:
- Linkedin: Lakshita
- Location: Roorkee, Uttarakhand, India (IST, UTC+5:30)
- Typical working hours: 9:00 AM – 6:00 PM IST on weekends and evenings during May–July; post mid-July (1:00 PM–9:00 PM) daily
Synopsis
Summary
Wikisource hosts thousands of digitized works - books, manuscripts, historical documents, that volunteers painstakingly transcribe page by page. OCR (Optical Character Recognition) already exists for individual pages via the Wikimedia OCR tool, but volunteers who want to OCR an entire book today must trigger that process one page at a time. There is no mechanism to say: *"OCR all 300 pages of this work, let me review the results, and then commit them to the wiki."*
This project delivers a robust, production-grade Bulk OCR pipeline for Wikisource by modernizing the existing workflow (building upon the foundations in T359703) and integrating features across two key repositories: the Wikisource extension (MediaWiki) and the Wikimedia OCR tool (backend).
The project focuses on three interconnected layers:
- Selection & Configuration (UI): Enhancing the Index-page UI in the Wikisource extension to allow authorized users to select page ranges and configure OCR parameters - including a new "rotate" option (T413556) to handle misaligned scans.
- Safety-First Feedback Loop: Implementing a human-in-the-loop review modal that surfaces OCR results page-by-page before any wiki writes occur, preventing garbled text from polluting the wiki layer.
- Resilient Execution & Error Handling: Fixing the "silent failure" issue where success messages are shown despite per-page errors, and building a controlled write-back pipeline that respects MediaWiki's authorization and rate-limiting constraints.
The result is a workflow that transforms a tedious, one-page-at-a-time process into a structured, reviewable, and high-quality mass-digitization pipeline.
Possible Mentors
@theprotonade (Parthiv Menon), @SGill (Satdeep Gill)
Have you contacted your mentors already?
I have been actively following the discussions on Phabricator (T411157, T413556, T365940) and have explored the codebases of both the Wikisource extension and the Wikimedia OCR tool to understand the integration points. I have introduced myself on Wikimedia Zulip and am awaiting a response from Parthiv Menon and Satdeep Gill to discuss this technical plan and ensure my approach aligns with their vision.
Deliverables
Project Size: 350 Hours (Medium)
The architecture can be summarized as four interconnected stages. The pipeline enforces a strict no-write-without-review contract - OCR results are never committed to the wiki until the user has explicitly approved them in the feedback modal.
Phase 1 : Weeks 1–2: Research, Audit, and Design
Goal:
- Establish a thorough technical baseline before writing any production code.
- Audit the existing Wikisource extension (PHP/JS) codebase: understand how single-page OCR is currently triggered, how results are inserted into the text layer, and how user-group permissions are enforced.
- Study the Wikimedia OCR tool's API - available endpoints, OCR engine options, rate limits, error responses - and the legacy bulk OCR user script (T359703) to identify patterns worth preserving.
- Review T411157 and T413556 in Phabricator to understand the full scope of already-identified requirements and any design decisions already made by mentors.
- Identify the primary integration points: OcrController and assets/app.js in wikimedia-ocr for backend changes; the Index-page JS modules and PHP hooks in the Wikisource extension for the frontend.
- Draft a detailed technical design document: UI wireframes for the page-selection panel and feedback modal, the data flow between the extension and the OCR API, and the write-back sequence.
- Share the design doc with mentors for early feedback before implementation begins.
Key Deliverable: Approved technical design document (wireframes + data flow + identified PHP hooks and JS modules) shared with mentors; no production code written until design is signed off.
Phase 2 : Weeks 3–4: Authorization Layer and Page Selection UI
Goal: Build the gating mechanism and the entry point for the bulk OCR workflow.
- Implement user-group authorization checks - both client-side in JS and server-side in PHP - within the Wikisource extension to restrict bulk OCR access to permitted groups (e.g., autopatrolled, proofreadpage-autopatrol, or a dedicated bulkocr group - exact groups to be confirmed with mentors).
- Build the page selection UI using MediaWiki's OOUI (Object-Oriented User Interface) library to ensure a native look-and-feel consistent with the rest of the extension:
- A panel accessible from the Index page of a work on Wikisource.
- Allows selection of individual pages, a page range (e.g., pages 10–50), or the entire work.
- Displays current page status (without text, with text, proofread) fetched via action=query&prop=proofreadpage-status so users can target only blank pages.
- Disabled / hidden entirely for users who do not meet the authorization criteria.
// Conceptual structure of the BulkOcrController module const BulkOcrController = { init() { if (!this.isUserAuthorized()) return; // guard: exit early for unauthorized users this.renderPageSelectionPanel(); }, isUserAuthorized() { // Check mw.config user groups against the permitted bulk OCR groups const userGroups = mw.config.get('wgUserGroups') || []; return PERMITTED_GROUPS.some(g => userGroups.includes(g)); }, renderPageSelectionPanel() { /* build OOUI panel here */ } };
- Write QUnit unit tests for the authorization check and the page selection logic.
Key Deliverable: Authorized users see the page selection panel on Index pages; unauthorized users see nothing; unit tests pass.
Phase 3 : Weeks 5–6: OCR Engine Configuration and Job Execution
Goal: Allow users to choose their OCR configuration and execute the bulk OCR job against the Wikimedia OCR API.
- Add an OCR configuration section to the page selection panel using OOUI components:
- A dropdown for the OCR engine (Tesseract, Google Vision, Transkribus).
- New Rotate Option (T413556): A rotation selector (0°, 90°, 180°, 270°) passed as a parameter to the wikimedia-ocr API, allowing users to fix scan orientation before processing.
- A language selector (pre-populated via mw.config from the work's language metadata where available).
- Clear explanations surfaced via OOUI Tooltips.
- Implement the OCR job runner and fix T411157:
- Update the wikimedia-ocr frontend / OcrController to correctly distinguish between partial success and total failure - Success messages must only appear when all requested pages complete without error.
- Implement sequential job execution in the extension, calling the wikimedia-ocr API one page at a time (or in small controlled batches to respect rate limits).
- Show a live progress indicator ("OCR-ing page 34 of 120…").
- Capture specific API error codes per page and store them in a local result object - nothing is written to the wiki yet.
- A single failed page does not abort the entire job; it is flagged and processing continues.
// Conceptual OCR runner - config carries engine, language, and rotation (T413556) async function runBulkOcr(pages, config) { const results = {}; for (const page of pages) { try { results[page.title] = await callWikimediaOcrApi(page, config); } catch (err) { results[page.title] = { error: err.message }; } updateProgressIndicator(page, pages.length); } return results; }
- Write QUnit tests for the job runner covering success, partial failure, and complete failure scenarios.
Key Deliverable: OCR job runs end-to-end for a set of selected pages; T411157 fixed; progress is shown; results are stored locally; no wiki writes occur yet.
Phase 4 : Weeks 7–9: Feedback and Review Modal
Goal: Give users full visibility into OCR results before any text is committed to the wiki - the most safety-critical part of the workflow.
- Build the feedback modal (T365940) that opens automatically after the OCR job completes, using OOUI for accessibility and localization support:
- Displays results page by page (or in a paginated list for large works).
- For each page: shows a thumbnail of the original page image alongside the OCR-extracted text in an editable textarea.
- Per-page action: Approve, Reject, or Edit then Approve.
- A summary header: "X pages OCR-ed successfully, Y pages failed, Z pages pending review."
- An "Approve All" shortcut for users who want to batch-approve after a quick scan.
- Failed pages (API errors) are clearly flagged and excluded from the write-back by default.
- Ensure the modal is fully accessible: keyboard-navigable, ARIA roles on interactive elements, screen-reader-compatible labels - all handled via OOUI's built-in accessibility infrastructure.
- Ensure the modal is responsive and usable on tablet-sized viewports as well as desktop.
- Ensure all new UI strings are i18n-ready (added to en.json and qqq.json in both the Wikisource extension and the i18n/ directory of the OCR tool).
- Write QUnit tests for all modal states: loading, results populated, partial failure, all approved, all rejected, mixed.
Key Deliverable: The feedback modal correctly renders OCR results, allows per-page decisions, and produces a final approved/rejected page map ready for write-back - with no wiki writes yet.
Phase 5 : Weeks 10–11: Write-Back Pipeline
Goal: Commit approved OCR text into the correct text layer of each page on Wikisource.
The write-back pipeline handles three outcomes for every page: success, edit conflict (retry once), and hard API failure (log and continue). No silent failures.
- Implement the write-back pipeline:
- Triggered only after the user clicks "Confirm and Save" in the feedback modal.
- Iterates over the approved-page map, making a MediaWiki API edit (action=edit) for each page to insert OCR text into the page's text layer, following conventions already established by the single-page OCR flow.
- Respects MediaWiki's CSRF edit token protection; on edit conflict, retries once with a freshly fetched token, then surfaces a clear per-page error.
- Edits are tagged bulk-ocr for easier tracking and potential rollback by admins.
- Shows a final progress indicator ("Saving page 12 of 47…") and a completion summary with links to any failed pages.
// Conceptual write-back - uses constructor form new mw.Api() async function writeApprovedPages(approvedPages) { const api = new mw.Api(); for (const [pageTitle, ocrText] of Object.entries(approvedPages)) { try { await api.postWithEditToken({ action: 'edit', title: pageTitle, text: ocrText, summary: 'Bulk OCR via Wikisource extension (GSoC)', tags: 'bulk-ocr' }); } catch (err) { if (err === 'editconflict') { await retryWithFreshToken(api, pageTitle, ocrText); // retry once } else { logFailure(pageTitle, err); } } } }
- Write integration tests for the write-back: mock the MediaWiki API, verify correct payloads are sent, verify edit-conflict retry behavior, and verify hard failures are logged without aborting the loop.
Key Deliverable: Approved pages are saved to the wiki text layer; edit conflicts are retried once; hard failures are logged and reported; no silent data loss.
Phase 6 : Week 12: Polish, Documentation, and Cleanup
Goal: Deliver a production-ready, well-tested, well-documented feature.
- Complete QUnit coverage: authorization, page selection, OCR runner, feedback modal, write-back - including all error and edge-case paths.
- Perform end-to-end manual testing on a Wikisource staging environment across multiple browsers (Chrome, Firefox, Safari).
- Verify that the UI degrades gracefully for non-authorized users (no panel visible, no JS errors in the console).
- Update the Wikisource extension documentation and any relevant on-wiki documentation pages to describe the new bulk OCR workflow, the authorization requirements, and the OCR engine options.
- Ensure all new UI strings are i18n-ready in both repositories (en.json + qqq.json).
- Final code cleanup, respond to all open Gerrit review comments, ensure no regressions in the existing single-page OCR flow.
Key Deliverable: All patches merged, CI green, documentation updated, all mentor review comments resolved.
Evaluation Plan
Mid-term Evaluation (Week 7)
- The authorization layer, page selection UI, OCR engine configuration (including the rotate option from T413556), and OCR job runner are all functional. T411157 is fixed. An authorized user can select pages, choose an OCR engine, run the OCR job, and see a progress indicator - with results stored locally. The feedback modal is partially built (renders results; per-page approve/reject not yet complete).
Final Evaluation (Week 12)
- The complete bulk OCR pipeline is live: page selection → OCR configuration → job execution → feedback/review modal → write-back. All QUnit tests pass, the feature is documented on-wiki and in extension docs, and the existing single-page OCR flow has no regressions.
Timeline Summary
| Weeks | Phase | Key Deliverable |
|---|---|---|
| 1–2 | Research & Audit | Approved technical design doc (wireframes + data flow + identified hooks/modules) |
| 3–4 | Authorization + Page Selection UI | Authorized users see OOUI page selector; unit tests pass |
| 5–6 | OCR Config + Job Execution | T411157 fixed; bulk OCR job runs end-to-end; results stored locally |
| 7–9 | Feedback & Review Modal | T365940 implemented; per-page approve/reject modal; all states tested |
| 10–11 | Write-Back Pipeline | Approved pages committed to wiki text layer; retry + error logging |
| 12 | Polish, Docs, Cleanup | Full test coverage, docs updated, CI green, no open reviews |
Participation
Communication: I will post daily progress summaries to the project's communication channel (Wikimedia Zulip, per mentor preference). I will schedule a weekly check-in with my mentors to review open patches and surface blockers early. Gerrit review comments will receive a response within 24 hours.
Progress Tracking: I will maintain a public weekly log (Wikimedia user page or personal blog) documenting what I built, what I learned, and the plan for the next week - shared with mentors at the start of each week.
Publishing Code: All work will be submitted as patches to Gerrit following MediaWiki's contribution guidelines, commit message conventions, and testing requirements. No phase is considered complete until its patch has passed CI and received mentor approval.
Availability: I can commit 35–40 hours per week throughout the GSoC period. During the Salesforce internship (mid-May to mid-July), this will be distributed across evenings (4 hrs/day) and full weekends. After mid-July, my course structure is flexible and I can contribute full-time. Any planned absence will be communicated to mentors at least one week in advance.
About Me
Education: I am a 3rd-year B.Tech student in Electrical Engineering at Indian Institute of Technology, Roorkee, entering my 4th year this summer. CGPA: 8.83.
Other time commitments: I have an internship at Salesforce Hyderabad from mid-May to mid-July (weekdays, 9 AM–6 PM IST). I have planned my GSoC schedule around this: 4 hours each evening plus full weekends during that period easily covers the required weekly hours. Post mid-July, my 4th-year schedule is flexible and I can contribute full-time.
GSoC and other applications? I am applying for GSoC with Wikimedia for this project and for one other Wikimedia project (Programs & Events Dashboard — System Stats), which is my primary choice.
What does making this project happen mean to you? Wikisource's mission - to make humanity's public-domain written heritage freely available to everyone - is one of the most tangible, lasting things the Wikimedia movement does. Right now, volunteers transcribing an entire book must click through OCR one page at a time, which is tedious enough that it becomes a real barrier to contribution. Building bulk OCR with a proper review step means a volunteer who might have given up after twenty pages can now process an entire work in a single sitting - and review the results before committing them, so quality stays high. For me, this is exactly the kind of infrastructure work that multiplies the effort of every future contributor. Having already shipped several merged PRs in the WikiEduDashboard project and built production systems handling thousands of users at IIT Roorkee, I know how to navigate a large open-source codebase, write code that survives code review, and deliver features that real users depend on. I am excited to bring that same approach to Wikisource.
Past Experience
Wikimedia Contributions (Merged PRs - WikiEduDashboard)
| PR | Description |
|---|---|
| #6593 | Conditionally hide wikidata stats tab |
| #6599 | Post instructor userpage template on course approval |
| #6612 | Add "Update Scheduled" state to admin course actions |
| #6655 | Add setting for block bots |
| #6669 | Skip /requested_accounts.json API call on irrelevant routes |
| #6670 | Implemented a 30-second cache for /requested_accounts.json responses |
| #6675 | Add dynamic refresh functionality to the notification bell |
| #6687 | Added a 5-second timestamp-based throttle to API.fetchNews() |
| #6766 | Add regression test for suppressed parent revision bug |
| #20 | fix: calculate byte change using the previous revision's length when the parent revision is missing. |
PRs in Review
| PR | Description |
|---|---|
| #6732 | Add article_scoped flag and logic to base Course model |
| #6708 | feat: Add survey completion time tracking and display for analytics |
Relevant Technical Experience
- JavaScript / Frontend Architecture: All eight merged WikiEduDashboard PRs involved JavaScript - from API throttling and caching patterns to React component state and dynamic UI updates. This directly maps to the JS-heavy nature of the Wikisource extension.
- Production Systems at Scale: As Project Leader at IIT Roorkee's Information Management Group, I maintain the Placement & Internship Portal used by 10,000+ students - handling race conditions, data consistency, and performance optimization under real production load. Building a write-back pipeline that handles edit conflicts and partial failures reliably draws on exactly this kind of experience.
- Analytics and Event Tracking (Internship): During my backend internship, I built analytics infrastructure for tracking user interaction sequences on a gamification platform - designing systems that handle high-frequency events, store results reliably, and surface structured summaries. The OCR job runner (execute → store results → surface to user for review) follows a similar pattern.
- Open Source Contributions: Completed Hacktoberfest 2024 with successful contributions across multiple repositories.

