Project title
Bulk OCR on Wikisource
Brief summary
Wikisource is an online wiki-based digital library of free-content textual sources operated by the Wikimedia Foundation. The Bulk OCR feature aims to provide an easy way for volunteers to OCR multiple pages or, say, an entire book on Wikisource. However, the ability to perform bulk OCR on any work should be restricted only to certain groups of users. To this end, there is a need to add features to the Wikisource extension to allow authorized users to OCR multiple pages at once and insert the OCRed text back into the relevant text layer of the corresponding pages of the book on Wikisource.
Expected outcomes
By the end of the project, contributors will have written well documented code to improve the existing workflow for bulk OCR that allows authorized users to perform bulk OCR of pages of a particular work on Wikisource. They are expected to add option selection for different OCR configurations and a feedback loop before triggering the write to wiki process. Updating the documentation of the new workflow on any existing documentation pages is also expected.
Skills required/preferred
Javascript, HTML, CSS, familiarity with object oriented programming, experience with PHP and Mediawiki are bonuses, familiarity with the Wikimedia OCR project
Possible mentors
Parthiv Menon (@theprotonade), Satdeep Gill (@SGill)
Expected size of the project
350 hours
Rating
Medium
Microtasks
Any other additional information for contributors:
- T359703
- Merged patch for bulk OCR
- Wikimedia OCR tool
- Wikimedia OCR source code
- Wikimedia OCR documentation
- Wikisource extension
- Legacy user script for bulk OCR
Why are you proposing this project?
Benefits the Wikisource community
What is the expected impact?
Users who can perform bulk OCR can see feedback related to pages that are being OCRed before approving the entire OCR process. This helps reduce faulty OCRed texts being inserted into the wiki. Success would typically look like having proper UI elements to choose from different OCR engines and having a feedback modal to approve or disapprove of the entire OCR process.