This is an investigation card to figure out:
- How Tesseract currently works on English, French, etc
- Can Tesseract work on Indic languages, and it's just not enabled right now? (Asaf has a tool that may be able to do this.)
- Should the enable-OCR gadget be default, part of the ProofreadPage extension, or a separate extension?
- How does the current Indic language workflow need to change, to incorporate Google OCR? How is Bengali Wikisource using Google Drive?
- Comparing Tesseract to Google OCR: if Google is cheap, it might be better quality/easier to send everything there. Should we replace Tesseract for the languages currently using it?
- Find out about the OCR API service on Tool Labs that the WikiSource Gadget uses. Who owns it? What programming language is it in?
The outcome of this ticket should be writing a bunch more tickets.
On https://meta.wikimedia.org/wiki/Community_Tech/Google_OCR_for_Indic_language_Wikisources/notes ,
there's information on the English Wikisource OCR workflow with screenshots, and a list of active Indic language Wikisources.