Build an OCR tool for wikisource so we don't need to rely on external services.
|Open||None||T161979 Optimize OCR model for Wikisource for each book based on initial proofreading|
|Resolved||Samwilson||T161978 Epic: Generalized OCR for Wikisource|
|Resolved||• aezell||T244100 Spike: New/Improved OCR tool [8 hours]|
|Resolved||aborrero||T247422 Update Tesseract on Toolforge to v4.1.0|
|Resolved||kaldari||T246944 Improve OCR: Test accuracy and features of various OCR engines|
@ifried, @Samwilson, @aezell - I talked with Alexandros Kosiaris about how we could communicate with Google's OCR API from a production extension (similar to what Content Translation is already doing). He informed me that all you have to do is proxy the API requests through the HTTP proxy specified by $wgCopyUploadProxy. Thus it should be relatively easy to move Wikisource OCR into a MediaWiki extension, if we decide we want to do that.
No, there is no production/external API for Tesseract, and we would not want to call Toolforge from production. If Tesseract is needed, we'll need to request that Platform Engineering build a service for that. Since Platform Engineering already has a large backlog, we should make that request as soon as we are sure that Tesseract would be needed in production, as it could take a long time (a year or more) to get such a service up and running in production.