Page MenuHomePhabricator

Epic: Generalized OCR for Wikisource
Closed, ResolvedPublic


Build an OCR tool for wikisource so we don't need to rely on external services.

Event Timeline

kaldari renamed this task from Generalized OCR for Wikisource to Epic: Generalized OCR for Wikisource.Oct 19 2020, 5:09 PM
kaldari added a project: Epic.

@ifried, @Samwilson, @aezell - I talked with Alexandros Kosiaris about how we could communicate with Google's OCR API from a production extension (similar to what Content Translation is already doing). He informed me that all you have to do is proxy the API requests through the HTTP proxy specified by $wgCopyUploadProxy. Thus it should be relatively easy to move Wikisource OCR into a MediaWiki extension, if we decide we want to do that.

That sounds like a great idea.

That would work for the Google Cloud Vision API, but is there a production/external API for Tesseract? Or is it okay to call Toolforge for that?

No, there is no production/external API for Tesseract, and we would not want to call Toolforge from production. If Tesseract is needed, we'll need to request that Platform Engineering build a service for that. Since Platform Engineering already has a large backlog, we should make that request as soon as we are sure that Tesseract would be needed in production, as it could take a long time (a year or more) to get such a service up and running in production.

@Samwilson this was the ticket listed on the community wishlist. what do we consider the status of this now ?

Samwilson claimed this task.
Samwilson added a project: Wikimedia OCR.

I think we can call this done!

We do still have an external service (Google Cloud Vision API) but we don't rely on it (we have the internal Tesseract as the default).