Page MenuHomePhabricator

Epic: Generalized OCR for Wikisource
Open, Needs TriagePublic

Description

Build an OCR tool for wikisource so we don't need to rely on external services.

Event Timeline

Halfak created this task.Apr 2 2017, 12:42 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 2 2017, 12:42 PM
Pols12 added a subscriber: Pols12.
Xover added a subscriber: Xover.Oct 31 2019, 7:31 AM
Ltrlg added a subscriber: Ltrlg.Jan 2 2020, 7:44 AM
kaldari renamed this task from Generalized OCR for Wikisource to Epic: Generalized OCR for Wikisource.Oct 19 2020, 5:09 PM
kaldari added a project: Epic.

@ifried, @Samwilson, @aezell - I talked with Alexandros Kosiaris about how we could communicate with Google's OCR API from a production extension (similar to what Content Translation is already doing). He informed me that all you have to do is proxy the API requests through the HTTP proxy specified by $wgCopyUploadProxy. Thus it should be relatively easy to move Wikisource OCR into a MediaWiki extension, if we decide we want to do that.

That sounds like a great idea.

That would work for the Google Cloud Vision API, but is there a production/external API for Tesseract? Or is it okay to call Toolforge for that?

No, there is no production/external API for Tesseract, and we would not want to call Toolforge from production. If Tesseract is needed, we'll need to request that Platform Engineering build a service for that. Since Platform Engineering already has a large backlog, we should make that request as soon as we are sure that Tesseract would be needed in production, as it could take a long time (a year or more) to get such a service up and running in production.

MJL added a subscriber: MJL.Mon, Nov 16, 5:59 PM