Page MenuHomePhabricator

Investigation: Tool to use Google OCR in Indic language Wikisources
Closed, ResolvedPublic8 Story Points

Description

Based on T120788: Tool to use Google OCRs in Indic language Wikisource

This is an investigation card to figure out:

  • How Tesseract currently works on English, French, etc
  • Can Tesseract work on Indic languages, and it's just not enabled right now? (Asaf has a tool that may be able to do this.)
  • Should the enable-OCR gadget be default, part of the ProofreadPage extension, or a separate extension?
  • How does the current Indic language workflow need to change, to incorporate Google OCR? How is Bengali Wikisource using Google Drive?
  • Comparing Tesseract to Google OCR: if Google is cheap, it might be better quality/easier to send everything there. Should we replace Tesseract for the languages currently using it?
  • Find out about the OCR API service on Tool Labs that the WikiSource Gadget uses. Who owns it? What programming language is it in?

The outcome of this ticket should be writing a bunch more tickets.

On https://meta.wikimedia.org/wiki/Community_Tech/Google_OCR_for_Indic_language_Wikisources/notes ,
there's information on the English Wikisource OCR workflow with screenshots, and a list of active Indic language Wikisources.

Event Timeline

DannyH created this task.Jul 11 2016, 11:07 PM
Restricted Application added subscribers: Zppix, Aklapper. · View Herald TranscriptJul 11 2016, 11:07 PM
DannyH updated the task description. (Show Details)Jul 14 2016, 5:15 PM
DannyH updated the task description. (Show Details)Jul 14 2016, 5:30 PM
kaldari updated the task description. (Show Details)Jul 14 2016, 5:56 PM
DannyH updated the task description. (Show Details)Jul 14 2016, 5:57 PM
DannyH set the point value for this task to 8.Jul 14 2016, 5:59 PM
DannyH moved this task from To be estimated/discussed to Estimated on the Community-Tech board.
DannyH updated the task description. (Show Details)Jul 19 2016, 12:24 AM
kaldari raised the priority of this task from Normal to High.Jul 25 2016, 11:50 PM
Niharika claimed this task.Jul 28 2016, 8:48 PM

Format for Google OCR API request if image is hosted on Google Cloud Storage (for testing):

POST https://vision.googleapis.com/v1/images:annotate?key={YOUR_API_KEY}
{
 "requests": [
  {
   "features": [
    {
     "type": "TEXT_DETECTION"
    }
   ],
   "image": {
    "source": {
     "gcsImageUri": "gs://ocr-testing/Fauna_of_British_India.jpg"
    }
   }
  }
 ]
}

To use on an image not hosted in Google Cloud Storage:

POST https://vision.googleapis.com/v1/images:annotate?key={YOUR_API_KEY}
{
 "requests": [
  {
   "features": [
    {
     "type": "TEXT_DETECTION"
    }
   ],
   "image": {
    "content": "XXXXXXX"
   }
  }
 ]
}

...where XXXXXXX is the encoded image data.

Sad news. Looks like the Google Vision API does not support Indic languages:
https://developers.google.com/vision/text-overview

We tested it with some Hindi and Kannada texts and it consistently either errored out or detected it as a non-Indic language. This was true even when we used the "languageHints" param in the API to explicitly tell the API that it was Hindi.

It's strange that Google Drive would support OCR for Indic-languages but not the Google Vision API. I'm going to ask my contact at Google about this and see if they know why that's the case.

Heard back from Google. They sent me the full list of languages supported by the Google Vision API (which is much longer than the list in their documentation). It includes several Indic languages including Hindi. Guess we need to do some more testing.

The full list is:

Latin scripts: Afrikaans, Catalan, Czech, Danish, Dutch, English, Estonian, Filipino, Finnish, French, German, Croatian, Hungarian, Icelandic, Indonesian, Italian, Latvian, Lithuanian, Malay, Norwegian, Polish, Portuguese, Romanian, Slovak, Slovenian, Spanish, Swedish, Turkish, Vietnamese

Cyrillic scripts: Russian, Ukrainian, Bulgarian, Serbian

CJK scripts: Chinese, Japanese, Korean

"Aksara" scripts: Arabic, Persian, Pushto, Urdu, Assamese, Bengali, Azerbaijani, Belarusian, Kazakh, Kirghiz, Macedonian, Mongolian, Uzbek, Hindi, Marathi, Nepali, Sanskrit, Tamil

Apparently, the Vision API will only detect Hindi if you provide a languageHint and it is set to the 2-letter code for Hindi:

{
 "requests": [
  {
   "features": [
    {
     "type": "TEXT_DETECTION"
    }
   ],
   "image": {
    "source": {
     "gcsImageUri": "gs://ocr-testing/hindi/Omniglot.gif"
    }
   },
   "imageContext": {
    "languageHints": [
     "hi"
    ]
   }
  }
 ]
}

At least this gives us a potential path forward :)

With the investigation done, here are the tasks I propose for the project:

  • Create a wrapper on tool labs to interact with the google API
  • Create a wikitext editor gadget:
    • On clicking it, the gadget queries the google API with the image of the book on that page
    • It populates the content div with the text it got back from the API
    • Note that it needs to grab the content language (probably a ULS global variable) and use it as a language hint while making the API request
  • Deploying and testing the gadget on a test wikisource

@Niharika: That sounds like a good set of tasks. Unfortunately, there is no Test Wikisource, but we should be fine testing it as a user script on English Wikisource. There isn't much risk since the gadget/user script won't actually be making any edits.

kaldari closed this task as Resolved.Aug 15 2016, 4:58 PM
kaldari moved this task from Needs Review/Feedback to Q1 2018-19 on the Community-Tech-Sprint board.
DannyH moved this task from Estimated to Archive on the Community-Tech board.
Ankry added a subscriber: Ankry.Jul 19 2018, 7:14 PM