Page MenuHomePhabricator

Tool to use Google OCRs in Indic language Wikisource
Closed, ResolvedPublic

Description

For a long time Indic languages Wikisource projects depended totally on manual proofreading, which not only wasted a lot of time, but also a lot of energy. Recently Google has released OCR software for more than 20 Indic languages. This software is far far better and accurate than the previous OCRs. But it has many limitations. Uploading the same large file two times (one time for Google OCR and another at Commons) is not an easy solution for most of the contributors, as Internet connection is way slow in India. What I suggest is to develop a tool which can feed the uploaded pdf or djvu files of Commons directly to Google OCRs, so that uploading them 2 times can be avoided. -- Bodhisattwa (talk) 13:50, 10 November 2015 (UTC)

This card tracks a proposal from the 2015 Community Wishlist Survey: https://meta.wikimedia.org/wiki/2015_Community_Wishlist_Survey

This proposal received 39 support votes, and was ranked #25 out of 107 proposals. https://meta.wikimedia.org/wiki/2015_Community_Wishlist_Survey/Wikisource#Tool_to_use_Google_OCRs_in_Indic_language_Wikisource

'''Update'''

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

@Tshrinivasan I missed the change sorry for not following up sonoer. I am looking into it: we (WMF) may actually be able to negotiate something with Google -- I think they offer access to it as part of their cloud computing services: https://cloud.google.com/vision/

I'd be happy to help with the project at the Hackathon, if needed. :)

@Niharika and @Tshrinivasan I will be at the hackathon, so would love to talk to you about the code and problem being solved by this bug at Wikimania -- I want to make sure the conversations we start help you in this work.

I'm also interested to help on this entry during Wikimania.

Looking into switching to Google Cloud Platform's Cloud Vision API. According to their pricing guide:

1 - 1,000 units/month: Free
1,001 - 1 Million units/month: $2.50/1,000 units
1,000,001 - 5 Million units/month: $2.00/1,000 units
5,000,001 - 20 Million units/month: $0.60/1,000 units

They don't offer any suggestion for what a "unit" is, but I suppose it is an image of a page.

Google might give us special limits. They did so for Google books API for
the BUB tool.

What are the languages that are currently unsupported by the OCR on WikiSource?

I know the Indic Languages are listed at http://wiki.wikimedia.in/List_of_Indian_language_wiki_projects

I created a brief list of the different languages for different OCR tools at and WikiSource languages: https://docs.google.com/spreadsheets/d/1TRBWwxOxWWodJNIxrOFOF-mUDo7MOanIVCrFqRK0dP4/edit?usp=sharing .

@kaldari & @Niharika Currently OCR uses external tools to do on depend on Tesseract OCR (see the list of languages at https://github.com/tesseract-ocr/langdata) , when the HOCR text is not available in the object itself (PDFs and/or DJVUs will carry HOCR text with them if available). https://wikisource.org/wiki/MediaWiki:OCR.js .

@Phe maintains the OCR system w/ tesseract which is on tools labs http://tools.wmflabs.org/phetools/ . The tool requires a call to it (which I was under the wrong impression: I thought it was built into the software for Proofread page). Code is at https://github.com/wikisource/ocr-tools .

It looks like you would need to create another option in either the JS or the tool in PheTools that supports the alternative OCR.

According to the interface at http://tools.wmflabs.org/phetools/ocr.php there is an "ocr robot". Is that the primary tool that performs the OCRing? Does anyone know what the actual workflow is here? i.e. when and how the OCRing actually takes place?

@Ijon I came upon http://wikisource-l.wikimedia.narkive.com/0IJXx5mq/ocr-as-a-service and consequently http://openocr.wmflabs.org (502 now). Maybe you can tell us about your experience using tesseract? Did you enable it for all languages or just english?
Looking at https://github.com/tesseract-ocr/langdata it seems that quite a few Indic languages have training data but it's not clear if the results were inaccurate or there are other issues with it.

Hey, so here is a brief report of information I grabbed during Wikimania. The thing is that there's already a working script (including for Indic languages) somewhere which do works as a CLI tool. I didn't noted where, but @8ohit.dua probably should be able to tell more about it. He and I teamed to launch ocr4wikisource on tool labs, that is, a web frontend for the previously mentioned script.

So far I've been busy on others tasks, and I still have to finish my Flask training before I can do anything really useful regarding this development.

Ha, here is the CLI interface I guess : https://github.com/tshrinivasan/OCR4wikisource

Correct. The script requires you to download it and run it locally on your computer. It works by splitting your pdf into 10-page chunks, uploading them to Google Drive, OCR-ing them using a command, downloading the OCR=ed text back and reintegrating the chunks. Our initial plan was to make this available on tool labs but this process can be made easier with the help of Google API access and will need some further thinking.

I am interested in also looking at how well Tesseract does this.

Well, Google did an overview of Tesseract, and used to host its repository, so maybe that's what there are using internally. Well, they may have unpublished changes, and also the training data set used internally may be different than the one available on github, who know… But at least trying the github version shouldn't be a big deal.

Other links which may be of some interest for this context (or not) extracted from this French post on OCR and Linux:

Hi @Niharika one small correction of your previous comments

It works by splitting your pdf into 10-page chunks, uploading them to Google Drive, OCR-ing them using a command, downloading the OCR=ed text back and reintegrating the chunks.

The present script splits page one by one as single page , not 10-page chunks. Then upload one by one to Google drive, OCRed, Download OCRed text file at local PC. Then upload the same to page by page at Wikisource.

@kaldari & @Niharika Currently OCR uses external tools to do on depend on Tesseract OCR (see the list of languages at https://github.com/tesseract-ocr/langdata) , when the HOCR text is not available in the object itself (PDFs and/or DJVUs will carry HOCR text with them if available). https://wikisource.org/wiki/MediaWiki:OCR.js .

I am a little confused here. Going by that language list a bunch of Indic languages are already supported by Tesseract, including Bengali (https://github.com/tesseract-ocr/langdata/tree/master/ben). @Tpt, @Phe, @Sadads - do we know if Tesseract OCR for these languages is not very good or are there other reasons why we don't use it?

@Sadads, (if you know) could you also tell me which tool(s) makes the OCR call to Phetools?

@Niharika , Bengali do have Tessaract support and the latest trained data (https://github.com/tesseract-ocr/tessdata/blob/master/ben.traineddata) was updated (T117711) recently by @Tpt during Wikimania. But even after update, the Tessaract OCR gives inferior quality output than Google OCR. Still, Google OCR is the best option we have.

Hello, @Niharika -- I have no experience with other languages and Tesseract. It worked well enough for English, and I made it available as a service at the time in response to an expressed need, but have not looked at it since. I'm not sure anyone has been using it, so I'm not taking the time to look into that 502 just yet.

I don't see that Tesseract is a solution to this ticket's stated need.

Tesseract is working fine for English. But it is not usable for other Indian Languages like Tamil, Telugu etc.

But Google OCR is working fine for the most of the languages. Hence using Google OCR in OCR4WikiSource project.

There are ways to train tesseract for unicode and indian languages. It needs huge amount of effort. The tough part with training tesseract is we have to train for each font(style) of text.

Training it is the best solution. But it will take time. we have to find people who can train it for indian languages.

Good news. We were able to get the Google Vision API to successfully detect Hindi and Bengali text (See T140037). The next step would be to create an API proxy on Tool Labs and then create an OCR gadget on the Indic language WikiSource projects that uses it (roughly based on the English Wikisource gadget).

Languages supported by existing phetools/Terreract gadget code:

bel, ben, cat, ces, dan, deu, eng, epo, spa, est, fra, heb, hrv, hun, ind, isl, ita, ita, nor, pol, por, rus, swe
(Belarusian, Bengali, Catalan, Czech, Danish, German, English, Esperanto, Spanish, Estonian, French, Hebrew, Croatian, Hungarian, Indonesian, Icelandic, Italian, Norwegian, Polish, Portuguese, Russian, Swedish)

Languages supported by Google Vision API (note that their own documentation is out of date):

afr, ara, asm, aze, bel, ben, bul, cat, ces, chi, dan, dut, eng, est, fil, fin, fre, ger, hin, hrv, hun, ice, ind, ita, jpn, kaz, kir, kor, lav, lit, mac, mar, may, mon, nep, nor, per, pol, por, pus, rum, rus, san, slo, slv, spa, srp, swe, tam, tur, ukr, urd, uzb, vie

All possible languages supported by Tesseract:

afr, amh, ara, asm, aze, aze_cyrl, bel, ben, bod, bos, bul, cat, ceb, ces, chi_sim, chi_tra, chr, cym, dan, dan_frak, deu, dzo, ell, eng, enm, epo, equ, est, eus, fas, fin, fra, frk, frm, gle, glg, grc, guj, hat, heb, hin, hrv, hun, iku, ind, isl, ita, ita_old, jav, jpn, kan, kat, kat_old, kaz, khm, kir, kor, kur, lao, lat, lav, lit, mal, mar, mkd, mlt, msa, mya, nep, nld, nor, ori, pan, pol, por, pus, ron, rus, san, sin, slk, slk_frak, slv, spa, sqi, srp, srp_latn, swa, swe, syr, tam, tel, tgk, tgl, tha, tir, tur, uig, ukr, urd, uzb, uzb_cyrl, vie, yid

All Wikisource languages (sorry these are not 3-letter codes):

ang, ar, as, az, be, bg, bn, br, bs, ca, cs, cy, da, de, el, en, eo, es, et, fa, fi, fo, fr, gl, gu, he, hr, ht, hu, hy, id, is, it, ja, kn, ko, la, li, lt, mk, ml, mr, nl, no, or, pl, pt, ro, ru, sah, sa, sk, sl, sr, sv, ta, te, th, tr, uk, vec, vi, yi, zh_min_nan, zh

Now , may be I am wrong, I can under stand why for Malayalam, Telugu, Oriya, Gujrati, Kannada is not Working..... As per Google Vision API (note that their own documentation is out of date): as @kaldari said supported below languages...

afr, ara, asm, aze, bel, ben, bul, cat, ces, chi, dan, dut, eng, est, fil, fin, fre, ger, hin, hrv, hun, ice, ind, ita, jpn, kaz, kir, kor, lav, lit, mac, mar, may, mon, nep, nor, per, pol, por, pus, rum, rus, san, slo, slv, spa, srp, swe, tam, tur, ukr, urd, uzb, vie

The above list without Malayalam, Telugu, Oriya, Gujrati, Kannada.

In Google Drive all Malayalam, Telugu, Oriya, Gujrati, Kannada OCRed fine.

The Indic language projects that can now use Google OCR are Bengali, Sanskrit, Tamil, Assamese, and Marathi (via the instructions at https://wikisource.org/wiki/Wikisource:Google_OCR).

I sent an email to my contact at Google asking if they had any plans to support Malayalam, Telugu, Oriya, Gujrati, or Kannada.

Since the Assamese Wikisource was already loading the phetools OCR tool by default (from Common.js), even though it isn't a language supported by phetools, I just went ahead and replaced it with the Google OCR tool. It now works, but none of the system messages exist. I posted a note on their Village Pump to see if we could get someone there to help.

The Indic language projects that can now use Google OCR are Bengali, Sanskrit, Tamil, Assamese, and Marathi (via the instructions at https://wikisource.org/wiki/Wikisource:Google_OCR).

I sent an email to my contact at Google asking if they had any plans to support Malayalam, Telugu, Oriya, Gujrati, or Kannada.

@kaldari, All these languages can use Google OCR using Google Drive API. This script utilises the Google Drive API, and has been used in different Indic Wikisource projects. May be converting this script into a tool or gadget will be a effective step.

My thinking is that the OCR4wikisource workflow isn't quite optimal: I think it'd be better to use such a tool to add a text layer to the PDF, and then proceed with proofreadpage's normal system of pulling that out for each page. This would match how other Wikisources work. (Or, of course, use the single-page OCR button — am I correct in thinking that the objection to this is the slow speed?)

But as far as getting a system up and running that uses Google Drive:

  • In order to use the OCR4wikisource tool we would need Google Drive API access, which (unless something can be negotiated with Google) will probably incur some cost. @kaldari can you confim?
  • There is also a possibility that us exposing the OCR component of Google Drive as a publicly-usable service would be in contravention of the GD terms of use. It's fine for individual users to run scripts against their own accounts, but a different matter once it becomes a web-based thing running against a WMF account. Can the WMF Legal department help with this?

Also, is there any plan for Google to add support for the missing languages to their Cloud Vision API? That would be the simplest way ahead!

There is also a possibility that us exposing the OCR component of Google Drive as a publicly-usable service would be in contravention of the GD terms of use. It's fine for individual users to run scripts against their own accounts, but a different matter once it becomes a web-based thing running against a WMF account. Can the WMF Legal department help with this?

We can take a look.

This comment was removed by kaldari.

@Samwilson: Before we have @ZhouZ look at this, let's identify all the Google endpoints that https://github.com/tienfuc/gdcmdtools interacts with so that we will know which ToSes actually apply.

@Ravidreams, @Bodhisattwa: I talked to several folks from Google on Friday. They confirmed that we could not use Google Drive as a free public OCR service (according to their Terms of Use). Also, I brought up the issue of the quality difference between the Vision API and Google Drive's OCR. They said that they would need an example in order to investigate the difference. I asked for an example about a month ago, but no one has supplied one yet. If anyone has an example of a case where the Google Drive OCR does a better job than the Google Vision API, please send it to me along with an explanation of the difference. Otherwise, there isn't much we can do. Finally, I asked them about supporting Malayalam, Telugu, Oriya, Gujrati, and Kannada. They said that they would look into it and get back to me.

I've added a side-by-side view of the OCR output of the first page of this example document: https://commons.wikimedia.org/wiki/File:OCR-test-1.djvu

@kaldari, @Samwilson , for Bengali, there is not much difference in the OCR output from Cloud Vision API or Google Drive API.

@Samwilson: What language is your test in? What resolution was was it OCRed at? The results from that test look pretty similar. It looks like there are a few letters that are wrong in the Vision version and a few that are wrong in the Drive version, but not a significant difference between the two. The punctuation seems to be slightly better in the Vision version, although it incorrectly detected a spot on the page as a comma, which the Drive version didn't.

Perhaps we should do a test in Tamil, since that is the language that Ravidreams was complaining about.

I think it's safe to close this task as done now. CC @DannyH

@jayantanth, @Bodhisattwa, @Samwilson - An update: Kannada, Telugu, and Panjabi are now properly supported by the Google Vision API. Malayalam, Oriya, and Gujrati are still broken, however.

@jayantanth, @Bodhisattwa, @Samwilson - It looks like Malayalam, Oriya/Odia, and Gujrati have now been fixed as well!