Page MenuHomePhabricator

Test and deploy the OCR gadget on Wikisource
Closed, ResolvedPublic3 Estimated Story Points

Description

Test the new OCR user script on English or multilingual Wikisource. If it works well, deploy it to an Indic language Wikisource as a gadget.

Event Timeline

kaldari renamed this task from Deploy and test the OCR gadget on Wikisource to Test and deploy the OCR gadget on Wikisource.Aug 15 2016, 4:50 PM
kaldari updated the task description. (Show Details)

You can deploy at BNWS for testing we have know issues.

DannyH set the point value for this task to 3.Aug 30 2016, 5:47 PM
DannyH moved this task from To Be Estimated/Discussed to Estimated on the Community-Tech board.

@jayantanth: It looks like the existing phetools/Tesseract support Bengali. Has Bengali WikiSource tried using those yet?

This administrator is probably the best person to contact about getting the gadget enabled on Tamil: https://ta.wikisource.org/wiki/%E0%AE%AA%E0%AE%AF%E0%AE%A9%E0%AE%B0%E0%AF%8D:Balajijagadesh

@jayantanth: It looks like the existing phetools/Tesseract support Bengali. Has Bengali WikiSource tried using those yet?

T120788#2504287

Deployed in Mediawiki:Common.js in Bengali Wikisource. The script is working fine.

Deployed on Kannada Wikisource - Gadget-GoogleOcr.js. The script is not working for both PDF & DJVU.

I tested in Tamil Wikisource. The OCR accuracy is far less compared to OCR by Google Drive API. If there is an official commuication channel between WMF and Google's CloudVision API team, can this be conveyed? I won't find merit in recommending or using this tool, until this is fixed as every % accuracy matters when OCR is involved.

@jayantanth: It looks like the existing phetools/Tesseract support Bengali. Has Bengali WikiSource tried using those yet?

@kaldari Yes we have a phetools/Tesseract , but OCR output is not at per as Google OCR.

@Ravidreams : We do have a contact at Google. Can you provide me with a testcase Tamil image (or several images) that perform better in the Google Drive API than the Google Vision OCR API? Are there any particular patterns to the difference?

@Omshivaprakash: It looks like the Google OCR API doesn't yet support Kannada. See T120788#2618286. I've filed a request to add support for Kannada to phetools since there is a Tesseract language pack available for Kannada: https://github.com/phil-el/phetools/issues/9.

I have test all Indic Wikisource as a test.

Working fine got the text
Bengali,Assamese,Marathi, Tamil, Sanskrit

No working, no text output comes...
Malayalam, Telugu, Oriya, Gujrati, Kannada

Now , may be I am wrong, I can under stand why for Malayalam, Telugu, Oriya, Gujrati, Kannada is not Working..... As per Google Vision API (note that their own documentation is out of date): as @kaldari said supported below languages...

afr, ara, asm, aze, bel, ben, bul, cat, ces, chi, dan, dut, eng, est, fil, fin, fre, ger, hin, hrv, hun, ice, ind, ita, jpn, kaz, kir, kor, lav, lit, mac, mar, may, mon, nep, nor, per, pol, por, pus, rum, rus, san, slo, slv, spa, srp, swe, tam, tur, ukr, urd, uzb, vie

The above list without Malayalam, Telugu, Oriya, Gujrati, Kannada.

In Google Drive all Malayalam, Telugu, Oriya, Gujrati, Kannada OCRed fine.

@Samwilson @Bodhisattwa: Be careful about adding the Clean up OCR script. A lot of those rules are specific to English and don't make sense for other languages. For example, the "remove unwanted spaces around punctuation marks" will cause errors in French, and many of the rules are for specific English words. I would suggest just cherry-picking the rules that are going to work for any language.

Agreed with @kaldari, please use Clean up OCR script separately. Please don't mesh up with this.

@kaldari @Bodhisattwa : That's rather what I was thinking, after looking at it a bit more. The existing clean-up script isn't very usable outside it's current home anyway, and so it seemed to me that it'd be nice to create a decent generalised OCR cleanup library, for multiple languages — surely such a thing exists though? Anyway, if it doesn't, we have to do the hard bits of it regardless, so we might as well package it up nicely for others as well, I reckon! (But that's all an aside from the issue at hand, I guess.)

@Bodhisattwa are the system messages working correctly on Tamil Wikisource? I guess if it's installed site-wide then you'll not add it as a Gadget — but if you are, do the instructions make sense?

@kaldari (and anyone else with access), we can monitor Google Cloud Vision API usage (that's the API the tool uses) via this link.

I think we should implement some kind of throttling at our end if the requests become too many. The toollabs tool is also open to everyone. We should consider tying it up with OAuth to prevent spam requests.

@Niharika: I went ahead and put some limits on the Vision API quotas:

  • Requests per day: 10,000 (was 864,000)
  • Requests per 100 seconds: 500 (was 1,000)
  • Requests per 100 seconds per user: 100 (was 1,000)

We can tweak these further if needed. I also added Sam as a project owner so he can help keep tabs on it.

@Samwilson: Report from Balajijagadesh on Tamil wiki: "Okay... Its working now.. after doing OCR the page freezes sometimes in mozilla firefox". Any guess what might be causing that? Sounds like Firefox running out of memory for some reason.

I've replied on his talk page. I've not been able to replicate the problem yet.

Also, I'm getting the following error on the quotas page NIharika linked to above: "The API doesn't exist or you don't have permission to access it". I think I need to be joined to a project? At the moment I have no projects listed there.

@Samwilson: Are you able to access the quotas page yet?

@kaldari: yes, it is working now. Must just take a day or so for some reason. Thanks.

kaldari closed this task as Resolved.EditedSep 15 2016, 6:34 AM
kaldari claimed this task.

Current status of the tool:

  • Global script on Bengali Wikisource (with interface messages)
  • Global script on Assamese Wikisource (with interface messages)
  • Global script on Sanskrit Wikisource (with interface messages)
  • Global script on Marathi Wikisource, but no interface messages created
  • Gadget on Tamil Wikisource (with interface messages)

I think that's good enough to close this task. We can follow up in the main task at T120788.