Test and deploy the OCR gadget on Wikisource
Closed, ResolvedPublic3 Estimated Story Points
Actions

Assigned To

Authored By

	• DannyH
	Aug 11 2016, 10:36 PM

Description

Test the new OCR user script on English or multilingual Wikisource. If it works well, deploy it to an Indic language Wikisource as a gadget.

Related Objects
Search...

Status	Assigned	Task
Resolved	Tshrinivasan	T120788 Tool to use Google OCRs in Indic language Wikisource
Resolved	kaldari	T142770 Test and deploy the OCR gadget on Wikisource
Resolved	Samwilson	T142769 Create a ProofreadPage wikitext editor user script for Wikisource which uses Google Vision API to do OCR
Resolved	kaldari	T145725 Get Google OCR tool to reliably load in the toolbar
Resolved	Samwilson	T145567 Move Google OCR script into MediaWiki namespace and update all references to it

Event Timeline

• DannyH created this task.Aug 11 2016, 10:36 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 11 2016, 10:36 PM

• DannyH added a parent task: T120788: Tool to use Google OCRs in Indic language Wikisource.Aug 11 2016, 10:36 PM

• DannyH added a subtask: T142769: Create a ProofreadPage wikitext editor user script for Wikisource which uses Google Vision API to do OCR.

• DannyH mentioned this in T140037: Investigation: Tool to use Google OCR in Indic language Wikisources.

Bodhisattwa subscribed.Aug 12 2016, 3:49 AM

Yann subscribed.Aug 14 2016, 9:33 PM

Niharika removed a subtask: T142769: Create a ProofreadPage wikitext editor user script for Wikisource which uses Google Vision API to do OCR.Aug 15 2016, 10:08 AM

Niharika added a subtask: T142769: Create a ProofreadPage wikitext editor user script for Wikisource which uses Google Vision API to do OCR.Aug 15 2016, 10:25 AM

kaldari renamed this task from Deploy and test the OCR gadget on Wikisource to Test and deploy the OCR gadget on Wikisource.Aug 15 2016, 4:50 PM

kaldari updated the task description. (Show Details)

You can deploy at BNWS for testing we have know issues.

• DannyH set the point value for this task to 3.Aug 30 2016, 5:47 PM

• DannyH moved this task from Needs Discussion to Up Next (June 3-21) on the Community-Tech board.

Samwilson subscribed.Sep 2 2016, 2:58 AM

kaldari updated the task description. (Show Details)Sep 8 2016, 3:41 AM

kaldari updated the task description. (Show Details)Sep 8 2016, 4:23 AM

@jayantanth: It looks like the existing phetools/Tesseract support Bengali. Has Bengali WikiSource tried using those yet?

If we want to test on Tamil Wikisource (which isn't supported by phetools), here are some pages to test with: https://ta.wikisource.org/wiki/Index:%E0%AE%AA%E0%AF%87%E0%AE%B1%E0%AF%81%E0%AE%95%E0%AE%BE%E0%AE%B2%E0%AE%AA%E0%AF%8D_%E0%AE%AA%E0%AE%BF%E0%AE%B0%E0%AE%9A%E0%AF%8D%E0%AE%9A%E0%AE%A9%E0%AF%88%E0%AE%95%E0%AE%B3%E0%AF%8D.pdf.

kaldari edited projects, added Community-Tech-Sprint, All-and-every-Wikisource; removed Community-Tech.Sep 8 2016, 5:48 AM

This administrator is probably the best person to contact about getting the gadget enabled on Tamil: https://ta.wikisource.org/wiki/%E0%AE%AA%E0%AE%AF%E0%AE%A9%E0%AE%B0%E0%AF%8D:Balajijagadesh

I've asked Balajijagadesh.

In T142770#2618265, @kaldari wrote:

@jayantanth: It looks like the existing phetools/Tesseract support Bengali. Has Bengali WikiSource tried using those yet?

T120788#2504287

Deployed in Mediawiki:Common.js in Bengali Wikisource. The script is working fine.

Bodhisattwa awarded a token.Sep 8 2016, 4:00 PM

Deployed on Kannada Wikisource - Gadget-GoogleOcr.js. The script is not working for both PDF & DJVU.

I tested in Tamil Wikisource. The OCR accuracy is far less compared to OCR by Google Drive API. If there is an official commuication channel between WMF and Google's CloudVision API team, can this be conveyed? I won't find merit in recommending or using this tool, until this is fixed as every % accuracy matters when OCR is involved.

In T142770#2618265, @kaldari wrote:

@jayantanth: It looks like the existing phetools/Tesseract support Bengali. Has Bengali WikiSource tried using those yet?

@kaldari Yes we have a phetools/Tesseract , but OCR output is not at per as Google OCR.

@Ravidreams : We do have a contact at Google. Can you provide me with a testcase Tamil image (or several images) that perform better in the Google Drive API than the Google Vision OCR API? Are there any particular patterns to the difference?

@Omshivaprakash: It looks like the Google OCR API doesn't yet support Kannada. See T120788#2618286. I've filed a request to add support for Kannada to phetools since there is a Tesseract language pack available for Kannada: https://github.com/phil-el/phetools/issues/9.

I have test all Indic Wikisource as a test.

Working fine got the text
Bengali,Assamese,Marathi, Tamil, Sanskrit

No working, no text output comes...
Malayalam, Telugu, Oriya, Gujrati, Kannada

kaldari closed subtask T142769: Create a ProofreadPage wikitext editor user script for Wikisource which uses Google Vision API to do OCR as Resolved.Sep 8 2016, 5:30 PM

• DannyH moved this task from Ready to In Development on the Community-Tech-Sprint board.Sep 8 2016, 5:48 PM

Now , may be I am wrong, I can under stand why for Malayalam, Telugu, Oriya, Gujrati, Kannada is not Working..... As per Google Vision API (note that their own documentation is out of date): as @kaldari said supported below languages...

afr, ara, asm, aze, bel, ben, bul, cat, ces, chi, dan, dut, eng, est, fil, fin, fre, ger, hin, hrv, hun, ice, ind, ita, jpn, kaz, kir, kor, lav, lit, mac, mar, may, mon, nep, nor, per, pol, por, pus, rum, rus, san, slo, slv, spa, srp, swe, tam, tur, ukr, urd, uzb, vie

The above list without Malayalam, Telugu, Oriya, Gujrati, Kannada.

In Google Drive all Malayalam, Telugu, Oriya, Gujrati, Kannada OCRed fine.

Had a discussion with @Samwilson about adding Clean up OCR script to this new script.

@Samwilson @Bodhisattwa: Be careful about adding the Clean up OCR script. A lot of those rules are specific to English and don't make sense for other languages. For example, the "remove unwanted spaces around punctuation marks" will cause errors in French, and many of the rules are for specific English words. I would suggest just cherry-picking the rules that are going to work for any language.

Agreed with @kaldari, please use Clean up OCR script separately. Please don't mesh up with this.

kaldari updated the task description. (Show Details)Sep 8 2016, 7:02 PM

@kaldari @Bodhisattwa : That's rather what I was thinking, after looking at it a bit more. The existing clean-up script isn't very usable outside it's current home anyway, and so it seemed to me that it'd be nice to create a decent generalised OCR cleanup library, for multiple languages — surely such a thing exists though? Anyway, if it doesn't, we have to do the hard bits of it regardless, so we might as well package it up nicely for others as well, I reckon! (But that's all an aside from the issue at hand, I guess.)

@Bodhisattwa are the system messages working correctly on Tamil Wikisource? I guess if it's installed site-wide then you'll not add it as a Gadget — but if you are, do the instructions make sense?

• DannyH moved this task from In Development to Needs Review/Feedback on the Community-Tech-Sprint board.Sep 9 2016, 12:35 AM

MKar subscribed.Sep 9 2016, 1:19 PM

@kaldari (and anyone else with access), we can monitor Google Cloud Vision API usage (that's the API the tool uses) via this link.

I think we should implement some kind of throttling at our end if the requests become too many. The toollabs tool is also open to everyone. We should consider tying it up with OAuth to prevent spam requests.

@Niharika: I went ahead and put some limits on the Vision API quotas:

Requests per day: 10,000 (was 864,000)
Requests per 100 seconds: 500 (was 1,000)
Requests per 100 seconds per user: 100 (was 1,000)

We can tweak these further if needed. I also added Sam as a project owner so he can help keep tabs on it.

@Samwilson: Report from Balajijagadesh on Tamil wiki: "Okay... Its working now.. after doing OCR the page freezes sometimes in mozilla firefox". Any guess what might be causing that? Sounds like Firefox running out of memory for some reason.

I've replied on his talk page. I've not been able to replicate the problem yet.

Also, I'm getting the following error on the quotas page NIharika linked to above: "The API doesn't exist or you don't have permission to access it". I think I need to be joined to a project? At the moment I have no projects listed there.

@Samwilson: Are you able to access the quotas page yet?