Test the new OCR user script on English or multilingual Wikisource. If it works well, deploy it to an Indic language Wikisource as a gadget.
|Resolved||Tshrinivasan||T120788 Tool to use Google OCRs in Indic language Wikisource|
|Resolved||kaldari||T142770 Test and deploy the OCR gadget on Wikisource|
|Resolved||Samwilson||T142769 Create a ProofreadPage wikitext editor user script for Wikisource which uses Google Vision API to do OCR|
|Resolved||kaldari||T145725 Get Google OCR tool to reliably load in the toolbar|
|Resolved||Samwilson||T145567 Move Google OCR script into MediaWiki namespace and update all references to it|
If we want to test on Tamil Wikisource (which isn't supported by phetools), here are some pages to test with: https://ta.wikisource.org/wiki/Index:%E0%AE%AA%E0%AF%87%E0%AE%B1%E0%AF%81%E0%AE%95%E0%AE%BE%E0%AE%B2%E0%AE%AA%E0%AF%8D_%E0%AE%AA%E0%AE%BF%E0%AE%B0%E0%AE%9A%E0%AF%8D%E0%AE%9A%E0%AE%A9%E0%AF%88%E0%AE%95%E0%AE%B3%E0%AF%8D.pdf.
This administrator is probably the best person to contact about getting the gadget enabled on Tamil: https://ta.wikisource.org/wiki/%E0%AE%AA%E0%AE%AF%E0%AE%A9%E0%AE%B0%E0%AF%8D:Balajijagadesh
I tested in Tamil Wikisource. The OCR accuracy is far less compared to OCR by Google Drive API. If there is an official commuication channel between WMF and Google's CloudVision API team, can this be conveyed? I won't find merit in recommending or using this tool, until this is fixed as every % accuracy matters when OCR is involved.
I have test all Indic Wikisource as a test.
Working fine got the text
Bengali,Assamese,Marathi, Tamil, Sanskrit
No working, no text output comes...
Malayalam, Telugu, Oriya, Gujrati, Kannada
Now , may be I am wrong, I can under stand why for Malayalam, Telugu, Oriya, Gujrati, Kannada is not Working..... As per Google Vision API (note that their own documentation is out of date): as @kaldari said supported below languages...
afr, ara, asm, aze, bel, ben, bul, cat, ces, chi, dan, dut, eng, est, fil, fin, fre, ger, hin, hrv, hun, ice, ind, ita, jpn, kaz, kir, kor, lav, lit, mac, mar, may, mon, nep, nor, per, pol, por, pus, rum, rus, san, slo, slv, spa, srp, swe, tam, tur, ukr, urd, uzb, vie
The above list without Malayalam, Telugu, Oriya, Gujrati, Kannada.
In Google Drive all Malayalam, Telugu, Oriya, Gujrati, Kannada OCRed fine.
@Samwilson @Bodhisattwa: Be careful about adding the Clean up OCR script. A lot of those rules are specific to English and don't make sense for other languages. For example, the "remove unwanted spaces around punctuation marks" will cause errors in French, and many of the rules are for specific English words. I would suggest just cherry-picking the rules that are going to work for any language.
@kaldari @Bodhisattwa : That's rather what I was thinking, after looking at it a bit more. The existing clean-up script isn't very usable outside it's current home anyway, and so it seemed to me that it'd be nice to create a decent generalised OCR cleanup library, for multiple languages — surely such a thing exists though? Anyway, if it doesn't, we have to do the hard bits of it regardless, so we might as well package it up nicely for others as well, I reckon! (But that's all an aside from the issue at hand, I guess.)
@Bodhisattwa are the system messages working correctly on Tamil Wikisource? I guess if it's installed site-wide then you'll not add it as a Gadget — but if you are, do the instructions make sense?
I think we should implement some kind of throttling at our end if the requests become too many. The toollabs tool is also open to everyone. We should consider tying it up with OAuth to prevent spam requests.
@Niharika: I went ahead and put some limits on the Vision API quotas:
- Requests per day: 10,000 (was 864,000)
- Requests per 100 seconds: 500 (was 1,000)
- Requests per 100 seconds per user: 100 (was 1,000)
We can tweak these further if needed. I also added Sam as a project owner so he can help keep tabs on it.
I've replied on his talk page. I've not been able to replicate the problem yet.
Also, I'm getting the following error on the quotas page NIharika linked to above: "The API doesn't exist or you don't have permission to access it". I think I need to be joined to a project? At the moment I have no projects listed there.
Current status of the tool:
- Global script on Bengali Wikisource (with interface messages)
- Global script on Assamese Wikisource (with interface messages)
- Global script on Sanskrit Wikisource (with interface messages)
- Global script on Marathi Wikisource, but no interface messages created
- Gadget on Tamil Wikisource (with interface messages)
I think that's good enough to close this task. We can follow up in the main task at T120788.