Page MenuHomePhabricator

Google OCR button is not providing other language text output in Wikisource .
Closed, ResolvedPublic

Description

Google OCR button is not providing other language text output in Wikisource

For example,

  1. it is not providing Bengali text output in English Wikisource here

Screenshot from 2018-06-24 21-02-33.png (768×1 px, 405 KB)

  1. it is not providing Devanagari (Sanskrit) text output in Bengali Wikisource here

Screenshot from 2018-06-24 21-29-00.png (768×1 px, 328 KB)

Event Timeline

Bodhisattwa renamed this task from Google OCR button is not providing Bengali text output in English Wikisource to Google OCR button is not providing other language text output in Wikisource .Jun 24 2018, 4:00 PM
Bodhisattwa updated the task description. (Show Details)
Vvjjkkii renamed this task from Google OCR button is not providing other language text output in Wikisource to 7daaaaaaaa.Jul 1 2018, 1:02 AM
Vvjjkkii triaged this task as High priority.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed a subscriber: Aklapper.
Billinghurst renamed this task from 7daaaaaaaa to Google OCR button is not providing other language text output in Wikisource .Jul 1 2018, 1:47 AM
Billinghurst renamed this task from Google OCR button is not providing other language text output in Wikisource to Google OCR button is not providing Bengali text output in English Wikisource to Google OCR button is not providing other language text output in Wikisource ..
Billinghurst lowered the priority of this task from High to Medium.
Billinghurst updated the task description. (Show Details)
Billinghurst added a subscriber: Aklapper.
Bodhisattwa renamed this task from Google OCR button is not providing Bengali text output in English Wikisource to Google OCR button is not providing other language text output in Wikisource . to Google OCR button is not providing other language text output in Wikisource ..Jul 1 2018, 7:33 AM

This is probably going to be a WONTFIX. In order for Google OCR to work well, it needs to know what language it's reading. (You would be surprised at how bad it is at guessing.) Currently when you click the Google OCR button, it passes the language code for that wiki (under the assumption that most text on that wiki is going to be in the content language. We could add another step where it asks for the language explicitly, but that would needlessly slow down the process for the 99% of cases where the language of the text matches the language of the wiki. One workaround is to use https://tools.wmflabs.org/ws-google-ocr/ in those cases, since it allows you to specify the language of the text.

I'll leave the ultimate decision up to @Samwilson, but I would recommend WONTFIX.

We could add an extra button to the toolbar that just opens the ws-google-ocr tool in a new tab, with the image URL pre-filled. That would probably be the quickest to implement. For the WikiEditor toolbar, it could go a level down, under 'proofreading tools'; for the old toolbar, it'd just be next to the existing button (with some sort of external-link icon?).

I don't really have an idea of how many works there that contain lengthy sections of other languages, where this feature would be useful. Might ask on the mailing list.

(I'm happy to work on this in my own time; we've enough CommTech stuff going on I think.)

@Samwilson Is this the ticket you mentioned in the standup?

I think this issue is resolved now with the new OCR button. It's now possible to open the 'advanced options' and enter different (or multiple) language names.

@Bodhisattwa can you confirm?

I think this issue is resolved now with the new OCR button. It's now possible to open the 'advanced options' and enter different (or multiple) language names.

@Bodhisattwa can you confirm?

Yes, its done now.

Samwilson claimed this task.

Thanks!