Page MenuHomePhabricator

Google OCR error: "We can not access the URL currently"
Closed, ResolvedPublicBUG REPORT

Description

Steps to replicate the issue (include links if applicable):

The Google service returned an error: We can not access the URL currently. Please download the content and pass it in.

What happens?:

image.png (711×863 px, 47 KB)

What should have happened instead?:

It should be able to find/access the URL.

Software version (skip for WMF-hosted wikis like Wikipedia):

Other information (browser name/version, screenshots, etc.):

The bug was found when a user try to transcribe via Javanese Wikisource

https://jv.wikisource.org/w/index.php?title=Kaca:Wulang_Basa_Jilid_2.pdf/68&action=edit&redlink=1

This is the result via Wikisource:

image.png (329×508 px, 52 KB)

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

And what does it mean by

Please download the content and pass it in.```?

And what does it mean by

@Bennylin: If the displayed string is unclear, please file a separate ticket. One problem per ticket only, please. Thanks.

Aklapper renamed this task from OCR error to Google OCR error: "We can not access the URL currently".Mar 15 2023, 10:49 AM

And what does it mean by

Please download the content and pass it in.```?

This is Google's error message: there are two ways for us to give them the image, one as a URL, and one by passing the image data to them. It's not very clear, I agree!

This issue does look like a duplicate of T331820. The image in question seems now to be working in the OCR tool.

One workaround for this error is to select a part of the image, e.g. just leave off a small margin around the whole thing. This causes the tool to send the data rather than the URL. (Not to say we shouldn't fix the actual problem, of course!)

(Sorry, I didn't mean a duplicate of that issue, but caused by it.)

Google OCR cannot recognize punctuations out of line in Chinese verticle text.

For example, all periods (。) were missed in https://zh.wikisource.org/wiki/Page:CADAL09006598_%E9%80%9A%E4%BF%97%E6%96%B0%E5%B0%BA%E7%89%98.djvu/6 .

How can one contact Google to fix this?

Google actually OCR every image pdf it indexes. See the cache pages for

https://www.google.com/search?q=site%3Aupload.wikimedia.org+filetype%3Apdf+ssid

Can't we just ask Google use that effort to help us OCR every books in https://commons.wikimedia.org/wiki/Commons:Library_back_up_project and return the OCR results to us, so that they can be improved by volunteers?

Is there any Google contact person for Wikimedia affairs?

I think the problem lies with Wikimedia Commons being slow to respond. If someone manually opens a rarely accessed book on browser and randomly selects a page to view, the server might not display it immediately; it may take some time. The server should extract pages from PDF files, cache them as image files, and then display them. For OCR, there should be dedicated tools to download the entire PDF file, convert it to images using those tools, and then send them to Google for OCR processing.

A limitation of Google OCR has been found: it cannot recognize punctuation marks outside vertical lines. This is a common typesetting practice during the Chinese Republican era. For example, for this image, no punctuation marks were recognized. Are there any options available on Google to recognize them?

This is out of topic. But where should I report this? Is this count as bug?

A limitation of Google OCR has been found: it cannot recognize punctuation marks outside vertical lines. This is a common typesetting practice during the Chinese Republican era. For example, for this image, no punctuation marks were recognized. Are there any options available on Google to recognize them?

This is out of topic. But where should I report this? Is this count as bug?

It is (sort of) a bug, but there's nothing we can do about it, as it exists wholly within Google's service. The API docs are here: https://cloud.google.com/vision/docs/reference/rest/v1/Feature — there's not much in the way of configurability for text detection, beyond languageHints[].

A limitation of Google OCR has been found: it cannot recognize punctuation marks outside vertical lines. This is a common typesetting practice during the Chinese Republican era. For example, for this image, no punctuation marks were recognized. Are there any options available on Google to recognize them?

This is out of topic. But where should I report this? Is this count as bug?

It is (sort of) a bug, but there's nothing we can do about it, as it exists wholly within Google's service. The API docs are here: https://cloud.google.com/vision/docs/reference/rest/v1/Feature — there's not much in the way of configurability for text detection, beyond languageHints[].

Is this a cooperation between WMF and Google or WMF simply paid to use Google Vision API? If it's a former, may be we can report this to Google?

A limitation of Google OCR has been found: it cannot recognize punctuation marks outside vertical lines. This is a common typesetting practice during the Chinese Republican era. For example, for this image, no punctuation marks were recognized. Are there any options available on Google to recognize them?

This is out of topic. But where should I report this? Is this count as bug?

It is (sort of) a bug, but there's nothing we can do about it, as it exists wholly within Google's service. The API docs are here: https://cloud.google.com/vision/docs/reference/rest/v1/Feature — there's not much in the way of configurability for text detection, beyond languageHints[].

Is this a cooperation between WMF and Google or WMF simply paid to use Google Vision API? If it's the former, may be we can report to Google?

And what does it mean by

Please download the content and pass it in.```?

This is Google's error message: there are two ways for us to give them the image, one as a URL, and one by passing the image data to them. It's not very clear, I agree!

This issue does look like a duplicate of T331820. The image in question seems now to be working in the OCR tool.

One workaround for this error is to select a part of the image, e.g. just leave off a small margin around the whole thing. This causes the tool to send the data rather than the URL. (Not to say we shouldn't fix the actual problem, of course!)

@Samwilson Can we catch the error and resend the image data ourselves ?

Yes, that sounds like a reasonable fix.

The solution would be easy. Just write a bot, download a PDF from commons, and convert the file to jpg locally. Upload every jpg to Google, get the OCRed text, and use the bot put text to Wikisource.

It would only require two parameters for users to input: filename of pdf (or djvu) and the target Wikisource domain name (like zh.wikisource.org). The user should be autoconfirmed user in the target Wikisource and should confirm that they think the quality would OK (avoid handwritten manuscript that would have bad OCR quality).

The solution would be easy. Just write a bot, download a PDF from commons, and convert the file to jpg locally. Upload every jpg to Google, get the OCRed text, and use the bot put text to Wikisource.

It would only require two parameters for users to input: filename of pdf (or djvu) and the target Wikisource domain name (like zh.wikisource.org). The user should be autoconfirmed user in the target Wikisource and should confirm that they think the quality would OK (avoid handwritten manuscript that would have bad OCR quality).

Mass OCR is explicitly forbidden quite a lot of language wikisources

The solution would be easy. Just write a bot, download a PDF from commons, and convert the file to jpg locally. Upload every jpg to Google, get the OCRed text, and use the bot put text to Wikisource.

It would only require two parameters for users to input: filename of pdf (or djvu) and the target Wikisource domain name (like zh.wikisource.org). The user should be autoconfirmed user in the target Wikisource and should confirm that they think the quality would OK (avoid handwritten manuscript that would have bad OCR quality).

Mass OCR is explicitly forbidden quite a lot of language wikisources

So it should up to the site to decide. The tool might have an allow list reflecting site decision.

There is a discussion in zhws about OCR all Chinese books on commons (1.5M files).
https://zh.wikisource.org/wiki/Wikisource:%E5%86%99%E5%AD%97%E9%97%B4#OCR%E5%9C%96%E6%9B%B8%E9%A4%A8

The books contains many duplicates and only a part would pass printing quality acceptable for OCR. So users should be able to submit each file they choose for OCR.

PR for re-sending the full image data: https://github.com/wikimedia/wikimedia-ocr/pull/120

@wmr the other issues you raise here are not related to the current task, could you please create new tasks for these if you think they need addressing? Thanks!

PR for re-sending the full image data: https://github.com/wikimedia/wikimedia-ocr/pull/120

@wmr the other issues you raise here are not related to the current task, could you please create new tasks for these if you think they need addressing? Thanks!

I have created:

https://phabricator.wikimedia.org/T352503

Samwilson claimed this task.

Resolved as part of T296912.