WikisourceOCR: Google OCR is not working
Closed, ResolvedPublic3 Estimated Story PointsBUG REPORT
Actions

Assigned To

Authored By

	Jayprakash12345
	Dec 2 2021, 7:58 AM

Description

Step of reproduce

Go https://en.wikisource.org/w/index.php?title=Page:The_Life_of_the_Spider.djvu/54&action=edit
Select Google OCR from Transcribe text dropdown
Click on Transcribe text

Expect

Text should be add after getting responses

Actual

Getting error "The Google service returned an error: We can not access the URL currently. Please download the content and pass it in."

Extra

URL: https://ocr.wmcloud.org/api.php?engine=google&langs[]=en&image=https%3A%2F%2Fupload.wikimedia.org%2Fwikipedia%2Fcommons%2Fthumb%2F6%2F6d%2FThe_Life_of_the_Spider.djvu%2Fpage55-1024px-The_Life_of_the_Spider.djvu.jpg&uselang=en

{
    "engine": "google",
    "langs": [
        "en"
    ],
    "psm": 3,
    "crop": [],
    "image_hosts": [
        "upload.wikimedia.org",
        "upload.wikimedia.beta.wmflabs.org"
    ],
    "image": "https://upload.wikimedia.org/wikipedia/commons/thumb/6/6d/The_Life_of_the_Spider.djvu/page55-1024px-The_Life_of_the_Spider.djvu.jpg",
    "uselang": "en",
    "error": "The Google service returned an error: We can not access the URL currently. Please download the content and pass it in."
}

Related Objects

Mentioned In: T332125: Google OCR error: "We can not access the URL currently"
T338100: English Wikisource OCR gadgets fails to identify text

Event Timeline

Jayprakash12345 created this task.Dec 2 2021, 7:58 AM

Restricted Application added projects: Community-Tech, User-Jayprakash12345. · View Herald TranscriptDec 2 2021, 7:58 AM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Jayprakash12345 updated the task description. (Show Details)Dec 2 2021, 8:03 AM

It seems the image https://upload.wikimedia.org/wikipedia/commons/thumb/6/6d/The_Life_of_the_Spider.djvu/page55-1024px-The_Life_of_the_Spider.djvu.jpg is accessible to us, but not to Google.

The same error results when I try this request via the API explorer tool: https://cloud.google.com/vision/docs/reference/rest/v1/images/annotate?apix_params=%7B%22resource%22%3A%7B%22requests%22%3A%5B%7B%22image%22%3A%7B%22source%22%3A%7B%22imageUri%22%3A%22https%3A%2F%2Fupload.wikimedia.org%2Fwikipedia%2Fcommons%2Fthumb%2F6%2F6d%2FThe_Life_of_the_Spider.djvu%2Fpage55-1024px-The_Life_of_the_Spider.djvu.jpg%22%7D%7D%2C%22features%22%3A%5B%7B%22type%22%3A%22TEXT_DETECTION%22%7D%5D%7D%5D%7D%7D

Google might have attempted to load the image before its thumb had been rendered, and then cached some sort of not-found status when it didn't get it fast enough. It's probably worth waiting a little while and seeing if this resolves itself.

Just for a note: Per Wikimedian Guntupalli Rameswaram, It is already not working from past two days.

Ruthven subscribed.Dec 2 2021, 2:23 PM

It looks like it's working correctly now. I've tested a few different pages from that book, and they're all okay.

If this happens again and causes issues, one of the following mitigations (or something else) might be in order:

On such an API failure, conclude that a Google-side cache has "gone bad" and fallback to downloading the image and passing in as binary data
Before making such an API call, poll the HTTP HEAD request for the image and don't fire the Google request until that request has returned 200, which means the image has rendered and will now be in the Thumbor cache.

Also, it's possible this is because the Google IP is making so many image requests for "heavy" images (i.e. from PDFs or DjVus) that the renderfile-nonstandard rate limit of 140 files/minute is being hit and Google is repeatedly getting 429s.

It occurs to me that if the Google server is hitting the API for an image that isn't rendered, if it's using a default user agent, it would get a 403 telling it to set a user agent. And it wouldn't be too unreasonable to cache that failure for a bit as they might want to avoid hammering a server that's apparently not interested.

Hitting the image thumbnail URL might seem safe because the user has already loaded it to view the page, but the next page result is also pre-fetched, so it's actually quite possible for Google to be hitting uncached images.

In T296912#7578757, @Inductiveload wrote:

It occurs to me that if the Google server is hitting the API for an image that isn't rendered, if it's using a default user agent, it would get a 403 telling it to set a user agent. And it wouldn't be too unreasonable to cache that failure for a bit as they might want to avoid hammering a server that's apparently not interested.

Hitting the image thumbnail URL might seem safe because the user has already loaded it to view the page, but the next page result is also pre-fetched, so it's actually quite possible for Google to be hitting uncached images.

I use pywikibot with Wikisource, and the message of error is for may app:

WARNING: Http response status 400

the python code are:

while (flag) :
		try :
			y=page.ocr(ocr_tool=ProofreadPage._GOOGLE_OCR)
			flag=False
		except :
			i=i+1
			print("Reintentando {} veces...".format(i))

Several times work wiht few tries, other times very much times. I see that cache image is in system, cheched with

while (flag) :
		try :
			URL=page.url_image
			data_stream = io.BytesIO(urlopen(URL).read())
			imagen = Image.open(data_stream)
			y=page.ocr(ocr_tool=ProofreadPage._GOOGLE_OCR)
			flag=False
		except :
			i=i+1
                        print(URL)
			print("Reintentando {} veces...".format(i))

It's also possible Google is hammering the thumbnailer for other purposes from the same IP and the 429s aren't anything we can control (not that we know that Google is getting 429s...)

In any case, always downloading the image to the OCR service and passing directly in the request would completely circumvent the issue.

@Inductiveload gave me a sample file that is having issues so I could look through the thumbor logs: https://upload.wikimedia.org/wikipedia/commons/thumb/9/95/Gospel_of_Saint_Matthew_in_West-Saxon.djvu/page53-646px-Gospel_of_Saint_Matthew_in_West-Saxon.djvu.jpg

I found two log entries which indicate that thumbor generated the thumbnail fine:

/var/log/haproxy/haproxy.log.3.gz:Dec 18 10:59:17 thumbor2004 haproxy[36337]: 10.64.0.38:48776 [18/Dec/2021:10:59:16.298] thumbor thumbor/server8802 0/0/0/1381/1451 200 142161 - - CL-- 6/5/3/0/0 0/0 "GET /wikipedia/commons/thumb/9/95/Gospel_of_Saint_Matthew_in_West-Saxon.djvu/page53-646px-Gospel_of_Saint_Matthew_in_West-Saxon.djvu.jpg HTTP/1.1"
/var/log/haproxy/haproxy.log.3.gz:Dec 18 10:59:16 thumbor1005 haproxy[2322]: 10.64.0.38:46096 [18/Dec/2021:10:59:15.416] thumbor thumbor/server8803 0/0/0/826/827 200 142158 - - ---- 5/4/2/0/0 0/0 "GET /wikipedia/commons/thumb/9/95/Gospel_of_Saint_Matthew_in_West-Saxon.djvu/page53-646px-Gospel_of_Saint_Matthew_in_West-Saxon.djvu.jpg HTTP/1.1"

Any 404s *should* have shown up in that search...

AIUI now, any future requests for that specific thumbnail will only hit Varnish/Swift, not thumbor. We only have 1/128 sampled logs for Varnish requests, but generally if someone is going fast enough to be triggering 429 errors they'll show up in the log at least once. So if someone knows the specific IP (or range) or user-agent used by this tool, I can try to grep for it (anyone with logstash should be able to IIRC) to confirm that it's 429s. But I'd have hoped Google could give a more useful error message here?

Trying it on the Google API Explorer thing is, sadly, no more instructive:

{
  "responses": [
    {
      "error": {
        "code": 4,
        "message": "We can not access the URL currently. Please download the content and pass it in."
      }
    }
  ]
}

However, this should be trivial to circumvent in src/Engine/GoogleCloudVisionEngine.php:

$image = $this->getImage($imageUrl, $crop);

$image = $this->getImage($imageUrl, $crop, self::DO_DOWNLOAD_IMAGE);

This will force an image download to the OCR tool and then it'll included that in the request rather than the URL. This definitely does work, because if you specify an image crop, it always downloads the image first to do the crop, and that is working fine:

Oh, I didn't realize this was a Google Cloud thing. At the Varnish layer, We have different rate limits for "public clouds", which are somewhat dynamic and everchanging in response to constant (D)DoS attacks from them. If downloading the thumb to Cloud/Toolforge and uploading that to Google is something that works, I'd recommend that instead.

According to https://github.com/googleapis/nodejs-vision/issues/270#issuecomment-481064953 Google also does outbound rate limiting, so it really could be a lot of things going wrong. Really seems like the only real fix is to upload the image to Google.

• JMcLeod_WMF moved this task from New & TBD Tickets to Needs Discussion on the Community-Tech board.Apr 25 2022, 2:20 PM

KSiebert moved this task from Needs Discussion to Engineering Backlog on the Community-Tech board.Apr 27 2022, 9:08 AM

KSiebert set the point value for this task to 3.

TheresNoTime changed the subtype of this task from "Task" to "Bug Report".Aug 4 2022, 6:14 PM

Soda subscribed.Oct 27 2022, 2:05 PM

• JMcLeod_WMF removed a project: Community-Tech.Feb 28 2023, 12:23 AM

KTT-Commons mentioned this in T338100: English Wikisource OCR gadgets fails to identify text.Jun 4 2023, 2:03 PM

Samwilson moved this task from Backlog to Google on the Wikimedia OCR board.Sep 15 2023, 5:39 AM

Samwilson mentioned this in T332125: Google OCR error: "We can not access the URL currently".Nov 30 2023, 8:03 AM

A patch to add a workaround for this: https://github.com/wikimedia/wikimedia-ocr/pull/120

The above has been merged for a while, and I think things have improved.

WikisourceOCR: Google OCR is not workingClosed, ResolvedPublic3 Estimated Story PointsBUG REPORTActions