Page MenuHomePhabricator

WikisourceOCR: Google OCR is not working
Open, Needs TriagePublic3 Estimated Story PointsBUG REPORT

Description

Step of reproduce

Expect

Text should be add after getting responses

Actual

Getting error "The Google service returned an error: We can not access the URL currently. Please download the content and pass it in."

Extra

URL: https://ocr.wmcloud.org/api.php?engine=google&langs[]=en&image=https%3A%2F%2Fupload.wikimedia.org%2Fwikipedia%2Fcommons%2Fthumb%2F6%2F6d%2FThe_Life_of_the_Spider.djvu%2Fpage55-1024px-The_Life_of_the_Spider.djvu.jpg&uselang=en

{
    "engine": "google",
    "langs": [
        "en"
    ],
    "psm": 3,
    "crop": [],
    "image_hosts": [
        "upload.wikimedia.org",
        "upload.wikimedia.beta.wmflabs.org"
    ],
    "image": "https://upload.wikimedia.org/wikipedia/commons/thumb/6/6d/The_Life_of_the_Spider.djvu/page55-1024px-The_Life_of_the_Spider.djvu.jpg",
    "uselang": "en",
    "error": "The Google service returned an error: We can not access the URL currently. Please download the content and pass it in."
}

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Just for a note: Per Wikimedian Guntupalli Rameswaram, It is already not working from past two days.

It looks like it's working correctly now. I've tested a few different pages from that book, and they're all okay.

If this happens again and causes issues, one of the following mitigations (or something else) might be in order:

  • On such an API failure, conclude that a Google-side cache has "gone bad" and fallback to downloading the image and passing in as binary data
  • Before making such an API call, poll the HTTP HEAD request for the image and don't fire the Google request until that request has returned 200, which means the image has rendered and will now be in the Thumbor cache.

Also, it's possible this is because the Google IP is making so many image requests for "heavy" images (i.e. from PDFs or DjVus) that the renderfile-nonstandard rate limit of 140 files/minute is being hit and Google is repeatedly getting 429s.

It occurs to me that if the Google server is hitting the API for an image that isn't rendered, if it's using a default user agent, it would get a 403 telling it to set a user agent. And it wouldn't be too unreasonable to cache that failure for a bit as they might want to avoid hammering a server that's apparently not interested.

Hitting the image thumbnail URL might seem safe because the user has already loaded it to view the page, but the next page result is also pre-fetched, so it's actually quite possible for Google to be hitting uncached images.

It occurs to me that if the Google server is hitting the API for an image that isn't rendered, if it's using a default user agent, it would get a 403 telling it to set a user agent. And it wouldn't be too unreasonable to cache that failure for a bit as they might want to avoid hammering a server that's apparently not interested.

Hitting the image thumbnail URL might seem safe because the user has already loaded it to view the page, but the next page result is also pre-fetched, so it's actually quite possible for Google to be hitting uncached images.

I use pywikibot with Wikisource, and the message of error is for may app:

WARNING: Http response status 400

the python code are:

while (flag) :
		try :
			y=page.ocr(ocr_tool=ProofreadPage._GOOGLE_OCR)
			flag=False
		except :
			i=i+1
			print("Reintentando {} veces...".format(i))

Several times work wiht few tries, other times very much times. I see that cache image is in system, cheched with

while (flag) :
		try :
			URL=page.url_image
			data_stream = io.BytesIO(urlopen(URL).read())
			imagen = Image.open(data_stream)
			y=page.ocr(ocr_tool=ProofreadPage._GOOGLE_OCR)
			flag=False
		except :
			i=i+1
                        print(URL)
			print("Reintentando {} veces...".format(i))

It's also possible Google is hammering the thumbnailer for other purposes from the same IP and the 429s aren't anything we can control (not that we know that Google is getting 429s...)

In any case, always downloading the image to the OCR service and passing directly in the request would completely circumvent the issue.

@Inductiveload gave me a sample file that is having issues so I could look through the thumbor logs: https://upload.wikimedia.org/wikipedia/commons/thumb/9/95/Gospel_of_Saint_Matthew_in_West-Saxon.djvu/page53-646px-Gospel_of_Saint_Matthew_in_West-Saxon.djvu.jpg

I found two log entries which indicate that thumbor generated the thumbnail fine:

/var/log/haproxy/haproxy.log.3.gz:Dec 18 10:59:17 thumbor2004 haproxy[36337]: 10.64.0.38:48776 [18/Dec/2021:10:59:16.298] thumbor thumbor/server8802 0/0/0/1381/1451 200 142161 - - CL-- 6/5/3/0/0 0/0 "GET /wikipedia/commons/thumb/9/95/Gospel_of_Saint_Matthew_in_West-Saxon.djvu/page53-646px-Gospel_of_Saint_Matthew_in_West-Saxon.djvu.jpg HTTP/1.1"
/var/log/haproxy/haproxy.log.3.gz:Dec 18 10:59:16 thumbor1005 haproxy[2322]: 10.64.0.38:46096 [18/Dec/2021:10:59:15.416] thumbor thumbor/server8803 0/0/0/826/827 200 142158 - - ---- 5/4/2/0/0 0/0 "GET /wikipedia/commons/thumb/9/95/Gospel_of_Saint_Matthew_in_West-Saxon.djvu/page53-646px-Gospel_of_Saint_Matthew_in_West-Saxon.djvu.jpg HTTP/1.1"

Any 404s *should* have shown up in that search...

AIUI now, any future requests for that specific thumbnail will only hit Varnish/Swift, not thumbor. We only have 1/128 sampled logs for Varnish requests, but generally if someone is going fast enough to be triggering 429 errors they'll show up in the log at least once. So if someone knows the specific IP (or range) or user-agent used by this tool, I can try to grep for it (anyone with logstash should be able to IIRC) to confirm that it's 429s. But I'd have hoped Google could give a more useful error message here?

Trying it on the Google API Explorer thing is, sadly, no more instructive:

{
  "responses": [
    {
      "error": {
        "code": 4,
        "message": "We can not access the URL currently. Please download the content and pass it in."
      }
    }
  ]
}

However, this should be trivial to circumvent in src/Engine/GoogleCloudVisionEngine.php:

$image = $this->getImage($imageUrl, $crop);

to

$image = $this->getImage($imageUrl, $crop, self::DO_DOWNLOAD_IMAGE);

This will force an image download to the OCR tool and then it'll included that in the request rather than the URL. This definitely does work, because if you specify an image crop, it always downloads the image first to do the crop, and that is working fine:

Oh, I didn't realize this was a Google Cloud thing. At the Varnish layer, We have different rate limits for "public clouds", which are somewhat dynamic and everchanging in response to constant (D)DoS attacks from them. If downloading the thumb to Cloud/Toolforge and uploading that to Google is something that works, I'd recommend that instead.

According to https://github.com/googleapis/nodejs-vision/issues/270#issuecomment-481064953 Google also does outbound rate limiting, so it really could be a lot of things going wrong. Really seems like the only real fix is to upload the image to Google.

KSiebert set the point value for this task to 3.
TheresNoTime changed the subtype of this task from "Task" to "Bug Report".Aug 4 2022, 6:14 PM