Page MenuHomePhabricator

OCR requests that return a 500 error fail with a CORS error
Closed, ResolvedPublic2 Estimated Story PointsBUG REPORT

Assigned To
Authored By
Daimona
Jun 25 2021, 12:28 PM
Referenced Files
F34530252: invalid_url.png
Jun 28 2021, 2:37 PM
F34530247: invalid_language.png
Jun 28 2021, 2:37 PM
F34530250: invalid_psm.png
Jun 28 2021, 2:37 PM
F34530256: google_error.png
Jun 28 2021, 2:37 PM

Description

List of steps to reproduce (step by step, including full links if applicable):

  • Go here
  • Enable the OCR tool via the console: $('.ext-wikisource-ExtractTextWidget').show()
  • Run the OCR and observe the console

What happens?:
A CORS error is logged, since the server responds with a 500 due to the language "nap" not being supported. I and other engineers already found this bug when testing locally. Also, a user-facing error is shown: "No text was returned by the OCR tool.".

What should have happened instead?:
There should be no CORS error.

Event Timeline

Setting to high, this might be broken on several wikisources.

Daimona raised the priority of this task from High to Needs Triage.Jun 25 2021, 12:33 PM

Hmm actually we already show a user-facing error. So the high-prio task is T285544. Also updating task description to reflect this.

Daimona renamed this task from OCR requests that return a 500 error fail silently with a CORS error to OCR requests that return a 500 error fail with a CORS error.Jun 25 2021, 12:33 PM
Daimona updated the task description. (Show Details)
Daimona set the point value for this task to 2.

In the backend, I was only able to get a CORS error when I got a 500 (e.g. a timeout). The access-control-allow-origin header was not set. Is this ok @Daimona?

In the frontend, we now show a wider variety of error messages to the user rather than just No text was returned by the OCR tool. For example:

  1. The tesseract engine returned an internal error. (if, for example, you pass an invalid PSM parameter to tesseract. I think it is unlikely that a user will be able to provoke this error.)
    invalid_psm.png (740×1 px, 551 KB)
  2. The following language is not supported by the OCR engine: yy (when passing an invalid language, as in T285544)
    invalid_language.png (741×1 px, 552 KB)
  3. Image URL must begin with one of the following domain names and end with a valid file extension: upload.wikimedia.org and upload.wikimedia.beta.wmflabs.org (if the image URL is invalid. I think it is unlikely that a user will be able to provoke this.)
    invalid_url.png (740×1 px, 560 KB)
  4. The Google service returned an error: We can not access the URL currently. Please download the content and pass it in. (for an error with the Google Vision API, e.g. a timeout, which you might be able to see if you go here and choose the Google OCR engine.)
    google_error.png (701×1 px, 133 KB)

We still show No text was returned by the OCR tool in some cases, including when there is no text to extract (obviously) and when the tool returns a 500 error (e.g. when tesseract times out).

@NRodriguez The description of T281767 states:

  • Note: This ticket does not include work to distinguish between different error types, such as timeout errors vs. formatting or other errors. We will just implement a general message that can apply to most cases and, if we want to further fine-tune the messaging, that can be tackled in a separate ticket.

We do now distinguish between different types of error (see above). We might want to review this, perhaps in a separate ticket. I am not sure how helpful some of the messages are (e.g. the Google service error) and they are a little hard to read as they disappear quite quickly.

Test environment: https://en.wikisource.beta.wmflabs.org Wikisource – (a168c1a) 07:23, 28 June 2021. https://ocr-test.wmcloud.org Version 0.6.0-5-g81a7edc.

In the backend, I was only able to get a CORS error when I got a 500 (e.g. a timeout). The access-control-allow-origin header was not set. Is this ok @Daimona?

I think I only tested this via the frontend, but yeah, I think this is fine.

In the frontend, we now show a wider variety of error messages to the user rather than just No text was returned by the OCR tool.

Ohhh right, this is a nice side effect!

Thanks for your close attention to detail!