Page MenuHomePhabricator

Wikimedia OCR fails with 400 status
Closed, ResolvedPublicBUG REPORT

Description

On this page (and all other pages I checked at random), Wikimedia OCR returns a 400 HTTP status and the error "Image URL must begin with one of the following domain names and end with a valid file extension: upload.wikimedia.org and upload.wikimedia.beta.wmflabs.org".

The returned JSON is:

{
  "engine":"tesseract", "langs":["en"], "psm":3, "crop":[],
  "image_hosts":["upload.wikimedia.org","upload.wikimedia.beta.wmflabs.org"],
  "image":"\/\/upload.wikimedia.org\/wikipedia\/commons\/thumb\/e\/e5\/Ling-Nam%3B_or%2C_Interior_views_of_southern_China%2C_including_explorations_in_the_hitherto_untraversed_island_of_Hainan_%28IA_cu31924023225307%29.pdf\/page160-1450px-Ling-Nam%3B_or%2C_Interior_views_of_southern_China%2C_including_explorations_in_the_hitherto_untraversed_island_of_Hainan_%28IA_cu31924023225307%29.pdf.jpg",
  "uselang":"en",
  "error":"Image URL must begin with one of the following domain names and end with a valid file extension: upload.wikimedia.org and upload.wikimedia.beta.wmflabs.org"
}

Since the image URL actually matches the whitelist (modulo the leading //) it's possible the returned error is itself a bug triggered by an underlying problem.

In any case, from the user side this one is kinda "UBN"-y.

Edit: Oh, I should add, first reported at s:WS:S/H#Error_from_the_OCR_tool??.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Looks like it's because it's sending a protocol-relative URL (i.e. starting with //), and the OCR tool is looking for a leading https.

Possibly related to the work for T298663. @Soda do you have any ideas?

Probably quickest fix is to update that URL regex.

Checked on the web interface (ocr.wmclod) by manually adding "https:" to the URL and it worked fine, so it does indeed seem to be just a dumb regex match problem. Easy bugs are the bestest bugs! :-)

Easy but maybe some weird bits to it...

Looks like it's not enough to only accept URLs without https, because it then fails to fetch (with "Invalid URL: scheme is missing in "//upload.wikimedia.org/wikiped…"). I'll update the above patch.

There's still an issue (with Firefox at any rate) of the form field (which is of type url) not accepting protocol-relative URLs. That's sort of fine, but the problem is that the advanced options link will prefill that field with a value that means the form can't be submitted. I'll update the patch (although I'm going to be AFK for a few hours now so if someone else wants to jump on this then go for it).

There's still an issue (with Firefox at any rate) of the form field (which is of type url) not accepting protocol-relative URLs. That's sort of fine, but the problem is that the advanced options link will prefill that field with a value that means the form can't be submitted. I'll update the patch (although I'm going to be AFK for a few hours now so if someone else wants to jump on this then go for it).

I think we can use the pattern="/regex/" to bypass the default URL checking ?

Looks like it's because it's sending a protocol-relative URL (i.e. starting with //), and the OCR tool is looking for a leading https.

Possibly related to the work for T298663. @Soda do you have any ideas?

Probably quickest fix is to update that URL regex.

We should also add some code to the ProofreadPage implementation to validate and normalize the URLs I feel. This API is user-exposed and it will be better to have a consistent output instead of just praying that the combination of MediaWiki + Thumbor/InstantCommons + Openseadragon magically gives us a consistent URL.

There's still an issue (with Firefox at any rate) of the form field (which is of type url) not accepting protocol-relative URLs.

Protocol-relative (or rather, all non-absolute) URLs are not permitted in type=url fields by the HTML forms standard, because relative URLs only make sense within the context of a rendered web page in a web browser (the browser normalizes them to an absolute URL using the context of the current web page).

I think we can use the pattern="/regex/" to bypass the default URL checking ?

Nope. pattern=… can only further constrain an already valid absolute URL.

We should also add some code to the ProofreadPage implementation to validate and normalize the URLs I feel.

Indeed. While in the JS API it can often be useful or necessary to get access to protocol (or even host) relative URLs—because user scripts and Gadgets do operate in the context of the rendered web page—the second we transport those URLs outside that context (e.g. to ocr.wmcloud, or Toolforge, or...) the URLs need to be canonical (absolute).

Change 866279 had a related patch set uploaded (by Sohom Datta; author: Sohom Datta):

[mediawiki/extensions/ProofreadPage@master] Normalize URLs before exposing them via the Openseadragon API

https://gerrit.wikimedia.org/r/866279

Change 866279 merged by jenkins-bot:

[mediawiki/extensions/ProofreadPage@master] Normalize URLs before exposing them via the Openseadragon API

https://gerrit.wikimedia.org/r/866279

I've updated the OCR tool to accept these URLs now, and released version 1.0.7 with the fix. The OCR button should be working again. The fix on the Wikisource side will roll out next week in the normal train.

@Samwilson The regex "/(https?:)?\/\/($hostRegex)\/.+($formatRegex)$/"; will support anything as long as it has //. For example,
https://ocr.wmcloud.org/api.php?engine=tesseract&langs[]=en&image=foobar://upload.wikimedia.org%2Fwikipedia%2Fcommons%2Fthumb%2Fe%2Fe5%2FLing-Nam%253B_or%252C_Interior_views_of_southern_China%252C_including_explorations_in_the_hitherto_untraversed_island_of_Hainan_%2528IA_cu31924023225307%2529.pdf%2Fpage160-1450px-Ling-Nam%253B_or%252C_Interior_views_of_southern_China%252C_including_explorations_in_the_hitherto_untraversed_island_of_Hainan_%2528IA_cu31924023225307%2529.pdf.jpg&uselang=en
which will return a 500 error.

It might be fixed by having /^(https?:)?\/\/($hostRegex)\/.+($formatRegex)$/";

Should we accept urls without https:// or //, i.e. just upload.wikimedia.org/...? I don't know what changed to make it start requesting urls with just //, and whether it might start requesting urls without // either in the future.

The error message when you enter the incorrect url is: "Image URL must begin with one of the following domain names and end with a valid file extension: upload.wikimedia.org and upload.wikimedia.beta.wmflabs.org". This is not technically correct. Should the message be updated?

This is a good point, and actually the change I made to the regex wasn't actually required (as the URLs are fixed before that validation step). I've made a new PR to fix it: https://github.com/wikimedia/wikimedia-ocr/pull/62

With this new change, the error message can now stay the same, I think.