Page MenuHomePhabricator

Wikimedia OCR: "-" and "_" being stripped from language codes
Closed, ResolvedPublic1 Estimated Story PointsBUG REPORT

Description

What is the problem?

Languages like sr-Latn and Canadian_Aboriginal are being treated as invalid because their -, _ are being stripped.

Instead, the tool returns an error like:

The following language is not supported by the OCR engine: srLatn

sr-Latn is supported by Google. Canadian_Aboriginal is supported by Tesseract.

This bug affects 2 out of ~60 supported languages for google. 17 out of ~160 languages for Tesseract.

Urls to reproduce problem

Expected behavior: The image gets OCR'd
Observed behavior: Error returned of the form The following language is not supported by the OCR engine: CanadianAboriginal

Environment

Wikimedia OCR: Version 0.1.0-5-gf2af8be

Event Timeline

Fixed by T282760: Add language-mapping data (currently up for code review). Note however language codes have changed, since we try to only accept ISO 639-1 as people are used to on the wiki. Tesseract still has some extra non-standard ones. See the API for the available list at /api/available_langs?engine=tesseract. So for instance sr-Latn is sr-latn. Meanwhile Canadian_Aboriginal I believe is a script and not intended to be accepted as a language code.

Sending this over to Dom in case he wants to revisit it, but effectively I think this task is invalid now that we only accept ISO 639-1 codes (with a few extras specific to Tesseract). If you agree please feel free to close as invalid.

Sending this over to Dom in case he wants to revisit it, but effectively I think this task is invalid now that we only accept ISO 639-1 codes (with a few extras specific to Tesseract). If you agree please feel free to close as invalid.

@MusikAnimal I still appear to be getting this issue.

For example, if I choose sr-latn from the language selector and submit, I get: The following language is not supported by the OCR engine: srlatn.

Direct link: https://ocr-test.wmcloud.org/?image=https%3A%2F%2Fupload.wikimedia.org%2Fwikipedia%2Fcommons%2F6%2F64%2FGevel_-_Venray_-_20241580_-_RCE.jpg&langs[]=sr-latn&engine=google

It is version 0.5.0, which should be up-to-date.

Should be fixed by https://github.com/wikimedia/wikimedia-ocr/commit/5013405a821b0875de15700ddab812eb273c13be, which I mistakenly just force-merged (sorry!)

@Daimona Does that regex need to include numbers as well now? We have just included the language ru-petr1708.

Submitting with that language (e.g. https://ocr-test.wmcloud.org/?image=https%3A%2F%2Fupload.wikimedia.org%2Fwikipedia%2Fcommons%2Fthumb%2F1%2F1e%2FThe_Book_of_Scottish_Song.djvu%2Fpage20-1024px-The_Book_of_Scottish_Song.djvu.jpg&langs[]=ru-petr1708&engine=google)
returns: The following language is not supported by the OCR engine: ru-petr

Should be fixed by https://github.com/wikimedia/wikimedia-ocr/commit/5013405a821b0875de15700ddab812eb273c13be, which I mistakenly just force-merged (sorry!)

@Daimona Does that regex need to include numbers as well now? We have just included the language ru-petr1708.

I'd say yes! I'm not sure if tesseract and google are following any standard with the language codes, so in doubt, it's probably a good idea to add whatever characters are currently used.

Should be fixed by https://github.com/wikimedia/wikimedia-ocr/commit/5013405a821b0875de15700ddab812eb273c13be, which I mistakenly just force-merged (sorry!)

@Daimona Does that regex need to include numbers as well now? We have just included the language ru-petr1708.

I'd say yes! I'm not sure if tesseract and google are following any standard with the language codes, so in doubt, it's probably a good idea to add whatever characters are currently used.

Great. I think Sam just made https://github.com/wikimedia/wikimedia-ocr/pull/41

This has been merged, and I can now successfully submit all supported languages (except T284827, but I doubt that was caused by this).

Moving straight to Done as this is a relatively small change.

ldelench_wmf set the point value for this task to 1.