Wikimedia OCR: "-" and "_" being stripped from language codes
Closed, ResolvedPublic1 Estimated Story PointsBUG REPORT
Actions

Assigned To

Authored By

	dom_walden
	May 4 2021, 2:10 PM

Description

What is the problem?

Languages like sr-Latn and Canadian_Aboriginal are being treated as invalid because their -, _ are being stripped.

Instead, the tool returns an error like:

The following language is not supported by the OCR engine: srLatn

sr-Latn is supported by Google. Canadian_Aboriginal is supported by Tesseract.

This bug affects 2 out of ~60 supported languages for google. 17 out of ~160 languages for Tesseract.

Urls to reproduce problem

Expected behavior: The image gets OCR'd
Observed behavior: Error returned of the form The following language is not supported by the OCR engine: CanadianAboriginal

Environment

Wikimedia OCR: Version 0.1.0-5-gf2af8be

Related Objects

Mentioned In: T282760: Add language-mapping data
T280617: Wikisource OCR: Validate language codes
Mentioned Here: T284827: Wikimedia OCR: 500 error with lang "equ"
T282760: Add language-mapping data

Event Timeline

dom_walden created this task.May 4 2021, 2:10 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 4 2021, 2:10 PM

ifried moved this task from New & TBD Tickets to Needs Discussion on the Community-Tech board.May 4 2021, 3:15 PM

dom_walden mentioned this in T280617: Wikisource OCR: Validate language codes.May 6 2021, 12:58 PM

Fixed by T282760: Add language-mapping data (currently up for code review). Note however language codes have changed, since we try to only accept ISO 639-1 as people are used to on the wiki. Tesseract still has some extra non-standard ones. See the API for the available list at /api/available_langs?engine=tesseract. So for instance sr-Latn is sr-latn. Meanwhile Canadian_Aboriginal I believe is a script and not intended to be accepted as a language code.

MusikAnimal mentioned this in T282760: Add language-mapping data.Jun 2 2021, 10:45 PM

MusikAnimal edited projects, added Community-Tech (CommTech-Sprint-2); removed Community-Tech.Jun 8 2021, 1:37 AM

Restricted Application edited projects, added Community-Tech; removed Community-Tech (CommTech-Sprint-2). · View Herald TranscriptJun 8 2021, 1:37 AM

Sending this over to Dom in case he wants to revisit it, but effectively I think this task is invalid now that we only accept ISO 639-1 codes (with a few extras specific to Tesseract). If you agree please feel free to close as invalid.

In T281866#7140922, @MusikAnimal wrote:

Sending this over to Dom in case he wants to revisit it, but effectively I think this task is invalid now that we only accept ISO 639-1 codes (with a few extras specific to Tesseract). If you agree please feel free to close as invalid.

@MusikAnimal I still appear to be getting this issue.

For example, if I choose sr-latn from the language selector and submit, I get: The following language is not supported by the OCR engine: srlatn.

Direct link: https://ocr-test.wmcloud.org/?image=https%3A%2F%2Fupload.wikimedia.org%2Fwikipedia%2Fcommons%2F6%2F64%2FGevel_-_Venray_-_20241580_-_RCE.jpg&langs[]=sr-latn&engine=google

It is version 0.5.0, which should be up-to-date.

Should be fixed by https://github.com/wikimedia/wikimedia-ocr/commit/5013405a821b0875de15700ddab812eb273c13be, which I mistakenly just force-merged (sorry!)

Follow-up PR with tests: https://github.com/wikimedia/wikimedia-ocr/pull/36

ldelench_wmf moved this task from Backlog to 🌟Top Priority on the Wikimedia OCR board.Jun 9 2021, 1:05 PM

dmaza assigned this task to Daimona.Jun 9 2021, 2:05 PM

In T281866#7142135, @Daimona wrote:

Should be fixed by https://github.com/wikimedia/wikimedia-ocr/commit/5013405a821b0875de15700ddab812eb273c13be, which I mistakenly just force-merged (sorry!)

@Daimona Does that regex need to include numbers as well now? We have just included the language ru-petr1708.

Submitting with that language (e.g. https://ocr-test.wmcloud.org/?image=https%3A%2F%2Fupload.wikimedia.org%2Fwikipedia%2Fcommons%2Fthumb%2F1%2F1e%2FThe_Book_of_Scottish_Song.djvu%2Fpage20-1024px-The_Book_of_Scottish_Song.djvu.jpg&langs[]=ru-petr1708&engine=google)
returns: The following language is not supported by the OCR engine: ru-petr

In T281866#7148002, @dom_walden wrote:

In T281866#7142135, @Daimona wrote:

Should be fixed by https://github.com/wikimedia/wikimedia-ocr/commit/5013405a821b0875de15700ddab812eb273c13be, which I mistakenly just force-merged (sorry!)

@Daimona Does that regex need to include numbers as well now? We have just included the language ru-petr1708.

I'd say yes! I'm not sure if tesseract and google are following any standard with the language codes, so in doubt, it's probably a good idea to add whatever characters are currently used.

In T281866#7148133, @Daimona wrote:

In T281866#7148002, @dom_walden wrote:

In T281866#7142135, @Daimona wrote:

Should be fixed by https://github.com/wikimedia/wikimedia-ocr/commit/5013405a821b0875de15700ddab812eb273c13be, which I mistakenly just force-merged (sorry!)

@Daimona Does that regex need to include numbers as well now? We have just included the language ru-petr1708.

I'd say yes! I'm not sure if tesseract and google are following any standard with the language codes, so in doubt, it's probably a good idea to add whatever characters are currently used.

Great. I think Sam just made https://github.com/wikimedia/wikimedia-ocr/pull/41

Daimona set Final Story Points to 1.Jun 10 2021, 10:35 AM

In T281866#7148135, @dom_walden wrote:

Great. I think Sam just made https://github.com/wikimedia/wikimedia-ocr/pull/41

This has been merged, and I can now successfully submit all supported languages (except T284827, but I doubt that was caused by this).

Moving straight to Done as this is a relatively small change.

ldelench_wmf closed this task as Resolved.Jun 11 2021, 4:06 PM

ldelench_wmf set the point value for this task to 1.

Wikimedia OCR: "-" and "_" being stripped from language codesClosed, ResolvedPublic1 Estimated Story PointsBUG REPORTActions