What is the problem?
Languages like sr-Latn and Canadian_Aboriginal are being treated as invalid because their -, _ are being stripped.
Instead, the tool returns an error like:
The following language is not supported by the OCR engine: srLatn
sr-Latn is supported by Google. Canadian_Aboriginal is supported by Tesseract.
This bug affects 2 out of ~60 supported languages for google. 17 out of ~160 languages for Tesseract.
Urls to reproduce problem
- https://ocr-test.wmcloud.org/?image=https%3A%2F%2Fupload.wikimedia.org%2Fwikipedia%2Fcommons%2F6%2F64%2FGevel_-_Venray_-_20241580_-_RCE.jpg&langs%5B%5D=sr-Latn&engine=google
- https://ocr-test.wmcloud.org/?image=https%3A%2F%2Fupload.wikimedia.org%2Fwikipedia%2Fcommons%2F6%2F64%2FGevel_-_Venray_-_20241580_-_RCE.jpg&langs%5B%5D=Canadian_Aboriginal&engine=tesseract
Expected behavior: The image gets OCR'd
Observed behavior: Error returned of the form The following language is not supported by the OCR engine: CanadianAboriginal
Environment
Wikimedia OCR: Version 0.1.0-5-gf2af8be