Page MenuHomePhabricator

Add API endpoint to retrieve supported languages
Closed, ResolvedPublic3 Estimated Story Points

Description

Wikimedia OCR should have an API endpoint allowing you to fetch the supported languages for the given engine. This would allow the MediaWiki backend to pre-fetch the list for use in autocompletion on the frontend.

It's also just a courteous thing to do, if we are providing an API to OCR images. No one should have to hard code the list of languages we support!

Acceptance criteria

  • Add an API endpoint to get a list of supported languages for a given engine

Event Timeline

Restricted Application added a subscriber: Aklapper. Β· View Herald Transcript

PR: https://github.com/wikimedia/wikimedia-ocr/pull/29

This partially implements autocompletion in the UI. It only honors what the selected engine was on page load. So if you want Tesseract languages, you have to have ?engine=tesseract in the URL. This bug can be fixed in a separate PR; I didn't want to spend much time on it since this hasn't had any design or product review. The focus of this PR is the API endpoint.

Also, I assume we'll want to map Tesseract's supported language list to return ISO 639-1, which is what we use on-wiki. Once that's done, the language names will show in the autocompletion. This is why for example, for Google you get en – English but with ?engine=tesseract you get only eng.

MusikAnimal set the point value for this task to 3.May 6 2021, 3:20 AM

I assume we'll want to map Tesseract's supported language list to return ISO 639-1, which is what we use on-wiki.

We might have to do some manual mapping, unfortunately. For example, Hebrew is identified by it's old ISO639-1 code of iw (which we can look up on Wikidata if need be). Worse is the list of Tesseract languages, which seems to be mostly ISO639-3 but also has full English names (e.g. Armenian, Greek) as well as duplicates (Bengali and ben) which (I'm guessing) are different models for the same language (or maybe for different scripts for the same language?).

I think we've been thinking that on-wiki we'll want to show the languages in the same way they're done in ULS. But perhaps that's not right, because these are not actually languages but OCR models β€” we need to show 'Italian - Old' somehow, even though that's not really a language. We can get a long way by assuming Google = 639-1 and Tesseract = 639-3, but there are a bunch of outliers. If labels for those langauges don't need to be translated, probably it doesn't matter.

It feels like it might be nice, from the wiki side of things, to retrieve a list of lanugage_code β†’ Localized language name pairs from the tool. Although perhaps that doesn't work well for setting defaults (e.g. how, on Hebrew Wikisource where the $wgContentLanguage is he, do we know to default to iw for Google and heb or Hebrew for Tesseract?)

https://tesseract-ocr.github.io/tessdoc/Data-Files suggests that these are not duplicates, but e.g. ben is Bengali and should only be listed once.

@MusikAnimal @Samwilson Is there anything more to be done here? You can move it into Product sign-off or Done if not.

I have checked that the languages returned by the api endpoint is correct for both languages (for Google we only return the "Supported languages").

https://ocr-test.wmcloud.org/api/available_langs?engine=tesseract
https://ocr-test.wmcloud.org/api/available_langs?engine=google

@MusikAnimal @Samwilson Is there anything more to be done here? You can move it into Product sign-off or Done if not.

Thanks for the review! T282760: Add language-mapping data will supersede this, so I guess I'll move this back to In Development since we're basically re-doing it.

All follow-up PRs merged :)

QA notes: The same API endpoints exist, but now the list of languages is more complete (T282760) and the localized names for each language are provided.

Testing for this was done as part of T282760.

Moving straight to Done because all the interesting things are happening in T282760.