Maniphest T282073

Add API endpoint to retrieve supported languages
Closed, ResolvedPublic3 Estimated Story Points
Actions

Assigned To

Authored By

	MusikAnimal
	May 6 2021, 2:43 AM

Description

Wikimedia OCR should have an API endpoint allowing you to fetch the supported languages for the given engine. This would allow the MediaWiki backend to pre-fetch the list for use in autocompletion on the frontend.

It's also just a courteous thing to do, if we are providing an API to OCR images. No one should have to hard code the list of languages we support!

Acceptance criteria

Add an API endpoint to get a list of supported languages for a given engine

Related Objects

Mentioned In: T282760: Add language-mapping data
T280214: Wikisource OCR: Accept Google options on the API
Mentioned Here: T282760: Add language-mapping data

Event Timeline

MusikAnimal created this task.May 6 2021, 2:43 AM

Restricted Application added a project: Community-Tech. · View Herald TranscriptMay 6 2021, 2:43 AM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

PR: https://github.com/wikimedia/wikimedia-ocr/pull/29

This partially implements autocompletion in the UI. It only honors what the selected engine was on page load. So if you want Tesseract languages, you have to have ?engine=tesseract in the URL. This bug can be fixed in a separate PR; I didn't want to spend much time on it since this hasn't had any design or product review. The focus of this PR is the API endpoint.

Also, I assume we'll want to map Tesseract's supported language list to return ISO 639-1, which is what we use on-wiki. Once that's done, the language names will show in the autocompletion. This is why for example, for Google you get en – English but with ?engine=tesseract you get only eng.

MusikAnimal set the point value for this task to 3.May 6 2021, 3:20 AM

MusikAnimal mentioned this in T280214: Wikisource OCR: Accept Google options on the API.May 6 2021, 3:23 AM

MusikAnimal edited projects, added Community-Tech (CommTech-Sprint-1); removed Community-Tech.

MusikAnimal moved this task from Ready 🎬 to Review/Feedback 💬 on the Community-Tech (CommTech-Sprint-1) board.

I assume we'll want to map Tesseract's supported language list to return ISO 639-1, which is what we use on-wiki.

We might have to do some manual mapping, unfortunately. For example, Hebrew is identified by it's old ISO639-1 code of iw (which we can look up on Wikidata if need be). Worse is the list of Tesseract languages, which seems to be mostly ISO639-3 but also has full English names (e.g. Armenian, Greek) as well as duplicates (Bengali and ben) which (I'm guessing) are different models for the same language (or maybe for different scripts for the same language?).

I think we've been thinking that on-wiki we'll want to show the languages in the same way they're done in ULS. But perhaps that's not right, because these are not actually languages but OCR models — we need to show 'Italian - Old' somehow, even though that's not really a language. We can get a long way by assuming Google = 639-1 and Tesseract = 639-3, but there are a bunch of outliers. If labels for those langauges don't need to be translated, probably it doesn't matter.

It feels like it might be nice, from the wiki side of things, to retrieve a list of lanugage_code → Localized language name pairs from the tool. Although perhaps that doesn't work well for setting defaults (e.g. how, on Hebrew Wikisource where the $wgContentLanguage is he, do we know to default to iw for Google and heb or Hebrew for Tesseract?)

https://tesseract-ocr.github.io/tessdoc/Data-Files suggests that these are not duplicates, but e.g. ben is Bengali and should only be listed once.

• NRodriguez moved this task from Backlog to 🌟Top Priority on the Wikimedia OCR board.May 14 2021, 9:31 PM

@MusikAnimal @Samwilson Is there anything more to be done here? You can move it into Product sign-off or Done if not.

I have checked that the languages returned by the api endpoint is correct for both languages (for Google we only return the "Supported languages").

https://ocr-test.wmcloud.org/api/available_langs?engine=tesseract
https://ocr-test.wmcloud.org/api/available_langs?engine=google

In T282073#7118586, @dom_walden wrote:

@MusikAnimal @Samwilson Is there anything more to be done here? You can move it into Product sign-off or Done if not.

Thanks for the review! T282760: Add language-mapping data will supersede this, so I guess I'll move this back to In Development since we're basically re-doing it.

Follow-up PR: https://github.com/wikimedia/wikimedia-ocr/pull/33

MusikAnimal mentioned this in T282760: Add language-mapping data.Jun 2 2021, 10:45 PM

Another follow-up PR: https://github.com/wikimedia/wikimedia-ocr/pull/34

ldelench_wmf moved this task from CommTech-Sprint-1 to CommTech-Sprint-2 on the Community-Tech board.Jun 7 2021, 4:12 PM

ldelench_wmf edited projects, added Community-Tech (CommTech-Sprint-2); removed Community-Tech (CommTech-Sprint-1).

ldelench_wmf moved this task from Ready 🎬 to Review/Feedback 💬 on the Community-Tech (CommTech-Sprint-2) board.

All follow-up PRs merged :)

QA notes: The same API endpoints exist, but now the list of languages is more complete (T282760) and the localized names for each language are provided.

Testing for this was done as part of T282760.

Moving straight to Done because all the interesting things are happening in T282760.

ldelench_wmf closed this task as Resolved.Jun 10 2021, 8:38 PM

Add API endpoint to retrieve supported languagesClosed, ResolvedPublic3 Estimated Story PointsActions

Description

Related Objects

Event Timeline

Add API endpoint to retrieve supported languages
Closed, ResolvedPublic3 Estimated Story Points
Actions