Add on-wiki UI for selecting languages
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Samwilson
	Apr 6 2021, 3:44 AM

Description

Both Google Cloud Vision API and Tesseract allow for specifying multiple languages when processing an image's text, to help make the OCR more accurate.

Google: https://cloud.google.com/vision/docs/languages
Tesseract: https://tesseract-ocr.github.io/tessdoc/Data-Files-in-different-versions.html (installable as individual Debian packages, many (all?) of which are already installed on Toolforge)

Only Tesseract provides a dynamic means of retrieving what languages are supported. For Google it's just a list on the above page.

Currently, we just use a Wikisource's content langauge as the language, but this is not optimal for pages with multiple languages nor for Multilingual Wikisource.

The language codes for the two engines differ, so we'll have to map them to some sort of common system.

Details

	Subject	Repo	Branch	Lines +/-
	Add on-wiki UI for selecting languages	mediawiki/extensions/Wikisource	master	+185 -49
	Add on-wiki UI for selecting languages	mediawiki/extensions/Wikisource	master	+205 -48

Customize query in gerrit

Related Objects

Mentioned In: T357857: OCR language selection does not work
T331961: Add Transkribus as an option in the OCR menu (WikiEditor)
T328830: Configurable option to change default language in Wikimedia OCR
T318594: Tesseract OCR: Allow saving "psm" parameter option
T316428: OCR is not working for Devanagari script in wikisource.org
T301985: Next/Previous button for Wikimedia OCR advance mode.
Mentioned Here: T331961: Add Transkribus as an option in the OCR menu (WikiEditor)
T287080: Wikisource OCR: Add prohibited-character-list to API
T287125: Wikisource OCR: Allow custom thresholding pre-processing step
T280214: Wikisource OCR: Accept Google options on the API

Event Timeline

Samwilson created this task.Apr 6 2021, 3:44 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 6 2021, 3:44 AM

The tool work for this has been done in T280214.

Although this work is done, and multiple language selection is available in the advanced options of the tool, I wonder if this should stay open for implementing an on-wiki UI for multiple language selection (I can't find another ticket for that).

We did have some discussion about how to do that UI, and I think it was something along the lines of a TagMultiselectWidget with each selection being made via a UniversalLanguageSelector popup (which is how we do multiple langauge selection in SVG Translate Tool, for example). The main question is: where should the widget go? It might look weird in the 'Transcribe text' dropdown menu.

Samwilson renamed this task from Add support for multiple languages to Add on-wiki UI for selecting multiple languages.Nov 17 2021, 6:05 AM

Remember that the languages are not necessarily the same list as "real" languages: they're just different models provided by the relevant tool, and there can by multiple models for one language (e.g. A special model for old school printing).

Importantly, the UI needs the list of available languages.

In T279405#7509281, @Inductiveload wrote:

Remember that the languages are not necessarily the same list as "real" languages: they're just different models provided by the relevant tool, and there can by multiple models for one language (e.g. A special model for old school printing).

Importantly, the UI needs the list of available languages.

Good points! We can get the lists of languages from https://ocr.wmcloud.org/api/available_langs?engine=tesseract so maybe rather than ULS we just have more-or-less the same language chooser UI that is currently on the tool's form, i.e. just a MenuTagMultiselectWidget, with custom data.

We could turn the dropdown into a gear icon, and open a dialog with all the options. It'd have to contain the engine choice, and languages, and tesseract PSM, in order to make the 'advanced options' link redundant (once ODS is doing the cropping, that is).

I think a dialog makes much more sense, because there are a lot more options you might consider to add, for example OCR blacklist chars (T287080) as well as a pre-processing threshold step (T287125) and probably a small handful of other usefully-twiddlable controls. Stuffing it all into a menu is going to get messy.

Samwilson renamed this task from Add on-wiki UI for selecting multiple languages to Add on-wiki UI for selecting languages.Jan 21 2022, 12:26 AM

Samwilson mentioned this in T301985: Next/Previous button for Wikimedia OCR advance mode. .Feb 18 2022, 7:04 AM

dmaza removed a project: Community-Tech.Aug 1 2022, 2:12 PM

Restricted Application added a project: Community-Tech. · View Herald TranscriptAug 1 2022, 2:12 PM

dmaza removed a project: Community-Tech.Aug 1 2022, 2:13 PM

Samwilson mentioned this in T316428: OCR is not working for Devanagari script in wikisource.org.Aug 27 2022, 11:41 AM

Samwilson mentioned this in T318594: Tesseract OCR: Allow saving "psm" parameter option.Sep 27 2022, 7:32 AM

Samwilson mentioned this in T328830: Configurable option to change default language in Wikimedia OCR.Feb 6 2023, 12:50 AM

Soda merged a task: T328830: Configurable option to change default language in Wikimedia OCR.Feb 28 2023, 2:36 AM

Soda added subscribers: Gopavasanth, Soda.

Soda claimed this task.Feb 28 2023, 2:43 AM

Samwilson mentioned this in T331961: Add Transkribus as an option in the OCR menu (WikiEditor).Mar 14 2023, 6:07 AM

How are you going with this @Soda? I think there might be some overlap with T331961, so you and @KLawal-WMF might need to compare notes.

Yeah sure, @KLawal-WMF how far have you gotten ?

In T279405#8731593, @Soda wrote:

Yeah sure, @KLawal-WMF how far have you gotten ?

Transkribus has been added as an option in the OCR menu. Currently resolving comments by @Samwilson

FRomeo_WMF subscribed.Oct 30 2023, 11:23 AM

Hi @Soda, how far have you gotten ? I am working adding more options for the engines

Change 971235 had a related patch set uploaded (by Kolakachi; author: Kolakachi):

[mediawiki/extensions/Wikisource@master] Add more options for ocr engines

https://gerrit.wikimedia.org/r/971235

gerritbot added a project: Patch-For-Review.Nov 2 2023, 4:49 PM

Change 989530 had a related patch set uploaded (by Kolakachi; author: L10n-bot):

[mediawiki/extensions/Wikisource@master] Add on-wiki UI for selecting languages

https://gerrit.wikimedia.org/r/989530

Change 989530 abandoned by Kolakachi:

[mediawiki/extensions/Wikisource@master] Add on-wiki UI for selecting languages

Reason:

https://gerrit.wikimedia.org/r/989530

Change 971235 merged by jenkins-bot:

[mediawiki/extensions/Wikisource@master] Add on-wiki UI for selecting languages

https://gerrit.wikimedia.org/r/971235

ReleaseTaggerBot added a project: MW-1.42-notes (1.42.0-wmf.16; 2024-01-30).Jan 24 2024, 3:01 AM

Maintenance_bot removed a project: Patch-For-Review.Jan 24 2024, 3:30 AM

KLawal-WMF closed this task as Resolved.Jan 24 2024, 9:28 AM

KLawal-WMF claimed this task.

Samwilson mentioned this in T357857: OCR language selection does not work.Feb 18 2024, 4:29 AM

Add on-wiki UI for selecting languagesClosed, ResolvedPublicActions

Description

Details

Related Objects

Event Timeline

Add on-wiki UI for selecting languages
Closed, ResolvedPublic
Actions