Page MenuHomePhabricator

Wikisource OCR: Validate language codes
Open, Needs TriagePublic5 Estimated Story Points

Description

Users can provide a list of languages that are in the image being OCRed, but these codes are not currently validated. Both engines have lists of supported languages, against which we can check the user input. We should show a friendly error when an unsupported language code is used (currently, Google will blindly accept invalid languages and return its best guess, and Tesseract will fail with a seemingly-unrelated message).

Acceptance Criteria:

  • Validate language input based on what is allowed in engine

Error copy:

This language is NOT supported by the OCR.

Event Timeline

Restricted Application added a subscriber: Aklapper. ยท View Herald Transcript
ldelench_wmf set the point value for this task to 3.Wed, Apr 21, 5:35 PM
ldelench_wmf moved this task from To Be Estimated/Discussed to Estimated on the Community-Tech board.
ldelench_wmf changed the point value for this task from 3 to 5.Wed, Apr 21, 5:37 PM
ifried renamed this task from Validate language codes to Wikisource OCR: Validate language codes.Wed, Apr 21, 6:00 PM
ifried moved this task from Estimated to Kanban-2020-21-Q4 on the Community-Tech board.

What's the process for providing error messages? Other than the error copy, do you need any other product design support for this copy and do we provide it on here? cc @nayoub

Error copy:

This language is NOT supported by the OCR.

Side note, do we have UX writing guidelines? cc @nayoub

@HMonroy @NRodriguez For the Google OCR engine, we now only treat as valid what Google considers "Supported languages".

The Google OCR engine technically supports more:

So it is no longer possible to enter a language like sco (Scots) or cy (Welsh) into Wikimedia OCR.

Do we want to support more languages?

A slightly related issue, I assume Google's list of supported languages might change in the future. Do we have a plan about how we are going to keep Wikimedia OCR up-to-date?

Pinging @Samwilson, since I know Harumi is busy with some Editing team work :)

If we wanted to include support for the experimental & mapped languages, would this be possible? Could we create a ticket for it? Any thoughts/concerns?

Pinging @Samwilson, since I know Harumi is busy with some Editing team work :)

If we wanted to include support for the experimental & mapped languages, would this be possible? Could we create a ticket for it? Any thoughts/concerns?

Yep, not hard to add the remaining language codes.

Keeping our list up to date is a different thing, and it doesn't look like there's any easy way to do it automatically: https://groups.google.com/g/cloud-vision-discuss/c/wkOZLwUTvYE/m/xpu-XQnMAQAJ

I guess we just keep it static, and try to revisit it every year or something.

If you enter an invalid language (like in this link), you get a nice error message such as:

The following language is not supported by the OCR engine: blah

The languages we accept varies depending on whether you choose Google or Tesseract.

I tested both engines, checking that we accept all the languages the respective engine supports (other than what we discussed in T280617#7057720 and some languages being incorrectly interpreted due to T281866).

If (the decision about) supporting other languages will be done in T281913, I will move this along.

Test Environment: https://ocr-test.wmcloud.org Version 0.1.0-5-gf2af8be