Page MenuHomePhabricator

Wikisource OCR: Accept Tesseract options on the API
Closed, ResolvedPublic3 Estimated Story Points

Description

Acceptance Criteria:

  • The new Wikimedia OCR should accept Tesseract options through the API like: multiple languages, PSM, Engine.

Event Timeline

Restricted Application added a subscriber: Aklapper. ยท View Herald TranscriptApr 15 2021, 2:31 AM
ldelench_wmf set the point value for this task to 3.Apr 15 2021, 11:49 PM

The multiple languages part of this will be dealt with in T280214 (because the lang list is common to both engines). We might want to still do some per-engine verification of the language codes though.

PR merged: https://github.com/wikimedia/wikimedia-ocr/pull/22

Note that depending on what options you choose, you might get errors about an invalid DPI (dots per inch). In production/staging this will display as a 500 error page. We're not really sure what conditions require you to set the DPI, and in my testing even when I did I would still sometimes get the same error, so we're omitting a DPI option for the time being. See discussion at https://github.com/wikimedia/wikimedia-ocr/pull/22#discussion_r625531031