Acceptance Criteria:
- The new Wikimedia OCR should accept Tesseract options through the API like: multiple languages, PSM, Engine.
Acceptance Criteria:
The multiple languages part of this will be dealt with in T280214 (because the lang list is common to both engines). We might want to still do some per-engine verification of the language codes though.
PR merged:
Note that depending on what options you choose, you might get errors about an invalid DPI (dots per inch). In production/staging this will display as a 500 error page. We're not really sure what conditions require you to set the DPI, and in my testing even when I did I would still sometimes get the same error, so we're omitting a DPI option for the time being. See discussion at
I've tested some combinations of the Tesseract options via the UI. I see variations in the returned OCR text, so I guess that means the options are being passed to Tesseract.
For example, compare:
Test environment: Version 0.2.0
Testing on the above and switching the tesseract PSM options yileded 500 errors