Page MenuHomePhabricator

Wikimedia OCR: Validate Tesseract options
Closed, ResolvedPublic3 Estimated Story PointsBUG REPORT

Event Timeline

The options you provided are valid, rather the combination of options causes Tesseract to get confused, and that's why you're seeing a 500. To test validation, try a PSM of 30, for instance: https://ocr-test.wmcloud.org/?image=https%3A%2F%2Fupload.wikimedia.org%2Fwikipedia%2Fcommons%2F1%2F1b%2F07029jfBaliuag%252C_Bulacan_Town_Hall_Municipal_Office_Buildingsfvf_26.jpg&engine=tesseract&psm=30&oem=0 The psm and oem options get type casted to ints, so putting in a gibberish string as the value will be interpreted as zero.

Back to the 500, it has something to with the DPI (dots per inch) being lost when certains options are used. We're not really sure what conditions break it, but I'm beginning to believe part of the problem is with the Tesseract package we're using, because I've even forced a valid DPI value and I still get the same error. See discussion at https://github.com/wikimedia/wikimedia-ocr/pull/22#discussion_r625531031

As per the comment above, it seems that some tesseract options cannot be used together. For instance:

Generated command:
"tesseract" - "/tmp/ocr151TMx" --psm 3 --oem 0

Returned message:
Failed loading language 'eng'
Tesseract couldn't load any languages!
Could not initialize tesseract." at /var/www/tool/vendor/thiagoalessio/tesseract_ocr/src/FriendlyErrors.php line 66

I still haven't verified whether this is intended behaviour for tesseract, a tesseract bug, or incomplete configuration. What's certain is that the root cause is somewhere inside tesseract, not in the OCR tool, nor the PHP library we're using.


Now, anticipating this behaviour is not something that we're supposed to do (special cases aside), because knowing which option values won't work together is an internal detail that we need not know about. At most, it's the PHP library that should do it for us, but that's not currently the case. Reacting to the issue after-the-fact is also not possible for us, since tesseract will just die with exit code 1, which might happen for a variety of reasons, not just bad combinations of options.

Instead, I think it would be a nice idea to have a catch-all catch (pardon the pun) for low-level exceptions. Practically speaking, this means properly handling instances of TesseractOcrException (defined by the library) in the main exception handler. For now we could even just display a generic error message ("something went wrong with tesseract"), and maybe special-case some exception subclasses to provide a more detailed error message. Passing the command output through doesn't seem a good idea in general, as it might contain sensitive data.

I still haven't verified whether this is intended behaviour for tesseract, a tesseract bug, or incomplete configuration. What's certain is that the root cause is somewhere inside tesseract, not in the OCR tool, nor the PHP library we're using.

A quick search seems to suggest that we might be missing the appropriate data for running the legacy engine, see e.g. https://github.com/tesseract-ocr/tesseract/issues/2315 . Possible solutions:

  • Add it to the installation requirements (which might become a bit complicated)
  • As above, but make it optional and hide the relevant tesseract options if not available. This also seems to be dealing with implementation details that the library should handle for us, so probably a bad idea.
  • Ignore it, I guess? And provide a nicer error message.

Whatever the solution, having a catch-all clause seems a good idea to me. Still better than a 500, at least.

We now return a generic error message (The tesseract engine returned an internal error.) when you submit certain combinations of Tesseract options.

For example, try one of the links in the description.

I tested every combination of Tesseract option with a few different images. The error always happens with the same combination of options.

Test environment: https://ocr-test.wmcloud.org Version 0.5.0-13-g467f8ac.

I tested every combination of Tesseract option with a few different images. The error always happens with the same combination of options.

FTR, that is T284831. In a nutshell, setting oem=0 will fail regardless of everything else because the language files used on toolforge don't have support for the legacy engine.

I was able to see a nicely formatted error message-- I was unable to see an error message for the first link given that the image does not have any scannable text.

I was able to see a nicely formatted error message-- I was unable to see an error message for the first link given that the image does not have any scannable text.

Oh actually, that's because we removed the oem setting (T285262), which was necessary to trigger that specific error.

Daimona set the point value for this task to 3.Jun 24 2021, 9:49 PM