Page MenuHomePhabricator

Wikimedia OCR: 500 error with lang "equ"
Closed, ResolvedPublic1 Estimated Story PointsBUG REPORT

Description

What is the problem?

If I submit an OCR request with lang = equ (which is the "Math / equation detection module"), I get a 500 error:

Message:  Uncaught PHP Exception thiagoalessio\TesseractOCR\UnsuccessfulCommandException: "Error! The command did not produce any output.  
    
  Generated command:  
  "tesseract" - "/tmp/ocrCetOOv" --psm 3 --oem 3 -l equ  
    
  Returned message:  
  Error opening data file /usr/share/tesseract-ocr/4.00/tessdata/equ.traineddata  
  Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory.  
  Failed loading language 'equ'  
  Tesseract couldn't load any languages!  
  Could not initialize tesseract." at /var/www/tool/vendor/thiagoalessio/tesseract_ocr/src/FriendlyErrors.php line 66  
Time:  2021-06-11T15:19:04.537858+00:00  
Channel:  request  
Context:   exception:  {    
         "class": "thiagoalessio\\TesseractOCR\\UnsuccessfulCommandException",    
         "message": "Error! The command did not produce any output.\n\nGenerated command:\n\"tesseract\" - \"/tmp/ocrCetOOv\" --psm 3 --oem 3 -l equ\n\nReturned message:\nError opening data file /usr/share/tesseract-ocr/4.00/tessdata/equ.traineddata\nPlease make sure the TESSDATA_PREFIX environment variable is set to your \"tessdata\" directory.\nFailed loading language 'equ'\nTesseract couldn't load any languages!\nCould not initialize tesseract.",    
         "code": 0,    
         "file": "/var/www/tool/vendor/thiagoalessio/tesseract_ocr/src/FriendlyErrors.php:66",    
         "trace": [    
             "/var/www/tool/vendor/thiagoalessio/tesseract_ocr/src/TesseractOCR.php:39",    
             "/var/www/tool/src/Engine/TesseractEngine.php:87",    
             "/var/www/tool/src/Controller/OcrController.php:199",    
             "/var/www/tool/vendor/symfony/cache/LockRegistry.php:100",    
             "/var/www/tool/vendor/symfony/cache/Traits/ContractsTrait.php:88",    
             "/var/www/tool/vendor/symfony/cache-contracts/CacheTrait.php:70",    
             "/var/www/tool/vendor/symfony/cache/Traits/ContractsTrait.php:95",    
             "/var/www/tool/vendor/symfony/cache-contracts/CacheTrait.php:33",    
             "/var/www/tool/src/Controller/OcrController.php:200",    
             "/var/www/tool/src/Controller/OcrController.php:146",    
             "/var/www/tool/vendor/symfony/http-kernel/HttpKernel.php:157",    
             "/var/www/tool/vendor/symfony/http-kernel/HttpKernel.php:79",    
             "/var/www/tool/vendor/symfony/http-kernel/Kernel.php:195",    
             "/var/www/tool/public/index.php:21"    
         ]    
     }    
Extra:   host:  ocr-test.wmcloud.org    
   uri:  http://ocr-test.wmcloud.org/api.php?engine=tesseract&image=https%3A%2F%2Fupload.wikimedia.org%2Fwikipedia%2Fcommons%2F6%2F64%2FGevel_-_Venray_-_20241580_-_RCE.jpg&lang=equ
Steps to reproduce problem
  1. https://ocr-test.wmcloud.org/?image=https%3A%2F%2Fupload.wikimedia.org%2Fwikipedia%2Fcommons%2Fa%2Fab%2FThe_Blind_Man%2527s_Eyes_cover.jpg&langs%5B%5D=equ&engine=tesseract&psm=3&oem=3

Expected behavior: OCR text is returned
Observed behavior: 500 error

Environment

Wikimedia OCR: https://ocr-test.wmcloud.org Version 0.5.0-7-gc1906ef

Event Timeline

tools.ocr-test@tools-sgebastion-07:~$ tesseract --list-langs | grep equ
tools.ocr-test@tools-sgebastion-07:~$

So it's not installed, but since it's listed in the language map, the app thinks it's valid. I think the best would be to install it; if this is not possible for some reason, then removing it from the list seems the only option.

According to the comments here, equ might've been excluded on purpose (depending on how tesseract was installed in the first place) since it was mainly a Tesseract 3 thing that worked very poorly. If this is the case, then it might be better to remove it straight away.

Iā€™m pretty sure the math pseudo-language is not supported in Tesseract 4.x.

Iā€™m pretty sure the math pseudo-language is not supported in Tesseract 4.x.

It should be supported according to this, but I've just tried it out locally (using tesseract 5) and I got some gibberish from an image as simple as this one. Actually, what I'm getting is very similar to what's described in this stackoverflow question.

I'm just going to remove it for now.

Daimona set the point value for this task to 1.Jun 11 2021, 4:00 PM

The language selector no longer lists equ.

If I try to submit with langs[]=equ it returns a validation message: The following language is not supported by the OCR engine: equ.

Test environment: https://ocr-test.wmcloud.org Version 0.5.0-10-ga771905.