Page MenuHomePhabricator

Update our tesseract version to 5.0.0 alpha
Closed, ResolvedPublic3 Estimated Story Points

Description

Acceptance Criteria:

  • We upgrade our version of tesseract from 4.0.0 stable release to 5.0.0 alpha

Context:
On the talk page, a contributor pointed to the net benefits of this newer version.

See comment and examples of output here:
https://ocr-test.wmcloud.org/ with one single page with Google OCR and output here. The out show very well ORC quality but not recognized Two-column as well known.
https://ocr-test.wmcloud.org/ with one single page with Tesseract OCR and output here. The out quality is very bad, but Two-column recognized very well. The same page is uploaded at https://archive.org/details/bharatkoshpage-82 and their tesseract 5.0.0-alpha-20201231-10-g1236 version output text . The test output is very well and recognized as two-column. Just for curiosity what version we are using at tesseract?. Jayanta (CIS-A2K) (talk) 06:29, 29 April 2021 (UTC)page is uploaded at https://archive.org/details/bharatkoshpage-82 and their tesseract 5.0.0-alpha-20201231-10-g1236 version output text . The test output is very well and recognized as two-column. Just for curiosity what version we are using at tesseract?

SW: This is doable, but we would have to come back and rework once it's no longer in alpha. Concern is creating tech debt that we may not return to.
NR: How do we upgrade when 5.0.0 comes out in ~6 months?
SW: It will be upgraded automatically. Suspect that the reason 5.0.0 is taking so long to release is because of API changes
DM can reach out to IA to see how they handled

Event Timeline

Restricted Application added a subscriber: Aklapper. Β· View Herald Transcript

@dmaza were we able to get a word from IA? is it ok to upgrade?

Context provided by IA on risks and pros and cons for why we should consider to upgrade:

~~~ We switched to the 5.0.0 alpha release tags because the speed got a lot better,as did the quality, and various runtime errors that we were seeing in 4.x were fixed.
1.- Were there any workarounds you had to deal with after the upgrade?
2.- Is it stable enough?
3.- Any known issues from the top of your head that we should be aware of when we upgrade?
~~~

1.- Not really, we've filed issues for problems that we ran into. There was a bug in leptonica, so I think we still ship our own leptonica, but otherwise not really. We have one or two patches of our own, but that is just an enhancement to hOCR generation (writing the scan_res property).
2.- I would say so. I believe we have less persistent problems than with 4.x - some fixes are also not being backported.
3.- Not really aware of any compatibility problems. There are mostly more options (in binarisation, for example)
I would suggest to stick to the (alpha) release tags, rather than the latest master commit.
We build our own .deb version of Tesseract using gitlab CI - there is an ubuntu ppa with development builds as well, but we wanted to stick to specific versions (in some cases with minor patches)

Daimona subscribed.

I think one thing to consider is cross-plaftorm compatibility (i.e. windows and Mac), see my comment at T284831#7153886. I guess it might have to be discussed at some point.

FTR, I'm available for this, but ideally I'd want to pair with someone else.

ldelench_wmf set the point value for this task to 3.Jun 24 2021, 11:27 PM

This is now live on https://ocr-test.wmcloud.org, using 5.0.0-alpha-20210401 built from source and I've updated our sysadmin docs. Initial tests look good to me but we should test thoroughly, particularly with Indic languages which this is supposed to have better support for. I do not immediately notice any performance improvements but it certainly doesn't seem any worse.

@MusikAnimal I am seeing more The tesseract engine returned an internal error errors when transcribing Hindi and Marathi books.

The emailed error appears to be:

Message:  thiagoalessio\TesseractOCR\UnsuccessfulCommandException: Error! The command did not produce any output.  
    
  Generated command:  
  "tesseract" - "/tmp/ocrzYi9br" --psm 3 -l hin  
    
  Returned message:  
  Tesseract Open Source OCR Engine v5.0.0-alpha-20210401 with Leptonica  
  Estimating resolution as 270  
  Detected 60 diacritics  
  Floating point exception in /var/www/tool/vendor/thiagoalessio/tesseract_ocr/src/FriendlyErrors.php:66  
  Stack trace:  
  #0 /var/www/tool/vendor/thiagoalessio/tesseract_ocr/src/TesseractOCR.php(39): thiagoalessio\TesseractOCR\FriendlyErrors::checkCommandExecution(Object(thiagoalessio\TesseractOCR\Command), '', 'Tesseract Open ...')  
  #1 /var/www/tool/src/Engine/TesseractEngine.php(81): thiagoalessio\TesseractOCR\TesseractOCR->run()  
  #2 /var/www/tool/src/Controller/OcrController.php(237): App\Engine\TesseractEngine->getText('https://upload....', Array)  
  #3 /var/www/tool/vendor/symfony/cache/LockRegistry.php(100): App\Controller\OcrController->App\Controller\{closure}(Object(Symfony\Component\Cache\CacheItem), true)  
  #4 /var/www/tool/vendor/symfony/cache/Traits/ContractsTrait.php(88): Symfony\Component\Cache\LockRegistry::compute(Object(Closure), Object(Symfony\Component\Cache\CacheItem), true, Object(Symfony\Component\Cache\Adapter\FilesystemAdapter), Object(Closure), Object(Symfony\Bridge\Monolog\Logger))  
  #5 /var/www/tool/vendor/symfony/cache-contracts/CacheTrait.php(70): Symfony\Component\Cache\Adapter\AbstractAdapter->Symfony\Component\Cache\Traits\{closure}(Object(Symfony\Component\Cache\CacheItem), true)  
  #6 /var/www/tool/vendor/symfony/cache/Traits/ContractsTrait.php(95): Symfony\Component\Cache\Adapter\AbstractAdapter->doGet(Object(Symfony\Component\Cache\Adapter\FilesystemAdapter), '675b5b08b843186...', Object(Closure), 1, Array, Object(Symfony\Bridge\Monolog\Logger))  
  #7 /var/www/tool/vendor/symfony/cache-contracts/CacheTrait.php(33): Symfony\Component\Cache\Adapter\AbstractAdapter->doGet(Object(Symfony\Component\Cache\Adapter\FilesystemAdapter), '675b5b08b843186...', Object(Closure), 1, Array)  
  #8 /var/www/tool/src/Controller/OcrController.php(238): Symfony\Component\Cache\Adapter\AbstractAdapter->get('675b5b08b843186...', Object(Closure))  
  #9 /var/www/tool/src/Controller/OcrController.php(161): App\Controller\OcrController->getText()  
  #10 /var/www/tool/vendor/symfony/http-kernel/HttpKernel.php(157): App\Controller\OcrController->homeAction()  
  #11 /var/www/tool/vendor/symfony/http-kernel/HttpKernel.php(79): Symfony\Component\HttpKernel\HttpKernel->handleRaw(Object(Symfony\Component\HttpFoundation\Request), 1)  
  #12 /var/www/tool/vendor/symfony/http-kernel/Kernel.php(195): Symfony\Component\HttpKernel\HttpKernel->handle(Object(Symfony\Component\HttpFoundation\Request), 1, true)  
  #13 /var/www/tool/public/index.php(21): Symfony\Component\HttpKernel\Kernel->handle(Object(Symfony\Component\HttpFoundation\Request))  
  #14 {main}  
Time:  2021-07-01T09:38:16.750667+00:00  
Channel:  tesseract  
Extra:   host:  ocr-test.wmcloud.org    
   uri:  http://ocr-test.wmcloud.org/?engine=tesseract&image=https%3A%2F%2Fupload.wikimedia.org%2Fwikipedia%2Fcommons%2Fthumb%2F2%2F20%2F%25E0%25A4%25AC%25E0%25A4%25BE%25E0%25A4%25AC%25E0%25A5%2581%25E0%25A4%25B0.pdf%2Fpage31-1024px-%25E0%25A4%25AC%25E0%25A4%25BE%25E0%25A4%25AC%25E0%25A5%2581%25E0%25A4%25B0.pdf.jpg&langs%5B0%5D=hi&psm=3

I have seen this in about 7 out of the 50 images I have tried to transcribe in Hindi.

So far, I have only seen it happen when choosing certain resolution sizes of the image.

For example, this image returns the error for sizes 364px, 456px, 583px and 1024px (and possibly others).

Other examples:

This looks like a tesseract bug. Might be related to issue 3314, although that's already fixed in the version we're using. Switching to a newer tesseract version may or may not help. I might try to attach a debugger and see if we can get a trace or something.

I SSHed to the server to debug this issue. First of all, I wanted to confirm that it's a tesseract issue:

$ curl -s "https://upload.wikimedia.org/wikipedia/commons/thumb/9/9e/%E0%A4%A6%E0%A5%87%E0%A4%B5%E0%A4%95%E0%A5%80%E0%A4%A8%E0%A4%82%E0%A4%A6%E0%A4%A8_%E0%A4%B8%E0%A4%AE%E0%A4%97%E0%A5%8D%E0%A4%B0.pdf/page26-1024px-%E0%A4%A6%E0%A5%87%E0%A4%B5%E0%A4%95%E0%A5%80%E0%A4%A8%E0%A4%82%E0%A4%A6%E0%A4%A8_%E0%A4%B8%E0%A4%AE%E0%A4%97%E0%A5%8D%E0%A4%B0.pdf.jpg" | tesseract stdin stdout -l hin

Estimating resolution as 240
Floating point exception

So yeah, it is. Then I re-downloaded the hin.tessdata file from the source, to make sure we didn't have a bad copy. Same result as above. Then I attached a debugger to the process, and got some useful info:

$ sudo wget -O testimg.jpg "https://upload.wikimedia.org/wikipedia/commons/thumb/9/9e/%E0%A4%A6%E0%A5%87%E0%A4%B5%E0%A4%95%E0%A5%80%E0%A4%A8%E0%A4%82%E0%A4%A6%E0%A4%A8_%E0%A4%B8%E0%A4%AE%E0%A4%97%E0%A5%8D%E0%A4%B0.pdf/page26-1024px-%E0%A4%A6%E0%A5%87%E0%A4%B5%E0%A4%95%E0%A5%80%E0%A4%A8%E0%A4%82%E0%A4%A6%E0%A4%A8_%E0%A4%B8%E0%A4%AE%E0%A4%97%E0%A5%8D%E0%A4%B0.pdf.jpg"

$ gdb tesseract
GNU gdb (Debian 8.2.1-2+b3) 8.2.1
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from tesseract...done.
(gdb) run testimg.jpg stdout -l hin
Starting program: /usr/local/bin/tesseract testimg.jpg stdout -l hin
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Estimating resolution as 240

Program received signal SIGFPE, Arithmetic exception.
0x00007ffff7e95f36 in tesseract::Textord::make_a_word_break (this=this@entry=0x7ffff69c0c20, row=row@entry=0x555555c155e0, blob_box=..., blob_box@entry=..., prev_gap=prev_gap@entry=32767, prev_blob_box=prev_blob_box@entry=...,
    real_current_gap=real_current_gap@entry=294, within_xht_current_gap=294, next_blob_box=..., next_gap=-60, blanks=@0x7fffffffcd01: 0 '\000', fuzzy_sp=@0x7fffffffccff: false, fuzzy_non=@0x7fffffffcd00: false,
    prev_gap_was_a_space=@0x7fffffffcd02: false, break_at_next_gap=@0x7fffffffcd03: false) at src/textord/tospace.cpp:1234
1234            blanks = static_cast<uint8_t>(current_gap / row->space_size);

I guess row->space_size might be 0. Before reporting this upstream, I'd like to try git master of tesseract and see if it's still broken.

I tried upgrading tesseract to git master, but the bug is still there:

$ tesseract testimg.jpg stdout -l hin
Estimating resolution as 240
Floating point exception

$ tesseract --version | head -n1
tesseract 5.0.0-alpha-20210401-139-g38f0f

So I've rolled back to the 20210401.

This is the upstream bug report: https://github.com/tesseract-ocr/tesseract/issues/3483

Fixed upstream (on git master). I've just updated ocr-test to use the master version of tesseract. Some of the examples from T282150#7189424 are now working, but others still result in a FPE (2nd, 3rd and 4th in "other examples"). I've reported this in the upstream issue.

The second upstream issue was also fixed, and I've just updated tesseract to the new git master. I've also updated the docs with simplified instructions for tesseract upgrades.

The remaining examples at T282150#7189424 are now working.

I've identified some potentially interesting images which show differences between the old (4.0.0) and new (5.0.0) versions of Tesseract:

Telugu:

Bengali:

Hindi:

Farsi:

Malayalam:

Punjabi:

Odia: