Acceptance Criteria:
- We upgrade our version of tesseract from 4.0.0 stable release to 5.0.0 alpha
Context:
On the talk page, a contributor pointed to the net benefits of this newer version.
See comment and examples of output here:
https://ocr-test.wmcloud.org/ with one single page with Google OCR and output here. The out show very well ORC quality but not recognized Two-column as well known.
https://ocr-test.wmcloud.org/ with one single page with Tesseract OCR and output here. The out quality is very bad, but Two-column recognized very well. The same page is uploaded at https://archive.org/details/bharatkoshpage-82 and their tesseract 5.0.0-alpha-20201231-10-g1236 version output text . The test output is very well and recognized as two-column. Just for curiosity what version we are using at tesseract?. Jayanta (CIS-A2K) (talk) 06:29, 29 April 2021 (UTC)page is uploaded at https://archive.org/details/bharatkoshpage-82 and their tesseract 5.0.0-alpha-20201231-10-g1236 version output text . The test output is very well and recognized as two-column. Just for curiosity what version we are using at tesseract?
SW: This is doable, but we would have to come back and rework once it's no longer in alpha. Concern is creating tech debt that we may not return to.
NR: How do we upgrade when 5.0.0 comes out in ~6 months?
SW: It will be upgraded automatically. Suspect that the reason 5.0.0 is taking so long to release is because of API changes
DM can reach out to IA to see how they handled