Page MenuHomePhabricator

Add kraken OCR engine to Wikimedia OCR
Open, Needs TriagePublicFeature

Description

Wikimedia OCR currently uses the free Tesseract OCR engine (which only supports printed text) and the commercial Google and Transkribus OCR engines.

The free kraken OCR engine supports printed and handwritten text. Like Tesseract, kraken is used in the OCR-D project for OCR of historic prints. It is much slower than Tesseract, but sometimes gets better results and would be the only available non-commercial OCR engine for handwritings.

I suggest to start with my free models for German print and German handwriting (they are not limited to German, but can be used with other languages which use Latin script as well), but there exist many more models, for example for Arabic or Hebrew script.

I already have implemented a prototype and sent a draft pull request for Wikimedia OCR.

Event Timeline

sweil updated the task description. (Show Details)

The current implementation offers 3 different models for the text recognition.
Is there a need for non Latin scripts as well? Which ones? Arabic? Hebrew? Others?

Kraken also supports different models for the segmentation (region and line detection).
The segmentation model should be selectable from the web interface and the API, too.

Kraken also supports different models for the segmentation (region and line detection).
The segmentation model should be selectable from the web interface and the API, too.

The implementation now supports different segmentation models for kraken, too. Currently either default or ubma_segmentation can be selected.