Page MenuHomePhabricator

Wikisource OCR: Allow custom thresholding pre-processing step
Open, Needs TriagePublicFeature

Assigned To
None
Authored By
Inductiveload
Jul 21 2021, 9:38 PM
Referenced Files
F34559158: t50.png
Jul 21 2021, 9:39 PM
F34559157: t45.png
Jul 21 2021, 9:39 PM
F34559152: Hamlet,_Second_Quarto,_1603_(Folger_STC_22278).djvu.jpg
Jul 21 2021, 9:38 PM

Description

Some texts are printed lightly on dark paper/with bleed-though, for example this one: https://upload.wikimedia.org/wikipedia/commons/thumb/9/93/Hamlet%2C_Second_Quarto%2C_1603_%28Folger_STC_22278%29.djvu/page63-1024px-Hamlet%2C_Second_Quarto%2C_1603_%28Folger_STC_22278%29.djvu.jpg

Hamlet,_Second_Quarto,_1603_(Folger_STC_22278).djvu.jpg (1×1 px, 246 KB)

This means OCR struggled if you set 50% as a threshold level, because that leaves lots of junk in the image:

convert /tmp/input.jpg -threshold 45% t45.png

50%:

t50.png (1×1 px, 30 KB)

45%:

t45.png (1×1 px, 28 KB)

The different in OCR is major: using the same trained model (eng)

50%:

Now what my Lord s proofe ha(l_) madeyouknow,

45%:

Now what my Lord is proofe hath made you know,