Page MenuHomePhabricator

Train a Tesseract model for 18th century printing
Closed, DeclinedPublic

Description

Both Tesseract and the Google OCR suffer badly on "old" printing styles as often found around the 18th century. This is usually a combination of:

  • Fairly distinctive fonts which aren't common for bulk text in later years
    • Frequent use of long-s and ligatures that confuse OCR engines (long-s especially looks like 'f')
  • Poor printing quality due to the technology of the times
  • Strong paper color leading to lower contrast due to the paper technology and age effects

However, the text from this period is actually all rather similar, so a modified OCR model may be able to improve all of it.

An example of a page that OCRs fairly badly is this (it's not the worst example, but it's typical)

Cu31924104036037.djvu.jpg (1×1 px, 443 KB)

Event Timeline

The ita_old (Italian - Old) language already supports long-S, so perhaps we can reference that?

Using ia_old on the above image gives reasonable results:

HE Main of this Fifth Book of Moſes is a Re-
capitulation of ſeveral memorable Paſſages, con-
- tained in the Other Fo#r. But they are here
…

Huh, cool. The real question I guess is how we assemble further the data for training. We have two options, AIUI:

  • Extract thousands of lines from images like the above and proofread them line by line (probably needs a decent UI more than anything else, followed by crowdsourcing the line transcriptions, possibly scraping them from proofread pages somehow)
  • Figure out a way to carefully "degrade" known input text to appear like the above image (so set a similar font and run through a 1700s-ification program) and feed that in (this is now Tesseract is normally trained, I think).

So I have managed to train up a model with the second method: https://en.wikisource.org/wiki/User:Inductiveload/Tesseract

The model is ~12MB, so I'm not sure how best to upload it (Phrabricator doesn't seem to want to take it).

My proposal is to think of a good name for this (I called it eng_oldprint, very much not wedded to that name, but mindful that it might be easy to conflate with Old English, though the ISO code for that is ang). Then add it to the OCR server as another language to be picked from the Advanced Settings page.

Over time, as I get better at figuring out how to train these models, it could be tweaked or more added (there's a very fine balance between errors from undertraining and errors from overfitting the model).

Demo:

page94.jpg (1×1 px, 223 KB)

engeng_oldprint (this model)ita_old
58 Terra-Filius. N° xI.

founded upon this politick fuppofition, that when
they had got 2 new Printing—bouf, they could ne-
ver want new books; but by what means focver
it was bu'lt, my lord Crarexbon has the honour,

and we, his happy pofterity, the invaluable benefic
of it.

1 fhould think it an undertaking well worthy
the lborious Mr. Hearne, to give the world an ac-
count, from year to year, of the many incompa-
rable tomes, which iffue from that illuftrious prefs.
‘This, 1 apprehend, would do great honour to the
univerfity, and to its learned authors, fince the cata-
Jogue would not be crouded with any of thofe he-
retical, pernicious, and free-thinking tralts, which
are the noifom fpawn of other modern prefles: we
fhould find there no ill-meaning Effays spon_human
Unde; flanding, no Oceana’s, no Hypothefes of Liber-
#y, 10 defcants upon Original Comtrails, mor en-
quiries into the State of Nature, no Appeals to the
Laity and commen Senfe in matters of religion, no
vindications of Confeience and private Fudgment,
10 defences of Refiftance in any poflible” cafes, no
apologies for the Revolution, and the prefent Go-
vernment, &c. to fully the Academical Types, and
reproach the fclemn Imprimatur of the univerfity
— New, accurate Editions of primitive Fathers,
and antient Chrenicles, or modern fermons, and long
fyfends of Logick, Metaphyficks, and School-divinity
are the folid produCtions of this auguft Typographa-
sarm————Such are the effes, and fuch the advan-
tages of reftraining the lence of the prefs! How
would letters flourifh? how would arts revive?
bow would religion lift up her awful front? and
bow weuld the church rejoyce, if fuch a whole-
fome check were put upon the prefs throughout
the world ? '

But PranTinc is not the only, nor the principal
wle, far which thefe fupendous Rone-walls were

erected;
58 Terræ-F1lius. n" x1,

founded upon this politick ſuppoſition, that when
they had got a new Frmcng houſe, they could ne-
ver want new books; but by what means ſocver
it was bu't, my lord Clarendon has the honour,

and we, his happy poſlcrity, the invaluable beneſic
of it,

I ſhould think it an undertaking well worthy
the laborious Mr. Hearne, to give the world an ac-
count, from year to year, of the many incompa-
rable tomes, which iſſue from that illuſtrious preſs.
This, I apprehend, would do great honour to the
univerſity, and to its leamed authors, ſince the cata-
logue would not be crouded with any of thoſe he-
retical, pernicious, and free-thinking tracts, which
are the noiſom ſpawn of other modern preſſes: we
ſhould ſind there no ill.meaning Eſſays upon human
Underſtandmg, no Oceana's, no Hypotheſes of Liber-
ty, no deſcants upon Original Contracls, nor en-
quiries into the Stare of Nature, no Appeals to the
Laity and common Senſe in matters of religion, no
vindications of Conſcience and privare Judgment,
no defences of Reſiſtance in any poſſible caſes, no
apologies for the Revolution, and the preſent Go-
vernment, &c. to ſully the Academical Types, and
reproach the ſclemn Imprimatur of the univerſity
——New, accurate Editions of primitive Fathers,
and antient Chronicles, or modern ſermons, and long
ſyſturas of Logick, Metaphyſicks, and School-divinity
are the ſolid productions of this auguſt Typographa-
um————Such are the effects, and ſuch the advan-
tages of reſtraining the lrcence of the preſs! How
would letters flouriſh? how would arts revive?
bow would religiou lift up her awful front? and
how wculd the church rejoyce, if ſuch a whole-
ſome check were put upon the preſs throughout
the world ? l

But Printixg is not the only, not the principal
uſe, tar which theſe ſupendous ſtone-walls weie

erected 3
s8 Terra-Eilìus. N° XI.

founded upon this politick ſuppofition, that when
they had got a new Prinfl'ng-bouſſ, they could ne-
ver want new dooks , but by what means Hocver
i was bu'lt, my lorà CLARENDON has the honour,

and we, bis happy poſterity, the invaluable benefit
of it.

I fhou!d think it an undertaking well wor:hy
the laborious Mr. Hearne, to give the world an ac-
count, from year to pear, of the many incompa-
Table tomes, which iſſue from that illuſtrious preſs,
This, I apprehend, would do pgreat honour to the
univerſity, and to its learned authors, fince the cata-
ogue would not be crouded with any of thoſè he-
retical, pernicious, and free-thinking tra&s, which
are the noiſom ſpawn of other modern preſles: we
fhould find there no ill-meaning E/ſays upon human
Underfianding, no Oceana's, no Hypotheſes of Liber-
#, no deſcants upon Original Contraéts, nor en-
quiries into the Stare of Nature, no Appeals t0 the
Laity and commen Senſe in matters of religion, no
vindications of Conſcience and private Fudgment,
no defences of Reſiſtance in any poſſible calſes, no
apologies for the Revolution, and the preſent Go-
ernment, &c., to fully the Academical Types, and
reproach the ſclemn Imfrimatur of the univerſity
*— New, accurate Editions of primitive Fathers,
and antient Chronicles, or modern fermons, and long
{yſtunis of Logick, Metapbyſicks, and School-divivity
are the ſolid produ&ions of this auguſt Typograph&-
#7—— Such are the effeets, and fuch the advan-
tages of reſtraining the licence of the preſs! How
would letters fiouriſh? how would arts revive?
how wouid religion lift up her awful fiont ? and
how weuld the church rejoyce, if ſuch a whole-
Hme check were put upon the preſs throughout
the world? ‘

But PauntI86 is not the only, nor the principal
vſe, far which theſe ſtupendous ſtone-walls were

ere&ted;

Model:

Actually on reflection, a name like eng_old_caslon_longs is probably pretty descriptive.

It may make sense to also train a "no longs" old-printing variant, because there are quite few works that OCR badly due to the similar old-style print quality that look somewhat like Caslon (though the italic 'h' doesn't look like right for Caslon) but don't have long-s:

bart2.png (1×1 px, 1 MB)

Probably this can just be done with the Mannheim models: T287460