Train a Tesseract model for 18th century printing
Closed, DeclinedPublic
Actions

Assigned To

Authored By

	Inductiveload
	Apr 2 2021, 10:06 PM

Description

Both Tesseract and the Google OCR suffer badly on "old" printing styles as often found around the 18th century. This is usually a combination of:

Fairly distinctive fonts which aren't common for bulk text in later years
- Frequent use of long-s and ligatures that confuse OCR engines (long-s especially looks like 'f')
Poor printing quality due to the technology of the times
Strong paper color leading to lower contrast due to the paper technology and age effects

However, the text from this period is actually all rather similar, so a modified OCR model may be able to improve all of it.

An example of a page that OCRs fairly badly is this (it's not the worst example, but it's typical)

Cu31924104036037.djvu.jpg (1×1 px, 443 KB)

Related Objects
Search...

Status	Assigned	Task
Declined	Inductiveload	T279208 Train a Tesseract model for 18th century printing
Open	None	T287060 Investigate a way to generate ground-truth test in a given font
Open	None	T287055 Customised font for simulating old printing

Event Timeline

Inductiveload created this task.Apr 2 2021, 10:06 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 2 2021, 10:06 PM

The ita_old (Italian - Old) language already supports long-S, so perhaps we can reference that?

Using ia_old on the above image gives reasonable results:

HE Main of this Fifth Book of Moſes is a Re-
capitulation of ſeveral memorable Paſſages, con-
- tained in the Other Fo#r. But they are here
…

Huh, cool. The real question I guess is how we assemble further the data for training. We have two options, AIUI:

Extract thousands of lines from images like the above and proofread them line by line (probably needs a decent UI more than anything else, followed by crowdsourcing the line transcriptions, possibly scraping them from proofread pages somehow)
Figure out a way to carefully "degrade" known input text to appear like the above image (so set a similar font and run through a 1700s-ification program) and feed that in (this is now Tesseract is normally trained, I think).

Inductiveload mentioned this in T286656: Old English support in Tesseract.Jul 19 2021, 7:55 PM

Inductiveload added a project: User-Inductiveload.Jul 20 2021, 10:36 PM

Restricted Application added a project: Community-Tech. · View Herald TranscriptJul 20 2021, 10:36 PM

Inductiveload moved this task from Inbox/backlog to Train special OCR models on the User-Inductiveload board.Jul 20 2021, 10:38 PM

Inductiveload claimed this task.Jul 21 2021, 12:52 AM

So I have managed to train up a model with the second method: https://en.wikisource.org/wiki/User:Inductiveload/Tesseract

The model is ~12MB, so I'm not sure how best to upload it (Phrabricator doesn't seem to want to take it).

My proposal is to think of a good name for this (I called it eng_oldprint, very much not wedded to that name, but mindful that it might be easy to conflate with Old English, though the ISO code for that is ang). Then add it to the OCR server as another language to be picked from the Advanced Settings page.

Over time, as I get better at figuring out how to train these models, it could be tweaked or more added (there's a very fine balance between errors from undertraining and errors from overfitting the model).

Demo:

eng	eng_oldprint (this model)	ita_old
58 Terra-Filius. N° xI. founded upon this politick fuppofition, that when they had got 2 new Printing—bouf, they could ne- ver want new books; but by what means focver it was bu'lt, my lord Crarexbon has the honour, and we, his happy pofterity, the invaluable benefic of it. 1 fhould think it an undertaking well worthy the lborious Mr. Hearne, to give the world an ac- count, from year to year, of the many incompa- rable tomes, which iffue from that illuftrious prefs. ‘This, 1 apprehend, would do great honour to the univerfity, and to its learned authors, fince the cata- Jogue would not be crouded with any of thofe he- retical, pernicious, and free-thinking tralts, which are the noifom fpawn of other modern prefles: we fhould find there no ill-meaning Effays spon_human Unde; flanding, no Oceana’s, no Hypothefes of Liber- #y, 10 defcants upon Original Comtrails, mor en- quiries into the State of Nature, no Appeals to the Laity and commen Senfe in matters of religion, no vindications of Confeience and private Fudgment, 10 defences of Refiftance in any poflible” cafes, no apologies for the Revolution, and the prefent Go- vernment, &c. to fully the Academical Types, and reproach the fclemn Imprimatur of the univerfity — New, accurate Editions of primitive Fathers, and antient Chrenicles, or modern fermons, and long fyfends of Logick, Metaphyficks, and School-divinity are the folid produCtions of this auguft Typographa- sarm————Such are the effes, and fuch the advan- tages of reftraining the lence of the prefs! How would letters flourifh? how would arts revive? bow would religion lift up her awful front? and bow weuld the church rejoyce, if fuch a whole- fome check were put upon the prefs throughout the world ? ' But PranTinc is not the only, nor the principal wle, far which thefe fupendous Rone-walls were erected;	58 Terræ-F1lius. n" x1, founded upon this politick ſuppoſition, that when they had got a new Frmcng houſe, they could ne- ver want new books; but by what means ſocver it was bu't, my lord Clarendon has the honour, and we, his happy poſlcrity, the invaluable beneſic of it, I ſhould think it an undertaking well worthy the laborious Mr. Hearne, to give the world an ac- count, from year to year, of the many incompa- rable tomes, which iſſue from that illuſtrious preſs. This, I apprehend, would do great honour to the univerſity, and to its leamed authors, ſince the cata- logue would not be crouded with any of thoſe he- retical, pernicious, and free-thinking tracts, which are the noiſom ſpawn of other modern preſſes: we ſhould ſind there no ill.meaning Eſſays upon human Underſtandmg, no Oceana's, no Hypotheſes of Liber- ty, no deſcants upon Original Contracls, nor en- quiries into the Stare of Nature, no Appeals to the Laity and common Senſe in matters of religion, no vindications of Conſcience and privare Judgment, no defences of Reſiſtance in any poſſible caſes, no apologies for the Revolution, and the preſent Go- vernment, &c. to ſully the Academical Types, and reproach the ſclemn Imprimatur of the univerſity ——New, accurate Editions of primitive Fathers, and antient Chronicles, or modern ſermons, and long ſyſturas of Logick, Metaphyſicks, and School-divinity are the ſolid productions of this auguſt Typographa- um————Such are the effects, and ſuch the advan- tages of reſtraining the lrcence of the preſs! How would letters flouriſh? how would arts revive? bow would religiou lift up her awful front? and how wculd the church rejoyce, if ſuch a whole- ſome check were put upon the preſs throughout the world ? l But Printixg is not the only, not the principal uſe, tar which theſe ſupendous ſtone-walls weie erected 3	s8 Terra-Eilìus. N° XI. founded upon this politick ſuppofition, that when they had got a new Prinﬂ'ng-bouſſ, they could ne- ver want new dooks , but by what means Hocver i was bu'lt, my lorà CLARENDON has the honour, and we, bis happy poſterity, the invaluable benefit of it. I fhou!d think it an undertaking well wor:hy the laborious Mr. Hearne, to give the world an ac- count, from year to pear, of the many incompa- Table tomes, which iſſue from that illuſtrious preſs, This, I apprehend, would do pgreat honour to the univerſity, and to its learned authors, fince the cata- ogue would not be crouded with any of thoſè he- retical, pernicious, and free-thinking tra&s, which are the noiſom ſpawn of other modern preſles: we fhould find there no ill-meaning E/ſays upon human Underfianding, no Oceana's, no Hypotheſes of Liber- #, no deſcants upon Original Contraéts, nor en- quiries into the Stare of Nature, no Appeals t0 the Laity and commen Senſe in matters of religion, no vindications of Conſcience and private Fudgment, no defences of Reſiſtance in any poſſible calſes, no apologies for the Revolution, and the preſent Go- ernment, &c., to fully the Academical Types, and reproach the ſclemn Imfrimatur of the univerſity *— New, accurate Editions of primitive Fathers, and antient Chronicles, or modern fermons, and long {yſtunis of Logick, Metapbyſicks, and School-divivity are the ſolid produ&ions of this auguſt Typograph&- #7—— Such are the effeets, and fuch the advan- tages of reſtraining the licence of the preſs! How would letters fiouriſh? how would arts revive? how wouid religion lift up her awful fiont ? and how weuld the church rejoyce, if ſuch a whole- Hme check were put upon the preſs throughout the world? ‘ But PauntI86 is not the only, nor the principal vſe, far which theſe ſtupendous ſtone-walls were ere&ted;

Model:

Inductiveload added a subtask: T287055: Customised font for simulating old printing.Jul 22 2021, 3:07 PM

Actually on reflection, a name like eng_old_caslon_longs is probably pretty descriptive.

It may make sense to also train a "no longs" old-printing variant, because there are quite few works that OCR badly due to the similar old-style print quality that look somewhat like Caslon (though the italic 'h' doesn't look like right for Caslon) but don't have long-s: