Page MenuHomePhabricator

Add Mannheim University OCR models
Open, Needs TriagePublicFeature

Description

The OCR project at Mannheim has produced some fairly impressive models.

Models are at https://ub-backup.bib.uni-mannheim.de/~stweil/tesstrain (navigate to the various subfolders)

Specifically:

  • GT4HistOCR
  • frak2021
  • Fraktur_5000000.

These are (I assume) useful general for the deWS project.

However, they do well on English non-fraktur texts with long-s, with Fraktur_5000000 beating out my own custom special models by a factor of 2(!) in the character error rates!

george5t.png (985×595 px, 48 KB)

A New T OUCH on the TIMES.

To itrs oun Proper Pune.

EORG E he is the mildeſt King,
that ever ſat on Britain's throne,
Behold how wiſely he has acted,
to his ſubjects every one.

But we're of a rebellious nature,
and our minds are ne'er content,

Likewiſe the moſt of our reflection,
are on the King and Parliament.

There's quakers, new lights, independents,
methodiſts and fwaulers too,

Thoſe mimons and ſimons,
are they not a fiithy crew.

Thoſe hypocrites that live amongſt us,
our religion they deſpiſe,

Empty fools without foundation,
neither loyal, juſt nor wiſe.

Our churchmen they are littie better,
if the truth it were all known,

They take the King for Britain's head,
but part of's laws they will not own.

Tis brotherly love's gone ſrom amongſt us,
neighbours they cannot agree,

They ſpend their money on the law,
and bring themſelves to poverty.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

I think in general, adding new (existing) models should be fairly easy. What I'd like to understand first is what files exactly we need from the directory linked in the task description, since there are lots of .traineddata for every category.

The most recent ones are likely best.

Makes sense, the problem is I can't figure out the naming scheme, which would allow us to automatically pull the latest version from there.

I believe WM OCR uses "fast"

Yes, since we use the tesseract-ocr-all debian package, which has the fast versions. T284835 was a proposal to add both fast and best, but it's not being worked on.

The scheme is somehow based on the accuracy of the model. I am not sure why some models have one with a "plain name" and some don't.

But none of them appear to be changed - even frak2021 was only updated in March and the others not since last year. So I think they're not turning over very fast.

Is there any more information I can provide here?

Aklapper changed the subtype of this task from "Task" to "Feature Request".Jan 2 2022, 10:11 PM

@Inductiveload: Do you have a specific question? (There's probably no further info needed.)

Sure: can we progress this. It's been months and months.

Please either elaborate what's blocking you, or who "we" is. See also https://www.mediawiki.org/wiki/Bug_management/Development_prioritization . Thanks.

I have done enough of CommTechs homework.

If they want to disclaim responsibility for their semi-complete projects, they can at least reply to this with 6 months letting someone know that they do not intend to do it, rather than just go radio silent even after I answered the question they asked, which at least somewhat implies some kind of ownership.

And don't give me that passive aggressive stuff. You know what this is blocked by. Thanks.

And just removing the project doesn't count, there should be a public notice about who is responsible for this code then.

@Inductiveload: Sounds like you expect the Community-Tech team (which once worked on Wikimedia OCR) to reply. It's up to each team what they plan to work on or want to have on their workboard. See also https://meta.wikimedia.org/wiki/Community_Tech#Current_projects and https://meta.wikimedia.org/wiki/Community_Tech/OCR_Improvements , I guess. Regarding your "don't give me", please see https://www.mediawiki.org/wiki/Code_of_Conduct . Thanks.

They didn't "work on it", they wrote it. It is "their" code. They also specifically denied a request that would allow wikis to configure a different OCR backend instance (though there is a workaround), so there's a level of enforced ownership just in that attitude.

If they want to say they're not going to work on it, the minimum they can do is say so, and since there's information already provided here, maybe say how such a thing could be implemented such that they'd consider a patch. Because I know what happens next: I spend hours and hours figuring out how to patch it on my own, and then I get told I did it wrong (or maybe no one even bothers to review at all).

And I am sorry about the tone, it must be hard being the bug wrangler when there's such a huge institutional lack of interest or resourcing for development and maintenance.

But also please recognise that giving out links like that certainly comes across as "look, here's a page I'm sure you have seen anyway with policy says no one even has to do anything at all, so just deal with it or learn to fix it yourself (and if you're lucky, we'll review your patch when you do). Thanks.".

Thanks for the clarification. I mostly agree with your last comment here about WMF's underlying resourcing problems.

Don't hesitate to ask me (the author of the mentioned models) if there remain any open questions.

Meanwhile german_print is the best model from UB Mannheim.