Page MenuHomePhabricator

Improve OCR: Test accuracy and features of various OCR engines
Closed, ResolvedPublic

Description

As an OCR user, I want the various OCR engines to be tested for accuracy, so that the team can have a better understanding of primary gaps and issues to be addressed in the 'Improve OCR' project.

Background: Let's see how the various engines handle italics, diacritics, multicolumn text, etc.

Some handy links:
https://tools.wmflabs.org/ws-google-ocr/
https://tools.wmflabs.org/indic-ocr/
https://archive.org/create/

Acceptance Criteria:

  • Test the primary OCR tools for accuracy and features

Event Timeline

Anyone have a link for easily testing Tesseract?

I did some very preliminary testing for fun. Google Vision and Google Drive handled diacritics and leading caps with flying colors, but had trouble with the columns. Google Vision pretty much didn't handle the columns at all and even broke up some lines incorrectly. Google Drive did a lot better but still got confused a couple times (especially due to one column being longer than the other). Internet Archive handled the columns with flying colors, but ignored all the diacritics and got confused by leading caps.

It seems clear even this early that we will still want to provide access to as many of these OCR tools as we can reliably maintain. They each have strengths and weaknesses.

Too bad we can't do some kind of multi-pass operation and combine the results of each system's output.

Too bad we can't do some kind of multi-pass operation and combine the results of each system's output.

@aezell - Actually this might be possible with a local tesseract solution. For example, tesseract can be configured to do column detection. Perhaps this could be paired with Google Vision. For example, you could use the Google Vision character output, but the tesseract line order (by seeing which lines are the most similar between the two outputs and assuming they are the same lines).

I'm really curious to test out tesseract though and see how it compares with the other engines.

It looks like the only way to get internet archive to detect diacritics is to specifically set the language to French (or a language that commonly uses diacritics). Setting it to English or no language results in all the diacritics being stripped.

I ran this page image (300 dpi) through all the OCR services. Here are the results:

EngineFormatting ErrorsCharacter ErrorsWhitespace ErrorsDiacritics PreservedCurly Quotes PreservedOther Notes
Internet Archivenone40noyesconfused by opening caps and ç, converted most diacritics to correct character without diacritics
Internet Archive (French)none110yesyesconfused by opening caps, changed w to m, changed ; to j , changed l to i, etc.
Tesseract 4.0.0-beta.1none81only éyeschanged l’ to P, confused by diacritics other than é
Google OCR (English)extensive errors02yessometimesno paragraph breaks, only line breaks
Indic OCRnone24yessometimeschanged ? into ., omitted a quotation mark

''Character Errors" means errors other than not detecting diacritics or curly quotes.

The big takeaways are:

  • Google OCR is really great at OCR, but terrible at formatting.
  • Internet Archive can theoretically detect diacritics, but only if you choose a language that uses them, in which case accuracy may decline.
  • Indic OCR is probably the best single OCR service (although combining Tesseract and Google OCR may be even better).

@Samwilson, @aezell - Thought you might find these results interesting ^.

Thanks so much for doing this, @kaldari !

It looks like there will always be trade-offs. @Samwilson and I were talking the other day about the fact that we will likely need to support multiple OCR engines for the foreseeable future. Perhaps, our energy is better spent improving the workflow around using them and less around trying to cobble together some amazing OCR result.

After all, Wikisource is built as a transcription engine and its users have a long history of perfecting transcriptions. We won't reach a point anytime soon where the WS pages are considered "finished" just because we have a really excellent OCR backend in use. So, if we know that the proofreading and validation are still going to happen as they do today, Maybe it's best for us to focus on those engines with good formatting support and diacritic support knowing that some characters here and there will need to be modified.

That's a good point. I could imagine an interface where you click the "OCR" button and it pops up a little list of options like:

  • "Google Vision (better for single column text)"
  • "Tesseract (better for multi-column text)"

Then they could just choose which option is right for that text.

ifried renamed this task from Test accuracy and features of various OCR engines to Improve OCR: Test accuracy and features of various OCR engines.Mar 12 2020, 4:48 PM
ifried updated the task description. (Show Details)

@Samwilson, @aezell - I just made a very important discovery. When you are sending an OCR request to the Google Vision API, if you set the request type to "DOCUMENT_TEXT_DETECTION" rather than "TEXT_DETECTION", it correctly detects columns, headers, etc. and gives you the text in the right order! In fact it gives you the exact same output as Indic OCR (which I believe uses the Google Drive API). And despite the documentation on Google's site, it seems to accept more file types than just PDF and TIFF; specifically it seems to be fine with JPEGs, although I've only tried JPEGs hosted directly on Google Cloud rather than passed to the API. This potentially means that our OCR problems are solved!

Wow. That's a big shift. I was actually going to update this task to say that with all the idiosyncrasies with the various platforms, we were going to focus on the front-end workflow experience.

But, if this holds up across more testing, it seems like relying on the service a bit more might be a viable option.

I guess we should put a pin in this investigation but given the slowdown we are in, it's probably OK to keep poking at it.

That's terrific. I have a weird memory that we switched *to* TEXT_DETECTION intentionally at some point. But maybe I'm dreamin.

It's a simple matter to switch ws-google-ocr to use DOCUMENT_TEXT_DETECTION; should we do that now?

That's terrific. I have a weird memory that we switched *to* TEXT_DETECTION intentionally at some point. But maybe I'm dreamin.

I don't remember if we ever used DOCUMENT_TEXT_DETECTION, but we may have decided against it since the documentation says it only supports PDF and TIFF.

It's a simple matter to switch ws-google-ocr to use DOCUMENT_TEXT_DETECTION; should we do that now?

Yeah, seems like it would be worth trying. We should switch it and make sure everything still works OK.

Very cool. Thanks for all the research @kaldari!

kaldari claimed this task.

Change 587899 had a related patch set uploaded (by Samwilson; owner: Samwilson):
[labs/tools/wikisource-ocr@master] Switch to 'document text detection' instead of 'text detection'

https://gerrit.wikimedia.org/r/587899

@Samwilson @aezell - Now that we have Tesseract 4.1.1 on Toolforge, I went back and tested with it. Interestingly, the accuracy was greatly improved by specifying the languages to apply (even for the English part), suggesting to me that Tesseract doesn't have good language detection (a problem that merlijn.wajer at the Internet Archive is apparently working on).

EngineFormatting ErrorsCharacter ErrorsWhitespace ErrorsDiacritics PreservedCurly Quotes PreservedOther Notes
Internet Archivenone40noyesconfused by opening caps and ç, converted most diacritics to correct character without diacritics
Internet Archive (French)none110yesyesconfused by opening caps, changed w to m, changed ; to j , changed l to i, etc.
Tesseract 4.0.0-beta.1none81only éyes"Alice"→"Aitice", changed l’ to P, confused by diacritics other than é
Tesseract 4.1.1none131only éyes"Alice"→"Aitice", all other errors in the French part
Tesseract 4.1.1 (eng+fra+Latin)none21yesyes2 apostrophes missing in the French part
Google OCR (English)extensive errors02yessometimesno paragraph breaks, only line breaks
Indic OCRnone24yessometimeschanged ? into ., omitted a quotation mark

''Character Errors" means errors other than not detecting diacritics or curly quotes.
Test file: Test OCR document.jpg

A new test using Test OCR document 2.jpg

EngineFormatting ErrorsCharacter ErrorsWhitespace ErrorsCurly Quotes PreservedOther Notes
Tesseract 4.1.1none150yes'Lancaster.'→'———', 'I should'→'1 sheuld', period changed to comma, 'a'→'a_', 'negro'→'necro'
Tesseract 4.1.1 (eng+Latin)none131yes'Lancaster.'→'enge', 'I should'→'1 sheuld', period changed to comma
Google OCR (English)none30no'I' deleted, 'inflict'→'indlict', em dash changed to space
Indic OCRnone10noem dash changed to hyphen

''Character Errors" means errors other than not detecting diacritics or curly quotes.