Improve OCR: Test accuracy and features of various OCR engines
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	kaldari
	Mar 4 2020, 8:55 PM

Description

As an OCR user, I want the various OCR engines to be tested for accuracy, so that the team can have a better understanding of primary gaps and issues to be addressed in the 'Improve OCR' project.

Background: Let's see how the various engines handle italics, diacritics, multicolumn text, etc.

Some handy links:
https://tools.wmflabs.org/ws-google-ocr/
https://tools.wmflabs.org/indic-ocr/
https://archive.org/create/

Acceptance Criteria:

Test the primary OCR tools for accuracy and features

Related Objects
Search...

Status	Assigned	Task
Open	None	T161979 Optimize OCR model for Wikisource for each book based on initial proofreading
Resolved	Samwilson	T161978 Epic: Generalized OCR for Wikisource
Resolved	• aezell	T244100 Spike: New/Improved OCR tool [8 hours]
Resolved	kaldari	T246944 Improve OCR: Test accuracy and features of various OCR engines

Event Timeline

kaldari created this task.Mar 4 2020, 8:55 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 4 2020, 8:55 PM

kaldari mentioned this in T244100: Spike: New/Improved OCR tool [8 hours].Mar 4 2020, 8:56 PM

Tgr subscribed.Mar 4 2020, 9:17 PM

kaldari updated the task description. (Show Details)Mar 4 2020, 9:45 PM

Restricted Application added a project: Internet-Archive. · View Herald TranscriptMar 4 2020, 9:45 PM

Anyone have a link for easily testing Tesseract?

I did some very preliminary testing for fun. Google Vision and Google Drive handled diacritics and leading caps with flying colors, but had trouble with the columns. Google Vision pretty much didn't handle the columns at all and even broke up some lines incorrectly. Google Drive did a lot better but still got confused a couple times (especially due to one column being longer than the other). Internet Archive handled the columns with flying colors, but ignored all the diacritics and got confused by leading caps.

It seems clear even this early that we will still want to provide access to as many of these OCR tools as we can reliably maintain. They each have strengths and weaknesses.

Too bad we can't do some kind of multi-pass operation and combine the results of each system's output.

Too bad we can't do some kind of multi-pass operation and combine the results of each system's output.

@aezell - Actually this might be possible with a local tesseract solution. For example, tesseract can be configured to do column detection. Perhaps this could be paired with Google Vision. For example, you could use the Google Vision character output, but the tesseract line order (by seeing which lines are the most similar between the two outputs and assuming they are the same lines).

I'm really curious to test out tesseract though and see how it compares with the other engines.

It looks like the only way to get internet archive to detect diacritics is to specifically set the language to French (or a language that commonly uses diacritics). Setting it to English or no language results in all the diacritics being stripped.

I ran this page image (300 dpi) through all the OCR services. Here are the results:

Engine	Formatting Errors	Character Errors	Whitespace Errors	Diacritics Preserved	Curly Quotes Preserved	Other Notes
Internet Archive	none	4	0	no	yes	confused by opening caps and ç, converted most diacritics to correct character without diacritics
Internet Archive (French)	none	11	0	yes	yes	confused by opening caps, changed w to m, changed ; to j , changed l to i, etc.
Tesseract 4.0.0-beta.1	none	8	1	only é	yes	changed l’ to P, confused by diacritics other than é
Google OCR (English)	extensive errors	0	2	yes	sometimes	no paragraph breaks, only line breaks
Indic OCR	none	2	4	yes	sometimes	changed ? into ., omitted a quotation mark

''Character Errors" means errors other than not detecting diacritics or curly quotes.

The big takeaways are:

Google OCR is really great at OCR, but terrible at formatting.
Internet Archive can theoretically detect diacritics, but only if you choose a language that uses them, in which case accuracy may decline.
Indic OCR is probably the best single OCR service (although combining Tesseract and Google OCR may be even better).

@Samwilson, @aezell - Thought you might find these results interesting ^.

Thanks so much for doing this, @kaldari !

It looks like there will always be trade-offs. @Samwilson and I were talking the other day about the fact that we will likely need to support multiple OCR engines for the foreseeable future. Perhaps, our energy is better spent improving the workflow around using them and less around trying to cobble together some amazing OCR result.

After all, Wikisource is built as a transcription engine and its users have a long history of perfecting transcriptions. We won't reach a point anytime soon where the WS pages are considered "finished" just because we have a really excellent OCR backend in use. So, if we know that the proofreading and validation are still going to happen as they do today, Maybe it's best for us to focus on those engines with good formatting support and diacritic support knowing that some characters here and there will need to be modified.

That's a good point. I could imagine an interface where you click the "OCR" button and it pops up a little list of options like:

"Google Vision (better for single column text)"
"Tesseract (better for multi-column text)"

Then they could just choose which option is right for that text.

ifried moved this task from New & TBD Tickets to Needs Discussion on the Community-Tech board.Mar 12 2020, 4:41 PM

ifried renamed this task from Test accuracy and features of various OCR engines to Improve OCR: Test accuracy and features of various OCR engines.Mar 12 2020, 4:48 PM

ifried updated the task description. (Show Details)

• Prtksxna subscribed.Mar 12 2020, 5:35 PM

@Samwilson, @aezell - I just made a very important discovery. When you are sending an OCR request to the Google Vision API, if you set the request type to "DOCUMENT_TEXT_DETECTION" rather than "TEXT_DETECTION", it correctly detects columns, headers, etc. and gives you the text in the right order! In fact it gives you the exact same output as Indic OCR (which I believe uses the Google Drive API). And despite the documentation on Google's site, it seems to accept more file types than just PDF and TIFF; specifically it seems to be fine with JPEGs, although I've only tried JPEGs hosted directly on Google Cloud rather than passed to the API. This potentially means that our OCR problems are solved!

Wow. That's a big shift. I was actually going to update this task to say that with all the idiosyncrasies with the various platforms, we were going to focus on the front-end workflow experience.

But, if this holds up across more testing, it seems like relying on the service a bit more might be a viable option.

I guess we should put a pin in this investigation but given the slowdown we are in, it's probably OK to keep poking at it.

That's terrific. I have a weird memory that we switched *to* TEXT_DETECTION intentionally at some point. But maybe I'm dreamin.

It's a simple matter to switch ws-google-ocr to use DOCUMENT_TEXT_DETECTION; should we do that now?

That's terrific. I have a weird memory that we switched *to* TEXT_DETECTION intentionally at some point. But maybe I'm dreamin.

I don't remember if we ever used DOCUMENT_TEXT_DETECTION, but we may have decided against it since the documentation says it only supports PDF and TIFF.

It's a simple matter to switch ws-google-ocr to use DOCUMENT_TEXT_DETECTION; should we do that now?

Yeah, seems like it would be worth trying. We should switch it and make sure everything still works OK.

Samwilson mentioned this in T248058: Google OCR tool: use 'document text detection' rather than 'text detection'.Mar 19 2020, 6:47 AM

It works well in my local testing with ws-google-ocr (in English and French anyway).

So shall we T247284: Improve OCR: Move ws-google-ocr repository to Gerrit and then T248058: Google OCR tool: use 'document text detection' rather than 'text detection'?

Sounds good to me!

Very cool. Thanks for all the research @kaldari!

kaldari closed this task as Resolved.Apr 2 2020, 2:46 PM

kaldari claimed this task.

Change 587899 had a related patch set uploaded (by Samwilson; owner: Samwilson):
[labs/tools/wikisource-ocr@master] Switch to 'document text detection' instead of 'text detection'

https://gerrit.wikimedia.org/r/587899

gerritbot added a project: Patch-For-Review.Apr 10 2020, 1:10 AM

kaldari added a parent task: T244100: Spike: New/Improved OCR tool [8 hours].Nov 19 2020, 2:20 AM

@Samwilson @aezell - Now that we have Tesseract 4.1.1 on Toolforge, I went back and tested with it. Interestingly, the accuracy was greatly improved by specifying the languages to apply (even for the English part), suggesting to me that Tesseract doesn't have good language detection (a problem that merlijn.wajer at the Internet Archive is apparently working on).

Engine	Formatting Errors	Character Errors	Whitespace Errors	Diacritics Preserved	Curly Quotes Preserved	Other Notes
Internet Archive	none	4	0	no	yes	confused by opening caps and ç, converted most diacritics to correct character without diacritics
Internet Archive (French)	none	11	0	yes	yes	confused by opening caps, changed w to m, changed ; to j , changed l to i, etc.
Tesseract 4.0.0-beta.1	none	8	1	only é	yes	"Alice"→"Aitice", changed l’ to P, confused by diacritics other than é
Tesseract 4.1.1	none	13	1	only é	yes	"Alice"→"Aitice", all other errors in the French part
Tesseract 4.1.1 (eng+fra+Latin)	none	2	1	yes	yes	2 apostrophes missing in the French part
Google OCR (English)	extensive errors	0	2	yes	sometimes	no paragraph breaks, only line breaks
Indic OCR	none	2	4	yes	sometimes	changed ? into ., omitted a quotation mark

''Character Errors" means errors other than not detecting diacritics or curly quotes.
Test file: Test OCR document.jpg

A new test using Test OCR document 2.jpg

Engine	Formatting Errors	Character Errors	Whitespace Errors	Curly Quotes Preserved	Other Notes
Tesseract 4.1.1	none	15	0	yes	'Lancaster.'→'———', 'I should'→'1 sheuld', period changed to comma, 'a'→'a_', 'negro'→'necro'
Tesseract 4.1.1 (eng+Latin)	none	13	1	yes	'Lancaster.'→'enge', 'I should'→'1 sheuld', period changed to comma
Google OCR (English)	none	3	0	no	'I' deleted, 'inflict'→'indlict', em dash changed to space
Indic OCR	none	1	0	no	em dash changed to hyphen

''Character Errors" means errors other than not detecting diacritics or curly quotes.

Improve OCR: Test accuracy and features of various OCR enginesClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Improve OCR: Test accuracy and features of various OCR engines
Closed, ResolvedPublic
Actions

Related Objects
Search...