Page MenuHomePhabricator

WS OCR: Optionally keep annotation like bold/italic if present
Closed, DeclinedPublicFeature

Description

It should be configurable (by the user, on a page-by-page basis) to opt to retain things like italics if that is present in the OCR result, like it is from Google.

This can be useful sometimes (and it can be uselessly wrong at other times).

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Aklapper changed the subtype of this task from "Task" to "Feature Request".Jul 8 2021, 3:33 PM

It doesn't look like Google returns information about italics (or anything other than position, really). It splits things up into blocks, which contain paragraphs. Blocks can be regular text, tables, images, horizontal/vertical lines, barcodes, or unknown. We could probably do something with paragraphs (which is what T250185 is about), to ensure e.g. that there's always a blank line between them, or to (optionally) rewrap paragraphs. But it doesn't look like we can do anything with inline formatting.

Tesseract used to include italic information, but doesn't now.

It doesn't look like there's anything we can do here. :-(

Darn, I was sure there was italic information in there somewhere (for Google, I knew Tesseract v4 didn't do it), but I can't see any sign of it now. I must have been going mad, or maybe just switched timelines again (hate it when that happens).

NRodriguez subscribed.

Due to unforseen complexity of this task stated above :(