It should be configurable (by the user, on a page-by-page basis) to opt to retain things like italics if that is present in the OCR result, like it is from Google.
This can be useful sometimes (and it can be uselessly wrong at other times).
It should be configurable (by the user, on a page-by-page basis) to opt to retain things like italics if that is present in the OCR result, like it is from Google.
This can be useful sometimes (and it can be uselessly wrong at other times).
It doesn't look like Google returns information about italics (or anything other than position, really). It splits things up into blocks, which contain paragraphs. Blocks can be regular text, tables, images, horizontal/vertical lines, barcodes, or unknown. We could probably do something with paragraphs (which is what T250185 is about), to ensure e.g. that there's always a blank line between them, or to (optionally) rewrap paragraphs. But it doesn't look like we can do anything with inline formatting.
Tesseract used to include italic information, but doesn't now.
It doesn't look like there's anything we can do here. :-(
Darn, I was sure there was italic information in there somewhere (for Google, I knew Tesseract v4 didn't do it), but I can't see any sign of it now. I must have been going mad, or maybe just switched timelines again (hate it when that happens).