WS OCR: Optionally keep annotation like bold/italic if present
Closed, DeclinedPublicFeature
Actions

Assigned To

Authored By

	Inductiveload
	Jul 8 2021, 2:54 PM

Description

It should be configurable (by the user, on a page-by-page basis) to opt to retain things like italics if that is present in the OCR result, like it is from Google.

This can be useful sometimes (and it can be uselessly wrong at other times).

Related Objects

Mentioned In: T278839: Wikisource: Investigate how hoCR/ALTO can support formatting on Wikisource
Mentioned Here: T250185: Make Wikisource-OCR handle paragraphs better

Event Timeline

Inductiveload created this task.Jul 8 2021, 2:54 PM

Restricted Application added a project: Community-Tech. · View Herald TranscriptJul 8 2021, 2:54 PM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Aklapper changed the subtype of this task from "Task" to "Feature Request".Jul 8 2021, 3:33 PM

HMonroy moved this task from New & TBD Tickets to Needs Discussion on the Community-Tech board.Jul 14 2021, 11:29 PM

HMonroy edited projects, added Community-Tech (CommTech-Sprint-5); removed Community-Tech.Jul 15 2021, 3:26 PM

Samwilson claimed this task.Aug 2 2021, 3:23 AM

Samwilson moved this task from Ready 🎬 to In Development 💻 on the Community-Tech (CommTech-Sprint-5) board.

ldelench_wmf edited projects, added Community-Tech (CommTech-Sprint-6); removed Community-Tech (CommTech-Sprint-5).Aug 2 2021, 6:17 PM

ldelench_wmf moved this task from Ready 🎬 to In Development 💻 on the Community-Tech (CommTech-Sprint-6) board.

It doesn't look like Google returns information about italics (or anything other than position, really). It splits things up into blocks, which contain paragraphs. Blocks can be regular text, tables, images, horizontal/vertical lines, barcodes, or unknown. We could probably do something with paragraphs (which is what T250185 is about), to ensure e.g. that there's always a blank line between them, or to (optionally) rewrap paragraphs. But it doesn't look like we can do anything with inline formatting.

Tesseract used to include italic information, but doesn't now.

It doesn't look like there's anything we can do here. :-(

Darn, I was sure there was italic information in there somewhere (for Google, I knew Tesseract v4 didn't do it), but I can't see any sign of it now. I must have been going mad, or maybe just switched timelines again (hate it when that happens).

Due to unforseen complexity of this task stated above :(

Samwilson mentioned this in T278839: Wikisource: Investigate how hoCR/ALTO can support formatting on Wikisource.Oct 22 2021, 12:26 AM

WS OCR: Optionally keep annotation like bold/italic if presentClosed, DeclinedPublicFeatureActions

Description

Related Objects

Event Timeline

WS OCR: Optionally keep annotation like bold/italic if present
Closed, DeclinedPublicFeature
Actions