Make Wikisource-OCR handle paragraphs better
Open, Needs TriagePublic
Actions

Assigned To

None

Authored By

	kaldari
	Apr 14 2020, 3:34 PM

Description

One way that Indic-OCR still beats Wikisource Google OCR is that Indic-OCR correctly detects paragraph breaks (usually) and inserts two linebreaks between each paragraph, while we just put a linebreak at the end of each line without any regard for paragraphs. Let's improve our paragraph handling so that less manual editing is needed.

Related Objects

Mentioned In: T348829: Add toolbar button for OCR cleanup
T278839: Wikisource: Investigate how hoCR/ALTO can support formatting on Wikisource
T286347: WS OCR: Optionally keep annotation like bold/italic if present
T281494: SPIKE: Enable Clean up in OCR Proofreading (4hours)
T278443: Wikisource OCR: fix issue with lines being formatted incorrectly

Event Timeline

kaldari created this task.Apr 14 2020, 3:34 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 14 2020, 3:34 PM

I'm moving this to 'to be estimated,' so we can discuss next steps with the team & @Samwilson.

Yes, it looks like there is lots that we can do to improve paragraphs. The response from the Google API contains details about every symbol, word, paragraph, and block.

Documentation of the returned format is at
https://cloud.google.com/vision/docs/reference/rest/v1/AnnotateImageResponse

I tested with a simplistic rebuilding of the text from these details, and it worked remarkably well:

$text = '';
foreach ($response['responses'][0]['fullTextAnnotation']['pages'][0]['blocks'] as $block) {
    foreach ($block['paragraphs'] as $para) {
        foreach ($para['words'] as $word) {
            foreach ($word['symbols'] as $symbol) {
                $text .= $symbol['text'];
            }
            $text .= ' ';
        }
        $text .= "\n\n";
    }
}
return $text;

So we should be able to add paragraphs, and reflow the lines. Punctuation is probably the biggest hurdle: we have to account for what are often language-specific rules, and probably won't get it all correct. The annoying thing is that obviously Google already do this conversion themselves for the Drive API, but it doesn't seem that it's available through this API (although I haven't really finished reading the docs).

Rebuilding the text from its components in this way means we can do other clever things though, such as putting hyphenated words back together, or maybe even inserting wiki templates where appropriate. These would have to be done on a per-Wikisource basis, but maybe that's not too hard. The worst case is that we leave things to be cleaned up by hand, and that's always going to be the case anyway.

All those loops make my spidey-sense go off.

I know this was just to prove it was possible but I shuddered.