Page MenuHomePhabricator

Make Wikisource-OCR handle paragraphs better
Open, Needs TriagePublic

Description

One way that Indic-OCR still beats Wikisource Google OCR is that Indic-OCR correctly detects paragraph breaks (usually) and inserts two linebreaks between each paragraph, while we just put a linebreak at the end of each line without any regard for paragraphs. Let's improve our paragraph handling so that less manual editing is needed.

Event Timeline

ifried subscribed.

I'm moving this to 'to be estimated,' so we can discuss next steps with the team & @Samwilson.

Yes, it looks like there is lots that we can do to improve paragraphs. The response from the Google API contains details about every symbol, word, paragraph, and block.

Documentation of the returned format is at
https://cloud.google.com/vision/docs/reference/rest/v1/AnnotateImageResponse

I tested with a simplistic rebuilding of the text from these details, and it worked remarkably well:

$text = '';
foreach ($response['responses'][0]['fullTextAnnotation']['pages'][0]['blocks'] as $block) {
    foreach ($block['paragraphs'] as $para) {
        foreach ($para['words'] as $word) {
            foreach ($word['symbols'] as $symbol) {
                $text .= $symbol['text'];
            }
            $text .= ' ';
        }
        $text .= "\n\n";
    }
}
return $text;

So we should be able to add paragraphs, and reflow the lines. Punctuation is probably the biggest hurdle: we have to account for what are often language-specific rules, and probably won't get it all correct. The annoying thing is that obviously Google already do this conversion themselves for the Drive API, but it doesn't seem that it's available through this API (although I haven't really finished reading the docs).

Rebuilding the text from its components in this way means we can do other clever things though, such as putting hyphenated words back together, or maybe even inserting wiki templates where appropriate. These would have to be done on a per-Wikisource basis, but maybe that's not too hard. The worst case is that we leave things to be cleaned up by hand, and that's always going to be the case anyway.

All those loops make my spidey-sense go off.

I know this was just to prove it was possible but I shuddered.

I was feeling a little loopy!

I look forward to developing a general-purpose system of text reconstruction, a la https://xkcd.com/974/ ;-)

kaldari renamed this task from Can Wikisource-OCR handle paragraphs better? to Make Wikisource-OCR handle paragraphs better.Apr 21 2020, 7:53 PM
kaldari updated the task description. (Show Details)
Samwilson subscribed.