Page MenuHomePhabricator

Wikisource OCR: fix issue with lines being formatted incorrectly
Open, Needs TriagePublic

Description

Acceptance Criteria:

  • Determine how to fix the issue in Wikimedia OCR in which lines are formatted to end at the last word of the line in the original text rather than the last word of the line that would be appropriate for the new text
  • Note: This should not apply to poems
  • Note: Indic OCR seems to be handling the issue better. Maybe we can look into what that OCR tool is doing.
  • Wouldn't it be possible to disable line breaks after the text is validated?

URL of example: https://en.wikisource.org/w/index.php?title=Page:Myths_of_Mexico_and_Peru.djvu/22&action=edit

Visual Example:

Issues when using Wikimedia OCR:

image.png (938×1 px, 971 KB)

Working properly with Indic OCR:
image (1).png (938×1 px, 578 KB)

Event Timeline

Samwilson subscribed.

This is similar to T250185 (maybe a dupe, I'm not sure).

Please note that while on some wikisources it is the convention to remove the linebreaks inside a paragraph, on others — for example the Scandinavian ones — it is the convention that they should be kept (except of course in the rare cases that cause formatting issues). The point is that when you are proofreading/validating the text, you have to compare the wikitext and the image word for word and line for line, and without intact linebreaks you will easily loose your place in the text many times over. In my opinion, it is much faster to work on a text with the original linebreaks that make the wikitext and image match line for line, and I would consider it a rather annoying bug if they were automatically removed. So if you do go down this road, please consider if it should done using some kind of opt-in mechanism.

As Peter says, this needs some form of configurability and probably at the per-user level. English Wikisource generally unwraps lines, but even there there are users who rely on hard linebreaks when proofreading. OCR is also imperfect at detecting page features, so for some scans automatic unwrapping will end up going to the opposite extreme (all text in one big lump with no line breaks).

And as a general issue you'll want to make sure you have test cases that are early-18th-century, when typography was different and inconsistent, and which Tesseract et al have not primarily trained on. The difference in quality is significant, and for features like automatic unwrapping this kind of case is important to test.

Related: T230415 and T279019

Some of the differences between tools in this regard are in the client-side. The Phetools OCR Gadget, for example, post-processes the data returned from the OCR server API using a bunch of regex replacements. In my own private little OCR toy I do that processing server-side in a push parser while converting hOCR to plain text (providing config UI client-side is blocked on the lack of any sane JS UI framework for gadget/user script developers, *cough* *cough*).

I general, I think the smartest architecture will be to do those kinds of transformations server side, but let the client turn them on and off for a given request based on the current need or user preference. Even if the feature implemented in the Wikisource extension does not expose these options to end users, this would let the local preject community provide alternate frontends that take advantage of the same backend; for example in the form of a Gadget that always requests unwrapped text, or always requests Ancient Greek OCR even though it's operating on English Wikisource, etc. Or maybe the built-in functionality has to limit language options for practical reasons, so we make a local Gadget that exposes all the possible language and script combinations and priorities to handle things like this. All the hyper-specialized edge cases that can't be catered to in the main tool, but will be a significant improvement for /that/ book, or /that/ contributor.