Wikisource OCR: fix issue with lines being formatted incorrectly
Open, Needs TriagePublic
Actions

Assigned To

None

Authored By

	ifried
	Mar 25 2021, 3:27 PM

Description

Acceptance Criteria:

Determine how to fix the issue in Wikimedia OCR in which lines are formatted to end at the last word of the line in the original text rather than the last word of the line that would be appropriate for the new text
Note: This should not apply to poems
Note: Indic OCR seems to be handling the issue better. Maybe we can look into what that OCR tool is doing.
Wouldn't it be possible to disable line breaks after the text is validated?

URL of example: https://en.wikisource.org/w/index.php?title=Page:Myths_of_Mexico_and_Peru.djvu/22&action=edit

Visual Example:

Issues when using Wikimedia OCR:

Working properly with Indic OCR:

Related Objects

Mentioned Here: T230415: Stop ignoring paragraph and region separators in DjVu file OCR text layer
T279019: Refresh DjVu image metadata
T250185: Make Wikisource-OCR handle paragraphs better

Event Timeline

ifried created this task.Mar 25 2021, 3:27 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 25 2021, 3:27 PM

ifried updated the task description. (Show Details)Mar 25 2021, 3:29 PM

ifried updated the task description. (Show Details)Mar 25 2021, 3:31 PM

This is similar to T250185 (maybe a dupe, I'm not sure).

Please note that while on some wikisources it is the convention to remove the linebreaks inside a paragraph, on others — for example the Scandinavian ones — it is the convention that they should be kept (except of course in the rare cases that cause formatting issues). The point is that when you are proofreading/validating the text, you have to compare the wikitext and the image word for word and line for line, and without intact linebreaks you will easily loose your place in the text many times over. In my opinion, it is much faster to work on a text with the original linebreaks that make the wikitext and image match line for line, and I would consider it a rather annoying bug if they were automatically removed. So if you do go down this road, please consider if it should done using some kind of opt-in mechanism.

jayantanth subscribed.Mar 28 2021, 7:15 AM

ifried updated the task description. (Show Details)Mar 30 2021, 11:06 PM

As Peter says, this needs some form of configurability and probably at the per-user level. English Wikisource generally unwraps lines, but even there there are users who rely on hard linebreaks when proofreading. OCR is also imperfect at detecting page features, so for some scans automatic unwrapping will end up going to the opposite extreme (all text in one big lump with no line breaks).

And as a general issue you'll want to make sure you have test cases that are early-18th-century, when typography was different and inconsistent, and which Tesseract et al have not primarily trained on. The difference in quality is significant, and for features like automatic unwrapping this kind of case is important to test.

Related: T230415 and T279019

Some of the differences between tools in this regard are in the client-side. The Phetools OCR Gadget, for example, post-processes the data returned from the OCR server API using a bunch of regex replacements. In my own private little OCR toy I do that processing server-side in a push parser while converting hOCR to plain text (providing config UI client-side is blocked on the lack of any sane JS UI framework for gadget/user script developers, *cough* *cough*).

I general, I think the smartest architecture will be to do those kinds of transformations server side, but let the client turn them on and off for a given request based on the current need or user preference. Even if the feature implemented in the Wikisource extension does not expose these options to end users, this would let the local preject community provide alternate frontends that take advantage of the same backend; for example in the form of a Gadget that always requests unwrapped text, or always requests Ancient Greek OCR even though it's operating on English Wikisource, etc. Or maybe the built-in functionality has to limit language options for practical reasons, so we make a local Gadget that exposes all the possible language and script combinations and priorities to handle things like this. All the hyper-specialized edge cases that can't be catered to in the main tool, but will be a significant improvement for /that/ book, or /that/ contributor.

MusikAnimal removed projects: All-and-every-Wikisource, Community-Tech.Oct 21 2021, 5:41 PM

Restricted Application added a project: Community-Tech. · View Herald TranscriptOct 21 2021, 5:41 PM

MusikAnimal removed a project: Community-Tech.Oct 21 2021, 5:41 PM

Wikisource OCR: fix issue with lines being formatted incorrectlyOpen, Needs TriagePublicActions

Description

Related Objects

Event Timeline

Wikisource OCR: fix issue with lines being formatted incorrectly
Open, Needs TriagePublic
Actions