Page MenuHomePhabricator

Wikisource: Investigate improving column support for OCR [8H]
Closed, ResolvedPublic

Description

As a product manager, I want engineers to explore if it is possible to support columns in OCR-ed text, so we can determine what solutions we can implement (if any), which would vastly improve the user experience and output of Wikisource editors.

Examples with Indic OCR:

Wikimedia OCR:

Acceptance Criteria:

  • Investigate the technical reason/root cause of why OCR tools are not properly handling multiple columns (i.e., displaying text in one column)
  • Investigate if there is a technical solution to properly format OCR-ed texts with multiple columns
  • If there is a solution, share findings with team as well as estimated scope of work
  • This investigation can focus on Wikimedia OCR, but it may be helpful to do some basic poking around Basic OCR & IndicOCR as well to get a more comprehensive sense of how the issue is being handled by different OCR tools at the moment

Visual Examples:

Screen Shot 2021-03-18 at 11.09.13 AM.png (1×2 px, 2 MB)

Screen Shot 2021-03-18 at 11.12.31 AM.png (1×2 px, 2 MB)

Event Timeline

ARamirez_WMF renamed this task from Wikisource: Investigate improving column support for OCR to Wikisource: Investigate improving column support for OCR [8H].Mar 18 2021, 11:35 PM
ARamirez_WMF moved this task from Needs Discussion to Up Next (June 3-21) on the Community-Tech board.

Solution 1: use Tesseract

Testing with Page:Encyclopædia Britannica, Ninth Edition, v. 24.djvu/174 from above gives interleaved text with Google OCR:

very coarse piece of work. The silver altar frontals from the old
Till the 14th century Venice continued to adhere to the old cathedral, now in St Mark's, with sacred subjects and figures of saints
Byzantine style of sculpture, which, though often delicate in exe- repoussé in each panel, made in the 14th century, are no less rude

but very good results with Tesseract (using fully automatic page segmentation, which is the default):

Till the 14th century Venice continued to adhere to the oll
Byzantine style of sculpture,” which, though often delicate in exe-
ciition and decorative in effect, slowly lost spirit and vigour, and

[…]

ery cease pieco of work, The silver altar frontas from the eld
cathedral, now in St Mark’s, with sacred subjects and figures of saints
repoussé in each panel, made in the Lith century, are no less rude

Tesseract has multiple page segmentation modes, and it will not be hard to expose these to the user to allow them to experiment with choosing the one that works best with their image. This helps with things like tabular text.

$ tesseract --help-psm
Page segmentation modes:
  0    Orientation and script detection (OSD) only.
  1    Automatic page segmentation with OSD.
  2    Automatic page segmentation, but no OSD, or OCR. (not implemented)
  3    Fully automatic page segmentation, but no OSD. (Default)
  4    Assume a single column of text of variable sizes.
  5    Assume a single uniform block of vertically aligned text.
  6    Assume a single uniform block of text.
  7    Treat the image as a single text line.
  8    Treat the image as a single word.
  9    Treat the image as a single word in a circle.
 10    Treat the image as a single character.
 11    Sparse text. Find as much text as possible in no particular order.
 12    Sparse text with OSD.
 13    Raw line. Treat the image as a single text line,
       bypassing hacks that are Tesseract-specific.

Solution 2: allow user to select only part of the image

A more flexible solution, that can work with any OCR engine, is to add a system for users to select a rectangular part of the image, and only submit that part for OCR. This would also work for images with text boxes in odd locations (e.g.). In T262146 there's going to be some work done on the image-displaying code, which perhaps will include something like Leaflet which will make it relatively easy to add a UI for clicking and selecting areas of the image.

This approach would be harder to integrate into a bulk-OCR system, but in the page-by-page workflow could instead remember the previously-selected rectangle. That'd be useful for excluding repeating portions of pages that are not required for OCR.


In both of these solutions, it's also possible to get more information (from both Google and Tesseract) about the coordinates of words in the image, and so we could reconstruct text from any rectangle of the image. I suggest that we only look into doing this as a last resort, because it's basically what OCR engines are good at and the sort of thing that we're likely to do only in a rudimentary way.

I think that for most works, it would better to make a flexible template to split the pages and allow the user to manually adjust. Generally, there are two major types of column usages in book

  1. Header - Two Columns - Footer
  2. Header - Two Columns - Image - Two Columns - Image - Two Columns - Footer

Most books will have one split pattern that will work for all the column pages. So setting one as a template and then propagating in throughout a range of pages is probably the way to go.

However, bound periodicals will need at least one split pattern for body pages and another for the front page.

Newspapers are a special case, where there will need to be a main shape to enclose the article and then splitting the columns.

Ideally, as much of this as possible should be done through detection of white space with the possibility of the user to drag the split lines.

Presenting just one column will make proofreading much easier and this is the approach that Distributed Proofreaders uses. Rather than making an actual split, it's probably better to make a virtual one so that the exact split lines can be easily adjusted.

There’s also Layout Parser that employs deep learning to analyze, parse, and OCR very complicated layouts.

https://github.com/Layout-Parser/layout-parser