Page MenuHomePhabricator

Indic OCR merging two columns together
Closed, InvalidPublicBUG REPORT

Description

List of steps to reproduce (step by step, including full links if applicable):

  • Upload a file on commons that has marathi content with 2 columns. For e.g.

https://commons.wikimedia.org/wiki/File:TagorechiGoshti-Marathi.djvu

  • Try to import the pages in wikisource.

What happens?:
The google OCR or Indic OCR will merge the column content.

What should have happened instead?:
It should read the first column and then the second column.

Software version (if not a Wikimedia wiki), browser information, screenshots, other information, etc:

(screenshot 1) Default setting is correct for columns but wrong font used:

https://commons.wikimedia.org/wiki/File:Default_ocr.png

(screenshot 2) Google OCR can read the text correctly but columns are merged.

https://commons.wikimedia.org/wiki/File:Google_OCR_columns_wrongly_merged.png

Event Timeline

https://commons.wikimedia.org/wiki/File:Default_ocr.png

I think it could help if "short-term, debug only" images were not uploaded with a generic name to Commons (what's the educational value?) but uploaded here.

If the OCR software can read 2 columns text on english wikisource correctly, why does it merge the columns into one when it reads Marathi/ hindi pages?

Hard to say - there isn't much space between the two columns and software interprets, there might be bugs in software, etc.

It does look like Google isn't great at this. There's not much we can do about that, but have you tried using Tesseract instead (with Wikimedia OCR)? For example, does this look okay?:

https://ocr.wmcloud.org/?image=https%3A%2F%2Fupload.wikimedia.org%2Fwikipedia%2Fcommons%2Fthumb%2Fa%2Fa2%2FTagorechiGoshti-Marathi.djvu%2Fpage5-543px-TagorechiGoshti-Marathi.djvu.jpg&engine=tesseract&langs%5B%5D=mr&psm=3

The IndicOCR gadget is using the https://indic-ocr.toolforge.org/ backend, which (if I'm remembering things correctly) is using the Google Drive API and so is a different OCR engine to the Google OCR provided by Wikimedia OCR.

Yes, this looks OK. But on Marathi Wiki-source site, I can see Indic OCR and Google OCR both of which are merging the columns incorrectly. Please add Tesseract to the site so that 2 column text can be processed.

shantanuo claimed this task.

Till someone adds a button, I am using advance option as shown in this image. https://commons.wikimedia.org/wiki/File:Tesseract_advanced.png

Aklapper changed the task status from Resolved to Invalid.Nov 14 2021, 3:21 PM
Aklapper removed shantanuo as the assignee of this task.

Changing task status as no code was changed.

I guess a good fix here would be for the Indic OCR gadget to add itself as an option in the 'Transcribe text' dropdown menu, rather than as a separate button. That would reduce confusion perhaps.

Also the page region OCR that's done but waiting for merge will hopefully help with this, as you can OCR each column separately if the OCR engine fails to segments the page correctly: https://phabricator.wikimedia.org/T294903