Pywikibot Proofreadpage OCR function uses low resolution instead of high resolution images
Closed, ResolvedPublicBUG REPORT
Actions

Assigned To

Authored By

	Ninovolador
	Dec 1 2023, 11:09 AM

Description

Like I said in the title: the proofreadpage module has a function, url_image(self) that generates the
URL of the image to use in the OCR web service, but it is quite a bad function that scrapes the URL and gets a lower-than-optimal resolution, resulting in lower quality OCR.

Steps to replicate the issue

Use the pywikibot proofreadpage module to do OCR on a page. The OCR web services gets this: https://ocr.wmcloud.org/api.php?engine=tesseract&langs[]=es&image=https://upload.wikimedia.org/wikipedia/commons/thumb/f/f2/Origen_de_las_especies_por_medio_de_la_selecci%C3%B3n_natural.djvu/page141-987px-Origen_de_las_especies_por_medio_de_la_selecci%C3%B3n_natural.djvu.jpg&uselang=es
Use the in-Wikisource OCR button, and the OCR web service gets this URL: https://ocr.wmcloud.org/api.php?engine=tesseract&langs[]=es&image=https://upload.wikimedia.org/wikipedia/commons/thumb/f/f2/Origen_de_las_especies_por_medio_de_la_selecci%25C3%25B3n_natural.djvu/page141-1974px-Origen_de_las_especies_por_medio_de_la_selecci%25C3%25B3n_natural.djvu.jpg&line_id=&uselang=es

What happens?:
(in this particular case) pywikibot uses images with 4x less pixels, and so the quality of the OCR is a lot worse.

What should have happened instead?:
pywikibot should have a better way of dealing with page image's URL. Honestly, using beautifulsoup to scrape the URL is quite a bad idea.

Maybe someone at Wikimedia OCR project can tell us how they manage to get the full resolution image from every page

Details

	Subject	Repo	Branch	Lines +/-
	proofreadpage.py: fetch URL of page scan via API	pywikibot/core	master	+63 -9
	[FIX][IMPR] Rewrite url_image() function in ProofreadPage module	pywikibot/core	master	+17 -27

Customize query in gerrit

Event Timeline

Ninovolador created this task.Dec 1 2023, 11:09 AM

Restricted Application added a project: Community-Tech. · View Herald TranscriptDec 1 2023, 11:09 AM

Restricted Application added subscribers: pywikibot-bugs-list, Aklapper. · View Herald Transcript

Maybe someone at Wikimedia OCR project can tell us how they manage to get the full resolution image from every page

The Wikisource extension uses the same image that's already in the Page namespace page (i.e. from ProofreadPage). The width of this image can be customized per Index page, but is usually somewhere around 1000 pixels.

In T352524#9374563, @Samwilson wrote:

Maybe someone at Wikimedia OCR project can tell us how they manage to get the full resolution image from every page

The Wikisource extension uses the same image that's already in the Page namespace page (i.e. from ProofreadPage). The width of this image can be customized per Index page, but is usually somewhere around 1000 pixels.

That's curious. I went to check and noticed that when you zoom out on the openseadragon thing, the OCR button uses the lower resolution image. I don't think this is expected.

Compare:
https://ocr.wmcloud.org/api.php?engine=tesseract&langs[]=es&image=https://upload.wikimedia.org/wikipedia/commons/thumb/f/f2/Origen_de_las_especies_por_medio_de_la_selecci%C3%B3n_natural.djvu/page141-987px-Origen_de_las_especies_por_medio_de_la_selecci%C3%B3n_natural.djvu.jpg&uselang=es

https://ocr.wmcloud.org/api.php?engine=tesseract&langs[]=es&image=https://upload.wikimedia.org/wikipedia/commons/thumb/f/f2/Origen_de_las_especies_por_medio_de_la_selecci%C3%B3n_natural.djvu/page141-3000px-Origen_de_las_especies_por_medio_de_la_selecci%C3%B3n_natural.djvu.jpg&uselang=es

Fortunately, if you put in an arbitrarily large image size (I guess something like 3000px is enough for most cases), the thumbnail server gets you the higher resolution avalaible, and OCR quality increases.

The ProofreadPage extensions comes with the imageforpage API that should be used for this :)

Change 979428 had a related patch set uploaded (by Sohom Datta; author: Sohom Datta):

[pywikibot/core@master] Rewrite url_image() function in ProofreadPage module

https://gerrit.wikimedia.org/r/979428

gerritbot added a project: Patch-For-Review.Dec 1 2023, 11:00 PM

Change 969516 had a related patch set uploaded (by Mpaa; author: Mpaa):

[pywikibot/core@master] proofreadpage.py: fetch URL of page scan via API

https://gerrit.wikimedia.org/r/969516

Change 979428 abandoned by Sohom Datta:

[pywikibot/core@master] [FIX][IMPR] Rewrite url_image() function in ProofreadPage module

Reason:

Better implementation at https://gerrit.wikimedia.org/r/c/pywikibot/core/+/969516

https://gerrit.wikimedia.org/r/979428

Xqt closed this task as Resolved.Dec 3 2023, 9:24 AM

Xqt assigned this task to Mpaa.

Change 969516 merged by jenkins-bot: