Page MenuHomePhabricator

Pywikibot Proofreadpage OCR function uses low resolution instead of high resolution images
Closed, ResolvedPublicBUG REPORT

Description

Like I said in the title: the proofreadpage module has a function, url_image(self) that generates the
URL of the image to use in the OCR web service, but it is quite a bad function that scrapes the URL and gets a lower-than-optimal resolution, resulting in lower quality OCR.

Steps to replicate the issue

What happens?:
(in this particular case) pywikibot uses images with 4x less pixels, and so the quality of the OCR is a lot worse.

What should have happened instead?:
pywikibot should have a better way of dealing with page image's URL. Honestly, using beautifulsoup to scrape the URL is quite a bad idea.

Maybe someone at Wikimedia OCR project can tell us how they manage to get the full resolution image from every page

Event Timeline

Restricted Application added subscribers: pywikibot-bugs-list, Aklapper. · View Herald Transcript

Maybe someone at Wikimedia OCR project can tell us how they manage to get the full resolution image from every page

The Wikisource extension uses the same image that's already in the Page namespace page (i.e. from ProofreadPage). The width of this image can be customized per Index page, but is usually somewhere around 1000 pixels.

Maybe someone at Wikimedia OCR project can tell us how they manage to get the full resolution image from every page

The Wikisource extension uses the same image that's already in the Page namespace page (i.e. from ProofreadPage). The width of this image can be customized per Index page, but is usually somewhere around 1000 pixels.

That's curious. I went to check and noticed that when you zoom out on the openseadragon thing, the OCR button uses the lower resolution image. I don't think this is expected.

Compare:
https://ocr.wmcloud.org/api.php?engine=tesseract&langs[]=es&image=https://upload.wikimedia.org/wikipedia/commons/thumb/f/f2/Origen_de_las_especies_por_medio_de_la_selecci%C3%B3n_natural.djvu/page141-987px-Origen_de_las_especies_por_medio_de_la_selecci%C3%B3n_natural.djvu.jpg&uselang=es

https://ocr.wmcloud.org/api.php?engine=tesseract&langs[]=es&image=https://upload.wikimedia.org/wikipedia/commons/thumb/f/f2/Origen_de_las_especies_por_medio_de_la_selecci%C3%B3n_natural.djvu/page141-3000px-Origen_de_las_especies_por_medio_de_la_selecci%C3%B3n_natural.djvu.jpg&uselang=es

Fortunately, if you put in an arbitrarily large image size (I guess something like 3000px is enough for most cases), the thumbnail server gets you the higher resolution avalaible, and OCR quality increases.

The ProofreadPage extensions comes with the imageforpage API that should be used for this :)

Change 979428 had a related patch set uploaded (by Sohom Datta; author: Sohom Datta):

[pywikibot/core@master] Rewrite url_image() function in ProofreadPage module

https://gerrit.wikimedia.org/r/979428

Change 969516 had a related patch set uploaded (by Mpaa; author: Mpaa):

[pywikibot/core@master] proofreadpage.py: fetch URL of page scan via API

https://gerrit.wikimedia.org/r/969516

Change 979428 abandoned by Sohom Datta:

[pywikibot/core@master] [FIX][IMPR] Rewrite url_image() function in ProofreadPage module

Reason:

Better implementation at https://gerrit.wikimedia.org/r/c/pywikibot/core/+/969516

https://gerrit.wikimedia.org/r/979428

Xqt assigned this task to Mpaa.

Change 969516 merged by jenkins-bot:

[pywikibot/core@master] proofreadpage.py: fetch URL of page scan via API

https://gerrit.wikimedia.org/r/969516