Like I said in the title: the proofreadpage module has a function, url_image(self) that generates the
URL of the image to use in the OCR web service, but it is quite a bad function that scrapes the URL and gets a lower-than-optimal resolution, resulting in lower quality OCR.
Steps to replicate the issue
- Use the pywikibot proofreadpage module to do OCR on a page. The OCR web services gets this: https://ocr.wmcloud.org/api.php?engine=tesseract&langs[]=es&image=https://upload.wikimedia.org/wikipedia/commons/thumb/f/f2/Origen_de_las_especies_por_medio_de_la_selecci%C3%B3n_natural.djvu/page141-987px-Origen_de_las_especies_por_medio_de_la_selecci%C3%B3n_natural.djvu.jpg&uselang=es
- Use the in-Wikisource OCR button, and the OCR web service gets this URL: https://ocr.wmcloud.org/api.php?engine=tesseract&langs[]=es&image=https://upload.wikimedia.org/wikipedia/commons/thumb/f/f2/Origen_de_las_especies_por_medio_de_la_selecci%25C3%25B3n_natural.djvu/page141-1974px-Origen_de_las_especies_por_medio_de_la_selecci%25C3%25B3n_natural.djvu.jpg&line_id=&uselang=es
What happens?:
(in this particular case) pywikibot uses images with 4x less pixels, and so the quality of the OCR is a lot worse.
What should have happened instead?:
pywikibot should have a better way of dealing with page image's URL. Honestly, using beautifulsoup to scrape the URL is quite a bad idea.
Maybe someone at Wikimedia OCR project can tell us how they manage to get the full resolution image from every page