Page MenuHomePhabricator

Wikisource OCR: add support for tesseract on wikimedia ocr
Closed, ResolvedPublic5 Estimated Story Points


As a Wikisource user, I want Tesseract to be added to Wikimedia OCR, so that we can have one robust tool to improve & maintain (rather than 2 available via Preferences).


Acceptance Criteria:

  • Add Tesseract to Wikimedia OCR

Event Timeline

ARamirez_WMF set the point value for this task to 5.Apr 1 2021, 11:48 PM

Oops, sorry @HMonroy I didn't realise you'd claimed this! I had a bit of a crack at it: (not fully polished yet or anything, so feel free to ignore if you've already made progress!).

@Samwilson A couple of thoughts on skimming (and I do mean skimming) the diff…

First, is the imageUrl input sanitized anywhere? It looks like you're feeding it straight on to HttpClient, so it could contain any old garbage. And it doesn't appear as though it is constrained to WMF sites, so there may be some potential for abuse.

Second, why the tmp files? Tesseract can work on stdin/-out (in the usual way), and for any single image the process isn't resumable (at best you could cache the downloaded image to avoid redownoad). For my own wrapper I do it that way, and while that's just a personal toy that part of it has been rock solid for me.

I'll assume you've considered using the native API (vs. wrapping the command line frontend) for Tesseract and rejected it as too complicated, but I'll still mention that that interface will give you lots and lots of knobs to tweak that are either not available or very very inconvenient and limited when going through the command line frontend.

@Xover thanks for looking at it! Yes, you're quite right about not sanitizing the image URL, and there are a bunch of other things it doesn't do as well. I started working on it before I checked here, and once I saw that Harumi was working on it I stopped and just pushed what I'd done so far.

As for the temp files: I started working on a caching idea so that multiple requests would not initiate multiple OCR processes, but then realised that was too much for this first PR so half-removed that idea.

It sounds like we'll end up caching lots, although perhaps just the OCR text rather than the image (keyed by a hash of the image), especially if we end up going with an approach that OCRs the whole work. That doesn't have to happen immediately though.

This is all merged now deployed to the test site:

Tesseract is only available on the grid engine, not Kubernetes, so I've switched ocr-test to use that. Also increased max_execution_time to 120 seconds because I kept getting timeouts when testing.

dom_walden subscribed.

I have been able to use the tesseract option to do OCR, for example.