Wikisource OCR: add support for tesseract on wikimedia ocr
Closed, ResolvedPublic5 Estimated Story Points
Actions

Assigned To

Authored By

	ifried
	Apr 1 2021, 11:45 PM

Description

As a Wikisource user, I want Tesseract to be added to Wikimedia OCR, so that we can have one robust tool to improve & maintain (rather than 2 available via Preferences).

Resources:

tesseract on github

Acceptance Criteria:

Add Tesseract to Wikimedia OCR

Related Objects

Mentioned In: T279553: Wikisource OCR: Add Tesseract to Docker
T278999: Wikisource OCR: Investigate adding Tesseract to Wikimedia OCR [16H]

Event Timeline

ifried created this task.Apr 1 2021, 11:45 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 1 2021, 11:45 PM

ARamirez_WMF set the point value for this task to 5.Apr 1 2021, 11:48 PM

ldelench_wmf moved this task from Needs Discussion to Up Next (June 3-21) on the Community-Tech board.Apr 1 2021, 11:48 PM

ifried moved this task from Up Next (June 3-21) to Kanban-2020-21-Q3 on the Community-Tech board.Apr 2 2021, 12:00 AM

ifried edited projects, added Community-Tech (Kanban-2020-21-Q3); removed Community-Tech.

HMonroy claimed this task.Apr 2 2021, 5:50 PM

HMonroy moved this task from Ready 🎬 to In Development 💻 on the Community-Tech (Kanban-2020-21-Q3) board.

ifried updated the task description. (Show Details)Apr 2 2021, 9:25 PM

ifried mentioned this in T278999: Wikisource OCR: Investigate adding Tesseract to Wikimedia OCR [16H].

Oops, sorry @HMonroy I didn't realise you'd claimed this! I had a bit of a crack at it: https://github.com/wikimedia/wikimedia-ocr/pull/5/files (not fully polished yet or anything, so feel free to ignore if you've already made progress!).

@Samwilson A couple of thoughts on skimming (and I do mean skimming) the diff…

First, is the imageUrl input sanitized anywhere? It looks like you're feeding it straight on to HttpClient, so it could contain any old garbage. And it doesn't appear as though it is constrained to WMF sites, so there may be some potential for abuse.

Second, why the tmp files? Tesseract can work on stdin/-out (in the usual way), and for any single image the process isn't resumable (at best you could cache the downloaded image to avoid redownoad). For my own wrapper I do it that way, and while that's just a personal toy that part of it has been rock solid for me.

I'll assume you've considered using the native API (vs. wrapping the command line frontend) for Tesseract and rejected it as too complicated, but I'll still mention that that interface will give you lots and lots of knobs to tweak that are either not available or very very inconvenient and limited when going through the command line frontend.

@Xover thanks for looking at it! Yes, you're quite right about not sanitizing the image URL, and there are a bunch of other things it doesn't do as well. I started working on it before I checked here, and once I saw that Harumi was working on it I stopped and just pushed what I'd done so far.

As for the temp files: I started working on a caching idea so that multiple requests would not initiate multiple OCR processes, but then realised that was too much for this first PR so half-removed that idea.

It sounds like we'll end up caching lots, although perhaps just the OCR text rather than the image (keyed by a hash of the image), especially if we end up going with an approach that OCRs the whole work. That doesn't have to happen immediately though.

MBinder_WMF edited projects, added Community-Tech (CommTech-Sprint-1); removed Community-Tech (Kanban-2020-21-Q3).Apr 6 2021, 8:43 PM

ldelench_wmf moved this task from Ready 🎬 to In Development 💻 on the Community-Tech (CommTech-Sprint-1) board.Apr 6 2021, 9:02 PM

Ready for review: https://github.com/wikimedia/wikimedia-ocr/pull/5

ifried mentioned this in T279553: Wikisource OCR: Add Tesseract to Docker.Apr 7 2021, 3:25 PM

PR 5 is merged.

Follow-up PR: https://github.com/wikimedia/wikimedia-ocr/pull/13

This is all merged now deployed to the test site: https://ocr-test.toolforge.org

Tesseract is only available on the grid engine, not Kubernetes, so I've switched ocr-test to use that. Also increased max_execution_time to 120 seconds because I kept getting timeouts when testing.

I have been able to use the tesseract option to do OCR, for example.

jayantanth subscribed.Apr 29 2021, 10:52 AM

approved by product! great work!

• NRodriguez moved this task from Product sign-off 🤘 to Done 🏁 on the Community-Tech (CommTech-Sprint-1) board.May 24 2021, 3:19 PM

ldelench_wmf closed this task as Resolved.May 24 2021, 7:33 PM

Wikisource OCR: add support for tesseract on wikimedia ocr Closed, ResolvedPublic5 Estimated Story PointsActions

Description

Related Objects

Event Timeline

Wikisource OCR: add support for tesseract on wikimedia ocr
Closed, ResolvedPublic5 Estimated Story Points
Actions