Page MenuHomePhabricator

Request creation of ocrtoy VPS project
Closed, ResolvedPublic

Description

Project Name: ocrtoy

Wikitech Usernames of requestors: xover

Purpose: OCR backend for a per-page gadget for Wikisource

Brief description: Will be (at least initially) a simple custom CGI script (so some stripe of web server will be involved) written in Perl that takes a language code and an image URL, running OCR for the given language on the image, and returning the extracted text as JSON (i.e. a replacement for the currently broken phe-tools on Toolserver). This will require some non-default Perl modules (with the typical licensing; all are on CPAN; list in T239934) plus the Tesseract OCR engine (Apache 2.0 license).

The custom code will obviously be some kind of open license (probably the same terms as Perl itself) and will be hosted on Github.

OCR is inherently a CPU-intensive operation, and the use case is interactive (there will be a user sitting waiting for the output), so if actually adopted across multiple language Wikisources there may be a need for multiple fast cores to handle the load. The initial plan is for on-demand processing of single images, so storage and RAM needs should be moderate and scale relatively linearly with use, but I don't have good estimates for these. Guesstimating 8GB memory might be tight in peaks but at 16GB the bottleneck will probably be elsewhere (CPU)? For storage ditto, 10GB including OS will probably be enough but will probably require being stingy with logging.

Long term it is possible functionality will be extended with various batch processing modes, which would need quite a lot (up to 100GB low end estimate, up to maybe 1TB worst case) of preferably fast storage, but that's currently planned for a separate tool (and will come later in any case).

How soon you are hoping this can be fulfilled: I'm hoping soon since I'm trying to build a replacement for an existing tool that's currently broken (and generating quite a bit of community frustration; see T228594 for details), but there is no particular rush otherwise.

Event Timeline

Bstorm claimed this task.
Bstorm added a subscriber: Bstorm.

Project should be available at https://horizon.wikimedia.org
Let us know if there are any problems.