Page MenuHomePhabricator

Cache OCR transcriptions
Closed, ResolvedPublic2 Estimated Story Points

Description

We should cache the result of the getText() methods, ensuring that the cache key is built with the given options. This isn't something we need to do, but it's easy to implement, inexpensive, and will speed things up for those who make identical requests in a short period of time.

The cache expiry should be configurable.

Acceptance criteria

  • Implement caching so that repeated requests to OCR the same image with same options will be very fast
  • Make the expiry configurable via the .env file

Event Timeline

Restricted Application added a subscriber: Aklapper. Β· View Herald Transcript
dom_walden added a subscriber: dom_walden.

We now cache the extracted text, meaning subsequent (identical) requests complete within 100-300ms.

Requests are considered identical if they have the same image url, language, engine, psm and oem parameters.

It is done server-side so different users making the same request will benefit from caching (say if they are transcribing the same page on wikisource).

Cached data is stored for 1 hour by default, but this is configurable (by changing the APP_CACHE_TTL variable in the .env file).

If we decide we want to turn caching off, I was able to do this by setting APP_CACHE_TTL=0. However, it will continue to return cached data for anything that is currently in the cache until it expires (after ~1 hour).

Test environment: https://ocr-test.wmcloud.org Version 0.3.0