Page MenuHomePhabricator

Caching strategies for scores in Lift Wing
Open, Needs TriagePublic

Description

The next step in Lift Wing is probably how to cache score results to avoid expensive re computations. There are some high level strategies to follow:

  • HTTP Caching at the CDN edge - We don't currently return any HTTP cache header in our responses to clients, and the API gateway sets no-cache if nothing is already specified. Having scores cached at the CDN layer could allow us to have basic protection against high traffic spikes, especially if the request the same traffic. The downside is that we could offer the caching only to external users (namely the ones using the api gateway), not the internal ones.
  • Score cache in Cassandra. We could basically replicate the ORES Redis score cache, but in Cassandra. When a score is requested, we'd fire a call to Cassandra to check if a value was already computed, and in case return the result immediately. On the contrary, we could compute the result and store it. Among the pros we have that both internal and external clients would benefit from the cache, but the downside is that bursts in traffic would hit our backed services anyway (since the CDN wouldn't protect us).

Both strategies have some challenges to solve:

  • How to invalidate the cache?
  • How long a cached value could remain in cache?
  • When/if the cache gets full, what is the policy for new data?
  • etc..

The CDN option doesn't seem viable for the moment since we don't expose a complete REST API, since most of the parameters (like features) that really make a score different from another one are carried by the POST's payload, that usually it is not cached at the Varnish/ATS layer (it would be very expensive). For example, let's pick:

curl -s https://inference.svc.eqiad.wmnet:30443/v1/models/cswiki-goodfaith:predict -X POST -d '{"rev_id": 23040023}' -i -H "Host: cswiki-goodfaith.revscoring-editquality-goodfaith.wikimedia.org"  --http1.1

The cached URL would be /v1/models/cswiki-goodfaith:predict and its value would be the json payload of the response. That would clearly be wrong since we don't vary the cached content based on the rev_id.
We could think about adding an extra "translation" layer in front of the current one, basically offering a real REST API (cacheable), but at this stage of Lift Wing it would be a major endeavor (and we already have made the API-Gateway's Lift Wing API public).

The remaining solution to try could be the Cassandra cache, but we'd need to plan it carefully.

Please add ideas/suggestions/doubts/etc.. :)

Event Timeline

Varnish can cache POST requests: https://docs.varnish-software.com/tutorials/caching-post-requests/

This is probably worth to follow up, but it may be a bad/expensive idea for our CDN.

Upsides of local-ish to LW caching (e.g. Cassandra):

We have control over decide what is cached how:

  • maximum rev age that is cached
  • which models are cached
  • how much cache-side space is allocated to each model/endpoint
  • caching strategy (LRU etc)

Upsides to getting "further out" caching:

  • Workload/maintenance burden is shared with more people
  • DDoS protection is easier on the edge of the network
  • potential synergistic benefits of "one big cache for everybody" vs. "smaller shared caches"
  • new LW services automatically benefit from existing cache without needing to change code/cache structure.¹

¹ This may also backfire: caching POSTs that are actually state changing would be bad, so we would need to be careful when matching requests for caching.

As for Varnish body-caching, one thing to note is that e.g. different amounts/kind of whitespace may cause cache misses, so the hashing function would have to be carefully constructed.