Caching strategies for scores in Lift Wing
Open, Needs TriagePublic
Actions

Assigned To

None

Authored By

	elukey
	Aug 11 2023, 10:29 AM

Description

The next step in Lift Wing is probably how to cache score results to avoid expensive re computations. There are some high level strategies to follow:

HTTP Caching at the CDN edge - We don't currently return any HTTP cache header in our responses to clients, and the API gateway sets no-cache if nothing is already specified. Having scores cached at the CDN layer could allow us to have basic protection against high traffic spikes, especially if the request the same traffic. The downside is that we could offer the caching only to external users (namely the ones using the api gateway), not the internal ones.

Score cache in Cassandra. We could basically replicate the ORES Redis score cache, but in Cassandra. When a score is requested, we'd fire a call to Cassandra to check if a value was already computed, and in case return the result immediately. On the contrary, we could compute the result and store it. Among the pros we have that both internal and external clients would benefit from the cache, but the downside is that bursts in traffic would hit our backed services anyway (since the CDN wouldn't protect us).

Both strategies have some challenges to solve:

How to invalidate the cache?
How long a cached value could remain in cache?
When/if the cache gets full, what is the policy for new data?
etc..

The CDN option doesn't seem viable for the moment since we don't expose a complete REST API, since most of the parameters (like features) that really make a score different from another one are carried by the POST's payload, that usually it is not cached at the Varnish/ATS layer (it would be very expensive). For example, let's pick:

curl -s https://inference.svc.eqiad.wmnet:30443/v1/models/cswiki-goodfaith:predict -X POST -d '{"rev_id": 23040023}' -i -H "Host: cswiki-goodfaith.revscoring-editquality-goodfaith.wikimedia.org"  --http1.1

The cached URL would be /v1/models/cswiki-goodfaith:predict and its value would be the json payload of the response. That would clearly be wrong since we don't vary the cached content based on the rev_id.
We could think about adding an extra "translation" layer in front of the current one, basically offering a real REST API (cacheable), but at this stage of Lift Wing it would be a major endeavor (and we already have made the API-Gateway's Lift Wing API public).

The remaining solution to try could be the Cassandra cache, but we'd need to plan it carefully.

Please add ideas/suggestions/doubts/etc.. :)

Event Timeline

elukey created this task.Aug 11 2023, 10:29 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 11 2023, 10:29 AM

Varnish can cache POST requests: https://docs.varnish-software.com/tutorials/caching-post-requests/

This is probably worth to follow up, but it may be a bad/expensive idea for our CDN.

Upsides of local-ish to LW caching (e.g. Cassandra):

We have control over decide what is cached how:

maximum rev age that is cached
which models are cached
how much cache-side space is allocated to each model/endpoint
caching strategy (LRU etc)

Upsides to getting "further out" caching:

Workload/maintenance burden is shared with more people
DDoS protection is easier on the edge of the network
potential synergistic benefits of "one big cache for everybody" vs. "smaller shared caches"
new LW services automatically benefit from existing cache without needing to change code/cache structure.¹

¹ This may also backfire: caching POSTs that are actually state changing would be bad, so we would need to be careful when matching requests for caching.

As for Varnish body-caching, one thing to note is that e.g. different amounts/kind of whitespace may cause cache misses, so the hashing function would have to be carefully constructed.

hnowlan subscribed.Aug 16 2023, 2:25 PM

calbon moved this task from Unsorted to Backlog/SRE on the Machine-Learning-Team board.Aug 29 2023, 2:52 PM

Caching strategies for scores in Lift WingOpen, Needs TriagePublicActions

Description

Event Timeline

Caching strategies for scores in Lift Wing
Open, Needs TriagePublic
Actions