Page MenuHomePhabricator

Caching service request for MinT
Open, Needs TriagePublic

Description

To address the performance needs of our machine translation service MinT , we need a dedicated cache store. Details as follows

  • The cache will store a machine translation that is expected to be requested repeatedly in future.
    • The cache key is going to be sourcelanguagecode-targetlanguagecode-hash(source content).
    • The cache value is machine translated content. For storage optimizations, this could be compressed too.
  • The cached translation is not going to change as long as the MT model is not changing. Hence it could be cached for a long time. However, its usefulness and impact or cache store size need to considered. So an ideal TTL would be 7 days
  • Data store size: cache key would take 2 bytes per language code, 2 bytes for delimiter, 32 bytes for sha256 = 36 bytes. Assuming 1000 chars per translation, gzip with 50% compression can take 500bytes. But since this would be unicode, single byte assumption does not hold. So let us assume 2000 bytes. One million such records will take roughly 2 GB. Add datastore overheads too.
  • The cache will be accessed from the deployed machine translation service.
  • The traffic for machine translation service is currently 2 req/second, but cannot speculate its growth in future much. Going by 10req/second may be safe. But I don't think this much traffic is big issue for a redis-like cache system.

Event Timeline

@santhosh, we'll discuss in our next week's team meeting and get back to you, thank you!

@santhosh we don't have any updates for you yet, and most of our team will be out due to summer holidays, is it ok to pick this up early September?

@jijiki That should be ok. Our team capacity is also thin in this month.
Meanwhile, we implemented a diskcache based caching which we plan to use as fallback cache options(for use in dev boxes, testing) etc.

Hi @jijiki , any updates on this request? thanks.

Hi @jijiki , any updates on this request? thanks.

Hey @santhosh, we are discussing internally, I will let you know soon!

@santhosh given the low traffic and the low storage needs, we could start off by adding memcached pods in translation service's namespace, and start from there. If we find that resource wise this is not enough, we re-iterate. What do you think?

@santhosh given the low traffic and the low storage needs, we could start off by adding memcached pods in translation service's namespace, and start from there. If we find that resource wise this is not enough, we re-iterate. What do you think?

For context, regarding the "low traffic" observation, it is worth noting that we had to disable a feature that exposed MinT machine translation to Wikipedia readers on 23 wikis because the number of requests were exceeding the server capacity. More details in T363338#10038113 and in this report.

We plan to gradually enable the feature back to the 23 pilot Wikipedias (and eventually to all Wikipedias) as we see server capacity and techniques such as caching help to make it possible. So even if the current traffic may be low, I wanted to share the above experience so that we can consider how traffic may increase soon.

@santhosh given the low traffic and the low storage needs, we could start off by adding memcached pods in translation service's namespace, and start from there. If we find that resource wise this is not enough, we re-iterate. What do you think?

For context, regarding the "low traffic" observation, it is worth noting that we had to disable a feature that exposed MinT machine translation to Wikipedia readers on 23 wikis because the number of requests were exceeding the server capacity. More details in T363338#10038113 and in this report.

Did you request assistance from SRE with capacity planning? I don't see us tagged in that task. It's not like we can't increase horizontally capacity. We could have offered advice.

We plan to gradually enable the feature back to the 23 pilot Wikipedias (and eventually to all Wikipedias) as we see server capacity and techniques such as caching help to make it possible. So even if the current traffic may be low, I wanted to share the above experience so that we can consider how traffic may increase soon.

Sounds like a good time to reach out to SRE for assistance with capacity planning!

@akosiaris Horizontal scaling is required, we will reach out for that separatly. Very soon.

The nature of machine translation requests and processing is, for many of upcoming usecases, the input will be same as something we already translated and output is same as long as the machine translation model does not change. And as machine translation is CPU intensive, it is a good idea to cache.

@jijiki The translation of some word/phrase/sentence is going to be same between restarts, application deployments. We aleady have a non-persisting fallback caching in our application. What we are looking for is persisting cache solution. Memcache does not help there right?

@akosiaris Horizontal scaling is required, we will reach out for that separatly. Very soon.

The nature of machine translation requests and processing is, for many of upcoming usecases, the input will be same as something we already translated and output is same as long as the machine translation model does not change. And as machine translation is CPU intensive, it is a good idea to cache.

@jijiki The translation of some word/phrase/sentence is going to be same between restarts, application deployments. We aleady have a non-persisting fallback caching in our application. What we are looking for is persisting cache solution. Memcache does not help there right?

The answer is 'depends'. Our infrastructure currently includes 2 caching services redis and memcached for mediawiki. In both cases, developers must consider that data as volatile; we have no backups and they cant survive a server rebooting. Data may stay until they reach their TTL, or be evicted to make room for new.

Is there a design document I could take a look? It could help me understand better what your needs are.

@Pginer-WMF @santhosh, I 've tried below to summarize our conversation in P+T meeting. Please do point out mistakes and I 'll correct them, I am mostly reconstructing from memory. I 've also added a suggested path forward for next actions, let me know what you think.

Observations

Current Status

MinT is currently serving about ~2 rps across both DCs. It does this using 4 pods, 2 per DC with a cummulative CPU usage of ~0.5 CPUs (with spikes up to 4 CPUs).

image.png (1×1 px, 126 KB)

It is also consuming between ~100GB and 135GB of RAM
image.png (1×1 px, 100 KB)

which is in par with number of pods and limits for pods.

2024-07-25 to 2024-07-31 event

During the timeframe of 2024-07-25 to 2024-07-31, an experiment with 23 wikis was run, increasing usage by ~300% (from 2rps to 6rps). While the relative increase was substantial, the absolute one is rather minuscule (4rps). The event is correlated with sustainably increased CPU usage and throttling. Parts of the event are covered (including a report

image.png (1×1 px, 89 KB)

The event has led into 2 (to my knowledge) different courses of work. 1 being doing some capacity planning, the other related to some caching needs for MinT. The latter was prioritized, the former was to happen latter.

Caching needs

A solution to address the issues experienced during the event would be the following. Assuming there is some repetition between the various articles end users request to be translated, we would benefit from only translating once and serving the rest from cache. For privacy reasons (among others?), we don't want to keep the original text around, so it will be hashed and a unique id will be generated in order to identify it later on, without storing the original content in any cache.

The edge caches were brought up as potential solution, but they are probably not going to be very helpful here. The main reason for that is that the API as currently exposed by CXserver is just a POST to a group of static URLs that do not allow to differentiate between them.

My understanding is that there is already some basic form of in-memory caching in MinT. We currently don't have any concrete knowledge as how true the hypothesis about repetition of requested translations is, but we already scrape metrics from MinT, it should be possible to instrument MinT and fetch that information. This would anyway be useful, no matter what the next steps are and would inform the next steps

Suggestions

Capacity

First off, I 'd suggest we reprioritize the caching vs capacity planning parts. We know we only use 4 pods for MinT, increasing them is possible. First off it will provide more CPU to the deployment, avoiding the high throttling levels. We already kinda a have a first order approximation (~6rps) when enabling 23 languages. To err a bit on the safe side, I suggest we pick 3 wikipedias, 1 small, 1 medium and 1 large from that set of 23, enable the feature for them, let it be for 1 week and gauge the effects. I expect we 'll see way less CPU throttling and requests (and I might be proven wrong). From there we guesstimate (nothing particularly fancy, it will prove to be wrong anyways) required number of pods. This alone might solve most problems. From there on, we talk whenever we more languages are added and continue. Note that memory usage requirements are quite high (32GB per pod), so we might end up having to stall waiting for new hardware.

Caching

A metric like the following is emitted (I am altering right now the metric in the traffic graphs, but feel free to use any metric that would make more sense to you)

machinetranslation_requests{cache_hit="yes"}

if the request ended in a cache hit and no if otherwise.

From that alone, we should be able to understand how successful a shared cache across pods would be. An in-memory cache hit ratio of 100% would mean a shared cache isn't going to help (there is so much repetition that the dedicated in memory cache is already sufficient) and conversely an in-memory cache hit ratio of 0% would also be bad (there is so little repetition, that a shared cache would have a very low cache hit ratio). Between those 2 extremes, there is a valley of interesting values to try and explain, but let's cross that bridge when we get there.

The pod is already scraped by prometheus, but I am not sure it is exposing all the metrics it should (I see Errors and Latency rows are empty in Grafana. Maybe some work should be prioritized to increase visibility there.

As a last note, I wish you had reached out to us earlier regarding the event and the capacity problems, we probably could have helped.

Thanks for your input, @akosiaris. As always, super useful.

I created a ticket to capture the proposal to enable in 3 wikis and provided some candidates (Persian, Icelandic, and Fon): T377943: Enable the access to MinT for Wiki Readers MVP in 3 wikis of different sizes

One aspect I'd like to get some clarity is about your mention of having those enabled for a week. Some questions:

  • Would it be an issue if the period is extended a bit to, for example, two weeks? This can be an opportunity to start measuring the user impact (T373862), where we may want to cover potential effects in editing activity, for which more time may be needed on smaller editing communities.
  • Do you suggest enabling the feature in all three wikis at once, or doing it more gradually (i.e., enable Fon first, some time after have Fon and Icelandic, and finally have all three) could be preferred?

Thanks for your input, @akosiaris. As always, super useful.

Glad I could help.

I created a ticket to capture the proposal to enable in 3 wikis and provided some candidates (Persian, Icelandic, and Fon): T377943: Enable the access to MinT for Wiki Readers MVP in 3 wikis of different sizes

One aspect I'd like to get some clarity is about your mention of having those enabled for a week. Some questions:

  • Would it be an issue if the period is extended a bit to, for example, two weeks? This can be an opportunity to start measuring the user impact (T373862), where we may want to cover potential effects in editing activity, for which more time may be needed on smaller editing communities.

No objections from my side. Take as long as you want. I mentioned 1 week, because I think it's a value that would provide us with enough data to make a somewhat plausible guesstimation.

  • Do you suggest enabling the feature in all three wikis at once, or doing it more gradually (i.e., enable Fon first, some time after have Fon and Icelandic, and finally have all three) could be preferred?

Personally, I 'd go for 3 and if this ends up causing issues, fallback to 1 while fixing the issues. But I am fine with either approach.