Page MenuHomePhabricator

Caching service request for recommendation api
Open, Needs TriagePublic

Description

The recommendation API can serve translation recommendations based on community-defined page collections such as Wiki99/food and Wikiproject_Women's_Health. To do so efficiently, it maintains a local cache of the collections, their articles, and all the articles language links. The cache is stored in a local file that gets repopulated after every deployment. Populating it takes a few minutes with 4 collections and it is expected to grow linearly with feature adoption.

We would like to move to a permanent cache that persists across deployments. This would eliminate the collection-related endpoints downtime while the cache is being filled and reduce the number of API calls the service has to make.

Event Timeline

Hi @SBisson

Unfortunately recommendation-api is an overloaded term, we have 2 recommendation-apis, 1 that powers e.g. https://en.wikipedia.org/api/rest_v1/#/Recommendation and one that powers https://api.wikimedia.org/wiki/Lift_Wing_API/Reference/Get_content_translation_recommendation. I suppose this task is about the later but can you please confirm this?

Furthermore, can you provide some more information about the service as well as guesstimations about the size of cache store? Before we offer a solution, we 'd like to know more about how the service works, what kind of requests it receives and from where, how and where the service is deployed. If you could also add some more information as to what kind of issues you meet when deploying. This would eliminate the collection-related endpoints downtime while the cache is being filled and reduce the number of API calls the service has to make. gives a few hints but we 'd appreciate more details.

Hi @akosiaris,

Of course Recommendation-API is very generic and could use a little more specificity as recommendations are popping up our ecosystem. The two examples you gave are actually 2 incarnations of the same thing. I'm not sure if the first one is still in use. (By CX2 maybe? @Nikerabbit)

About the service:

It is used by the new Content Translation unified dashboard, also referred to as CX3, to provide the user with translation recommendations for pages (articles that exist in the source language but not in the target language) and sections (articles that exist in both the source and target languages but that have sections that could be added in the target language).

One of the sources for translation recommendations is community-defined page collections. Here is an example on meta. From the CX3 UI, the user may choose to get recommendations from this specific collection or from all available collections. In order to serve those requests as quickly as possible, all available collections and pre-fetched with all their articles title, QID, language links and stored in cache.

The cache is currently stored in a file local to the image. When a new version of the service is deployed, based on a new image, the cache needs to be populated on the first start up and it takes some time. During that time, the service is operational but the collection-related endpoints, like listing of the collections, are useless so I refer to it as "downtime". Additionally, page collections change infrequently but they all have to be reprocessed with dozens of API calls after each deployment, which we want to do more often, not less.

There is also a hourly cache update process running but most of the time is sees that the latest revision id of the collection has not change and does nothing so this is generally unexpensive and non-disruptive.

In terms of cache size, we currently have 4 collections with a total of 2448 articles and this is what I see in the filesystem

-rw-r--r--   1 sbisson  staff   32768  6 Dec 10:23 cache.db
-rw-r--r--   1 sbisson  staff   32768  6 Dec 10:23 cache.db-shm
-rw-r--r--   1 sbisson  staff  317272  6 Dec 10:23 cache.db-wal

It is using a python JSONDisk cache with compression_level=6. See implementation. I don't know how we expect collection number and size to evolve over time. Maybe @Pginer-WMF or @PWaigi-WMF have some estimates.

Hope this helps

Hi,

Thanks for putting all this info together. Some answers inline

The two examples you gave are actually 2 incarnations of the same thing. I'm not sure if the first one is still in use

Partially. They are from the same original team, but one is in nodejs, the other one in python. The basic premise behind them is apparently the same, but the implementation and thus details are different. Yes, the first one is still in use, but it is also abandoned and we are working on fixing that. In any case, I think a lesson for the future here is that one of the 2 unsolved problems in computing, i.e. naming is a difficult one and requires some wider conversations to avoid issues like the one describe above.

Deployed on Liftwing: /docs endpoint and deployment config

Whatever the chosen path way ends up being, serviceops isn't the team managing liftwing infrastructure. That team would be Machine-Learning-Team and they definitely need to be involved in this, possibly in a more authoritative way than serviceops.

When a new version of the service is deployed, based on a new image, the cache needs to be populated on the first start up and it takes some time. During that time, the service is operational but the collection-related endpoints, like listing of the collections, are useless so I refer to it as "downtime".

Thanks for explaining this! It has allowed me to better reason about this.

This problem framing doesn't sound like a problem where a centralized caching solution would be the canonical solution. Caching is generally about looking up data more quickly, but you already have that, even if per instance. Centralizing it should be about increasing the cache hit rate and/or lowering and making look ups even faster, not solving what per your wording is a deployment issue.

Since this appears to be a deployment issue, I would suggest improvements to the deployment process instead of a centralized (from the PoV of the recommendation-apis instances) caching store. I would encourage involving Machine-Learning-Team in this since the suggestions I 'll outline below will be utilizing LiftWing infrastructure (the kubernetes parts at least)

  • One approach is the initContainer one. Due to how the default Deployment Strategy works in Kubernetes, it is typical to use this to perform some initialization work per instance without suffering the effects of "downtime" as you defined it above. The idea being that during deployment, every new instance runs the init container, which does the API fetches you talked about and once it does the service can properly start and begin to serve requests. This proceeds in batches of instances until done. Older versions of instances are still running and serving requests, being shutdown in batches as the deployment progresses. Note that init containers come with a set of drawbacks and it's not uncommon to perform something similar in the entrypoint of the image (MinT does that)
  • Another would altering the readiness probe of the service, where the readiness probe of the container doesn't return OK until all the useful stuff that needed to be fetched has been fetched. The rest works more or less the same way as above. The deployment strategy requires that pods/instances are marked as ready (i.e. the readiness probe being successful) to proceed in batches, allowing older instances to serve the traffic.

Init container gotchas

This is a list of various ways that init containers can bite you. I am adding them for completeness. If you are careful and keep those in mind, coding the initContainer defensively, an initContainer is an acceptable approach. This mostly applies for the entrypoint approach too, with the difference that it is mildly easier to debug it.

  • Using an initContainer to fetch some stuff from the network and assuming it's reliable. And at some point, something happens in 1 of all the initContainer pods, and if there is no logic to handle it, or the pods is misconfigured for that (e.g. restartPolicy==never) an entire deploy needs to be reverted and debugged.
  • Using an initContainer for initializing something without a lock. Yes, I 've seen a MySQL Gallera pathological case where a race existed that would lock up the entire deployment if 2 different instances of the init process ran (hibernate...) and everything needed to be restarted.
  • Using an initContainer to run something taking a pretty long time. The entire deploy ends up taking a long time and there isn't yet a good feedback to the deployer. If it hits a deployment timeout it will silently rollback, but it's one of those cases that the rollback can (and will in cases) fail, because part of the rollback would require starting initContainers for the old version of the code. But it wasn't the code that failed, it was the initContainer due to some external dependency.
  • Not writing the initContainer code idempotent. That's actually a meta issue and the stuff above is manifestations of it up to a point. But writing idempotent code means quite a lot of checks.

Personally, I would suggest to first go the readiness probe path, it is going to be somewhat faster to code it to check if the file you need are present on the file system, but the final decision probably rests somewhere between your team and what Machine-Learning-Team feels they can support.