Page MenuHomePhabricator

Handle edge cache invalidation for the api gateway
Open, HighPublic

Description

Currently, the api gateway is configured to be bypass the edge cache.

Given we want to start serving traffic with it, we need to start using the edge cache, and allow to invalidate properties properly when e.g. an edit happens, but (probably?) also when the wikitext parsed output changes (because of say a template change).

Invalidating the cache amounts to posting a json payload that follows https://schema.wikimedia.org/repositories//primary/jsonschema/resource_change/current.yaml to eventgate-main.

Basically, what we want is to have a daemon, one per service and per datacenter, that listens to $dc.resource_change, only select urls that don't start with /api/rest_v1 (so, mediawiki properties), and:

  • Emit a corresponding event for the URL in our service to $dc.resource_change to notify anyone of the change (optional; frankly, I don't think this is strictly needed but I'll let someone from DE comment on this - maybe @Ottomata )
  • Emit a corresponding event for the URL in our service to $dc.resource-purge
  • If the service has a cache, emit a purge request for the service. This last thing is currently not necessary anywhere.

A few considerations:

  • While I think there should be a way to allow a simple stream processor to look at the changes coming from mediawiki and emit purges based on rules defined centrally, I don't think that's the best idea here.
  • while we can't really use varnish's Xkey because trafficserver doesn't have the same or a similar concept, we can still reduce at least the traffic on the kafka queue if we moved some logic to purged. Basically add a configuration that allows to figure out article_url => derived urls as a mapping, and eliminate all the need for additional configuratione everywhere. This though can only work if for services that have no caching of their own
  • The other services would be responsible to send out the purges after invalidating their own cache - basically what restbase does today

Event Timeline

Restricted Application added a subscriber: Aklapper. ยท View Herald TranscriptDec 1 2022, 12:00 PM
Vgutierrez subscribed.

Re-tagging the task, I'm assuming it got into traffic-icebox by mistake :)

Note that we only need active purging if/when we emit cache control headers that tell the edge case to cache long-term.

One key ingredient here is finding out what cache duration we need to get a good (enough) hit rate. If that duration is low enough, we don't need active purging.

Note that we only need active purging if/when we emit cache control headers that tell the edge case to cache long-term.

One key ingredient here is finding out what cache duration we need to get a good (enough) hit rate. If that duration is low enough, we don't need active purging.

I beg to disagree. We do have active purging on every other resource we publish, and not having it here would lead to inconsistencies that we don't want.

I'll let the traffic team comment on what caching time would be optimal (I suppose it's a very hard question, and I'd assume our current TTL is a good compromise), but I'll underline that anything above 1 minute would be a regression compared to today.

Joe triaged this task as High priority.May 12 2023, 6:39 AM
Joe added a subscriber: Ottomata.

My idea for implementing this is as follows:

  • Create a benthos container
  • Add a release containing a Deployment with N replicas (is this the best solution?) to the namespace of the service, running benthos with the adequate configuration, one per datacenter, listening to the local queue
  • Limit what we need to configure to just the prefix for the URL we get from resource-change

My main doubt is how to make benthos:

  • read the correct kafka topic
  • load the message as json
  • drop any message where the meta.uri property does not include /wiki/
  • For the remaining messages, replace /wiki/ with the appropriate public url for this service

My idea for implementing this is as follows:

  • Create a benthos container
  • Add a release containing a Deployment with N replicas (is this the best solution?) to the namespace of the service, running benthos with the adequate configuration, one per datacenter, listening to the local queue
  • Limit what we need to configure to just the prefix for the URL we get from resource-change

My main doubt is how to make benthos:

  • read the correct kafka topic
  • load the message as json
  • drop any message where the meta.uri property does not include /wiki/
  • For the remaining messages, replace /wiki/ with the appropriate public url for this service

With my Benthos enthusiast hat on: the pattern is similar to what we do for webrequest sampling (i.e. kafka -> processing -> kafka) therefore the configuration will be quite similar to modules/profile/templates/benthos/instances/webrequest_live.yaml.erb except the processors bit that will contain bloblang code to inspect meta.uri (either drop or rewrite)

Basically add a configuration that allows to figure out article_url => derived urls as a mapping, and eliminate all the need for additional configuration everywhere.

I like this idea.

Q. If this was done, would there be a need for the resource_purge event at all? Wouldn't the resource_change event (or, more ideally, the actual state change event ;) ) be enough to indicate that a purge should happen?

Basically add a configuration that allows to figure out article_url => derived urls as a mapping, and eliminate all the need for additional configuration everywhere.

I like this idea.

Q. If this was done, would there be a need for the resource_purge event at all? Wouldn't the resource_change event (or, more ideally, the actual state change event ;) ) be enough to indicate that a purge should happen?

This idea is getting very close to T253026: Introduce a centralized Dependency Tracking Service :)

Basically add a configuration that allows to figure out article_url => derived urls as a mapping, and eliminate all the need for additional configuration everywhere.

I like this idea.

Q. If this was done, would there be a need for the resource_purge event at all? Wouldn't the resource_change event (or, more ideally, the actual state change event ;) ) be enough to indicate that a purge should happen?

yes but we want to keep purges logically independent - and also be able to only emit them when any other resource that needed cleaning the cache has done so. It is also better operationally as we want to keep the processing done at the edge by purged to a minimum.

The ML team is serving its Lift Wing model servers via the API gateway, so we'd benefit as well to have edge caching :)

Change 936765 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/deployment-charts@master] api-gateway: emit no-cache unless otherwise asked

https://gerrit.wikimedia.org/r/936765

Change 937061 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/puppet@production] cache: set api.wikimedia.org to normal caching

https://gerrit.wikimedia.org/r/937061

Change 936765 merged by jenkins-bot:

[operations/deployment-charts@master] api-gateway: emit no-cache unless otherwise asked

https://gerrit.wikimedia.org/r/936765

Change 935771 had a related patch set uploaded (by Kamila Souฤkovรก; author: Kamila Souฤkovรก):

[operations/deployment-charts@master] add Benthos chart + WIP cache invalidator service

https://gerrit.wikimedia.org/r/935771

Change 935771 merged by Kamila Souฤkovรก:

[operations/deployment-charts@master] add Benthos chart

https://gerrit.wikimedia.org/r/935771

Change 938256 had a related patch set uploaded (by Kamila Souฤkovรก; author: Kamila Souฤkovรก):

[operations/deployment-charts@master] add WIP Benthos cache invalidator to staging

https://gerrit.wikimedia.org/r/938256

Change 937061 merged by Hnowlan:

[operations/puppet@production] cache: set api.wikimedia.org to normal caching

https://gerrit.wikimedia.org/r/937061

Change 938256 merged by jenkins-bot:

[operations/deployment-charts@master] add Benthos smoke test to staging

https://gerrit.wikimedia.org/r/938256

Change 942440 had a related patch set uploaded (by Kamila Souฤkovรก; author: Kamila Souฤkovรก):

[operations/puppet@production] kubernetes: add Benthos cache invalidator service

https://gerrit.wikimedia.org/r/942440

Change 942444 had a related patch set uploaded (by Kamila Souฤkovรก; author: Kamila Souฤkovรก):

[operations/deployment-charts@master] add namespace for benthos-cache-invalidator

https://gerrit.wikimedia.org/r/942444

Change 942440 merged by Kamila Souฤkovรก:

[operations/puppet@production] kubernetes: add Benthos cache invalidator service

https://gerrit.wikimedia.org/r/942440

Change 942444 merged by jenkins-bot:

[operations/deployment-charts@master] add namespace for benthos-cache-invalidator

https://gerrit.wikimedia.org/r/942444

Change 943578 had a related patch set uploaded (by Kamila Souฤkovรก; author: Kamila Souฤkovรก):

[operations/deployment-charts@master] benthos: temporarily disable readiness probe

https://gerrit.wikimedia.org/r/943578

Change 943578 merged by jenkins-bot:

[operations/deployment-charts@master] benthos: temporarily disable readiness probe

https://gerrit.wikimedia.org/r/943578

Change 944936 had a related patch set uploaded (by Kamila Souฤkovรก; author: Kamila Souฤkovรก):

[operations/deployment-charts@master] benthos: bump chart version

https://gerrit.wikimedia.org/r/944936

Change 944936 merged by jenkins-bot:

[operations/deployment-charts@master] benthos: bump chart version

https://gerrit.wikimedia.org/r/944936

Change 945595 had a related patch set uploaded (by Kamila Souฤkovรก; author: Kamila Souฤkovรก):

[operations/docker-images/production-images@master] benthos: add wmf-certificates for Kafka

https://gerrit.wikimedia.org/r/945595

Change 945595 merged by Kamila Souฤkovรก:

[operations/docker-images/production-images@master] benthos: add wmf-certificates for Kafka

https://gerrit.wikimedia.org/r/945595

Change 945606 had a related patch set uploaded (by Kamila Souฤkovรก; author: Kamila Souฤkovรก):

[operations/deployment-charts@master] benthos-cache-invalidator: bump image version

https://gerrit.wikimedia.org/r/945606

Change 945606 merged by jenkins-bot:

[operations/deployment-charts@master] benthos-cache-invalidator: bump image version

https://gerrit.wikimedia.org/r/945606