Make CDN purges reliable
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	BBlack
	Apr 28 2016, 12:08 AM

Description

This meta-task is to serve as a pointer/blocker so we don't keep having to re-explain the same basic problems in many tickets.

Background

The CDN infrastructure is composed of two layers. Traffic flows from frontend-cache -> backend-cache -> MediaWiki application.

"Frontend" cache. Their main purpose is to handle high load. Traffic is generally distributed equally among them. Implementation-wise this is currently backed by Varnish, stored in RAM, has the logical capacity equal to what one such server can hold in memory (given each frontend server is effectively the same).
"Backend" cache. Their main purpose is to handle wide range of pages. Traffic is distributed by hashing the URL to one of the backends (any relevant request headers factor in as well).

When an article is edited, or when we cascade updates from templates and Wikidata items, we need to purge the relevant URLs from the CDN caches. We use HTCP (multicast UDP) to send the purges from MediaWiki to the cache nodes.

See https://wikitech.wikimedia.org/wiki/MediaWiki_at_WMF#Infrastructure for a more complete overview, including links to in-depth docs.

Root problems

Network congestion. The use of HTCP (multicast UDP) generates a lot of internal traffic to our cache nodes.

Packet loss. UDP is unreliable, especially at high rates multicasted across broad networks and contending with other beefy traffic to the cache boxes on network queues and such. Historically, this hasn't been a huge issue when internal traffic was much more stable. For quite a long time, user-notable missed purges were rare.

Bad renewal of purged content, due to a cache-layer race condition. The multicast HTCP purges have no awareness of our distinct frontend and backend layers.

It is easy for a purge to reache a frontend first (instead of backends). Then upon the next visit to that article the frontend will fallback to the backend cache which may serve it its old copy. Thus it will have "whitewashed" the old version. Sometime later the backend recieves the purge, but the frontends have already moved on and this does not self-correct currently. Again, historically this wasn't a huge problem in practice; the race condition was never noticed much for articles that people paid most attention to.

This problem is non-trivial to solve because there can be a local backlog of purges. Even if we "simply" send the purge to all the backends first, and only then purge the frontends, this does not help per-se because the action isn't instantenous. Each server has its own inbox of purges it has received for processing. What matters is not the order in which they are sent to the cache layers, but the order in which they are processed.

Many URLs to purge (content variants and derivatives). Often a single piece of unique content is reflected under several distinct URLs (think language conversion, image resizing, mobile vs desktop rendering, the History page of an article, etc). Historically, this was solved by either never caching or never purging the "less-important" views of article metadata.

Impact

Since late 2015, the above problems have gotten worse and more-noticeable:

T124418 outlines how our rate of purge requests has multiplied by more than an order of magnitude in this recent timeframe. There are several distinct days on which the level raised higher (permanently), and we can only guess at the various causes:
- We know some of the causes were code changes that attempted to fix the variants problem (4) above by issuing purges for many more distinct URLs per unique content source than we have historically.
- We know some of the causes were code changes trying to fix problems (2) and (3) by sending multiple delayed repeats of every purge request a short time later to try to paper over races and loss, which further multiplies the total rate.
- We suspect that when most of the purging was centralized through the JobQueue somewhere in this timeframe, that this probably also multiplied the purge rate due to JobQueue bugs repeating past purges that were already completed for no good reason.
- Some wikis have actually added javascript at various places in the wikis themselves to execute automatic purge-on-view as a recourse as well, further exacerbating the problem in an incredibly frustrating way.

Because of the massive increase in raw purge rate at the caches, we're almost certainly in worse shape than we were before. Various parties' attempts to 'fix' the problems have overwhelmed us with far more purge traffic than we've ever had before, which results in more loss to network queues and buffers at various layers. We now get far more frequent reports of failed purging than we did historically. This image gives a decent view of the purge traffic increase:

Screen Shot 2016-04-07 at 7.47.28 PM.png (725×1 px, 236 KB)

What we're doing

We've basically given up on trying to backtrack through whatever has gone wrong in the past several months, since T124418 investigation basically went nowhere. However, we already have longer-term solutions in the works to fix various aspects of the underlying issues anyways, which will hopefully obviate this whole mess:

Enable XKey support in Varnish (Aug 2016). A key component is T122881 where (after upgrading to varnish4, which is still ongoing) we'll get the XKey vmod going to provide a realistic, scalable solution for problem (4) with content variants.
Deploy EventBus/Kafka support to MediaWiki (2015-2016). Another key component is the EventBus work (T116786), where we hope to centralize purge requests and fan them out to the caches more-reliably without using multicast. We'll probably solve the layer races within EventBus as well by having different subscription topics for different layers and staggering through them, but that's a relatively-minor detail for this ticket.
Shorten CDN expiry to reduce need for purging (2016). We're also looking in T124954 at reducing our maximum cache TTLs pretty dramatically, which makes any minor purge loss a much smaller fallout than it is with today's long TTLs, but that's stalled out a bit while working on the varnish4 -> xkey backlog for the first point.
Introduce MediaWiki rebound purging (2015). To reduce chances of problem (3) happening with race conditions, we added a stop-gap that effectively rolls the dice twice. The purge is repeated once, 20 seconds later, via the job queue. Configured in MediaWiki via $wgCdnReboundPurgeDelay.
Introduce chained purging in vhtcpd (2017). Within a single server (which hosts both a frontend instance and a backend instance) chain the purge processing so that backend is applied before frontend. This reduces chances of problem (3) happening, but does not rule it out because there is no coordination between backends. See also https://gerrit.wikimedia.org/r/382868/ (91cda076 https://wikitech.wikimedia.org/wiki/Multicast_HTCP_purging.

Future thoughts

The amount of remaining work to get from where we are today to a better solution is non-trivial. It will probably be months before we've significantly reduced or eliminated purging problems, not weeks or days. In the meantime, we don't have a whole lot of awesome ways to cope with this.

If easy administrative tools to simply re-issue purges (e.g. ?action=purge) do not paper over the problem, our only other recourse is having operators execute manual varnish cache bans. These do not scale on a human level (and in fact, detract from ongoing work including all the above), and they also do not scale well enough at a technical level that we want to automate this and make it any easier to execute them faster.

Currently, the majority of the real pragmatic problems this is causing are on upload.wikimedia.org links for Commons deletions, as seen in e.g. T119038, T109331, T133819, and probably several other duplicates of the same basic thing. A lot of the urgency from requestors on these is driven by a rise in Commons abuse from mobile networks to upload copyvio material (especially through labs-based proxy tools), which Commons admins are having to deal with an alarming rate of. At the rate at which they're deleting copyvio content, and the degree to which they care this content isn't visible from our servers anymore, they're falling into a bucket where they are affected by the general purging issues to a much greater and more-noticeable degree than most.

While that's totally our fault (the missed purges), it should be possible to fix individual cases with ?action=purge sorts of solutions. If it's not, then we have a content-variants problem or some other code problem in the midst of all of this as well.

A lot confusion happens in every ticket about this. Browser caching confuses reporters into thinking the item is still cached by us when it's not. Sometimes they're confused by our multiple geographic endpoints (esams, ulsfo, eqiad, codfw). Even within each datacenter, there are multiple frontend caches to which different users will map, getting inconsistent results when there's an issue. I don't have any good answers for this at the moment.

Regardless, caching isn't the only problem in these cases. The underlying problem of massive copyvio uploads on commons should be addressed on its own in some realistic and relatively-future-proof way that's less burdensome to administrators and operators everywhere, IMHO.

Details

Subject	Repo	Branch	Lines +/-
purged: make Kafka cluster name configurable	operations/puppet	production	+5 -5
cache: make upload consume purges from kafka	operations/puppet	production	+4 -0
cache::text: enable consuming from kafka everywhere	operations/puppet	production	+6 -12
purged: enable consuming from kafka on cp2029 too	operations/puppet	production	+5 -0
cache::text: enable reading purges from kafka on cp2027	operations/puppet	production	+5 -0
purged: add support for kafka	operations/puppet	production	+165 -21
Add integration tests using docker-compose	operations/software/purged	master	+148 -0
Add the ability to consume from kafka	operations/software/purged	master	+589 -1
Add schema for purge events.	mediawiki/event-schemas	master	+95 -0
ATS: stop logging PURGE traffic	operations/puppet	production	+0 -24
reverse-proxy: Disable rebound purges	operations/mediawiki-config	master	+0 -3
htcppurger: per-dc/cluster delay data	operations/puppet	production	+12 -0
purging: put BE before FE in varnishes list	operations/puppet	production	+1 -1

Related Objects
Search...

Status	Assigned	Task
Resolved	aaron	T88445 MediaWiki active/active datacenter investigation and work (tracking)
Resolved	aaron	T97562 WANObjectCache relay daemon or mcrouter support
Resolved	Ottomata	T123954 Investigate proper set up for using Kafka MirrorMaker with new main Kafka clusters.
		Restricted Task
Duplicate	None	T109331 Deleted files sometimes remain visible to non-privileged users if permanently linked
Duplicate	None	T133819 upload-lb.ulsfo.wikimedia.org still allow access to some deleted files
Duplicate	BBlack	T119038 Image cache issue when 'over-writing' an image on commons
Resolved	• ema	T133821 Make CDN purges reliable
Resolved	aaron	T124418 Investigate massive increase in htmlCacheUpdate jobs in Dec/Jan
Resolved	• ema	T122881 Install XKey vmod
Resolved	• ema	T131499 Upgrade all cache clusters to Varnish 4
Resolved	• ema	T126206 Upgrade to Varnish 4: things to remember
Resolved	• ema	T128788 Port varnishlog.py to new VSL API
Resolved	• ema	T131353 Port remaining scripts depending on varnishlog.py to new VSL API
Resolved	• ema	T131501 Convert misc cluster to Varnish 4
Resolved	• ema	T134989 WDQS empty response - transfer clsoed with 15042 bytes remaining to read
Resolved	• ema	T131502 Convert upload cluster to Varnish 4
Resolved	BBlack	T131761 Solve large-object/stream/pass/chunked in upload cluster better
Resolved	• ema	T142076 Analyze Range requests on cache_upload frontend
Resolved	• ema	T142233 Varnish 4 stalls with two consecutive Range requests using HTTP persistent connections
Resolved	• ema	T131503 Convert text cluster to Varnish 4
Resolved	BBlack	T135696 Sort out vcl_deliver vs vcl_synth mess with v4 VCL
Resolved	• ema	T150660 Post Varnish 4 migration cleanup
Resolved	daniel	T102476 RFC: Requirements for change propagation
Resolved	• GWicke	T84923 Reliable publish / subscribe event bus
Resolved	Ottomata	T88459 Implementing the reliable event bus using Kafka
Invalid	Ottomata	T110748 Event Bus
Resolved	Ottomata	T110750 Investigate improving Confluent REST Proxy and Schema Registry for Event Bus
Resolved	Ottomata	T114443 EventBus MVP
Resolved	RobH	T114191 Setup a 2 server Kafka instance in both eqiad and codfw for reliable purge streams
		Unknown Object (Task)
		Unknown Object (Task)
		Unknown Object (Task)
		Unknown Object (Task)
Resolved	Ottomata	T121553 setup kafka1001 & kafka1002
Resolved	• Cmjohnson	T121578 Rack 8 new misc servers
Resolved	elukey	T121558 setup kafka2001 & kafka2002
Resolved	RobH	T120885 codfw: rack 8 new misc systems
Resolved	• mobrovac	T116247 Define edit related events for change propagation
Resolved	Eevans	T116786 Integrate eventbus-based event production into MediaWiki
Declined	• csteipp	T120133 security review of ramsey/uuid
Resolved	• csteipp	T120212 Security review of EventBus extension
Resolved	• GWicke	T120409 RESTBase should honor wiki-wide deletion/suppression of users
Resolved	ssastry	T125266 Remove user name and edit comment from html <head>
Resolved	• Pchelolo	T122079 Update EventBus extension to produce User-block events
Resolved	Ottomata	T122077 Define schema for a User-block event
Resolved	Ottomata	T118578 Package EventLogging and dependencies for Jessie
Resolved	Ottomata	T118761 Move EventLogging/server to its own repo and set up CI
Resolved	Ottomata	T118780 Puppetize eventlogging-service
Resolved	• madhuvishy	T118903 Make eventlogging logs configurable via python config file [5 pts] {oryx}
Resolved	Ottomata	T118863 Deploy eventlogging from new repository [5 pts]
Resolved	• madhuvishy	T118869 Send HTTP stats about eventlogging-service to statsd [3 pts]
Resolved	Ottomata	T121112 Build tornado-sprocket python packages
Resolved	• mobrovac	T128463 New Service Request - Change Propagation
Resolved	• mobrovac	T130948 Scap3 promote stage not working
Resolved	BBlack	T124954 Decrease max object TTL in varnishes
		Restricted Task
Resolved	• ema	T249325 cache_text cluster consistently backlogged on purge requests
Open	None	T252127 Improve resource-purge request_id and dt propagation
Resolved	CDanis	T256446 monitoring & alerting for purged

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 385415 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] htcppurger: per-dc/cluster delay data

https://gerrit.wikimedia.org/r/385415

Change 385415 merged by BBlack:
[operations/puppet@production] htcppurger: per-dc/cluster delay data

https://gerrit.wikimedia.org/r/385415

Mentioned in SAL (#wikimedia-operations) [2017-10-20T18:52:25Z] <bblack> vhtcpd upgrade + queue delay puppetization deploy ( https://gerrit.wikimedia.org/r/385415 ) done on cp* fleet - T133821

BBlack closed subtask T124954: Decrease max object TTL in varnishes as Resolved.Oct 23 2017, 3:11 PM

Tgr mentioned this in T167400: Disable serving unpatrolled new files to Wikipedia Zero users.Nov 3 2017, 10:53 PM

This continues to be a pain point for WP0 abuse, and probably a major accident waiting to happen in general (imagine failing to honor DMCA takedown time limits, or some kind of attack material remaining available for days). Are there further steps planned to investigate/resolve the issue?

What I really need to dig on this further is an easy way to see a list of recent WP0-abuse-related deletions on various wikis. Am I missing some way to use the deletion log search interfaces?

The user-side of deletion logs does not inherently have a search function, unless the specific actions are marked with a tag.

Err, we should really move the sub-conversation back to T171881 . This ticket is more about general reliability problems and/or race-conditions, not about the WP0 abuse specifically.

zhuyifei1999 mentioned this in T184182: Image thumbnail shows old version.Jan 15 2018, 8:29 PM

zhuyifei1999 merged a task: T184182: Image thumbnail shows old version.Jan 15 2018, 9:45 PM

zhuyifei1999 added a subscriber: Thgoiter.

Thgoiter unsubscribed.Jan 16 2018, 8:34 AM

Xeriphas1994 subscribed.Feb 18 2018, 9:57 AM

daniel closed subtask T102476: RFC: Requirements for change propagation as Resolved.Mar 7 2018, 8:51 PM

Krinkle closed subtask T124418: Investigate massive increase in htmlCacheUpdate jobs in Dec/Jan as Resolved.Jul 9 2018, 9:12 PM

ArielGlenn removed a project: Patch-For-Review.Sep 3 2018, 9:09 AM

• Phabricator_maintenance moved this task from Backlog to Acknowledged on the SRE board.Jan 26 2019, 8:38 PM

Bawolff added a subtask: Restricted Task.Feb 18 2019, 11:57 AM

RhinosF1 subscribed.Feb 18 2019, 12:11 PM

• mobrovac closed subtask Restricted Task as Resolved.Feb 20 2019, 11:58 PM

Bawolff reopened subtask Restricted Task as Open.Feb 21 2019, 4:30 PM

• mobrovac closed subtask Restricted Task as Resolved.Mar 12 2019, 2:37 PM

Krinkle renamed this task from Content purges are unreliable to Make CDN purges reliable.Apr 6 2020, 5:34 PM

Krinkle added projects: Performance-Team (Radar), Sustainability (MediaWiki-MultiDC).

Krinkle updated the task description. (Show Details)

Change 586390 had a related patch set uploaded (by Krinkle; owner: CDanis):
[operations/mediawiki-config@master] reverse-proxy: Disable rebound purges

https://gerrit.wikimedia.org/r/586390

gerritbot added a project: Patch-For-Review.Apr 6 2020, 6:21 PM

Krinkle added a subtask: T249325: cache_text cluster consistently backlogged on purge requests.Apr 6 2020, 7:41 PM

Krinkle moved this task from Limbo to Perf recommendation on the Performance-Team (Radar) board.Apr 6 2020, 7:55 PM

Change 586390 abandoned by CDanis:
reverse-proxy: Disable rebound purges

https://gerrit.wikimedia.org/r/586390

Joe mentioned this in T250205: Reduce rate of purges emitted by MediaWiki.Apr 14 2020, 5:26 PM

• ema mentioned this in T250781: Move all purge traffic to kafka.Apr 21 2020, 8:26 AM

Change 592615 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] ATS: stop logging PURGE traffic

https://gerrit.wikimedia.org/r/592615

Change 592615 merged by Ema:
[operations/puppet@production] ATS: stop logging PURGE traffic

https://gerrit.wikimedia.org/r/592615

• ema closed subtask T249325: cache_text cluster consistently backlogged on purge requests as Resolved.Apr 28 2020, 10:45 AM

Since purged is now in production, and that we have some work ongoing that will reduce the amount of purges we send (T250261), I think it's time to revisit the idea of moving purges to Kafka. This would also help with the transition of change-prop to kubernetes.

In order to make purges completely reliable we would need a quite complex setup, but I think we can reap the benefits progressively.

One first step would be to just substitute UDP multicast with kafka as a transport for purges. This would give us a series of advantages:

reliability of transmission
backlog and the ability to restart from a specific offset even upon a purged service restart
Ability to define prioritized queues for e.g. direct edits

In this model, we would need to do the following things:

Define a schema for a "url purge message".
Add the ability to read such messages from multiple kafka topics from purged. In this model, every purged server will be its own consumer group and will read all messages from the topics. Purged should either listen to multicast or consume kafka, not both.
Add a new method to HtmlCacheUpdate to submit messages to eventgate using the schema mentioned above. We should have the ability to pick a topic, depending on the priority of the purge.
Start sending purges on both channels from the jobrunners, and progressively switch purged to consume from kafka.

Here is a diagram representing the purge request flow for a job:

purges-with-kafka.png (605×455 px, 30 KB)

At a later time, we could think of changing the logic, and make purges avoid race conditions, removing the need for the rebound purges.
One way to implement this would be the following:

No more changes are needed at the application layer
All purged servers join a single consumer group per datacenter. This will ensure each purge message is consumed by only one purged instance.
This instance will take care of sending the purges to all the cache backends in the DC first, and to all the frontends afterwards

This would ensure there are no fe/be race conditions.

In T133821#6092865, @Joe wrote:

Define a schema for a "url purge message".

If I can throw in another $0.02 here - I would scope this bigger than a URL, and think of it as a schema for purging broader things as well. "Purge a URL" is one kind of purge we have today, and will probably always need as a baseline capability, but we've always wanted to gain the ability to purge on a more semantic sort of level, as with the earlier (never really completed, and now everything has changed) X-Key work. The idea here is the ability to purge alternate K:V sets that can be used to tag small sets (not large swaths, it only scales to small-ish sets well!) of related content. For example, a purge might have a key of type article, and a value like enwiki:Foo, which would purge all of the potentially-many outputs related to enwiki's Foo article (history, various content snippet outputs from APIs, etc), which we'd control by having all the related content outputs contain a special header like X-Key: article=enwiki:foo, and having the caches build alternate lookup indices on these keys to efficiently purge content on them.

I volunteer RESTBase stack to be a guinea pig for this. Currently we already are processing all the purges via kafka. We post a resource_change event into several kafka topics (one for purges derived from direct page editing and one for purges derived from transclusions) for every URI we need to purge. We have a with a change-prop rule in the very end, that translate from kafka messages to HTCP messages and send them over UDP.

So, for RESTBase stack moving to kafka purging will be trivial. Once we have seen the solution working, we can move on to MediaWiki.

In T133821#6094125, @BBlack wrote:

In T133821#6092865, @Joe wrote:

Define a schema for a "url purge message".

If I can throw in another $0.02 here - I would scope this bigger than a URL, and think of it as a schema for purging broader things as well. "Purge a URL" is one kind of purge we have today, and will probably always need as a baseline capability, but we've always wanted to gain the ability to purge on a more semantic sort of level, as with the earlier (never really completed, and now everything has changed) X-Key work. The idea here is the ability to purge alternate K:V sets that can be used to tag small sets (not large swaths, it only scales to small-ish sets well!) of related content. For example, a purge might have a key of type article, and a value like enwiki:Foo, which would purge all of the potentially-many outputs related to enwiki's Foo article (history, various content snippet outputs from APIs, etc), which we'd control by having all the related content outputs contain a special header like X-Key: article=enwiki:foo, and having the caches build alternate lookup indices on these keys to efficiently purge content on them.

Ok, this means we need to create our own schema for this which we will be able to extend as much as we want in the future. I'll work on that first.

Joe added a project: serviceops.Apr 30 2020, 7:48 AM

Looking at our existing event schemas, resource_change has all the information we need, but also much more. We would like to get a much smaller object to transmit, and specifically we only want to define:

uri: the url to purge
root_event_ts: timestamp of the root event causing the purge
tags: [optional] - a set of optional tags to attach to the event.

in fact, the minimal valid message using resource_change as a schema would look like this:

{
    "$schema": "/resource_change/1.0.0",
    "meta": {
        "id": "aaaaaaaa-bbbb-bbbb-bbbb-123456789012",
        "dt": "2020-04-30T11:37:53+02:00",
        "stream": "purge",
        "uri": "https://it.wikipedia.org/wiki/Francesco_Totti"
    },
    "root_event": {
        "dt": "2020-04-24T09:00:00+02:00",
        "signature": ""
    }
}

while we could imagine creating something like:

{
    "$schema": "/resource_purge/1.0.0",
    "uri": "https://it.wikipedia.org/wiki/Francesco_Totti",
    "root_event_dt": "2020-04-24T09:00:00+02:00",
}

which is a 56% reduction in size. I consider that pretty significant, but maybe in kafka's context that doesn't really matter and we prefer standardizing on less schemas. I'll generate a new schema anyways, and we can discuss the merits in a review.

Change 593487 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[mediawiki/event-schemas@master] Add schema for purge events.

https://gerrit.wikimedia.org/r/593487

I like the root event timestamp info. We could potentially put in future rules to help by ignoring ancient purges, in some cases (e.g. if we can guarantee the cache's contents are no older then 24h, we can also ignore root events older than 24h, which might speed up replaying a backlog of data...).

Change 593487 abandoned by Giuseppe Lavagetto:
Add schema for purge events.

https://gerrit.wikimedia.org/r/593487

After a discussion on the patch, it was clearer to me that some information can't be removed from the message, and that makes resource_change the perfect fit for our use-case.

Josve05a unsubscribed.May 1 2020, 6:00 AM

Change 594147 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/software/purged@master] Add the ability to consume from kafka

https://gerrit.wikimedia.org/r/594147

Change 594148 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/software/purged@master] Add integration tests using docker-compose

https://gerrit.wikimedia.org/r/594148

akosiaris subscribed.May 4 2020, 4:13 PM

In T133821#6092867, @Joe wrote:

At a later time, we could think of changing the logic, and make purges avoid race conditions, removing the need for the rebound purges.
One way to implement this would be the following:

No more changes are needed at the application layer

All purged servers join a single consumer group per datacenter. This will ensure each purge message is consumed by only one purged instance.

This instance will take care of sending the purges to all the cache backends in the DC first, and to all the frontends afterwards

This would ensure there are no fe/be race conditions.

I guess we have two categories of rebound purges. The MediaWiki ones (for DB replication lag) and the infrastructure ones for cache tier race mitigation. The proposed scheme would eliminate the later. The former case is largely negligible (only applies to URLs of pages directly edited, not changed via templates/files).

Change 595502 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/puppet@production] purged: add support for kafka

https://gerrit.wikimedia.org/r/595502

In T133821#6118058, @aaron wrote:

In T133821#6092867, @Joe wrote:

At a later time, we could think of changing the logic, and make purges avoid race conditions, removing the need for the rebound purges.
One way to implement this would be the following:

No more changes are needed at the application layer

All purged servers join a single consumer group per datacenter. This will ensure each purge message is consumed by only one purged instance.

This instance will take care of sending the purges to all the cache backends in the DC first, and to all the frontends afterwards

This would ensure there are no fe/be race conditions.

I guess we have two categories of rebound purges. The MediaWiki ones (for DB replication lag) and the infrastructure ones for cache tier race mitigation. The proposed scheme would eliminate the later. The former case is largely negligible (only applies to URLs of pages directly edited, not changed via templates/files).

I think we can basically stop sending rebound purges at this point for anything but direct edits, given how more reliable purged in keeping the queue of outdated purges down.

Change 595905 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/puppet@production] cache::text: enable reading purges from kafka on cp3050

https://gerrit.wikimedia.org/r/595905

Change 594147 merged by Giuseppe Lavagetto:
[operations/software/purged@master] Add the ability to consume from kafka

https://gerrit.wikimedia.org/r/594147

• Pchelolo added a subtask: T252127: Improve resource-purge request_id and dt propagation.May 12 2020, 4:42 PM

Change 594148 merged by Giuseppe Lavagetto:
[operations/software/purged@master] Add integration tests using docker-compose

https://gerrit.wikimedia.org/r/594148

Joe mentioned this in T252619: Debian-glue doesn't check for the validity of the distribution in the changelog..May 13 2020, 7:42 AM

Mentioned in SAL (#wikimedia-operations) [2020-05-13T09:21:37Z] <_joe_> installing purged 0.11 on cp2028 T133821

Mentioned in SAL (#wikimedia-operations) [2020-05-13T09:32:51Z] <_joe_> installing purged 0.11 on cp2027 T133821

Mentioned in SAL (#wikimedia-operations) [2020-05-13T14:54:58Z] <_joe_> upgrading + restarting purged across ulsfo and codfw T133821

Change 595502 merged by Giuseppe Lavagetto:
[operations/puppet@production] purged: add support for kafka

https://gerrit.wikimedia.org/r/595502

Change 595905 merged by Giuseppe Lavagetto:
[operations/puppet@production] cache::text: enable reading purges from kafka on cp2027

https://gerrit.wikimedia.org/r/595905

Change 596651 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/puppet@production] cache::text: enable consuming from kafka everywhere

https://gerrit.wikimedia.org/r/596651

Change 597051 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/puppet@production] purged: enable consuming from kafka on cp2029 too

https://gerrit.wikimedia.org/r/597051

Change 597051 merged by Giuseppe Lavagetto:
[operations/puppet@production] purged: enable consuming from kafka on cp2029 too

https://gerrit.wikimedia.org/r/597051

Change 596651 merged by Giuseppe Lavagetto:
[operations/puppet@production] cache::text: enable consuming from kafka everywhere

https://gerrit.wikimedia.org/r/596651

Mentioned in SAL (#wikimedia-operations) [2020-05-18T14:19:27Z] <_joe_> start consuming $dc.resource-purge kafka topic from purged in all of codfw T133821

Mentioned in SAL (#wikimedia-operations) [2020-05-18T14:23:25Z] <_joe_> start consuming $dc.resource-purge kafka topic from purged in all of eqsin, ulsfo T133821

Mentioned in SAL (#wikimedia-operations) [2020-05-18T14:29:03Z] <_joe_> start consuming $dc.resource-purge kafka topic from purged in all of eqiad T133821

Mentioned in SAL (#wikimedia-operations) [2020-05-18T14:33:03Z] <_joe_> start consuming $dc.resource-purge kafka topic from purged in all of esams T133821

Status update: purged is now consuming purges from restbase directly via kafka and not via multicast anymore. This should unblock the complete migration of changeprop to kubernetes, amongst other things.

Change 604430 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] cache: make upload consume purges from kafka

https://gerrit.wikimedia.org/r/604430

Change 604430 merged by Ema:
[operations/puppet@production] cache: make upload consume purges from kafka

https://gerrit.wikimedia.org/r/604430

Mentioned in SAL (#wikimedia-operations) [2020-06-10T16:06:06Z] <ema> cp3051: restart purged to apply https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/604430/ T250781 T133821

Mentioned in SAL (#wikimedia-operations) [2020-06-10T16:12:07Z] <ema> restart purged on all cache hosts to apply https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/604430/ T250781 T133821

Mentioned in SAL (#wikimedia-operations) [2020-06-10T16:13:00Z] <ema> correction: restart purged on all *cache_upload* hosts to apply https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/604430/ T250781 T133821

Change 604743 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] purged: make Kafka cluster name configurable

https://gerrit.wikimedia.org/r/604743

Change 604743 merged by Ema:
[operations/puppet@production] purged: make Kafka cluster name configurable

https://gerrit.wikimedia.org/r/604743

Krinkle moved this task from Later to Current: Performance Team on the Sustainability (MediaWiki-MultiDC) board.Jun 23 2020, 8:52 PM

Nemo_bis added a subtask: T256446: monitoring & alerting for purged.Jun 26 2020, 5:38 PM

CDanis closed subtask T256446: monitoring & alerting for purged as Resolved.Jul 2 2020, 1:58 PM

Krinkle moved this task from Current: Performance Team to Current: SRE on the Sustainability (MediaWiki-MultiDC) board.Jul 2 2020, 2:47 PM

jijiki moved this task from Incoming 🐫 to cross-team active on the serviceops board.Aug 17 2020, 11:45 PM

Krinkle edited projects, added Sustainability; removed Sustainability (MediaWiki-MultiDC).Sep 21 2020, 6:37 PM

ArielGlenn removed a project: Patch-For-Review.Sep 28 2020, 12:35 PM

This should've been closed back when T250781 closed - all purge traffic now goes via kafka queues and multicast purging is no more. We might have more to do on rate reduction separately in T250205 , but I don't think that needs to hold this ancient, epic, somewhat ambiguous task open.

Krinkle mentioned this in T318349: Text cluster is being hit with an average of 1.8k PURGE requests per second per host.Sep 23 2022, 1:40 PM

	F3845100: Screen Shot 2016-04-07 at 7.47.28 PM.png
	Apr 28 2016, 12:08 AM

	F31786513: purges-with-kafka.png
	Apr 29 2020, 10:29 AM

Make CDN purges reliableClosed, ResolvedPublicActions