The current cache purge request rate is very high, in the order of 150,000 requests per minute. This is 2 orders of magnitude larger than the edit rate, which is steadily around 1000 edits per minute.
A significant amount of such purges are caused by invalidation due to template changes, or Wikibase items edits. When a page linked from other pages (thus, also templates and wikidata items) gets edited, it spawns two recursive jobs:
- htmlCacheUpdate, that invalidates the ParserCache entries and sends a purge to the CDN. It has a p95 of completion below a few hours
- RefreshLinks, that refreshes the ParserCache for those same linked pages. It has a p95 of completion around 5 days.
This high rate of purges creates all sorts of scalability and reliability issues (see e.g. T249325 and T133821 for some context)
To reduce the amount of purges we need to send to the caches, the proposed solution would be:
- Lower the cache TTL of standard pages to ~ 1 day, progressively
- Only send the CDN purge from HtmlCacheUpdate for direct edits
- Make purging the CDN happen in RefreshLinks, rather than in HtmlCacheUpdate otherwise, only if less than the standard cache TTL of pages has not expired.
In this way, we will guarantee that editors and logged in users will get fast updates, but we will prevent mass invalidation of pages to cause a stampede from anonymous users, and reduce the overall purge rate for the long tail of dependent page updates.
There is a previous attempt at doing this with https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/295027/, but I think the approach proposed above would solve part of the worries expressed in the CR.