Page MenuHomePhabricator

Reduce rate of purges emitted by MediaWiki
Closed, DeclinedPublic

Description

Problem statement

The current cache purge request rate is very high, in the order of 150,000 requests per minute. This is 2 orders of magnitude larger than the edit rate, which is steadily around 1000 edits per minute.

A significant amount of such purges are caused by invalidation due to template changes, or Wikibase items edits. When a page linked from other pages (thus, also templates and wikidata items) gets edited, it spawns two recursive jobs:

  • htmlCacheUpdate, that invalidates the ParserCache entries and sends a purge to the CDN. It has a p95 of completion below a few hours
  • RefreshLinks, that refreshes the ParserCache for those same linked pages. It has a p95 of completion around 5 days.

This high rate of purges creates all sorts of scalability and reliability issues (see e.g. T249325 and T133821 for some context)

Proposed mitigation

To reduce the amount of purges we need to send to the caches, the proposed solution would be:

  • Lower the cache TTL of standard pages to ~ 1 day, progressively
  • Only send the CDN purge from HtmlCacheUpdate for direct edits
  • Make purging the CDN happen in RefreshLinks, rather than in HtmlCacheUpdate otherwise, only if less than the standard cache TTL of pages has not expired.

In this way, we will guarantee that editors and logged in users will get fast updates, but we will prevent mass invalidation of pages to cause a stampede from anonymous users, and reduce the overall purge rate for the long tail of dependent page updates.

There is a previous attempt at doing this with https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/295027/, but I think the approach proposed above would solve part of the worries expressed in the CR.

Event Timeline

Joe renamed this task from placeholder: reduce rate of purges emitted by Mediawiki to Reduce rate of purges emitted by Mediawiki.Apr 14 2020, 5:26 PM
Joe updated the task description. (Show Details)
Krinkle renamed this task from Reduce rate of purges emitted by Mediawiki to Reduce rate of purges emitted by MediaWiki.Apr 14 2020, 7:59 PM
Krinkle updated the task description. (Show Details)
Krinkle updated the task description. (Show Details)
Krinkle subscribed.
daniel triaged this task as Medium priority.Apr 14 2020, 8:11 PM
daniel subscribed.

@Joe You are assigned to this ticket, is this something you are going to work on in the code? Or shalle we assigned someone from the CPT team once we are working on it?

Joe removed Joe as the assignee of this task.Apr 15 2020, 5:30 AM
Joe subscribed.

@Joe You are assigned to this ticket, is this something you are going to work on in the code? Or shalle we assigned someone from the CPT team once we are working on it?

I would hope the latter :) I am writing a series of tasks coming out of a discussion of current threats and mitigation, given the ongoing COVID-19 related challenges, this is the first of them.

I'm not fond of the idea of not sending purges for indirect edits nor using RefreshLinksJob instead of HtmlCacheUpdateJob (too slow IMO).

I'm OK with:

  1. Make the CDN cache purge aspect of jobs no-op if CURRENT_TIME > (rootJobTimestamp + CDN TTL + MAX NORMAL DB LAG); the page_touched part still happens
  2. Lower $wgCdnMaxAge to 1 day; even if the CDN purge didn't run yet, the if-not-modified cache re-validations from CDN will save updated content if page_touched was updated
  3. Deprioritizing and logging large high-use template backlink purge jobs a bit more

Skipping backlink CDN purges for templates/entitites/files with millions of backlinks might work, assuming the editor is logged in and not showing things to logged out users...I can't think of a clever way to maintain expectations.

I'm not fond of the idea of not sending purges for indirect edits

Agreed. The proposal to stop sending these purges does not stand by itself but rather would be an implementation step of "moving" the purges from one place to another (remove here, add there).

I'm not fond of […] using RefreshLinksJob instead of HtmlCacheUpdateJob (too slow IMO).

[…] Skipping backlink CDN purges for templates/entitites/files with millions of backlinks might work, assuming the editor is logged in and not showing things to logged out users...I can't think of a clever way to maintain expectations.

I think for someting like making a change to Template:Information or Template:Infobox, it's more valuable for logged-out users that pages get served quickly, than for those indirect changes to be applied instantly (e.g. with a cache miss resulting in a 2-60 second blank screen, awaiting an expensive reparse, possibly hitting the lower timeout threshold from GET compared to on-edit/job).

As I understand it, that last part is exactly what we're proposing. Although it would work as you'd like for unregistered editors as well, I think? They too get a full sessiont that bypasses the CDN and lasts for days/weeks (tied to their browser, incl typical session restore/continuation).

In a nut shell:

  • Still purge from edit.
  • Still bump page_touched recursively from a (quick) job. This means any natural cache miss or passthrough (editor with session) will still lazy re-parse as needed.
  • Move recursive purges away from the quick job that does the page_touched bumps, and move it to the job that does the re-parses.

As additional optimisations we could:

  • Make the job aware of the CDN TTL and forego the purge if that amount of time has passed since the root job started (currently 4 days I think).

Given that the reparse jobs are graceful (fixed size batches, each subsequent batch goes to the back of the queue), I don't think we need any special code for treating high-use templates differently. Naturally templates with only a few host pages would need only one or two batches and thus make it through the queue presumably within minutes/hours either way.

If we consider 4 days too long as the maximum time it could take for a recursive template update to apply, I'd rather we invest in making the reparse queue work faster (more resources and/or more efficient), rather than lowering the TTL in general and exposing more cache misses and on-demand reparses during a page view.

As a long-term direction, I'd actually be interested in exploring what MDN wiki does - to logically forbid re-parsing on GET as matter of principal (MDN only parses on POST or from jobs, their rev-latest parser cache as no TTL, and in the event stuff known-stale, the user is informed that it might be stale; while a high prio job works to fix it asap). Anyway, that's for a separate task :)

Change 598114 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@master] Avoid HtmlCacheUpdateJob purges for content no longer in CDN

https://gerrit.wikimedia.org/r/598114

Change 598114 merged by jenkins-bot:
[mediawiki/core@master] Avoid HtmlCacheUpdateJob purges for content no longer in CDN

https://gerrit.wikimedia.org/r/598114

I'm not fond of the idea of not sending purges for indirect edits

Agreed. The proposal to stop sending these purges does not stand by itself but rather would be an implementation step of "moving" the purges from one place to another (remove here, add there).

I'm not fond of […] using RefreshLinksJob instead of HtmlCacheUpdateJob (too slow IMO).

[…] Skipping backlink CDN purges for templates/entitites/files with millions of backlinks might work, assuming the editor is logged in and not showing things to logged out users...I can't think of a clever way to maintain expectations.

I think for someting like making a change to Template:Information or Template:Infobox, it's more valuable for logged-out users that pages get served quickly, than for those indirect changes to be applied instantly (e.g. with a cache miss resulting in a 2-60 second blank screen, awaiting an expensive reparse, possibly hitting the lower timeout threshold from GET compared to on-edit/job).

As I understand it, that last part is exactly what we're proposing. Although it would work as you'd like for unregistered editors as well, I think? They too get a full sessiont that bypasses the CDN and lasts for days/weeks (tied to their browser, incl typical session restore/continuation).

In a nut shell:

  • Still purge from edit.
  • Still bump page_touched recursively from a (quick) job. This means any natural cache miss or passthrough (editor with session) will still lazy re-parse as needed.
  • Move recursive purges away from the quick job that does the page_touched bumps, and move it to the job that does the re-parses.

That could work. The jobs that do the CDN purge would need to check (and possibly) update page_touched for sanity (in case the page_touched jobs did not already run).

To traffic and SRE folks: Where are we now on this with regards to load and priority? Was the reduction "enough" that the other ideas about cascading updates can wait until Q2?

akosiaris subscribed.

Removing SRE, has already been triaged to a more specific SRE subteam(2 of them in fact)

The task was more or less refused by the owners of the subsystem, who decided to go in a completely different direction. The problem is mostly still there but I doin't see a point in leaving this task open at this point.

I think that the biggest win for SRE will be getting all of gthe CDN layer to be on single-host caching so we can probably stop sending out rebound purges.