Reduce rate of purges emitted by MediaWiki
Closed, DeclinedPublic
Actions

Assigned To

None

Authored By

	CDanis
	Apr 14 2020, 5:07 PM

Description

Problem statement

The current cache purge request rate is very high, in the order of 150,000 requests per minute. This is 2 orders of magnitude larger than the edit rate, which is steadily around 1000 edits per minute.

A significant amount of such purges are caused by invalidation due to template changes, or Wikibase items edits. When a page linked from other pages (thus, also templates and wikidata items) gets edited, it spawns two recursive jobs:

htmlCacheUpdate, that invalidates the ParserCache entries and sends a purge to the CDN. It has a p95 of completion below a few hours
RefreshLinks, that refreshes the ParserCache for those same linked pages. It has a p95 of completion around 5 days.

This high rate of purges creates all sorts of scalability and reliability issues (see e.g. T249325 and T133821 for some context)

Proposed mitigation

To reduce the amount of purges we need to send to the caches, the proposed solution would be:

Lower the cache TTL of standard pages to ~ 1 day, progressively
Only send the CDN purge from HtmlCacheUpdate for direct edits
Make purging the CDN happen in RefreshLinks, rather than in HtmlCacheUpdate otherwise, only if less than the standard cache TTL of pages has not expired.

In this way, we will guarantee that editors and logged in users will get fast updates, but we will prevent mass invalidation of pages to cause a stampede from anonymous users, and reduce the overall purge rate for the long tail of dependent page updates.

There is a previous attempt at doing this with https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/295027/, but I think the approach proposed above would solve part of the worries expressed in the CR.

Details

	Subject	Repo	Branch	Lines +/-
	Avoid HtmlCacheUpdateJob purges for content no longer in CDN	mediawiki/core	master	+65 -36

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Declined		None	T250205 Reduce rate of purges emitted by MediaWiki
		Resolved		Krinkle	T250261 Stop sending purges for `action=history` for linked pages.

Event Timeline

CDanis created this task.Apr 14 2020, 5:07 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 14 2020, 5:07 PM

CDanis mentioned this in T249325: cache_text cluster consistently backlogged on purge requests.Apr 14 2020, 5:07 PM

Joe renamed this task from placeholder: reduce rate of purges emitted by Mediawiki to Reduce rate of purges emitted by Mediawiki.Apr 14 2020, 5:26 PM

Joe added projects: Traffic, serviceops, Performance-Team, Platform Engineering.

Joe updated the task description. (Show Details)

Restricted Application added a project: SRE. · View Herald TranscriptApr 14 2020, 5:26 PM

• Mholloway subscribed.Apr 14 2020, 5:31 PM

daniel moved this task from Inbox to Triage Meeting Inbox on the Platform Engineering board.Apr 14 2020, 7:37 PM

Krinkle renamed this task from Reduce rate of purges emitted by Mediawiki to Reduce rate of purges emitted by MediaWiki.Apr 14 2020, 7:59 PM

Krinkle updated the task description. (Show Details)

Krinkle subscribed.

• Gilles moved this task from Inbox, needs triage to Radar on the Performance-Team board.Apr 14 2020, 8:01 PM

• Gilles edited projects, added Performance-Team (Radar); removed Performance-Team.

@Joe You are assigned to this ticket, is this something you are going to work on in the code? Or shalle we assigned someone from the CPT team once we are working on it?

In T250205#6056793, @daniel wrote:

@Joe You are assigned to this ticket, is this something you are going to work on in the code? Or shalle we assigned someone from the CPT team once we are working on it?

I would hope the latter :) I am writing a series of tasks coming out of a discussion of current threats and mitigation, given the ongoing COVID-19 related challenges, this is the first of them.

Paladox subscribed.Apr 15 2020, 2:16 PM

Krinkle mentioned this in T250511: Re-evaluate caching and purging of language variants (e.g. "/zh-hans/Page_name").Apr 17 2020, 5:54 PM

• ema moved this task from Backlog to Radar/Not for service by Traffic on the Traffic board.Apr 20 2020, 3:58 PM

aaron mentioned this in T246456: Performance review of Wikidata Bridge.May 8 2020, 2:45 PM

Krinkle added a project: Sustainability (Incident Followup).May 8 2020, 4:31 PM

Ladsgroup subscribed.May 8 2020, 5:54 PM

Krinkle closed subtask T250261: Stop sending purges for `action=history` for linked pages. as Resolved.May 12 2020, 4:35 PM

Krinkle moved this task from Limbo to Perf recommendation on the Performance-Team (Radar) board.May 21 2020, 1:34 AM

I'm not fond of the idea of not sending purges for indirect edits nor using RefreshLinksJob instead of HtmlCacheUpdateJob (too slow IMO).

I'm OK with:

Make the CDN cache purge aspect of jobs no-op if CURRENT_TIME > (rootJobTimestamp + CDN TTL + MAX NORMAL DB LAG); the page_touched part still happens
Lower $wgCdnMaxAge to 1 day; even if the CDN purge didn't run yet, the if-not-modified cache re-validations from CDN will save updated content if page_touched was updated
Deprioritizing and logging large high-use template backlink purge jobs a bit more

Skipping backlink CDN purges for templates/entitites/files with millions of backlinks might work, assuming the editor is logged in and not showing things to logged out users...I can't think of a clever way to maintain expectations.

In T250205#6154883, @aaron wrote:

I'm not fond of the idea of not sending purges for indirect edits

Agreed. The proposal to stop sending these purges does not stand by itself but rather would be an implementation step of "moving" the purges from one place to another (remove here, add there).

In T250205#6154883, @aaron wrote:

I'm not fond of […] using RefreshLinksJob instead of HtmlCacheUpdateJob (too slow IMO).

[…] Skipping backlink CDN purges for templates/entitites/files with millions of backlinks might work, assuming the editor is logged in and not showing things to logged out users...I can't think of a clever way to maintain expectations.

I think for someting like making a change to Template:Information or Template:Infobox, it's more valuable for logged-out users that pages get served quickly, than for those indirect changes to be applied instantly (e.g. with a cache miss resulting in a 2-60 second blank screen, awaiting an expensive reparse, possibly hitting the lower timeout threshold from GET compared to on-edit/job).

As I understand it, that last part is exactly what we're proposing. Although it would work as you'd like for unregistered editors as well, I think? They too get a full sessiont that bypasses the CDN and lasts for days/weeks (tied to their browser, incl typical session restore/continuation).

In a nut shell:

Still purge from edit.
Still bump page_touched recursively from a (quick) job. This means any natural cache miss or passthrough (editor with session) will still lazy re-parse as needed.
Move recursive purges away from the quick job that does the page_touched bumps, and move it to the job that does the re-parses.

As additional optimisations we could:

Make the job aware of the CDN TTL and forego the purge if that amount of time has passed since the root job started (currently 4 days I think).

Given that the reparse jobs are graceful (fixed size batches, each subsequent batch goes to the back of the queue), I don't think we need any special code for treating high-use templates differently. Naturally templates with only a few host pages would need only one or two batches and thus make it through the queue presumably within minutes/hours either way.

If we consider 4 days too long as the maximum time it could take for a recursive template update to apply, I'd rather we invest in making the reparse queue work faster (more resources and/or more efficient), rather than lowering the TTL in general and exposing more cache misses and on-demand reparses during a page view.

As a long-term direction, I'd actually be interested in exploring what MDN wiki does - to logically forbid re-parsing on GET as matter of principal (MDN only parses on POST or from jobs, their rev-latest parser cache as no TTL, and in the event stuff known-stale, the user is informed that it might be stale; while a high prio job works to fix it asap). Anyway, that's for a separate task :)

Change 598114 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@master] Avoid HtmlCacheUpdateJob purges for content no longer in CDN

https://gerrit.wikimedia.org/r/598114

gerritbot added a project: Patch-For-Review.May 22 2020, 9:12 PM

Change 598114 merged by jenkins-bot:
[mediawiki/core@master] Avoid HtmlCacheUpdateJob purges for content no longer in CDN

https://gerrit.wikimedia.org/r/598114

ReleaseTaggerBot added a project: MW-1.35-notes (1.35.0-wmf.35; 2020-06-02).May 28 2020, 12:00 AM

Maintenance_bot removed a project: Patch-For-Review.May 28 2020, 12:11 AM

In T250205#6158994, @Krinkle wrote:

In T250205#6154883, @aaron wrote:

I'm not fond of the idea of not sending purges for indirect edits

Agreed. The proposal to stop sending these purges does not stand by itself but rather would be an implementation step of "moving" the purges from one place to another (remove here, add there).

In T250205#6154883, @aaron wrote:

I'm not fond of […] using RefreshLinksJob instead of HtmlCacheUpdateJob (too slow IMO).

[…] Skipping backlink CDN purges for templates/entitites/files with millions of backlinks might work, assuming the editor is logged in and not showing things to logged out users...I can't think of a clever way to maintain expectations.

I think for someting like making a change to Template:Information or Template:Infobox, it's more valuable for logged-out users that pages get served quickly, than for those indirect changes to be applied instantly (e.g. with a cache miss resulting in a 2-60 second blank screen, awaiting an expensive reparse, possibly hitting the lower timeout threshold from GET compared to on-edit/job).

As I understand it, that last part is exactly what we're proposing. Although it would work as you'd like for unregistered editors as well, I think? They too get a full sessiont that bypasses the CDN and lasts for days/weeks (tied to their browser, incl typical session restore/continuation).

In a nut shell:

Still purge from edit.

Still bump page_touched recursively from a (quick) job. This means any natural cache miss or passthrough (editor with session) will still lazy re-parse as needed.

Move recursive purges away from the quick job that does the page_touched bumps, and move it to the job that does the re-parses.

That could work. The jobs that do the CDN purge would need to check (and possibly) update page_touched for sanity (in case the page_touched jobs did not already run).

To traffic and SRE folks: Where are we now on this with regards to load and priority? Was the reduction "enough" that the other ideas about cascading updates can wait until Q2?

jijiki moved this task from Incoming 🐫 to 🔦Unused2 on the serviceops board.Aug 17 2020, 11:45 PM

BBlack mentioned this in T133821: Make CDN purges reliable.Sep 29 2020, 9:26 PM

RP88 subscribed.Sep 29 2020, 10:24 PM

Naike added a project: Platform Engineering Roadmap Decision Making.Oct 9 2020, 12:50 PM

Naike moved this task from Untriaged to Icebox on the Platform Engineering Roadmap Decision Making board.Nov 16 2020, 3:25 PM

CCicalese_WMF removed a project: Platform Engineering.Feb 24 2021, 11:37 PM

Krinkle mentioned this in T279205: Harden and improve HTTP cache headers and purging in MediaWiki core (Sprint placeholder) .Apr 2 2021, 9:25 PM

Ladsgroup awarded a token.Apr 27 2021, 9:49 AM

Ladsgroup mentioned this in T297238: Wikidata edit did not update the langlinks tables on MediaWiki side.Dec 8 2021, 2:21 PM

Krinkle mentioned this in T318349: Text cluster is being hit with an average of 1.8k PURGE requests per second per host.Sep 23 2022, 1:40 PM

jijiki moved this task from 🔦Unused2 to 🌻Mediawiki on the serviceops board.Oct 17 2022, 4:03 PM

Removing SRE, has already been triaged to a more specific SRE subteam(2 of them in fact)

The task was more or less refused by the owners of the subsystem, who decided to go in a completely different direction. The problem is mostly still there but I doin't see a point in leaving this task open at this point.

I think that the biggest win for SRE will be getting all of gthe CDN layer to be on single-host caching so we can probably stop sending out rebound purges.

Joe moved this task from Backlog to Done on the SRE-Sprint-Week-Sustainability-March2023 board.Mar 21 2023, 8:46 AM

Reduce rate of purges emitted by MediaWikiClosed, DeclinedPublicActions