Consider using delayed rebound purges for CDN
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	aaron
	Sep 20 2015, 9:31 PM

Description

In general, the app cache purging for MediaWiki works like this:

Case I:
a) User changes some asset
b) Cache keys and CDN may be purged
c) User sees the new asset (e.g. via a post-save redirect), ChronologyProtector and sticky DC cookie makes sure they see the new value and cache misses on the asset see the changed data when writing back to the cache
d) CDN caches the new asset

Case II:
a) User changes some asset
b) Cache keys and CDN may be purged
c) Some other user sees the asset later and the slaves are caught by now. They see the new value and cache misses on the asset see the changed data when writing back to the cache
d) CDN caches the new asset

The slaves and WAN cache quickly converge on the newest values. However, one can imagine another case...

Case III:
a) User changes some asset
b) Cache keys and CDN may be purged
c) Some other user sees the new asset before slaves are caught by now (bad luck). They see the old value and cache misses on the asset see the old data when writing back to the cache. The slaves and WAN cache will still converge to the right value soon. But...
d) CDN caches the old asset and is stuck for the full TTL (or until purge or new changes)

This is not typically a big problem for many assets given that:
a) Rapidly changing dynamic content is usually uncached or has a very low TTL (e.g. RecentChanges)
b) Other assets are less likely to have this kind of coincidence happen (like random pages)

However, popular articles are assets where this is more likely to occur (e.g. "Barack Obama", featured articles, ect...).

Probably the easiest solution is to do a second "rebound" CDN-only purge, after ~WANObjectCache::HOLDOFF_TTL. This is the effective slave lag SLA limit. This could use the job queue and is fairly cheap since the actual app cache (e.g. parser cache) is not cleared.

Details

	Subject	Repo	Branch	Lines +/-
	Configure $wgCdnReboundPurgeDelay	operations/mediawiki-config	master	+5 -2
	Add $wgCdnReboundPurgeDelay for more consistent CDN purges	mediawiki/core	master	+79 -8

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		aaron	T88445 MediaWiki active/active datacenter investigation and work (tracking)
		Resolved		aaron	T113192 Consider using delayed rebound purges for CDN

Event Timeline

aaron created this task.Sep 20 2015, 9:31 PM

aaron claimed this task.

aaron raised the priority of this task from to Medium.

aaron updated the task description. (Show Details)

aaron added projects: Sustainability, Epic.

aaron added subscribers: Krinkle, jcrespo, Glaisher and 12 others.

aaron removed a project: Epic.Sep 20 2015, 11:43 PM

aaron set Security to None.

To be more robust, one could imagine the following:

Have a memory store (memcached?) that stores URL => timestamp keys. On page view, before sending cache-control headers, the store is check for the URL. If a key is there due to a purge that happened not long ago, then the cache-control headers will have a low TTL (say 5 seconds).

So for example:
a) user edits, and the normal purge happens
b) user has an edit token and bypasses CDN anyway on post-edit redirect (so they see their change)
c) as an HHVM post-send DeferredUpdate in the original edit request, the store is updated in all DCs *synchronously* setting a key for the URL and a timestamp of the purging. After that finishes, a second purge is issued.
d) any views after the second purge will see the key and use a low TTL if needed, and once the key expires they go back to 30 day headers

aaron added a subscriber: BBlack.Sep 28 2015, 9:45 PM

aaron moved this task from Tag to Doing on the Sustainability board.Oct 18 2015, 8:26 PM

Change 252895 had a related patch set uploaded (by Aaron Schulz):
[WIP] Add $wgCdnReboundPurgeDelay for more consistent CDN purges

https://gerrit.wikimedia.org/r/252895

gerritbot added a project: Patch-For-Review.Nov 13 2015, 6:51 AM

Change 252895 merged by jenkins-bot:
Add $wgCdnReboundPurgeDelay for more consistent CDN purges

https://gerrit.wikimedia.org/r/252895

Change 258365 had a related patch set uploaded (by Aaron Schulz):
[WIP] Configure $wgCdnReboundPurgeDelay

https://gerrit.wikimedia.org/r/258365

ReleaseTaggerBot added projects: MW-1.27-release (WMF-deploy-2015-12-15_(1.27.0-wmf.9)), MW-1.27-release-notes.Dec 11 2015, 12:01 AM

Change 258365 merged by jenkins-bot:
Configure $wgCdnReboundPurgeDelay

https://gerrit.wikimedia.org/r/258365