Page MenuHomePhabricator

Consider using delayed rebound purges for CDN
Closed, ResolvedPublic

Description

In general, the app cache purging for MediaWiki works like this:

Case I:
a) User changes some asset
b) Cache keys and CDN may be purged
c) User sees the new asset (e.g. via a post-save redirect), ChronologyProtector and sticky DC cookie makes sure they see the new value and cache misses on the asset see the changed data when writing back to the cache
d) CDN caches the new asset

Case II:
a) User changes some asset
b) Cache keys and CDN may be purged
c) Some other user sees the asset later and the slaves are caught by now. They see the new value and cache misses on the asset see the changed data when writing back to the cache
d) CDN caches the new asset

The slaves and WAN cache quickly converge on the newest values. However, one can imagine another case...

Case III:
a) User changes some asset
b) Cache keys and CDN may be purged
c) Some other user sees the new asset before slaves are caught by now (bad luck). They see the old value and cache misses on the asset see the old data when writing back to the cache. The slaves and WAN cache will still converge to the right value soon. But...
d) CDN caches the old asset and is stuck for the full TTL (or until purge or new changes)

This is not typically a big problem for many assets given that:
a) Rapidly changing dynamic content is usually uncached or has a very low TTL (e.g. RecentChanges)
b) Other assets are less likely to have this kind of coincidence happen (like random pages)

However, popular articles are assets where this is more likely to occur (e.g. "Barack Obama", featured articles, ect...).

Probably the easiest solution is to do a second "rebound" CDN-only purge, after ~WANObjectCache::HOLDOFF_TTL. This is the effective slave lag SLA limit. This could use the job queue and is fairly cheap since the actual app cache (e.g. parser cache) is not cleared.

Details

Related Gerrit Patches:

Event Timeline

aaron created this task.Sep 20 2015, 9:31 PM
aaron claimed this task.
aaron raised the priority of this task from to Normal.
aaron updated the task description. (Show Details)
aaron added projects: Availability, Epic.
aaron added subscribers: Krinkle, jcrespo, Glaisher and 12 others.
aaron removed a project: Epic.Sep 20 2015, 11:43 PM
aaron set Security to None.
aaron added a comment.Sep 28 2015, 9:25 PM

To be more robust, one could imagine the following:

Have a memory store (memcached?) that stores URL => timestamp keys. On page view, before sending cache-control headers, the store is check for the URL. If a key is there due to a purge that happened not long ago, then the cache-control headers will have a low TTL (say 5 seconds).

So for example:
a) user edits, and the normal purge happens
b) user has an edit token and bypasses CDN anyway on post-edit redirect (so they see their change)
c) as an HHVM post-send DeferredUpdate in the original edit request, the store is updated in all DCs *synchronously* setting a key for the URL and a timestamp of the purging. After that finishes, a second purge is issued.
d) any views after the second purge will see the key and use a low TTL if needed, and once the key expires they go back to 30 day headers

aaron added a subscriber: BBlack.Sep 28 2015, 9:45 PM
aaron moved this task from Backlog to Doing on the Availability board.Oct 18 2015, 8:26 PM

Change 252895 had a related patch set uploaded (by Aaron Schulz):
[WIP] Add $wgCdnReboundPurgeDelay for more consistent CDN purges

https://gerrit.wikimedia.org/r/252895

Change 252895 merged by jenkins-bot:
Add $wgCdnReboundPurgeDelay for more consistent CDN purges

https://gerrit.wikimedia.org/r/252895

Change 258365 had a related patch set uploaded (by Aaron Schulz):
[WIP] Configure $wgCdnReboundPurgeDelay

https://gerrit.wikimedia.org/r/258365

Change 258365 merged by jenkins-bot:
Configure $wgCdnReboundPurgeDelay

https://gerrit.wikimedia.org/r/258365

aaron closed this task as Resolved.Jan 20 2016, 4:28 PM