Page MenuHomePhabricator

Evaluate using 'stale-while-revalidate' HTTP cache control
Closed, ResolvedPublic

Description

Blog post with background information at https://www.mnot.net/blog/2014/06/01/chrome_and_stale-while-revalidate.

I noticed that the Google Font API uses this header. (e.g. https://fonts.googleapis.com/css?family=Open+Sans:400,300,700 on https://performance.wikimedia.org/)

cache-control: private, max-age=86400, stale-while-revalidate=604800

As of writing, no browsers have implemented this yet (Chrome was experimenting, but no milestone yet. Firefox has a tracking bug but no signals yet.) But, various web proxies and CDNs do support stale-while-revalidate.

This would give us an immediate improvement in cache-miss latency for ResourceLoader. And, once browsers implement it, it would also take away the latency of common requests throughout a session such as the startup module - which is referenced in all pages, but expires every 5 minutes.

ResourceLoader design doc: https://www.mediawiki.org/wiki/ResourceLoader/Features#Startup_Module

Browsers:

Proxies:

Event Timeline

Krinkle renamed this task from Evaluate use of 'stale-while-revalidate' HTTP cache control to Evaluate using 'stale-while-revalidate' HTTP cache control.Apr 12 2016, 3:33 AM

(From Team Offsite)

This might be interesting for thumbnails as well. About 1/30 thumbnail fetches results in a 304 - which blocks rendering of those images. stale-while-revalidate would make this asynchronous, a lot like the offline-first fetch as used in Service Worker code.

Change 341733 had a related patch set uploaded (by Krinkle):
[mediawiki/core] resourceloader: Add 1 minute stale-while-revalidate (Cache-control)

https://gerrit.wikimedia.org/r/341733

Change 341733 abandoned by Krinkle:
resourceloader: Add 1 minute stale-while-revalidate (Cache-control)

Reason:
Per Task.

https://gerrit.wikimedia.org/r/341733

Krinkle closed this task as Declined.EditedApr 12 2017, 7:26 PM

Haven't had an opportunity to figure out why this breaks in Varnish, but either way that would make it unsuitable for including in MediaWiki by default if something major as Varnish won't work with it by default without some workaround or VCL change.

In addition, I'm not comfortable shipping this in ResourceLoader when no browser has implemented support for it yet. It would mean that it gets silently adopted once they do implement it, which may not be bug-free. Last time we did that (with requestIdleCallback) we ran into major problems and had to back it out unexpected because Chrome started to enable the feature in a Chrome stable release but the feature didn't actually work properly.

Once at least one browser implements this we can revisit it and then we can also figure out what's wrong in how it interacts with Varnish.

Chrome is shipping this as of Chrome 75. Time to reconsider!

Firefox too, as of Firefox 68 (2019)
https://bugzilla.mozilla.org/show_bug.cgi?id=1536511

Change 341733 restored by Krinkle:

[mediawiki/core@master] resourceloader: Add 1 minute stale-while-revalidate (Cache-control)

https://gerrit.wikimedia.org/r/341733

Change 341733 restored by Krinkle:

[mediawiki/core@master] resourceloader: Add 1 minute stale-while-revalidate (Cache-control)

https://gerrit.wikimedia.org/r/341733

I cherry-picked this to the Beta Cluster and confirmed that after waiting for the existing cache to roll-over, the directive makes it way through the traffic layers (ATS, Varnish) and out to the browser. This is important since we have a number of transformations in VCL/ATS that modify Cache-Control response headers.

Observed in Firefox when browsing random page views:

GET https://en.wikipedia.beta.wmflabs.org/w/load.php?lang=en&modules=startup&…

HTTP/2 200 OK
server: deployment-mediawiki12.deployment-prep.eqiad1.wikimedia.cloud
x-powered-by: PHP/7.4.30
cache-control: public, max-age=300, s-maxage=300, stale-while-revalidate=60
date: Thu, 27 Oct 2022 23:44:34 GMT
expires: Thu, 27 Oct 2022 23:49:34 GMT
age: 0
..
x-cache: deployment-cache-text07 hit, deployment-cache-text07 pass
x-cache-status: hit-local
..

As before, the browser re-uses this offline and unconditionally for 5 minutes with e.g. the network panel in devtools showing "200 OK (cached)" with an effective 0 ms roundtrip attributed to the non-request.

After this period, the next page view despite taking place after the expiry shows the same "200 OK (cached)" response with the same past-date response header showing, but alongside it a second request that goes to the server for a fresh roundtrip that receices a HTTP-304 response to renew the cached response for another 5 minutes — this renewal roundtrip would normally block page rendering but is now async.

Also important of note is that this background request has HTTP/2 bandwidth priority of Lowest which should reduce chances of it affecting the speed of other late resources such as async/lazy JS and below the fold images.

capture-200-cached.png (894×2 px, 181 KB)

capture-304-roundtrip.png (822×2 px, 167 KB)

Change 341733 merged by jenkins-bot:

[mediawiki/core@master] ResourceLoader: Add 1min grace via stale-while-revalidate Cache-Control

https://gerrit.wikimedia.org/r/341733

I wasn't able to conclusively see any major change (up or down) in domComplete, loadEventEnd, or mediaWikiLoadEnd. I checked the global metrics, mobile/desktop metrics, and the p75/p95 of several major browsers that I know support it (Chrome, Chrome Mobile, Firefox; but not Safari/iOS).

This benefit is limited to the 6th minute of a page view session. In the first minute, the startup manifest is naturally fetched. It doesn't block rendering, but as being a subresource, async or not, it does count towards domComplete/loadEventEnd, and is a delay to interactive code (mediaWikiLoadEnd). The five minutes following that first one, we allow it to be used completely offline. It can't get any better than that during those minutes. In the sixth minute, the new state-while-revalidate the browser to use the cached manifest offline as-is without any server roundtrip, just like the previous minutes. From the seventh minute on-wards, both max-age and SWR are expired, and so the browser has to do a light 304-Not-Modified roundtrip that transfers 0 response body bytes but is a roundtrip nonetheless.

I decided to check the session length dataset (docs) using hive (wmf)> SELECT * FROM wmf.session_length_daily WHERE year=2022 AND month=12 AND day=17 AND wiki="en.wikipedia" ORDER BY session_length LIMIT 10;. For 17 December 2022, on en.wikipedia.org, the first ten minutes break down as follows:

en.wikipedia	0	100%
en.wikipedia	1	26%
en.wikipedia	2	14%
en.wikipedia	3	9%
en.wikipedia	4	6%
en.wikipedia	5	5%
en.wikipedia	6	4% <!-- should benefit from stale-while-revalidate
en.wikipedia	7	3%
en.wikipedia	8	3%
en.wikipedia	9	3%

As such, it would not be surprising if the p95 doesn't reflect those 4% of browsing sessions. It might of course, since there's potential for some of the navtiming fragments (e.g. Chrome browser, Firefox browser, Minerva skin) to perhaps be more represented, or have more page views within the same time duration etc. But it's certainly close enough that it may also very well be lost in the noise.

Tooting the usual horn, I believe with navtiming on prometheus the higher percentiles like p99 would actually become meaningful as we could apply them to larger time periods like a day or a week instead of a single minute only, and thus carry enough signal to help us notice these differences. Ref T175087: Create a navtiming processor for Prometheus. I'll close this as resolved and working in intended based on my lab test at T132418#8351613, and as not having caused a regression in RUM data.

Change #1037907 had a related patch set uploaded (by Alistair3149; author: Krinkle):

[mediawiki/core@REL1_39] ResourceLoader: Add 1min grace via stale-while-revalidate Cache-Control

https://gerrit.wikimedia.org/r/1037907

Change #1037907 merged by jenkins-bot:

[mediawiki/core@REL1_39] ResourceLoader: Add 1min grace via stale-while-revalidate Cache-Control

https://gerrit.wikimedia.org/r/1037907