Page MenuHomePhabricator

[Research] Investigate 30% load.php reqs increase since 2015-07-30
Closed, ResolvedPublic

Description

Starting at this date, lots of metrics from the ResourceLoader dashboard have gone up significantly.

https://grafana.wikimedia.org/#/dashboard/db/resourceloader
View the last 6 months and zoom in on stuff before August 30 (e.g. July 23 - August 30)

(Note: Before you zoom in you'll find that numbers are much lower before September 7, this is because T111170 got fixed; The old numbers can still be trusted in relation to each other.)

Screen Shot 2015-09-18 at 01.27.34.png (806×1 px, 423 KB)

Screen Shot 2015-09-18 at 01.53.25.png (504×887 px, 67 KB)

Something big changed. Mainly the HTTP 304 responses. It should be rare for a browser to make a 304 roundtrip for a long-expiry resource. A 1% HTTP 304 hitrate for long-expiry resource may sound bad but is normal and desired. It doesn't mean the browser often hits HTTP 200 for these instead. Rather, it means these requests are never made. (After all, the point of a long-expiry resource is not just that proxies hold on, browsers as well). As such, it is guaranteed that if a browser is in need of a long-expiry resource it either doesn't have it (HTTP 200) or it does have it and it can be used straight out of local cache. HTTP 304 roundtrips don't make sense for long-expiry resources. Yet, starting 2015-07-30 this metric went from <1% to 60%.

To see whether this was caused by a shift in traffic (e.g. requests that were formerly getting HTTP 200, which'd be a good thing), or that they were previously non-existent requests (e.g. requests previously going to local caches). Looking at the total request count (with movingAverage set to 100 to hide daily fluctuation) the answer is clear:

Screen Shot 2015-09-18 at 02.00.47.png (490×1 px, 113 KB)

Total traffic went from 1.1M to 1.3M and never came back down. And the ratio of requests expecting a 304 response (If-None-Match) went from 30% to 40% here.

Event Timeline

Krinkle raised the priority of this task from to High.
Krinkle updated the task description. (Show Details)
Krinkle subscribed.
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Looking more closely, it happened started between 18:00 and 20:00:

Screen Shot 2015-09-18 at 02.08.26.png (466×1 px, 67 KB)

https://wikitech.wikimedia.org/w/index.php?title=Server_Admin_Log&oldid=172775#2015-07-30

18:02 logmsgbot: twentyafterfour rebuilt wikiversions.cdb and synchronized wikiversions files: all wikis to 1.26wmf16
18:33 logmsgbot: ori Synchronized wmf-config/CommonSettings.php: I6665bf31: Use relative URLs to construct load.php requests (duration: 00m 12s)
18:49 logmsgbot: ori Synchronized wmf-config/CommonSettings.php: I6db1771bf4: Use absolute URLs to construct load.php requests (duration: 00m 12s)
...
19:48 logmsgbot: ori Synchronized wmf-config: I0990ac5b: Update URL configuration for mobile when entering mobile mode (duration: 00m 12s)

The 1.26wmf16 switch is a red-herring. The actual cause is the change to mobile load.php urls. Confirmed by the fact that the graph went up, then down again, and finally kept going up. This follows @ori's commits that switched back and forth between mobile and desktop urls, eventually sticking with mobile (for mobile context). These commits address T106966. Mobile was wrongly using the desktop domain for load.php up until 2015-07-30. (See also T112646; which fixed another instance of this.)

This change should've been an improvement (in terms of DNS, SPDY, and more). But made things worse.

With help of @Catrope, I narrowed it down to request headers being malformed for mobile responses.

Desktop:

https://en.wikipedia.org/w/load.php?debug=false&lang=en&modules=startup&only=scripts&skin=vector
cache-control:public, max-age=300, s-maxage=300
https://en.wikipedia.org/w/load.php?debug=false&lang=en&modules=jquery%2Cmediawiki&only=scripts&skin=vector&version=c54hDTto
cache-control:public, max-age=2592000, s-maxage=2592000

Mobile:

https://en.m.wikipedia.org/w/load.php?debug=false&lang=en&modules=startup&only=scripts&skin=minerva&target=mobile
s-maxage=300, must-revalidate, max-age=0
https://en.m.wikipedia.org/w/load.php?debug=false&lang=en&modules=jquery%2Cmediawiki&only=scripts&skin=minerva&version=c54hDTto
s-maxage=300, must-revalidate, max-age=0

Caused by varnish/mobile-frontend.inc.vcl.erb#L166-L169:

if (resp.http.Cache-Control ~ "s-maxage=[1-9]") {
	set resp.http.Cache-Control = "s-maxage=300, must-revalidate, max-age=0";
} else {
	set resp.http.Cache-Control = "private, s-maxage=0, max-age=0, must-revalidate";
}

Introduced in 2011 with rOPUPe1e13ed3d696: Merge all existing Varnish3 VCL files into one new master file, with include… by @mark.

So mobile performance got worse because up until now mobile users were enjoying load.php requests from other domains that don't have bad Varnish configuration overrides (first bits.wikimedia.org, and later desktop domains; both are fine; only mobile has this override).

Krinkle set Security to None.

I tried to see if I could simply apply https://gerrit.wikimedia.org/r/#/c/219101/7/templates/varnish/text-frontend.inc.vcl.erb to mobile-frontend-inc.vcl.erb , but the logic for mobile is different:

if (resp.http.Cache-Control ~ "s-maxage=[1-9]") {
        set resp.http.Cache-Control = "s-maxage=300, must-revalidate, max-age=0";
} else {
        set resp.http.Cache-Control = "private, s-maxage=0, max-age=0, must-revalidate";
}

This (or any other CC mangling) should not be applied to load.php requests, but I'm not fluent enough in VCL to make that happen.

Change 239826 had a related patch set uploaded (by Krinkle):
varnish: Don't disable Cache-Control for all mobile traffic

https://gerrit.wikimedia.org/r/239826

Change 239826 merged by Ori.livneh:
varnish: Don't disable Cache-Control for all mobile traffic

https://gerrit.wikimedia.org/r/239826

This appears to be resolved now? The percentage of long cache headers in 304 responses went back down to 1-2% yesterday.

Krinkle claimed this task.

Yep. Request total of load.php, long-expiry 304 responses, and If-None-Match percentage all dropped back. Those three metrics show a pretty accurate mirror of the increases they saw last month.