Page MenuHomePhabricator

Disable SPDY on cache_text for a week
Closed, ResolvedPublic

Description

Following the investigation on T125208, it seems like SPDY is currently a counter-productive choice on text caches. This is due to the combination of our large page bodies and SPDY apparently transmitting the whole page before any dependencies like <head> CSS.

By disabling SPDY on text cache for a week, we should be able to confirm the expected effect on first paint on https://grafana.wikimedia.org/dashboard/db/navigation-timing and https://performance.wikimedia.org/#!/week

Details

Related Gerrit Patches:
operations/puppet : productioncache_text: re-enable SPDY
operations/puppet : productiondisable SPDY for all cache_text
operations/puppet : productionSPDY support toggle, off for cp1008 canary

Event Timeline

Gilles created this task.Feb 5 2016, 2:32 PM
Gilles assigned this task to BBlack.
Gilles raised the priority of this task from to Needs Triage.
Gilles updated the task description. (Show Details)
Gilles added subscribers: Gilles, Peter, ori, Krinkle.
Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald TranscriptFeb 5 2016, 2:32 PM
BBlack added a project: Traffic.EditedFeb 5 2016, 2:56 PM

We've talked about this a bit on IRC. If the perf test week works out, we'll probably want to keep avoiding SPDY and/or HTTP/2 on cache_text until some future date when the "large initial HTML body without inlined CSS" problem goes away (via loading article content separately and/or inlining our top CSS). As for the performance comparison itself, I think we could start early next week (Monday or Tuesday). We'll need to keep in mind some anomalies (not a perfect comparison), but that's always the case. One of the key anomalies in today's data is a slow rolling reboot of all the cache machines (which is going to hurt perf a bit temporarily as empty frontend memory caches refill and such).

Change 268892 had a related patch set uploaded (by BBlack):
SPDY support toggle, off for cp1008 canary

https://gerrit.wikimedia.org/r/268892

Change 268893 had a related patch set uploaded (by BBlack):
disable SPDY for all cache_text

https://gerrit.wikimedia.org/r/268893

ori triaged this task as Medium priority.Feb 8 2016, 7:52 PM
ori set Security to None.
ori moved this task from Inbox to Blocked or Needs-CR on the Performance-Team board.

Change 268892 merged by BBlack:
SPDY support toggle, off for cp1008 canary

https://gerrit.wikimedia.org/r/268892

BBlack added a comment.Feb 9 2016, 4:47 PM

The cache kernel reboots will be done in a few hours. I figure allow the rest of the day for the perf impact there to settle back to "normal", and then push the SPDY change for cache_text early tomorrow.

BBlack added a comment.Feb 9 2016, 4:48 PM

(also note pinkunicorn/cp1008 already has SPDY removed. You can locally hack e.g. en.wikipedia.org DNS to point at 208.80.154.42 to see how the waterfalls graphs etc look on this config).

elukey added a subscriber: elukey.Feb 10 2016, 3:03 PM

Change 268893 merged by BBlack:
disable SPDY for all cache_text

https://gerrit.wikimedia.org/r/268893

BBlack added a comment.EditedFeb 10 2016, 5:57 PM

Note this went live ~ 15:10 UTC Feb 10 (spread over several minutes before/after). So far preliminary data in our graphs looks (to me!) like in the aggregate of client requests, it's a small net negative to perf.

Are you sure this wasn't 15:10 UTC? Isn't that when the patch was merged?

Yup, sorry, thinko while translating timezones. Updated above too!

Krinkle added a comment.EditedFeb 10 2016, 6:59 PM

Last 12 hours compared to same time last week. Seems starting in the hour after 15:00 (red mark) there is a noticeable regression.

The below graph shows that the EventLogging schema and hit rate itself are unaffected.

https://grafana.wikimedia.org/dashboard/db/performance-metrics
https://grafana.wikimedia.org/dashboard/db/eventlogging-schema?var-schema=NavigationTiming

I'd say some of those graphs, they fluctuate so much we need more data to confirm the pattern. But the TTFB, DOM-Complete, and onLoad ones certainly have a negative trend emerging.

Assuming those trends hold up, we still have to evaluate what we think that means for clients: did we suffer a fairly notable hit to the perf of SPDY-capable clients on fast connections, but make it up with improvements for SPDY-capable clients on slow ones?

Helpful correlating data is also in: https://grafana-admin.wikimedia.org/dashboard/db/client-connections showing the dropoff in SPDY% and the increase in SSL sessionid reuse (presumably from replacing singular long-lived SPDY sessions with multiple HTTP/1.1 sessions). Note that those stats are per *request*, not per *connection*.

So, looking at @Krinkle's metrics from above ( https://grafana.wikimedia.org/dashboard/db/performance-metrics ) from the change to now, we're seeing on the averages these that seem clearly attributable to the SPDY removal:

Statoldnewdiffdiffpct
TTFB383ms401ms+28ms+7%
FirstPaint866ms903ms+37ms+4%
DOMComplete1819ms1961ms+143ms+7%
onLoadEvent1903ms2041ms+138ms+7%
DNS34ms28ms-6ms-18%

Since these are averages across all the clients, if we assume our hypothesis is true that removing SPDY improved performance for some clients (SPDY-capable devices on slow networks), then that means the impact is worse than the above for other affected clients (SPDY-capable devices on faster networks). Can we get some idea from previous testing (or new testing) what the relative impacts to these two kinds of users are, and/or where the crossover point would be in client network performance, etc?

Gilles added a comment.EditedFeb 12 2016, 4:26 PM

firstPaint geometric mean per country, comparing Feb 11th and Feb 4th: https://docs.google.com/spreadsheets/d/1oZuFk152g-CRdVnw2aBaIAN3Z-nRciiztje8Mi9p1FM/edit?usp=sharing

I'm not sure what to think of it. Some countries definitely don't have enough samples for the result to be significant. But Denmark, Norway, Austria and Switzerland stand out, having around 1000ish records and looking like they have better firstPaint than before. Could it be that those countries have a higher proportion of people on slow connections in the country side? Austria and Switzerland have mountains in common, where slow internet is still the standard (at least from my limited personal experience, it seems like mountain areas often have terrible performance both for DSL and mobile).

Note that so far I haven't looked at the mobile site vs the desktop site, that might be worth comparing as well.

Random horrible idea of the day: we could do some crazy hack in nginx code where we measure RTT during the initial part of the handshake and then decide to ignore spdy3/h2 from npn/alpn if RTT > X :)

Could it be that those countries have a higher proportion of people on slow connections in the country side?

There was major network disruption for some central Europe ISPs (UPC and others) yesterday, BTW.

Anyway, assuming you mean latency:

It's hard to say if latency is a good enough indicator for overall connection quality, but like this experiment, we can try something like @BBlack is suggesting and look at the aggregate results.

For what's it's worth, I've looked at those countries with strange results again, this time looking at Feb 14th vs Feb 7th, with mobile and desktop distinct (same google doc as before, new tab). The previous results don't seem to hold up and we can't say that mobile is more affected than desktop. It doesn't seem like my vague theory holds true, and the outage @Nemo_bis mentioned might have had an effect on the dates I looked at previously.

Anyway, Chrome will no longer support SPDY on May 15th, I don't think we should work on that stuff too much. We should just remember to scrutinize the effect of the HTTP/2 roll out when it happens.

If anything we've just given ourselves a preview of the slight performance hit this Chrome update will have if we don't support HTTP/2 by then :)

assuming you mean latency

This problem is likely not root alone in latency.

There may be cases where because of buffer bloat and congestion transmitting the same stuff over one TCP connection instead of multiple will have worse performance for one individual actor.

until some future date when the "large initial HTML body without inlined CSS" problem goes away (via loading article content separately and/or inlining our top CSS).

AFAIK SPDY/HTTP2 provide in its design a way to improve this by a) reprioritizing some files over others and/or b) send a file to a client the server knows will be requested even before it is requested.
Doing (a) means when the next request for e.g. CSS comes in pause sending out the HTML body and instead transmit the CSS and only then continue with the body. However when the buffer bloat is so bad that the the full HTML body is already in flight, but the client still e.g. needs seconds to receive it, that won't help. But at least it would ensure that HTTP2 is always better than HTTP<2 with multiple connections in non-congested situations.
Doing (b) in addition to (a) would mean sending the HTML head, then the CSS that is likely to be requested, then the HTML body. Which means this would work even when the request for the CSS is too late to reprioritize over the HTML body.

Implementing both might be sufficient to make HTTP2 always an improvement also under congestion even if it can't get the advantage of multiple connections. Any ideas how complicated that would be? How much support nginx contains for these already?

@JanZerebecki - no idea on your questions yet, but really we should look at those questions with the HTTP/2 code rather than the SPDY code, as that's where we'll be in the long run. In any case, I think with 5 days and pretty clear diffs, we've got all the data we need from this current experiment. Will revert configs shortly.

Change 270736 had a related patch set uploaded (by BBlack):
cache_text: re-enable SPDY

https://gerrit.wikimedia.org/r/270736

Change 270736 merged by BBlack:
cache_text: re-enable SPDY

https://gerrit.wikimedia.org/r/270736

@JanZerebecki SPDY doesn't have re-prioritization, only HTTP/2 does. SPDY can only set the priority of an asset at the beginning.

Indeed it seems like the protocol itself allows for optimizations like you describe. I assume stock servers aren't doing that because when you get a request for some CSS from the client, you have no idea that it's CSS coming from the <head>. And each browser will differ in the priority values it uses when requesting things, which means that even if some give different priority values to head CSS and CSS loaded async, it might not be consistent across browsers.

We could definitely devise a custom solution that would let the server know that the CSS is in the head, though.

BBlack closed this task as Resolved.Feb 18 2016, 12:56 AM

This experiment is done