Page MenuHomePhabricator

Study performance impact of disabling TCP selective acknowledgments
Closed, ResolvedPublic

Description

Yesterday we've disabled TCP SACKs [1] on the caches/bastions to mitigate a remote DoS in the TCP implementation in the Linux kernel [2].

The caches were disabled between 17:30 and 18:00 UTC on Mon 17th and the remaining hosts were disabled starting 8:20 UTC on Tue 18th (with Puppet runs taking up to 30 minutes after the merge of the patch).

Once a fixed kernel is fully deployed we can consider whether to re-enable TCP SACKs. Various options:

  • If there's no visible performance impact, we can keep it disabled (reduces exploitable complexity)
  • If there's a performance regression, re-enable on the caches only
  • If there's a performance regression, re-enable fleet-wide if running a fixed Linux kernel (Puppet can apply this gradually as reboots happen)

Footnotes:
[1] https://en.wikipedia.org/wiki/Transmission_Control_Protocol#Selective_acknowledgments
[2] https://github.com/Netflix/security-bulletins/blob/master/advisories/third-party/2019-001.md

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 18 2019, 9:36 AM

loadEventEnd seems to have regressed around the time the change was deployed. In the week-over-week you can see that the curves seem to diverge past that point, increasing the week-over-week difference:

Ideally you want to re-enable this for half the fleet to verify, as daily/weekly traffic patterns add a lot of noise to the data. As long as a patched kernel is on the short horizon, I think we can wait for that before conducting such a test.

ema triaged this task as Normal priority.Jun 18 2019, 9:58 AM
ema added a project: Traffic.
ema added a comment.Jun 18 2019, 10:01 AM

As long as a patched kernel is on the short horizon, I think we can wait for that before conducting such a test.

Very short, the first cache reboots with patched kernel are planned for this EU afternoon.

ema moved this task from Triage to Caching on the Traffic board.Jun 18 2019, 10:03 AM
ema added a comment.Jun 19 2019, 7:52 AM

loadEventEnd seems to have regressed around the time the change was deployed

I'm gonna acknowledge the alert in Icinga as we're aware of this and there's no immediate action to take.

Mentioned in SAL (#wikimedia-operations) [2019-06-19T11:07:13Z] <ema> cache nodes: pause rolling reboots for kernel and varnish upgrades T224694 T225998

Mentioned in SAL (#wikimedia-operations) [2019-06-19T12:20:48Z] <ema> cache nodes: resume rolling reboots for kernel and varnish upgrades T224694 T225998

Mentioned in SAL (#wikimedia-operations) [2019-06-19T16:01:26Z] <ema> cache nodes: stop rolling reboots for today, 47/80 done T224694 T225998

Mentioned in SAL (#wikimedia-operations) [2019-06-20T09:17:36Z] <ema> cache nodes: resume rolling reboots for kernel and varnish upgrades T224694 T225998 T226048

ema added a comment.Jun 21 2019, 2:29 PM

All cache nodes are now running Linux 4.9.168-1+deb9u3. Next week we can thus re-enable SACKs on part of the cache fleet and further evaluate performance implications.

CDanis added a subscriber: CDanis.Jun 21 2019, 3:58 PM
Krinkle added a subscriber: Krinkle.EditedJun 23 2019, 1:44 PM

loadEventEnd seems to have regressed around the time the change was deployed. In the week-over-week you can see that the curves seem to diverge past that point, increasing the week-over-week difference:

Indeed, from Navigation Timing dashboard, loadEventEnd shows a global increase of ~ 100 ms at the 75th percentile compared to last week. Starting around June 15th (2 days before this task was created), and still on-going today.

And, affects both desktop and mobile (link):

But, based on other nav timing metrics, this might not be network-related.

reponseStart

Did go up around the same date, but only by 10 milliseconds, and has since come back down, whereas the load time regression hasn't.

response

navtiming-response (responseStart to responseEnd), did not go up on the same day. Only several days later, and only by < 3 milliseconds, and has since come back down.

The domInteractive (until end of HTML download/parse, not sub resources or paint) metric and firstPaint metric show a similar pattern as the responseStart metric. They went up by a small amount (around 10 milliseconds) on the same day (June 14-15), but have since come back down.

The ones that remained up and correlate with loadEventEnd, are:

  • domComplete and loadEventStart (naturally, nearly identical to loadEventEnd).
  • mediaWikiLoadEnd (nav start to JS settle).
  • processing (responseEnd to domComplete).

This suggests it is either related to client-side rendering (amount of data or the speed at which it can be processed), or related to secondary requests (e.g. the HTML req might be fine, but CSS/JS/images are more affected by the networking issue).

Mentioned in SAL (#wikimedia-operations) [2019-06-24T10:28:23Z] <moritzm> re-enabling TCP SACKs on cp5001-cp5003 (half of Varnish/upload in eqsin) T225998

Mentioned in SAL (#wikimedia-operations) [2019-06-24T10:36:58Z] <moritzm> re-enabling TCP SACKs on cp5007-cp5009 (half of Varnish/text in eqsin) T225998

Mentioned in SAL (#wikimedia-operations) [2019-06-24T13:04:13Z] <moritzm> re-enabling TCP SACKs on cp1075-1082 (half of Varnish/text and Varnish/upload in eqiad) T225998

Mentioned in SAL (#wikimedia-operations) [2019-06-24T13:10:37Z] <moritzm> re-enabling TCP SACKs on cp2001,2002,2004-2008,2010,2011, 2014, 2017 (half of Varnish/text and Varnish/upload in codfw) T225998

Mentioned in SAL (#wikimedia-operations) [2019-06-24T13:19:03Z] <moritzm> re-enabling TCP SACKs on cp3040-cp3047, cp3049 (half of Varnish/text and Varnish/upload in esams) T225998

Mentioned in SAL (#wikimedia-operations) [2019-06-24T13:26:25Z] <moritzm> re-enabling TCP SACKs on cp4024-4029 (half of Varnish/text and Varnish/upload in ulsfo) T225998

Gilles claimed this task.Jun 24 2019, 8:01 PM
Gilles moved this task from Inbox to Doing on the Performance-Team board.

Breakdown of servers and their config

eqsin:
Enabled: cp5001-cp5003, cp5007-cp5009
Disabled: cp5004-cp5006, cp5010-cp5012

ulsfo:
Enabled: cp4024-cp4029
Disabled: cp4021-cp4023, cp4030-cp4032

esams:
Enabled: cp3040-cp3047, cp3049
Disabled: cp3030, cp3032-cp3039

codfw:
Enabled: cp2001,cp2002,cp2004-cp2008,cp2010,cp2011,cp2014,cp2017
Disabled: cp2012,cp2013,cp2016,cp2018-cp2026

eqiad:
Enabled: cp1075-cp1082
Disabled: cp1083-cp1090

Gilles added a comment.EditedJun 25 2019, 8:26 AM

The results are in, looking at loadEventEnd since yesterday 14:00 GMT until now.

statusmedianp90p95sample size
SACK enabled128240796338.75335806
SACK disabled129142596783339182
difference0.7%4.2%6.5%

It's definitely worth re-enabling SACK everywhere, as the performance difference is significant, particularly for users with slower/less reliable internet connections.

Hive queries used, for reference: P8650

The internal recvfrom field from EventLogging refers to the Varnish frontend that received the /beacon request (similar to the last hostname in the XCS response header). Afaik that does not correlate to the multiple requests that made up the page load as each of those requests are independently distributed by LVS afaik.

Looking at the critical non-image reqs (HTML, CSS, JS) for a page view, logged-out, I see resps from cp3042 and some from cp3030. And images from a separate cp hosts as well (text cluster vs upload cluster).

Gilles added a comment.EditedJun 25 2019, 8:12 PM

It does for text, when your IP gets hashed to a specific Varnish frontend, you get all "small" requests from it. I.e. anything below the size threshold to be stored in Varnish frontends.

Remember that x-cache headers are read from right to left. Trying this out right now with a clear cache and a clear local storage I get the following headers for all assets on the page:

Non-images

x-cache: cp1079 hit/1, cp3030 hit/1, cp3041 miss
x-cache: cp1087 hit/7, cp3041 hit/3, cp3041 hit/140
x-cache: cp1087 hit/7, cp3030 hit/6, cp3041 hit/3267
x-cache: cp1079 hit/9, cp3033 hit/2, cp3041 hit/5468
x-cache: cp1085 hit/5, cp3032 hit/3, cp3041 hit/4463
x-cache: cp1089 hit/1, cp3033 hit/3, cp3041 hit/1
x-cache: cp1083 hit/7, cp3030 hit/5, cp3041 hit/1494482 (image/icon served by MediaWiki)
x-cache: cp1081 hit/4, cp3041 hit/3, cp3041 hit/1458154 (image/icon served by MediaWiki)
x-cache: cp1075 hit/4, cp3030 hit/4, cp3041 hit/529456 (image/icon served by MediaWiki)
x-cache: cp1087 hit/1, cp3041 hit/2, cp3041 hit/675
x-cache: cp1085 hit/5, cp3040 hit/2, cp3041 hit/208723
x-cache: cp1083 hit/10, cp3042 hit/5, cp3041 hit/2938509 (image/icon served by MediaWiki)
x-cache: cp1087 hit/1, cp3032 hit/1, cp3041 hit/2138
x-cache: cp1081 pass, cp3030 hit/3, cp3041 hit/47756
x-cache: cp1087 hit/9, cp3032 hit/3, cp3041 hit/29240

Images

x-cache: cp3036 hit, cp3044 hit/1308
x-cache: cp3046 hit, cp3044 hit/103289
x-cache: cp3043 hit, cp3044 hit/1
x-cache: cp3043 hit, cp3044 hit/474536

So it was cp3041 every time for me, for anything besides images.

ema added a comment.Jun 26 2019, 6:03 AM

Remember that x-cache headers are read from right to left.

More details here: https://wikitech.wikimedia.org/wiki/Varnish#X-Cache

Change 519193 had a related patch set uploaded (by Muehlenhoff; owner: Muehlenhoff):
[operations/puppet@production] Re-enable TCP selective acknowledgements on hosts running a fixed kernel

https://gerrit.wikimedia.org/r/519193

Change 519193 merged by Muehlenhoff:
[operations/puppet@production] Re-enable TCP selective acknowledgements on hosts running a fixed kernel

https://gerrit.wikimedia.org/r/519193

ema added a comment.Jul 5 2019, 12:02 PM

@Gilles: is there anything left to be done here? Other than blogging about the results that is. :-)

Nothing left to do here, I believe.

MoritzMuehlenhoff closed this task as Resolved.Aug 20 2019, 6:28 AM

Yeah, I think we can close this one.