Production load.php spends ~ 10% time doing output compression within PHP
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Krinkle
	Jan 10 2020, 11:25 PM

Description

This seems likely caused by the ATS migration given that this has to do with how requests are made internally between ATS backends and Apache.

From reversed flame graphs at https://performance.wikimedia.org/arclamp/svgs/daily/, ctrl-F searching for handleGz:

2019-12-13 / load.php: 0.39% spent in MediaWiki\OutputHandler::handleGzip
2019-12-15 / load.php: 0.55% spent in MediaWiki\OutputHandler::handleGzip
2019-12-17 / load.php: 0.90% spent in MediaWiki\OutputHandler::handleGzip
2019-12-18 / load.php: 1.73% spent in MediaWiki\OutputHandler::handleGzip
2019-12-19 / load.php: 8.90% spent in MediaWiki\OutputHandler::handleGzip
…
2020-01-09 / load.php: 9.10% spent in MediaWiki\OutputHandler::handleGzip

Details

Subject	Repo	Branch	Lines +/-
Revert "ATS: temporarily leave AE untouched"	operations/puppet	production	+2 -3
ATS: temporarily leave AE untouched	operations/puppet	production	+3 -2
Revert "ATS: temporarily leave AE untouched"	operations/puppet	production	+2 -3
ATS: temporarily leave AE untouched	operations/puppet	production	+3 -2
Reapply "ATS: unset Accept-Encoding"	operations/puppet	production	+17 -0

Customize query in gerrit

Related Objects

Mentioned In: T247020: load.php?modules=startup miss rate tripled on 2020-02-05
T242024: Investigate load.php backend latency and minify-calls regression (November 2019)
Mentioned Here: T247020: load.php?modules=startup miss rate tripled on 2020-02-05
T238494: 15% response start regression as of 2019-11-11 (Varnish->ATS)

Event Timeline

Krinkle created this task.Jan 10 2020, 11:25 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 10 2020, 11:25 PM

Krinkle renamed this task from [Regression] Production is making Apache/PHP do gzip output compression to Production load.php spends ~ 10% time doing output compression within PHP.Jan 10 2020, 11:43 PM

Krinkle added a project: Traffic.

Krinkle updated the task description. (Show Details)

Restricted Application added a project: SRE. · View Herald TranscriptJan 10 2020, 11:43 PM

I recall that in the pre-ATS setup, we explicitly configured the interaction between applayer and traffic to not request compressed responses. Looks like this may've been lost in the transition?

Krinkle updated the task description. (Show Details)Jan 10 2020, 11:45 PM

Jdforrester-WMF subscribed.Jan 11 2020, 12:00 AM

• ema updated the task description. (Show Details)Jan 13 2020, 1:07 PM

• ema moved this task from Backlog to Caching on the Traffic board.Jan 13 2020, 1:10 PM

In T242478#5794025, @Krinkle wrote:

I recall that in the pre-ATS setup, we explicitly configured the interaction between applayer and traffic to not request compressed responses.

That's correct. Cache misses in a given PoP, say esams, were routed to the varnish backends in eqiad, then to the appservers. We were removing Accept-Encoding between the eqiad varnishes and the applayer, but honoring it between esams and eqiad varnishes. In other words, varnish in eqiad was taking care of the compression before sending the response back to esams.

With the move to ATS, cache misses in remote PoPs now result in direct requests to the applayer. For this reason we had been passing remove-accept-encoding true to the ATS compress plugin. We found out however that the compress plugin is buggy, disabled it (T238494#5753680), and forgot about Accept-Encoding. We should now unset Accept-Encoding in Lua, keeping an eye on the performance penalty that this might imply. TTFB should decrease, but total transfer times are very likely gonna increase due to to larger amount of data to be transfered.

Change 564005 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] Revert "Revert "ATS: unset Accept-Encoding""

https://gerrit.wikimedia.org/r/564005

gerritbot added a project: Patch-For-Review.Jan 13 2020, 1:33 PM

Krinkle moved this task from Inbox, needs triage to Blocked (old) on the Performance-Team board.Jan 13 2020, 9:07 PM

Krinkle assigned this task to • ema.Jan 22 2020, 5:32 PM

Change 564005 merged by Ema:
[operations/puppet@production] Reapply "ATS: unset Accept-Encoding"

https://gerrit.wikimedia.org/r/564005

Mentioned in SAL (#wikimedia-operations) [2020-02-04T16:02:13Z] <ema> cp: rolling ats-backend-restart to unset Accept-Encoding before sending origin server requests T242478

Maintenance_bot removed a project: Patch-For-Review.Feb 4 2020, 4:10 PM

Change 570092 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] ATS: temporarily leave AE untouched

https://gerrit.wikimedia.org/r/570092

gerritbot added a project: Patch-For-Review.Feb 4 2020, 4:52 PM

Change 570092 merged by Ema:
[operations/puppet@production] ATS: temporarily leave AE untouched

https://gerrit.wikimedia.org/r/570092

Maintenance_bot removed a project: Patch-For-Review.Feb 4 2020, 5:11 PM

Change 570311 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] Revert "ATS: temporarily leave AE untouched"

https://gerrit.wikimedia.org/r/570311

gerritbot added a project: Patch-For-Review.Feb 5 2020, 12:33 PM

Mentioned in SAL (#wikimedia-operations) [2020-02-05T13:38:25Z] <ema> cp: disable puppet and merge https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/570311/ T242478

Change 570311 merged by Ema:
[operations/puppet@production] Revert "ATS: temporarily leave AE untouched"

https://gerrit.wikimedia.org/r/570311

Mentioned in SAL (#wikimedia-operations) [2020-02-05T13:41:28Z] <ema> cp1075: unset Accept-Encoding on origin server requests T242478

Change 570336 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] ATS: temporarily leave AE untouched

https://gerrit.wikimedia.org/r/570336

Change 570336 merged by Ema:
[operations/puppet@production] ATS: temporarily leave AE untouched

https://gerrit.wikimedia.org/r/570336

Mentioned in SAL (#wikimedia-operations) [2020-02-05T14:13:51Z] <ema> cp1075: back to leaving Accept-Encoding as it is due to unrelated applayer issues T242478

I have applied the change to cp1075 for some minutes, and the effect on network transfer is a 4x increase.

unset-ae-cp1075-network.png (972×1 px, 106 KB)

Origin server responses from eqiad to esams are ~50M/s currently. A 4x increase would mean 200M/s, hence an additional +150M/s on the eqiad-esams link. According to @ayounsi this is acceptable.

The appservers are currently experiencing a latency/CPU increase due to unrelated events, but when they recover we can apply the change and observe impact both on origin server response size and appservers CPU usage.

Change 570352 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] Revert "ATS: temporarily leave AE untouched"

https://gerrit.wikimedia.org/r/570352

Change 570352 merged by Ema:
[operations/puppet@production] Revert "ATS: temporarily leave AE untouched"

https://gerrit.wikimedia.org/r/570352

Maintenance_bot removed a project: Patch-For-Review.Feb 5 2020, 3:10 PM

Mentioned in SAL (#wikimedia-operations) [2020-02-05T15:12:00Z] <ema> cp: unset Accept-Encoding from ats-be requests to applayer T242478

Network

Dashboard: Cluster overview (eqiad appservers)

Outbound is naturally inflated, as expected. Note though, that this merely restores the status quo from before the ATS migration [last 90 days]

Screenshot 2020-02-05 at 17.46.51.png (566×1 px, 111 KB)

Flame graph

I've confirmed that in the MediaWiki PHP flame graphs, there is no longer the ~10% block spent in handleGzip.

2020-02-05_14 load.reversed	2020-02-05_15 load.reversed	2020-02-05_16 load.reversed

handleGzip 8.5%	handleGzip 3.1%	N/A

CPU

In terms of overall CPU usage on app servers, we're not seeing any meaningful change.

This is presumably because by request frequency for the app servers overall, most requests presumably involve work that is signifanctly more CPU-bound than the little bit of gzip'ing at the end.

However the change is imho still beneficial because:

It avoids all the bugs we previously experienced with doing request compression on Apache and/or in PHP.
It lowers the minimum response time for more latency-sensitive endpoints such as load.php. These are heavily cached and for appserver-overall graphs these backend requests don't move the needle much.
It offloads compression work away from app servers, instead of synchronously holding up responses on the Apache after MW is done. In particular, the way PHP does this internally makes any form of streaming impossible. And historically (since ~ 2016 I think) we've always done this work on traffic layers, and changing that wasn't an intended part of the ATS migration in December. I'm happy to revisit this as its own change, but preferably on better terms, when we have time for it - not reactionary as now. I propose we keep the restored status quo, an maybe revisit after the summer?

Krinkle closed this task as Resolved.Feb 5 2020, 6:22 PM

• mmodell subscribed.Feb 5 2020, 8:22 PM

Krinkle mentioned this in T247020: load.php?modules=startup miss rate tripled on 2020-02-05.Mar 5 2020, 7:29 PM

Summary from T247020: load.php?modules=startup miss rate tripled on 2020-02-05:

The fix for this task (T242478) made it so that handle compression at the edge again, instead of at the app servers.
The way we did it caused the responses to no longer be re-usable between gzip and no-gzip clients, thus tripping the backend request rate (Grafana: resourceloadermodule/startup/rate)

Screenshot 2020-03-05 at 19.20.20.png (509×2 px, 109 KB)

This was fixed by fixing a bug in how we handle the Accept-Encoding header in ATS. See T247020 for the patch and further details.

	F31667327: Screenshot 2020-03-05 at 19.20.20.png
	Mar 26 2020, 6:44 PM

	F31551102: Screenshot 2020-02-05 at 17.46.51.png
	Feb 5 2020, 6:21 PM

	F31551106: Screenshot 2020-02-05 at 17.48.57.png
	Feb 5 2020, 6:21 PM

	F31551110: Screenshot 2020-02-05 at 17.50.20.png
	Feb 5 2020, 6:21 PM

	F31551100: Screenshot 2020-02-05 at 17.45.28.png
	Feb 5 2020, 6:21 PM

	F31551108: Screenshot 2020-02-05 at 17.49.41.png
	Feb 5 2020, 6:21 PM

Production load.php spends ~ 10% time doing output compression within PHPClosed, ResolvedPublicActions