Page MenuHomePhabricator

Production load.php spends ~ 10% time doing output compression within PHP
Closed, ResolvedPublic

Assigned To
Authored By
Krinkle
Jan 10 2020, 11:25 PM
Referenced Files
F31667327: Screenshot 2020-03-05 at 19.20.20.png
Mar 26 2020, 6:44 PM
F31551102: Screenshot 2020-02-05 at 17.46.51.png
Feb 5 2020, 6:21 PM
F31551106: Screenshot 2020-02-05 at 17.48.57.png
Feb 5 2020, 6:21 PM
F31551110: Screenshot 2020-02-05 at 17.50.20.png
Feb 5 2020, 6:21 PM
F31551100: Screenshot 2020-02-05 at 17.45.28.png
Feb 5 2020, 6:21 PM
F31551108: Screenshot 2020-02-05 at 17.49.41.png
Feb 5 2020, 6:21 PM
F31551118: export.png
Feb 5 2020, 6:21 PM
F31549054: unset-ae-cp1075-network.png
Feb 5 2020, 2:52 PM

Description

This seems likely caused by the ATS migration given that this has to do with how requests are made internally between ATS backends and Apache.

From reversed flame graphs at https://performance.wikimedia.org/arclamp/svgs/daily/, ctrl-F searching for handleGz:

  • 2019-12-13 / load.php: 0.39% spent in MediaWiki\OutputHandler::handleGzip
  • 2019-12-15 / load.php: 0.55% spent in MediaWiki\OutputHandler::handleGzip
  • 2019-12-17 / load.php: 0.90% spent in MediaWiki\OutputHandler::handleGzip
  • 2019-12-18 / load.php: 1.73% spent in MediaWiki\OutputHandler::handleGzip
  • 2019-12-19 / load.php: 8.90% spent in MediaWiki\OutputHandler::handleGzip
  • 2020-01-09 / load.php: 9.10% spent in MediaWiki\OutputHandler::handleGzip

export.png (1×2 px, 229 KB)

Event Timeline

Krinkle renamed this task from [Regression] Production is making Apache/PHP do gzip output compression to Production load.php spends ~ 10% time doing output compression within PHP.Jan 10 2020, 11:43 PM
Krinkle added a project: Traffic.
Krinkle updated the task description. (Show Details)
Krinkle added a subscriber: ema.

I recall that in the pre-ATS setup, we explicitly configured the interaction between applayer and traffic to not request compressed responses. Looks like this may've been lost in the transition?

I recall that in the pre-ATS setup, we explicitly configured the interaction between applayer and traffic to not request compressed responses.

That's correct. Cache misses in a given PoP, say esams, were routed to the varnish backends in eqiad, then to the appservers. We were removing Accept-Encoding between the eqiad varnishes and the applayer, but honoring it between esams and eqiad varnishes. In other words, varnish in eqiad was taking care of the compression before sending the response back to esams.

With the move to ATS, cache misses in remote PoPs now result in direct requests to the applayer. For this reason we had been passing remove-accept-encoding true to the ATS compress plugin. We found out however that the compress plugin is buggy, disabled it (T238494#5753680), and forgot about Accept-Encoding. We should now unset Accept-Encoding in Lua, keeping an eye on the performance penalty that this might imply. TTFB should decrease, but total transfer times are very likely gonna increase due to to larger amount of data to be transfered.

Change 564005 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] Revert "Revert "ATS: unset Accept-Encoding""

https://gerrit.wikimedia.org/r/564005

Change 564005 merged by Ema:
[operations/puppet@production] Reapply "ATS: unset Accept-Encoding"

https://gerrit.wikimedia.org/r/564005

Mentioned in SAL (#wikimedia-operations) [2020-02-04T16:02:13Z] <ema> cp: rolling ats-backend-restart to unset Accept-Encoding before sending origin server requests T242478

Change 570092 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] ATS: temporarily leave AE untouched

https://gerrit.wikimedia.org/r/570092

Change 570092 merged by Ema:
[operations/puppet@production] ATS: temporarily leave AE untouched

https://gerrit.wikimedia.org/r/570092

Change 570311 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] Revert "ATS: temporarily leave AE untouched"

https://gerrit.wikimedia.org/r/570311

Change 570311 merged by Ema:
[operations/puppet@production] Revert "ATS: temporarily leave AE untouched"

https://gerrit.wikimedia.org/r/570311

Mentioned in SAL (#wikimedia-operations) [2020-02-05T13:41:28Z] <ema> cp1075: unset Accept-Encoding on origin server requests T242478

Change 570336 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] ATS: temporarily leave AE untouched

https://gerrit.wikimedia.org/r/570336

Change 570336 merged by Ema:
[operations/puppet@production] ATS: temporarily leave AE untouched

https://gerrit.wikimedia.org/r/570336

Mentioned in SAL (#wikimedia-operations) [2020-02-05T14:13:51Z] <ema> cp1075: back to leaving Accept-Encoding as it is due to unrelated applayer issues T242478

I have applied the change to cp1075 for some minutes, and the effect on network transfer is a 4x increase.

unset-ae-cp1075-network.png (972×1 px, 106 KB)

Origin server responses from eqiad to esams are ~50M/s currently. A 4x increase would mean 200M/s, hence an additional +150M/s on the eqiad-esams link. According to @ayounsi this is acceptable.

The appservers are currently experiencing a latency/CPU increase due to unrelated events, but when they recover we can apply the change and observe impact both on origin server response size and appservers CPU usage.

Change 570352 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] Revert "ATS: temporarily leave AE untouched"

https://gerrit.wikimedia.org/r/570352

Change 570352 merged by Ema:
[operations/puppet@production] Revert "ATS: temporarily leave AE untouched"

https://gerrit.wikimedia.org/r/570352

Mentioned in SAL (#wikimedia-operations) [2020-02-05T15:12:00Z] <ema> cp: unset Accept-Encoding from ats-be requests to applayer T242478

Network

Outbound is naturally inflated, as expected. Note though, that this merely restores the status quo from before the ATS migration [last 90 days]

Screenshot 2020-02-05 at 17.46.51.png (566×1 px, 111 KB)

Flame graph

I've confirmed that in the MediaWiki PHP flame graphs, there is no longer the ~10% block spent in handleGzip.

2020-02-05_14 load.reversed2020-02-05_15 load.reversed2020-02-05_16 load.reversed
Screenshot 2020-02-05 at 17.48.57.png (1×2 px, 441 KB)
Screenshot 2020-02-05 at 17.49.41.png (1×2 px, 544 KB)
Screenshot 2020-02-05 at 17.50.20.png (1×2 px, 295 KB)
handleGzip 8.5%handleGzip 3.1%N/A

CPU

In terms of overall CPU usage on app servers, we're not seeing any meaningful change.

export.png (970×2 px, 148 KB)

This is presumably because by request frequency for the app servers overall, most requests presumably involve work that is signifanctly more CPU-bound than the little bit of gzip'ing at the end.

However the change is imho still beneficial because:

  • It avoids all the bugs we previously experienced with doing request compression on Apache and/or in PHP.
  • It lowers the minimum response time for more latency-sensitive endpoints such as load.php. These are heavily cached and for appserver-overall graphs these backend requests don't move the needle much.
  • It offloads compression work away from app servers, instead of synchronously holding up responses on the Apache after MW is done. In particular, the way PHP does this internally makes any form of streaming impossible. And historically (since ~ 2016 I think) we've always done this work on traffic layers, and changing that wasn't an intended part of the ATS migration in December. I'm happy to revisit this as its own change, but preferably on better terms, when we have time for it - not reactionary as now. I propose we keep the restored status quo, an maybe revisit after the summer?

Summary from T247020: load.php?modules=startup miss rate tripled on 2020-02-05:

  • The fix for this task (T242478) made it so that handle compression at the edge again, instead of at the app servers.
  • The way we did it caused the responses to no longer be re-usable between gzip and no-gzip clients, thus tripping the backend request rate (Grafana: resourceloadermodule/startup/rate)

Screenshot 2020-03-05 at 19.20.20.png (509×2 px, 109 KB)

  • This was fixed by fixing a bug in how we handle the Accept-Encoding header in ATS. See T247020 for the patch and further details.