Page MenuHomePhabricator

load.php?modules=startup miss rate tripled on 2020-02-05
Closed, ResolvedPublic

Description

Grafana dashboard: resourceloadermodule/startup/rate

It went from a stable 8 cache-miss backend requests per second (~480/min), to now 20-25 rps (~1300/min).

Close up (link):

(There is no moving average applied. Data points have a 15min resolution window from Graphite)

This suggests it started 2020-02-05 between 15:00-15:30, and graduallly tripped over the hour that followed.

13:38 ema: cp: disable puppet and merge https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/570311/ T242478
13:41 ema: cp1075: unset Accept-Encoding on origin server requests T242478
13:43 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
13:45 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
13:46 marostegui: Decrease buffer pool size on db1107 for testing - T242702
14:13 ema: cp1075: back to leaving Accept-Encoding as it is due to unrelated applayer issues T242478
14:23 vgutierrez: pooling cp5006 - T242093
14:26 XioNoX: push inital flowspec config to all routers
15:12 ema: cp: unset Accept-Encoding from ats-be requests to applayer T242478
15:15 vgutierrez: depooling & reimaging cp5012 as buster - T242093
15:24 effie: Rollout php-apcu_5.1.17+4.0.11-1+0~20190217111312.9+stretch~1.gbp192528+wmf2 to api, app and jobrunner canaries - T236800
15:29 effie: restart php-fpm on canaries - T236800
16:07 elukey: update puppet compiler's facts
16:25 jforrester@deploy1001: Synchronized wmf-config/CommonSettings.php: T232140 Restore wgLogoHD to wikis without a MinervaCustomLogos defined (duration: 01m 09s)

Event Timeline

Krinkle created this task.Mar 5 2020, 7:29 PM
Restricted Application added projects: Operations, Performance-Team. · View Herald TranscriptMar 5 2020, 7:29 PM
Restricted Application added subscribers: Masumrezarock100, Aklapper. · View Herald Transcript

The Accept-Encoding changes seem most suspect?

Indeed. I suspect that maybe it is now getting both gzip and no-gzip traffic and/or no longer sharing CDN cache between the two. It's meant to internally only do no-gzip to the app layer though.

Mholloway renamed this task from load.php?modules=startup miss rate tripped on 2012-02-05 to load.php?modules=startup miss rate tripped on 2020-02-05.Mar 5 2020, 8:30 PM
Krinkle renamed this task from load.php?modules=startup miss rate tripped on 2020-02-05 to load.php?modules=startup miss rate trippled on 2020-02-05.Mar 5 2020, 8:32 PM
akosiaris triaged this task as High priority.Mar 6 2020, 11:06 AM
akosiaris added a subscriber: ema.
CDanis renamed this task from load.php?modules=startup miss rate trippled on 2020-02-05 to load.php?modules=startup miss rate tripled on 2020-02-05.Mar 6 2020, 11:09 AM
ema added a comment.Mar 6 2020, 12:33 PM

I suspect this is due to the fact that we are unsetting Accept-Encoding in do_global_send_request, which happens right before sending the request to the origin. We are thus unsetting the server request header. However, the check related to proxy.config.http.cache.ignore_accept_encoding_mismatch uses the client request field, which we leave untouched in Lua. It might thus be enough to perform the unsetting in do_global_read_request instead.

Change 577551 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] ATS: unset client req Accept-Encoding on ats-be

https://gerrit.wikimedia.org/r/577551

ema added a comment.Mar 9 2020, 9:31 AM

I suspect this is due to the fact that we are unsetting Accept-Encoding in do_global_send_request, which happens right before sending the request to the origin. We are thus unsetting the server request header. However, the check related to proxy.config.http.cache.ignore_accept_encoding_mismatch uses the client request field, which we leave untouched in Lua. It might thus be enough to perform the unsetting in do_global_read_request instead.

Theory confirmed in labs:

# cache hit without Accept-Encoding
ema@traffic-cache-atstext-buster:~$ curl -v -H "X-Forwarded-Proto: https" -H "Host: en.wikipedia.beta.wmflabs.org" '127.0.0.1:3128/w/load.php?lang=en&modules=startup&only=scripts&raw=1&skin=vector' 2>&1 | grep X-Cache
< X-Cache-Int: traffic-cache-atstext-buster hit

# cache misses with Accept-Encoding
ema@traffic-cache-atstext-buster:~$ curl -v -H "Accept-Encoding: gzip" -H "X-Forwarded-Proto: https" -H "Host: en.wikipedia.beta.wmflabs.org" '127.0.0.1:3128/w/load.php?lang=en&modules=startup&only=scripts&raw=1&skin=vector' 2>&1 | grep X-Cache
< X-Cache-Int: traffic-cache-atstext-buster miss
ema@traffic-cache-atstext-buster:~$ curl -v -H "Accept-Encoding: gzip" -H "X-Forwarded-Proto: https" -H "Host: en.wikipedia.beta.wmflabs.org" '127.0.0.1:3128/w/load.php?lang=en&modules=startup&only=scripts&raw=1&skin=vector' 2>&1 | grep X-Cache
< X-Cache-Int: traffic-cache-atstext-buster miss

Also confirmed that the proposed patch https://gerrit.wikimedia.org/r/577551 fixes the problem:

# cache hit with or without AE
ema@traffic-cache-atstext-buster:~$ curl -v -H "Accept-Encoding: gzip" -H "X-Forwarded-Proto: https" -H "Host: en.wikipedia.beta.wmflabs.org" '127.0.0.1:3128/w/load.php?lang=en&modules=startup&only=scripts&raw=1&skin=vector' 2>&1 | grep X-Cache
< X-Cache-Int: traffic-cache-atstext-buster hit
ema@traffic-cache-atstext-buster:~$ curl -v  -H "X-Forwarded-Proto: https" -H "Host: en.wikipedia.beta.wmflabs.org" '127.0.0.1:3128/w/load.php?lang=en&modules=startup&only=scripts&raw=1&skin=vector' 2>&1 | grep X-Cache
< X-Cache-Int: traffic-cache-atstext-buster hit

Change 577551 merged by Ema:
[operations/puppet@production] ATS: unset client req Accept-Encoding on ats-be

https://gerrit.wikimedia.org/r/577551

ema closed this task as Resolved.Mar 9 2020, 10:20 AM
ema claimed this task.

The hitrate is now recovering after applying the patch:

Closing, thanks!