Page MenuHomePhabricator

Number of 5xx (mainly 503) errors surge from text caches since February 17th
Closed, ResolvedPublic

Description

Since February 17 around 20:00Z, we're seeing a steep increase in the number of 5xx responses we get from any datacenter.

Here is the graph for eqiad, where network/connectivity issues can't be the cause of such errors:

Looking at the SAL, I just see a couple of deployments around that time.

This needs investigation as some users are reporting more frequent timeouts when navigating our sites. Note that there was a second step up on February 23rd around 16:00Z, which made things even more worrying.

It's important to note that this does /not/ come from upload caches so it might be Mediawiki-related.

Findings so far

  • The surge comes from text caches
  • The surge is not tied to a specific datacenter and/or varnish backend, so we can exclude any hardware issue to a specific machine.
  • The first surge correlates with the timing of enabling keepalive on text caches; the progressive raise in 503s during one hour also seems to suggest the propagation of a puppet change across the cluster.
  • This still would not explain the second surge. We might be seeing the superposition of two independent effects here.

Details

Related Gerrit Patches:
operations/puppet : productionRevert "cache_misc: disable do_gzip"
operations/puppet : productiondisable do_gzip on cache_text (experiment)

Event Timeline

Joe created this task.Feb 24 2016, 8:47 AM
Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald TranscriptFeb 24 2016, 8:47 AM
Joe claimed this task.Feb 24 2016, 8:47 AM
Joe triaged this task as Unbreak Now! priority.
Joe added a comment.Feb 24 2016, 8:58 AM

So, SAL entries around the time of the two spikes:

February 17th:

20:25 logmsgbot: krinkle@tin Synchronized wmf-config/CommonSettings.php: Re-enable T99096 for mediawiki.org (duration: 01m 29s)
20:23 logmsgbot: catrope@tin Synchronized docroot/: (no message) (duration: 01m 33s)

February 23rd:

16:23 godog: depool restbase1007 and shut for ram upgrade
16:20 logmsgbot: thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: Enable uploader group at Simple English Wikipedia gerrit:272744 (duration: 01m 32s)
16:16 logmsgbot: thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: VisualEditor: Switch to Single Edit Tab mode on Hungarian Wikipedia gerrit:270346 (duration: 01m 32s)
16:11 logmsgbot: thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: Enable the structured language overlay and increase the instrumentation rate gerrit:271264 (duration: 01m 32s)
16:00 godog: shut restbase1009 for cpu/mem upgrade

I'll be looking at puppet changes merged around those times too.

Volans added a subscriber: Volans.Feb 24 2016, 8:58 AM
ema added a subscriber: ema.Feb 24 2016, 9:21 AM
Joe added a comment.Feb 24 2016, 9:30 AM

I am looking at the number of errors from the text caches on various days, focusing on the affected backend as x_cache suggests:

What I found is the following:

Errors on text backends that have sent a 503 spiked from 500/backend/day on february 16th to 7000/backend/day on february 18th.

I can now mostly exclude that this problem is due to any malfunctioning varnish hardware; it seems to be a specifical case of requests getting 503s as response to legit requests from the application layer, apparently.

I'll search for specific url patterns now.

Joe updated the task description. (Show Details)Feb 24 2016, 9:31 AM

An overly long list of puppet stuff before/after the Feb 17th spike

[17:20:41] (CR) Dzahn: [C: 2] partman: fix db.cfg to allow unattended install [puppet] - https://gerrit.wikimedia.org/r/271291 (owner: Dzahn)
[17:47:31] (CR) Dzahn: [C: 2] partman: another fix to db.cfg to allow fully unattended install [puppet] - https://gerrit.wikimedia.org/r/271302 (owner: Papaul)
[18:33:56] (CR) Dzahn: [C: 2] dhcp: replace tabs with spaces [puppet] - https://gerrit.wikimedia.org/r/268706 (owner: Dzahn)
[19:00:12] (CR) Dzahn: [C: 2 V: 2] netboot.cfg - replace tabs with spaces [puppet] - https://gerrit.wikimedia.org/r/268717 (owner: Dzahn)
[19:03:23] (CR) Alexandros Kosiaris: [C: 2 V: 2] etherpad: Simplify module and role [puppet] - https://gerrit.wikimedia.org/r/271283 (owner: Alexandros Kosiaris)
[19:08:10] (CR) BBlack: [C: 2 V: 2] nginx keepalives: enable for maps+misc [puppet] - https://gerrit.wikimedia.org/r/271317 (https://phabricator.wikimedia.org/T107749) (owner: BBlack)
[19:21:55] (CR) Dzahn: [C: 2 V: 2] dhcpd: fix MAC of es2014, add es2019 [puppet] - https://gerrit.wikimedia.org/r/271319 (https://phabricator.wikimedia.org/T126006) (owner: Papaul)
[19:26:05] (CR) BBlack: [C: 2 V: 2] "compiler-checked" [puppet] - https://gerrit.wikimedia.org/r/271321 (owner: BBlack)
[19:34:01] (CR) Ori.livneh: [C: 2 V: 2] mediawiki: Apply public-wiki-rewrites.incl to www.mediawiki.org [puppet] - https://gerrit.wikimedia.org/r/271013 (https://
phabricator.wikimedia.org/T99096) (owner: Krinkle)
[19:35:16] (CR) Dzahn: [C: 2 V: 2] ruthenium: Tweak the update_parsoid.sh script to make it more robust [puppet] - https://gerrit.wikimedia.org/r/271323 (owner: S
ubramanya Sastry)
[20:00:24] (CR) BBlack: [C: 2 V: 2] nginx keepalives: enable for text [puppet] - https://gerrit.wikimedia.org/r/271318 (https://phabricator.wikimedia.org/T107749)
(owner: BBlack)
[20:16:20] (CR) Ori.livneh: [C: 2 V: 2] navtiming: tee save timing stats for mediawikiwiki to separate metric [puppet] - https://gerrit.wikimedia.org/r/271341 (ow
ner: Ori.livneh)
[20:25:18] (CR) Ori.livneh: [C: 2 V: 2] mediawiki: Apply public-wiki-rewrites to all remaining wiki domains [puppet] - https://gerrit.wikimedia.org/r/271330 (http
s://phabricator.wikimedia.org/T99096) (owner: Krinkle)
[20:42:40] (CR) Dzahn: [C: 2] beta: move roles to modules/role/ [puppet] - https://gerrit.wikimedia.org/r/260937 (owner: Dzahn)

ema added a subscriber: BBlack.Feb 24 2016, 9:47 AM

We did enable nginx keepalives for text at 8PM on the 17th https://phabricator.wikimedia.org/T107749#2036351 Could that be the reason for this? @BBlack

elukey added a subscriber: elukey.Feb 24 2016, 9:50 AM
Joe added a comment.Feb 24 2016, 10:00 AM

The timing of the first spike correlates pretty well with that change; this still doesn't explain the second step up.

Joe updated the task description. (Show Details)Feb 24 2016, 10:02 AM
Joe added a comment.EditedFeb 24 2016, 10:19 AM

The second surge is very steep, it happens between 16:01 UTC and 16.:05 UTC on Feb 23rd:
https://graphite.wikimedia.org/render/?width=800&height=600&_salt=1456302551.26&from=15%3A50_20160223&until=16%3A20_20160223&target=sumSeries(varnish.*.text.frontend.request.client.status.5xx.rate)

(these are again eqiad text caches 503s, as measured on the frontend).

This surge speed would seem related to a code deployment and not a puppet change, given the speed of the surge; the timing of mediawiki deployments doesn't correlate though.

ema added a comment.Feb 24 2016, 11:09 AM

A few interesting varnish metrics in esams at the time of the first spike.

ema added a comment.Feb 24 2016, 2:18 PM

We have also noticed an increase in varnish.fetch_chunked as well as a varnish.fetch_close drop.

Change 272976 had a related patch set uploaded (by BBlack):
disable do_gzip on cache_text (experiment)

https://gerrit.wikimedia.org/r/272976

Change 272976 merged by BBlack:
disable do_gzip on cache_text (experiment)

https://gerrit.wikimedia.org/r/272976

Change 272996 had a related patch set uploaded (by BBlack):
disable keepalives on cache_text T127931

https://gerrit.wikimedia.org/r/272996

Change 272996 merged by BBlack:
disable keepalives on cache_text T127931

https://gerrit.wikimedia.org/r/272996

Change 273013 had a related patch set uploaded (by BBlack):
Revert "cache_misc: disable do_gzip"

https://gerrit.wikimedia.org/r/273013

Change 273013 merged by BBlack:
Revert "cache_misc: disable do_gzip"

https://gerrit.wikimedia.org/r/273013

BBlack closed this task as Resolved.Feb 24 2016, 6:18 PM