Page MenuHomePhabricator

Investigate major HTTP 500 spike since 2016-09-23
Closed, ResolvedPublic

Description

https://grafana.wikimedia.org/dashboard/file/varnish-http-errors.json?from=1470009600000&to=1479463200000

Screen Shot 2016-11-18 at 10.40.08.png (218×936 px, 63 KB)

Screen Shot 2016-11-18 at 10.40.14.png (718×2 px, 222 KB)

Screen Shot 2016-11-18 at 10.41.15.png (732×2 px, 121 KB)

https://wikitech.wikimedia.org/w/index.php?title=Server_Admin_Log&oldid=993285#2016-09-22 cherry-picked entries from around this date:

2016-09-22
07:33 elukey: rebooting stat1004 for kernel upgrades
07:40 moritzm: rolling restart of trusty swift frontend servers in codfw for kernel security update
07:52 elukey: rebooted stat100[23] for kernel upgrades
07:58 elukey: uploaded varnishkafka 1.0.12-1 to reprepro
08:35 elukey: restarted varnishkafka on cp1099 (log abandoned )
08:40 elukey: installed varnishkafka 1.0.12 on cp1099
08:43 elukey: installing varnishkafka 1.0.12 on cache:upload esams
09:02 elukey: installing varnishkafka 1.0.12 on cache:upload codfw
12:25 elukey: installing varnishkafka 1.0.12 on cache:upload ulsfo and eqiad
15:02 bblack: upgrading openssl on cp*
18:38 logmsgbot: aaron@tin Synchronized php-1.28.0-wmf.20/includes/libs/rdbms/database/Database.php: rMW844cfd568a7c & rMW014a420b4525 (duration: 00m 49s)
18:47 logmsgbot: thcipriani@tin Synchronized php-1.28.0-wmf.20/extensions/CentralNotice: SWAT: Update extensions/CentralNotice submodule (T144952) (duration: 00m 52s)
19:09 logmsgbot: thcipriani@tin rebuilt wikiversions.php and synchronized wikiversions files: group1 wikis to 1.28.0-wmf.20
20:08 logmsgbot: thcipriani@tin rebuilt wikiversions.php and synchronized wikiversions files: all wikis to 1.28.0-wmf.20
22:49 logmsgbot: aaron@tin Synchronized php-1.28.0-wmf.20/includes/libs/rdbms/loadbalancer/LoadBalancer.php: rMWa73a7ef92862 (duration: 01m 04s)

HTTP 5xx matches:

Screen Shot 2016-11-18 at 10.40.23.png (712×2 px, 151 KB)

Total request count did not significantly change, so it's not probably not caused by overall traffic being higher, but rather something on our side.

Screen Shot 2016-11-18 at 10.40.32.png (720×2 px, 376 KB)

Event Timeline

Pretty sure what you're looking at here is T147648 (also related: T147784)

Krinkle claimed this task.

Looks like that was it. It's coming back down now:

Screen Shot 2016-11-23 at 12.40.29.png (678×2 px, 154 KB)

Might take a while to return fully as it depends on the iOS app update being rolled out to users (and users may not want to update right away). Closing this task in favour of the others.