Page MenuHomePhabricator

Investigate spike in 500s during asw-c2-eqiad replacement
Closed, DeclinedPublic


Yesterday during asw-c2-eqiad switch replacement in T155875: asw-c2-eqiad reboots & fdb_mac_entry_mc_set() issues we saw a spike in 500s returned by varnish. Most of those seemed to come from mw api, thus dependent services (e.g. restbase) was affected too.

The DB servers behind asw-c2 (db1055, 56, 57, 59) were purposefully not depooled from mediawiki since this failure mode is something mediawiki should be able to handle gracefully.

Related incident:

Event Timeline

Logstash during that time period ( January 26th 2017, 17:56:23.220 to January 26th 2017, 18:25:00.000):

With top messages being

Server {host} (#2) is not replicating?	706,722
Error connecting to {db_server}: {error}	371,205
Server {host} (#6) is not replicating?	348,429
Server {host} (#4) is not replicating?	260,459
Server {host} (#5) is not replicating?	187,095
Server {host} (#3) is not replicating?	160,324
LoadMonitor::getServerStates: host db1060 is unreachable	72,959
LoadMonitor::getServerStates: host db1087 is unreachable	63,196
LoadMonitor::getServerStates: host db1055 is unreachable	62,881
LoadMonitor::getServerStates: host db1059 is unreachable	45,745
LoadMonitor::getServerStates: host db1056 is unreachable	40,340
LoadMonitor::getServerStates: host db1088 is unreachable	34,825
{"id":"d0ce59d1e6742f3773cf137b","type":"DBExpectedError","file":"/srv/mediawiki/php-1.29.0-wmf.8/includes/libs/rdbms/database/DatabaseMysqlBase.php","line":820,"message":"Failed to query MASTER_POS_WAIT()","code":0,"url":"/rpc/RunJobs.php?wiki=simplewiki	12,635
LoadMonitor::getServerStates: host is unreachable	7,104
LoadMonitor::getServerStates: host is unreachable 2,912

One possibility would be and edge case of how the latest changes to the load balancer works- so that queries get stuck rather than timing out earlier due to waiting to read the replication status. Could that be a possibility, @aaron?

The api servers being the most affected would fit, as those are the ones more sensitive to lag checking.

Ottomata triaged this task as Medium priority.Mar 6 2017, 7:34 PM

I wonder why would the restart cause MASTER_GTID_WAIT() to fail in a non-timeout way, e.g. 'Failed to query MASTER_POS_WAIT()'.

There are other options- loadbalancing creating new connections timing out when there is no immediate error, or external storage doing that (there was also 1 ES server involved). I would honestly take a debug app server and try to reproduce it, faking the same kind of connectivity loss.

What it is clear is that when servers got depooled, the issue went away, so there is some timeout at app level we need to shorten.

jcrespo added a parent task: Restricted Task.May 8 2018, 12:39 PM
Marostegui added a subscriber: Marostegui.

This is impossible to investigate anymore - closing it. Reopen if needed.