Investigate spike in 500s during asw-c2-eqiad replacement
Closed, DeclinedPublic
Actions

Assigned To

None

Authored By

	fgiunchedi
	Jan 27 2017, 12:08 PM

Description

Yesterday during asw-c2-eqiad switch replacement in T155875: asw-c2-eqiad reboots & fdb_mac_entry_mc_set() issues we saw a spike in 500s returned by varnish. Most of those seemed to come from mw api, thus dependent services (e.g. restbase) was affected too.

The DB servers behind asw-c2 (db1055, 56, 57, 59) were purposefully not depooled from mediawiki since this failure mode is something mediawiki should be able to handle gracefully.

Related Objects
Search...

Status	Assigned	Task
Resolved	faidon	T155875 asw-c2-eqiad reboots & fdb_mac_entry_mc_set() issues
Resolved	None	T155999 DBA plan to mitigate asw-c2-eqiad reboots
Open	None	T119626 Eliminate SPOF at the main database infrastructure
Resolved	aaron	T88445 MediaWiki active/active datacenter investigation and work (tracking)
		Restricted Task
Declined	None	T156475 Investigate spike in 500s during asw-c2-eqiad replacement

Event Timeline

fgiunchedi created this task.Jan 27 2017, 12:08 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 27 2017, 12:08 PM

fgiunchedi mentioned this in T155875: asw-c2-eqiad reboots & fdb_mac_entry_mc_set() issues.Jan 27 2017, 12:09 PM

Logstash during that time period ( January 26th 2017, 17:56:23.220 to January 26th 2017, 18:25:00.000): https://logstash.wikimedia.org/goto/8d722054b5849d315d057bd46fe8a894

With top messages being

Server {host} (#2) is not replicating?	706,722
Error connecting to {db_server}: {error}	371,205
Server {host} (#6) is not replicating?	348,429
Server {host} (#4) is not replicating?	260,459
Server {host} (#5) is not replicating?	187,095
Server {host} (#3) is not replicating?	160,324
LoadMonitor::getServerStates: host db1060 is unreachable	72,959
LoadMonitor::getServerStates: host db1087 is unreachable	63,196
LoadMonitor::getServerStates: host db1055 is unreachable	62,881
LoadMonitor::getServerStates: host db1059 is unreachable	45,745
LoadMonitor::getServerStates: host db1056 is unreachable	40,340
LoadMonitor::getServerStates: host db1088 is unreachable	34,825
{"id":"d0ce59d1e6742f3773cf137b","type":"DBExpectedError","file":"/srv/mediawiki/php-1.29.0-wmf.8/includes/libs/rdbms/database/DatabaseMysqlBase.php","line":820,"message":"Failed to query MASTER_POS_WAIT()","code":0,"url":"/rpc/RunJobs.php?wiki=simplewiki	12,635
LoadMonitor::getServerStates: host 10.64.32.184 is unreachable	7,104
LoadMonitor::getServerStates: host 10.64.32.185 is unreachable 2,912

jcrespo added a parent task: T155999: DBA plan to mitigate asw-c2-eqiad reboots.Jan 31 2017, 4:56 PM

• Marostegui mentioned this in T155999: DBA plan to mitigate asw-c2-eqiad reboots.Feb 9 2017, 7:22 AM

One possibility would be and edge case of how the latest changes to the load balancer works- so that queries get stuck rather than timing out earlier due to waiting to read the replication status. Could that be a possibility, @aaron?

The api servers being the most affected would fit, as those are the ones more sensitive to lag checking.

Ottomata triaged this task as Medium priority.Mar 6 2017, 7:34 PM

I wonder why would the restart cause MASTER_GTID_WAIT() to fail in a non-timeout way, e.g. 'Failed to query MASTER_POS_WAIT()'.

There are other options- loadbalancing creating new connections timing out when there is no immediate error, or external storage doing that (there was also 1 ES server involved). I would honestly take a debug app server and try to reproduce it, faking the same kind of connectivity loss.

What it is clear is that when servers got depooled, the issue went away, so there is some timeout at app level we need to shorten.

jcrespo added a parent task: Restricted Task.May 8 2018, 12:39 PM

jcrespo mentioned this in T187962: Rack/cable/configure asw2-c-eqiad switch stack.May 14 2018, 4:37 PM

• Phabricator_maintenance moved this task from Backlog to Acknowledged on the SRE board.Jan 26 2019, 9:02 PM

Krinkle added a project: Sustainability (Incident Followup).Sep 28 2021, 10:48 PM

Krinkle updated the task description. (Show Details)

This is impossible to investigate anymore - closing it. Reopen if needed.

Investigate spike in 500s during asw-c2-eqiad replacementClosed, DeclinedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Investigate spike in 500s during asw-c2-eqiad replacement
Closed, DeclinedPublic
Actions

Related Objects
Search...