Page MenuHomePhabricator

Increased latency in CODFW API and APP monitoring urls (~07:20 UTC 19 Jan 2020)
Closed, ResolvedPublic

Event Timeline

This seems to be related to T243148, db2085 was overwhelmed and this explains the high latency (Special:Blank page health checks were taking ages to complete).

The latency dropped as soon as @Marostegui depooled db2085.

db2085 is a s1 and s8 codfw slave (multi instance). We don't have read traffic on codfw databases, how could it cause those latency issues?

And according to the graph the latency increase indeed starts when db2085 went down

Jan 19 07:19:49 icinga1001 icinga: SERVICE ALERT: db2085;puppet last run;UNKNOWN;SOFT;1;CHECK_NRPE STATE UNKNOWN: Socket timeout after 10 seconds.
Jan 19 07:19:53 icinga1001 icinga: SERVICE ALERT: db2085;dhclient process;UNKNOWN;SOFT;1;CHECK_NRPE STATE UNKNOWN: Socket timeout after 10 seconds.
Jan 19 07:19:53 icinga1001 icinga: SERVICE ALERT: db2085;Check size of conntrack table;UNKNOWN;SOFT;1;CHECK_NRPE STATE UNKNOWN: Socket timeout after 10 seconds.
Jan 19 07:19:56 icinga2001 icinga: SERVICE ALERT: db2085;MegaRAID;UNKNOWN;SOFT;1;CHECK_NRPE STATE UNKNOWN: Socket timeout after 10 seconds.
Jan 19 07:20:00 icinga2001 icinga: SERVICE ALERT: db2085;SSH;CRITICAL;SOFT;1;CRITICAL - Socket timeout after 10 seconds
Jan 19 07:20:05 icinga1001 icinga: HOST ALERT: db2085;DOWN;SOFT;1;PING CRITICAL - Packet loss = 100%
Jan 19 07:20:12 icinga2001 icinga: SERVICE ALERT: db2085;dhclient process;UNKNOWN;SOFT;1;CHECK_NRPE STATE UNKNOWN: Socket timeout after 10 seconds.
Jan 19 07:20:12 icinga2001 icinga: HOST ALERT: db2085;DOWN;SOFT;1;PING CRITICAL - Packet loss = 100%
Jan 19 07:20:15 icinga1001 icinga: SERVICE ALERT: db2085;MariaDB Slave Lag: s1;UNKNOWN;HARD;1;CHECK_NRPE STATE UNKNOWN: Socket timeout after 10 seconds.

From what I can see in logstash, looks like it is coming from:
Wikimedia\Rdbms\LoadMonitor::getServerStates: host {db_server} is unreachable I am not sure whether it is expected to be checking on codfw's databases health or not cc @Krinkle @aaron https://logstash.wikimedia.org/goto/f30c629c5e6b5c017e9c4273f79de55e

elukey triaged this task as Medium priority.Jan 20 2020, 7:49 AM
elukey added a project: Performance-Team.

As long as there are any health checks that hit MediaWiki in codfw that involve DB access (pretty much any normal/special page view), then LoadMonitor::getServerStates is reachable (in the course of picking a DB to connect to). That seems expected to me.

As long as there are any health checks that hit MediaWiki in codfw that involve DB access (pretty much any normal/special page view), then LoadMonitor::getServerStates is reachable (in the course of picking a DB to connect to). That seems expected to me.

Thanks @aaron for the answer.
So it is expected, but is it desirable? My point is that a non used host went down in the passive datacenter and it still apparently caused user impact due just to health checks.

What user impact did it cause?

@aaron non at all since it was codfw. On the other hand, we were a bit alarmed because of it, since we didn't expect such an alert from there.

Looks like the main action is to avoid these alarms in the future, asking a few questions (some may be obvious):

  • Did we know the db issue in advance (e.g. maintenance induced)? If it was, then the practice we follow for primary dc (Eqiad) could perhaps be adopted for inactive as well - E.g. depool first even if for inactive DC.
  • If we didn't know it ahead of time, then an alarm might still be fine if there is an easy actionable, e.g. depool after the fact.
  • If we didn't know it and/or don't want alarms for db servers in stand-by DCs, perhaps the some of the Icinga checks should be disabled in inactive DCs like Codfw, e.g. the app server health checks and maybe more.

Looks like the main action is to avoid these alarms in the future, asking a few questions (some may be obvious):

  • Did we know the db issue in advance (e.g. maintenance induced)? If it was, then the practice we follow for primary dc (Eqiad) could perhaps be adopted for inactive as well - E.g. depool first even if for inactive DC.

No, nothing planned, it was a host that crashed T243148

  • If we didn't know it ahead of time, then an alarm might still be fine if there is an easy actionable, e.g. depool after the fact.
  • If we didn't know it and/or don't want alarms for db servers in stand-by DCs, perhaps the some of the Icinga checks should be disabled in inactive DCs like Codfw, e.g. the app server health checks and maybe more.

Yeah, I think it is mostly alarming on codfw latency what we'd need to decide on as of today codfw is passive.

Marostegui claimed this task.

I am going to close this, as there is not much else we can really do here and it looks like a one time thing