Page MenuHomePhabricator

cloud misc database connection issues
Closed, DuplicatePublic

Description

During the past months, dbproxy1005 has been detecting db1009 as down -normally that means not being able to connect in 3 seconds 3 times in a row. We believe this could because of temporary overload by some application (99% possibilities it would be one owned by cloud).

Last time this happened was Fri 23 Feb 2018:

[20:38:35] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1005 is CRITICAL: CRITICAL check_failover servers up 2 down 1

but it has happened several times in the past, I think more frequently lately.

dbproxy1005 is not the problem here, it is not used yet to failover m5, but we would like to do it eventually; the problem is that it could be detecting micro-downtimes caused by overload. Being a misbehaving cloud application is not verified, but the working thesis right now.

In addition to this events, after the latest grant changes, the number of aborted connections has increased greatly:

https://grafana.wikimedia.org/dashboard/db/mysql?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1009&var-port=9104&panelId=10&fullscreen&from=1518965801427&to=1519570601428

Please help us researching what was the long running and the new issue with mysql connections. This would be a blocker for T188029

Event Timeline

jcrespo created this task.Feb 25 2018, 3:08 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 25 2018, 3:08 PM

This happened again yesterday- and it is a blocker for high availability for all services on s5- and could me signs of a worse future outage.

When I zoom out on that graph, it looks to me like this problem was happening back in early December, then stopped happening around the 15th, then started happening again on the 22nd of February. Do you agree?

If we're lucky, that extra datapoint should help narrow things down.

Is there any way to know what service is producing all those aborted clients?

There are metrics in mysql (performance_schema.*, information_scheam, and sys.*) and prometheus- I just need to take the time to look at them. I was hoping there was some application failing clearly to see why, easily.

Note the connection failures are not as worrying as the proxy thinking s5-master is down, check IRC logs for the last time.

[2018-02-26 15:20:02] <icinga-wm> PROBLEM - haproxy failover on dbproxy1005 is CRITICAL: CRITICAL check_failover servers up 2 down 1

That could well correspond to when I was doing a massive import of labswiki into m5. That doesn't really account for the other times this happened though.

bd808 added a subscriber: bd808.Feb 27 2018, 3:51 PM