Page MenuHomePhabricator

Better mysql monitoring for number of connections and processlist strange patterns
Closed, DeclinedPublic

Event Timeline

jcrespo claimed this task.
jcrespo raised the priority of this task from to Needs Triage.
jcrespo updated the task description. (Show Details)
jcrespo added projects: DBA, acl*sre-team.
jcrespo subscribed.
jcrespo triaged this task as High priority.Sep 13 2015, 5:06 PM
jcrespo set Security to None.

In particular, not only should we monitor the current connections, but the peak since the last check, as otherwise it may be undetected for short bursts. Related: T109279

This is what I have now:

$ time bash mysqlcheck.sh 
HTTP/1.1 200 OK
Content-Type: application/json
Connection: close
Content-Lenght: 182

{ "server_available": "yes", "current_time": "2015-09-17 19:41:19.066394", "last_master_update": "2015-09-17 14:11:24.000650", "max_connections": "5000", "current_connections": "6" }
real	0m0.010s
user	0m0.006s
sys	0m0.002s
$ vim mysqlcheck.sh # to make mysql fail
$ time bash mysqlcheck.sh 
HTTP/1.1 503 Service Unavailable
Content-Type: application/json
Connection: close
Content-Lenght: 26

{"server_available": "no"}
real	0m0.009s
user	0m0.000s
sys	0m0.008s

(BTW, this server is running, but depooled and with its replication thread stopped).

@jcrespo, it seems like adding /proc/loadavg would be a useful metric to adjust on. But it might be a naive assumption that load is a good proxy for mysql's capacity to do work?

I am afraid not, a simplistic, but accurate way of describing the linux load parameters is "the size of the CPU queue". MySQL will very rarely be blocked by CPU, it will be either suffering from internal contention or waiting on IO. The number of current_connections could be a better measure of that queue. Maybe I can add more information, like active vs. waiting processes (usually, mysql will be executing 32 or less at the same time, and will make the others wait).

You maybe make the most difficult question of all in the database world: How to measure how busy is MySQL? Not a definitive answer. Probably it should be a combination of lag (already produced as the difference between "current_time" and "last_master_update", with a 0.25s error), mysql process queing (current_connections, maybe some extra information) and latency of the HTTP request itself (which would measure mysql query time and network congestion). That would be a metric for "data staleness", throughput issues and latency issues.

If we go to one of our busiest servers (db1072) -although it is not at peak time right now- I get this:

$ time bash mysqlcheck.sh
HTTP/1.1 200 OK
Content-Type: application/json
Connection: close
Content-Lenght: 184

{ "server_available": "yes", "current_time": "2015-09-18 08:09:45.047804", "last_master_update": "2015-09-18 08:09:45.000680", "max_connections": "5000", "current_connections": "161" }
real	0m0.014s
user	0m0.010s
sys	0m0.001s

So, 0.014 -> very short latency, good (obviously real HTTP latency should be measured, and would be larger than that), lag: 0.047124s(-0.25 error)s: almost no lag, good, current_connections: 161 (there is some queuing, >32, but that is normal for such a busy server; it is far from the 5000 connection limit). To detect a large burst of connections, we need to make sure that current_connections is not at 75%, 90%? of max_connections. Tuning those is something that should be tested under realistic stress conditions.

Aaron is using mediawiki itself to query pt-heartbeat. With that in mind I wonder if exposing a port is still interesting for external services or I should just implement a check directly on nagios.

I would do the second for now.

@jcrespo, interesting. If you're able to get those numbers into nagois, I can probably figure out a way to get it onto the appservers.

@csteipp Are you suggesting the appservers depending on nagios? I would strongly advice against that.

If you mean "copying" what nagios will be doing, keep an eye for the check_mariadb.pl future changes I am now writing. The logic is easy, but I have to figure out a way to distribute who is each slave's original master (if auto-discover or set it as a puppet property manually).

Aaron already implemented it on mediawiki, at least the heartbeat one: https://gerrit.wikimedia.org/r/#/c/241133/

Change 253665 had a related patch set uploaded (by Jcrespo):
[WIP] Use heartbeat when possible to check slave lag

https://gerrit.wikimedia.org/r/253665

jcrespo changed the task status from Open to Stalled.Jan 18 2016, 5:54 PM

Change 253665 merged by Jcrespo:
Use heartbeat when possible to check slave lag

https://gerrit.wikimedia.org/r/253665

Before all 5.5 old servers are decomissioned, this is the list of old checks:

DPKG [common]
Disk space [common]
Full LVS Snapshot [unneeded? we use now xtrabackup]
MySQL Idle Transactions [interesting]
MySQL InnoDB [interesting]
MySQL Processlist [interesting]
MySQL Recent Restart [interesting]
MySQL Replication Heartbeat [done]
MySQL Slave Delay [done]
MySQL Slave Running [done]
Slave_SQL_Running: [done]
MySQL disk space [done, needed if space can now be parametrized?]
NTP [common]
RAID [common]
SSH [common]
configured eth [common]
dhclient process [common]
mysqld processes [done]
puppet last run [common]
salt-minion processes [common]

Most of the pending ones seem the typical ones from Percona Monitoring toolkit, although we need a balance between annoying and complete. The ones on the title are the minimal ones.

Change 289822 had a related patch set uploaded (by Jcrespo):
Revert "mariadb: set replication check's contact_group to admins"

https://gerrit.wikimedia.org/r/289822

Change 289822 merged by Jcrespo:
Revert "mariadb: set replication check's contact_group to admins"

https://gerrit.wikimedia.org/r/289822

Change 289825 had a related patch set uploaded (by Jcrespo):
Increase retries to 10 to avoid small bumps to alert

https://gerrit.wikimedia.org/r/289825

Change 289825 merged by Jcrespo:
Increase retries to 10 to avoid small bumps to alert

https://gerrit.wikimedia.org/r/289825

Marostegui lowered the priority of this task from High to Medium.Jun 12 2017, 10:29 AM
Marostegui subscribed.

Closing this in favour of T253120 which has more concrete points of action