Better mysql monitoring for number of connections and processlist strange patterns
Closed, DeclinedPublic
Actions

Assigned To

None

Authored By

	jcrespo
	Sep 13 2015, 5:06 PM

Description

Mostly https://www.percona.com/doc/percona-monitoring-plugins/1.1/nagios/pmp-check-mysql-processlist.html#pmp-check-mysql-processlist

Details

Subject	Repo	Branch	Lines +/-
Increase retries to 10 to avoid small bumps to alert	operations/puppet/mariadb	master	+1 -0
Revert "mariadb: set replication check's contact_group to admins"	operations/puppet	production	+1 -1
Use heartbeat when possible to check slave lag	operations/puppet	production	+51 -21

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		None	T172492 Database alerting
		Declined		None	T112473 Better mysql monitoring for number of connections and processlist strange patterns

Event Timeline

jcrespo created this task.Sep 13 2015, 5:06 PM

jcrespo claimed this task.

jcrespo raised the priority of this task from to Needs Triage.

jcrespo updated the task description. (Show Details)

jcrespo added projects: DBA, acl*sre-team.

jcrespo subscribed.

Restricted Application added subscribers: Matanya, Aklapper. · View Herald TranscriptSep 13 2015, 5:06 PM

jcrespo triaged this task as High priority.Sep 13 2015, 5:06 PM

jcrespo set Security to None.

In particular, not only should we monitor the current connections, but the peak since the last check, as otherwise it may be undetected for short bursts. Related: T109279

mwjames subscribed.Sep 13 2015, 6:32 PM

mwjames unsubscribed.

jcrespo moved this task from Triage to In progress on the DBA board.Sep 14 2015, 1:03 PM

This is what I have now:

$ time bash mysqlcheck.sh 
HTTP/1.1 200 OK
Content-Type: application/json
Connection: close
Content-Lenght: 182

{ "server_available": "yes", "current_time": "2015-09-17 19:41:19.066394", "last_master_update": "2015-09-17 14:11:24.000650", "max_connections": "5000", "current_connections": "6" }
real	0m0.010s
user	0m0.006s
sys	0m0.002s
$ vim mysqlcheck.sh # to make mysql fail
$ time bash mysqlcheck.sh 
HTTP/1.1 503 Service Unavailable
Content-Type: application/json
Connection: close
Content-Lenght: 26

{"server_available": "no"}
real	0m0.009s
user	0m0.000s
sys	0m0.008s

(BTW, this server is running, but depooled and with its replication thread stopped).

@jcrespo, it seems like adding /proc/loadavg would be a useful metric to adjust on. But it might be a naive assumption that load is a good proxy for mysql's capacity to do work?

I am afraid not, a simplistic, but accurate way of describing the linux load parameters is "the size of the CPU queue". MySQL will very rarely be blocked by CPU, it will be either suffering from internal contention or waiting on IO. The number of current_connections could be a better measure of that queue. Maybe I can add more information, like active vs. waiting processes (usually, mysql will be executing 32 or less at the same time, and will make the others wait).

You maybe make the most difficult question of all in the database world: How to measure how busy is MySQL? Not a definitive answer. Probably it should be a combination of lag (already produced as the difference between "current_time" and "last_master_update", with a 0.25s error), mysql process queing (current_connections, maybe some extra information) and latency of the HTTP request itself (which would measure mysql query time and network congestion). That would be a metric for "data staleness", throughput issues and latency issues.

If we go to one of our busiest servers (db1072) -although it is not at peak time right now- I get this:

$ time bash mysqlcheck.sh
HTTP/1.1 200 OK
Content-Type: application/json
Connection: close
Content-Lenght: 184

{ "server_available": "yes", "current_time": "2015-09-18 08:09:45.047804", "last_master_update": "2015-09-18 08:09:45.000680", "max_connections": "5000", "current_connections": "161" }
real	0m0.014s
user	0m0.010s
sys	0m0.001s

So, 0.014 -> very short latency, good (obviously real HTTP latency should be measured, and would be larger than that), lag: 0.047124s(-0.25 error)s: almost no lag, good, current_connections: 161 (there is some queuing, >32, but that is normal for such a busy server; it is far from the 5000 connection limit). To detect a large burst of connections, we need to make sure that current_connections is not at 75%, 90%? of max_connections. Tuning those is something that should be tested under realistic stress conditions.

jcrespo mentioned this in T111266: Make LoadBalancer slave lag check and read-only mode more robust (for example, using pt-heartbeat).Sep 25 2015, 5:00 PM

jcrespo moved this task from In progress to Backlog on the DBA board.Oct 20 2015, 5:45 PM

Aaron is using mediawiki itself to query pt-heartbeat. With that in mind I wonder if exposing a port is still interesting for external services or I should just implement a check directly on nagios.

I would do the second for now.

@jcrespo, interesting. If you're able to get those numbers into nagois, I can probably figure out a way to get it onto the appservers.

@csteipp Are you suggesting the appservers depending on nagios? I would strongly advice against that.

If you mean "copying" what nagios will be doing, keep an eye for the check_mariadb.pl future changes I am now writing. The logic is easy, but I have to figure out a way to distribute who is each slave's original master (if auto-discover or set it as a puppet property manually).

Aaron already implemented it on mediawiki, at least the heartbeat one: https://gerrit.wikimedia.org/r/#/c/241133/

Change 253665 had a related patch set uploaded (by Jcrespo):
[WIP] Use heartbeat when possible to check slave lag

https://gerrit.wikimedia.org/r/253665

gerritbot added a project: Patch-For-Review.Nov 17 2015, 8:01 PM

jcrespo changed the task status from Open to Stalled.Jan 18 2016, 5:54 PM

ArielGlenn subscribed.Feb 14 2016, 2:49 PM

Volans subscribed.Feb 14 2016, 2:56 PM

Change 253665 merged by Jcrespo:
Use heartbeat when possible to check slave lag

https://gerrit.wikimedia.org/r/253665

Before all 5.5 old servers are decomissioned, this is the list of old checks:

DPKG [common]
Disk space [common]
Full LVS Snapshot [unneeded? we use now xtrabackup]
MySQL Idle Transactions [interesting]
MySQL InnoDB [interesting]
MySQL Processlist [interesting]
MySQL Recent Restart [interesting]
MySQL Replication Heartbeat [done]
MySQL Slave Delay [done]
MySQL Slave Running [done]
Slave_SQL_Running: [done]
MySQL disk space [done, needed if space can now be parametrized?]
NTP [common]
RAID [common]
SSH [common]
configured eth [common]
dhclient process [common]
mysqld processes [done]
puppet last run [common]
salt-minion processes [common]

Most of the pending ones seem the typical ones from Percona Monitoring toolkit, although we need a balance between annoying and complete. The ones on the title are the minimal ones.

jcrespo mentioned this in T133333: Audit MySQL configurations.Apr 28 2016, 10:35 AM

jcrespo removed jcrespo as the assignee of this task.May 2 2016, 9:52 AM

jcrespo mentioned this in T135809: Lag on a single commons server created user-facing issues.May 20 2016, 7:47 AM

Change 289822 had a related patch set uploaded (by Jcrespo):
Revert "mariadb: set replication check's contact_group to admins"

https://gerrit.wikimedia.org/r/289822

Change 289822 merged by Jcrespo:
Revert "mariadb: set replication check's contact_group to admins"

https://gerrit.wikimedia.org/r/289822

jcrespo mentioned this in rOPUPf1c05170a912: Revert "mariadb: set replication check's contact_group to admins".May 20 2016, 8:05 AM

Change 289825 had a related patch set uploaded (by Jcrespo):
Increase retries to 10 to avoid small bumps to alert

https://gerrit.wikimedia.org/r/289825

Change 289825 merged by Jcrespo:
Increase retries to 10 to avoid small bumps to alert

https://gerrit.wikimedia.org/r/289825

jcrespo mentioned this in rOPUPed628758779c: Revert "mariadb: set is_critical to false for checks".May 20 2016, 8:21 AM

jcrespo mentioned this in rOPUP973d1bc45b58: Revert "mariadb: set is_critical to false for checks".Jun 17 2016, 6:08 PM

jcrespo mentioned this in rOPUP63f2ca99b305: Revert "mariadb: set is_critical to false for checks".

• Marostegui lowered the priority of this task from High to Medium.Jun 12 2017, 10:29 AM

jcrespo added a parent task: T172492: Database alerting.Aug 4 2017, 8:43 AM

Dzahn added a project: observability.Nov 26 2018, 4:01 PM

• Phabricator_maintenance moved this task from Backlog to Acknowledged on the SRE board.Jan 26 2019, 8:19 PM

Closing this in favour of T253120 which has more concrete points of action

Better mysql monitoring for number of connections and processlist strange patternsClosed, DeclinedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Better mysql monitoring for number of connections and processlist strange patterns
Closed, DeclinedPublic
Actions

Related Objects
Search...