Page MenuHomePhabricator

Monitor swap/memory usage on databases
Closed, ResolvedPublic

Description

By default, monitor the memory pressure and send a warning when swap starts being used (e.g. when 20% of it it is in use, to avoid good reasons to use it such a unused fs cache), and maybe a critical when OOM can start happening immediately. swapping is only an heuristic indicator of an outage, and normally it grows slowly, so it shouldn't page.

Event Timeline

Marostegui triaged this task as Medium priority.Sep 5 2018, 7:46 AM
jcrespo raised the priority of this task from Medium to High.Aug 4 2020, 6:21 AM
jcrespo edited subscribers, added: Kormat; removed: Banyek.

dbstore1004 was low on memory- so low that puppet runs were failing (although mysql wasn't due to its low OOM killer precedence). We should have both swapping and low memory alerts for mysql hosts- probably not paging, but enough to give advance notice/performance impact alert before crashing. Prometheus metrics could be used for this.

Change 618947 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb: Add proof of concept of memory alert

https://gerrit.wikimedia.org/r/618947

Change 618947 merged by Jcrespo:
[operations/puppet@production] mariadb: Add proof of concept of memory alert

https://gerrit.wikimedia.org/r/618947

I missconfigured db1121 with 430GB of bufffer pool and started 2 mysqldump processes to have some activity. The alert went off as expected:

Screenshot_20200807_191835.png (140×2 px, 54 KB)

Next week I will apply it to some more hosts to tune the thresholds.

Change 619257 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb: Add memory monitoring to core (mw) db hosts

https://gerrit.wikimedia.org/r/619257

Change 619258 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb: Add memory check to most other mariadb roles other than core

https://gerrit.wikimedia.org/r/619258

Change 619257 merged by Jcrespo:
[operations/puppet@production] mariadb: Add memory monitoring to core (mw) db hosts

https://gerrit.wikimedia.org/r/619257

Change 619258 merged by Jcrespo:
[operations/puppet@production] mariadb: Add memory check to most other mariadb roles other than core

https://gerrit.wikimedia.org/r/619258

Change 619301 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb: Reduce s7 memory usage for dbstore1003

https://gerrit.wikimedia.org/r/619301

Change 619301 merged by Jcrespo:
[operations/puppet@production] mariadb: Reduce s7 memory usage for dbstore1003

https://gerrit.wikimedia.org/r/619301

Change 619306 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb: Increase labsdb* memory monitoring thresholds

https://gerrit.wikimedia.org/r/619306

Change 619306 merged by Jcrespo:
[operations/puppet@production] mariadb: Increase labsdb* memory monitoring thresholds

https://gerrit.wikimedia.org/r/619306

jcrespo claimed this task.

We settled for now on memory usage. We send a warning when we get to 90% usage and a critical (non-paging) when it is at 95% usage- I think that will be more stable than trying to predict the very dynamic swapping activity. That is in no way perfect and will need tuning.

For now, the only database that has different parameters are labsdb hosts, which have a 92/97% threshold, due to higher buffer pool configuration. We will revisit this at a later time to tune the thresholds and maybe add other kinds of monitoring, like swapping activity or others.