By default, monitor the memory pressure and send a warning when swap starts being used (e.g. when 20% of it it is in use, to avoid good reasons to use it such a unused fs cache), and maybe a critical when OOM can start happening immediately. swapping is only an heuristic indicator of an outage, and normally it grows slowly, so it shouldn't page.
Description
Details
| Status | Subtype | Assigned | Task | ||
|---|---|---|---|---|---|
| Resolved | jcrespo | T171928 Wikidata and dewiki databases locked | |||
| Resolved | None | T172492 Database alerting | |||
| Resolved | jcrespo | T172490 Monitor swap/memory usage on databases |
Event Timeline
dbstore1004 was low on memory- so low that puppet runs were failing (although mysql wasn't due to its low OOM killer precedence). We should have both swapping and low memory alerts for mysql hosts- probably not paging, but enough to give advance notice/performance impact alert before crashing. Prometheus metrics could be used for this.
Change 618947 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb: Add proof of concept of memory alert
Change 618947 merged by Jcrespo:
[operations/puppet@production] mariadb: Add proof of concept of memory alert
I missconfigured db1121 with 430GB of bufffer pool and started 2 mysqldump processes to have some activity. The alert went off as expected:
Next week I will apply it to some more hosts to tune the thresholds.
Change 619257 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb: Add memory monitoring to core (mw) db hosts
Change 619258 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb: Add memory check to most other mariadb roles other than core
Change 619257 merged by Jcrespo:
[operations/puppet@production] mariadb: Add memory monitoring to core (mw) db hosts
Change 619258 merged by Jcrespo:
[operations/puppet@production] mariadb: Add memory check to most other mariadb roles other than core
Change 619301 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb: Reduce s7 memory usage for dbstore1003
Change 619301 merged by Jcrespo:
[operations/puppet@production] mariadb: Reduce s7 memory usage for dbstore1003
Change 619306 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb: Increase labsdb* memory monitoring thresholds
Change 619306 merged by Jcrespo:
[operations/puppet@production] mariadb: Increase labsdb* memory monitoring thresholds
We settled for now on memory usage. We send a warning when we get to 90% usage and a critical (non-paging) when it is at 95% usage- I think that will be more stable than trying to predict the very dynamic swapping activity. That is in no way perfect and will need tuning.
For now, the only database that has different parameters are labsdb hosts, which have a 92/97% threshold, due to higher buffer pool configuration. We will revisit this at a later time to tune the thresholds and maybe add other kinds of monitoring, like swapping activity or others.
