By default, monitor the memory pressure and send a warning when swap starts being used (e.g. when 20% of it it is in use, to avoid good reasons to use it such a unused fs cache), and maybe a critical when OOM can start happening immediately. swapping is only an heuristic indicator of an outage, and normally it grows slowly, so it shouldn't page.
dbstore1004 was low on memory- so low that puppet runs were failing (although mysql wasn't due to its low OOM killer precedence). We should have both swapping and low memory alerts for mysql hosts- probably not paging, but enough to give advance notice/performance impact alert before crashing. Prometheus metrics could be used for this.
I missconfigured db1121 with 430GB of bufffer pool and started 2 mysqldump processes to have some activity. The alert went off as expected:
Next week I will apply it to some more hosts to tune the thresholds.
We settled for now on memory usage. We send a warning when we get to 90% usage and a critical (non-paging) when it is at 95% usage- I think that will be more stable than trying to predict the very dynamic swapping activity. That is in no way perfect and will need tuning.
For now, the only database that has different parameters are labsdb hosts, which have a 92/97% threshold, due to higher buffer pool configuration. We will revisit this at a later time to tune the thresholds and maybe add other kinds of monitoring, like swapping activity or others.