Tendril seems to be having an excessive amount of memory consumption that keeps growing over the time until it makes OOM to kick in and start killing some processes, typically MySQL or nagios, causing pages
Examples of some tasks:
T231165: db1115 (tendril) paged twice in 24h due to OOM
T196726: db1115 (tendril DB) had OOM for some processes and some hw (memory) issues
Some graphs that show this behavior:
Tendril heavily uses the event scheduler with lots of events:
# mysql.py -hdb1115 tendril -e "show events" | wc -l 2267
MySQL is configured to use 20GB of pool_size.
innodb_buffer_pool_size = 20G
As of now (1st Sept 05:32 AM UTC) this is the memory consumption:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 29136 mysql 20 0 0.106t 0.062t 25624 S 244.4 50.3 41014:10 mysqld
root@db1115:/home/marostegui# free -g total used free shared buff/cache available Mem: 125 63 0 1 61 59 Swap: 0 0 0
# pmap 29136 | grep total total 113386168K
mysql:root@localhost [(none)]> show global status like 'Uptime'; +---------------+--------+ | Variable_name | Value | +---------------+--------+ | Uptime | 657836 | +---------------+--------+ 1 row in set (0.00 sec)
There are some (some of them quite old) bugs claiming to have some memory issues with events so maybe this could be similar:
https://bugs.mysql.com/bug.php?id=18683
https://forums.mysql.com/read.php?35,675519,675519
https://jira.mariadb.org/browse/MDEV-14260?attachmentViewMode=list
And this is the graph after the last crash T231165: db1115 (tendril) paged twice in 24h due to OOM