Cross posting from T231185#6107666 and T231185#6122136
== Tue May 5th: ==
We had an issue with tendril today where tendril was very slow and almost unresponsive, at first I thought it was another case of {T231769}, but it wasn't.
First of all we realized that after restarting mysql we got some errors with TokuDB not being loaded:
```
daemon.log:May 5 05:00:00 db1115 mysqld[1029]: 2020-05-05 5:00:00 140204559673088 [ERROR] Event Scheduler: [root@208.80.154.151][tendril.tendril_global_status_log_5m] Unknown storage engine 'TokuDB'
```
That resulted on a tables failing. This was due to a mistake on https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/585115/3/modules/profile/templates/mariadb/mysqld_config/tendril.my.cnf.erb which as been fixed with: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/594405/
However, after manually enabling TokuDB we also saw that the host was super loaded as soon as MySQL was started, and the CPU was over the roof:
```
05:31:43 up 36 min, 3 users, load average: 64.56, 57.27, 36.13
```
{F31803099}
The host was full of connections that were piling up, but one of them was taking ages to complete:
```
| 240574 | root | 10.64.32.25 | tendril | Connect | 852 | Copying to tmp table | insert into processlist_query_log
(server_id, stamp, id, user, host, db, time, info)
select p.server_id, now(), p.id, p.user, p.host, p.db, p.time, p.info
from processlist p
left join processlist_query_log q
on q.server_id = @server_id
and p.id = q.id
and p.user = q.user
and p.host = q.host
and p.db = q.db
and p.info = q.info
and p.time > q.time
and q.stamp > now() - interval 7 day
where
p.server_id = @server_id
and p.command = 'Query'
and q.server_id is null
```
That was the only query that would remain running after stopping the event scheduler.
One of the involved tables is: `processlist_query_log` which is a huge table:
{P11134}
According to the ibd creation dates, it looks like the partitions are purged every month, but for some reason they had lots of rows lately.
For instance:
```
-rw-rw---- 1 mysql mysql 11G May 2 00:58 processlist_query_log#P#p737911.ibd
-rw-rw---- 1 mysql mysql 16G May 3 00:56 processlist_query_log#P#p737912.ibd
-rw-rw---- 1 mysql mysql 15G May 4 00:56 processlist_query_log#P#p737913.ibd
```
That's lots of rows for just 24h.
The subquery involving a left join on that huge table was likely going to take hours, creating those CPU usages.
**After truncating the table, everything went back to normal, as that query is now able to finish quickly as the table decreased its size.**
For some reason, seeing the file sizes, something has made the partitions to increase almost 5 times since 1st May. Normally it used to be around 3GB or 4GB per day, but since 1st May they are bigger than 10GB. That needs some investigation too.
== Sun May 10th: ==
This has happened again:
```
root@cumin1001:~# mysql.py -hdb1115 -e "show processlist" | grep Copy
35178632 root 10.64.32.25 tendril Connect 15049 Copying to tmp table insert into processlist_query_log \n (server_id, stamp, id, user, host, db, time, info)\n 0.000
35964338 root 10.64.32.13 tendril Connect 5533 Copying to tmp table insert into processlist_query_log\n (server_id, stamp, id, user, host, db, time, info)\n 0.000
36340984 root 10.64.32.13 tendril Connect 1032 Copying to tmp table insert into processlist_query_log\n (server_id, stamp, id, user, host, db, time, info)\n 0.000
36421610 root 10.64.32.25 tendril Connect 10 Copying to tmp table insert into processlist_query_log\n (server_id, stamp, id, user, host, db, time, info)\n 0.000
36423267 root 10.64.32.25 tendril Connect 2 Copying to tmp table insert into processlist_query_log\n (server_id, stamp, id, user, host, db, time, info)\n 0.000
36429799 root 10.64.32.25 tendril Connect 0 Copying to tmp table insert into strings (string)\n select lower(concat( NAME_CONST('con_prefix',_latin1' 0.000
```
For some reason that `processlist_query_log` table is getting lots more data and the `select` which is under that `insert` take ages to complete:
```
-rw-rw---- 1 mysql mysql 11G May 6 00:56 processlist_query_log#P#p737915.ibd
-rw-rw---- 1 mysql mysql 16G May 7 00:56 processlist_query_log#P#p737916.ibd
-rw-rw---- 1 mysql mysql 16G May 8 00:57 processlist_query_log#P#p737917.ibd
-rw-rw---- 1 mysql mysql 15G May 9 01:01 processlist_query_log#P#p737918.ibd
-rw-rw---- 1 mysql mysql 9.2G May 10 06:27 processlist_query_log#P#p737919.ibd
-rw-rw---- 1 mysql mysql 48M May 10 06:47 processlist_query_log#P#p737920.ibd
```
After truncating `processlist_query_log` again I stopped the event_scheduler to let all the threads stop, once it was fully clean, I started it but the spike was massive and the host stalled again.
Finally, I restarted MySQL.
CPU graph is scary with a sudden spike:
{F31810767}
After restarting mysql the server went back to crazy values and had like 1200 connections again
I am still investigating