Page MenuHomePhabricator

tendril_purge_global_status_log_5m and global_status_log needs more frequent purging
Closed, ResolvedPublic

Description

During the troubleshooting that happened at T252324: Tendril mysql is stalling I found that the table global_status_log_5m had data from 2015:

mysql:root@localhost [tendril]> select min(stamp),max(stamp) from global_status_log_5m;
+---------------------+---------------------+
| min(stamp)          | max(stamp)          |
+---------------------+---------------------+
| 2015-06-19 11:40:00 | 2020-05-10 10:40:00 |
+---------------------+---------------------+
1 row in set (0.11 sec)

mysql:root@localhost [tendril]> select @stamp := now() - interval 7 day;
+----------------------------------+
| @stamp := now() - interval 7 day |
+----------------------------------+
| 2020-05-03 11:00:35              |
+----------------------------------+
1 row in set (0.00 sec)

mysql:root@localhost [tendril]> select * from global_status_log_5m where stamp < @stamp limit 2;
+-----------+---------------------+---------+-------------+
| server_id | stamp               | name_id | value       |
+-----------+---------------------+---------+-------------+
|      1082 | 2015-06-19 11:40:00 |     352 | 13561583583 |
|      1209 | 2015-06-19 11:40:00 |     348 |           0 |
+-----------+---------------------+---------+-------------+
2 rows in set (0.10 sec)

That table was causing contention on tendril as the table had grown to an insane amount of rows:

mysql:root@localhost [(none)]> show explain for 291612;
+------+-------------+----------------------+-------+---------------+------+---------+------+-------------+-------------+
| id   | select_type | table                | type  | possible_keys | key  | key_len | ref  | rows        | Extra       |
+------+-------------+----------------------+-------+---------------+------+---------+------+-------------+-------------+
|    1 | SIMPLE      | global_status_log_5m | index | NULL          | i2   | 8       | NULL | 87877089565 | Using index |
+------+-------------+----------------------+-------+---------------+------+---------+------+-------------+-------------+
1 row in set, 1 warning (0.00 sec)

There is an event that is supposed to purge it every 7 days, but it wasn't working (my guess is just because it is unable to get a lock there):

mysql:root@localhost [tendril]> show create event tendril_purge_global_status_log_5m;
+------------------------------------+----------+-----------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Event                              | sql_mode | time_zone | Create Event                                                                                                                                                                   |
+------------------------------------+----------+-----------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| tendril_purge_global_status_log_5m |          | SYSTEM    | CREATE DEFINER=`root`@`208.80.154.151` EVENT `tendril_purge_global_status_log_5m` ON SCHEDULE EVERY 5 MINUTE STARTS '2014-05-23 00:00:00' ON COMPLETION NOT PRESERVE ENABLE DO n

    if (get_lock('tendril_purge_global_status_log_5m', 1) = 0) then
      signal sqlstate value '45000' set message_text = 'get_lock';
    end if;

    select @stamp := now() - interval 7 day;
    delete from global_status_log_5m where stamp < @stamp limit 10000;

    do release_lock('tendril_purge_global_status_log_5m');
  end | utf8                 | utf8_general_ci      | latin1_swedish_ci  |
+------------------------------------+----------+-----------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)

Even if it worked, it would have probably never finished as the table was insanely big.

We should probably make sure it works or create a more radical approach, something along the lines of:

  • Stops the event scheduler
  • Waits for the output to be clean
  • Perform the delete + optimize
  • Starts the event scheduler

The table global_status_log was also massive for just holding a month's of data, so we might want to purge it more frequently. It doesn't make any sense to keep that amount of days:

mysql:root@localhost [tendril]> show explain for 283543;
+------+-------------+-------------------+-------+---------------+---------+---------+------+-------------+-------------+
| id   | select_type | table             | type  | possible_keys | key     | key_len | ref  | rows        | Extra       |
+------+-------------+-------------------+-------+---------------+---------+---------+------+-------------+-------------+
|    1 | SIMPLE      | global_status_log | index | NULL          | PRIMARY | 24      | NULL | 10164518550 | Using index |
+------+-------------+-------------------+-------+---------------+---------+---------+------+-------------+-------------+
1 row in set, 1 warning (0.00 sec)

mysql:root@localhost [tendril]> select min(stamp),max(stamp) from global_status_log;
+---------------------+---------------------+
| min(stamp)          | max(stamp)          |
+---------------------+---------------------+
| 2020-04-11 00:00:00 | 2020-05-10 10:44:52 |
+---------------------+---------------------+
1 row in set (0.01 sec)

So I think we should also set up maybe a weekly cronjob purging that table too. We don't use it for any historic things anyways so it doesn't make any sense to let it grow that much.

Event Timeline

Marostegui moved this task from Triage to Pending comment on the DBA board.

Setting it to high as this has (probably) the cause of two recent incidents on tendril

root@db1115.eqiad.wmnet[tendril]> select count(*) from global_status_log_5m;
+-----------+
| count(*)  |
+-----------+
| 331934990 |
+-----------+
1 row in set (1 min 53.61 sec)

mysql:root@localhost [tendril]> select min(stamp),max(stamp) from global_status_log_5m;
+---------------------+---------------------+
| min(stamp)          | max(stamp)          |
+---------------------+---------------------+
| 2020-05-10 18:20:00 | 2020-05-20 11:40:00 |
+---------------------+---------------------+
1 row in set (0.02 sec)

I am going to change the event, so we purge older than 2 days data. It doesn't make any sense to have 7 days there. We don't really use it for anything.

Regarding global_status_log the numbers are also crazy for just 10 days since I purged it:

mysql:root@localhost [tendril]> select min(stamp),max(stamp) from global_status_log;
+---------------------+---------------------+
| min(stamp)          | max(stamp)          |
+---------------------+---------------------+
| 2020-05-10 12:18:14 | 2020-05-20 11:42:58 |
+---------------------+---------------------+
1 row in set (0.77 sec)

mysql:root@localhost [(none)]> show explain for 68836553;
+------+-------------+-------------------+-------+---------------+---------+---------+------+------------+-------------+
| id   | select_type | table             | type  | possible_keys | key     | key_len | ref  | rows       | Extra       |
+------+-------------+-------------------+-------+---------------+---------+---------+------+------------+-------------+
|    1 | SIMPLE      | global_status_log | index | NULL          | PRIMARY | 24      | NULL | 3531813949 | Using index |
+------+-------------+-------------------+-------+---------------+---------+---------+------+------------+-------------+
1 row in set, 1 warning (0.00 sec)

I think I am going to create a similar event to the one for global_status_log_5m and enable it to purge data older than 2 days too.

Change 597535 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/software/tendril@master] dashboard: Change tendril_purge_global_status_log_5m

https://gerrit.wikimedia.org/r/597535

Change 597535 merged by Marostegui:
[operations/software/tendril@master] dashboard: Change tendril_purge_global_status_log_5m

https://gerrit.wikimedia.org/r/597535

Mentioned in SAL (#wikimedia-operations) [2020-05-20T14:43:30Z] <marostegui> Replace tendril_purge_global_status_log_5m event with the new one (purging every 2d of data and with a higher limit of rows) - T252331

Mentioned in SAL (#wikimedia-operations) [2020-05-21T07:22:27Z] <marostegui> Purge events from tendril.global_status_log older than 24h - T252331

Change 597792 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/software/tendril@master] dashboard.sql: Remove limit clause

https://gerrit.wikimedia.org/r/597792

Mentioned in SAL (#wikimedia-operations) [2020-05-22T10:10:55Z] <marostegui> Stop event_scheduler on db1115 - T252331

Change 597792 merged by Marostegui:
[operations/software/tendril@master] dashboard.sql: Change storing time

https://gerrit.wikimedia.org/r/597792

Mentioned in SAL (#wikimedia-operations) [2020-05-22T15:06:29Z] <marostegui> Decrease tendril_purge_global_status_log_5m storing rows time from 2 days to 1 day T252331

With the latest changes looks like we have global_status_log_5m under control. It doesn't grow over 35-40M rows.
I am going to spend time now with the other big one: global_status_log

Mentioned in SAL (#wikimedia-operations) [2020-05-25T07:02:19Z] <marostegui> Stop event scheduler on tendril T252331

This is approximately what global_status_log has accumulated since 10th May (15 days):

mysql:root@localhost [(none)]> show explain  for 102846891;
+------+-------------+-------------------+-------+---------------+---------+---------+------+------------+-------------+
| id   | select_type | table             | type  | possible_keys | key     | key_len | ref  | rows       | Extra       |
+------+-------------+-------------------+-------+---------------+---------+---------+------+------------+-------------+
|    1 | SIMPLE      | global_status_log | index | NULL          | PRIMARY | 24      | NULL | 4962983672 | Using index |
+------+-------------+-------------------+-------+---------------+---------+---------+------+------------+-------------+
1 row in set, 1 warning (0.00 sec)

It is storing around 1.5M rows every 5 minutes, so I am going to create an even to also purge that table every 5 minutes like we do with global_status_log_5m

Mentioned in SAL (#wikimedia-operations) [2020-05-25T11:48:27Z] <marostegui> Stop event scheduler on db1115 (tendril) - T252331

Mentioned in SAL (#wikimedia-operations) [2020-05-25T11:54:38Z] <marostegui> Install a new tendril_purge_global_status_log event on db1115 (tendril) T252331

I have been playing with an event, similar to the one that purges global_status_log_5m with kinda the same options and seems to keep general_status_log table under control, with around 370M records all the time and keeping only 24h of data.
Which is still a bunch of rows but way better than 4k million rows that we accumulated since I truncate it a few days ago. And definitely better than the 10164518550 rows we had before.
I am going to give it some more hours before sending the patch with the event, just in case I need to tweak something else, but so far it looks like we have the most problematic tables under control and they won't grow more than what they accumulate during 24h.

Change 598910 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/software/tendril@master] dashboard.sql: Create tendril_purge_global_status_log event

https://gerrit.wikimedia.org/r/598910

Change 598910 merged by Marostegui:
[operations/software/tendril@master] dashboard.sql: Create tendril_purge_global_status_log event

https://gerrit.wikimedia.org/r/598910

Marostegui claimed this task.

So this has been addressed by adding 2 events events that purge global_status_log_5m and global_status_log tables and keep them under control with a stable limited amount of rows per day.

global_status_log has grown a lot already: 273G total
global_status_log_5m seems "under" control: 56G total

I will purge them tomorrow, anyways.
Maybe the purging needs to be more aggressive.

So the purging is actually working and global_status_log_5m is under control more or less with similar numbers as the ones at T252331#6165038:

root@db1115.eqiad.wmnet[tendril]> select min(stamp),max(stamp) from global_status_log_5m;
+---------------------+---------------------+
| min(stamp)          | max(stamp)          |
+---------------------+---------------------+
| 2020-09-21 12:55:00 | 2020-09-22 13:00:00 |
+---------------------+---------------------+
1 row in set (3 min 21.620 sec)

root@db1115.eqiad.wmnet[tendril]> select min(stamp),max(stamp) from global_status_log;
+---------------------+---------------------+
| min(stamp)          | max(stamp)          |
+---------------------+---------------------+
| 2020-08-28 13:31:55 | 2020-09-22 13:04:05 |
+---------------------+---------------------+
1 row in set (7 min 57.478 sec)
root@db1115.eqiad.wmnet[tendril]> select count(*) from global_status_log_5m; 
+----------+
| count(*) |
+----------+
| 52435391 |
+----------+

global_status_log is also having similar values than it used to: T252331#6162637 if we extrapolate the numbers.

root@db1115.eqiad.wmnet[tendril]> show explain for 99316091;
+------+-------------+-------------------+-------+---------------+---------+---------+------+------------+-------------+
| id   | select_type | table             | type  | possible_keys | key     | key_len | ref  | rows       | Extra       |
+------+-------------+-------------------+-------+---------------+---------+---------+------+------------+-------------+
|    1 | SIMPLE      | global_status_log | index | NULL          | PRIMARY | 24      | NULL | 9485476946 | Using index |
+------+-------------+-------------------+-------+---------------+---------+---------+------+------------+-------------+
1 row in set, 1 warning (0.001 sec)

So maybe it was just a glitch. I will purge it tomorrow anyways, not worth having all that.
I remember playing around with more aggressive purging and it was not worth the time/overload.

Mentioned in SAL (#wikimedia-operations) [2020-09-23T05:37:53Z] <marostegui> Purge global_status_log table on tendril - T252331