tendril_purge_global_status_log_5m and global_status_log needs more frequent purging
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Marostegui
	May 10 2020, 12:32 PM

Description

During the troubleshooting that happened at T252324: Tendril mysql is stalling I found that the table global_status_log_5m had data from 2015:

mysql:root@localhost [tendril]> select min(stamp),max(stamp) from global_status_log_5m;
+---------------------+---------------------+
| min(stamp)          | max(stamp)          |
+---------------------+---------------------+
| 2015-06-19 11:40:00 | 2020-05-10 10:40:00 |
+---------------------+---------------------+
1 row in set (0.11 sec)

mysql:root@localhost [tendril]> select @stamp := now() - interval 7 day;
+----------------------------------+
| @stamp := now() - interval 7 day |
+----------------------------------+
| 2020-05-03 11:00:35              |
+----------------------------------+
1 row in set (0.00 sec)

mysql:root@localhost [tendril]> select * from global_status_log_5m where stamp < @stamp limit 2;
+-----------+---------------------+---------+-------------+
| server_id | stamp               | name_id | value       |
+-----------+---------------------+---------+-------------+
|      1082 | 2015-06-19 11:40:00 |     352 | 13561583583 |
|      1209 | 2015-06-19 11:40:00 |     348 |           0 |
+-----------+---------------------+---------+-------------+
2 rows in set (0.10 sec)

That table was causing contention on tendril as the table had grown to an insane amount of rows:

mysql:root@localhost [(none)]> show explain for 291612;
+------+-------------+----------------------+-------+---------------+------+---------+------+-------------+-------------+
| id   | select_type | table                | type  | possible_keys | key  | key_len | ref  | rows        | Extra       |
+------+-------------+----------------------+-------+---------------+------+---------+------+-------------+-------------+
|    1 | SIMPLE      | global_status_log_5m | index | NULL          | i2   | 8       | NULL | 87877089565 | Using index |
+------+-------------+----------------------+-------+---------------+------+---------+------+-------------+-------------+
1 row in set, 1 warning (0.00 sec)

There is an event that is supposed to purge it every 7 days, but it wasn't working (my guess is just because it is unable to get a lock there):

mysql:root@localhost [tendril]> show create event tendril_purge_global_status_log_5m;
+------------------------------------+----------+-----------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Event                              | sql_mode | time_zone | Create Event                                                                                                                                                                   |
+------------------------------------+----------+-----------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| tendril_purge_global_status_log_5m |          | SYSTEM    | CREATE DEFINER=`root`@`208.80.154.151` EVENT `tendril_purge_global_status_log_5m` ON SCHEDULE EVERY 5 MINUTE STARTS '2014-05-23 00:00:00' ON COMPLETION NOT PRESERVE ENABLE DO n

    if (get_lock('tendril_purge_global_status_log_5m', 1) = 0) then
      signal sqlstate value '45000' set message_text = 'get_lock';
    end if;

    select @stamp := now() - interval 7 day;
    delete from global_status_log_5m where stamp < @stamp limit 10000;

    do release_lock('tendril_purge_global_status_log_5m');
  end | utf8                 | utf8_general_ci      | latin1_swedish_ci  |
+------------------------------------+----------+-----------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)

Even if it worked, it would have probably never finished as the table was insanely big.

We should probably make sure it works or create a more radical approach, something along the lines of:

Stops the event scheduler
Waits for the output to be clean
Perform the delete + optimize
Starts the event scheduler

The table global_status_log was also massive for just holding a month's of data, so we might want to purge it more frequently. It doesn't make any sense to keep that amount of days:

mysql:root@localhost [tendril]> show explain for 283543;
+------+-------------+-------------------+-------+---------------+---------+---------+------+-------------+-------------+
| id   | select_type | table             | type  | possible_keys | key     | key_len | ref  | rows        | Extra       |
+------+-------------+-------------------+-------+---------------+---------+---------+------+-------------+-------------+
|    1 | SIMPLE      | global_status_log | index | NULL          | PRIMARY | 24      | NULL | 10164518550 | Using index |
+------+-------------+-------------------+-------+---------------+---------+---------+------+-------------+-------------+
1 row in set, 1 warning (0.00 sec)

mysql:root@localhost [tendril]> select min(stamp),max(stamp) from global_status_log;
+---------------------+---------------------+
| min(stamp)          | max(stamp)          |
+---------------------+---------------------+
| 2020-04-11 00:00:00 | 2020-05-10 10:44:52 |
+---------------------+---------------------+
1 row in set (0.01 sec)

So I think we should also set up maybe a weekly cronjob purging that table too. We don't use it for any historic things anyways so it doesn't make any sense to let it grow that much.

Details

Subject	Repo	Branch	Lines +/-
dashboard.sql: Create tendril_purge_global_status_log event	operations/software/tendril	master	+19 -0
dashboard.sql: Change storing time	operations/software/tendril	master	+1 -1
dashboard: Change tendril_purge_global_status_log_5m	operations/software/tendril	master	+2 -2

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	Marostegui	T231769 Investigate possible memory leak on db1115
Declined	None	T231185 Disable/remove unused features on Tendril
Resolved	Marostegui	T252331 tendril_purge_global_status_log_5m and global_status_log needs more frequent purging

Event Timeline

Marostegui created this task.May 10 2020, 12:32 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 10 2020, 12:32 PM

Setting it to high as this has (probably) the cause of two recent incidents on tendril

• Kormat subscribed.May 12 2020, 10:15 AM

root@db1115.eqiad.wmnet[tendril]> select count(*) from global_status_log_5m;
+-----------+
| count(*)  |
+-----------+
| 331934990 |
+-----------+
1 row in set (1 min 53.61 sec)

mysql:root@localhost [tendril]> select min(stamp),max(stamp) from global_status_log_5m;
+---------------------+---------------------+
| min(stamp)          | max(stamp)          |
+---------------------+---------------------+
| 2020-05-10 18:20:00 | 2020-05-20 11:40:00 |
+---------------------+---------------------+
1 row in set (0.02 sec)

I am going to change the event, so we purge older than 2 days data. It doesn't make any sense to have 7 days there. We don't really use it for anything.

Regarding global_status_log the numbers are also crazy for just 10 days since I purged it:

mysql:root@localhost [tendril]> select min(stamp),max(stamp) from global_status_log;
+---------------------+---------------------+
| min(stamp)          | max(stamp)          |
+---------------------+---------------------+
| 2020-05-10 12:18:14 | 2020-05-20 11:42:58 |
+---------------------+---------------------+
1 row in set (0.77 sec)

mysql:root@localhost [(none)]> show explain for 68836553;
+------+-------------+-------------------+-------+---------------+---------+---------+------+------------+-------------+
| id   | select_type | table             | type  | possible_keys | key     | key_len | ref  | rows       | Extra       |
+------+-------------+-------------------+-------+---------------+---------+---------+------+------------+-------------+
|    1 | SIMPLE      | global_status_log | index | NULL          | PRIMARY | 24      | NULL | 3531813949 | Using index |
+------+-------------+-------------------+-------+---------------+---------+---------+------+------------+-------------+
1 row in set, 1 warning (0.00 sec)

I think I am going to create a similar event to the one for global_status_log_5m and enable it to purge data older than 2 days too.

Change 597535 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/software/tendril@master] dashboard: Change tendril_purge_global_status_log_5m

https://gerrit.wikimedia.org/r/597535

gerritbot added a project: Patch-For-Review.May 20 2020, 12:32 PM

Change 597535 merged by Marostegui:
[operations/software/tendril@master] dashboard: Change tendril_purge_global_status_log_5m

https://gerrit.wikimedia.org/r/597535

Mentioned in SAL (#wikimedia-operations) [2020-05-20T14:43:30Z] <marostegui> Replace tendril_purge_global_status_log_5m event with the new one (purging every 2d of data and with a higher limit of rows) - T252331

Maintenance_bot removed a project: Patch-For-Review.May 20 2020, 3:10 PM

Marostegui mentioned this in rOSTDb3cc602cdf7e: dashboard: Change tendril_purge_global_status_log_5m.May 20 2020, 3:52 PM

Mentioned in SAL (#wikimedia-operations) [2020-05-21T07:22:27Z] <marostegui> Purge events from tendril.global_status_log older than 24h - T252331

Change 597792 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/software/tendril@master] dashboard.sql: Remove limit clause

https://gerrit.wikimedia.org/r/597792

gerritbot added a project: Patch-For-Review.May 21 2020, 2:29 PM

Mentioned in SAL (#wikimedia-operations) [2020-05-22T10:10:55Z] <marostegui> Stop event_scheduler on db1115 - T252331

Change 597792 merged by Marostegui:
[operations/software/tendril@master] dashboard.sql: Change storing time

https://gerrit.wikimedia.org/r/597792

Marostegui mentioned this in rOSTDf68da8515c0c: dashboard.sql: Change storing time.May 22 2020, 3:06 PM

Mentioned in SAL (#wikimedia-operations) [2020-05-22T15:06:29Z] <marostegui> Decrease tendril_purge_global_status_log_5m storing rows time from 2 days to 1 day T252331

Maintenance_bot removed a project: Patch-For-Review.May 22 2020, 3:10 PM

With the latest changes looks like we have global_status_log_5m under control. It doesn't grow over 35-40M rows.
I am going to spend time now with the other big one: global_status_log

Mentioned in SAL (#wikimedia-operations) [2020-05-25T07:02:19Z] <marostegui> Stop event scheduler on tendril T252331

This is approximately what global_status_log has accumulated since 10th May (15 days):

mysql:root@localhost [(none)]> show explain  for 102846891;
+------+-------------+-------------------+-------+---------------+---------+---------+------+------------+-------------+
| id   | select_type | table             | type  | possible_keys | key     | key_len | ref  | rows       | Extra       |
+------+-------------+-------------------+-------+---------------+---------+---------+------+------------+-------------+
|    1 | SIMPLE      | global_status_log | index | NULL          | PRIMARY | 24      | NULL | 4962983672 | Using index |
+------+-------------+-------------------+-------+---------------+---------+---------+------+------------+-------------+
1 row in set, 1 warning (0.00 sec)

It is storing around 1.5M rows every 5 minutes, so I am going to create an even to also purge that table every 5 minutes like we do with global_status_log_5m

Mentioned in SAL (#wikimedia-operations) [2020-05-25T11:48:27Z] <marostegui> Stop event scheduler on db1115 (tendril) - T252331

Mentioned in SAL (#wikimedia-operations) [2020-05-25T11:54:38Z] <marostegui> Install a new tendril_purge_global_status_log event on db1115 (tendril) T252331

I have been playing with an event, similar to the one that purges global_status_log_5m with kinda the same options and seems to keep general_status_log table under control, with around 370M records all the time and keeping only 24h of data.
Which is still a bunch of rows but way better than 4k million rows that we accumulated since I truncate it a few days ago. And definitely better than the 10164518550 rows we had before.
I am going to give it some more hours before sending the patch with the event, just in case I need to tweak something else, but so far it looks like we have the most problematic tables under control and they won't grow more than what they accumulate during 24h.

Change 598910 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/software/tendril@master] dashboard.sql: Create tendril_purge_global_status_log event

https://gerrit.wikimedia.org/r/598910

gerritbot added a project: Patch-For-Review.May 27 2020, 4:46 AM

Change 598910 merged by Marostegui:
[operations/software/tendril@master] dashboard.sql: Create tendril_purge_global_status_log event

https://gerrit.wikimedia.org/r/598910

Marostegui mentioned this in rOSTD2160c63c8bd7: dashboard.sql: Create tendril_purge_global_status_log event.May 27 2020, 4:51 AM

Maintenance_bot removed a project: Patch-For-Review.May 27 2020, 5:10 AM

So this has been addressed by adding 2 events events that purge global_status_log_5m and global_status_log tables and keep them under control with a stable limited amount of rows per day.

Marostegui mentioned this in T249085: Convert Tendril TokuDB tables to InnoDB.May 29 2020, 12:27 PM

Marostegui mentioned this in T257816: Set up replication for zarcillo.Jul 13 2020, 1:24 PM

jcrespo mentioned this in T260876: dbtree slowdown 2020-08-20.Aug 20 2020, 1:39 PM

global_status_log has grown a lot already: 273G total
global_status_log_5m seems "under" control: 56G total

I will purge them tomorrow, anyways.
Maybe the purging needs to be more aggressive.

So the purging is actually working and global_status_log_5m is under control more or less with similar numbers as the ones at T252331#6165038:

root@db1115.eqiad.wmnet[tendril]> select min(stamp),max(stamp) from global_status_log_5m;
+---------------------+---------------------+
| min(stamp)          | max(stamp)          |
+---------------------+---------------------+
| 2020-09-21 12:55:00 | 2020-09-22 13:00:00 |
+---------------------+---------------------+
1 row in set (3 min 21.620 sec)

root@db1115.eqiad.wmnet[tendril]> select min(stamp),max(stamp) from global_status_log;
+---------------------+---------------------+
| min(stamp)          | max(stamp)          |
+---------------------+---------------------+
| 2020-08-28 13:31:55 | 2020-09-22 13:04:05 |
+---------------------+---------------------+
1 row in set (7 min 57.478 sec)

root@db1115.eqiad.wmnet[tendril]> select count(*) from global_status_log_5m; 
+----------+
| count(*) |
+----------+
| 52435391 |
+----------+

global_status_log is also having similar values than it used to: T252331#6162637 if we extrapolate the numbers.

root@db1115.eqiad.wmnet[tendril]> show explain for 99316091;
+------+-------------+-------------------+-------+---------------+---------+---------+------+------------+-------------+
| id   | select_type | table             | type  | possible_keys | key     | key_len | ref  | rows       | Extra       |
+------+-------------+-------------------+-------+---------------+---------+---------+------+------------+-------------+
|    1 | SIMPLE      | global_status_log | index | NULL          | PRIMARY | 24      | NULL | 9485476946 | Using index |
+------+-------------+-------------------+-------+---------------+---------+---------+------+------------+-------------+
1 row in set, 1 warning (0.001 sec)

So maybe it was just a glitch. I will purge it tomorrow anyways, not worth having all that.
I remember playing around with more aggressive purging and it was not worth the time/overload.

Mentioned in SAL (#wikimedia-operations) [2020-09-23T05:37:53Z] <marostegui> Purge global_status_log table on tendril - T252331

Tables purged

Marostegui mentioned this in T291419: Update tendril or its replacement to support having request IDs in DB query comments.Sep 21 2021, 4:52 AM

tendril_purge_global_status_log_5m and global_status_log needs more frequent purgingClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

tendril_purge_global_status_log_5m and global_status_log needs more frequent purging
Closed, ResolvedPublic
Actions

Related Objects
Search...