Page MenuHomePhabricator

Mysql partition on an-coord1001 sudden change in growth rate since Apr 14th
Closed, ResolvedPublic

Description

There seems to be a change in growth rate of data stored in /var/lib/mysql on an-coord1001 since Apr 14th at around 20 UTC (more or less):

https://grafana.wikimedia.org/d/000000377/host-overview?viewPanel=12&orgId=1&var-server=an-coord1001&var-datasource=eqiad%20prometheus%2Fops&from=1618350142055&to=1618583383311

The majority of the size seems to be due to the hive_metastore database:

elukey@an-coord1001:/var/lib/mysql$ sudo du -hs * | sort -h | tail
1.1G	analytics-meta-bin.017978
1.1G	analytics-meta-bin.017979
1.1G	analytics-meta-bin.017980
1.1G	analytics-meta-bin.017981
1.1G	analytics-meta-bin.017982
1.1G	analytics-meta-bin.017983
1.1G	analytics-meta-bin.017984
4.1G	druid
5.9G	oozie
6.6G	hive_metastore
MariaDB [(none)]> SELECT table_schema AS `Database`, table_name AS `Table`, round(((data_length + index_length) / 1024 / 1024), 2) `Size in MB` FROM information_schema.TABLES ORDER BY (data_length + index_length) DESC LIMIT 10;
+----------------+--------------------+------------+
| Database       | Table              | Size in MB |
+----------------+--------------------+------------+
| hive_metastore | SDS                |    2103.20 |
| druid          | druid_tasks        |    1817.69 |
| oozie          | WF_JOBS            |    1806.61 |
| hive_metastore | PARTITION_KEY_VALS |    1386.91 |
| oozie          | WF_ACTIONS         |    1162.08 |
| hive_metastore | PARTITIONS         |    1019.45 |
| hive_metastore | PARTITION_PARAMS   |     839.81 |
| oozie          | COORD_ACTIONS      |     776.83 |
| druid          | druid_tasklogs     |     751.72 |
| hive_metastore | SERDES             |     583.47 |
+----------------+--------------------+------------+

Event Timeline

elukey triaged this task as High priority.Apr 16 2021, 2:32 PM
elukey created this task.

Hm, could this be related to T280293: Delete UpperCased eventlogging legacy directories in /wmf/data/event 90 days from 2021-04-15 (after 2021-07-14)? It turns out that the Re-refine of data since the 14th that I did yesterday would have duplicated all the partition directories in HDFS. However, the Hive metastore shouldn't have had extra information added; all that happened there was the partition was dropped and re-added with a different location path. So the number of partitions Hive is keeping track of should be the same.

Oh, is that the MySQL binlog? If so, that makes sense. There would have been a lot of drop/add partitions executed yesterday during the re-refine, which would have resulted in a lot more statemetns saved in the binlog. It does look like yesterday Apr 15 has more rotated binlogs than any other day.

If it is the binlog we can definitely drop some of it (PURGE BINARY LOGS BEFORE etc..), we keep 14 days IIRC and everything should already be replicated to db1108 and an-coord1002

elukey@an-coord1001:~$ sudo ls -lht /var/lib/mysql/ | grep '\-bin'
-rw-rw---- 1 mysql mysql 811M Apr 16 15:02 analytics-meta-bin.017985
-rw-rw---- 1 mysql mysql 1.1G Apr 16 11:31 analytics-meta-bin.017984
-rw-rw---- 1 mysql mysql 1008 Apr 16 11:31 analytics-meta-bin.index
-rw-rw---- 1 mysql mysql 1.1G Apr 16 06:57 analytics-meta-bin.017983
-rw-rw---- 1 mysql mysql 1.1G Apr 16 02:11 analytics-meta-bin.017982
-rw-rw---- 1 mysql mysql 1.1G Apr 15 21:36 analytics-meta-bin.017981
-rw-rw---- 1 mysql mysql 1.1G Apr 15 17:05 analytics-meta-bin.017980
-rw-rw---- 1 mysql mysql 1.1G Apr 15 12:34 analytics-meta-bin.017979
-rw-rw---- 1 mysql mysql 1.1G Apr 15 08:46 analytics-meta-bin.017978
-rw-rw---- 1 mysql mysql 1.1G Apr 15 04:26 analytics-meta-bin.017977
-rw-rw---- 1 mysql mysql 1.1G Apr 15 00:10 analytics-meta-bin.017976
-rw-rw---- 1 mysql mysql 1.1G Apr 14 19:53 analytics-meta-bin.017975
-rw-rw---- 1 mysql mysql 1.1G Apr 14 15:39 analytics-meta-bin.017974
-rw-rw---- 1 mysql mysql 1.1G Apr 14 05:54 analytics-meta-bin.017973
-rw-rw---- 1 mysql mysql 1.1G Apr 13 17:32 analytics-meta-bin.017972
-rw-rw---- 1 mysql mysql 1.1G Apr 13 04:42 analytics-meta-bin.017971
-rw-rw---- 1 mysql mysql 1.1G Apr 12 16:19 analytics-meta-bin.017970
-rw-rw---- 1 mysql mysql 1.1G Apr 12 03:27 analytics-meta-bin.017969
-rw-rw---- 1 mysql mysql 1.1G Apr 11 14:50 analytics-meta-bin.017968
-rw-rw---- 1 mysql mysql 1.1G Apr 11 01:47 analytics-meta-bin.017967
-rw-rw---- 1 mysql mysql 1.1G Apr 10 13:18 analytics-meta-bin.017966
-rw-rw---- 1 mysql mysql 1.1G Apr 10 00:33 analytics-meta-bin.017965
-rw-rw---- 1 mysql mysql 1.1G Apr  9 11:49 analytics-meta-bin.017964
-rw-rw---- 1 mysql mysql 1.1G Apr  8 23:18 analytics-meta-bin.017963
-rw-rw---- 1 mysql mysql 1.1G Apr  8 10:58 analytics-meta-bin.017962
-rw-rw---- 1 mysql mysql 1.1G Apr  7 23:44 analytics-meta-bin.017961
-rw-rw---- 1 mysql mysql 1.1G Apr  7 11:00 analytics-meta-bin.017960
-rw-rw---- 1 mysql mysql 1.1G Apr  6 22:43 analytics-meta-bin.017959
-rw-rw---- 1 mysql mysql 1.1G Apr  6 09:58 analytics-meta-bin.017958
-rw-rw---- 1 mysql mysql 1.1G Apr  6 00:50 analytics-meta-bin.017957
-rw-rw---- 1 mysql mysql 1.1G Apr  5 13:36 analytics-meta-bin.017956
-rw-rw---- 1 mysql mysql 1.1G Apr  5 00:54 analytics-meta-bin.017955
-rw-rw---- 1 mysql mysql 1.1G Apr  4 12:20 analytics-meta-bin.017954
-rw-rw---- 1 mysql mysql 1.1G Apr  3 23:52 analytics-meta-bin.017953
-rw-rw---- 1 mysql mysql 1.1G Apr  3 11:08 analytics-meta-bin.017952
-rw-rw---- 1 mysql mysql 1.1G Apr  2 23:24 analytics-meta-bin.017951
-rw-rw---- 1 mysql mysql 1.1G Apr  2 13:11 analytics-meta-bin.017950

Yep it seems the binlog, see how many files for Apr 14/15+ vs the other days..

Mentioned in SAL (#wikimedia-analytics) [2021-04-16T15:14:10Z] <elukey> execute PURGE BINARY LOGS BEFORE '2021-04-09 00:00:00'; on an-coord1001 to free space for /var/lib/mysql - T280367

/dev/mapper/an--coord1001--vg-mysql   59G   42G   18G  71% /var/lib/mysql

We'll have to keep this monitored during the weekend, but we should be good. To be noted: I executed the purge command after set sql_log_bin=0.

When we'll reimage an-coord1001 we'll also consolidate the partitions into /srv, and we'll have more room to absorb these events:

/dev/mapper/an--coord1001--vg-srv    102G   48G   54G  48% /srv
/dev/mapper/an--coord1001--vg-mysql   59G   42G   18G  71% /var/lib/mysql

Mentioned in SAL (#wikimedia-analytics) [2021-04-18T07:25:00Z] <elukey> run "PURGE BINARY LOGS BEFORE '2021-04-11 00:00:00';" on an-coord1001 to free some space - T280367

Let's leave this open for a day more, but the progression of the daily binlog files looks shrinking, so we should be good.

Mentioned in SAL (#wikimedia-analytics) [2021-04-20T12:40:11Z] <elukey> PURGE BINARY LOGS BEFORE '2021-04-12 00:00:00'; on an-coord1001 - T280367

elukey claimed this task.

Executed another purge just to free some space, the new binary logs are back to normal number/size.

@razzi please review this task and let me know if anything is unclear :)