Page MenuHomePhabricator

Increase in usage of /var/lib/mysql on an-coord1001 after Sept 21st
Closed, ResolvedPublic

Description

Screen Shot 2020-09-29 at 12.04.06 PM.png (943×1 px, 165 KB)

After Sept 21st there was an increase in usage of /var/lib/mysql on an-coord1001, the partition that we use for the analytics-meta mariadb instance (containing dbs for oozie, hue, hive, etc..)

The partition is on a LVM volume, in theory we could shrink the /srv partition (underused) and expand the mysql one, but the growth need to be investigated.

Most of the partition usage is currently related to binlog files.

Event Timeline

elukey triaged this task as High priority.Sep 29 2020, 10:06 AM
elukey created this task.

The first suspicion is Hue Next, that is what we added recently. I asked to Jaime if it was possible to get a snapshot of the dbs of analytics-meta today and a week ago, and something interesting popped up:

Current backups:

+-----------+-----------+------------------------------------------+------------+---------------------+------------------+
| backup_id | file_path | file_name                                | size       | file_date           | backup_object_id |
+-----------+-----------+------------------------------------------+------------+---------------------+------------------+
|      8029 |           | oozie.WF_JOBS.sql.gz                     | 1349653582 | 2020-09-29 02:09:26 |             NULL |
|      8029 |           | oozie.WF_ACTIONS.sql.gz                  |  177362024 | 2020-09-29 02:08:37 |             NULL |
|      8029 |           | oozie.COORD_ACTIONS.sql.gz               |   77834639 | 2020-09-29 02:08:22 |             NULL |
|      8029 |           | hive_metastore.SDS.sql.gz                |   58722913 | 2020-09-29 02:08:49 |             NULL |
|      8029 |           | hive_metastore.PARTITION_KEY_VALS.sql.gz |   48340888 | 2020-09-29 02:08:07 |             NULL |
|      8029 |           | hive_metastore.PARTITIONS.sql.gz         |   42841071 | 2020-09-29 02:08:14 |             NULL |
|      8029 |           | druid.druid_tasklogs.sql.gz              |   37087037 | 2020-09-29 02:08:39 |             NULL |
|      8029 |           | hive_metastore.PARTITION_PARAMS.sql.gz   |   28298782 | 2020-09-29 02:08:03 |             NULL |
|      8029 |           | druid.druid_tasks.sql.gz                 |   21349730 | 2020-09-29 02:08:42 |             NULL |
|      8029 |           | druid.druid_segments.sql.gz              |   12819252 | 2020-09-29 02:08:05 |             NULL |
|      8029 |           | hive_metastore.SERDES.sql.gz             |   12429097 | 2020-09-29 02:08:05 |             NULL |
|      8029 |           | hive_metastore.SERDE_PARAMS.sql.gz       |   11757990 | 2020-09-29 02:08:06 |             NULL |
|      8029 |           | superset_production.logs.sql.gz          |   11578497 | 2020-09-29 02:08:30 |             NULL |
|      8029 |           | oozie.SLA_SUMMARY.sql.gz                 |   10557551 | 2020-09-29 02:08:15 |             NULL |
|      8029 |           | oozie.SLA_EVENTS.sql.gz                  |   10499728 | 2020-09-29 02:08:17 |             NULL |
|      8029 |           | hue.desktop_document2.sql.gz             |    7306235 | 2020-09-29 02:08:15 |             NULL |
|      8029 |           | oozie.SLA_REGISTRATION.sql.gz            |    4758795 | 2020-09-29 02:08:15 |             NULL |
|      8029 |           | druid_public_eqiad.druid_tasklogs.sql.gz |    2301190 | 2020-09-29 02:07:57 |             NULL |
|      8029 |           | oozie.COORD_JOBS.sql.gz                  |    2015250 | 2020-09-29 02:08:07 |             NULL |
|      8029 |           | search_airflow.log.sql.gz                |    1368608 | 2020-09-29 02:08:17 |             NULL |

Last week's backups:

+-----------+-----------+------------------------------------------+-----------+---------------------+------------------+
| backup_id | file_path | file_name                                | size      | file_date           | backup_object_id |
+-----------+-----------+------------------------------------------+-----------+---------------------+------------------+
|      7904 |           | oozie.WF_JOBS.sql.gz                     | 644448864 | 2020-09-22 02:09:33 |             NULL |
|      7904 |           | oozie.WF_ACTIONS.sql.gz                  | 129797232 | 2020-09-22 02:09:10 |             NULL |
|      7904 |           | oozie.COORD_ACTIONS.sql.gz               |  74858596 | 2020-09-22 02:09:03 |             NULL |
|      7904 |           | hive_metastore.SDS.sql.gz                |  58095990 | 2020-09-22 02:09:23 |             NULL |
|      7904 |           | hive_metastore.PARTITION_KEY_VALS.sql.gz |  47398588 | 2020-09-22 02:08:45 |             NULL |
|      7904 |           | hive_metastore.PARTITIONS.sql.gz         |  42035241 | 2020-09-22 02:08:47 |             NULL |
|      7904 |           | druid.druid_tasklogs.sql.gz              |  37087037 | 2020-09-22 02:09:18 |             NULL |
|      7904 |           | hive_metastore.PARTITION_PARAMS.sql.gz   |  26817649 | 2020-09-22 02:08:46 |             NULL |
|      7904 |           | druid.druid_tasks.sql.gz                 |  21099915 | 2020-09-22 02:09:13 |             NULL |
|      7904 |           | druid.druid_segments.sql.gz              |  12697421 | 2020-09-22 02:08:44 |             NULL |
|      7904 |           | hive_metastore.SERDES.sql.gz             |  12162637 | 2020-09-22 02:09:03 |             NULL |
|      7904 |           | hive_metastore.SERDE_PARAMS.sql.gz       |  11548845 | 2020-09-22 02:08:55 |             NULL |
|      7904 |           | superset_production.logs.sql.gz          |  11284161 | 2020-09-22 02:09:08 |             NULL |
|      7904 |           | oozie.SLA_SUMMARY.sql.gz                 |  10404646 | 2020-09-22 02:08:57 |             NULL |
|      7904 |           | oozie.SLA_EVENTS.sql.gz                  |  10342002 | 2020-09-22 02:08:54 |             NULL |
|      7904 |           | hue.desktop_document2.sql.gz             |   7209576 | 2020-09-22 02:08:56 |             NULL |
|      7904 |           | oozie.SLA_REGISTRATION.sql.gz            |   4691577 | 2020-09-22 02:08:52 |             NULL |
|      7904 |           | druid_public_eqiad.druid_tasklogs.sql.gz |   2301190 | 2020-09-22 02:08:36 |             NULL |
|      7904 |           | oozie.COORD_JOBS.sql.gz                  |   2024410 | 2020-09-22 02:08:47 |             NULL |
|      7904 |           | search_airflow.log.sql.gz                |   1336718 | 2020-09-22 02:08:58 |             NULL |
+-----------+-----------+------------------------------------------+-----------+---------------------+------------------+

oozie.WF_JOBS increased a lot, and given that the binlog files also grew, maybe there is something that inserts a lot to that table?

Ok I think I have a lead - I see a ton of entries for the recent days about actions like loop_mark_hour_done_21 and they seem related to the bulk load of the pagecount-ez (from what I can see from the refinery).

@fdans let's follow up on this, if it is temporary we can avoid other measures like expanding the partitions etc.. Basically the bulk load is causing oozie to log a ton of inserts to the db, growing its usage etc..

My 2 cents on that one: Oozie has a setting about how long it keeps historical information for workflows/coords/bundles. I imagine we can manually tweak it to drop recent info, but that would mean loosing possibly interesting data from other jobs.
Another approach is to manually query Mysql oozie table for workflows and drop the finished one from backfilling.

My 2 cents on that one: Oozie has a setting about how long it keeps historical information for workflows/coords/bundles. I imagine we can manually tweak it to drop recent info, but that would mean loosing possibly interesting data from other jobs.
Another approach is to manually query Mysql oozie table for workflows and drop the finished one from backfilling.

+1 on reviewing historical data that we keep

The other problem is that ~40G out of those 50+ used is related to the mariadb's binlog, since we keep it for two weeks and it contains all changes to the DB. In this case the backfilling job is causing the oozie db to grow (but in the order of some GBs) and also the binlog to grow a lot as well due to the intense activity. I was asking to Fran how long is the backfilling job going to run since less traffic from it will mean less data on the binlog, and a gradual data size reduction in the mariadb partition as old binlog data gets dropped.

Mentioned in SAL (#wikimedia-analytics) [2020-09-30T05:47:13Z] <elukey> "PURGE BINARY LOGS BEFORE '2020-09-22 00:00:00';" on an-coord1001's mariadb - T264081

I had a chat with Francisco and the bulk load should in theory last another 3/4 days (there is currently an issue under investigation though, so bulk loading is stopped atm). I just reduced the binlogs to a week via PURGE command, it should suffice to avoid having disk space issues again. We should nonetheless find a more long term solution like expanding the ext4 partition for /var/lib/mysql and reduce the /srv one.

Mentioned in SAL (#wikimedia-analytics) [2020-10-05T12:35:32Z] <elukey> execute "PURGE BINARY LOGS BEFORE '2020-09-28 00:00:00';" on an-coord1001's mysql to free space - T264081

Since for the remaining data we won't be using a loop (hourly job), this won't be a concern. The remaining backfilling, which I'm testing right now, will be happening over the next week.

elukey claimed this task.

It seems way more stable now, closing for the moment :)