Page MenuHomePhabricator

Cleanup heartbeat.heartbeat on all production instances
Open, MediumPublic

Description

Currently the heartbeat table on production database instances contains a lot of stale entries. E.g.:

root@db1083.eqiad.wmnet[(none)]> select * from heartbeat.heartbeat ORDER BY ts DESC;
+----------------------------+-----------+-------------------+------------+-----------------------+---------------------+-------+------------+
| ts                         | server_id | file              | position   | relay_master_log_file | exec_master_log_pos | shard | datacenter |
+----------------------------+-----------+-------------------+------------+-----------------------+---------------------+-------+------------+
| 2020-11-20T13:12:25.001110 | 171970661 | db1083-bin.005754 |  707058317 | NULL                  |                NULL | s1    | eqiad      |
| 2020-10-29T08:02:21.000720 | 180363268 | db2112-bin.002375 |  568497678 | db1083-bin.005621     |            34053178 | s1    | codfw      |
| 2019-11-14T06:01:21.000750 | 171974720 | db1067-bin.003024 |  445994072 | NULL                  |                NULL | s1    | eqiad      |
| 2018-10-22T07:53:05.002070 | 180355171 | db2048-bin.004286 |  172403944 | db1067-bin.001687     |           371670049 | s1    | codfw      |
| 2018-07-18T06:01:44.001480 | 171970637 | db1052-bin.005945 |  479413784 | NULL                  |                NULL | s1    | eqiad      |
| 2017-05-10T13:29:31.001050 | 180359172 | db2016-bin.002977 |  503605243 | db1052-bin.004793     |           401461576 | s1    | codfw      |
| 2017-01-26T08:16:47.000790 | 171974683 | db1057-bin.002885 |  576316184 | db1052-bin.004556     |           210334207 | s1    | eqiad      |
| 2016-04-19T14:34:57.500710 |    103222 | db1052-bin.003635 |  253962980 | db2016-bin.001838     |           537698332 | s1    | NULL       |
| 2014-02-05T05:42:17.500660 |    103226 | db1056-bin.000672 |  400611588 | NULL                  |                NULL | NULL  | NULL       |
| 2013-04-17T19:13:49.500670 |     10166 | db1017-bin.001509 | 1015768234 | NULL                  |                NULL | NULL  | NULL       |
| 2013-03-25T22:13:51.501320 |    103221 | db1051-bin.000022 |  916971119 | db1017-bin.001426     |           168293056 | NULL  | NULL       |
| 2013-03-21T21:37:25.001370 |     10681 | db71-bin.000453   |  636529649 | db63-bin.000751       |           821645784 | NULL  | NULL       |
| 2013-03-21T21:37:25.000990 |     10673 | db63-bin.000751   |  821645784 | db1017-bin.001404     |           152420058 | NULL  | NULL       |
| 2013-03-09T15:14:02.500790 |     10679 | db69-bin.000022   |  791940369 | db63-bin.000679       |          1047555273 | NULL  | NULL       |
| 2013-03-03T23:16:13.501210 |     10646 | db36-bin.001811   |  618958721 | db63-bin.000658       |           581081709 | NULL  | NULL       |
| 2012-08-22T05:49:29.001100 |     10648 | db38-bin.000610   |   43354652 | NULL                  |                NULL | NULL  | NULL       |
| 2012-01-18T22:25:16.001070 |     10642 | db32-bin.000385   |  435653878 | NULL                  |                NULL | NULL  | NULL       |
+----------------------------+-----------+-------------------+------------+-----------------------+---------------------+-------+------------+
17 rows in set (0.001 sec)

This makes it very difficult to distinguish between entries that should no longer get updated, and entries that aren't getting updated that _should_ get updated.

Once the initial cleanup is done, we'll need alerting and potentially automation to ensure that cruft doesn't re-accumulate.

Event Timeline

Kormat created this task.Nov 20 2020, 1:13 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 20 2020, 1:13 PM

Let's backup them just in case- the reason why this was not done before is because the heartbeat table helps track past replication history in case a replication problem (e.g. an improper master switchover).

I have created a dump at: dbprov1003:/srv/backups/dumps/latest/heartbeat_tables.2020-11-20.tar.gz, so that is no longer a blocker.

Kormat updated the task description. (Show Details)Nov 20 2020, 1:43 PM
herron triaged this task as Medium priority.Nov 20 2020, 3:07 PM
Kormat updated the task description. (Show Details)Nov 20 2020, 3:12 PM

Mentioned in SAL (#wikimedia-operations) [2020-11-23T14:04:48Z] <kormat> cleaning up heartbeat.heartbeat on pc1 T268336

Mentioned in SAL (#wikimedia-operations) [2020-11-23T14:09:14Z] <kormat> cleaning up heartbeat.heartbeat on pc2 T268336

Mentioned in SAL (#wikimedia-operations) [2020-11-23T14:10:02Z] <kormat> cleaning up heartbeat.heartbeat on pc3 T268336

Change 642379 had a related patch set uploaded (by Kormat; owner: Kormat):
[operations/puppet@production] orchestrator: Use heartbeat table to detect lag.

https://gerrit.wikimedia.org/r/642379

Cleaning up the heartbeat tables in prod is a bit tricky, as there's a lot of cruft, and a mix of STATEMENT vs ROW replication. My suggestion is that we update stale rows to set ts to 0, instead of trying to delete the rows. That makes it easy to filter them out from queries.

LSobanski moved this task from Triage to Refine on the DBA board.Nov 24 2020, 12:48 PM

Change 642379 merged by Kormat:
[operations/puppet@production] orchestrator: Use heartbeat table to detect lag.

https://gerrit.wikimedia.org/r/642379

Marostegui edited projects, added Orchestrator; removed SRE.Nov 25 2020, 9:54 AM

m1 table cleaned up