Page MenuHomePhabricator

Cleanup heartbeat.heartbeat on all production instances
Closed, ResolvedPublic

Description

Currently the heartbeat table on production database instances contains a lot of stale entries. E.g.:

root@db1083.eqiad.wmnet[(none)]> select * from heartbeat.heartbeat ORDER BY ts DESC;
+----------------------------+-----------+-------------------+------------+-----------------------+---------------------+-------+------------+
| ts                         | server_id | file              | position   | relay_master_log_file | exec_master_log_pos | shard | datacenter |
+----------------------------+-----------+-------------------+------------+-----------------------+---------------------+-------+------------+
| 2020-11-20T13:12:25.001110 | 171970661 | db1083-bin.005754 |  707058317 | NULL                  |                NULL | s1    | eqiad      |
| 2020-10-29T08:02:21.000720 | 180363268 | db2112-bin.002375 |  568497678 | db1083-bin.005621     |            34053178 | s1    | codfw      |
| 2019-11-14T06:01:21.000750 | 171974720 | db1067-bin.003024 |  445994072 | NULL                  |                NULL | s1    | eqiad      |
| 2018-10-22T07:53:05.002070 | 180355171 | db2048-bin.004286 |  172403944 | db1067-bin.001687     |           371670049 | s1    | codfw      |
| 2018-07-18T06:01:44.001480 | 171970637 | db1052-bin.005945 |  479413784 | NULL                  |                NULL | s1    | eqiad      |
| 2017-05-10T13:29:31.001050 | 180359172 | db2016-bin.002977 |  503605243 | db1052-bin.004793     |           401461576 | s1    | codfw      |
| 2017-01-26T08:16:47.000790 | 171974683 | db1057-bin.002885 |  576316184 | db1052-bin.004556     |           210334207 | s1    | eqiad      |
| 2016-04-19T14:34:57.500710 |    103222 | db1052-bin.003635 |  253962980 | db2016-bin.001838     |           537698332 | s1    | NULL       |
| 2014-02-05T05:42:17.500660 |    103226 | db1056-bin.000672 |  400611588 | NULL                  |                NULL | NULL  | NULL       |
| 2013-04-17T19:13:49.500670 |     10166 | db1017-bin.001509 | 1015768234 | NULL                  |                NULL | NULL  | NULL       |
| 2013-03-25T22:13:51.501320 |    103221 | db1051-bin.000022 |  916971119 | db1017-bin.001426     |           168293056 | NULL  | NULL       |
| 2013-03-21T21:37:25.001370 |     10681 | db71-bin.000453   |  636529649 | db63-bin.000751       |           821645784 | NULL  | NULL       |
| 2013-03-21T21:37:25.000990 |     10673 | db63-bin.000751   |  821645784 | db1017-bin.001404     |           152420058 | NULL  | NULL       |
| 2013-03-09T15:14:02.500790 |     10679 | db69-bin.000022   |  791940369 | db63-bin.000679       |          1047555273 | NULL  | NULL       |
| 2013-03-03T23:16:13.501210 |     10646 | db36-bin.001811   |  618958721 | db63-bin.000658       |           581081709 | NULL  | NULL       |
| 2012-08-22T05:49:29.001100 |     10648 | db38-bin.000610   |   43354652 | NULL                  |                NULL | NULL  | NULL       |
| 2012-01-18T22:25:16.001070 |     10642 | db32-bin.000385   |  435653878 | NULL                  |                NULL | NULL  | NULL       |
+----------------------------+-----------+-------------------+------------+-----------------------+---------------------+-------+------------+
17 rows in set (0.001 sec)

This makes it very difficult to distinguish between entries that should no longer get updated, and entries that aren't getting updated that _should_ get updated.

Once the initial cleanup is done, we'll need alerting and potentially automation to ensure that cruft doesn't re-accumulate.

Progress:

  • s1
  • s2
  • s3
  • s4
  • s5
  • s6
  • s7
  • s8
  • x1
  • x2
  • es4
  • es5
  • m1
  • m2
  • m3
  • m5
  • pc1
  • pc2
  • pc3
NOTE: es1, es2 and es3 do not have replication enabled so no need.

Event Timeline

Let's backup them just in case- the reason why this was not done before is because the heartbeat table helps track past replication history in case a replication problem (e.g. an improper master switchover).

I have created a dump at: dbprov1003:/srv/backups/dumps/latest/heartbeat_tables.2020-11-20.tar.gz, so that is no longer a blocker.

herron triaged this task as Medium priority.Nov 20 2020, 3:07 PM

Mentioned in SAL (#wikimedia-operations) [2020-11-23T14:04:48Z] <kormat> cleaning up heartbeat.heartbeat on pc1 T268336

Mentioned in SAL (#wikimedia-operations) [2020-11-23T14:09:14Z] <kormat> cleaning up heartbeat.heartbeat on pc2 T268336

Mentioned in SAL (#wikimedia-operations) [2020-11-23T14:10:02Z] <kormat> cleaning up heartbeat.heartbeat on pc3 T268336

Change 642379 had a related patch set uploaded (by Kormat; owner: Kormat):
[operations/puppet@production] orchestrator: Use heartbeat table to detect lag.

https://gerrit.wikimedia.org/r/642379

Cleaning up the heartbeat tables in prod is a bit tricky, as there's a lot of cruft, and a mix of STATEMENT vs ROW replication. My suggestion is that we update stale rows to set ts to 0, instead of trying to delete the rows. That makes it easy to filter them out from queries.

Change 642379 merged by Kormat:
[operations/puppet@production] orchestrator: Use heartbeat table to detect lag.

https://gerrit.wikimedia.org/r/642379

After almost 2h, I have entirely cleaned up s4 :)

I have started to clean up s1, but will most likely finish it once the switchover is done tomorrow. There's lots to clean up there

Regarding s1: I have finished cleaning up eqiad. Tomorrow only deleting the current master's ID would be the only thing left for it.
Going to start codfw clean up now.

Marostegui claimed this task.
Marostegui updated the task description. (Show Details)

All done