Currently the heartbeat table on production database instances contains a lot of stale entries. E.g.:
root@db1083.eqiad.wmnet[(none)]> select * from heartbeat.heartbeat ORDER BY ts DESC; +----------------------------+-----------+-------------------+------------+-----------------------+---------------------+-------+------------+ | ts | server_id | file | position | relay_master_log_file | exec_master_log_pos | shard | datacenter | +----------------------------+-----------+-------------------+------------+-----------------------+---------------------+-------+------------+ | 2020-11-20T13:12:25.001110 | 171970661 | db1083-bin.005754 | 707058317 | NULL | NULL | s1 | eqiad | | 2020-10-29T08:02:21.000720 | 180363268 | db2112-bin.002375 | 568497678 | db1083-bin.005621 | 34053178 | s1 | codfw | | 2019-11-14T06:01:21.000750 | 171974720 | db1067-bin.003024 | 445994072 | NULL | NULL | s1 | eqiad | | 2018-10-22T07:53:05.002070 | 180355171 | db2048-bin.004286 | 172403944 | db1067-bin.001687 | 371670049 | s1 | codfw | | 2018-07-18T06:01:44.001480 | 171970637 | db1052-bin.005945 | 479413784 | NULL | NULL | s1 | eqiad | | 2017-05-10T13:29:31.001050 | 180359172 | db2016-bin.002977 | 503605243 | db1052-bin.004793 | 401461576 | s1 | codfw | | 2017-01-26T08:16:47.000790 | 171974683 | db1057-bin.002885 | 576316184 | db1052-bin.004556 | 210334207 | s1 | eqiad | | 2016-04-19T14:34:57.500710 | 103222 | db1052-bin.003635 | 253962980 | db2016-bin.001838 | 537698332 | s1 | NULL | | 2014-02-05T05:42:17.500660 | 103226 | db1056-bin.000672 | 400611588 | NULL | NULL | NULL | NULL | | 2013-04-17T19:13:49.500670 | 10166 | db1017-bin.001509 | 1015768234 | NULL | NULL | NULL | NULL | | 2013-03-25T22:13:51.501320 | 103221 | db1051-bin.000022 | 916971119 | db1017-bin.001426 | 168293056 | NULL | NULL | | 2013-03-21T21:37:25.001370 | 10681 | db71-bin.000453 | 636529649 | db63-bin.000751 | 821645784 | NULL | NULL | | 2013-03-21T21:37:25.000990 | 10673 | db63-bin.000751 | 821645784 | db1017-bin.001404 | 152420058 | NULL | NULL | | 2013-03-09T15:14:02.500790 | 10679 | db69-bin.000022 | 791940369 | db63-bin.000679 | 1047555273 | NULL | NULL | | 2013-03-03T23:16:13.501210 | 10646 | db36-bin.001811 | 618958721 | db63-bin.000658 | 581081709 | NULL | NULL | | 2012-08-22T05:49:29.001100 | 10648 | db38-bin.000610 | 43354652 | NULL | NULL | NULL | NULL | | 2012-01-18T22:25:16.001070 | 10642 | db32-bin.000385 | 435653878 | NULL | NULL | NULL | NULL | +----------------------------+-----------+-------------------+------------+-----------------------+---------------------+-------+------------+ 17 rows in set (0.001 sec)
This makes it very difficult to distinguish between entries that should no longer get updated, and entries that aren't getting updated that _should_ get updated.
Once the initial cleanup is done, we'll need alerting and potentially automation to ensure that cruft doesn't re-accumulate.
Progress:
- s1
- s2
- s3
- s4
- s5
- s6
- s7
- s8
- x1
- x2
- es4
- es5
- m1
- m2
- m3
- m5
- pc1
- pc2
- pc3