During the last swtichover the kill/start of pt-heartbeat on the masters of the active datacenter was not that smooth.
In particular the start of pt-heartbeat from salt didn't actually worked as expected. It needs some investigation to understand if/how the way pt-heartbeat daemonize itself doesn't work well with the way salt execute commands on hosts and what are the possible solutions.
As a more general topic, we need to find a reliable way to have a lag detection mechanism that works well with our multi-dc, multi-shard environment, given that MW will probably start using it, see T111266.
One possible solution could be to run pt-heartbeat on both masters (eqiad and codfw) all the time.
With the limitation that pt-heartbeat is an external process and as such could die, be killed, etc.