ATM smokeping is active only one host at the time (netmon1002) and on failover the smokeping daemon is started on the standby host (netmon2001). RRDs are synced between the two (i.e. eqiad -> codfw normally).
This task is to propose a change in how smokeping is deployed, which should make failover easier to manage and align smokeping to existing monitoring/alerting deployments, specifically:
- Each netmon/smokeping host is active all the time, i.e. produces its own graphs. Currently post-failover we'd see a change in e.g. latency graphs because the source of ping changes and RRDs are synced between sites.
- Only the active smokeping host sends alerts, alternatively both hosts can send alerts
- Failover is a DNS change (plus a puppet change if alerts are active/standby)