Page MenuHomePhabricator

Change smokeping to have pinging active/active, with alerts active/standby
Closed, ResolvedPublic

Description

ATM smokeping is active only one host at the time (netmon1002) and on failover the smokeping daemon is started on the standby host (netmon2001). RRDs are synced between the two (i.e. eqiad -> codfw normally).

This task is to propose a change in how smokeping is deployed, which should make failover easier to manage and align smokeping to existing monitoring/alerting deployments, specifically:

  • Each netmon/smokeping host is active all the time, i.e. produces its own graphs. Currently post-failover we'd see a change in e.g. latency graphs because the source of ping changes and RRDs are synced between sites.
  • Only the active smokeping host sends alerts, alternatively both hosts can send alerts
  • Failover is a DNS change (plus a puppet change if alerts are active/standby)

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptThu, Jul 23, 8:21 AM

👍
Good idea! I'd say send alerts only from one host as it's already quite loud (no easy way to mute alerts).
Also T169860 is most likely the future of Smokeping so no need to spend too much time on it neither.

Change 615760 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] smokeping: default to active/active

https://gerrit.wikimedia.org/r/615760

Change 615760 merged by Filippo Giunchedi:
[operations/puppet@production] smokeping: default to active/active

https://gerrit.wikimedia.org/r/615760

fgiunchedi closed this task as Resolved.Fri, Jul 24, 7:06 AM
fgiunchedi claimed this task.

This is done! Now both netmon2001 and netmon1002 smokeping daemons are running all the time, but only the active netmon server (netmon1002 now) will send alerts.