Page MenuHomePhabricator

failover alert2001 to alert1001
Closed, ResolvedPublic

Description

Tracking task to fail services back to alert1001 reasonably soon after switch maintenance has completed

Event Timeline

herron changed the task status from Open to Stalled.

stalling until related maintenance is finished

herron changed the task status from Stalled to In Progress.Apr 20 2023, 4:11 PM

Mentioned in SAL (#wikimedia-operations) [2023-04-24T14:07:09Z] <herron> beginning alert host failover from alert2001 to alert1001 T333837

Mentioned in SAL (#wikimedia-operations) [2023-04-24T14:07:43Z] <herron> disabled icinga meta monitoring on wikitech-static T333837

Change 910878 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] Revert "alerting_host: failover icinga and alertmanger from eqiad to codfw"

https://gerrit.wikimedia.org/r/910878

Change 910879 had a related patch set uploaded (by Herron; author: Herron):

[operations/dns@master] Revert "dns: repoint alert host services to alert2001"

https://gerrit.wikimedia.org/r/910879

Change 910878 merged by Herron:

[operations/puppet@production] Revert "alerting_host: failover icinga and alertmanger from eqiad to codfw"

https://gerrit.wikimedia.org/r/910878

Change 910879 merged by Herron:

[operations/dns@master] Revert "dns: repoint alert host services to alert2001"

https://gerrit.wikimedia.org/r/910879

Mentioned in SAL (#wikimedia-operations) [2023-04-24T14:31:43Z] <herron> re-enabled icinga meta monitoring on wikitech-static T333837

herron triaged this task as Medium priority.

Alerting host services have been live from eqiad for ~30m, resolving!