toolschecker fell to pieces when labs-ns0 went down
Open, Needs TriagePublic

Description

During the labservices1001 (aka labs-ns0) outage, toolschecker fired all kinds of bogus alerts. It should have coped just fine and fallen back on labs-ns1.

Andrew created this task.Dec 5 2016, 5:15 AM
Restricted Application added a project: Cloud-Services. · View Herald TranscriptDec 5 2016, 5:15 AM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Krenair added a subscriber: Krenair.Dec 5 2016, 7:52 PM

Mentioned in SAL (#wikimedia-labs) [2017-01-03T20:43:28Z] <madhuvishy> Silenced tools checker on icinga to test labservices1001 failure causing toolschecker to flake out T152369

Mentioned in SAL (#wikimedia-labs) [2017-01-03T23:11:34Z] <madhuvishy> Disabled puppet on tools-checker-01 (T152369)

Mentioned in SAL (#wikimedia-labs) [2017-01-04T02:43:09Z] <madhuvishy> Reenabled puppet on toolschecker and removed iptables rule on labservices1001 blocking incoming connections from tools-checker-01. T152369

Mentioned in SAL (#wikimedia-operations) [2017-01-04T20:49:00Z] <madhuvishy> adding temporary IP tables rule on labservices1001 to drop traffic from toolchecker for tests (T152369)

faidon added a subscriber: faidon.Thu, Jul 20, 1:19 PM

What's needed to be done here, from whom and with what priority? (Asking because it shows up in our monitoring workboard)