toolschecker fell to pieces when labs-ns0 went down
Closed, ResolvedPublic

Description

During the labservices1001 (aka labs-ns0) outage, toolschecker fired all kinds of bogus alerts. It should have coped just fine and fallen back on labs-ns1.

Andrew created this task.Dec 5 2016, 5:15 AM
Restricted Application added a project: Cloud-Services. · View Herald TranscriptDec 5 2016, 5:15 AM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Krenair added a subscriber: Krenair.Dec 5 2016, 7:52 PM

Mentioned in SAL (#wikimedia-labs) [2017-01-03T20:43:28Z] <madhuvishy> Silenced tools checker on icinga to test labservices1001 failure causing toolschecker to flake out T152369

Mentioned in SAL (#wikimedia-labs) [2017-01-03T23:11:34Z] <madhuvishy> Disabled puppet on tools-checker-01 (T152369)

Mentioned in SAL (#wikimedia-labs) [2017-01-04T02:43:09Z] <madhuvishy> Reenabled puppet on toolschecker and removed iptables rule on labservices1001 blocking incoming connections from tools-checker-01. T152369

Mentioned in SAL (#wikimedia-operations) [2017-01-04T20:49:00Z] <madhuvishy> adding temporary IP tables rule on labservices1001 to drop traffic from toolchecker for tests (T152369)

faidon added a subscriber: faidon.Jul 20 2017, 1:19 PM

What's needed to be done here, from whom and with what priority? (Asking because it shows up in our monitoring workboard)

madhuvishy closed this task as Resolved.EditedJul 24 2017, 5:14 PM
madhuvishy claimed this task.
madhuvishy added a subscriber: madhuvishy.

I had finished investigating and fixing this, but apparently forgot to update this task :| Here goes.

The reason toolschecker fired all the bogus alerts was 1. The grid job submit test is the only one that flaked out - reason being for some bizarre unknown reason, when one of the nameservers is down, the commands try to resolve dns for every exec and submit node with both servers (https://phabricator.wikimedia.org/P4699) - before finally succeeding - but since it takes a really long time, it was hitting the server side uwsgi timeout of 300s, 2. Since the checks were being run serially one after the other, one of the checks failing by hitting the timeout was causing all the following checks to flake out(https://phabricator.wikimedia.org/P4697).

We've fixed this in the medium term by splitting the toolschecker checks into separate uwsgi endpoints that are run separately - https://gerrit.wikimedia.org/r/#/c/334433/.