During the labservices1001 (aka labs-ns0) outage, toolschecker fired all kinds of bogus alerts. It should have coped just fine and fallen back on labs-ns1.
Description
Event Timeline
Mentioned in SAL (#wikimedia-labs) [2017-01-03T20:43:28Z] <madhuvishy> Silenced tools checker on icinga to test labservices1001 failure causing toolschecker to flake out T152369
Mentioned in SAL (#wikimedia-labs) [2017-01-03T23:11:34Z] <madhuvishy> Disabled puppet on tools-checker-01 (T152369)
Mentioned in SAL (#wikimedia-labs) [2017-01-04T02:43:09Z] <madhuvishy> Reenabled puppet on toolschecker and removed iptables rule on labservices1001 blocking incoming connections from tools-checker-01. T152369
Mentioned in SAL (#wikimedia-operations) [2017-01-04T20:49:00Z] <madhuvishy> adding temporary IP tables rule on labservices1001 to drop traffic from toolchecker for tests (T152369)
What's needed to be done here, from whom and with what priority? (Asking because it shows up in our observability workboard)
I had finished investigating and fixing this, but apparently forgot to update this task :| Here goes.
The reason toolschecker fired all the bogus alerts was 1. The grid job submit test is the only one that flaked out - reason being for some bizarre unknown reason, when one of the nameservers is down, the commands try to resolve dns for every exec and submit node with both servers (https://phabricator.wikimedia.org/P4699) - before finally succeeding - but since it takes a really long time, it was hitting the server side uwsgi timeout of 300s, 2. Since the checks were being run serially one after the other, one of the checks failing by hitting the timeout was causing all the following checks to flake out(https://phabricator.wikimedia.org/P4697).
We've fixed this in the medium term by splitting the toolschecker checks into separate uwsgi endpoints that are run separately - https://gerrit.wikimedia.org/r/#/c/334433/.