Page MenuHomePhabricator

toolschecker fell to pieces when labs-ns0 went down
Closed, ResolvedPublic

Description

During the labservices1001 (aka labs-ns0) outage, toolschecker fired all kinds of bogus alerts. It should have coped just fine and fallen back on labs-ns1.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Mentioned in SAL (#wikimedia-labs) [2017-01-03T20:43:28Z] <madhuvishy> Silenced tools checker on icinga to test labservices1001 failure causing toolschecker to flake out T152369

Mentioned in SAL (#wikimedia-labs) [2017-01-03T23:11:34Z] <madhuvishy> Disabled puppet on tools-checker-01 (T152369)

Mentioned in SAL (#wikimedia-labs) [2017-01-04T02:43:09Z] <madhuvishy> Reenabled puppet on toolschecker and removed iptables rule on labservices1001 blocking incoming connections from tools-checker-01. T152369

Mentioned in SAL (#wikimedia-operations) [2017-01-04T20:49:00Z] <madhuvishy> adding temporary IP tables rule on labservices1001 to drop traffic from toolchecker for tests (T152369)

What's needed to be done here, from whom and with what priority? (Asking because it shows up in our observability workboard)

madhuvishy claimed this task.
madhuvishy subscribed.

I had finished investigating and fixing this, but apparently forgot to update this task :| Here goes.

The reason toolschecker fired all the bogus alerts was 1. The grid job submit test is the only one that flaked out - reason being for some bizarre unknown reason, when one of the nameservers is down, the commands try to resolve dns for every exec and submit node with both servers (https://phabricator.wikimedia.org/P4699) - before finally succeeding - but since it takes a really long time, it was hitting the server side uwsgi timeout of 300s, 2. Since the checks were being run serially one after the other, one of the checks failing by hitting the timeout was causing all the following checks to flake out(https://phabricator.wikimedia.org/P4697).

We've fixed this in the medium term by splitting the toolschecker checks into separate uwsgi endpoints that are run separately - https://gerrit.wikimedia.org/r/#/c/334433/.