During the labservices1001 (aka labs-ns0) outage, toolschecker fired all kinds of bogus alerts. It should have coped just fine and fallen back on labs-ns1.
I had finished investigating and fixing this, but apparently forgot to update this task :| Here goes.
The reason toolschecker fired all the bogus alerts was 1. The grid job submit test is the only one that flaked out - reason being for some bizarre unknown reason, when one of the nameservers is down, the commands try to resolve dns for every exec and submit node with both servers (https://phabricator.wikimedia.org/P4699) - before finally succeeding - but since it takes a really long time, it was hitting the server side uwsgi timeout of 300s, 2. Since the checks were being run serially one after the other, one of the checks failing by hitting the timeout was causing all the following checks to flake out(https://phabricator.wikimedia.org/P4697).
We've fixed this in the medium term by splitting the toolschecker checks into separate uwsgi endpoints that are run separately - https://gerrit.wikimedia.org/r/#/c/334433/.