wcqs1002 and wcqs2001 are not yet in production. They are alerting on various unrelated things in icinga (SSH access, various NRPE timeouts, etc...). This makes me think that they are either overloaded (strange, there is no traffic to those servers), or that there is some hardware or network failure. No further investigation yet.
As of right now both wcqs1002 and wcqs2001 seem to be running normal, blazegraph is active and all Icinga checks are green/OK. It's not obvious what the issue was but it seems gone now.
re: "various NRPE timeouts" these happen when the nagios-nrpe-server service dies. An issue we have seen repeatedly on other hosts was: host runs out of memory for whatever reason, OOM-killer picks nagios-nrpe-server process as the victim, all the Icinga standard checks for this host that are executed on the host via NRPE start failing (check_disk, check_cpu and so on), someone restarts nagios-nrpe-server, everything recovers. So maybe it was that here as well?