wcqs1002 and wcqs2001 are not yet in production. They are alerting on various unrelated things in icinga (SSH access, various NRPE timeouts, etc...). This makes me think that they are either overloaded (strange, there is no traffic to those servers), or that there is some hardware or network failure. No further investigation yet.
Description
Details
| Subject | Repo | Branch | Lines +/- | |
|---|---|---|---|---|
| wcqs: remove from loadbalancers to avoid alerting | operations/puppet | production | +1 -1 |
| Status | Subtype | Assigned | Task | ||
|---|---|---|---|---|---|
| Resolved | Ladsgroup | T271851 Clean up gui from the wdqs deploy repo and puppet | |||
| Resolved | None | T260568 [EPIC] Productionize WCQS | |||
| Duplicate | None | T294865 wcqs1002 and wcqs2001 unresponsive | |||
| Resolved | Gehel | T294961 Resolve kernel hang on wcqs* instances |
Event Timeline
Change 736564 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):
[operations/puppet@production] wcqs: state change production->monitoring_setup
Change 736644 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):
[operations/puppet@production] wcqs: remove from loadbalancers to avoid alerting
Change 736644 merged by Giuseppe Lavagetto:
[operations/puppet@production] wcqs: remove from loadbalancers to avoid alerting
As of right now both wcqs1002 and wcqs2001 seem to be running normal, blazegraph is active and all Icinga checks are green/OK. It's not obvious what the issue was but it seems gone now.
re: "various NRPE timeouts" these happen when the nagios-nrpe-server service dies. An issue we have seen repeatedly on other hosts was: host runs out of memory for whatever reason, OOM-killer picks nagios-nrpe-server process as the victim, all the Icinga standard checks for this host that are executed on the host via NRPE start failing (check_disk, check_cpu and so on), someone restarts nagios-nrpe-server, everything recovers. So maybe it was that here as well?
Oh, sorry, I saw T294961#7480793 and T294961#7482633 now. @RKemper then just to confirm it seems your kernel upgrade fixed this :)