Page MenuHomePhabricator

wcqs1002 and wcqs2001 unresponsive
Closed, DuplicatePublic

Description

wcqs1002 and wcqs2001 are not yet in production. They are alerting on various unrelated things in icinga (SSH access, various NRPE timeouts, etc...). This makes me think that they are either overloaded (strange, there is no traffic to those servers), or that there is some hardware or network failure. No further investigation yet.

Event Timeline

Change 736564 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] wcqs: state change production->monitoring_setup

https://gerrit.wikimedia.org/r/736564

Change 736644 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/puppet@production] wcqs: remove from loadbalancers to avoid alerting

https://gerrit.wikimedia.org/r/736644

Change 736644 merged by Giuseppe Lavagetto:

[operations/puppet@production] wcqs: remove from loadbalancers to avoid alerting

https://gerrit.wikimedia.org/r/736644

As of right now both wcqs1002 and wcqs2001 seem to be running normal, blazegraph is active and all Icinga checks are green/OK. It's not obvious what the issue was but it seems gone now.

re: "various NRPE timeouts" these happen when the nagios-nrpe-server service dies. An issue we have seen repeatedly on other hosts was: host runs out of memory for whatever reason, OOM-killer picks nagios-nrpe-server process as the victim, all the Icinga standard checks for this host that are executed on the host via NRPE start failing (check_disk, check_cpu and so on), someone restarts nagios-nrpe-server, everything recovers. So maybe it was that here as well?

It's not obvious what the issue was but it seems gone now.

Oh, sorry, I saw T294961#7480793 and T294961#7482633 now. @RKemper then just to confirm it seems your kernel upgrade fixed this :)