Page MenuHomePhabricator
Paste P9032

icinga codfw down on some servers
ActivePublic

Authored by jcrespo on Sep 3 2019, 8:47 AM.
Tags
None
Referenced Files
F30209585: raw.txt
Sep 3 2019, 8:47 AM
Subscribers
[2019-09-03 07:46:20] SERVICE ALERT: db2103;configured eth;OK;HARD;1;OK - interfaces up
Service Critical[2019-09-03 07:46:18] SERVICE ALERT: icinga1001;High average GET latency for mw requests on appserver in codfw;CRITICAL;SOFT;1;cluster=appserver code=200 handler={proxy:fcgi://127.0.0.1:9000,proxy:unix:/run/php/fpm-www.sock
Host Up[2019-09-03 07:46:18] HOST ALERT: db2112;UP;HARD;1;PING OK - Packet loss = 0%, RTA = 30.36 ms
Service Ok[2019-09-03 07:46:18] SERVICE ALERT: db2112;dhclient process;OK;HARD;1;PROCS OK: 0 processes with command name 'dhclient'
Service Critical[2019-09-03 07:46:14] SERVICE ALERT: db2112;MariaDB Slave Lag: s1;CRITICAL;HARD;1;CRITICAL slave_sql_lag could not connect
Service Ok[2019-09-03 07:46:14] SERVICE ALERT: db2103;DPKG;OK;HARD;1;All packages OK
Service Ok[2019-09-03 07:46:12] SERVICE ALERT: db2112;SSH;OK;HARD;1;SSH OK - OpenSSH_7.4p1 Debian-10+deb9u6 (protocol 2.0)
Host Up[2019-09-03 07:46:12] HOST ALERT: db2103;UP;HARD;1;PING OK - Packet loss = 0%, RTA = 30.28 ms
Service Ok[2019-09-03 07:46:08] SERVICE ALERT: db2103;SSH;OK;HARD;1;SSH OK - OpenSSH_7.4p1 Debian-10+deb9u6 (protocol 2.0)
Service Ok[2019-09-03 07:46:06] SERVICE ALERT: db2103;MariaDB disk space;OK;HARD;1;DISK OK
Service Ok[2019-09-03 07:46:06] SERVICE ALERT: db2103;Disk space;OK;HARD;1;DISK OK
Service Ok[2019-09-03 07:46:06] SERVICE ALERT: db2103;Check size of conntrack table;OK;HARD;1;OK: nf_conntrack is 0 % full
Service Critical[2019-09-03 07:45:56] SERVICE ALERT: lvs2006;PyBal backends health check;CRITICAL;SOFT;2;PYBAL CRITICAL - CRITICAL - appservers-https_443: Servers mw2184.codfw.wmnet, mw2183.codfw.wmnet are marked down but pooled: apaches_80: Servers mw2255.codfw.wmnet, mw2234.codfw.wmnet, mw2225.codfw.wmnet, mw2182.codfw.wmnet, mw2181.codfw.wmnet, mw2227.codfw.wmnet, mw2274.codfw.wmnet, mw2169.codfw.wmnet, mw2193.codfw.wmnet, mw2235.codfw.wmnet, mw2190.codfw.wmnet, mw2269.codfw.wmnet, mw2228.codfw.wmnet, mw2175.codfw.wmnet, mw2239.codfw.wmnet, mw2194.codfw.wmnet, mw2174.codfw.wmnet, mw2171.codfw.wmnet, mw2178.codfw.wmnet, mw2191.codfw.wmnet, mw2192.codfw.wmnet, mw2270.codfw.wmnet, mw2230.codfw.wmnet, mw2254.codfw.wmnet, mw2238.codfw.wmnet, mw2277.codfw.wmnet, mw2177.codfw.wmnet, mw2189.codfw.wmnet, mw2226.codfw.wmnet, mw2273.codfw.wmnet, mw2276.codfw.wmnet, mw2224.codfw.wmnet, mw2258.codfw.wmnet, mw2236.codfw.wmnet are marked down but pooled: api_80: Servers mw2147.codfw.wmnet, mw2205.codfw.wmnet, mw2209.codfw.wmnet, mw2261.codfw.wmnet, mw2143.codfw.wmnet, mw2201.codfw.wmnet, mw2202.codfw.wmnet, mw2142.codfw.wm
Service Unknown[2019-09-03 07:45:52] SERVICE ALERT: an-tool1006;Check the NTP synchronisation status of timesyncd;UNKNOWN;HARD;3;CHECK_NRPE STATE UNKNOWN: Socket timeout after 10 seconds.
Service Critical[2019-09-03 07:45:38] SERVICE ALERT: lvs2003;PyBal backends health check;CRITICAL;SOFT;1;PYBAL CRITICAL - CRITICAL - api-https_443: Servers mw2147.codfw.wmnet, mw2143.codfw.wmnet, mw2137.codfw.wmnet, mw2202.codfw.wmnet, mw2145.codfw.wmnet, mw2206.codfw.wmnet, mw2203.codfw.wmnet, mw2139.codfw.wmnet, mw2212.codfw.wmnet, mw2222.codfw.wmnet, mw2245.codfw.wmnet, mw2138.codfw.wmnet, mw2208.codfw.wmnet, mw2210.codfw.wmnet, mw2136.codfw.wmnet, mw2141.codfw.wmnet are marked down but pooled
Service Unknown[2019-09-03 07:45:34] SERVICE ALERT: db2112;MariaDB Slave IO: s1;UNKNOWN;HARD;1;CHECK_NRPE STATE UNKNOWN: Socket timeout after 10 seconds.
Service Unknown[2019-09-03 07:45:30] SERVICE ALERT: db2103;Check systemd state;UNKNOWN;HARD;1;CHECK_NRPE STATE UNKNOWN: Socket timeout after 10 seconds.
Service Unknown[2019-09-03 07:45:28] SERVICE ALERT: db2112;MariaDB read only s1;UNKNOWN;HARD;1;CHECK_NRPE STATE UNKNOWN: Socket timeout after 10 seconds.
Service Unknown[2019-09-03 07:45:26] SERVICE ALERT: db2103;MariaDB Slave Lag: s1;UNKNOWN;HARD;1;CHECK_NRPE STATE UNKNOWN: Socket timeout after 10 seconds.
Service Unknown[2019-09-03 07:45:24] SERVICE ALERT: db2103;MariaDB read only s1;UNKNOWN;HARD;1;CHECK_NRPE STATE UNKNOWN: Socket timeout after 10 seconds.
Host Down[2019-09-03 07:45:22] HOST ALERT: db2103;DOWN;HARD;2;PING CRITICAL - Packet loss = 100%
Service Unknown[2019-09-03 07:45:20] SERVICE ALERT: db2112;Check size of conntrack table;UNKNOWN;HARD;1;CHECK_NRPE STATE UNKNOWN: Socket timeout after 10 seconds.
Service Unknown[2019-09-03 07:45:18] SERVICE ALERT: db2112;mysqld processes;UNKNOWN;HARD;1;CHECK_NRPE STATE UNKNOWN: Socket timeout after 10 seconds.
Service Critical[2019-09-03 07:45:18] SERVICE ALERT: icinga1001;High average GET latency for mw requests on api_appserver in codfw;CRITICAL;SOFT;1;cluster=api_appserver code=200 handler={proxy:fcgi://127.0.0.1:9000,proxy:unix:/run/php/fpm-www.sock
Service Ok[2019-09-03 07:45:12] SERVICE ALERT: cp4027;IPsec;OK;SOFT;2;Strongswan OK - 34 ESP OK
Service Unknown[2019-09-03 07:45:10] SERVICE ALERT: db2112;DPKG;UNKNOWN;HARD;1;CHECK_NRPE STATE UNKNOWN: Socket timeout after 10 seconds.
Service Unknown[2019-09-03 07:45:10] SERVICE ALERT: db2112;MariaDB Slave SQL: s1;UNKNOWN;HARD;1;CHECK_NRPE STATE UNKNOWN: Socket timeout after 10 seconds.
Service Critical[2019-09-03 07:45:08] SERVICE ALERT: alnitak;check_impression_logs;CRITICAL;HARD;2;CRITICAL centralnotice-impressions.+.json.sampled10.log rotated 95535 min ago [critical => 9999], landingpage-impressions.+.json.log rotated 124515 min ago [critical => 9999]
Service Unknown[2019-09-03 07:45:02] SERVICE ALERT: db2112;Disk space;UNKNOWN;HARD;1;CHECK_NRPE STATE UNKNOWN: Socket timeout after 10 seconds.
Service Unknown[2019-09-03 07:45:02] SERVICE ALERT: db2103;Check whether ferm is active by checking the default input chain;UNKNOWN;HARD;1;CHECK_NRPE STATE UNKNOWN: Socket timeout after 10 seconds.
Service Unknown[2019-09-03 07:44:54] SERVICE ALERT: db2103;MariaDB Slave SQL: s1;UNKNOWN;HARD;1;CHECK_NRPE STATE UNKNOWN: Socket timeout after 10 seconds.
Service Unknown[2019-09-03 07:44:52] SERVICE ALERT: db2103;configured eth;UNKNOWN;HARD;1;CHECK_NRPE STATE UNKNOWN: Socket timeout after 10 seconds.
Service Unknown[2019-09-03 07:44:52] SERVICE ALERT: db2112;puppet last run;UNKNOWN;HARD;1;CHECK_NRPE STATE UNKNOWN: Socket timeout after 10 seconds.
Service Unknown[2019-09-03 07:44:50] SERVICE ALERT: db2112;dhclient process;UNKNOWN;HARD;1;CHECK_NRPE STATE UNKNOWN: Socket timeout after 10 seconds.
Service Unknown[2019-09-03 07:44:50] SERVICE ALERT: db2112;MariaDB Slave Lag: s1;UNKNOWN;HARD;1;CHECK_NRPE STATE UNKNOWN: Socket timeout after 10 seconds.
Service Unknown[2019-09-03 07:44:50] SERVICE ALERT: db2103;DPKG;UNKNOWN;HARD;1;CHECK_NRPE STATE UNKNOWN: Socket timeout after 10 seconds.
Service Critical[2019-09-03 07:44:44] SERVICE ALERT: db2112;SSH;CRITICAL;HARD;1;CRITICAL - Socket timeout after 10 seconds
Service Critical[2019-09-03 07:44:42] SERVICE ALERT: db2103;SSH;CRITICAL;HARD;1;CRITICAL - Socket timeout after 10 seconds
Service Unknown[2019-09-03 07:44:40] SERVICE ALERT: db2103;Disk space;UNKNOWN;HARD;1;CHECK_NRPE STATE UNKNOWN: Socket timeout after 10 seconds.
Service Unknown[2019-09-03 07:44:40] SERVICE ALERT: db2103;MariaDB disk space;UNKNOWN;HARD;1;CHECK_NRPE STATE UNKNOWN: Socket timeout after 10 seconds.

Event Timeline

jcrespo added a subscriber: Joe.