Page MenuHomePhabricator
Paste P7323

Icinga alert stats June 2018
ActivePublic

Authored by Volans on Jul 2 2018, 9:03 PM.
Tags
None
Referenced Files
F23241548: Icinga alert stats June 2018
Jul 2 2018, 9:12 PM
F23241230: Icinga alert stats June 2018
Jul 2 2018, 9:03 PM
Subscribers
None
# Stats gathered from received Icinga emails:
Total pages: 13
All pages:
- db1082/MariaDB Slave Lag: s5
- search.svc.eqiad.wmnet/ElasticSearch health check for shards
- thumbor.svc.eqiad.wmnet/LVS HTTP IPv4
- videoscaler.svc.codfw.wmnet/LVS HTTP IPv4
- db1115/mysqld processes
- db1115/MariaDB disk space
- labvirt1019/Disk space
- wdqs.svc.eqiad.wmnet/LVS HTTP IPv4
- labcontrol1001/keystone admin endpoint port 35357
- api.svc.codfw.wmnet/LVS HTTP IPv4
- api.svc.eqiad.wmnet/LVS HTTP IPv4
- checker.tools.wmflabs.org/toolschecker: NFS read/writeable on labs instances
- labstore1004/drbd service
# Stats gathered from IRC logs of #wikimedia-operations (from my IRC bouncer)
Total CRITICAL reported on IRC: 586
Total unique CRITICAL alarms: 57
Total unique host group that had at least a CRITICAL alarm: 92
Total unique hosts that had at least a CRITICAL alarm: 385
CRITICAL alarms per type:
345: puppet last run
45: Check systemd state
18: restbase endpoints health
17: MariaDB Slave Lag: s1
16: Check size of conntrack table
14: MariaDB Slave Lag: s5
14: HHVM jobrunner
11: MariaDB Slave Lag: s8
11: Disk space
7: etcd request latencies
5: kubelet operational latencies
5: Request latencies
5: MariaDB Slave Lag: s7
5: Device not healthy -SMART-
5: Check whether ferm is active by checking the default input chain
4: proton endpoints health
4: mcrouter process
4: Router interfaces
4: DPKG
3: dhclient process
3: MD RAID
2: ores
2: ensure kvm processes are running
2: configured eth
2: MariaDB Slave Lag: s3
2: High lag
1: pybal
1: novaobserver has only observer role
1: mysqld processes
1: mediawiki-installation DSH group
1: logstash syslog TCP port
1: logstash log4j TCP port
1: logstash JSON linesTCP port
1: keystone public endoint port 5000
1: keystone admin endpoint port 35357
1: cassandra-c SSL 10.192.48.48:7001
1: cassandra-c CQL 10.192.48.48:9042
1: cassandra-a SSL 10.192.16.162:7001
1: cassandra-a CQL 10.192.16.162:9042
1: Upload HTTP 5xx reqs/min
1: PyBal connections to etcd
1: PyBal backends health check
1: MariaDB disk space
1: MariaDB Slave Lag: x1
1: MariaDB Slave Lag: m2
1: Long running screen/tmux
1: Logstash syslog TLS listener on port 16514
1: Improperly owned -0:0- files in /srv/mediawiki-staging
1: IPv4 ping to ulsfo
1: IPv4 ping to codfw
1: High CPU load on API appserver
1: HP RAID
1: HHVM rendering
1: Esams HTTP 5xx reqs/min
1: Check correctness of the icinga configuration
1: BGP status
1: Apache HTTP
CRITICAL alarms per host group (laxay grouping, just removing the XXXX numbering):
115: mw
89: db
33: kubernetes
28: elastic
23: cp
20: restbase-dev
17: proton
17: ms-be
13: mc
13: dbstore
12: wtp
12: mwdebug
10: analytics
8: rdb
8: ores
8: francium
7: lvs
7: labvirt
6: ganeti
5: wdqs
5: restbase
5: logstash
5: es
5: chlorine
4: scb
4: neon
4: labstore
4: labcontrol
4: kafkamon
4: graphite
4: druid
4: dbproxy
3: planet
3: pc
3: labtestcontrol
3: kafka
3: install
3: etcd
2: webperf
2: stat
2: snapshot
2: sca
2: poolcounter
2: kafka-jumbo
2: cr2-eqiad
2: conf
2: argon
2: aqs
2: acrab
1: ununpentium
1: thorium
1: terbium
1: rutherfordium
1: ruthenium
1: ripe-atlas-ulsfo
1: ripe-atlas-codfw
1: radon
1: pybal-test
1: puppetmaster
1: puppetdb
1: oxygen
1: oresrdb
1: netmon
1: mx
1: mwlog
1: ms-fe
1: mendelevium
1: meitnerium
1: labtestweb
1: labtestservices
1: labtestneutron
1: labtestnet
1: labsdb
1: labpuppetmaster
1: labnodepool
1: labnet
1: kubestage
1: hydrogen
1: hassium
1: furud
1: flerovium
1: einsteinium
1: dns
1: deploy
1: cr2-esams
1: cr1-eqdfw
1: cr1-codfw
1: contint
1: cobalt
1: bromine
1: bast
1: auth
CRITICAL alarms per host:
24: kubernetes2003
11: db1115
9: restbase-dev1006
9: dbstore1002
8: restbase-dev1004
8: francium
6: proton1001
6: db1116
5: kubernetes1003
5: db2094
5: chlorine
4: proton2001
4: proton1002
4: ores2001
4: neon
4: mwdebug2002
4: lvs1001
4: logstash1007
4: dbstore2001
4: db2059
3: restbase-dev1005
3: proton2002
3: mwdebug2001
3: mwdebug1001
3: mw1334
3: mw1311
3: mw1309
3: mw1308
3: labvirt1010
3: labtestcontrol2001
3: labcontrol1001
3: install1002
3: elastic1029
3: elastic1018
3: db2038
3: db1124
2: wtp1043
2: wdqs1003
2: restbase2005
2: restbase2001
2: rdb2006
2: rdb2004
2: planet1001
2: mwdebug1002
2: mw2286
2: mw2283
2: mw2278
2: mw2262
2: mw2259
2: mw2250
2: mw2236
2: mw2226
2: mw2208
2: mw2203
2: mw2193
2: mw2190
2: mw2153
2: mw2146
2: mw1337
2: mw1336
2: mw1310
2: mw1306
2: mw1305
2: mw1302
2: mw1280
2: mw1277
2: mw1255
2: mw1245
2: mw1226
2: mw1224
2: mw1222
2: mw1221
2: mc2020
2: labstore1003
2: kubernetes1002
2: kubernetes1001
2: kafkamon2001
2: kafkamon1001
2: graphite1001
2: elastic2034
2: elastic1027
2: db2088
2: db2085
2: db2084
2: db2082
2: db2062
2: db2061
2: db2052
2: db2047
2: cr2-eqiad
2: argon
2: analytics1003
2: acrab
1: wtp2019
1: wtp2014
1: wtp2013
1: wtp1039
1: wtp1038
1: wtp1035
1: wtp1032
1: wtp1030
1: wtp1029
1: wtp1028
1: webperf1002
1: webperf1001
1: wdqs2005
1: wdqs1009
1: wdqs1007
1: ununpentium
1: thorium
1: terbium
1: stat1006
1: stat1004
1: snapshot1007
1: snapshot1005
1: scb2005
1: scb1004
1: scb1003
1: scb1002
1: sca2004
1: sca1003
1: rutherfordium
1: ruthenium
1: ripe-atlas-ulsfo
1: ripe-atlas-codfw
1: restbase2006
1: rdb2002
1: rdb1005
1: rdb1003
1: rdb1002
1: radon
1: pybal-test2001
1: puppetmaster2001
1: puppetdb2001
1: poolcounter2001
1: poolcounter1003
1: planet2001
1: pc2006
1: pc2004
1: pc1006
1: oxygen
1: oresrdb1002
1: ores2009
1: ores2005
1: ores1006
1: ores1002
1: netmon2001
1: mx2001
1: mwlog2001
1: mw2279
1: mw2275
1: mw2271
1: mw2264
1: mw2254
1: mw2243
1: mw2238
1: mw2237
1: mw2235
1: mw2230
1: mw2224
1: mw2215
1: mw2191
1: mw2186
1: mw2185
1: mw2178
1: mw2174
1: mw2159
1: mw2155
1: mw2152
1: mw2145
1: mw2138
1: mw1347
1: mw1346
1: mw1335
1: mw1331
1: mw1323
1: mw1321
1: mw1320
1: mw1313
1: mw1304
1: mw1303
1: mw1300
1: mw1296
1: mw1293
1: mw1289
1: mw1288
1: mw1286
1: mw1285
1: mw1273
1: mw1267
1: mw1257
1: mw1256
1: mw1252
1: mw1251
1: mw1231
1: mw1230
1: ms-fe1007
1: ms-be2042
1: ms-be2039
1: ms-be2030
1: ms-be2027
1: ms-be2020
1: ms-be2019
1: ms-be1042
1: ms-be1041
1: ms-be1038
1: ms-be1036
1: ms-be1027
1: ms-be1025
1: ms-be1022
1: ms-be1021
1: ms-be1016
1: ms-be1014
1: ms-be1013
1: mendelevium
1: meitnerium
1: mc2036
1: mc2034
1: mc2030
1: mc2029
1: mc2028
1: mc2022
1: mc2021
1: mc1034
1: mc1033
1: mc1026
1: mc1021
1: lvs5003
1: lvs4005
1: lvs2001
1: logstash1009
1: labvirt1020
1: labvirt1019
1: labvirt1006
1: labvirt1003
1: labtestweb2001
1: labtestservices2002
1: labtestneutron2002
1: labtestnet2002
1: labstore2004
1: labstore1005
1: labsdb1005
1: labpuppetmaster1001
1: labnodepool1002
1: labnet1001
1: labcontrol1002
1: kubestage1001
1: kafka2001
1: kafka1002
1: kafka1001
1: kafka-jumbo1003
1: kafka-jumbo1001
1: hydrogen
1: hassium
1: graphite2002
1: graphite2001
1: ganeti2008
1: ganeti2005
1: ganeti2003
1: ganeti2002
1: ganeti1007
1: ganeti1002
1: furud
1: flerovium
1: etcd1006
1: etcd1005
1: etcd1004
1: es2019
1: es2016
1: es2015
1: es2012
1: es1016
1: elastic2031
1: elastic2026
1: elastic2024
1: elastic2020
1: elastic2015
1: elastic2014
1: elastic2010
1: elastic2008
1: elastic2002
1: elastic1052
1: elastic1046
1: elastic1039
1: elastic1037
1: elastic1035
1: elastic1032
1: elastic1028
1: elastic1019
1: elastic1017
1: einsteinium
1: druid1006
1: druid1005
1: druid1004
1: druid1001
1: dns4002
1: deploy1001
1: dbproxy1009
1: dbproxy1006
1: dbproxy1002
1: dbproxy1001
1: db2093
1: db2092
1: db2087
1: db2086
1: db2081
1: db2080
1: db2079
1: db2078
1: db2076
1: db2075
1: db2072
1: db2071
1: db2070
1: db2069
1: db2066
1: db2065
1: db2056
1: db2055
1: db2051
1: db2048
1: db2045
1: db2037
1: db2034
1: db1123
1: db1122
1: db1119
1: db1110
1: db1109
1: db1103
1: db1101
1: db1096
1: db1093
1: db1092
1: db1088
1: db1084
1: db1082
1: db1074
1: db1071
1: db1068
1: db1061
1: db1054
1: cr2-esams
1: cr1-eqdfw
1: cr1-codfw
1: cp5007
1: cp4031
1: cp4027
1: cp4026
1: cp3048
1: cp3047
1: cp3043
1: cp3042
1: cp3039
1: cp3008
1: cp2019
1: cp2018
1: cp2016
1: cp2014
1: cp2013
1: cp1074
1: cp1073
1: cp1071
1: cp1066
1: cp1059
1: cp1051
1: cp1050
1: cp1045
1: contint1001
1: conf2002
1: conf1005
1: cobalt
1: bromine
1: bast1001
1: auth2001
1: aqs1009
1: aqs1007
1: analytics1076
1: analytics1074
1: analytics1066
1: analytics1054
1: analytics1043
1: analytics1041
1: analytics1037
1: analytics1001

Event Timeline

Volans created this object with visibility "No One".
Volans updated the paste's language from autodetect to yaml.Jul 2 2018, 9:12 PM
Volans edited the content of this paste. (Show Details)
Volans changed the visibility from "No One" to "Public (No Login Required)".