As part of progressively reducing Icinga' scope we should be moving off it all paging checks/alerts. This will also help improving paging alerts reliability (e.g. T294166: Alert that should have paged via VictorOps was delayed because of partial networking outage) because we'll be using the VO API exclusively, as opposed to the email transport.
List of current (April 2022) paging alerts in Icinga
Prometheus-based (via Icinga check_prometheus)
- excessive RX traffic on LVS interfaces
- not enough php-fpm workers
- reduced availability (i.e. high 5xx) for ats-tls and varnish
- high rate of NEL errors
Native Icinga/NRPE checks
- zookeeper server (check_procs on java process)
- LVS/service::catalog checks. Will be removed by T291946: Move service::catalog checks (“monitoring” section) to blackbox exporter and Alertmanager
- MariaDB alerts (replica, disk space, read only, mysqld processes not running, etc)
- cfssl signer per-CA and cfssl-multirootca unit status
- acme-chief unit status
Corp OIT ldap mirror (check_ldap)- etcd replication (check_http_url_for_regexp_on_port!${::fqdn}!${etcdmirror_web_port}!/lag!^(-[1-9]|[0-5][^0-9]+))
- kafka broker server (check_procs on java process)
- exim queue
- fastnetmon is alerting
- phabricator.wikimedia.org unreachable / ssl expiring
- ircd (check_ircd basic irc client to check connectivity and clients connected)
- auth and recursive DNS (check_dns and check_dns_query_auth)
- elasticsearch health check for frozen writes (check timestamp on ES /mw_cirrus_metastore/mw_cirrus_metastore/freeze-everything)
- "wiki content on commons" (and ssl expiry)
- superset (tcp/http) check
Note some users' (e.g. WMCS, fundraising) checks will be tackled as a separate effort