As part of progressively reducing Icinga' scope we should be moving off it all paging checks/alerts. This will also help improving paging alerts reliability (e.g. {T294166}) because we'll be using the VO API exclusively, as opposed to the email transport.
== List of current (April 2022) paging alerts in Icinga ==
=== Prometheus-based (via Icinga `check_prometheus`) ===
* [x] excessive RX traffic on LVS interfaces
* [x] not enough php-fpm workers
* [x] reduced availability (i.e. high 5xx) for ats-tls and varnish
* [x] high rate of NEL errors
=== Native Icinga/NRPE checks ===
* [ ] zookeeper server (`check_procs` on java process)
* [x] LVS/service::catalog checks. Will be removed by {T291946}
* [ ] MariaDB alerts (replica, disk space, read only, mysqld processes not running, etc)
* [ ] cfssl signer per-CA and cfssl-multirootca unit status
* [ ] acme-chief unit status
* ~~Corp OIT ldap mirror (`check_ldap`)~~
* [ ] etcd replication (`check_http_url_for_regexp_on_port!${::fqdn}!${etcdmirror_web_port}!/lag!^(-[1-9]|[0-5][^0-9]+)`)
* [ ] kafka broker server (`check_procs` on java process)
* [x] exim queue
* [x] fastnetmon is alerting
* [ ] phabricator.wikimedia.org unreachable / ssl expiring
* [ ] ircd (`check_ircd` basic irc client to check connectivity and clients connected)
* [ ] auth and recursive DNS (`check_dns` and `check_dns_query_auth`)
* [ ] elasticsearch health check for frozen writes (check timestamp on ES `/mw_cirrus_metastore/mw_cirrus_metastore/freeze-everything`)
* [ ] "wiki content on commons" (and ssl expiry)
* [ ] superset (tcp/http) check
Note some users' (e.g. WMCS, fundraising) checks will be tackled as a separate effort