As part of this quarter's o11y OKRs we'll be looking into reducing the amount of alerting IRC flood/spam that occurs during incidents, mostly from per-host alerts all firing together. Solutions for said alerts include:
- Evaluate if the alert can be eliminated altogether
- If not, lower its severity to non-IRC notifications
- Evaluate whether a (possibly higher-level) aggregate alert on IRC is warranted
The list of alerts affected includes:
- Apache HTTP on <appserver> icinga talks http to the appserver asking for en.wikipedia.org (check_http_wikipedia)
- PHP7 rendering on <appserver> icinga talks http to the appserver/jobrunner sending the PHP_ENGINE=php7 cookie (check_http_wikipedia_main_php7 check_http_jobrunner_php7)
- confd template for <file> on <host> the nrpe script check_confd_template does sanity/compilation/staleness checks
- mediawiki-installation DSH group on <host> the check_dsh_groups script does basic consistency checks when hosts are not in the groups they are supposed to be. This alert will "naturally" go away with mw-on-k8s, holding off for now
- MediaWiki EtcdConfig up-to-date on <host> basic sanity check via check_etcd_mw_config_lastindex.py on whether the appserver has an up to date etcd index
- Confd vcl based reload can fire per-host simultaneously and thus spam IRC
- PROBLEM - restbase endpoints health on restbase1027 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get
- PROBLEM - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is CRITICAL: /api/rest_v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is
- (will be replaced by checks in T320620: Port openapi/swagger checks/alerts to Prometheus) PROBLEM - Restbase/mobileapps/wikifeeds/termbox LVS codfw on restbase.svc.codfw.wmnet is CRITICAL
- RECOVERY - Maps HTTPS on maps2006 is OK: HTTP OK: HTTP/1.1 200 OK - 956 bytes
- PROBLEM - aqs endpoints health on aqs1014 is CRITICAL
