Page MenuHomePhabricator

Reduce IRC flood/spam during incidents
Closed, ResolvedPublic

Description

As part of this quarter's o11y OKRs we'll be looking into reducing the amount of alerting IRC flood/spam that occurs during incidents, mostly from per-host alerts all firing together. Solutions for said alerts include:

  • Evaluate if the alert can be eliminated altogether
  • If not, lower its severity to non-IRC notifications
  • Evaluate whether a (possibly higher-level) aggregate alert on IRC is warranted

The list of alerts affected includes:

  • Apache HTTP on <appserver> icinga talks http to the appserver asking for en.wikipedia.org (check_http_wikipedia)
  • PHP7 rendering on <appserver> icinga talks http to the appserver/jobrunner sending the PHP_ENGINE=php7 cookie (check_http_wikipedia_main_php7 check_http_jobrunner_php7)
  • confd template for <file> on <host> the nrpe script check_confd_template does sanity/compilation/staleness checks
  • mediawiki-installation DSH group on <host> the check_dsh_groups script does basic consistency checks when hosts are not in the groups they are supposed to be. This alert will "naturally" go away with mw-on-k8s, holding off for now
  • MediaWiki EtcdConfig up-to-date on <host> basic sanity check via check_etcd_mw_config_lastindex.py on whether the appserver has an up to date etcd index
  • Confd vcl based reload can fire per-host simultaneously and thus spam IRC
  • PROBLEM - restbase endpoints health on restbase1027 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get
  • PROBLEM - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is CRITICAL: /api/rest_v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is
  • (will be replaced by checks in T320620: Port openapi/swagger checks/alerts to Prometheus) PROBLEM - Restbase/mobileapps/wikifeeds/termbox LVS codfw on restbase.svc.codfw.wmnet is CRITICAL
  • RECOVERY - Maps HTTPS on maps2006 is OK: HTTP OK: HTTP/1.1 200 OK - 956 bytes
  • PROBLEM - aqs endpoints health on aqs1014 is CRITICAL

Event Timeline

A sample of alerts I found while looking for IRC floods from icinga-wm (reporting a sample of alert, not repeating the flood here)

2022-04-20T14:01:08 -icinga-wm:#wikimedia-operations- PROBLEM - MediaWiki EtcdConfig up-to-date on mw1322 is CRITICAL: etcd last index (474291) is outdated compared to the master one (474294) URL
2021-06-15T16:32:04 -icinga-wm:#wikimedia-operations- PROBLEM - Hadoop NodeManager on an-worker1137 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager URL
2021-07-16T15:26:37 -icinga-wm:#wikimedia-operations- PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) URL
2021-07-26T10:35:03 -icinga-wm:#wikimedia-operations- PROBLEM - Apache HTTP on mw2310 is CRITICAL: CRITICAL - Socket timeout after 10 seconds URL
2021-07-26T10:35:03 -icinga-wm:#wikimedia-operations- PROBLEM - PHP7 rendering on mw2274 is CRITICAL: CRITICAL - Socket timeout after 10 seconds URL
2021-09-07T14:06:12 -icinga-wm:#wikimedia-operations- PROBLEM - Check systemd state on restbase1019 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service URL
2021-09-12T18:13:18 -icinga-wm:#wikimedia-operations- PROBLEM - Varnish HTTP upload-frontend - port 3126 on cp3061 is CRITICAL: CRITICAL - Socket timeout after 10 seconds URL
2021-10-03T06:01:18 -icinga-wm:#wikimedia-operations- PROBLEM - ats-tls HTTPS wikiworkshop.org RSA on cp2027 is CRITICAL: SSL CRITICAL - OCSP staple validity for wikiworkshop.org has 86323 seconds left URL
2022-01-13T13:13:00 -icinga-wm:#wikimedia-operations- PROBLEM - Confd template for /srv/config-master/pybal/codfw/upload-URL on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/upload-URL is broken URL
2022-03-01T15:43:05 -icinga-wm:#wikimedia-operations- PROBLEM - mediawiki-installation DSH group on mw1407 is CRITICAL: Host mw1407 is not in mediawiki-installation dsh group URL
2022-03-02T19:28:15 -icinga-wm:#wikimedia-operations- PROBLEM - Ensure local MW versions match expected deployment on mw1434 is CRITICAL: CRITICAL: 528 mismatched wikiversions URL
2022-04-28T07:35:33 -icinga-wm:#wikimedia-operations- PROBLEM - Confd template for /etc/varnish/dynamic.actions.inc.vcl on cp6005 is CRITICAL: File not found: /etc/varnish/dynamic.actions.inc.vcl URL
2022-06-22T13:50:22 -icinga-wm:#wikimedia-operations- PROBLEM - nova-compute proc maximum on cloudvirt1027 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute URL
11:39 -icinga-wm:#wikimedia-operations- PROBLEM - confd service on doh1002 is CRITICAL: CRITICAL - Expecting active 
          but unit confd is activating https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state

For "Apache HTTP on mw" I guess ideally it would be replaced by 2 things:

  • a paging alert based on "too many mw servers have failed apaches" with some threshold
  • a non-paging alert (IRC, maybe automatic ticket) for each individual server BUT that does only trigger if we are not also above the treshold

my 2 cents, pending discussion with more serviceops

For "Apache HTTP on mw" I guess ideally it would be replaced by 2 things:

  • a paging alert based on "too many mw servers have failed apaches" with some threshold
  • a non-paging alert (IRC, maybe automatic ticket) for each individual server BUT that does only trigger if we are not also above the treshold

Yeah sth like that would work I think, as things stand today I think our highest "bang for buck" for the above and PHP rendering (and possibly other of the same nature) is the following:

  • Use up metric for apache-exporter and php-fpm-exporter as proxy metrics to signal that sth is wrong with the fleet itself (i.e. we can't collect metrics), like we have now the JobUnavailable alert but tailored for php/apache on mw, and make it paging
  • Downgrade the per-host apache/php alerts above to warning so we still have visibility into per-host status and no IRC spam during incidents

FYI, I made this dashboard a while ago: https://logstash.wikimedia.org/app/dashboards#/view/AWm67Kpk8aQffZ3HmRpW hopefully it can help with the investigation work.

lmata triaged this task as Medium priority.
lmata moved this task from Inbox to In progress on the SRE Observability (FY2022/2023-Q1) board.

12:13:30 <icinga-wm> PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on thumbor2006 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring

Is another fairly noisy alert

Change 825742 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] mediawiki: stop checking per-appserver availability

https://gerrit.wikimedia.org/r/825742

Regarding the appserver alerts, I think we should go in the following direction:

  • Have one metric that tells us if apache is up; I think that AppserversUnreachable is checking up{job="apache"}, instead than apache_up which is what the exporter uses to signal if the apache server is reachable or not. But otherwise, we're covered and we can remove the individual server alerts
  • I'd also add a per-server alert if apache is down for more than 3 hours on that specific server, though - just so that if a server has been left misconfigured or broken for any reason we'll notice.
  • We also need a metric that tells us if php-fpm is able to respond to queries, although that is mostly covered by the PHPBusyWorkers alerts
  • Finally, and this is still lacking in alertmanager, we need a metric similar to the php7 rendering one, basically we need an http check of an url that involves calling the mediawiki code. Probably a good candidate is what pybal check.

Change 830582 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/alerts@master] sre: check apache_up too as part of AppserversUnreachable

https://gerrit.wikimedia.org/r/830582

Thank you for the feedback!

Regarding the appserver alerts, I think we should go in the following direction:

  • Have one metric that tells us if apache is up; I think that AppserversUnreachable is checking up{job="apache"}, instead than apache_up which is what the exporter uses to signal if the apache server is reachable or not. But otherwise, we're covered and we can remove the individual server alerts

Indeed, I've addressed this in https://gerrit.wikimedia.org/r/c/operations/alerts/+/830582

  • I'd also add a per-server alert if apache is down for more than 3 hours on that specific server, though - just so that if a server has been left misconfigured or broken for any reason we'll notice.

I'm a little skeptical on the value of such an alert, has this come up before and wasn't covered e.g. by puppet stale alerts?

  • We also need a metric that tells us if php-fpm is able to respond to queries, although that is mostly covered by the PHPBusyWorkers alerts

Agreed

  • Finally, and this is still lacking in alertmanager, we need a metric similar to the php7 rendering one, basically we need an http check of an url that involves calling the mediawiki code. Probably a good candidate is what pybal check.

Unless I'm mistaken this is the probes section in service::catalog calling /w/health-check.php ?

Change 830582 merged by Filippo Giunchedi:

[operations/alerts@master] sre: check apache_up too as part of AppserversUnreachable

https://gerrit.wikimedia.org/r/830582

re: MediaWiki EtcdConfig up-to-date over the last 90d we got ~10 floods of varying intensity, ranging from 80 to 10-15 for the most part: https://logstash.wikimedia.org/goto/a17e70746f78e8259665be97b88e65b4

2022-10-05-155047_1066x495_scrot.png (495×1 px, 33 KB)

Change 825742 merged by Filippo Giunchedi:

[operations/puppet@production] mediawiki: stop checking per-appserver availability

https://gerrit.wikimedia.org/r/825742

Change 841549 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/alerts@master] sre: issue confd per-template alerts

https://gerrit.wikimedia.org/r/841549

Change 841549 merged by Filippo Giunchedi:

[operations/alerts@master] sre: issue confd per-template alerts

https://gerrit.wikimedia.org/r/841549

Change 841886 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] confd: remove check_confd_template icinga check

https://gerrit.wikimedia.org/r/841886

Change 841887 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] WIP mediawiki: remove PHP7 icinga checks

https://gerrit.wikimedia.org/r/841887

Change 841886 merged by Filippo Giunchedi:

[operations/puppet@production] confd: remove check_confd_template icinga check

https://gerrit.wikimedia.org/r/841886

Change 861850 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] varnish: teach confd-reload-vcl to write a Prometheus state file

https://gerrit.wikimedia.org/r/861850

Change 861850 merged by Filippo Giunchedi:

[operations/puppet@production] varnish: teach confd-reload-vcl to write a Prometheus state file

https://gerrit.wikimedia.org/r/861850

Change 862266 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] varnish: check vcl reload for old and new state

https://gerrit.wikimedia.org/r/862266

Change 862266 merged by Filippo Giunchedi:

[operations/puppet@production] varnish: check vcl reload for old and new state

https://gerrit.wikimedia.org/r/862266

Change 866264 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/alerts@master] sre: exclude confd-reload-vcl from textfile staleness

https://gerrit.wikimedia.org/r/866264

Change 866264 merged by Filippo Giunchedi:

[operations/alerts@master] sre: exclude confd-reload-vcl from textfile staleness

https://gerrit.wikimedia.org/r/866264

Change 841887 merged by Filippo Giunchedi:

[operations/puppet@production] mediawiki: remove PHP7 icinga checks

https://gerrit.wikimedia.org/r/841887

Change 955915 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] nagios: emit warnings from check_dsh_groups

https://gerrit.wikimedia.org/r/955915

Change 955915 merged by Filippo Giunchedi:

[operations/puppet@production] nagios: emit warnings from check_dsh_groups

https://gerrit.wikimedia.org/r/955915

Change 961002 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] service: allow disabling icinga checks for 'node'

https://gerrit.wikimedia.org/r/961002

Change 961003 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] restbase: disable per-host icinga checks

https://gerrit.wikimedia.org/r/961003

Change 961002 merged by Filippo Giunchedi:

[operations/puppet@production] service: allow disabling icinga checks for 'node'

https://gerrit.wikimedia.org/r/961002

Change 961003 merged by Filippo Giunchedi:

[operations/puppet@production] restbase: disable per-host icinga checks

https://gerrit.wikimedia.org/r/961003

Mentioned in SAL (#wikimedia-operations) [2023-09-26T09:48:15Z] <godog> remove per-host restbase healthchecks, replaced by service-level swagger-exporter checks - T314118

Change 961062 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] maps: remove per-host healthchck

https://gerrit.wikimedia.org/r/961062

Change 961062 merged by Filippo Giunchedi:

[operations/puppet@production] maps: remove per-host healthchck

https://gerrit.wikimedia.org/r/961062

I'm going to call this resolved, we can reopen or start a new task to audit alerts causing floods on irc during incidents