Page MenuHomePhabricator

Migrate SRE paging alerts off Icinga and to Alertmanager
Open, Needs TriagePublic

Description

As part of progressively reducing Icinga' scope we should be moving off it all paging checks/alerts. This will also help improving paging alerts reliability (e.g. T294166: Alert that should have paged via VictorOps was delayed because of partial networking outage) because we'll be using the VO API exclusively, as opposed to the email transport.

List of current (April 2022) paging alerts in Icinga

Prometheus-based (via Icinga check_prometheus)

  • excessive RX traffic on LVS interfaces
  • not enough php-fpm workers
  • reduced availability (i.e. high 5xx) for ats-tls and varnish
  • high rate of NEL errors

Native Icinga/NRPE checks

  • zookeeper server (check_procs on java process)
  • LVS/service::catalog checks. Will be removed by T291946: Move service::catalog checks (“monitoring” section) to blackbox exporter and Alertmanager
  • MariaDB alerts (replica, disk space, read only, mysqld processes not running, etc)
  • cfssl signer per-CA and cfssl-multirootca unit status
  • acme-chief unit status
  • Corp OIT ldap mirror (check_ldap)
  • etcd replication (check_http_url_for_regexp_on_port!${::fqdn}!${etcdmirror_web_port}!/lag!^(-[1-9]|[0-5][^0-9]+))
  • kafka broker server (check_procs on java process)
  • exim queue
  • fastnetmon is alerting
  • phabricator.wikimedia.org unreachable / ssl expiring
  • ircd (check_ircd basic irc client to check connectivity and clients connected)
  • auth and recursive DNS (check_dns and check_dns_query_auth)
  • elasticsearch health check for frozen writes (check timestamp on ES /mw_cirrus_metastore/mw_cirrus_metastore/freeze-everything)
  • "wiki content on commons" (and ssl expiry)
  • superset (tcp/http) check

Note some users' (e.g. WMCS, fundraising) checks will be tackled as a separate effort

Details

SubjectRepoBranchLines +/-
operations/alertsmaster+126 -0
operations/alertsmaster+44 -0
operations/alertsmaster+117 -0
operations/puppetproduction+1 -1
operations/puppetproduction+9 -20
operations/puppetproduction+2 -19
operations/puppetproduction+2 -2
operations/puppetproduction+6 -0
operations/puppetproduction+21 -21
operations/puppetproduction+5 -0
operations/puppetproduction+189 -19
operations/puppetproduction+6 -1
operations/puppetproduction+1 -0
operations/puppetproduction+3 -0
operations/puppetproduction+8 -31
operations/puppetproduction+6 -17
operations/puppetproduction+19 -0
operations/puppetproduction+5 -5
operations/puppetproduction+1 -1
operations/puppetproduction+28 -2
operations/puppetproduction+5 -2
operations/puppetproduction+5 -0
operations/puppetproduction+1 -2
operations/puppetproduction+26 -7
operations/puppetproduction+0 -20
operations/puppetproduction+1 -0
operations/puppetproduction+4 -1
operations/puppetproduction+2 -0
operations/puppetproduction+1 -0
operations/puppetproduction+94 -27
operations/puppetproduction+4 -0
operations/puppetproduction+24 -0
operations/puppetproduction+0 -1
operations/alertsmaster+11 -8
operations/alertsmaster+74 -0
operations/puppetproduction+2 -44
operations/alertsmaster+6 -0
operations/puppetproduction+13 -6
operations/puppetproduction+78 -2
operations/puppetproduction+2 -12
operations/puppetproduction+0 -12
operations/alertsmaster+43 -0
operations/alertsmaster+66 -0
operations/puppetproduction+7 -1
operations/puppetproduction+3 -4
operations/puppetproduction+0 -36
operations/puppetproduction+4 -2
operations/puppetproduction+8 -0
operations/puppetproduction+16 -11
operations/alertsmaster+77 -0
operations/puppetproduction+0 -22
operations/alertsmaster+68 -0
operations/puppetproduction+0 -15
operations/alertsmaster+78 -0
operations/puppetproduction+5 -0
Show related patches Customize query in gerrit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 798448 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/alerts@master] sre: limit mail alerts to prometheus/ops in codfw and eqiad

https://gerrit.wikimedia.org/r/798448

Change 798448 merged by Filippo Giunchedi:

[operations/alerts@master] sre: limit mail alerts to prometheus/ops in codfw and eqiad

https://gerrit.wikimedia.org/r/798448

Change 798526 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/alerts@master] Enforce hashtag-page in summary

https://gerrit.wikimedia.org/r/798526

Change 793723 merged by Filippo Giunchedi:

[operations/alerts@master] sre: add fastnetmon alerting page

https://gerrit.wikimedia.org/r/793723

Change 793731 merged by Filippo Giunchedi:

[operations/puppet@production] fastnetmon: remove alert, ported to Prometheus / Alertmanager

https://gerrit.wikimedia.org/r/793731

Change 798526 merged by Filippo Giunchedi:

[operations/alerts@master] Enforce hashtag-page in summary

https://gerrit.wikimedia.org/r/798526

Change 801714 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] blackbox: add IRC probe module

https://gerrit.wikimedia.org/r/801714

Change 801723 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] ldap-corp: disable paging

https://gerrit.wikimedia.org/r/801723

Change 801723 merged by Filippo Giunchedi:

[operations/puppet@production] ldap-corp: disable paging

https://gerrit.wikimedia.org/r/801723

Change 801714 merged by Filippo Giunchedi:

[operations/puppet@production] blackbox: add IRC probe module

https://gerrit.wikimedia.org/r/801714

Change 802071 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] hieradata: TCP probe for ldap-ro

https://gerrit.wikimedia.org/r/802071

Change 802071 merged by Filippo Giunchedi:

[operations/puppet@production] hieradata: TCP probe for ldap-ro

https://gerrit.wikimedia.org/r/802071

Change 803553 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] prometheus: generate per-service TCP blackbox module

https://gerrit.wikimedia.org/r/803553

Change 803554 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] hieradata: set SNI for ldap-ro

https://gerrit.wikimedia.org/r/803554

Change 803553 merged by Filippo Giunchedi:

[operations/puppet@production] prometheus: generate per-service TCP blackbox module

https://gerrit.wikimedia.org/r/803553

Change 803554 merged by Filippo Giunchedi:

[operations/puppet@production] hieradata: set SNI for ldap-ro

https://gerrit.wikimedia.org/r/803554

Change 804266 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] phabricator: add blackbox http check

https://gerrit.wikimedia.org/r/804266

Change 804274 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] icinga: check commons.w.o with blackbox exporter

https://gerrit.wikimedia.org/r/804274

Change 804266 merged by Filippo Giunchedi:

[operations/puppet@production] phabricator: add blackbox http check

https://gerrit.wikimedia.org/r/804266

Change 804266 merged by Filippo Giunchedi:

[operations/puppet@production] phabricator: add blackbox http check

https://gerrit.wikimedia.org/r/804266

This works as expected (i.e. phab gets probed) though I think we'll want to revisit the labels, since the hostname doesn't show up (the ip address does)

probe_duration_seconds{address="10.64.16.8", family="ip4", instance="phabricator.wikimedia.org:443", job="probes/custom", module="http_phabricator_wikimedia_org_ip4", prometheus="ops", site="eqiad"}
probe_duration_seconds{address="2620:0:861:102:10:64:16:8", family="ip6", instance="phabricator.wikimedia.org:443", job="probes/custom", module="http_phabricator_wikimedia_org_ip6", prometheus="ops", site="eqiad"}

For silencing/ergonomic purposes I think it does make sense to have the hostname as instance in this case, thus it'd be instance=phab1001:443. To address all "phabricator" probes we can hook into module=~"http_phabricator_wikimedia_org.*" cc @jbond

Sth else to be addressed is that changing the blackbox modules config didn't notify => Exec['assemble blackbox.yml'] and it must

This works as expected

nice :) i can take a look at making some tweaks next week

Change 805816 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] prometheus: use hostname for blackbox::check::http

https://gerrit.wikimedia.org/r/805816

Change 805816 merged by Filippo Giunchedi:

[operations/puppet@production] prometheus: use hostname for blackbox::check::http

https://gerrit.wikimedia.org/r/805816

Change 806207 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] phabricator: get envoy to listen on ipv6

https://gerrit.wikimedia.org/r/806207

Change 806207 merged by Dzahn:

[operations/puppet@production] phabricator: get envoy to listen on ipv6

https://gerrit.wikimedia.org/r/806207

Change 809586 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] prometheus: adjust check::http params based on distro

https://gerrit.wikimedia.org/r/809586

Change 809586 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] prometheus: adjust check::http params based on distro

https://gerrit.wikimedia.org/r/809586

This causes some shenanigans with blackbox exporter versions and exported resources. Meaning we can't effectively vary options based on the distro. I've decided to go ahead and backport Bullseye's version of blackbox-exporter to Buster instead, and then upgrade so we can ditch the distro-based conditionals. cc @taavi

Change 810910 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] prometheus: remove distro-based conditionals for blackbox

https://gerrit.wikimedia.org/r/810910

Change 810910 merged by Filippo Giunchedi:

[operations/puppet@production] prometheus: remove distro-based conditionals for blackbox

https://gerrit.wikimedia.org/r/810910

Change 809586 abandoned by Filippo Giunchedi:

[operations/puppet@production] prometheus: adjust check::http params based on distro

Reason:

blackbox-exporter upgraded instead

https://gerrit.wikimedia.org/r/809586

Change 810929 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] prometheus: disable protocol fallback for blackbox::check::http

https://gerrit.wikimedia.org/r/810929

Change 810929 merged by Filippo Giunchedi:

[operations/puppet@production] prometheus: disable protocol fallback for blackbox::check::http

https://gerrit.wikimedia.org/r/810929

Change 804274 merged by Filippo Giunchedi:

[operations/puppet@production] icinga: check commons.w.o with blackbox exporter

https://gerrit.wikimedia.org/r/804274

Change 811241 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] prometheus: deploy custom probedown alerts

https://gerrit.wikimedia.org/r/811241

Change 811242 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] prometheus: deploy alerts as yml not yaml

https://gerrit.wikimedia.org/r/811242

Change 811294 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] prometheus: move blackbox http check to prometheus::rule

https://gerrit.wikimedia.org/r/811294

Change 811295 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] prometheus: introduce blackbox::module

https://gerrit.wikimedia.org/r/811295

Change 811296 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] prometheus: switch to blackbox::module

https://gerrit.wikimedia.org/r/811296

Change 811310 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] icinga: pass in ip4/ip6 addresses for commons blackbox

https://gerrit.wikimedia.org/r/811310

Change 811310 merged by Filippo Giunchedi:

[operations/puppet@production] icinga: pass in ip4/ip6 addresses for commons blackbox

https://gerrit.wikimedia.org/r/811310

Change 811241 merged by Filippo Giunchedi:

[operations/puppet@production] prometheus: deploy custom probedown alerts

https://gerrit.wikimedia.org/r/811241

Change 811242 merged by Filippo Giunchedi:

[operations/puppet@production] prometheus: deploy alerts as yml not yaml

https://gerrit.wikimedia.org/r/811242

Change 811294 merged by Filippo Giunchedi:

[operations/puppet@production] prometheus: move blackbox http check to prometheus::rule

https://gerrit.wikimedia.org/r/811294

Change 811295 merged by Filippo Giunchedi:

[operations/puppet@production] prometheus: introduce blackbox::module

https://gerrit.wikimedia.org/r/811295

Change 811296 merged by Filippo Giunchedi:

[operations/puppet@production] prometheus: switch to blackbox::module

https://gerrit.wikimedia.org/r/811296

Change 812846 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] phabricator: switch to prometheus-only network probes/checks

https://gerrit.wikimedia.org/r/812846

Change 812854 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] icinga: switch to prometheus-only probes for commons

https://gerrit.wikimedia.org/r/812854

Change 812854 merged by Filippo Giunchedi:

[operations/puppet@production] icinga: switch to prometheus-only probes for commons

https://gerrit.wikimedia.org/r/812854

Change 813213 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] icinga: fix ip addresses for commons.wikimedia.org probe

https://gerrit.wikimedia.org/r/813213

Change 813213 merged by Filippo Giunchedi:

[operations/puppet@production] icinga: fix ip addresses for commons.wikimedia.org probe

https://gerrit.wikimedia.org/r/813213

Change 815258 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] icinga: set 'instance' for commons probe

https://gerrit.wikimedia.org/r/815258

Change 815258 merged by Filippo Giunchedi:

[operations/puppet@production] icinga: set 'instance' for commons probe

https://gerrit.wikimedia.org/r/815258

Change 815304 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] prometheus: enable x509 CN validation in blackbox

https://gerrit.wikimedia.org/r/815304

Change 815305 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] prometheus: use protocol in blackbox target files

https://gerrit.wikimedia.org/r/815305

Change 815306 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] prometheus: add blackbox TCP check

https://gerrit.wikimedia.org/r/815306

Change 815307 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] syslog: probe TLS endpoint with blackbox

https://gerrit.wikimedia.org/r/815307

Change 815304 merged by Filippo Giunchedi:

[operations/puppet@production] prometheus: enable x509 CN validation in blackbox

https://gerrit.wikimedia.org/r/815304

Change 815305 merged by Filippo Giunchedi:

[operations/puppet@production] prometheus: use protocol in blackbox target files

https://gerrit.wikimedia.org/r/815305

Change 815306 merged by Filippo Giunchedi:

[operations/puppet@production] prometheus: add blackbox TCP check

https://gerrit.wikimedia.org/r/815306

Change 815685 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] prometheus: adjust blackbox check params/types

https://gerrit.wikimedia.org/r/815685

Change 815685 merged by Filippo Giunchedi:

[operations/puppet@production] prometheus: adjust blackbox check params/types

https://gerrit.wikimedia.org/r/815685

Change 815307 merged by Filippo Giunchedi:

[operations/puppet@production] syslog: probe TLS endpoint with blackbox

https://gerrit.wikimedia.org/r/815307

Change 805815 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] mw_rc_irc: check ircd availability with blackbox prober

https://gerrit.wikimedia.org/r/805815

Change 815713 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] prometheus: fix blackbox timeout Pattern

https://gerrit.wikimedia.org/r/815713

Change 815713 merged by Filippo Giunchedi:

[operations/puppet@production] prometheus: fix blackbox timeout Pattern

https://gerrit.wikimedia.org/r/815713

Change 812846 merged by Filippo Giunchedi:

[operations/puppet@production] phabricator: switch to prometheus-only network probes/checks

https://gerrit.wikimedia.org/r/812846

Change 805815 merged by Filippo Giunchedi:

[operations/puppet@production] mw_rc_irc: check ircd availability with blackbox prober

https://gerrit.wikimedia.org/r/805815

Change 815897 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] prometheus: default server_name to hostname in tcp check

https://gerrit.wikimedia.org/r/815897

Change 815897 merged by Filippo Giunchedi:

[operations/puppet@production] prometheus: default server_name to hostname in tcp check

https://gerrit.wikimedia.org/r/815897

Change 818108 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/alerts@master] sre: port Kafka alerts from Icinga

https://gerrit.wikimedia.org/r/818108

Change 818402 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/alerts@master] sre: port zookeeper alerts

https://gerrit.wikimedia.org/r/818402

Change 818108 merged by Filippo Giunchedi:

[operations/alerts@master] sre: port Kafka alerts from Icinga

https://gerrit.wikimedia.org/r/818108

Change 825741 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/alerts@master] sre: alert on appserver unavailability

https://gerrit.wikimedia.org/r/825741

Change 825741 merged by Filippo Giunchedi:

[operations/alerts@master] sre: alert on appserver unavailability

https://gerrit.wikimedia.org/r/825741

Change 818402 merged by Filippo Giunchedi:

[operations/alerts@master] sre: port Zookeeper alerts

https://gerrit.wikimedia.org/r/818402

I'm untagging this from current work as we're not going to work on it for now, favouring the progressive migration off icinga instead