Page MenuHomePhabricator

Shift frack alerting to use prometheus-alertmanager instead of icinga
Open, Needs TriagePublic

Description

Observability plans to sunset Icinga at the end of FY25/26, so FR needs to migrate off of it. We're planning to go with Prometheus alerting based on what's been done for production.

As icinga is phased out, the frack hosts will need to send their alerts to alertmanager. We are looking to use the frack prometheus instance to send the alerts to alertmanager.

Tasks to accomplish:

  • update pfw / iptables rules for frmon to contact alerts hosts
  • verify what metrics currently in prometheus will work for alerts
  • set up config in frack prometheus to send alerts to alerts hosts
    • host config
    • user / service account
  • test creating or moving a metric/alert to prometheus
  • see if currently reported nsca metrics in /var/spool/prometheus/nagios_nsca.prom on each host would be usable

Helpful docs / links:
{T393640}
https://wikitech.wikimedia.org/wiki/Alertmanager
https://gerrit.wikimedia.org/r/plugins/gitiles/operations/alerts/+/refs/heads/master
https://prometheus-eqiad.wikimedia.org/ops/config
https://prometheus-eqiad.wikimedia.org/ops/alerts?search
https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/=
https://alerts.wikimedia.org/?q=%40state%3Dactive&q=%40cluster%3Dwikimedia.org

Icinga checks:

  • check_procs (for specific/named processes)
    • replacement: TBD ( maybe use the systemd unit info like node_systemd_unit_state{instance="frdata1002.frack.eqiad.wmnet:9100",name="apache2.service"} )
    • roles:
      • check_apache2: frdata payments_listener payments
      • check_coworker: civi frdev
      • check_dagster: fran
      • check_freeradius: auth
      • check_krb5kdc: auth
      • check_metabase: fran
      • check_nginx: frdata frpig payments
  • check_procs (total)
    • replacement: node_processes_state
    • roles: all
    • alert rules: processes.yml
  • check_zombie aka check_procs in state=Z
    • metric: node_processes_state{state="Z"}
    • roles: all
  • check_audit_downloads: civi
    • replacement: prometheus::collector::audit_file metric: audit_max_file_age
    • roles: crm
    • alert rules: added to crm.yml
  • check_cert
    • replacement: prometheus::collector::certificate_expiry metric: cert_expiry
    • roles:
      • check_cert_apple_api_cert: frpm
      • check_cert_clientcert_ca: frpm
      • check_cert_paypal_api_cert: frpm
      • check_cert_puppet_ca: frpm
      • check_cert_kafkatee: banner_logger
    • alert rules: TBD
  • check_disk: all
  • check_endpoints:
    • metrics endpoint_check_time endpoint_check_result
    • alert rules: alerting on endpoint_check_result
    • roles: civi auth frpig payments
  • check_haproxy: pay-lb, maybe fransw also
    • metrics:
      • haproxy_process_uptime_seconds
      • haproxy_server_status (maybe for up/down/drain)
      • haproxy_backend_active_servers
      • haproxy_backend_agg_check_status (maybe)
  • check_impression_logs: frban
    • replacement: prometheus metrics generated from rotate_impression_logs.pl
    • roles: banner_logger
    • alert rules: impression_logs.yml
  • check_ipsec: fran frban
  • check_kafkatee: frban
    • metrics:
      • kafkatee_broker_topic_state{broker=~"kafka-jumbo.*"}
      • ?
    • alert rules: kafkatee.yml
  • check_listener_ipn: frpig
    • replacement: check_endpoints
  • check_load: all
  • check_mailq: civi frdev frmx
    • metrics: postfix_mailqueue{queue="active"} postfix_mailqueue_total
    • alert rules: postfix.yml
  • check_memory: all
  • check_missing_thank_yous: frdb
    • metric: missing_thank_yous
    • alert_rules: database_query.yml
  • check_mysql: frdata frdb payments
    • alert rules: mariadb.yml
      • MariadbReplicationRunning
      • MariadbNoReplicas
      • MariadbSSLDisabled
      • MariadbReplicationLag - needs more work to not alert during backup runs by checking mysql_backup_is_running == 1
  • check_http
    • payments-paymentswiki
    • analytics_trino-trino
  • check_puppetrun: all
    • can use the prod prometheus collector but will need to adjust the catalog_version/current commit portion since we use a different method for our version. or we adjust puppet to use a commit hash based version
  • check_raid: all
  • fundraising queue (database)
    • check_recurring_contrib_processing: frdb
      • metric: civicrm_contribution_recur
    • check_recurring_gc_contribs_missed: frdb - retiring
    • check_recurring_gc_failures_missed: frdb - retiring
    • check_recurring_gc_jobs_required: frdb - retiring
    • check_recurring_gc_schedule_sanity: frdb - retiring
    • alert rules: database_query.yml
  • redis
    • check_redis (memory utilization), check_redis_donor_prefs (memory utilization): frqueue
      • replacement: prometheus metrics collected by redis exporter
      • roles: frqueue
      • alert rules: redis.yml
    • check_redis (replag), check_redis_donor_prefs (replag): frqueue
      • replacement: prometheus metrics collected by redis exporter. not looking at replag since it catches up quickly, but at replication established or not
      • roles: frqueue
      • alert rules: redis.yml
  • check_rsyslog_backlog: all
  • check_smtp: frmx
  • check_ssl:
    • civicrm
    • civicrm-civiproxy
    • frdata-fundraising
    • frdata-frdata
    • frdev-civicrm-staging
    • frpig
    • payments
    • payments-staging
  • check_timesync: all

Event Timeline

In the puppet-private repo:
commit f94d1d4501e54136766722074b6821502213ccc1 staged on T367370_prom_alerts branch for pfw config.
commit 064db8663032904ceeaa6b1a32fc6ab93343dcc4 staged on T367370_prom_alerts branch for iptables config.

Something else I wanted to add: with respect to authoring and deploying alerts we have essentially centralized all alerts in operations/alerts.git repository. Said repo already contains scaffolding such as CI/tests integration and if you'd like to also commit FR alerts in there that's no problem; deployment is straightforward in the sense that in production we clone the repo and then selectively deploy alerts based on user-provided directions (#deploy comments at the top of the file). HTH

Jgreen renamed this task from Shift frack alerting to use alertmanager instead of icinga to Shift frack alerting to use prometheus-alertmanager instead of icinga.Jun 24 2025, 1:07 PM
Jgreen updated the task description. (Show Details)
Jgreen subscribed.

Change #1202827 had a related patch set uploaded (by Jgreen; author: Jgreen):

[operations/puppet@production] nsca_frack.cfg.erb remove deprecated check_endpoints service check

https://gerrit.wikimedia.org/r/1202827

@fgiunchedi We (fr-tech) are getting close to live testing with some alerts. We have started to build a set of alerts and are firing them off to our local alertmanager instance that will just send us email. With a config change, we could start pointing those at the production alertmanager instance.

We think the next logical step would be to set up the contact groups (team?) within alertmanager so that we can get them routed correctly via email/irc/splunk on-call. We want to make sure that we tag our alerts properly so that we don't cause issues for other folks. Is this something you can assist us with/point us in the right direction for?

Additionally, we have one alert that we saw today that is routed only through email (https://gerrit.wikimedia.org/r/plugins/gitiles/operations/alerts/+/refs/heads/master/team-netops/fundraising.yaml) that we'd like to get routed via splunk on-call to page us also. Perhaps that will be possible after the contact groups are set up.

Thank you for following up @Dwisehaupt and I'm glad to know you are making progress! These days I am no longer part of Observability, and will defer to @hnowlan to take it from here

Hi @Dwisehaupt,

@fgiunchedi We (fr-tech) are getting close to live testing with some alerts. We have started to build a set of alerts and are firing them off to our local alertmanager instance that will just send us email. With a config change, we could start pointing those at the production alertmanager instance.

We think the next logical step would be to set up the contact groups (team?) within alertmanager so that we can get them routed correctly via email/irc/splunk on-call. We want to make sure that we tag our alerts properly so that we don't cause issues for other folks. Is this something you can assist us with/point us in the right direction for?

Regarding the contact groups, this depends on the Alertmanager configuration and there's two outcomes:

  1. For a team defined in the config: In this case we need to implement new receivers which define the alert destinations based on specific thresholds (for ex. An IRC/Slack channel).
  2. For a new team: In this case we need to create a new matching rule in the Alertmanager configuration along with the desired receivers for it.

In order to not cause alert issues for other teams the correct team and severity must be used as this is used by Alertmanager to properly route alerts to the desired destinations. I'm happy to help either by reviewing your patches on this or sending them myself if you'd like assistance with this. Additionally, here's our Alertmanager user guide.

Additionally, we have one alert that we saw today that is routed only through email (https://gerrit.wikimedia.org/r/plugins/gitiles/operations/alerts/+/refs/heads/master/team-netops/fundraising.yaml) that we'd like to get routed via splunk on-call to page us also. Perhaps that will be possible after the contact groups are set up.

That alert is handled by the following match on the Alertmanager configuration. Based on the match and severity that alert would be sent to the following receivers:

  • sre-irc
  • infrastructure-foundations-irc

I don't see any email receivers matching for that alert however, we can set-up other receivers that would (for example) send that alert to a Slack channel.

Regarding the page for that alert, there are two outcomes:

  1. The alert must page SRE: In this case there's already a receiver for SRE Pages. In order for it page all that's needed is to change the severity from critical to page.
  2. The alert must page another team: In this case we need to implement a routing key on SplunkOnCall, then add a new receiver for that routing key on the alertmanager config.

I can help you to create the routing key on SplunkOnCall if you give me with the team name.

Jgreen updated the task description. (Show Details)
Jgreen updated the task description. (Show Details)

Change #1204648 had a related patch set uploaded (by Dwisehaupt; author: Dwisehaupt):

[operations/puppet@production] Alertmanager: Add fr-tech-ops and update fr-tech groups

https://gerrit.wikimedia.org/r/1204648

@andrea.denisse Thanks for the feedback. We want to mimic what we currently have for the icinga alerting groups. I think I have done that with this commit (https://gerrit.wikimedia.org/r/c/operations/puppet/+/1204648). It goes a little further to split out fr-tech and fr-tech-ops since we have some hardware/OS level alerts that the whole fr-tech team doesn't need to have.

Our alert rules will live in fundraising and fire from our prometheus server. Given that, I believe we can reference the team in our rules to make sure that they get properly routed. Here is an example of one our our rules we have been testing locally:

---

groups:

  - name: queue-alerts

    rules:

      - &redis_queue_size_warn
        alert: RedisQueueSize
        expr:
          redis_queue_total{
              cluster="frqueue",
              queue=~"(contribution_tracking|payments_init|pending|refund)"}
            > 1500
        for: 5m
        labels:
          severity: warning
          team: 'fr-tech'
        annotations:
          description: 'High redis queue size'
          summary: "Redis Queue {{ $labels.queue }} is high: [{{ $value }}]"
          dashboard: 'https://frmon.wikimedia.org/d/R5m3iU1Wk/queue?orgId=1&from=now-24h&to=now&timezone=utc'
      - &redis_queue_size_crit
        alert: RedisQueueSize
        expr:
          redis_queue_total{
              cluster="frqueue",
              queue=~"(contribution_tracking|payments_init|pending|refund)"}
            > 2000
        for: 5m
        labels:
          severity: critical
          team: 'fr-tech'
        annotations:
          description: 'Critical redis queue size'
          summary: "Redis Queue {{ $labels.queue }} is Critical: [{{ $value }}]"
          dashboard: 'https://frmon.wikimedia.org/d/R5m3iU1Wk/queue?orgId=1&from=now-24h&to=now&timezone=utc'

Does this make sense?

As far as the PfwCoreBGPDown alert, when reading through the config I think we are getting email alerts due to this rule. When we have the new groups set up, maybe we can have that route to our paging level.

@andrea.denisse Thanks for the feedback. We want to mimic what we currently have for the icinga alerting groups. I think I have done that with this commit (https://gerrit.wikimedia.org/r/c/operations/puppet/+/1204648). It goes a little further to split out fr-tech and fr-tech-ops since we have some hardware/OS level alerts that the whole fr-tech team doesn't need to have.

Our alert rules will live in fundraising and fire from our prometheus server. Given that, I believe we can reference the team in our rules to make sure that they get properly routed. Here is an example of one our our rules we have been testing locally:

---

groups:

  - name: queue-alerts

    rules:

      - &redis_queue_size_warn
        alert: RedisQueueSize
        expr:
          redis_queue_total{
              cluster="frqueue",
              queue=~"(contribution_tracking|payments_init|pending|refund)"}
            > 1500
        for: 5m
        labels:
          severity: warning
          team: 'fr-tech'
        annotations:
          description: 'High redis queue size'
          summary: "Redis Queue {{ $labels.queue }} is high: [{{ $value }}]"
          dashboard: 'https://frmon.wikimedia.org/d/R5m3iU1Wk/queue?orgId=1&from=now-24h&to=now&timezone=utc'
      - &redis_queue_size_crit
        alert: RedisQueueSize
        expr:
          redis_queue_total{
              cluster="frqueue",
              queue=~"(contribution_tracking|payments_init|pending|refund)"}
            > 2000
        for: 5m
        labels:
          severity: critical
          team: 'fr-tech'
        annotations:
          description: 'Critical redis queue size'
          summary: "Redis Queue {{ $labels.queue }} is Critical: [{{ $value }}]"
          dashboard: 'https://frmon.wikimedia.org/d/R5m3iU1Wk/queue?orgId=1&from=now-24h&to=now&timezone=utc'

Does this make sense?

As far as the PfwCoreBGPDown alert, when reading through the config I think we are getting email alerts due to this rule. When we have the new groups set up, maybe we can have that route to our paging level.

Hi @Dwisehaupt , the routes and the patch LGTM.

I also checked SplunkOnCall and there's already a Fundraising team there with you and @Jgreen so the pages should work.

The Puppet repo has a special set of permissions so please let me know if you need help merging when ready.

Hi @Dwisehaupt , the routes and the patch LGTM.

I also checked SplunkOnCall and there's already a Fundraising team there with you and @Jgreen so the pages should work.

The Puppet repo has a special set of permissions so please let me know if you need help merging when ready.

@andrea.denisse Thanks for the review! We (fr-tech-ops) don't have rights to deploy changes to the prod puppet repo. Feel free to merge and deploy the change when it's convenient for you.

Change #1204648 merged by Andrea Denisse:

[operations/puppet@production] Alertmanager: Add fr-tech-ops and update fr-tech groups

https://gerrit.wikimedia.org/r/1204648

Jgreen updated the task description. (Show Details)
Jgreen updated the task description. (Show Details)