Observability plans to sunset Icinga at the end of FY25/26, so FR needs to migrate off of it. We're planning to go with Prometheus alerting based on what's been done for production.
As icinga is phased out, the frack hosts will need to send their alerts to alertmanager. We are looking to use the frack prometheus instance to send the alerts to alertmanager.
Tasks to accomplish:
- update pfw / iptables rules for frmon to contact alerts hosts
- verify what metrics currently in prometheus will work for alerts
- set up config in frack prometheus to send alerts to alerts hosts
- host config
- user / service account
- test creating or moving a metric/alert to prometheus
- see if currently reported nsca metrics in /var/spool/prometheus/nagios_nsca.prom on each host would be usable
Helpful docs / links:
{T393640}
https://wikitech.wikimedia.org/wiki/Alertmanager
https://gerrit.wikimedia.org/r/plugins/gitiles/operations/alerts/+/refs/heads/master
https://prometheus-eqiad.wikimedia.org/ops/config
https://prometheus-eqiad.wikimedia.org/ops/alerts?search
https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/=
https://alerts.wikimedia.org/?q=%40state%3Dactive&q=%40cluster%3Dwikimedia.org
Icinga checks:
- check_procs (for specific/named processes)
- replacement: TBD ( maybe use the systemd unit info like node_systemd_unit_state{instance="frdata1002.frack.eqiad.wmnet:9100",name="apache2.service"} )
- roles:
- check_apache2: frdata payments_listener payments
- check_coworker: civi frdev
- check_dagster: fran
- check_freeradius: auth
- check_krb5kdc: auth
- check_metabase: fran
- check_nginx: frdata frpig payments
- check_procs (total)
- replacement: node_processes_state
- roles: all
- alert rules: processes.yml
- check_zombie aka check_procs in state=Z
- metric: node_processes_state{state="Z"}
- roles: all
- check_audit_downloads: civi
- replacement: prometheus::collector::audit_file metric: audit_max_file_age
- roles: crm
- alert rules: added to crm.yml
- check_cert
- replacement: prometheus::collector::certificate_expiry metric: cert_expiry
- roles:
- check_cert_apple_api_cert: frpm
- check_cert_clientcert_ca: frpm
- check_cert_paypal_api_cert: frpm
- check_cert_puppet_ca: frpm
- check_cert_kafkatee: banner_logger
- alert rules: TBD
- check_disk: all
- check_endpoints:
- metrics endpoint_check_time endpoint_check_result
- alert rules: alerting on endpoint_check_result
- roles: civi auth frpig payments
- check_haproxy: pay-lb, maybe fransw also
- metrics:
- haproxy_process_uptime_seconds
- haproxy_server_status (maybe for up/down/drain)
- haproxy_backend_active_servers
- haproxy_backend_agg_check_status (maybe)
- metrics:
- check_impression_logs: frban
- replacement: prometheus metrics generated from rotate_impression_logs.pl
- roles: banner_logger
- alert rules: impression_logs.yml
- check_ipsec: fran frban
- check_kafkatee: frban
- metrics:
- kafkatee_broker_topic_state{broker=~"kafka-jumbo.*"}
- ?
- alert rules: kafkatee.yml
- metrics:
- check_listener_ipn: frpig
- replacement: check_endpoints
- check_load: all
- check_mailq: civi frdev frmx
- metrics: postfix_mailqueue{queue="active"} postfix_mailqueue_total
- alert rules: postfix.yml
- check_memory: all
- check_missing_thank_yous: frdb
- metric: missing_thank_yous
- alert_rules: database_query.yml
- check_mysql: frdata frdb payments
- alert rules: mariadb.yml
- MariadbReplicationRunning
- MariadbNoReplicas
- MariadbSSLDisabled
- MariadbReplicationLag - needs more work to not alert during backup runs by checking mysql_backup_is_running == 1
- alert rules: mariadb.yml
- check_http
- payments-paymentswiki
- analytics_trino-trino
- check_puppetrun: all
- can use the prod prometheus collector but will need to adjust the catalog_version/current commit portion since we use a different method for our version. or we adjust puppet to use a commit hash based version
- check_raid: all
- fundraising queue (database)
- check_recurring_contrib_processing: frdb
- metric: civicrm_contribution_recur
- check_recurring_gc_contribs_missed: frdb - retiring
- check_recurring_gc_failures_missed: frdb - retiring
- check_recurring_gc_jobs_required: frdb - retiring
- check_recurring_gc_schedule_sanity: frdb - retiring
- alert rules: database_query.yml
- check_recurring_contrib_processing: frdb
- redis
- check_redis (memory utilization), check_redis_donor_prefs (memory utilization): frqueue
- replacement: prometheus metrics collected by redis exporter
- roles: frqueue
- alert rules: redis.yml
- check_redis (replag), check_redis_donor_prefs (replag): frqueue
- replacement: prometheus metrics collected by redis exporter. not looking at replag since it catches up quickly, but at replication established or not
- roles: frqueue
- alert rules: redis.yml
- check_redis (memory utilization), check_redis_donor_prefs (memory utilization): frqueue
- check_rsyslog_backlog: all
- check_smtp: frmx
- check_ssl:
- civicrm
- civicrm-civiproxy
- frdata-fundraising
- frdata-frdata
- frdev-civicrm-staging
- frpig
- payments
- payments-staging
- check_timesync: all