Page MenuHomePhabricator

Move WMCS off of Icinga and introduce alertmanager
Open, MediumPublicGoal

Description

Below is a list of icinga checks with contact group wmcs that will need to be migrated to AM/prometheus (or deprecated) for wmcs to be off icinga (modulo base checks, out of scope for this)

titleprofilecontactsstatus
ensure_nova_compute_runningprofile::openstack::eqiad1::nova::compute::servicewmcs-team,adminsmigrated
ensure_running_kvm_instancesprofile::openstack::eqiad1::nova::compute::servicewmcs-team,adminspending
ensure_single_nova_compute_procprofile::openstack::eqiad1::nova::compute::servicewmcs-team,adminsmigrated
check-cinder-snapshot-leaksprofile::openstack::base::cinderwmcs-team-email,adminspending
check-cloudinfra-spreadprofile::openstack::eqiad1::keystone::servicewmcs-team-email,wmcs-botsremoved
check-deployment-prep-spreadprofile::openstack::eqiad1::keystone::servicewmcs-team-email,wmcs-botsremoved
check-flavor_aggregatesprofile::openstack::eqiad1::nova::fullstack::servicewmcs-team-email,wmcs-botspending
check-tools-spreadprofile::openstack::eqiad1::keystone::servicewmcs-team-email,wmcs-botsremoved
high_iowait_stallingprofile::dumps::distribution::monitoringwmcs-team,adminsmigrated
network_in_saturatedprofile::dumps::distribution::monitoringwmcs-team,adminsmigrated
network_out_saturatedprofile::dumps::distribution::monitoringwmcs-team,adminsmigrated
https_wikitech-staticprofile::icingawmcs-bots,adminspending
tools-checker-dumpsprofile::icingawmcs-team-emailpending
tools-checker-etcd-k8sprofile::icingawmcs-team-email,wmcs-botspending
tools-checker-grid-continuous-busterprofile::icingawmcs-bots,wmcs-team-emailpending
tools-checker-grid-start-busterprofile::icingawmcs-bots,wmcs-team-emailpending
tools-checker-k8s-node-readyprofile::icingawmcs-team-email,wmcs-botspending
tools-checker-labs-dns-privateprofile::icingawmcs-teampending
tools-checker-ldapprofile::icingawmcs-team-email,wmcs-botspending
tools-checker-nfs-homeprofile::icingawmcs-teampending
tools-checker-redisprofile::icingawmcs-teampending
tools-checker-selfprofile::icingawmcs-teampending
tools-checker-toolscronprofile::icingawmcs-team-emailpending
wikitech-static-main-pageprofile::icingawmcs-bots,adminspending
check-flavor_aggregatesprofile::openstack::codfw1dev::nova::fullstack::servicewmcs-team-email,wmcs-botspending
check-neutron-conntrackprofile::openstack::base::neutron::l3_agentwmcs-team-email,adminspending
Auth DNS TCP <name>profile::openstack::eqiad1::pdns::auth::serviceadminspending
Auth DNS UDP <name>profile::openstack::eqiad1::pdns::auth::serviceadminspending
DNS resolution <name>profile::openstack::eqiad1::pdns::auth::serviceadminspending

! MIGRATION TABLE !

Migrated? (Y/N)TitleResource TypeCommandFileProfiles
NAuth DNS TCP: X.eqiad1.X on server X.openstack.eqiad1.wikimediacloud.orgMonitoring::Servicecheck_dig_tcpmodules/profile/manifests/openstack/eqiad1/pdns/auth/service.pp:32profile::openstack::eqiad1::pdns::auth::service
NAuth DNS UDP: X.eqiad1.X on server X.openstack.eqiad1.wikimediacloud.orgMonitoring::Servicecheck_digmodules/profile/manifests/openstack/eqiad1/pdns/auth/service.pp:25profile::openstack::eqiad1::pdns::auth::service
Ncheck-flavor_aggregatesNrpe::Monitor_service/usr/local/lib/nagios/plugins/check_flavor_propertiesmodules/openstack/manifests/nova/fullstack/monitor.pp:8profile::openstack::base::nova::fullstack::service, profile::openstack::eqiad1::nova::fullstack::service, profile::openstack::codfw1dev::nova::fullstack::service
Ntools-checker-dumpsMonitoring::Servicecheck_http_url_at_address_for_string_with_timeoutmodules/icinga/manifests/monitor/toollabs.pp:43profile::icinga
Ntools-checker-ldapMonitoring::Servicecheck_http_url_at_address_for_string_with_timeoutmodules/icinga/manifests/monitor/toollabs.pp:34profile::icinga
Nensure_running_kvm_instancesNrpe::Monitor_service/usr/lib/nagios/plugins/check_procsmodules/openstack/manifests/nova/compute/monitor.pp:32profile::openstack::eqiad1::nova::compute::service
Ncheck-cinder-snapshot-leaksNrpe::Monitor_service/usr/local/bin/check_cinder_snapshot_leaks.pymodules/openstack/manifests/cinder/monitor.pp:17profile::openstack::base::cinder, profile::openstack::codfw1dev::cinder, profile::openstack::eqiad1::cinder
Ntools-checker-redisMonitoring::Servicecheck_http_url_at_address_for_string_with_timeoutmodules/icinga/manifests/monitor/toollabs.pp:61profile::icinga
Ntools-checker-etcd-k8sMonitoring::Servicecheck_http_url_at_address_for_string_with_timeoutmodules/icinga/manifests/monitor/toollabs.pp:23profile::icinga
Ntools-checker-nfs-homeMonitoring::Servicecheck_http_url_at_address_for_string_with_timeoutmodules/icinga/manifests/monitor/toollabs.pp:52profile::icinga
Ntools-checker-labs-dns-privateMonitoring::Servicecheck_http_url_at_address_for_string_with_timeoutmodules/icinga/manifests/monitor/toollabs.pp:14profile::icinga
Ncheck-neutron-conntrackNrpe::Monitor_service/usr/local/lib/nagios/plugins/check_neutron_conntrackmodules/openstack/manifests/monitor/neutron/l3_agent_conntrack.pp:7profile::openstack::base::neutron::l3_agent, profile::openstack::eqiad1::neutron::l3_agent, profile::openstack::codfw1dev::neutron::l3_agent
Ntools-checker-selfMonitoring::Servicecheck_http_url_at_address_for_string_with_timeoutmodules/icinga/manifests/monitor/toollabs.pp:70profile::icinga

Related Objects

Event Timeline

taavi renamed this task from Move off from icinga and introduce alertmanager to Move WMCS off of Icinga and introduce alertmanager.Dec 1 2023, 12:54 PM
taavi claimed this task.
taavi removed a project: wmcs-retrospective.
taavi updated the task description. (Show Details)
taavi updated the task description. (Show Details)
dcaro updated the task description. (Show Details)
taavi changed the task status from Open to In Progress.Jan 17 2024, 4:00 PM

In case it is useful, as part of the icinga migration I've been collecting checks and their type/status here: https://docs.google.com/spreadsheets/d/19nxCXldb804TJCXGy4Z2BHG_1wRksRnKcPC6sXfjQuM/edit#gid=1831147731

taavi removed taavi as the assignee of this task.Jun 25 2024, 3:35 PM

A very low hanging fruit to make progress on this task is the following prometheus-based checks:

monitoring::check_prometheus { 'network_out_saturated':
monitoring::check_prometheus { 'network_in_saturated':
monitoring::check_prometheus { 'high_iowait_stalling':

Those are straightforward to port to alerts.git since they are already Prometheus-based, or even better deleted if they are no longer relevant

Change #1111328 had a related patch set uploaded (by Andrea Denisse; author: Andrea Denisse):

[operations/alerts@master] wmcs: Migrate network saturation alerts to the alerts.git repository

https://gerrit.wikimedia.org/r/1111328

Change #1111338 had a related patch set uploaded (by Andrea Denisse; author: Andrea Denisse):

[operations/alerts@master] wmcs: Migrate iowait stalling alerts to the alerts.git repository

https://gerrit.wikimedia.org/r/1111338

Change #1111340 had a related patch set uploaded (by Andrea Denisse; author: Andrea Denisse):

[operations/puppet@production] wmcs: Remove Puppet files for migrated Prometheus alerts

https://gerrit.wikimedia.org/r/1111340

Change #1111338 merged by Andrea Denisse:

[operations/alerts@master] wmcs: Migrate iowait stalling alerts to the alerts.git repository

https://gerrit.wikimedia.org/r/1111338

Change #1111328 merged by Andrea Denisse:

[operations/alerts@master] wmcs: Migrate network saturation alerts to the alerts.git repository

https://gerrit.wikimedia.org/r/1111328

Change #1111340 merged by Andrea Denisse:

[operations/puppet@production] wmcs: Remove Puppet files for migrated Prometheus alerts

https://gerrit.wikimedia.org/r/1111340

Aklapper changed the task status from In Progress to Open.Apr 11 2025, 10:20 PM

Resetting task status from "In Progress" to "Open" as this task has been "in progress" for more than one year (see T380300). Feel free to set that status again, or rather break down into smaller subtasks.

Change #1155138 had a related patch set uploaded (by Tiziano Fogli; author: Tiziano Fogli):

[operations/puppet@production] monitoring services: add migration task T328502 to instances

https://gerrit.wikimedia.org/r/1155138

Change #1155138 merged by Tiziano Fogli:

[operations/puppet@production] monitoring services: add migration task T328502 to instances

https://gerrit.wikimedia.org/r/1155138

tappof changed the subtype of this task from "Task" to "Goal".Sep 2 2025, 1:33 PM

Change #1184715 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/alerts@master] wmcs: remove HighIOWaitStalling

https://gerrit.wikimedia.org/r/1184715

Change #1184715 merged by Filippo Giunchedi:

[operations/alerts@master] wmcs: remove HighIOWaitStalling

https://gerrit.wikimedia.org/r/1184715

Change #1200012 had a related patch set uploaded (by Tiziano Fogli; author: Tiziano Fogli):

[operations/puppet@production] haproxy: enable nrpe2nodexp wrapper on check-cinder-snapshot-leaks

https://gerrit.wikimedia.org/r/1200012

Change #1200016 had a related patch set uploaded (by Tiziano Fogli; author: Tiziano Fogli):

[operations/puppet@production] neutron: enable nrpe2nodexp wrapper on check-neutron-conntrack

https://gerrit.wikimedia.org/r/1200016

Change #1200018 had a related patch set uploaded (by Tiziano Fogli; author: Tiziano Fogli):

[operations/puppet@production] nova: enable nrpe2nodexp wrapper on check-flavor_aggregates

https://gerrit.wikimedia.org/r/1200018

Change #1200012 merged by Tiziano Fogli:

[operations/puppet@production] cinder: enable nrpe2nodexp wrapper on check-cinder-snapshot-leaks

https://gerrit.wikimedia.org/r/1200012

Change #1200016 merged by Tiziano Fogli:

[operations/puppet@production] neutron: enable nrpe2nodexp wrapper on check-neutron-conntrack

https://gerrit.wikimedia.org/r/1200016

Change #1200018 merged by Tiziano Fogli:

[operations/puppet@production] nova: enable nrpe2nodexp wrapper on check-flavor_aggregates

https://gerrit.wikimedia.org/r/1200018