Page MenuHomePhabricator

Migrate eventgate check_prometheus checks to alertmanager
Closed, ResolvedPublic

Description

We have been carrying out a migration of all of our existing check_prometheus based checks from Icinga to Alertmanager.
This work has been done as part of: T293399

This ticket specifically covers the migration of our eventgate services, which are defined here: https://github.com/wikimedia/puppet/blob/production/modules/profile/manifests/prometheus/alerts.pp#L101-L193

  • Each of: ${eventgate_service}_validation_error_rate
  • Each of: eventgate_logging_external_latency_${site}
  • Each of: eventgate_logging_external_errors_${site}

Event Timeline

Change 901623 had a related patch set uploaded (by AOkoth; author: AOkoth):

[operations/alerts@master] eventgate: add EventgateErrorsLoggingExternal alert

https://gerrit.wikimedia.org/r/901623

Change 902694 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/alerts@master] Move Icinga eventgate logging external errors checks to alertmanager

https://gerrit.wikimedia.org/r/902694

Change 902703 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/puppet@production] Remove EventGate Icinga checks that have been moved to alertmanager

https://gerrit.wikimedia.org/r/902703

Change 902694 merged by jenkins-bot:

[operations/alerts@master] Move Icinga eventgate logging external errors checks to alertmanager

https://gerrit.wikimedia.org/r/902694

Change 901623 merged by jenkins-bot:

[operations/alerts@master] eventgate: add EventgateLoggingExternalErrors alert

https://gerrit.wikimedia.org/r/901623

Change 906710 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/alerts@master] data-engineering: use generic eventgate HTTP error alert name

https://gerrit.wikimedia.org/r/906710

Change 906711 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/alerts@master] data-engineering: disable missing metrics pint check for validation errors

https://gerrit.wikimedia.org/r/906711

Change 906712 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/alerts@master] data-engineering: refactor eventgate validation alerts

https://gerrit.wikimedia.org/r/906712

Change 906710 merged by Filippo Giunchedi:

[operations/alerts@master] data-engineering: use generic eventgate HTTP error alert name

https://gerrit.wikimedia.org/r/906710

Change 906711 merged by Filippo Giunchedi:

[operations/alerts@master] data-engineering: disable missing metrics pint check for validation errors

https://gerrit.wikimedia.org/r/906711

Change 906712 merged by Filippo Giunchedi:

[operations/alerts@master] data-engineering: refactor eventgate validation alerts

https://gerrit.wikimedia.org/r/906712

Change 908917 had a related patch set uploaded (by AOkoth; author: AOkoth):

[operations/puppet@production] prometheus: delete migrated eventgate alerts

https://gerrit.wikimedia.org/r/908917

Change 908917 merged by Filippo Giunchedi:

[operations/puppet@production] prometheus: delete migrated eventgate alerts

https://gerrit.wikimedia.org/r/908917

Change 902703 abandoned by AOkoth:

[operations/puppet@production] Remove EventGate Icinga checks that have been moved to alertmanager

Reason:

changes made in different patch

https://gerrit.wikimedia.org/r/902703

Change 921023 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] prometheus: clean up eventgate prometheus alerts

https://gerrit.wikimedia.org/r/921023

Change 921023 merged by Filippo Giunchedi:

[operations/puppet@production] prometheus: clean up eventgate prometheus alerts

https://gerrit.wikimedia.org/r/921023

All done, resolving