Page MenuHomePhabricator

Port openapi/swagger checks/alerts to Prometheus
Closed, ResolvedPublic

Description

We did some work to get openapi / swagger metrics into Prometheus as part of T205870. With Alertmanager and alerts.git being a reality nowadays we should clean up the Icinga checks and move to Prometheus-based alerts.

While at it I think it'd be beneficial to at least brainstorm on the situation of modules/lvs/manifests/monitor_services.pp which is basically a separate list of openapi-based services, and IMHO it should be part of service::catalog instead.

Event Timeline

Restricted Application added a subscriber: Aklapper. ยท View Herald TranscriptOct 12 2022, 12:01 PM

Change 916914 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] prometheus: generate swagger targets from service catalog

https://gerrit.wikimedia.org/r/916914

Will need a grafana dashboard and a runbook to define the alerts.

Change 918547 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/alerts@master] team-sre: add openapi/swagger alerts

https://gerrit.wikimedia.org/r/918547

Change 919331 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] prometheus: don't fail on unknown blackbox probe type

https://gerrit.wikimedia.org/r/919331

Change 919331 merged by Filippo Giunchedi:

[operations/puppet@production] prometheus: don't fail on unknown blackbox probe type

https://gerrit.wikimedia.org/r/919331

Change 923576 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] prometheus: don't add empty targets

https://gerrit.wikimedia.org/r/923576

Change 923576 merged by Filippo Giunchedi:

[operations/puppet@production] prometheus: don't add empty targets

https://gerrit.wikimedia.org/r/923576

Change 916914 merged by Cwhite:

[operations/puppet@production] prometheus: generate swagger targets from service catalog

https://gerrit.wikimedia.org/r/916914

Change 924144 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] prometheus: remove invalid cluster key

https://gerrit.wikimedia.org/r/924144

Change 924144 merged by Cwhite:

[operations/puppet@production] prometheus: remove invalid cluster key

https://gerrit.wikimedia.org/r/924144

Change 924145 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] prometheus: disable new swagger job

https://gerrit.wikimedia.org/r/924145

Change 924145 merged by Cwhite:

[operations/puppet@production] prometheus: disable new swagger job

https://gerrit.wikimedia.org/r/924145

Change 925106 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] prometheus: ensure absent invalid swagger targets file

https://gerrit.wikimedia.org/r/925106

Change 925106 merged by Cwhite:

[operations/puppet@production] prometheus: ensure absent invalid swagger targets file

https://gerrit.wikimedia.org/r/925106

Change 925113 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] opensearch_dashboards: fix package name typo

https://gerrit.wikimedia.org/r/925113

Change 925113 merged by Cwhite:

[operations/puppet@production] opensearch_dashboards: fix package name typo

https://gerrit.wikimedia.org/r/925113

colewhite changed the task status from Open to In Progress.Jun 1 2023, 7:48 PM
colewhite claimed this task.
colewhite triaged this task as Medium priority.

Change 925117 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] prometheus: re-enable swagger jobs from service catalog

https://gerrit.wikimedia.org/r/925117

Change 925117 merged by Cwhite:

[operations/puppet@production] prometheus: re-enable swagger jobs from service catalog

https://gerrit.wikimedia.org/r/925117

Change 925118 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] prometheus: fix swagger job relabel configs

https://gerrit.wikimedia.org/r/925118

Change 925118 merged by Cwhite:

[operations/puppet@production] prometheus: fix swagger job relabel configs

https://gerrit.wikimedia.org/r/925118

Change 925119 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] prometheus: add external swagger checks to all sites

https://gerrit.wikimedia.org/r/925119

Change 925120 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] lvs: remove lvs::monitor_services

https://gerrit.wikimedia.org/r/925120

Change 925119 merged by Cwhite:

[operations/puppet@production] prometheus: add external swagger checks to all sites

https://gerrit.wikimedia.org/r/925119

Change 918547 merged by jenkins-bot:

[operations/alerts@master] team-sre: add openapi/swagger alerts

https://gerrit.wikimedia.org/r/918547

Change 925120 merged by Cwhite:

[operations/puppet@production] lvs: remove lvs::monitor_services

https://gerrit.wikimedia.org/r/925120

The new alerts are in place and the old checks have been removed from Icinga.