Page MenuHomePhabricator

followups to unactionable NELHigh pages due to Telecom Italia outage, 2023-02-05
Open, MediumPublic

Description

In the event of long-duration issues in particular geographies, it would be nice if we could exclude those reports from the NEL metric used for alerting.

  • edit es_exporter config to also aggregate by country in addition to type -- it seems best to create a new metric aggregated this way
  • edit alertmanager rule to use the new metric
  • edit statograph configuration to use the new metric (verifying that the results of the query are the same as before)

Event Timeline

akosiaris subscribed.

Removing SRE as the more specific working group is tagged already.

Change 901220 had a related patch set uploaded (by Volans; author: Volans):

[operations/puppet@production] es_exporter: add NEL metrics by country

https://gerrit.wikimedia.org/r/901220

In the meanwhile I've also added 2 graphs in the superset dashboards in the Geo tab with the number of NELs by Country and NELs by ISP. Hope that will be helpful.

Change 901220 merged by Volans:

[operations/puppet@production] es_exporter: add NEL metrics by country

https://gerrit.wikimedia.org/r/901220

Metrics are now being ingested by prometheus:

image.png (1×2 px, 480 KB)

Change 902316 had a related patch set uploaded (by Volans; author: Volans):

[operations/alerts@master] NEL: add alert by country

https://gerrit.wikimedia.org/r/902316

Volans subscribed.

Removing myself from assignee as the sprint week effort ended. The above patch is pending review.