Page MenuHomePhabricator

Evaluate 'pint' for Prometheus alerts
Closed, ResolvedPublic

Description

Cloudflare has published their software for linting / checking Prometheus alerts: https://github.com/cloudflare/pint . We'll need to evaluate it for our use cases and see if we get value out of it.

As of Feb 2023 we have implemented the following:

  • CI checks for operations/alerts.git, based on pint
  • runtime pint checks for instance-specific alerts (i.e. alerts files with deploy-tag (docs))

Still TODO:

  • Add 'pint' support for global/thanos alerts
  • Plan for non-instance-specific alerts (i.e. currently without deploy-tag)
  • Plan for cloudmetrics and pint checking (prometheus is moving off cloudmetrics)

Details

Related Changes in Gerrit:
SubjectRepoBranchLines +/-
operations/alertsmaster+9 -1
operations/alertsmaster+16 -7
operations/alertsmaster+8 -0
operations/alertsmaster+1 -0
operations/alertsmaster+4 -0
operations/alertsmaster+40 -0
operations/alertsmaster+3 -3
operations/alertsmaster+8 -0
operations/alertsmaster+6 -0
operations/alertsmaster+6 -7
operations/alertsmaster+3 -0
operations/alertsmaster+5 -0
operations/alertsmaster+1 -1
operations/alertsmaster+18 -16
operations/alertsmaster+10 -1
operations/alertsmaster+6 -0
operations/alertsmaster+21 -0
operations/alertsmaster+5 -0
operations/alertsmaster+25 -0
operations/alertsmaster+3 -0
operations/alertsmaster+3 -0
operations/alertsmaster+3 -0
operations/alertsmaster+3 -0
operations/alertsmaster+2 -0
operations/alertsmaster+45 -30
operations/alertsmaster+2 -0
operations/alertsmaster+1 -0
operations/alertsmaster+51 -90
operations/alertsmaster+4 -2
operations/alertsmaster+3 -0
operations/alertsmaster+3 -0
operations/alertsmaster+10 -0
operations/alertsmaster+12 -10
operations/puppetproduction+30 -2
operations/puppetproduction+28 -3
operations/alertsmaster+3 -0
operations/alertsmaster+44 -34
operations/alertsmaster+57 -25
operations/alertsmaster+2 -0
operations/alertsmaster+1 -0
operations/alertsmaster+2 -0
operations/alertsmaster+2 -0
operations/alertsmaster+9 -1
operations/alertsmaster+4 -33
operations/alertsmaster+104 -84
operations/alertsmaster+3 -5
operations/alertsmaster+22 -16
operations/alertsmaster+1 -0
operations/alertsmaster+73 -60
operations/alertsmaster+52 -7
operations/puppetproduction+1 -1
operations/puppetproduction+20 -6
operations/puppetproduction+4 -0
operations/alertsmaster+7 -7
operations/puppetproduction+18 -0
operations/puppetproduction+4 -0
operations/puppetproduction+1 -1
operations/puppetproduction+70 -0
operations/puppetproduction+234 -123
operations/debs/pintmaster+168 -0
operations/alertsmaster+13 -15
Show related patches Customize query in gerrit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 898768 merged by Filippo Giunchedi:

[operations/alerts@master] structured-data: deploy to 'ops' instance

https://gerrit.wikimedia.org/r/898768

Change 898783 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/alerts@master] structured-data: deploy to ops/eqiad only

https://gerrit.wikimedia.org/r/898783

Change 898783 merged by Filippo Giunchedi:

[operations/alerts@master] structured-data: deploy to ops/eqiad only

https://gerrit.wikimedia.org/r/898783

Change 898754 merged by jenkins-bot:

[operations/alerts@master] perf: deploy to 'ext' instance

https://gerrit.wikimedia.org/r/898754

Change 898765 merged by jenkins-bot:

[operations/alerts@master] search-platform: deploy alerts to specific Prometheus instances

https://gerrit.wikimedia.org/r/898765

Change 898776 merged by jenkins-bot:

[operations/alerts@master] netops: split routinator from ping offload

https://gerrit.wikimedia.org/r/898776

Change 899503 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/alerts@master] search-platform: deploy blazegraph/cirrus/pipelines alerts to eqiad/codfw only

https://gerrit.wikimedia.org/r/899503

Change 899506 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/alerts@master] perf: fix webperf metric names

https://gerrit.wikimedia.org/r/899506

Change 899503 merged by jenkins-bot:

[operations/alerts@master] search-platform: deploy blazegraph/cirrus/pipelines alerts to eqiad/codfw only

https://gerrit.wikimedia.org/r/899503

Change 899506 merged by jenkins-bot:

[operations/alerts@master] perf: fix webperf metric names

https://gerrit.wikimedia.org/r/899506

Change 898701 merged by Filippo Giunchedi:

[operations/puppet@production] thanos: add pint for thanos-rule

https://gerrit.wikimedia.org/r/898701

Change 899525 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] thanos: exclude promql/rate pint check

https://gerrit.wikimedia.org/r/899525

Change 899525 merged by Filippo Giunchedi:

[operations/puppet@production] thanos: exclude promql/rate pint check

https://gerrit.wikimedia.org/r/899525

Change 900241 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/alerts@master] traffic: deploy alerts to 'ops' instance

https://gerrit.wikimedia.org/r/900241

Change 900241 merged by Filippo Giunchedi:

[operations/alerts@master] traffic: deploy alerts to 'ops' instance

https://gerrit.wikimedia.org/r/900241

Change 900626 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/alerts@master] traffic: use haproxy for EdgeTrafficDrop

https://gerrit.wikimedia.org/r/900626

Change 900628 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/alerts@master] o11y: deploy prometheus alerts to all instances

https://gerrit.wikimedia.org/r/900628

Change 900628 merged by Filippo Giunchedi:

[operations/alerts@master] o11y: deploy prometheus alerts to all instances

https://gerrit.wikimedia.org/r/900628

Change 901140 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/alerts@master] sre: deploy zk alerts to 'ops' instance

https://gerrit.wikimedia.org/r/901140

Change 901142 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/alerts@master] sre: deploy kafka alerts to 'ops' instance

https://gerrit.wikimedia.org/r/901142

Change 901140 merged by Filippo Giunchedi:

[operations/alerts@master] sre: deploy zk alerts to 'ops' instance

https://gerrit.wikimedia.org/r/901140

Change 901142 merged by Filippo Giunchedi:

[operations/alerts@master] sre: deploy kafka alerts to 'ops' instance

https://gerrit.wikimedia.org/r/901142

Change 900626 merged by Filippo Giunchedi:

[operations/alerts@master] traffic: remove EdgeTrafficDrop

https://gerrit.wikimedia.org/r/900626

Change 905215 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/alerts@master] sre: move confd alerts to 'ops' instance

https://gerrit.wikimedia.org/r/905215

Change 905216 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/alerts@master] sre: move k8s alerts to specific Prometheus instances

https://gerrit.wikimedia.org/r/905216

Change 905217 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/alerts@master] sre: move hardware alerts to 'ops' instance

https://gerrit.wikimedia.org/r/905217

Change 905218 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/alerts@master] sre: move keyholder alerts to 'ops' instance

https://gerrit.wikimedia.org/r/905218

Change 905219 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/alerts@master] sre: move alerting puppet agent failure to 'ops' instance

https://gerrit.wikimedia.org/r/905219

Change 905220 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/alerts@master] sre: move etcd alerts to 'ops' instance

https://gerrit.wikimedia.org/r/905220

Change 905221 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/alerts@master] sre: move druid/webrequest alerts to 'analytics' instance

https://gerrit.wikimedia.org/r/905221

Change 905222 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/alerts@master] warn on deploy-tag missing

https://gerrit.wikimedia.org/r/905222

Change 905222 merged by Filippo Giunchedi:

[operations/alerts@master] warn on deploy-tag missing

https://gerrit.wikimedia.org/r/905222

Change 905215 merged by jenkins-bot:

[operations/alerts@master] sre: move confd alerts to 'ops' instance

https://gerrit.wikimedia.org/r/905215

Change 905216 merged by jenkins-bot:

[operations/alerts@master] sre: move k8s alerts to specific Prometheus instances

https://gerrit.wikimedia.org/r/905216

Change 905217 merged by jenkins-bot:

[operations/alerts@master] sre: move hardware alerts to 'ops' instance

https://gerrit.wikimedia.org/r/905217

Change 905218 merged by jenkins-bot:

[operations/alerts@master] sre: move keyholder alerts to 'ops' instance

https://gerrit.wikimedia.org/r/905218

Change 905219 merged by jenkins-bot:

[operations/alerts@master] sre: move alerting puppet agent failure to 'ops' instance

https://gerrit.wikimedia.org/r/905219

Change 905220 merged by jenkins-bot:

[operations/alerts@master] sre: move etcd alerts to 'ops' instance

https://gerrit.wikimedia.org/r/905220

Change 905221 merged by jenkins-bot:

[operations/alerts@master] sre: move druid/webrequest alerts to 'analytics' instance

https://gerrit.wikimedia.org/r/905221

Change 905571 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/alerts@master] sre: add missing deploy-tag

https://gerrit.wikimedia.org/r/905571

Change 905571 merged by Filippo Giunchedi:

[operations/alerts@master] sre: add missing deploy-tag

https://gerrit.wikimedia.org/r/905571

Change 905580 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/alerts@master] sre: add missing deploy-tag

https://gerrit.wikimedia.org/r/905580

Change 905581 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/alerts@master] data-engineering: add missing deploy-tag

https://gerrit.wikimedia.org/r/905581

Change 905582 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/alerts@master] data-persistence: add missing deploy-tag

https://gerrit.wikimedia.org/r/905582

Change 905587 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/alerts@master] wmcs: add missing deploy-tag

https://gerrit.wikimedia.org/r/905587

Change 905580 merged by jenkins-bot:

[operations/alerts@master] sre: add missing deploy-tag

https://gerrit.wikimedia.org/r/905580

Change 905581 merged by jenkins-bot:

[operations/alerts@master] data-engineering: add missing deploy-tag

https://gerrit.wikimedia.org/r/905581

Change 905582 merged by jenkins-bot:

[operations/alerts@master] data-persistence: add missing deploy-tag

https://gerrit.wikimedia.org/r/905582

Change 905587 merged by Filippo Giunchedi:

[operations/alerts@master] wmcs: add missing deploy-tag

https://gerrit.wikimedia.org/r/905587

As of today all alert files have a corresponding deploy-tag and thus are checked by pint (except for cloudmetrics, still TODO)

Change 905632 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/alerts@master] sre: move prometheus/wmcs scrapefailure

https://gerrit.wikimedia.org/r/905632

Change 905633 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/alerts@master] wmcs: fix deploy-tag for novafullstack

https://gerrit.wikimedia.org/r/905633

Change 905632 merged by jenkins-bot:

[operations/alerts@master] sre: move prometheus/wmcs scrapefailure

https://gerrit.wikimedia.org/r/905632

Change 905633 merged by jenkins-bot:

[operations/alerts@master] wmcs: fix deploy-tag for novafullstack

https://gerrit.wikimedia.org/r/905633

Change 906011 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/alerts@master] sre: mute etcd-mirror pint promql checks

https://gerrit.wikimedia.org/r/906011

Change 906020 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/alerts@master] data-engineering: ignore 'status' label pint check

https://gerrit.wikimedia.org/r/906020

Change 906011 merged by Filippo Giunchedi:

[operations/alerts@master] sre: mute etcd-mirror pint promql checks

https://gerrit.wikimedia.org/r/906011

Change 906020 merged by Filippo Giunchedi:

[operations/alerts@master] data-engineering: ignore 'status' label pint check

https://gerrit.wikimedia.org/r/906020

Change 906533 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/alerts@master] sre: mute puppet-ca pint checks for missing series

https://gerrit.wikimedia.org/r/906533

Change 906574 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/alerts@master] data-engineering: fix varnishkafka metric names, deploy to all sites

https://gerrit.wikimedia.org/r/906574

Change 906574 merged by Filippo Giunchedi:

[operations/alerts@master] data-engineering: fix varnishkafka metric names, deploy to all sites

https://gerrit.wikimedia.org/r/906574

Change 906581 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/alerts@master] Make deploy-tag compulsory

https://gerrit.wikimedia.org/r/906581

Change 906533 merged by Filippo Giunchedi:

[operations/alerts@master] sre: mute puppet-ca pint checks for missing series

https://gerrit.wikimedia.org/r/906533

Change 906711 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/alerts@master] data-engineering: disable missing metrics pint check for validation errors

https://gerrit.wikimedia.org/r/906711

Change 906711 merged by Filippo Giunchedi:

[operations/alerts@master] data-engineering: disable missing metrics pint check for validation errors

https://gerrit.wikimedia.org/r/906711

Change 906581 merged by Filippo Giunchedi:

[operations/alerts@master] Make deploy-tag compulsory

https://gerrit.wikimedia.org/r/906581

Change 908206 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/alerts@master] sre: report alert lint problems

https://gerrit.wikimedia.org/r/908206

Change 908206 merged by Filippo Giunchedi:

[operations/alerts@master] sre: report alert lint problems

https://gerrit.wikimedia.org/r/908206

Change 920199 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/alerts@master] sre: disable pint promql/series check for SystemdUnitFailed

https://gerrit.wikimedia.org/r/920199

Change 920199 merged by Filippo Giunchedi:

[operations/alerts@master] sre: disable pint promql/series check for SystemdUnitFailed

https://gerrit.wikimedia.org/r/920199

Change 920201 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/alerts@master] o11y: ignore promql/series for code/thanos-query-frontend

https://gerrit.wikimedia.org/r/920201

Change 920201 merged by Filippo Giunchedi:

[operations/alerts@master] o11y: ignore promql/series for code/thanos-query-frontend

https://gerrit.wikimedia.org/r/920201

Change 920205 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/alerts@master] dcops: temp disable promql/series pint check for InterfaceSpeedError

https://gerrit.wikimedia.org/r/920205

Change 920211 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/alerts@master] perf: disable promql/series lint checks for navtiming

https://gerrit.wikimedia.org/r/920211

Change 920205 merged by Filippo Giunchedi:

[operations/alerts@master] dcops: temp disable promql/series pint check for InterfaceSpeedError

https://gerrit.wikimedia.org/r/920205

Change 920211 merged by Filippo Giunchedi:

[operations/alerts@master] perf: disable promql/series lint checks for navtiming

https://gerrit.wikimedia.org/r/920211

fgiunchedi claimed this task.

This is done, the remaining work is tracked in subtasks

Change #1124790 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/alerts@master] sre: open tasks for long standing lint problems

https://gerrit.wikimedia.org/r/1124790

Change #1124790 merged by Filippo Giunchedi:

[operations/alerts@master] sre: open tasks for long standing lint problems

https://gerrit.wikimedia.org/r/1124790