Page MenuHomePhabricator

Web interface to navigate Prometheus alerts and their status
Closed, ResolvedPublic

Description

The alerts.w.o works fine to check firing/silenced alerts, however it'd be useful to users to be able to see all defined alerts and their status. This information is not available to alertmanager, but it is to the individual systems that send alerts to AM. In the librenms/icinga/grafana case we have the capability already, in the Prometheus case not yet.

This task tracks a solution to expose all Prometheus alerts and their status. The information is available both on the Prometheus web interface and as the ALERTS and ALERTS_FOR_STATE meta-metrics. The solution might entail either reverse-proxying the Prometheus web interface (behind SSO) and/or playing with the meta-metrics above to build suitable dashboard(s)

Event Timeline

+1 for reverse proxying the prometheus web interface behind SSO, that seems straightforward to me and could be useful in other cases as well

Change 764895 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] prometheus: sketch out proxied prometheus web with IDP

https://gerrit.wikimedia.org/r/764895

Change 788721 had a related patch set uploaded (by Herron; author: Herron):

[labs/private@master] "private" add prometheus.wm.o placeholder key

https://gerrit.wikimedia.org/r/788721

Change 788721 merged by Herron:

[labs/private@master] "private" add prometheus.wm.o placeholder key

https://gerrit.wikimedia.org/r/788721

Change 764895 merged by Herron:

[operations/puppet@production] prometheus: enable prometheus web access via proxy with IDP

https://gerrit.wikimedia.org/r/764895

Change 848442 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] prometheus: update web_idp urls to prometheus-$site.wm.o

https://gerrit.wikimedia.org/r/848442

Change 848442 merged by Herron:

[operations/puppet@production] prometheus: update web_idp urls to prometheus-$site.wm.o

https://gerrit.wikimedia.org/r/848442

Change 848480 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] prometheus: web_idp pin to prometheus(12)005

https://gerrit.wikimedia.org/r/848480

Change 848480 merged by Herron:

[operations/puppet@production] prometheus: web_idp pin to prometheus(12)005

https://gerrit.wikimedia.org/r/848480

Change 848488 had a related patch set uploaded (by Herron; author: Herron):

[operations/dns@master] dns: add prometheus-$site.wm.o entries for prometheus web interface

https://gerrit.wikimedia.org/r/848488

Change 848488 merged by Herron:

[operations/dns@master] dns: add prometheus-$site.wm.o entries for prometheus web interface

https://gerrit.wikimedia.org/r/848488

Prometheus web interfaces can now be accessed behind SSO with https://prometheus-site.wikimedia.org/instance/ e.g. https://prometheus-eqiad.wikimedia.org/ops/

A good next step here will be to create a landing page for the root of the site e.g. https://prometheus-eqiad.wikimedia.org/ (currently an apache default page) with high level information and links to docs, along with handling of https://prometheus.wm.o

This is great to see -- thanks @herron for working on it!

Change 851004 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] hieradata: fix Prometheus IDP entry

https://gerrit.wikimedia.org/r/851004

Change 851004 merged by Filippo Giunchedi:

[operations/puppet@production] hieradata: fix Prometheus IDP entry

https://gerrit.wikimedia.org/r/851004

It was noticed today by @JMeybohm that the prometheus web interface is currently cached (and shouldn't for obvious reasons)

Change 857522 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] prometheus: default to valid external url

https://gerrit.wikimedia.org/r/857522

Change 858406 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] prometheus: disable caching of prometheus-site.wm.o

https://gerrit.wikimedia.org/r/858406

Change 858406 merged by Herron:

[operations/puppet@production] prometheus: disable caching of prometheus-site.wm.o

https://gerrit.wikimedia.org/r/858406

It was noticed today by @JMeybohm that the prometheus web interface is currently cached (and shouldn't for obvious reasons)

With https://gerrit.wikimedia.org/r/858406 I'm now seeing e.g. x-cache: cp1089 miss, cp1079 pass on requests to the prom web interface.

Interestingly, similar cache pass rules were added at the time the prom web interface was deployed, but it just so happened to be deployed right at the same time as a refactoring of that yaml and they didn't get carried over. At any rate looking much better now!

Change 857522 merged by Filippo Giunchedi:

[operations/puppet@production] prometheus: default to valid external url

https://gerrit.wikimedia.org/r/857522

Change 863380 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] service::catalog: add prometheus-https

https://gerrit.wikimedia.org/r/863380

herron claimed this task.

Change 863380 merged by BCornwall:

[operations/puppet@production] service::catalog: add prometheus-https

https://gerrit.wikimedia.org/r/863380

Change 929421 had a related patch set uploaded (by BCornwall; author: BCornwall):

[operations/puppet@production] prometheus: Disable SNI support in Envoy tlsproxy

https://gerrit.wikimedia.org/r/929421

Change 928942 had a related patch set uploaded (by BCornwall; author: BCornwall):

[operations/puppet@production] Revert "service::catalog: add prometheus-https"

https://gerrit.wikimedia.org/r/928942

Change 928942 merged by BCornwall:

[operations/puppet@production] Revert "service::catalog: add prometheus-https"

https://gerrit.wikimedia.org/r/928942

Change 929421 merged by BCornwall:

[operations/puppet@production] prometheus: Disable SNI support in Envoy tlsproxy

https://gerrit.wikimedia.org/r/929421

Change 929768 had a related patch set uploaded (by BCornwall; author: BCornwall):

[operations/puppet@production] prometheus: Add global_cert_name to Envoy config

https://gerrit.wikimedia.org/r/929768

Change 929768 abandoned by BCornwall:

[operations/puppet@production] prometheus: Disable SNI support in Envoy tlsproxy

Reason:

I5b10a4a2ad3a34b8ad2ef48052b13c93c62aedd0 supercedes this

https://gerrit.wikimedia.org/r/929768

Change 939326 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] service::catalog: add prometheus-https

https://gerrit.wikimedia.org/r/939326

Change 939326 merged by Herron:

[operations/puppet@production] service::catalog: add prometheus-https

https://gerrit.wikimedia.org/r/939326