Page MenuHomePhabricator

Monitor the availability of the superset deployments
Closed, ResolvedPublic

Description

We need to get alerted if the Superset pods do not run. Possibly mirror what was done for the Spark History Server (T353717)

Event Timeline

Gehel triaged this task as High priority.Feb 6 2024, 2:55 PM

Change 1005540 had a related patch set uploaded (by Stevemunene; author: Stevemunene):

[operations/alerts@master] superset: add availability monitor

https://gerrit.wikimedia.org/r/1005540

Change 1005540 merged by jenkins-bot:

[operations/alerts@master] superset: add availability monitor

https://gerrit.wikimedia.org/r/1005540

In addition to the kube_deployment_status_replicas_available metric, it might be quite a good idea to use one or two Prometheus blackbox exporters to check on the availability of https://superset.wikimedia.org

https://wikitech.wikimedia.org/wiki/Prometheus#Network_probes_%28blackbox_exporter%29

Hmm. It seems that we already have some http blackbox probes defined here in the service catalog:
https://github.com/wikimedia/operations-puppet/blob/production/hieradata/common/service.yaml#L4034-L4036

I specifically intercatpted these in the nginx reverse proxy here:
https://github.com/wikimedia/operations-deployment-charts/blob/master/charts/superset/templates/configmap.yaml#L202-L204

...but maybe that's not a good idea. Maybe we should remove this section and pass them through to the application. What do you think @brouberol @Stevemunene ?

I do think it's a good idea actually

Change #1014467 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/deployment-charts@master] external-services: let /health requests get responded by Superset

https://gerrit.wikimedia.org/r/1014467

Change #1014467 merged by Brouberol:

[operations/deployment-charts@master] superset: let /health requests get responded by Superset

https://gerrit.wikimedia.org/r/1014467