We need to get alerted if the Superset pods do not run. Possibly mirror what was done for the Spark History Server (T353717)
Description
Details
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | brouberol | T353782 Decommission an-tool1010 | |||
Resolved | brouberol | T347710 Migrate the Analytics Superset instances to our DSE Kubernetes cluster | |||
Resolved | Stevemunene | T356484 Monitor the availability of the superset deployments |
Event Timeline
Change 1005540 had a related patch set uploaded (by Stevemunene; author: Stevemunene):
[operations/alerts@master] superset: add availability monitor
Change 1005540 merged by jenkins-bot:
[operations/alerts@master] superset: add availability monitor
In addition to the kube_deployment_status_replicas_available metric, it might be quite a good idea to use one or two Prometheus blackbox exporters to check on the availability of https://superset.wikimedia.org
https://wikitech.wikimedia.org/wiki/Prometheus#Network_probes_%28blackbox_exporter%29
Hmm. It seems that we already have some http blackbox probes defined here in the service catalog:
https://github.com/wikimedia/operations-puppet/blob/production/hieradata/common/service.yaml#L4034-L4036
I specifically intercatpted these in the nginx reverse proxy here:
https://github.com/wikimedia/operations-deployment-charts/blob/master/charts/superset/templates/configmap.yaml#L202-L204
...but maybe that's not a good idea. Maybe we should remove this section and pass them through to the application. What do you think @brouberol @Stevemunene ?
Change #1014467 had a related patch set uploaded (by Brouberol; author: Brouberol):
[operations/deployment-charts@master] external-services: let /health requests get responded by Superset
Change #1014467 merged by Brouberol:
[operations/deployment-charts@master] superset: let /health requests get responded by Superset