Monitor the availability of the superset deployments
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	brouberol
	Feb 2 2024, 8:18 AM

Description

We need to get alerted if the Superset pods do not run. Possibly mirror what was done for the Spark History Server (T353717)

Details

	Subject	Repo	Branch	Lines +/-
	superset: add availability monitor	operations/alerts	master	+63 -0
	superset: let /health requests get responded by Superset	operations/deployment-charts	master	+1 -5

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	brouberol	T353782 Decommission an-tool1010
Resolved	brouberol	T347710 Migrate the Analytics Superset instances to our DSE Kubernetes cluster
Resolved	Stevemunene	T356484 Monitor the availability of the superset deployments

Event Timeline

brouberol created this task.Feb 2 2024, 8:18 AM

brouberol mentioned this in T347710: Migrate the Analytics Superset instances to our DSE Kubernetes cluster.Feb 2 2024, 8:24 AM

BTullis removed a project: Epic.Feb 2 2024, 5:33 PM

BTullis removed subscribers: JAllemandou, MoritzMuehlenhoff, Volans.

Gehel triaged this task as High priority.Feb 6 2024, 2:55 PM

• lbowmaker moved this task from Incoming (new tickets) to Radar (External Teams) on the Data-Engineering board.Feb 8 2024, 5:58 PM

Gehel moved this task from Incoming to Quarterly Goals on the Data-Platform-SRE board.Feb 9 2024, 1:26 PM

brouberol assigned this task to Stevemunene.Feb 20 2024, 10:18 AM

Stevemunene moved this task from Quarterly Goals to 2024.02.12 - 2024.03.03 on the Data-Platform-SRE board.Feb 23 2024, 8:56 AM

Stevemunene edited projects, added Data-Platform-SRE (2024.02.12 - 2024.03.03); removed Data-Platform-SRE.

Stevemunene moved this task from Backlog to In Progress on the Data-Platform-SRE (2024.02.12 - 2024.03.03) board.

Change 1005540 had a related patch set uploaded (by Stevemunene; author: Stevemunene):

[operations/alerts@master] superset: add availability monitor

https://gerrit.wikimedia.org/r/1005540

gerritbot added a project: Patch-For-Review.Feb 23 2024, 11:07 AM

Stevemunene moved this task from In Progress to Needs Review on the Data-Platform-SRE (2024.02.12 - 2024.03.03) board.Feb 23 2024, 11:07 AM

Stevemunene moved this task from Needs Review to In Progress on the Data-Platform-SRE (2024.02.12 - 2024.03.03) board.Feb 26 2024, 11:25 AM

Stevemunene moved this task from In Progress to To Be Deployed on the Data-Platform-SRE (2024.02.12 - 2024.03.03) board.Feb 27 2024, 3:07 PM

Change 1005540 merged by jenkins-bot:

[operations/alerts@master] superset: add availability monitor

https://gerrit.wikimedia.org/r/1005540

In addition to the kube_deployment_status_replicas_available metric, it might be quite a good idea to use one or two Prometheus blackbox exporters to check on the availability of https://superset.wikimedia.org

https://wikitech.wikimedia.org/wiki/Prometheus#Network_probes_%28blackbox_exporter%29

Hmm. It seems that we already have some http blackbox probes defined here in the service catalog:
https://github.com/wikimedia/operations-puppet/blob/production/hieradata/common/service.yaml#L4034-L4036

I specifically intercatpted these in the nginx reverse proxy here:
https://github.com/wikimedia/operations-deployment-charts/blob/master/charts/superset/templates/configmap.yaml#L202-L204

...but maybe that's not a good idea. Maybe we should remove this section and pass them through to the application. What do you think @brouberol @Stevemunene ?

Maintenance_bot removed a project: Patch-For-Review.Feb 28 2024, 12:30 PM

I do think it's a good idea actually

Gehel edited projects, added Data-Platform-SRE (2024.03.04 - 2024.03.24); removed Data-Platform-SRE (2024.02.12 - 2024.03.03).Mar 1 2024, 3:29 PM

Gehel moved this task from Backlog to To Be Deployed on the Data-Platform-SRE (2024.03.04 - 2024.03.24) board.

Gehel edited projects, added Data-Platform-SRE (2024.03.25 - 2024.04.14); removed Data-Platform-SRE (2024.03.04 - 2024.03.24).Mar 22 2024, 8:42 AM

Gehel moved this task from Backlog to To Be Deployed on the Data-Platform-SRE (2024.03.25 - 2024.04.14) board.

Change #1014467 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/deployment-charts@master] external-services: let /health requests get responded by Superset

https://gerrit.wikimedia.org/r/1014467

gerritbot added a project: Patch-For-Review.Mar 26 2024, 10:19 AM

Change #1014467 merged by Brouberol:

[operations/deployment-charts@master] superset: let /health requests get responded by Superset

https://gerrit.wikimedia.org/r/1014467

brouberol closed this task as Resolved.Mar 26 2024, 10:39 AM

Gehel moved this task from To Be Deployed to Done on the Data-Platform-SRE (2024.03.25 - 2024.04.14) board.Mar 26 2024, 4:44 PM

Monitor the availability of the superset deploymentsClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Monitor the availability of the superset deployments
Closed, ResolvedPublic
Actions

Related Objects
Search...