tbs: user-story 9: Create an alert on metricsinfra for tekton being down on toolsbeta
Closed, ResolvedPublic8 Estimated Story Points
Actions

Assigned To

Authored By

	dcaro
	Dec 14 2022, 2:01 PM

Description

As of writing this task, this can only be done directly in the DB, more info here:

https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Monitoring#Monitoring_for_Cloud_VPS

Essentially, we have a cloud vps project, metricsinfra, where we have a setup with prometheus(alertmanager), specifically, there's a host metricsinfra-controller-2.metricsinfra.eqiad1.wikimedia.cloud.

That is used to generate the alerts for prometheus from a DB hosted in trove.

You have to login into that DB, to do that ssh into metricsinfra-controller-2.metricsinfra.eqiad1.wikimedia.cloud and run sudo -i mariadb.

There you have the prometheusconfig database, with the table alerts, that you have to update with the alerts that you want to add, an example row:

*************************** 1. row ***************************
         id: 1
 project_id: 12
       name: GridQueueProblem
       expr: sge_queueproblems{project="toolsbeta",state=~".*(e|E).*"}
   duration: 30m
   severity: warn
annotations: {"summary": "Grid queue {{ $labels.queue }}@{{ $labels.host }} is in state {{ $labels.state }}", "runbook": "https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsGridQueueProblem"}

The column expr is the prometheus expression that you want to monitor, you can find out, check and test them here:
https://toolsbeta-prometheus.wmcloud.org

(probably something like sum(kube_pod_status_phase{namespace="tekton-pipelines", phase="Running"})>0 might be enough)

Another place you can use to find the expression to use is:
https://grafana-rw.wmcloud.org/d/TJuKfnt4z/kubernetes-namespace?orgId=1&var-cluster=prometheus-toolsbeta&var-namespace=image-build&forceLogin&search=open

Inspecting the graphs there and the datasources you will be able to see which prometheus instance and which expression are the ones that give you the data you want.

About the alert itself, it should have also an annotation called 'service' with the value 'toolforge,build_service'.

Details

	Subject	Repo	Branch	Lines +/-
	toolforge: add tekton metrics to prometheus	operations/puppet	production	+6 -0

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Open	None	T380882 openstack network problems (November 2024)
Resolved	aborrero	T380827 tools-nfs outage 2024-11-25
Open	None	T380832 [jobs-api] crashing
Open	None	T380959 [docs,envvars-api,jobs-api,builds-api] create docs on how to operate the cluster and core components
Resolved	LucasWerkmeister	T320140 Migrate wd-shex-infer from Toolforge GridEngine to Toolforge Kubernetes
Resolved	matmarex	T319707 Migrate dtcheck from Toolforge GridEngine to Toolforge Kubernetes
Resolved	Legoktm	T320062 Migrate steve-adder from Toolforge GridEngine to Toolforge Kubernetes
Resolved	Legoktm	T320011 Migrate rfa-voting-history from Toolforge GridEngine to Toolforge Kubernetes
Open	dcaro	T194332 [Epic,builds-api,components-api,webservice,jobs-api] Make Toolforge a proper platform as a service with push-to-deploy and build packs
Resolved	dcaro	T267374 [tbs.beta] Create a toolforge build service beta release
Resolved	dcaro	T325172 [builds-api,harbor,builds-builder] user-story 11: I want to know how to debug the service
Resolved	None	T325174 [builds-builder,harbor,bulid-service,docs] user-story 11: Add section to admin docs on how to debug the service, how to pin-point the failing component and how to get the logs for each of them.
Resolved	dcaro	T325166 tbs: user-story 10: I want to know how to manage the service
Resolved	dcaro	T325167 tbs: user-story 10: Create admin wiki page for the toolforge build service
Resolved	Raymond_Ndibe	T325175 tbs: user-story 11: Add a runbook for each of the service alerts.
Resolved	Raymond_Ndibe	T325160 tbs: user-story 9: I want to know when the service is down
Resolved	Raymond_Ndibe	T325162 tbs: user-story 9: Create an alert on metricsinfra for tekton being down on toolsbeta