Page MenuHomePhabricator

tbs: user-story 9: Create an alert on metricsinfra for tekton being down on toolsbeta
Open, Needs TriagePublic8 Estimated Story Points

Description

As of writing this task, this can only be done directly in the DB, more info here:

https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Monitoring#Monitoring_for_Cloud_VPS

Essentially, we have a cloud vps project, metricsinfra, where we have a setup with prometheus(alertmanager), specifically, there's a couple hosts:
metricsinfra-controller-1.metricsinfra.eqiad1.wikimedia.cloud
metricsinfra-controller-2.metricsinfra.eqiad1.wikimedia.cloud

That generate the alerts for prometheus from a DB, that is hosted in trove.

You have to login into that DB (you can find the credentials and host in the controller hosts config, /etc/prometheus-manager/config.yaml).

There you have the prometheusconfig database, with the table alerts, that you have to update with the alerts that you want to add, an example row:

*************************** 1. row ***************************
         id: 1
 project_id: 12
       name: GridQueueProblem
       expr: sge_queueproblems{project="toolsbeta",state=~".*(e|E).*"}
   duration: 30m
   severity: warn
annotations: {"summary": "Grid queue {{ $labels.queue }}@{{ $labels.host }} is in state {{ $labels.state }}", "runbook": "https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsGridQueueProblem"}

The column expr is the prometheus expression that you want to monitor, you can find out, check and test them here:
https://toolsbeta-prometheus.wmcloud.org

(probably something like sum(kube_pod_status_phase{namespace="tekton-pipelines", phase="Running"})>0 might be enough)

Another place you can use to find the expression to use is:
https://grafana-rw.wmcloud.org/d/TJuKfnt4z/kubernetes-namespace?orgId=1&var-cluster=prometheus-toolsbeta&var-namespace=image-build&forceLogin&search=open

Inspecting the graphs there and the datasources you will be able to see which prometheus instance and which expression are the ones that give you the data you want.

About the alert itself, it should have also an annotation called 'service' with the value 'toolforge,build_service'.

Event Timeline

dcaro triaged this task as High priority.Dec 14 2022, 2:01 PM
dcaro created this task.
dcaro added a project: Toolforge Build Service.
dcaro updated the task description. (Show Details)

A potential metric to use could be sum(kube_pod_status_phase{namespace="tekton-pipelines", phase="Running"})

dcaro raised the priority of this task from High to Needs Triage.Mon, Mar 6, 3:02 PM