Page MenuHomePhabricator

tbs: user-story 9: Create an alert on metricsinfra for tekton being down on toolsbeta
Closed, ResolvedPublic8 Estimated Story Points

Description

As of writing this task, this can only be done directly in the DB, more info here:

https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Monitoring#Monitoring_for_Cloud_VPS

Essentially, we have a cloud vps project, metricsinfra, where we have a setup with prometheus(alertmanager), specifically, there's a host metricsinfra-controller-2.metricsinfra.eqiad1.wikimedia.cloud.

That is used to generate the alerts for prometheus from a DB hosted in trove.

You have to login into that DB, to do that ssh into metricsinfra-controller-2.metricsinfra.eqiad1.wikimedia.cloud and run sudo -i mariadb.

There you have the prometheusconfig database, with the table alerts, that you have to update with the alerts that you want to add, an example row:

*************************** 1. row ***************************
         id: 1
 project_id: 12
       name: GridQueueProblem
       expr: sge_queueproblems{project="toolsbeta",state=~".*(e|E).*"}
   duration: 30m
   severity: warn
annotations: {"summary": "Grid queue {{ $labels.queue }}@{{ $labels.host }} is in state {{ $labels.state }}", "runbook": "https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsGridQueueProblem"}

The column expr is the prometheus expression that you want to monitor, you can find out, check and test them here:
https://toolsbeta-prometheus.wmcloud.org

(probably something like sum(kube_pod_status_phase{namespace="tekton-pipelines", phase="Running"})>0 might be enough)

Another place you can use to find the expression to use is:
https://grafana-rw.wmcloud.org/d/TJuKfnt4z/kubernetes-namespace?orgId=1&var-cluster=prometheus-toolsbeta&var-namespace=image-build&forceLogin&search=open

Inspecting the graphs there and the datasources you will be able to see which prometheus instance and which expression are the ones that give you the data you want.

About the alert itself, it should have also an annotation called 'service' with the value 'toolforge,build_service'.

Details

TitleReferenceAuthorSource BranchDest Branch
buildservice: add tekton alertrepos/cloud/toolforge/alerts!3raymond-ndibeadd_tekton_to_alertsmain
Customize query in GitLab

Related Objects

Event Timeline

dcaro triaged this task as High priority.Dec 14 2022, 2:01 PM
dcaro created this task.
dcaro added a project: Toolforge Build Service.
dcaro updated the task description. (Show Details)

A potential metric to use could be sum(kube_pod_status_phase{namespace="tekton-pipelines", phase="Running"})

dcaro raised the priority of this task from High to Needs Triage.Mar 6 2023, 3:02 PM

Change 915771 had a related patch set uploaded (by Raymond Ndibe; author: Raymond Ndibe):

[operations/puppet@production] toolforge: add tekton metrics to prometheus

https://gerrit.wikimedia.org/r/915771

Raymond_Ndibe changed the task status from Open to In Progress.May 4 2023, 5:20 PM
Raymond_Ndibe changed the task status from In Progress to Stalled.
Raymond_Ndibe claimed this task.
dcaro changed the task status from Stalled to In Progress.May 5 2023, 9:35 AM

Change 915771 merged by David Caro:

[operations/puppet@production] toolforge: add tekton metrics to prometheus

https://gerrit.wikimedia.org/r/915771

Raymond_Ndibe changed the task status from In Progress to Stalled.May 9 2023, 12:10 AM