tbs: user-story 9: Create an alert on metricsinfra for tekton being down on tools
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	dcaro
	Dec 14 2022, 2:01 PM

Description

As of writing this task, this can only be done directly in the DB, more info here:

https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Monitoring#Monitoring_for_Cloud_VPS

Essentially, we have a cloud vps project, metricsinfra, where we have a setup with prometheus(alertmanager), specifically, there's a couple hosts:
metricsinfra-controller-1.metricsinfra.eqiad1.wikimedia.cloud
metricsinfra-controller-2.metricsinfra.eqiad1.wikimedia.cloud

That generate the alerts for prometheus from a DB, that is hosted in trove.

You have to login into that DB (you can find the credentials and host in the controller hosts config, /etc/prometheus-manager/config.yaml).

There you have the prometheusconfig database, with the table alerts, that you have to update with the alerts that you want to add, an example row:

*************************** 1. row ***************************
         id: 1
 project_id: 12
       name: GridQueueProblem
       expr: sge_queueproblems{project="tools",state=~".*(e|E).*"}
   duration: 30m
   severity: warn
annotations: {"summary": "Grid queue {{ $labels.queue }}@{{ $labels.host }} is in state {{ $labels.state }}", "runbook": "https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsGridQueueProblem"}

The column expr is the prometheus expression that you want to monitor, you can find out, check and test them here:
https://prometheus.wmflabs.org/

About the alert itself, it should have also an annotation called 'service' with the value 'toolforge,build_service'.

Details

	Subject	Repo	Branch	Lines +/-
	toolforge: add tekton metrics to prometheus	operations/puppet	production	+6 -0
	Adapt the openstack config to the new var names	cloud/metricsinfra/prometheus-configurator	master	+1 -1

Customize query in gerrit

	Title	Reference	Author	Source Branch	Dest Branch
	buildservice: add tekton alert	repos/cloud/toolforge/alerts!3	raymond-ndibe	add_tekton_to_alerts	main

Customize query in GitLab

Related Objects
Search...

Status	Assigned	Task
Resolved	LucasWerkmeister	T320140 Migrate wd-shex-infer from Toolforge GridEngine to Toolforge Kubernetes
Resolved	matmarex	T319707 Migrate dtcheck from Toolforge GridEngine to Toolforge Kubernetes
Resolved	Legoktm	T320062 Migrate steve-adder from Toolforge GridEngine to Toolforge Kubernetes
Resolved	Legoktm	T320011 Migrate rfa-voting-history from Toolforge GridEngine to Toolforge Kubernetes
Open	dcaro	T194332 [Epic,builds-api,components-api,webservice,jobs-api] Make Toolforge a proper platform as a service with push-to-deploy and build packs
Resolved	dcaro	T267374 [tbs.beta] Create a toolforge build service beta release
Open	None	T325172 [builds-api,harbor,builds-builder] user-story 11: I want to know how to debug the service
Open	None	T325174 [builds-builder,harbor,bulid-service,docs] user-story 11: Add section to admin docs on how to debug the service, how to pin-point the failing component and how to get the logs for each of them.
Resolved	dcaro	T325166 tbs: user-story 10: I want to know how to manage the service
Resolved	dcaro	T325167 tbs: user-story 10: Create admin wiki page for the toolforge build service
Resolved	Raymond_Ndibe	T325175 tbs: user-story 11: Add a runbook for each of the service alerts.
Resolved	Raymond_Ndibe	T325160 tbs: user-story 9: I want to know when the service is down
Resolved	Raymond_Ndibe	T325163 tbs: user-story 9: Create an alert on metricsinfra for tekton being down on tools

Event Timeline

dcaro triaged this task as High priority.Dec 14 2022, 2:01 PM

dcaro created this task.

dcaro added a project: Toolforge Build Service.

KHernandez-WMF mentioned this in T325160: tbs: user-story 9: I want to know when the service is down.Dec 19 2022, 11:23 PM

dcaro updated the task description. (Show Details)Jan 6 2023, 5:11 PM

Raymond_Ndibe added a project: User-Raymond_Ndibe.Jan 17 2023, 10:20 PM

fnegri edited projects, added cloud-services-team; removed cloud-services-team (Kanban).Jan 18 2023, 6:25 PM

fnegri moved this task from Kanban to Inbox on the cloud-services-team board.

dcaro raised the priority of this task from High to Needs Triage.Mar 6 2023, 3:03 PM

Raymond_Ndibe claimed this task.Apr 22 2023, 11:06 AM

Raymond_Ndibe edited projects, added Toolforge Build Service (Iteration 13); removed Toolforge Build Service.

Raymond_Ndibe moved this task from Next Up to In Review on the Toolforge Build Service (Iteration 13) board.

Raymond_Ndibe added a project: Patch-For-Review.Apr 22 2023, 11:09 AM

Id of the newly created alert on metricsinfra-controller-2.metricsinfra.eqiad1.wikimedia.cloud
is 10

Change 911251 had a related patch set uploaded (by David Caro; author: David Caro):

[cloud/metricsinfra/prometheus-configurator@master] Adapt the openstack config to the new var names

https://gerrit.wikimedia.org/r/911251

Nice, I see though that the variable you chose for the alert (kube_pod_status_phase) does not seem to have any values :/

Where did you test it?

Answering to myself :)

You can check directly the prometheus for the tools project here:
https://tools-prometheus.wmflabs.org/tools/classic/graph?g0.range_input=1h&g0.expr=sum(kube_pod_status_phase%7Bnamespace%3D%22tekton-pipelines%22%2C%20phase%3D%22Running%22%7D)%20%3D%3D%200&g0.tab=1

Now, the problem is that those metrics don't exist on metricsinfra (they are not being pulled in, as the whole k8s metrics might be too much), so we have to pull them in somehow.

NOTE: I found a cool way of avoiding the issue with the alert not getting triggered if the metric does not exist, use something like sum(kube_pod_status_phase{namespace="tekton-pipelines"}) or on() vector(0) == 0, the key being the or on() vector(0) part, that will make the left side of the == be 0 if there's no metric called kube_pod_status_phase instead of just not having data and not triggering the alert ;)

This is how we pull metrics from a service inside kubernetes:
https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/ce3bda193076558a4c5e800e74cf177f015a8329/modules/profile/manifests/toolforge/prometheus.pp#339

but that's actually going to the pod directly, and requesting whatever is listening there the /metrics path (ex. if your service exposes metrics, that's how you get them).

We want to get a subset of the already pulled k8s metrics in, hmm...

We could do the following:

Add the tekton metrics there (that means, adding a new entry there that points to the tekton-controller pod port 9090)
That will also create a new label in the metric 'up{job="tekton-pipelines"}' (or whichever name we gave to it), and we can use that one instead, so if there's no controller pod it will just fail.

I actually think that we have already alerts for the job being down, though I lean towards having one for tekton being down too, even if for now they are doing a similar thing, because the intentions are different and we might get the tekton uptime somehow else instead in the future.

What do you think?

KHernandez-WMF moved this task from Iteration 13 to Iteration 14 on the Toolforge Build Service board.Apr 25 2023, 2:12 PM

KHernandez-WMF edited projects, added Toolforge Build Service (Iteration 14); removed Toolforge Build Service (Iteration 13).

KHernandez-WMF moved this task from Next Up to In Review on the Toolforge Build Service (Iteration 14) board.

Change 911251 merged by David Caro:

[cloud/metricsinfra/prometheus-configurator@master] Adapt the openstack config to the new var names

https://gerrit.wikimedia.org/r/911251

dcaro changed the task status from Open to In Progress.Apr 25 2023, 3:18 PM

dcaro moved this task from In Review to In Progress on the Toolforge Build Service (Iteration 14) board.

dcaro mentioned this in rCMPC413b835402aa: Adapt the openstack config to the new var names.Apr 25 2023, 3:20 PM

In T325163#8800749, @dcaro wrote:

We could do the following:

Add the tekton metrics there (that means, adding a new entry there that points to the tekton-controller pod port 9090)

That will also create a new label in the metric 'up{job="tekton-pipelines"}' (or whichever name we gave to it), and we can use that one instead, so if there's no controller pod it will just fail.

I actually think that we have already alerts for the job being down, though I lean towards having one for tekton being down too, even if for now they are doing a similar thing, because the intentions are different and we might get the tekton uptime somehow else instead in the future.

What do you think?

when you say "job being down" @dcaro which job are you referring to?

In T325163#8800749, @dcaro wrote:

We could do the following:

Add the tekton metrics there (that means, adding a new entry there that points to the tekton-controller pod port 9090)

That will also create a new label in the metric 'up{job="tekton-pipelines"}' (or whichever name we gave to it), and we can use that one instead, so if there's no controller pod it will just fail.

I actually think that we have already alerts for the job being down, though I lean towards having one for tekton being down too, even if for now they are doing a similar thing, because the intentions are different and we might get the tekton uptime somehow else instead in the future.

What do you think?

so this means we are removing the entry we added to the database? and instead taking this approach of making change to the puppet file above?

In T325163#8806370, @Raymond_Ndibe wrote:

so this means we are removing the entry we added to the database? and instead taking this approach of making change to the puppet file above?

I'm asking for your opinion here :)

But if we were to agree that's the path forward yes, we would remove the entry in the database.

In T325163#8806369, @Raymond_Ndibe wrote:

when you say "job being down" @dcaro which job are you referring to?

Sorry, prometheus lingo. I mean that prometheus has a list of url to get metrics from, and tekton would just be just one more there (that's what prometheus calls job, it also adds a label to each metric with job=<jobname>). And prometheus already sends an alert when any of those urls fails to return metrics, so if tekton goes down, we would get an alert saying something like "JobDown: job tekton failed to scrape" or similar.

In T325163#8806806, @dcaro wrote:

In T325163#8806370, @Raymond_Ndibe wrote:

so this means we are removing the entry we added to the database? and instead taking this approach of making change to the puppet file above?

I'm asking for your opinion here :)

But if we were to agree that's the path forward yes, we would remove the entry in the database.

It looks like the less complex option. If there is no downside to this approach, maybe we should take it then

Awesome, let's try then, we can always adapt if we find new issues 👍

Change 915771 had a related patch set uploaded (by Raymond Ndibe; author: Raymond Ndibe):

[operations/puppet@production] toolforge: add tekton metrics to prometheus

https://gerrit.wikimedia.org/r/915771

Raymond_Ndibe changed the task status from In Progress to Stalled.May 4 2023, 5:21 PM

Raymond_Ndibe moved this task from In Progress to In Review on the Toolforge Build Service (Iteration 14) board.

dcaro changed the task status from Stalled to In Progress.May 5 2023, 9:35 AM

dcaro moved this task from In Review to In Progress on the Toolforge Build Service (Iteration 14) board.

Change 915771 merged by David Caro:

[operations/puppet@production] toolforge: add tekton metrics to prometheus

https://gerrit.wikimedia.org/r/915771

raymond-ndibe opened https://gitlab.wikimedia.org/repos/cloud/toolforge/alerts/-/merge_requests/3

buildservice: add tekton alert

Raymond_Ndibe changed the task status from In Progress to Stalled.May 9 2023, 12:10 AM

Raymond_Ndibe moved this task from In Progress to In Review on the Toolforge Build Service (Iteration 14) board.

raymond-ndibe merged https://gitlab.wikimedia.org/repos/cloud/toolforge/alerts/-/merge_requests/3

buildservice: add tekton alert

Raymond_Ndibe closed this task as Resolved.May 10 2023, 5:07 PM

Raymond_Ndibe moved this task from In Review to Done on the Toolforge Build Service (Iteration 14) board.

Raymond_Ndibe removed a project: Patch-For-Review.

	F36961576: image.png
	Apr 24 2023, 8:24 AM

tbs: user-story 9: Create an alert on metricsinfra for tekton being down on toolsClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

tbs: user-story 9: Create an alert on metricsinfra for tekton being down on tools
Closed, ResolvedPublic
Actions

Related Objects
Search...