Page MenuHomePhabricator

tbs: user-story 9: Create an alert on metricsinfra for tekton being down on tools
Closed, ResolvedPublic

Description

As of writing this task, this can only be done directly in the DB, more info here:

https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Monitoring#Monitoring_for_Cloud_VPS

Essentially, we have a cloud vps project, metricsinfra, where we have a setup with prometheus(alertmanager), specifically, there's a couple hosts:
metricsinfra-controller-1.metricsinfra.eqiad1.wikimedia.cloud
metricsinfra-controller-2.metricsinfra.eqiad1.wikimedia.cloud

That generate the alerts for prometheus from a DB, that is hosted in trove.

You have to login into that DB (you can find the credentials and host in the controller hosts config, /etc/prometheus-manager/config.yaml).

There you have the prometheusconfig database, with the table alerts, that you have to update with the alerts that you want to add, an example row:

*************************** 1. row ***************************
         id: 1
 project_id: 12
       name: GridQueueProblem
       expr: sge_queueproblems{project="tools",state=~".*(e|E).*"}
   duration: 30m
   severity: warn
annotations: {"summary": "Grid queue {{ $labels.queue }}@{{ $labels.host }} is in state {{ $labels.state }}", "runbook": "https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsGridQueueProblem"}

The column expr is the prometheus expression that you want to monitor, you can find out, check and test them here:
https://prometheus.wmflabs.org/

About the alert itself, it should have also an annotation called 'service' with the value 'toolforge,build_service'.

Related Objects

Event Timeline

dcaro triaged this task as High priority.Dec 14 2022, 2:01 PM
dcaro created this task.
dcaro added a project: Toolforge Build Service.
dcaro raised the priority of this task from High to Needs Triage.Mar 6 2023, 3:03 PM

Id of the newly created alert on metricsinfra-controller-2.metricsinfra.eqiad1.wikimedia.cloud
is 10

Change 911251 had a related patch set uploaded (by David Caro; author: David Caro):

[cloud/metricsinfra/prometheus-configurator@master] Adapt the openstack config to the new var names

https://gerrit.wikimedia.org/r/911251

Nice, I see though that the variable you chose for the alert (kube_pod_status_phase) does not seem to have any values :/

image.png (317×515 px, 20 KB)

Where did you test it?

Answering to myself :)

You can check directly the prometheus for the tools project here:
https://tools-prometheus.wmflabs.org/tools/classic/graph?g0.range_input=1h&g0.expr=sum(kube_pod_status_phase%7Bnamespace%3D%22tekton-pipelines%22%2C%20phase%3D%22Running%22%7D)%20%3D%3D%200&g0.tab=1

Now, the problem is that those metrics don't exist on metricsinfra (they are not being pulled in, as the whole k8s metrics might be too much), so we have to pull them in somehow.

NOTE: I found a cool way of avoiding the issue with the alert not getting triggered if the metric does not exist, use something like sum(kube_pod_status_phase{namespace="tekton-pipelines"}) or on() vector(0) == 0, the key being the or on() vector(0) part, that will make the left side of the == be 0 if there's no metric called kube_pod_status_phase instead of just not having data and not triggering the alert ;)

This is how we pull metrics from a service inside kubernetes:
https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/ce3bda193076558a4c5e800e74cf177f015a8329/modules/profile/manifests/toolforge/prometheus.pp#339

but that's actually going to the pod directly, and requesting whatever is listening there the /metrics path (ex. if your service exposes metrics, that's how you get them).

We want to get a subset of the already pulled k8s metrics in, hmm...

We could do the following:

  • Add the tekton metrics there (that means, adding a new entry there that points to the tekton-controller pod port 9090)
  • That will also create a new label in the metric 'up{job="tekton-pipelines"}' (or whichever name we gave to it), and we can use that one instead, so if there's no controller pod it will just fail.

I actually think that we have already alerts for the job being down, though I lean towards having one for tekton being down too, even if for now they are doing a similar thing, because the intentions are different and we might get the tekton uptime somehow else instead in the future.

What do you think?

Change 911251 merged by David Caro:

[cloud/metricsinfra/prometheus-configurator@master] Adapt the openstack config to the new var names

https://gerrit.wikimedia.org/r/911251

dcaro changed the task status from Open to In Progress.Apr 25 2023, 3:18 PM

We could do the following:

  • Add the tekton metrics there (that means, adding a new entry there that points to the tekton-controller pod port 9090)
  • That will also create a new label in the metric 'up{job="tekton-pipelines"}' (or whichever name we gave to it), and we can use that one instead, so if there's no controller pod it will just fail.

I actually think that we have already alerts for the job being down, though I lean towards having one for tekton being down too, even if for now they are doing a similar thing, because the intentions are different and we might get the tekton uptime somehow else instead in the future.

What do you think?

when you say "job being down" @dcaro which job are you referring to?

We could do the following:

  • Add the tekton metrics there (that means, adding a new entry there that points to the tekton-controller pod port 9090)
  • That will also create a new label in the metric 'up{job="tekton-pipelines"}' (or whichever name we gave to it), and we can use that one instead, so if there's no controller pod it will just fail.

I actually think that we have already alerts for the job being down, though I lean towards having one for tekton being down too, even if for now they are doing a similar thing, because the intentions are different and we might get the tekton uptime somehow else instead in the future.

What do you think?

so this means we are removing the entry we added to the database? and instead taking this approach of making change to the puppet file above?

so this means we are removing the entry we added to the database? and instead taking this approach of making change to the puppet file above?

I'm asking for your opinion here :)

But if we were to agree that's the path forward yes, we would remove the entry in the database.

when you say "job being down" @dcaro which job are you referring to?

Sorry, prometheus lingo. I mean that prometheus has a list of url to get metrics from, and tekton would just be just one more there (that's what prometheus calls job, it also adds a label to each metric with job=<jobname>). And prometheus already sends an alert when any of those urls fails to return metrics, so if tekton goes down, we would get an alert saying something like "JobDown: job tekton failed to scrape" or similar.

so this means we are removing the entry we added to the database? and instead taking this approach of making change to the puppet file above?

I'm asking for your opinion here :)

But if we were to agree that's the path forward yes, we would remove the entry in the database.

It looks like the less complex option. If there is no downside to this approach, maybe we should take it then

Awesome, let's try then, we can always adapt if we find new issues 👍

Change 915771 had a related patch set uploaded (by Raymond Ndibe; author: Raymond Ndibe):

[operations/puppet@production] toolforge: add tekton metrics to prometheus

https://gerrit.wikimedia.org/r/915771

Raymond_Ndibe changed the task status from In Progress to Stalled.May 4 2023, 5:21 PM
dcaro changed the task status from Stalled to In Progress.May 5 2023, 9:35 AM

Change 915771 merged by David Caro:

[operations/puppet@production] toolforge: add tekton metrics to prometheus

https://gerrit.wikimedia.org/r/915771

Raymond_Ndibe changed the task status from In Progress to Stalled.May 9 2023, 12:10 AM