Page MenuHomePhabricator

tbs: user-story 9: Create an alert on metricsinfra for harbor being down on tools
Closed, ResolvedPublic

Description

As of writing this task, this can only be done directly in the DB, more info here:

https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Monitoring#Monitoring_for_Cloud_VPS

Essentially, we have a cloud vps project, metricsinfra, where we have a setup with prometheus(alertmanager), specifically, there's a couple hosts:
metricsinfra-controller-1.metricsinfra.eqiad1.wikimedia.cloud
metricsinfra-controller-2.metricsinfra.eqiad1.wikimedia.cloud

That generate the alerts for prometheus from a DB, that is hosted in trove.

You have to login into that DB (you can find the credentials and host in the controller hosts config, /etc/prometheus-manager/config.yaml).

There you have the prometheusconfig database, with the table alerts, that you have to update with the alerts that you want to add, an example row:

*************************** 1. row ***************************
         id: 1
 project_id: 12
       name: GridQueueProblem
       expr: sge_queueproblems{project="tools",state=~".*(e|E).*"}
   duration: 30m
   severity: warn
annotations: {"summary": "Grid queue {{ $labels.queue }}@{{ $labels.host }} is in state {{ $labels.state }}", "runbook": "https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsGridQueueProblem"}

The column expr is the prometheus expression that you want to monitor, you can find out, check and test them here:
https://prometheus.wmflabs.org/

About the alert itself, it should have also an annotation called 'service' with the value 'toolforge,build_service'.

Details

TitleReferenceAuthorSource BranchDest Branch
Add tekton alerts and tox for testingrepos/cloud/toolforge/alerts!1dcaroadd_tekton_alertsmain
Customize query in GitLab

Related Objects

Event Timeline

dcaro triaged this task as High priority.Dec 14 2022, 2:02 PM
dcaro created this task.
dcaro added a project: Toolforge Build Service.
dcaro raised the priority of this task from High to Needs Triage.Mar 6 2023, 3:03 PM
Raymond_Ndibe changed the task status from Open to In Progress.May 4 2023, 5:20 PM
Raymond_Ndibe changed the task status from In Progress to Stalled.
Raymond_Ndibe claimed this task.

metrics infra controller alert record:

id: 11
project_id: 12
name: HarborDown
expr: sum(probe_success{job="probes/pingthing", url=~".*tools-harbor\.*"}) == 0
duration: 30m
severity: warn
annotations: {"summary": "Harbor on {{ $labels.project }} {{ $labels.instance }} is down", "runbook": "https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/HarborDownOnTools",
"service": "toolforge,build_service"}

@dcaro I tried testing this alert by docker-compose -f /srv/ops/harbor/docker-compose.yml down of harbor on tools-harbor and disabling tools-puppetmaster, then looking at the alert in https://prometheus.wmcloud.org/alerts. It doesn't seem to work. Can you help look at this? maybe I made a mistake while trying to test or maybe it doesn't work

metrics infra controller alert record:

id: 11
project_id: 12
name: HarborDown
expr: sum(probe_success{job="probes/pingthing", url=~".*tools-harbor\.*"}) == 0
duration: 30m
severity: warn
annotations: {"summary": "Harbor on {{ $labels.project }} {{ $labels.instance }} is down", "runbook": "https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/HarborDownOnTools",
"service": "toolforge,build_service"}

@dcaro I tried testing this alert by docker-compose -f /srv/ops/harbor/docker-compose.yml down of harbor on tools-harbor and disabling tools-puppetmaster, then looking at the alert in https://prometheus.wmcloud.org/alerts. It doesn't seem to work. Can you help look at this? maybe I made a mistake while trying to test or maybe it doesn't work

something interesting I found was this:
for all the alerts in https://prometheus.wmcloud.org/alerts except ToolsHarborDown and ToolsbetaHarborDown, the alert expr when executed on https://prometheus.wmcloud.org/graph returns a value. This makes sense but it also doesn't explain why alerts created on metricsinfra controller with project_id set to 12 (tools) appears on https://prometheus.wmcloud.org/alerts but not on https://tools-prometheus.wmflabs.org/tools/classic/alerts

It should create the rules under the tools specific prometheus host:

https://tools-prometheus.wmcloud.org

If you go there, you see that there's actually a harbor one, so that's good

The rule there says:

expr: avg_over_time(probe_success{module=~"http_tools_harbor_wmcloud_org_.*"}[1m])
  * 100 < 75

It's not the same you have there though :/, that's not so good.

By checking all the entries of the metric probe_success itself (https://tools-prometheus.wmflabs.org/tools/classic/graph?g0.range_input=12h&g0.expr=probe_success&g0.tab=0), there's one like:

probe_success{instance="127.0.0.1:9115",job="probes/pingthing",url="https://tools-harbor.wmcloud.org/api/v2.0/ping"}

That we can use :)

I think that the expr on your alert also is not correct, it has a '\.*' instead of '\..*', but using '\..*' seems to give an error:

Error executing query: 1:48: parse error: unknown escape sequence U+002E '.'

So we can work around it using '[.].*' xd

okok, let me check why/where are the alert getting generated.

The Toolforge prometheus instance does not and has never had alerts configured via the metricsinfra tooling. The correct way to configure alerts there is via the toolforge/alerts repository.

Yep, I think the confusion comes from this.

The pingthing setup, is managed by puppet and hiera, it creates both the metric, and the alert on the tools prometheus host directly.

Then there's the metricsinfra alerts, that are managed by the DB (config) + prometheus-configurator (creates the alerts + prometheus config) + metricsinfra-prometheus (gathers metrics and triggers alerts) + metricsinfra-thanos (stores metrics) + metricsinfra-alertmanager (manages alerts).

Yep that :)

The Toolforge prometheus instance does not and has never had alerts configured via the metricsinfra tooling. The correct way to configure alerts there is via the toolforge/alerts repository.

I have a question though, the pingthing already creates alerts itself (it ProbeDown and such), that is not handled by the toolforge/alerts repository right?
In the sense that if we want to change that, it comes from puppet/hiera directly right?

The pingthing setup, is managed by puppet and hiera, it creates both the metric, and the alert on the tools prometheus host directly.

It doesn't create an alert, I think that was a leftover from some cherry-pick (which as a side-note I'd really prefer if those were limited to toolsbeta only) of the earlier Blackbox probe patch. I've sent a patch to the purge those correctly.

The pingthing setup, is managed by puppet and hiera, it creates both the metric, and the alert on the tools prometheus host directly.

It doesn't create an alert, I think that was a leftover from some cherry-pick (which as a side-note I'd really prefer if those were limited to toolsbeta only) of the earlier Blackbox probe patch. I've sent a patch to the purge those correctly.

I see, agree, thanks

I created a first mr https://gitlab.wikimedia.org/repos/cloud/toolforge/alerts/-/merge_requests/1 with the tekton only alert (will add the others later if this one is ok)

Added also tox for testing so I could do it locally.

Let me know what you think, will work with @Raymond_Ndibe on the others.

ps. also removed the entries from the metricsinfra db, so those should not pop up again.