Page MenuHomePhabricator

tbs: user-story 9: Create an alert on metricsinfra for harbor being down on tools
Open, Needs TriagePublic

Description

As of writing this task, this can only be done directly in the DB, more info here:

https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Monitoring#Monitoring_for_Cloud_VPS

Essentially, we have a cloud vps project, metricsinfra, where we have a setup with prometheus(alertmanager), specifically, there's a couple hosts:
metricsinfra-controller-1.metricsinfra.eqiad1.wikimedia.cloud
metricsinfra-controller-2.metricsinfra.eqiad1.wikimedia.cloud

That generate the alerts for prometheus from a DB, that is hosted in trove.

You have to login into that DB (you can find the credentials and host in the controller hosts config, /etc/prometheus-manager/config.yaml).

There you have the prometheusconfig database, with the table alerts, that you have to update with the alerts that you want to add, an example row:

*************************** 1. row ***************************
         id: 1
 project_id: 12
       name: GridQueueProblem
       expr: sge_queueproblems{project="tools",state=~".*(e|E).*"}
   duration: 30m
   severity: warn
annotations: {"summary": "Grid queue {{ $labels.queue }}@{{ $labels.host }} is in state {{ $labels.state }}", "runbook": "https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsGridQueueProblem"}

The column expr is the prometheus expression that you want to monitor, you can find out, check and test them here:
https://prometheus.wmflabs.org/

About the alert itself, it should have also an annotation called 'service' with the value 'toolforge,build_service'.