Page MenuHomePhabricator

Move (or delete?) trafficserver restart count alert from icinga to alerts.git
Closed, ResolvedPublic

Description

We have this alert in icinga that can be ported (or deleted, if no longer relevant) to alerts.git:

+    monitoring::check_prometheus { "trafficserver_${instance_name}_restart_count":
+        description     => "traffic_server ${instance_name} process restarted",
+        dashboard_links => ["https://grafana.wikimedia.org/d/000000610/ats-instance-drilldown?orgId=1&var-site=${::site} prometheus/ops&var-instance=${::hostname}&var-layer=${instance_name}"],
+        query           => "scalar(trafficserver_restart_count{${prometheus_labels}})",
+        method          => 'ge',
+        warning         => 2,
+        critical        => 2,
+        prometheus_url  => "http://prometheus.svc.${::site}.wmnet/ops",
+        notes_link      => 'https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server',
+    }

Event Timeline

We should migrate this and not delete it as it's still useful

BCornwall subscribed.

Forgive me if I'm off base but hasn't this already been done with T300723? We merged in https://gerrit.wikimedia.org/r/c/operations/alerts/+/807214 already which seems to fulfill this ticket.

I see, I had forgotten to remove it from puppet. I've created https://gerrit.wikimedia.org/r/889881 to address that.