⚓ T288726 Move > 60% of observability Prometheus-based checks to Alertmanager

Subject	Repo	Branch	Lines +/-
prometheus: remove per-exporter up checks	operations/puppet	production	+0 -22
thanos: aggregate exporter 'up' metrics	operations/puppet	production	+95 -0
sre: add alerts for exporter-specific unavailability	operations/alerts	master	+52 -0
thanos: reload rule via http endpoint	operations/puppet	production	+4 -1
thanos: fix yaml error	operations/puppet	production	+2 -2
thanos: add recording rules for exporter-specific availability	operations/puppet	production	+27 -0
prometheus: remove job unavailable alert	operations/puppet	production	+0 -15
team-sre: port job unavailable alert	operations/alerts	master	+73 -0
prometheus: remove textfile stale alert	operations/puppet	production	+0 -17
team-sre: port node-exporter textfile stale alert	operations/alerts	master	+41 -0
thanos: deploy global alerts only on hosts running thanos-rule	operations/puppet	production	+15 -4
icinga: remove alertmanager::alerts	operations/puppet	production	+0 -90
o11y: port alertmanager alerts	operations/alerts	master	+261 -0
o11y: port Icinga checks	operations/alerts	master	+36 -0
alerts: remove icinga overload alert, moved to AM	operations/puppet	production	+0 -15
logging: clean up legacy logstash alerts	operations/puppet	production	+0 -109
o11y: add logstash alerts	operations/alerts	master	+187 -0
o11y: add rsyslog alerts	operations/alerts	master	+41 -0
statsd: remove statsd_udp_inbound_errors	operations/puppet	production	+0 -11
o11y: add udp receive errors for statsd	operations/alerts	master	+43 -0
prometheus: add ThanosSidecarUploadFailure to prometheus/ops	operations/puppet	production	+18 -0
o11y: add prometheus alerts	operations/alerts	master	+79 -0
prometheus: remove alerts moved to AM	operations/puppet	production	+0 -31
o11y: add alerts ported from icinga/upstream	operations/alerts	master	+393 -0
profile: remove thanos alerts, moved to alerts.git	operations/puppet	production	+0 -290
o11y: port thanos-compact alerts from Icinga	operations/alerts	master	+86 -0
profile: remove thanos-compact alerts, ported to alerts.git	operations/puppet	production	+0 -76

Status	Assigned	Task
Open	None	T321808 Port most/all Icinga checks to Prometheus/Alertmanager
Open	None	T288622 All Prometheus based alerts move from Icinga to alert manager exclusively
Resolved	fgiunchedi	T288726 Move > 60% of observability Prometheus-based checks to Alertmanager

fgiunchedi moved this task from Backlog to Doing on the User-fgiunchedi board.Aug 13 2021, 7:55 AM

fgiunchedi updated the task description. (Show Details)Aug 13 2021, 9:11 AM

fgiunchedi updated the task description. (Show Details)Aug 13 2021, 9:15 AM

lmata moved this task from Inbox to Up next on the SRE Observability (FY2021/2022-Q1) board.Aug 18 2021, 4:47 PM

fgiunchedi moved this task from Up next to In progress on the SRE Observability (FY2021/2022-Q1) board.Aug 20 2021, 10:05 AM

Change 714372 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/alerts@master] o11y: port thanos-rule alerts from Icinga

https://gerrit.wikimedia.org/r/714372

Change 714373 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] profile: remove thanos-compact alerts, ported to alerts.git

https://gerrit.wikimedia.org/r/714373

Change 714373 merged by Filippo Giunchedi:

[operations/puppet@production] profile: remove thanos-compact alerts, ported to alerts.git

https://gerrit.wikimedia.org/r/714373

Change 714372 merged by Filippo Giunchedi:

[operations/alerts@master] o11y: port thanos-compact alerts from Icinga

https://gerrit.wikimedia.org/r/714372

fgiunchedi mentioned this in rOALE457bdc2a71ec: o11y: port thanos-compact alerts from Icinga.Aug 24 2021, 6:37 AM

Maintenance_bot removed a project: Patch-For-Review.Aug 24 2021, 7:10 AM

Change 714541 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] profile: remove thanos alerts, moved to alerts.git

https://gerrit.wikimedia.org/r/714541

Change 714543 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/alerts@master] o11y: add alerts ported from icinga/upstream

https://gerrit.wikimedia.org/r/714543

Change 714541 merged by Filippo Giunchedi:

[operations/puppet@production] profile: remove thanos alerts, moved to alerts.git

https://gerrit.wikimedia.org/r/714541

Change 714543 merged by Filippo Giunchedi:

[operations/alerts@master] o11y: add alerts ported from icinga/upstream

https://gerrit.wikimedia.org/r/714543

fgiunchedi mentioned this in rOALEbfb37321537f: o11y: add alerts ported from icinga/upstream.Aug 25 2021, 8:20 AM

fgiunchedi updated the task description. (Show Details)Aug 25 2021, 9:36 AM

Change 715032 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/alerts@master] o11y: add prometheus alerts

https://gerrit.wikimedia.org/r/715032

Change 715033 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] prometheus: remove alerts moved to AM

https://gerrit.wikimedia.org/r/715033

colewhite subscribed.Aug 26 2021, 3:24 PM

herron subscribed.Aug 26 2021, 3:24 PM

Change 715033 merged by Filippo Giunchedi:

[operations/puppet@production] prometheus: remove alerts moved to AM

https://gerrit.wikimedia.org/r/715033

Change 715032 merged by Filippo Giunchedi:

[operations/alerts@master] o11y: add prometheus alerts

https://gerrit.wikimedia.org/r/715032

fgiunchedi mentioned this in rOALE1667787bae51: o11y: add prometheus alerts.Aug 27 2021, 7:21 AM

fgiunchedi updated the task description. (Show Details)Aug 27 2021, 8:18 AM

Change 719123 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/alerts@master] o11y: add udp receive errors for statsd

https://gerrit.wikimedia.org/r/719123

Change 719124 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] statsd: remove statsd_udp_inbound_errors

https://gerrit.wikimedia.org/r/719124

Change 719126 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] prometheus: add ThanosSidecarUploadFailure to prometheus/ops

https://gerrit.wikimedia.org/r/719126

Change 719126 merged by Filippo Giunchedi:

[operations/puppet@production] prometheus: add ThanosSidecarUploadFailure to prometheus/ops

https://gerrit.wikimedia.org/r/719126

fgiunchedi updated the task description. (Show Details)Sep 8 2021, 6:45 AM

Change 719123 merged by Filippo Giunchedi:

[operations/alerts@master] o11y: add udp receive errors for statsd

https://gerrit.wikimedia.org/r/719123

Change 719124 merged by Filippo Giunchedi:

[operations/puppet@production] statsd: remove statsd_udp_inbound_errors

https://gerrit.wikimedia.org/r/719124

fgiunchedi updated the task description. (Show Details)Sep 8 2021, 6:48 AM

fgiunchedi mentioned this in rOALEb741084b334e: o11y: add udp receive errors for statsd.Sep 8 2021, 6:48 AM

Change 720063 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/alerts@master] o11y: add rsyslog alerts

https://gerrit.wikimedia.org/r/720063

Change 720079 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/alerts@master] o11y: add logstash alerts

https://gerrit.wikimedia.org/r/720079

Change 720093 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] logging: clean up legacy logstash alerts

https://gerrit.wikimedia.org/r/720093

Change 720063 merged by Cwhite:

[operations/alerts@master] o11y: add rsyslog alerts

https://gerrit.wikimedia.org/r/720063

colewhite mentioned this in rOALE15d0b211ae69: o11y: add rsyslog alerts.Sep 14 2021, 3:45 PM

Change 720079 merged by Cwhite:

[operations/alerts@master] o11y: add logstash alerts

https://gerrit.wikimedia.org/r/720079

colewhite mentioned this in rOALE8485bb5c6d56: o11y: add logstash alerts.Sep 14 2021, 3:59 PM

Change 720093 merged by Cwhite:

[operations/puppet@production] logging: clean up legacy logstash alerts

https://gerrit.wikimedia.org/r/720093

colewhite updated the task description. (Show Details)Sep 14 2021, 9:37 PM

Change 724761 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/alerts@master] o11y: port alertmanager alerts

https://gerrit.wikimedia.org/r/724761

Change 724771 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] icinga: remove alertmanager::alerts

https://gerrit.wikimedia.org/r/724771

Change 725884 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/alerts@master] o11y: port Icinga checks

https://gerrit.wikimedia.org/r/725884

Change 725885 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] alerts: remove icinga overload alert, moved to AM

https://gerrit.wikimedia.org/r/725885

Change 725885 merged by Filippo Giunchedi:

[operations/puppet@production] alerts: remove icinga overload alert, moved to AM

https://gerrit.wikimedia.org/r/725885

Change 725884 merged by Filippo Giunchedi:

[operations/alerts@master] o11y: port Icinga checks

https://gerrit.wikimedia.org/r/725884

fgiunchedi mentioned this in rOALE15af021ec6e4: o11y: port Icinga checks.Oct 5 2021, 6:46 AM

fgiunchedi updated the task description. (Show Details)Oct 6 2021, 9:36 AM

Change 724761 merged by Filippo Giunchedi:

[operations/alerts@master] o11y: port alertmanager alerts

https://gerrit.wikimedia.org/r/724761

fgiunchedi mentioned this in rOALE99d17a5fa0aa: o11y: port alertmanager alerts.Oct 12 2021, 12:54 PM

fgiunchedi updated the task description. (Show Details)Oct 12 2021, 12:54 PM

Change 724771 merged by Filippo Giunchedi:

[operations/puppet@production] icinga: remove alertmanager::alerts

https://gerrit.wikimedia.org/r/724771

Change 730206 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] thanos: deploy global alerts only on hosts running thanos-rule

https://gerrit.wikimedia.org/r/730206

Change 730206 merged by Filippo Giunchedi:

[operations/puppet@production] thanos: deploy global alerts only on hosts running thanos-rule

https://gerrit.wikimedia.org/r/730206

fgiunchedi moved this task from FY2021/2022-Q1 to FY2021/2022-Q2 on the SRE Observability board.Oct 20 2021, 2:33 PM

fgiunchedi edited projects, added SRE Observability (FY2021/2022-Q2); removed SRE Observability (FY2021/2022-Q1).

lmata triaged this task as High priority.Oct 29 2021, 5:11 PM

fgiunchedi moved this task from Inbox to In progress on the SRE Observability (FY2021/2022-Q2) board.Nov 26 2021, 1:34 PM

Change 743394 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/alerts@master] team-sre: port node-exporter textfile stale alert

https://gerrit.wikimedia.org/r/743394

Change 743395 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] prometheus: remove textfile stale alert

https://gerrit.wikimedia.org/r/743395

fgiunchedi updated the task description. (Show Details)Dec 6 2021, 3:04 PM

Change 744033 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/alerts@master] team-sre: port job unavailable alert

https://gerrit.wikimedia.org/r/744033

Change 744035 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] prometheus: remove job unavailable alert

https://gerrit.wikimedia.org/r/744035

Change 743394 merged by Filippo Giunchedi:

[operations/alerts@master] team-sre: port node-exporter textfile stale alert

https://gerrit.wikimedia.org/r/743394

Change 743395 merged by Filippo Giunchedi:

[operations/puppet@production] prometheus: remove textfile stale alert

https://gerrit.wikimedia.org/r/743395

fgiunchedi updated the task description. (Show Details)Dec 15 2021, 10:06 AM

lmata edited projects, added SRE Observability (FY2021/2022-Q3); removed SRE Observability (FY2021/2022-Q2).Jan 11 2022, 2:50 PM

Change 744033 merged by Filippo Giunchedi:

[operations/alerts@master] team-sre: port job unavailable alert

https://gerrit.wikimedia.org/r/744033

Change 744035 merged by Filippo Giunchedi:

[operations/puppet@production] prometheus: remove job unavailable alert

https://gerrit.wikimedia.org/r/744035

fgiunchedi updated the task description. (Show Details)Mar 7 2022, 9:21 AM

Change 778259 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/alerts@master] sre: add alerts for exporter-specific unavailability

https://gerrit.wikimedia.org/r/778259

Change 778261 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] thanos: add recording rules for exporter-specific availability

https://gerrit.wikimedia.org/r/778261

Change 778261 merged by Filippo Giunchedi:

[operations/puppet@production] thanos: add recording rules for exporter-specific availability

https://gerrit.wikimedia.org/r/778261

Change 778354 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] thanos: fix yaml error

https://gerrit.wikimedia.org/r/778354

Change 778354 merged by Cwhite:

[operations/puppet@production] thanos: fix yaml error

https://gerrit.wikimedia.org/r/778354

lmata moved this task from Inbox to In progress on the SRE Observability (FY2021/2022-Q3) board.Apr 7 2022, 11:04 PM

lmata edited projects, added SRE Observability (FY2021/2022-Q4); removed SRE Observability (FY2021/2022-Q3).Apr 11 2022, 3:34 AM

Change 784629 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] thanos: reload rule via http endpoint

https://gerrit.wikimedia.org/r/784629

Change 784629 merged by Filippo Giunchedi:

[operations/puppet@production] thanos: reload rule via http endpoint

https://gerrit.wikimedia.org/r/784629

Change 778259 merged by Filippo Giunchedi:

[operations/alerts@master] sre: add alerts for exporter-specific unavailability

https://gerrit.wikimedia.org/r/778259

Change 784635 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] thanos: aggregate exporter 'up' metrics

https://gerrit.wikimedia.org/r/784635

Change 784636 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] prometheus: remove per-exporter up checks

https://gerrit.wikimedia.org/r/784636

Change 784635 merged by Filippo Giunchedi:

[operations/puppet@production] thanos: aggregate exporter 'up' metrics

https://gerrit.wikimedia.org/r/784635

Change 784636 merged by Filippo Giunchedi:

[operations/puppet@production] prometheus: remove per-exporter up checks

https://gerrit.wikimedia.org/r/784636

fgiunchedi updated the task description. (Show Details)Apr 27 2022, 1:19 PM

All o11y Prometheus-based alerts mentioned in the description have been migrated, resolving!

lmata moved this task from Inbox to Done on the SRE Observability (FY2021/2022-Q4) board.May 25 2022, 4:25 PM

Move > 60% of observability Prometheus-based checks to Alertmanager
Closed, ResolvedPublic
Actions

Description

Metrics

Alerting

Logging

Details

Related Objects
Search...

Event Timeline

	fgiunchedi
	Aug 12 2021, 9:03 AM

Move > 60% of observability Prometheus-based checks to AlertmanagerClosed, ResolvedPublicActions