The incident in the parent task required a manual page. This should be fixed and page automatically in future.
Description
Details
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | eoghan | T354478 ticket.wikimedia.org down: upstream connect error or disconnect/reset before headers | |||
Open | LSobanski | T354479 ticket.wikimedia.org should page when down |
Event Timeline
Paging for ticket.wikimedia.org might be a bit expensive if done similar like pages for mediawiki for example (especially outside of business hours). But that's my personal opinion and might need a bigger discussion.
We could explore either paging just the Collaboration-Services sub-team. But sub-team paging is not yet implemented and a bigger topic and needs coordination with SRE and Observability Team first.
Also the page could be delayed so that short outages don't trigger a page but longer outages do. The incident in T354478 would have resolved automatically after 30 minutes for example. Currently the prometheus::blackbox::check::http does not support delaying probe down alerts, it's set to a fixed 2m. See https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/prometheus/manifests/blackbox/check/http.pp#181.
So to delay the page we would need a puppet change to prometheus::blackbox::check::http and make the for parameter configurable with puppet. The blackbox check is used quite a lot so we would have to make sure to set the default also to 2m to not interfere with existing blackbox checks. Also we should reach out to Observability and get feedback on this idea.
An alternative to the puppet change could be to add the specific check to /operations/alerts?
Definitely yes, I'm +1 on the general idea and happy to review patches / provide assistance
Change 991571 had a related patch set uploaded (by Jelto; author: Jelto):
[operations/puppet@production] prometheus::blackbox::check: make for parameter configurable
Change 991571 merged by Jelto:
[operations/puppet@production] prometheus::blackbox::check: make for parameter configurable
blackbox checks can be delayed by setting alert_after now:
prometheus::blackbox::check::http { $host: team => 'collaboration-services', severity => $severity, alert_after => '1h', ... }
So we can discuss after what duration ticket.wikimedia.org should alert and/or page.
Change 991765 had a related patch set uploaded (by Jelto; author: Jelto):
[operations/puppet@production] vrts: test delaying blackbox::check::http
Mentioned in SAL (#wikimedia-operations) [2024-01-22T09:56:52Z] <jelto> stop envoy on ticket-test.wikimedia.org to test alerting - T354479
Mentioned in SAL (#wikimedia-operations) [2024-01-22T10:00:55Z] <jelto> start envoy on ticket-test.wikimedia.org to test alerting - T354479
Change 991765 merged by Jelto:
[operations/puppet@production] vrts: test delaying blackbox::check::http
Mentioned in SAL (#wikimedia-operations) [2024-01-22T11:21:26Z] <jelto> stop envoy on ticket-test.wikimedia.org to test alerting - T354479
Mentioned in SAL (#wikimedia-operations) [2024-01-22T11:26:12Z] <jelto> start envoy on ticket-test.wikimedia.org to test alerting - T354479
The change above has the expected effect. The ProbeDown alert for vrts hosts changes from for: 2m to for: 3m. I checked the alert config in Thanos:
name: ProbeDown expr: avg_over_time(probe_success{module=~"http_ticket_test_wikimedia_org_.*"}[1m]) * 100 < 75 for: 3m labels: prometheus: ops severity: task site: eqiad team: collaboration-services ... summary: Service {{ $labels.instance }} has failed probes ({{ $labels.module }})
I've done two tests on ticket-test.wikimedia.org. I disabled envoy without the delay and with the delay. Without the ProbeDown alert fired after 4 minutes with the delay after 5 minutes. Although the difference of the delay is quite small and we have to add the normal Prometheus scrape interval and latency between Prometheus, alertmanager and IRC/Phab integrations it works as expected.
Without delay:
09:56 stop envoy
10:00 (ProbeDown) firing: Service vrts1002:1443 has failed probes (http_ticket_test_wikimedia_org_ip4)
With additional delay:
11:21 stop envoy
11:26: (ProbeDown) firing: Service vrts1002:1443 has failed probes (http_ticket_test_wikimedia_org_ip4)
So I'm going to revert the alert_after: 3m for vrts hosts and then we can discuss a proper alert for VRTS.
Change 992108 had a related patch set uploaded (by Jelto; author: Jelto):
[operations/puppet@production] Revert "vrts: test delaying blackbox::check::http"
Change 992108 merged by Dzahn:
[operations/puppet@production] Revert "vrts: test delaying blackbox::check::http"