Page MenuHomePhabricator

ticket.wikimedia.org should page when down
Open, MediumPublic

Description

The incident in the parent task required a manual page. This should be fixed and page automatically in future.

Event Timeline

Reedy renamed this task from ticket.Wikimedia.org should page when down to ticket.wikimedia.org should page when down.Jan 6 2024, 7:52 PM
Jelto triaged this task as Medium priority.Jan 8 2024, 4:27 PM
Jelto subscribed.

Thanks for opening the task! We will pick that topic up in our next team meeting.

Paging for ticket.wikimedia.org might be a bit expensive if done similar like pages for mediawiki for example (especially outside of business hours). But that's my personal opinion and might need a bigger discussion.

We could explore either paging just the Collaboration-Services sub-team. But sub-team paging is not yet implemented and a bigger topic and needs coordination with SRE and Observability Team first.

Also the page could be delayed so that short outages don't trigger a page but longer outages do. The incident in T354478 would have resolved automatically after 30 minutes for example. Currently the prometheus::blackbox::check::http does not support delaying probe down alerts, it's set to a fixed 2m. See https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/prometheus/manifests/blackbox/check/http.pp#181.
So to delay the page we would need a puppet change to prometheus::blackbox::check::http and make the for parameter configurable with puppet. The blackbox check is used quite a lot so we would have to make sure to set the default also to 2m to not interfere with existing blackbox checks. Also we should reach out to Observability and get feedback on this idea.
An alternative to the puppet change could be to add the specific check to /operations/alerts?

Currently the prometheus::blackbox::check::http does not support delaying probe down alerts, it's set to a fixed 2m. See https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/prometheus/manifests/blackbox/check/http.pp#181.
So to delay the page we would need a puppet change to prometheus::blackbox::check::http and make the for parameter configurable with puppet. The blackbox check is used quite a lot so we would have to make sure to set the default also to 2m to not interfere with existing blackbox checks. Also we should reach out to Observability and get feedback on this idea.

@fgiunchedi what do you think, is that something that could be introduced?

@fgiunchedi what do you think, is that something that could be introduced?

Definitely yes, I'm +1 on the general idea and happy to review patches / provide assistance

Change 991571 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] prometheus::blackbox::check: make for parameter configurable

https://gerrit.wikimedia.org/r/991571

Change 991571 merged by Jelto:

[operations/puppet@production] prometheus::blackbox::check: make for parameter configurable

https://gerrit.wikimedia.org/r/991571

blackbox checks can be delayed by setting alert_after now:

prometheus::blackbox::check::http { $host:
        team        => 'collaboration-services',
        severity    => $severity,
        alert_after => '1h',
        ...
    }

So we can discuss after what duration ticket.wikimedia.org should alert and/or page.

Change 991765 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] vrts: test delaying blackbox::check::http

https://gerrit.wikimedia.org/r/991765

Mentioned in SAL (#wikimedia-operations) [2024-01-22T09:56:52Z] <jelto> stop envoy on ticket-test.wikimedia.org to test alerting - T354479

Mentioned in SAL (#wikimedia-operations) [2024-01-22T10:00:55Z] <jelto> start envoy on ticket-test.wikimedia.org to test alerting - T354479

Change 991765 merged by Jelto:

[operations/puppet@production] vrts: test delaying blackbox::check::http

https://gerrit.wikimedia.org/r/991765

Mentioned in SAL (#wikimedia-operations) [2024-01-22T11:21:26Z] <jelto> stop envoy on ticket-test.wikimedia.org to test alerting - T354479

Mentioned in SAL (#wikimedia-operations) [2024-01-22T11:26:12Z] <jelto> start envoy on ticket-test.wikimedia.org to test alerting - T354479

The change above has the expected effect. The ProbeDown alert for vrts hosts changes from for: 2m to for: 3m. I checked the alert config in Thanos:

name: ProbeDown
expr: avg_over_time(probe_success{module=~"http_ticket_test_wikimedia_org_.*"}[1m]) * 100 < 75
for: 3m
labels:
prometheus: ops
severity: task
site: eqiad
team: collaboration-services
...
summary: Service {{ $labels.instance }} has failed probes ({{ $labels.module }})

I've done two tests on ticket-test.wikimedia.org. I disabled envoy without the delay and with the delay. Without the ProbeDown alert fired after 4 minutes with the delay after 5 minutes. Although the difference of the delay is quite small and we have to add the normal Prometheus scrape interval and latency between Prometheus, alertmanager and IRC/Phab integrations it works as expected.

Without delay:
09:56 stop envoy
10:00 (ProbeDown) firing: Service vrts1002:1443 has failed probes (http_ticket_test_wikimedia_org_ip4)

With additional delay:
11:21 stop envoy
11:26: (ProbeDown) firing: Service vrts1002:1443 has failed probes (http_ticket_test_wikimedia_org_ip4)

So I'm going to revert the alert_after: 3m for vrts hosts and then we can discuss a proper alert for VRTS.

Change 992108 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] Revert "vrts: test delaying blackbox::check::http"

https://gerrit.wikimedia.org/r/992108

Change 992108 merged by Dzahn:

[operations/puppet@production] Revert "vrts: test delaying blackbox::check::http"

https://gerrit.wikimedia.org/r/992108

Claiming this as it's a process / SLA question for the time being.