ticket.wikimedia.org should page when down
Open, MediumPublic
Actions

Assigned To

Authored By

	RhinosF1
	Jan 6 2024, 3:22 PM

Description

The incident in the parent task required a manual page. This should be fixed and page automatically in future.

Details

Subject	Repo	Branch	Lines +/-
Revert "vrts: test delaying blackbox::check::http"	operations/puppet	production	+0 -1
vrts: test delaying blackbox::check::http	operations/puppet	production	+1 -0
prometheus::blackbox::check: make for parameter configurable	operations/puppet	production	+6 -2

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		eoghan	T354478 ticket.wikimedia.org down: upstream connect error or disconnect/reset before headers
		Open		LSobanski	T354479 ticket.wikimedia.org should page when down

Event Timeline

RhinosF1 created this task.Jan 6 2024, 3:22 PM

Krd subscribed.Jan 6 2024, 3:34 PM

Reedy renamed this task from ticket.Wikimedia.org should page when down to ticket.wikimedia.org should page when down.Jan 6 2024, 7:52 PM

• NoOnEtHeMaStA added a commit: rMEXTf658c3375869: Update git submodules.Jan 7 2024, 9:15 PM

Aklapper removed a commit: rMEXTf658c3375869: Update git submodules.Jan 7 2024, 9:24 PM

Thanks for opening the task! We will pick that topic up in our next team meeting.

Jelto moved this task from Incoming to Backlog on the collaboration-services board.Jan 8 2024, 4:27 PM

Jelto added a subscriber: LSobanski.Jan 9 2024, 1:49 PM

Paging for ticket.wikimedia.org might be a bit expensive if done similar like pages for mediawiki for example (especially outside of business hours). But that's my personal opinion and might need a bigger discussion.

We could explore either paging just the Collaboration-Services sub-team. But sub-team paging is not yet implemented and a bigger topic and needs coordination with SRE and Observability Team first.

Also the page could be delayed so that short outages don't trigger a page but longer outages do. The incident in T354478 would have resolved automatically after 30 minutes for example. Currently the prometheus::blackbox::check::http does not support delaying probe down alerts, it's set to a fixed 2m. See https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/prometheus/manifests/blackbox/check/http.pp#181.
So to delay the page we would need a puppet change to prometheus::blackbox::check::http and make the for parameter configurable with puppet. The blackbox check is used quite a lot so we would have to make sure to set the default also to 2m to not interfere with existing blackbox checks. Also we should reach out to Observability and get feedback on this idea.
An alternative to the puppet change could be to add the specific check to /operations/alerts?

In T354479#9462886, @Jelto wrote:

Currently the prometheus::blackbox::check::http does not support delaying probe down alerts, it's set to a fixed 2m. See https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/prometheus/manifests/blackbox/check/http.pp#181.
So to delay the page we would need a puppet change to prometheus::blackbox::check::http and make the for parameter configurable with puppet. The blackbox check is used quite a lot so we would have to make sure to set the default also to 2m to not interfere with existing blackbox checks. Also we should reach out to Observability and get feedback on this idea.

@fgiunchedi what do you think, is that something that could be introduced?

In T354479#9464988, @LSobanski wrote:

@fgiunchedi what do you think, is that something that could be introduced?

Definitely yes, I'm +1 on the general idea and happy to review patches / provide assistance

Change 991571 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] prometheus::blackbox::check: make for parameter configurable

https://gerrit.wikimedia.org/r/991571

gerritbot added a project: Patch-For-Review.Jan 18 2024, 11:04 AM

Change 991571 merged by Jelto:

[operations/puppet@production] prometheus::blackbox::check: make for parameter configurable

https://gerrit.wikimedia.org/r/991571

blackbox checks can be delayed by setting alert_after now:

prometheus::blackbox::check::http { $host:
        team        => 'collaboration-services',
        severity    => $severity,
        alert_after => '1h',
        ...
    }

So we can discuss after what duration ticket.wikimedia.org should alert and/or page.

Maintenance_bot removed a project: Patch-For-Review.Jan 18 2024, 3:30 PM

Change 991765 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] vrts: test delaying blackbox::check::http

https://gerrit.wikimedia.org/r/991765

gerritbot added a project: Patch-For-Review.Jan 19 2024, 12:27 PM

Mentioned in SAL (#wikimedia-operations) [2024-01-22T09:56:52Z] <jelto> stop envoy on ticket-test.wikimedia.org to test alerting - T354479

Mentioned in SAL (#wikimedia-operations) [2024-01-22T10:00:55Z] <jelto> start envoy on ticket-test.wikimedia.org to test alerting - T354479

Jelto mentioned this in T355512: ProbeDown (vrts1002) .Jan 22 2024, 10:05 AM

Change 991765 merged by Jelto:

[operations/puppet@production] vrts: test delaying blackbox::check::http

https://gerrit.wikimedia.org/r/991765

Mentioned in SAL (#wikimedia-operations) [2024-01-22T11:21:26Z] <jelto> stop envoy on ticket-test.wikimedia.org to test alerting - T354479

Mentioned in SAL (#wikimedia-operations) [2024-01-22T11:26:12Z] <jelto> start envoy on ticket-test.wikimedia.org to test alerting - T354479

Maintenance_bot removed a project: Patch-For-Review.Jan 22 2024, 11:30 AM

The change above has the expected effect. The ProbeDown alert for vrts hosts changes from for: 2m to for: 3m. I checked the alert config in Thanos:

name: ProbeDown
expr: avg_over_time(probe_success{module=~"http_ticket_test_wikimedia_org_.*"}[1m]) * 100 < 75
for: 3m
labels:
prometheus: ops
severity: task
site: eqiad
team: collaboration-services
...
summary: Service {{ $labels.instance }} has failed probes ({{ $labels.module }})

I've done two tests on ticket-test.wikimedia.org. I disabled envoy without the delay and with the delay. Without the ProbeDown alert fired after 4 minutes with the delay after 5 minutes. Although the difference of the delay is quite small and we have to add the normal Prometheus scrape interval and latency between Prometheus, alertmanager and IRC/Phab integrations it works as expected.

Without delay:
09:56 stop envoy
10:00 (ProbeDown) firing: Service vrts1002:1443 has failed probes (http_ticket_test_wikimedia_org_ip4)

With additional delay:
11:21 stop envoy
11:26: (ProbeDown) firing: Service vrts1002:1443 has failed probes (http_ticket_test_wikimedia_org_ip4)

So I'm going to revert the alert_after: 3m for vrts hosts and then we can discuss a proper alert for VRTS.

Jelto claimed this task.Jan 22 2024, 11:32 AM

Jelto mentioned this in T355526: ProbeDown (vrts1002).

Jelto moved this task from Backlog to Work in Progress on the collaboration-services board.

Krd unsubscribed.Jan 22 2024, 11:35 AM

Change 992108 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] Revert "vrts: test delaying blackbox::check::http"

https://gerrit.wikimedia.org/r/992108

gerritbot added a project: Patch-For-Review.Jan 22 2024, 11:37 AM

Dzahn awarded a token.Jan 22 2024, 5:50 PM

Change 992108 merged by Dzahn:

[operations/puppet@production] Revert "vrts: test delaying blackbox::check::http"

https://gerrit.wikimedia.org/r/992108

Maintenance_bot removed a project: Patch-For-Review.Jan 22 2024, 6:30 PM

Claiming this as it's a process / SLA question for the time being.

LSobanski moved this task from Work in Progress to Backlog on the collaboration-services board.Feb 6 2024, 8:23 AM

ticket.wikimedia.org should page when downOpen, MediumPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

ticket.wikimedia.org should page when down
Open, MediumPublic
Actions

Related Objects
Search...