Page MenuHomePhabricator

monitoring: find out how we could have been paged for outage "Multiple CloudVPS instances lost their IPs"
Open, Needs TriagePublic

Description

The only page we got was from toolschecker being unable to access NFS, and it was because it was timing out and giving 500.

That happened way after the outage started, and after we started recovering (proxy was running, metricsinfra was accessible, ...).

Event Timeline

If you think this might be a good topic for the Incident Review meeting, please let @lmata or me know.

Aklapper renamed this task from monitoring: find out how we could have been paged for this outage to monitoring: find out how we could have been paged for outage "Multiple CloudVPS instances lost their IPs".Oct 2 2023, 6:28 PM

I have enabled paging for MainProxyDown alert on metricsinfra, that unfortunately might have not helped in this scenario if the alerting instance lost it's ip first, but is a step, now by alerting on the alerting instance unable to alert we get full coverage :) (and a tongue twister!)