In https://wikitech.wikimedia.org/wiki/Incidents/2022-12-12_wdqs_codfw_brief_outage, we had a brief codfw outage which ended up self healing. However, our automated monitoring emitted a page before that self healing could take place.
Given the recent efforts by Search team to formalize our WDQS uptime SLO, we should have our monitoring wait at least half an hour or so before paging (potentially longer).
There's a technical limitation, however - we have generic pybal pages that fire when the insufficient hosts are alive (as seen by pybal's configured health checks) based off the service's pybal depool threshold. We should see if we can implement a way to disable paging for specific services (WDQS in this case) for general alerts. This likely will require some changes to the associated puppet code, but we'll have to talk to o11y to understand more and see if there's a reasonable/feasible way of relaxing pybal paging on a service-specific basis.
AC:
- We know if it is possible to tune alerts around WDQS
- Disable paging upon probe: Service wdqs-ssl:443 has failed probes
- We have a decision on how to move forward, that is validated by Observability
- Implementation is not part of this ticket