Summary
When ConfirmEdit declares that it is in "failover mode", we want to alert the Product Safety and Integrity and SRE.
Technical notes
- We have a warning level message in the captcha log channel for hCaptcha is unavailable, falling back to FancyCaptcha. When this fires, we should create an alert
- We have a panel recording the request times for the HCaptchaEnterpriseHealthChecker::isAvailable() call in https://grafana-rw.wikimedia.org/d/441b2def-52e9-49d6-acad-91f5bb748989/hcaptcha-reverse-proxy-proxoid?orgId=1&from=now-3h&to=now&timezone=utc&var-instance=$__all&var-site=eqiad&var-wiki=ptwiki&var-wiki=jawiki&var-wiki=idwiki&var-wiki=zhwiki&var-wiki=trwiki&var-wiki=fawiki&var-wiki=frwiki&viewPanel=panel-36 . If this drops to 0, the service can be considered to have failed and we should generate an alert
The alerts can go to:
- #psi-alerts channel in Slack
- a team in SRE (TBD)
Acceptance criteria
- Alert is defined for the Logstash entry for hCaptcha is unavailable, falling back to FancyCaptcha
- Alert is defined for when the isAvailable() Grafana panel drops to 0