Page MenuHomePhabricator

hCaptcha: Implement alerts
Closed, ResolvedPublic

Description

Summary

When ConfirmEdit declares that it is in "failover mode", we want to alert the Product Safety and Integrity and SRE.

Technical notes

The alerts can go to:

  • #psi-alerts channel in Slack
  • a team in SRE (TBD)

Acceptance criteria

  • Alert is defined for the Logstash entry for hCaptcha is unavailable, falling back to FancyCaptcha
  • Alert is defined for when the isAvailable() Grafana panel drops to 0

Event Timeline

This looks good to me, thank you! @colewhite would you be able to also set up an alert for the Logstash entry for hCaptcha is unavailable, falling back to FancyCaptcha?

@ssingh do you have any requests for additional alerts based on the Grafana dashboard?

[Adding Raine @kamila as well.]

Additional thoughts:

IIUC, hcaptcha.execute() errors seem like another useful signal. It's rather catch-all, so it'd need to be routed to both dev and SRE, but seems good to have. Or would that overlap with the fallback?

As for latency alerts, those would likely be coming from hCaptcha, so they would likely not be actionable by us and thus I don't think they'd be valuable.

Additional thoughts:

IIUC, hcaptcha.execute() errors seem like another useful signal. It's rather catch-all, so it'd need to be routed to both dev and SRE, but seems good to have. Or would that overlap with the fallback?

As for latency alerts, those would likely be coming from hCaptcha, so they would likely not be actionable by us and thus I don't think they'd be valuable.

We could alert on hcaptcha.execute() errors, but I think we would have to set a pretty high threshold (e.g. > 100 in the last hour). Generally, a low number of these are expected.

We could alert on hcaptcha.execute() errors, but I think we would have to set a pretty high threshold (e.g. > 100 in the last hour). Generally, a low number of these are expected.

You're right, and on second thought, I'm not sure it's a good idea. Giving the internet an unintentional Klaxon is probably bad :D

@colewhite would you be able to also set up an alert for the Logstash entry for hCaptcha is unavailable, falling back to FancyCaptcha?

From the code, it looks like that log entry hinges on $services->getService( 'HCaptchaEnterpriseHealthChecker' )->isAvailable(): Since we have access to this code, I'd like to propose a change to the ConfirmEdit extension to back this alert.

Change #1198151 had a related patch set uploaded (by Cwhite; author: Cwhite):

[mediawiki/extensions/ConfirmEdit@master] HCaptchaEnterpriseHealthChecker: enhance isAvailable tracking

https://gerrit.wikimedia.org/r/1198151

Change #1198151 merged by jenkins-bot:

[mediawiki/extensions/ConfirmEdit@master] HCaptchaEnterpriseHealthChecker: enhance isAvailable tracking

https://gerrit.wikimedia.org/r/1198151

Dreamy_Jazz subscribed.

It seems there is nothing to review here. Tentatively moving to Done for not knowing where else this should go

@colewhite would you be able to also set up an alert for the Logstash entry for hCaptcha is unavailable, falling back to FancyCaptcha?

From the code, it looks like that log entry hinges on $services->getService( 'HCaptchaEnterpriseHealthChecker' )->isAvailable(): Since we have access to this code, I'd like to propose a change to the ConfirmEdit extension to back this alert.

@colewhite it seems like an alert should have fired given these logs https://logstash.wikimedia.org/goto/6aee202d1a5f1feb5ac0504d4e2dbcd6 , but I do not see one in #psi-alerts.

@colewhite it seems like an alert should have fired given these logs https://logstash.wikimedia.org/goto/6aee202d1a5f1feb5ac0504d4e2dbcd6 , but I do not see one in #psi-alerts.

You're right, an alert did not fire. From the previous alert definition, this hcaptcha failure was considered partial and the alert was tuned for a complete failure (rate <= 0) on the overall state of is_available. This was not taking into account the result label we added.

I've used this event to adjust based on a known failure condition. The alert now fires when: sum(increase(mediawiki_ConfirmEdit_hcaptcha_enterprise_health_checker_is_available_seconds_count{result="false"}[5m])) > 0

Playground

I'm going to optimistically mark this as resolved since the events are fairly clear in Prometheus now. Please do reach out if we need to adjust further.