Page MenuHomePhabricator

Investigate options for automatic fallback to FancyCAPTCHA
Closed, ResolvedPublic

Description

Summary

In the event that the proxy is unreachable, or that hCaptcha is unreachable from the proxy, we need a graceful way to automatically fallback to FancyCAPTCHA. In this task, we should discuss options.

Technical notes

Option 1: Per request roundtrip ping

  1. in ConfirmEdit Hooks::getInstance(), do a request (with a low timeout limit) to the proxy, or to a specific endpoint behind the proxy (e.g. https://docs.hcaptcha.com/#integration-testing-test-keys), if there's no response within the timeout limit, log a message in Logstash, emit an event to Prometheus, and fallback to FancyCAPTCHA.

Option 2: Track errors and successes in Memcache

  • Keep track of request timeouts in a global Memcache key with a low TTL (30 minutes?). If there are more than N number of errors, fall back to FancyCaptcha on page load, and generate alerts for further review and possible manual disabling.
  • Keep track of successful siteverify calls in a global Memcache key with somewhat higher TTL (3 hours?). If there are zero successes, fall back to FancyCaptcha on page load, and generate alerts for further review and possible manual disabling.

Acceptance criteria

  • A plan exists for how to automatically fallback to FancyCAPTCHA when needed

Event Timeline

Based on some imperfect tests from the deployment server, I think we'd be looking at adding anywhere from ~10ms (if we just ping /healthz on the proxy) to ~60ms (if we go through the proxy to hCaptcha to see if secure-api.js can be fetched). Note that we may end up self-hosting secure-api.js (T403829) in which case we'd need to roundtrip to api.js on hCaptcha's server, and that seems to be more like ~60-120ms.

From the MediaWiki side, I was thinking that in Hooks:;getInstance(), ConfirmEdit can invoke a new hook, onConfirmEditGetCaptchaClassFallback( &$className). In operations/mediawiki-config, we can implement this hook and do a request to hCaptcha via the proxy, with an upper limit of 200 (?) ms. If the request fails or times out, then we log an error in Logstash, and we change $className from HCaptcha to FancyCaptcha.

Since Hooks::getInstance() is called multiple times per request, we'd need to make sure we're doing the uptime check to hCaptcha a single time in a request, and for each subsequent invocation of ::getInstance(), load a cached FancyCaptcha instance. That should be doable, but we would need to modify some of the logic in ::getInstance().

Another idea, briefly discussed with @jijiki yesterday, would be to keep track of successes and errors in a Memcached key and fall back to FancyCaptcha if the errors are above a certain threshold, or if the successes are below a certain threshold.

Change #1189821 had a related patch set uploaded (by Kosta Harlan; author: Kosta Harlan):

[mediawiki/extensions/ConfirmEdit@master] WIP hCaptcha: Implement service health checks

https://gerrit.wikimedia.org/r/1189821

My current proposal for an initial version of this is here https://gerrit.wikimedia.org/r/c/1189821, if SRE would like to review and comment.

The proposed implementation would be for the health check maintenance script to run every ~5 minutes, and set the TTL for ~10 minutes

kostajh triaged this task as High priority.Sep 22 2025, 1:14 PM

Change #1190297 had a related patch set uploaded (by Kosta Harlan; author: Kosta Harlan):

[mediawiki/extensions/ConfirmEdit@master] Hooks: Enable overriding the hook instance

https://gerrit.wikimedia.org/r/1190297

I'm not SRE but @JTweed-WMF said you were looking for someone to review, so I had a look and left some comments on the changes (I'm still looking at the other patches). The general approach makes sense to me and I'm glad that we're planning something like this when introducing a dependency on an external service.

I'm not SRE but @JTweed-WMF said you were looking for someone to review, so I had a look and left some comments on the changes (I'm still looking at the other patches). The general approach makes sense to me and I'm glad that we're planning something like this when introducing a dependency on an external service.

Thanks @matmarex! Does the 5 minute cadence for running the maintenance script and a TTL for the healthcheck key of 10 minutes (T404204#11200831) seem reasonable to you?

It seems reasonable. I don't really know how to choose this value, but this seems low enough that the disruption to the sites would be short, and high enough that it can't possibly cause any weird load on anything. And the TTL naturally must be longer than the interval at which we run the script for it to work.

My concerns with the cronjob approach are the following:

  • Potentially the time it takes to spawn a pod to run the cronjob is longer than the time it takes to run a check
    • upstream latency appears to have a p50 of ~60ms and a p99 of ~500ms
  • 5m minutes is a very long time to either make an assumption that hcaptcha is up, but in a similar matter, that it is down
  • If the memcached node holding this key fails, the key will be lost
    • The cluster will failover pretty quickly to a spare memcached node, but this node will be cold
    • Until the next cronjob, we will be operating using the default (which one?)

I am not a developer, so take the following with a grain of salt. I suggest we have a think to approaches such as

  • getWithSetCallback(), to have automatic "regeneration on miss"
    • already has a mechanism to prevent multiple instances from regenerating the data
    • maybe use sister keys too
    • provides useful metrics we could potentially alert on
  • investigate if submitting a job to the jobrunners would make sense
  • maybe using events?

One more thing to put in the mix, if the endpoint is up, but the latency we measured while doing the check, took more than a threshold we define, based on our SLO most likely, maybe an option would be to, again, switch to fancyCaptcha.

  • we could put a very short TTL to the "status" key so to have a shorter interval between this and the next check
  • choose a sensible timeout that will not degrade the user experience

Lastly, regardless of which path we choose, we should be thoughtful when defining what we will use as default, ie when we are unsure of the availability of hcaptcha.

  • fail open
    • may cause flood of new accounts
  • use fancycaptcha
    • the status quo, we know its ins and outs
  • use hcaptcha
    • we may be keeping user onhold attempting to query a dead endpoint

My concerns with the cronjob approach are the following:

  • Potentially the time it takes to spawn a pod to run the cronjob is longer than the time it takes to run a check
    • upstream latency appears to have a p50 of ~60ms and a p99 of ~500ms

It would be useful to know how long it takes to start the pod for a cron job. I am not worried about the health checks taking much time once the maintenance script is run in the pod.

  • 5m minutes is a very long time to either make an assumption that hcaptcha is up, but in a similar matter, that it is down

I'm open to changing this, but it seems like a good balance to me in terms of providing value (in automatic fallback, or resuming service) and allowing time for manual response.

  • If the memcached node holding this key fails, the key will be lost
    • The cluster will failover pretty quickly to a spare memcached node, but this node will be cold
    • Until the next cronjob, we will be operating using the default (which one?)

If the memcache key value isn't found, the assumption would be to assume that hCaptcha is up.

I am not a developer, so take the following with a grain of salt. I suggest we have a think to approaches such as

  • getWithSetCallback(), to have automatic "regeneration on miss"
    • already has a mechanism to prevent multiple instances from regenerating the data
    • maybe use sister keys too

For this particular approach, I'd like to avoid mixing status updates from the maintenance script (scheduled, predictable) and end user requests (unscheduled / unpredictable). I would be happy to explore adding in production requests to the overall health check system, but think it would be easier to start with something simple, using just the maintenance script for setting status.

  • provides useful metrics we could potentially alert on

The patch I've drafted contains a log message that we can use for alerting.

  • investigate if submitting a job to the jobrunners would make sense
  • maybe using events?

One more thing to put in the mix, if the endpoint is up, but the latency we measured while doing the check, took more than a threshold we define, based on our SLO most likely, maybe an option would be to, again, switch to fancyCaptcha.

Yes, we could consider that.

Lastly, regardless of which path we choose, we should be thoughtful when defining what we will use as default, ie when we are unsure of the availability of hcaptcha.

  • fail open
    • may cause flood of new accounts
  • use fancycaptcha
    • the status quo, we know its ins and outs
  • use hcaptcha
    • we may be keeping user onhold attempting to query a dead endpoint

If the hCaptcha/proxy availability check fails, we would fall back to FancyCaptcha but we would not allow a request to proceed without some kind of CAPTCHA check.

For this particular approach, I'd like to avoid mixing status updates from the maintenance script (scheduled, predictable) and end user requests (unscheduled / unpredictable). I would be happy to explore adding in production requests to the overall health check system, but think it would be easier to start with something simple, using just the maintenance script for setting status.

But it's already a pattern we use elsewhere, for example https://github.com/wikimedia/mediawiki-extensions-TorBlock/blob/master/includes/TorExitNodes.php#L70-L91 where we will wait on the request, while we load from an external service.

T229736: Disable now-redundant mediawiki/TorBlock/loadExitNodes.php cron script suggests we moved away from this use of maintenance scripts like this in 2019 - rETOR1de471d9feb4: Convert getExitNodes() from $wgMainStash to the WAN cache (and there's other cases if we look around)

You can also inline bumping of the error count if a backend PHP call to hCaptcha returns with a 5** or similar, further augmenting monitoring of upstream services.

Maintenance scripts are predictable, until they're not, and they stop running. And they do break in production, as we often see. They can break for reasons unrelated to MW. They can be useful for development purposes, so it could stay to that extent. And it could be useful for other logging/alerting.

It adds complexity, overhead.

MW expecting a job to be running in the background, and not checking if it's "run recently" (which would add more complexity, so I'm definitely not suggesting that), is not good health checking.

Lack of a key saying there's an error, doesn't mean there's no errors. It's just nothing has recorded that there's an error. So we could continue to display a broken hCaptcha, because the script hasn't run. Which puts us in a no better situation than we are now, and would happen with manual failover.

Doing something like we do in TorBlock means we know when we check status, if it's not up to date, we'll check if it's up to date.

A simple check like is being proposed should be pretty quick, and if it's not, well, that too is a sign there's probably an issue.

For this particular approach, I'd like to avoid mixing status updates from the maintenance script (scheduled, predictable) and end user requests (unscheduled / unpredictable). I would be happy to explore adding in production requests to the overall health check system, but think it would be easier to start with something simple, using just the maintenance script for setting status.

But it's already a pattern we use elsewhere, for example https://github.com/wikimedia/mediawiki-extensions-TorBlock/blob/master/includes/TorExitNodes.php#L70-L91 where we will wait on the request, while we load from an external service.

Yeah, I would like to avoid introducing a maintenance script dependency if possible. I'll have a look at handling this in the request path, then.

Change #1190297 merged by jenkins-bot:

[mediawiki/extensions/ConfirmEdit@master] Hooks: Enable overriding the hook instance per action

https://gerrit.wikimedia.org/r/1190297

For this particular approach, I'd like to avoid mixing status updates from the maintenance script (scheduled, predictable) and end user requests (unscheduled / unpredictable). I would be happy to explore adding in production requests to the overall health check system, but think it would be easier to start with something simple, using just the maintenance script for setting status.

But it's already a pattern we use elsewhere, for example https://github.com/wikimedia/mediawiki-extensions-TorBlock/blob/master/includes/TorExitNodes.php#L70-L91 where we will wait on the request, while we load from an external service.

Yeah, I would like to avoid introducing a maintenance script dependency if possible. I'll have a look at handling this in the request path, then.

I quite agree, which is why I thought we ought to give getWithSetCallback() a go before putting any more effort in a new maint job. There’s another issue with the maintenance scripts which I should have mentioned earlier: they only run on the primary data centre, meaning that we would only get the upstream status from the main DC.

When we placed the proxy behind a load balanced service, by definition it became available to traffic reaching both core datacentres, also the reason why there was not need for any further actions during DC switchover week., since it was depooled along with all other services

Doing something like we do in TorBlock means we know when we check status, if it's not up to date, we'll check if it's up to date.

@Reedy tx for finding this, it is what I had in mind

Change #1192148 had a related patch set uploaded (by Kosta Harlan; author: Kosta Harlan):

[mediawiki/extensions/ConfirmEdit@wmf/1.45.0-wmf.20] Hooks: Enable overriding the hook instance per action

https://gerrit.wikimedia.org/r/1192148

Change #1192148 merged by jenkins-bot:

[mediawiki/extensions/ConfirmEdit@wmf/1.45.0-wmf.20] Hooks: Enable overriding the hook instance per action

https://gerrit.wikimedia.org/r/1192148

Mentioned in SAL (#wikimedia-operations) [2025-09-30T06:27:20Z] <kharlan@deploy2002> Started scap sync-world: Backport for [[gerrit:1192148|Hooks: Enable overriding the hook instance per action (T405239 T404204)]]

Mentioned in SAL (#wikimedia-operations) [2025-09-30T06:33:29Z] <kharlan@deploy2002> kharlan: Backport for [[gerrit:1192148|Hooks: Enable overriding the hook instance per action (T405239 T404204)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.

Mentioned in SAL (#wikimedia-operations) [2025-09-30T06:42:29Z] <kharlan@deploy2002> Finished scap sync-world: Backport for [[gerrit:1192148|Hooks: Enable overriding the hook instance per action (T405239 T404204)]] (duration: 15m 09s)

Change #1189821 merged by jenkins-bot:

[mediawiki/extensions/ConfirmEdit@master] hCaptcha: Provide capabilities for failing over to alternate CAPTCHA type

https://gerrit.wikimedia.org/r/1189821

Change #1194666 had a related patch set uploaded (by Kosta Harlan; author: Kosta Harlan):

[mediawiki/extensions/ConfirmEdit@wmf/1.45.0-wmf.22] hCaptcha: Provide capabilities for failing over to alternate CAPTCHA type

https://gerrit.wikimedia.org/r/1194666

Change #1194671 had a related patch set uploaded (by Kosta Harlan; author: Kosta Harlan):

[operations/mediawiki-config@master] ConfirmEdit/hCaptcha: Implement automatic failover

https://gerrit.wikimedia.org/r/1194671

Change #1194666 merged by jenkins-bot:

[mediawiki/extensions/ConfirmEdit@wmf/1.45.0-wmf.22] hCaptcha: Provide capabilities for failing over to alternate CAPTCHA type

https://gerrit.wikimedia.org/r/1194666

Mentioned in SAL (#wikimedia-operations) [2025-10-09T07:17:17Z] <kharlan@deploy2002> Started scap sync-world: Backport for [[gerrit:1194666|hCaptcha: Provide capabilities for failing over to alternate CAPTCHA type (T404204)]]

Mentioned in SAL (#wikimedia-operations) [2025-10-09T07:22:12Z] <kharlan@deploy2002> kharlan: Backport for [[gerrit:1194666|hCaptcha: Provide capabilities for failing over to alternate CAPTCHA type (T404204)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.

Mentioned in SAL (#wikimedia-operations) [2025-10-09T07:29:11Z] <kharlan@deploy2002> Finished scap sync-world: Backport for [[gerrit:1194666|hCaptcha: Provide capabilities for failing over to alternate CAPTCHA type (T404204)]] (duration: 11m 54s)

Change #1194671 merged by jenkins-bot:

[operations/mediawiki-config@master] ConfirmEdit/hCaptcha: Implement automatic failover

https://gerrit.wikimedia.org/r/1194671

Mentioned in SAL (#wikimedia-operations) [2025-10-09T07:54:33Z] <kharlan@deploy2002> Started scap sync-world: Backport for [[gerrit:1194671|ConfirmEdit/hCaptcha: Implement automatic failover (T404204)]]

Mentioned in SAL (#wikimedia-operations) [2025-10-09T07:59:09Z] <kharlan@deploy2002> kharlan: Backport for [[gerrit:1194671|ConfirmEdit/hCaptcha: Implement automatic failover (T404204)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.

Mentioned in SAL (#wikimedia-operations) [2025-10-09T08:07:48Z] <kharlan@deploy2002> Finished scap sync-world: Backport for [[gerrit:1194671|ConfirmEdit/hCaptcha: Implement automatic failover (T404204)]] (duration: 13m 14s)