Page MenuHomePhabricator

Improve resilience of hCaptcha API URL loading to transient network issues
Closed, ResolvedPublic

Description

Summary

On page load, we need to know if hCaptcha is available. We do that by checking if the secure-api.js file ($wgHCaptchaApiUrl) is available. We have 167 instances over the last week where the error threshold exceeded what we tolerate for a given window of time, which resulted in us switch back to FancyCaptcha.

Observations

  • We retry the API URL download once, without any delay in between first attempt and retry. Should we add a delay? Should we retry more than once?
  • "hCaptcha unavailable due to apiUrl errors:" appears within the same couple of seconds on enwiki (three times), zhwiki (twice) and fawiki (once). In theory this should be a global check, and should just need to be calculated on one wiki (and have the result shared across all wikis). So something may be off with our cache implementation.
  • If we could self-host the secure-api.js code (T403829), we should be in much better shape to handle transient network errors, because our secureEnclave.js code already supports retries due to network issues. (There might be a bit more work to do, though.) But if the network link between the proxy and hCaptcha is having issues, then we're put in a risky position of saying that hCaptcha is available when it in fact isn't, which means edits/acocunt creations don't go through.
    • Self-hosting was problematic from a proprietary code point of view, but perhaps we could cache the contents of secure-api.js in memcache for a long period of time?

Acceptance criteria

  • We do not have more than one failover incident per month

Event Timeline

kostajh updated the task description. (Show Details)

Change #1244726 had a related patch set uploaded (by Kosta Harlan; author: Kosta Harlan):

[mediawiki/extensions/ConfirmEdit@master] HCaptchaEnterpriseHealthChecker: Add configurable retry count and delay

https://gerrit.wikimedia.org/r/1244726

Change #1244726 merged by jenkins-bot:

[mediawiki/extensions/ConfirmEdit@master] HCaptchaEnterpriseHealthChecker: Add configurable retry count and delay

https://gerrit.wikimedia.org/r/1244726

Change #1246904 had a related patch set uploaded (by Kosta Harlan; author: Kosta Harlan):

[mediawiki/extensions/ConfirmEdit@wmf/1.46.0-wmf.17] HCaptchaEnterpriseHealthChecker: Add configurable retry count and delay

https://gerrit.wikimedia.org/r/1246904

Change #1246904 merged by jenkins-bot:

[mediawiki/extensions/ConfirmEdit@wmf/1.46.0-wmf.17] HCaptchaEnterpriseHealthChecker: Add configurable retry count and delay

https://gerrit.wikimedia.org/r/1246904

Mentioned in SAL (#wikimedia-operations) [2026-03-02T08:51:52Z] <kharlan@deploy2002> Started scap sync-world: Backport for [[gerrit:1246904|HCaptchaEnterpriseHealthChecker: Add configurable retry count and delay (T418477)]]

Mentioned in SAL (#wikimedia-operations) [2026-03-02T08:57:53Z] <kharlan@deploy2002> kharlan: Backport for [[gerrit:1246904|HCaptchaEnterpriseHealthChecker: Add configurable retry count and delay (T418477)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.

Mentioned in SAL (#wikimedia-operations) [2026-03-02T09:08:01Z] <kharlan@deploy2002> Finished scap sync-world: Backport for [[gerrit:1246904|HCaptchaEnterpriseHealthChecker: Add configurable retry count and delay (T418477)]] (duration: 16m 09s)

Change #1255046 had a related patch set uploaded (by Kosta Harlan; author: Kosta Harlan):

[mediawiki/extensions/ConfirmEdit@master] hCaptcha: Cache API URL verification to tolerate transient network failures

https://gerrit.wikimedia.org/r/1255046

Change #1255046 abandoned by Kosta Harlan:

[mediawiki/extensions/ConfirmEdit@master] hCaptcha: Cache API URL verification to tolerate transient network failures

https://gerrit.wikimedia.org/r/1255046

dom_walden subscribed.

I tested some of this as part of T416817. In the logs, I can see three attempts to access the hCaptcha service, with a reported (not verified) delay of 200ms between attempts.

Dreamy_Jazz subscribed.

This can probably be resolved