Page MenuHomePhabricator

hCaptcha: Stop using urldownloader for health checks of the secure-api.js file
Open, MediumPublic

Description

Summary

https://wikitech.wikimedia.org/wiki/Url-downloader warns: "Please, never use this functionality to reach out to endpoints belonging to Wikimedia. You should be able to utilize the service mesh/services proxy to reach most endpoints internally in a safer, more reliable and more performing way. If you can't find out how, please reach out to SRE".

We are currently using urldownloader here for health checks on the server-side to verify that MW -> WMF hCaptcha proxy -> hCaptcha's secure-api.js is reachable.

In between MW -> WMF hCaptcha proxy, we currently have urldownloader and instead we should use "service mesh/services proxy".

Note that usage of urldownloader for POST requests to hCaptcha from MW (here) when verifying a token is correct, and should not be changed. Those requests do not go through the WMF hCaptcha proxy.

Acceptance criteria

  • HCaptchaEnterpriseHealthChecker uses service mesh / services proxy for the secure-api.js health check

Event Timeline

If you can't find out how, please reach out to SRE

I don't know how to do this, and need SRE's guidance on how to do this.

kostajh updated the task description. (Show Details)

I feel like I lack some context here and the documentation (https://wikitech.wikimedia.org/wiki/HCaptcha) is exceptionally sparse.

AIUI checkApiUrl() is supposed to verify is hCaptcha can be used or if the code needs to fall back to other captcha methods? This means it needs to check if the hCaptcha JavaScript can (in theory) be loaded by the users browser and it currently does so by fetching the JS (via urldownloader) from the WMF hCaptcha proxy?

I feel like I lack some context here and the documentation (https://wikitech.wikimedia.org/wiki/HCaptcha) is exceptionally sparse.

AIUI checkApiUrl() is supposed to verify is hCaptcha can be used or if the code needs to fall back to other captcha methods? This means it needs to check if the hCaptcha JavaScript can (in theory) be loaded by the users browser and it currently does so by fetching the JS (via urldownloader) from the WMF hCaptcha

We need to know when rendering a page if hCaptcha is ready and unavailable, and if not, we should fall back to FancyCaptcha. One of the checks is to make sure that secure-api.js is reachable and matches the expected checksum. The request for this goes MW -> urldownloader -> WMF proxy -> js.hcaptcha.com.

AIUI, we should remove urldownloader from the equation.

(A related but somewhat tangential task is {T403829, but we should probably try to deal with that separately, since it has a bunch of other questions to resolve.)

JMeybohm added a subscriber: jijiki.

We need to know when rendering a page if hCaptcha is ready and unavailable, and if not, we should fall back to FancyCaptcha. One of the checks is to make sure that secure-api.js is reachable and matches the expected checksum. The request for this goes MW -> urldownloader -> WMF proxy -> js.hcaptcha.com.

From my naive PoV this is something the client should be doing (verifying the hCaptcha JS can be loaded properly and falling back if not). Since it can absolutely be that MW is able to fetch the JS while the client is not. But maybe there are restrictions at play that make this non feasible in out setup.
Maybe @Raine and or @jijiki know more about this.

I'll triage this for next Q

JMeybohm triaged this task as Medium priority.Mar 30 2026, 10:07 AM

We need to know when rendering a page if hCaptcha is ready and unavailable, and if not, we should fall back to FancyCaptcha. One of the checks is to make sure that secure-api.js is reachable and matches the expected checksum. The request for this goes MW -> urldownloader -> WMF proxy -> js.hcaptcha.com.

From my naive PoV this is something the client should be doing (verifying the hCaptcha JS can be loaded properly and falling back if not). Since it can absolutely be that MW is able to fetch the JS while the client is not. But maybe there are restrictions at play that make this non feasible in out setup.
Maybe @Raine and or @jijiki know more about this.

FancyCaptcha and hCaptcha have substantially different setup processes, so we need to know, when returning the HTML to client, if we are setting up for an hCaptcha or FancyCaptcha session. And more generally, the projects need to be in a given mode (e.g. imagine a bad actor supplying a bogus POST request with a solved FancyCaptcha while claiming that hCaptcha is offline).

From my naive PoV this is something the client should be doing (verifying the hCaptcha JS can be loaded properly and falling back if not). Since it can absolutely be that MW is able to fetch the JS while the client is not.

We could split the availability checks into two parts. One check can be that the WMF proxy is up and receiving traffic. That does not necessarily need to be done via MW.

The second check is that hCaptcha's secure-api.js file is downloadable. That also doesn't need to happen via MW, but we probably can't rely entirely on client reported errors here either (since someone could easily fake a bunch of those to make it look like hCaptcha is unavailable).

I'll triage this for next Q

This issue is resulting in substantial levels of instability since late December, so the sooner we can debug what is going on, the better.

It seems like the simplest way forward with the existing set up is to figure out how to reach the WMF proxy via the service mesh instead of using urldownloader, but I could not find documentation on how to do this (perhaps the warning at the top of https://wikitech.wikimedia.org/wiki/Url-downloader should link to some examples)

Having discussed this a bit more with @Dreamy_Jazz just now, we're going to drop the MW -> urldownloader -> proxy -> secure-api.js check entirely. Rationale:

  • The secure-api.js sits on a CDN provided by Cloudflare, so checking for its availability on its own is not very meaningful
  • the connection between MW and the proxy could be measured by querying the /healthz endpoint from MW, and this could form part of the HCaptchaEnterpriseHealthChecker checks. But we could also not do this, given that the user doesn't go from MW -> proxy, but instead from proxy -> hCaptcha
  • the proxy health could be measured by implementing anomaly alerts on the req/s, perhaps

Having discussed this a bit more with @Dreamy_Jazz just now, we're going to drop the MW -> urldownloader -> proxy -> secure-api.js check entirely. Rationale:

  • The secure-api.js sits on a CDN provided by Cloudflare, so checking for its availability on its own is not very meaningful
  • the connection between MW and the proxy could be measured by querying the /healthz endpoint from MW, and this could form part of the HCaptchaEnterpriseHealthChecker checks. But we could also not do this, given that the user doesn't go from MW -> proxy, but instead from proxy -> hCaptcha
  • the proxy health could be measured by implementing anomaly alerts on the req/s, perhaps

We can skip the check I guess but I have two thoughts:

  • The urldownloader issue still needs to be resolved, given that it still sits in between any requests from MW -> CDN (or the proxy VMs in this case). Even if we move the check out of secure-api.js, that communication path still needs to exist for all other communication, so we have to do it properly. (I don't know how to)
  • If we are observing network issues along this path -- and we are -- and while we can move the healthcheck, we should still figure out what is causing those issues (beyond a certain expected limit) and work on resolving that. Given that we can't see anything on the proxy VMs and that Cloudflare issues while they happen are still not that frequent, we should spend some time identifying those.

I was hesitant to suggest something (adding service mesh) just for the cause of health checking but what I read into @ssingh comment is that MW needs to fetch something from the hcaptcha proxy which is not just health checks. What I don't understand rn is why those request are proxied through the urldownloader. Can't we just reach out to the hcaptcha proxy via it's discovery record?
Ïf we can, adding service mesh support is as easy as https://wikitech.wikimedia.org/wiki/Envoy#Add_a_new_service_(listener), enabling the new listener in mw deployments and switch to calling the localhost endpoint instead of urldownloader.

I was hesitant to suggest something (adding service mesh) just for the cause of health checking but what I read into @ssingh comment is that MW needs to fetch something from the hcaptcha proxy which is not just health checks.

At a more basic level, MW doesn't need to do this. We decided to do that as a "health check" (T404204). But clients don't actually go from MW -> proxy/CDN -> hCaptcha, they go from client -> proxy/CDN -> hCaptcha.

So, what we will do is:

  • remove the existing health check that verifies secure-api.js is available
  • follow-up with a new health check that ensures the proxy is running, probably by calling /healthz on the proxy (and this will require the Envoy service listener
  • not worry about whether secure-api.js is available, because that is a static file hosted on a Cloudflare CDN. Longer term, we'd really prefer to just self host that as well (T403829)

Change #1266992 had a related patch set uploaded (by Kosta Harlan; author: Kosta Harlan):

[mediawiki/extensions/ConfirmEdit@master] hCaptcha: Remove apiUrl health check from HCaptchaEnterpriseHealthChecker

https://gerrit.wikimedia.org/r/1266992