Page MenuHomePhabricator

Investigate and evaluate hCaptcha to replace Wikimedia's Fancy Captcha
Open, HighPublic

Description

This is not complete and as such, should be considered a WIP. Comments/questions and such below are welcome

Following on from T249854: Add support for hCaptcha, and as a potential solution to T241921: Fix Wikimedia captchas (and the various older incantations).

hCaptcha is an alternative to reCaptcha, without the usual privacy concerns that come with it. cloudflare are currently in the process of moving from reCaptcha to hCaptcha.

It may still require a change to the WIkimedia's Privacy Policy, as it requires loading JS from an external website, and submitting data back to them, but hCaptchas Privacy Policy is seemingly more in line with what we'd want (IANAL, and would need WMF-Legal review obviously). They're more interested in the aggregate data rather than individual data, and try to discard other data as soon as they can.

hCaptcha are offering donation of websites "earnings" from captchas being solved to the Wikimedia Foundation rather than keeping it for themselves. While I imagine this won't solve all of Wikimedia's funding problems, it's nice that we're considered a good solution for the problem. Obviously, there's the potential of this resulting in captcha solves on Wikimedia sites also helping generate income

EVR9uTuXsAATveo.jpg (302×1 px, 21 KB)

The implementation is similar to reCaptcha, selecting images of a certain type etc.

Localisation is done to ~150 languages, and they're planning on open sourcing UI translations onto github, so a chance to expand that further and to help support more languages (which is one goal of the Captcha replacement project, T7309: Localize captcha images, though removing the text strings to be identified and typed out does make that task kinda redundant)

There's also a labelling service we could potentially use with MachineVision instead of the Google services. It would be potentially possible to use our own captchas to help label our own images from commons, somewhat a mix of T87598: Create a CAPTCHA that is also a useful micro edit and T34695: Implement, Review and Deploy Wikicaptcha

Questions:

  • Does this image matching captcha solution help our Accessibility issues?

Known caveats/issues:

  • No "no JS" solution (currently)
    • Can't serve captcha through API without expecting clients to load JS etc
    • Possibility of specifically allowing bots
  • Browser support versions will differ from ours - https://docs.hcaptcha.com/faq
  • Not FOSS
    • However, Wikimedia can get access to JS source for auditing purposes

Useful links:

Event Timeline

I strongly oppose:

  • It is not free or open-source
  • It is an external service that we have no control

If we still want to try it:

  • At least we should be able to (or ask them to allow to) set up a proxy to that service, so that all traffics go through a Wikimedia server (c.f. it seems not possible to create a proxy to reCaptcha) and not to hCaptcha directly; hCaptcha should not have access to user's cookie or IP
  • Try to make it avaliable in Cloud Services, and replacing current usage of reCaptche (this would the possible first step, as reCaptcha is currently in use, whether it is appropriate - see T196395: utrs.wmflabs.org loads lot of external content)

Thanks for the comments.

This is out of the scope of this task. While I obviously would welcome something like that, the Foundation does not generally maintain "random" tools, so that is somewhat upto the tool maintainers. I left a comment suggesting it on their task to remove/replace reCaptcha. I could probably find some time to help them out to some extent, if they wanted. But as the API is very similar to reCaptcha, it shouldn't require much effort to replace it. And the results could be used to help inform this.

  • At least we should be able to (or ask them to allow to) set up a proxy to that service, so that all traffics go through a Wikimedia server (c.f. it seems not possible to create a proxy to reCaptcha) and not to hCaptcha directly; hCaptcha should not have access to user's cookie or IP

There's scope to send the IP on the request from PHP, but not required (I think it's used for extra analysis). Which helps reduce linking between the JS request and the solving of the captcha.

It should be noted though, that if we're not giving the service the users IP, we might aswell not even bother trying the alternative captcha solution, as it's not going to make much/any difference. This information is used (and kept for very short periods) to work out whether the requests are legit. Just solving the captcha (successfully) doesn't give the service enough information as to whether you're not a bot. They use stats like how many requests/solves of captchas are coming from a particular source etc.

Note, T250314: Investigate Privacy Pass for Wikimedia Sites might help here for those more privacy concious.

  • It is an external service that we have no control

Neither are Google services we use for MachineVision, and for Google Translate in Content Translation, along with other services from other companies for translation. I think there's probably more too, without digging too deeply. I obviously understand interaction with those is more optional, where a captcha as part of the login flow (and other flows) is not so optional. And also in those cases, the users aren't directly interacting with Google Services, they're doing it via a "proxy" app/API. But in those cases, information like IP address serves no benefit. A translation between two languages is the same wherever you are in the world.

I would also say "no control" is {{cn}}, depending on what you mean by control.

I strongly oppose:

  • It is not free or open-source

So first off, this is not a requirement, and while it's something we strive for, but it's not essential. See Wikimedia Foundation Guiding Principles - I know some community members and foundation staff do think this should be an absolute, but it's currently not the case.

As an organization, we strive to use open source tools over proprietary ones, although we use proprietary or closed tools (such as software, operating systems, etc.) where there is currently no open-source tool that will effectively meet our needs.

Similarly, Open source is a draft, not a policy.

So, it's finding the best tool for the job. Preferring FOSS, but not requiring them.

I will note that hCaptcha are more than happy to provide Wikimedia the full JS source for auditing purposes. Which helps allieviate some of the issues of using closed source stuff. https://hcaptcha.com/1/api.js doesn't actually seem to be really obfuscated, just the obvious minification for performance reasons.

As it stands, no one has come forward with an appropriate Free/Open Source solution to the Captcha problem (unless I an everyone else involved has missed someone posting something that is enlightening and solves the problem). And as is clear by the mostly lack of progress on our own Captcha in over a decade, it's clear, that even with the best will in the world, the Foundation and my colleagues don't have all of the large quantities of required knowledge/experience of how to improve our captcha, whilst getting the benefits of l10n/i18n (which generally is something we do do quite well) and accessibility, and even more importantly don't have the time and resources to work on the projects in a capacity to make significant headway *at the same time as all the other work we have to do*.

So in the same way we don't use coreboot on our servers, and we use propreitary software switches and routers (I could continue), because of lack of appropriate alternatives. And of course, we don't use FOSS hardware; again, for the same reasons. It's just not practical.

See also Google services (including gApps by the Foundation) mentioned above too. Again, lack of alternatives that effectively meet our needs. Or at least, was the case at the time of last evaluation. And the moving of major services like that requires a lot of time and effort, for potentially little to no gain. That doesn't mean we shouldn't do it, but in a cost/benefit analysis...

And do bare in mind many community members don't feel as strongly (or in many cases, even care) as you do. How many use Windows? And therefore IE or Edge? Mac? Safari? iPhone? Non free drivers and binaries on Linux systems? In some cases they're forced to (work machines etc), but many by choice. Granted, it's consuming resoucrces using non FOSS, but it's a vein of a similar argument.

I'm not saying we're going to use hCaptcha for definite. Maybe we will, maybe we won't. But evaluating other options (like has happened for reCaptcha - if it was literally the only solution, we would've found a way to make it work) that don't fit the FOSS bill is something we should be doing as part of due diligence, in an attempt to unblock the process. Very much a case of "where there is currently no open-source tool that will effectively meet our needs". Do we want to be in the same position with our Captcha in 1, 5, 10 years time? Probably not. It also doesn't have to be a permenant solution. If we find something better down the road, we can definitely switch.

Noting this is an effort to try and help our overworked Stewards, global and local sysops, by having something that helps stop spam even happening in the first place.

Neither are Google services we use for MachineVision, and for Google Translate in Content Translation, along with other services from other companies for translation. I think there's probably more too, without digging too deeply. I obviously understand interaction with those is more optional, where a captcha as part of the login flow (and other flows) is not so optional. And also in those cases, the users aren't directly interacting with Google Services, they're doing it via a "proxy" app/API. But in those cases, information like IP address serves no benefit. A translation between two languages is the same wherever you are in the world.

But we do not rely on them to edit and we have plenty of alternatives. Also the requests go through proxies.

As it stands, no one has come forward with an appropriate Free/Open Source solution to the Captcha problem (unless I an everyone else involved has missed someone posting something that is enlightening and solves the problem). And as is clear by the mostly lack of progress on our own Captcha in over a decade, it's clear, that even with the best will in the world, the Foundation and my colleagues don't have all of the large quantities of required knowledge/experience of how to improve our captcha, whilst getting the benefits of l10n/i18n (which generally is something we do do quite well) and accessibility, and even more importantly don't have the time and resources to work on the projects in a capacity to make significant headway *at the same time as all the other work we have to do*.

I once suggested Wikimedia to develop one - T174874: Create a standalone Wikimedia CAPTCHA service

So in the same way we don't use coreboot on our servers, and we use propreitary software switches and routers (I could continue), because of lack of appropriate alternatives. And of course, we don't use FOSS hardware; again, for the same reasons. It's just not practical.

But we do have control on our servers. We do not have control on hCaptcha ones. At least it should be something that can be installed in Wikimedia servers; even if they may contact hCaptcha servers, the Captcha should work without them.

And do bare in mind many community members don't feel as strongly (or in many cases, even care) as you do. How many use Windows? And therefore IE or Edge? Mac? Safari? iPhone? Non free drivers and binaries on Linux systems? In some cases they're forced to (work machines etc), but many by choice. Granted, it's consuming resoucrces using non FOSS, but it's a vein of a similar argument.

Again, nobody requires users to use Windows. But Wikimedia may be going to require (at least new) users to use a non-free third-party service.

In the whole Wikimedia there's very few places that external scripts are loaded - content of MachineVision and Google Translate are already filtered so that they may not do anything bad. Here hCaptcha may theoretically inject arbitrary script to Wikimedia pages.

We have a current effort to replace any external resources, for privacy concerns. See also T135963: Add support for Content-Security-Policy (CSP) headers in MediaWiki

We have a current effort to replace any external resources, for privacy concerns. See also T135963: Add support for Content-Security-Policy (CSP) headers in MediaWiki

Yes, I'm aware of this. But CSP has a whitelisting system for this particular kind of issue. CSP is to stop unwanted and not specifically allowed things from being loaded; not stopping the wanted things that make things work

In the whole Wikimedia there's very few places that external scripts are loaded - content of MachineVision and Google Translate are already filtered so that they may not do anything bad. Here hCaptcha may theoretically inject arbitrary script to Wikimedia pages.

And their functionality and data requirements are different.

Yes, hCaptcha could (hell, we've seent it happen on Wikis enough times too. Sure it doesn't always last long, but it happens) inject arbitary scripts. Either purposefully, or accidentally due to some breach. But that's what contracts are for; so then if they are breached, there's legal ramifications.

Neither are Google services we use for MachineVision, and for Google Translate in Content Translation, along with other services from other companies for translation. I think there's probably more too, without digging too deeply. I obviously understand interaction with those is more optional, where a captcha as part of the login flow (and other flows) is not so optional. And also in those cases, the users aren't directly interacting with Google Services, they're doing it via a "proxy" app/API. But in those cases, information like IP address serves no benefit. A translation between two languages is the same wherever you are in the world.

But we do not rely on them to edit and we have plenty of alternatives. Also the requests go through proxies.

Again, what they do and how they work are different. Solving the captcha (ie the action/work) is only part of the process. Removing information the backend work with, such as IP, makes the service mostly useless. Please read my original responses.

Also, in most cases, most users will not see a Captcha. Certainly, I imagine long registered users won't have seen one on Wikimedia (unless creating an additional account for example) in a long time.

As it stands, no one has come forward with an appropriate Free/Open Source solution to the Captcha problem (unless I an everyone else involved has missed someone posting something that is enlightening and solves the problem). And as is clear by the mostly lack of progress on our own Captcha in over a decade, it's clear, that even with the best will in the world, the Foundation and my colleagues don't have all of the large quantities of required knowledge/experience of how to improve our captcha, whilst getting the benefits of l10n/i18n (which generally is something we do do quite well) and accessibility, and even more importantly don't have the time and resources to work on the projects in a capacity to make significant headway *at the same time as all the other work we have to do*.

I once suggested Wikimedia to develop one - T174874: Create a standalone Wikimedia CAPTCHA service

Great. But I've already answered this question. We only have limited time and resources. Your task was also explcitily declined. Same as many other ideas where people suggest we should branch out and do X.

As it stands, no one has come forward with an appropriate Free/Open Source solution to the Captcha problem (unless I an everyone else involved has missed someone posting something that is enlightening and solves the problem). And as is clear by the mostly lack of progress on our own Captcha in over a decade, it's clear, that even with the best will in the world, the Foundation and my colleagues don't have all of the large quantities of required knowledge/experience of how to improve our captcha, whilst getting the benefits of l10n/i18n (which generally is something we do do quite well) and accessibility, and even more importantly don't have the time and resources to work on the projects in a capacity to make significant headway *at the same time as all the other work we have to do*.

But we do have control on our servers. We do not have control on hCaptcha ones. At least it should be something that can be installed in Wikimedia servers; even if they may contact hCaptcha servers, the Captcha should work without them.

Again, read my answer about how the captcha works. Passing things through our servers removes that useful information, so we might aswell not bother.

How much control do we necessarily have with propriety firmware etc on them? How many Intel Management Engine type exploits are there out there? Sure, we can limit that by controlling egress, but that doesn't necessarily remove it completely.

Again, nobody requires users to use Windows. But Wikimedia may be going to require (at least new) users to use a non-free third-party service.

And in the same way you think that not using an FOSS solution is a big problem, other people do not. I suspect a decent amount of people that use Wikipedia don't know what this means, nor do they care. They'll happily use it on other sites they use, which are doing whatever with their data. Doesn't mean you're wrong, but certainly doesn't mean you're right either.

Again, nobody requires users to use Windows. But Wikimedia may be going to require (at least new) users to use a non-free third-party service.

This is explicitly not true; a huge number of businesses (I would argue "almost all", though obviously I don't have any hard statistics to back that up) force their employees to use Windows, for a variety of reasons (it's what the tech support on-hand is familiar with; apps the company relies on were written for Windows and it'd be expensive to update or replace them; the company values paid technical support; etc). You can argue that any or all of these should be non-concerns for any business, but you're screaming into an empty amphitheater in that case. Even ignoring this, pretty much any public computer is going to be Windows just because it has the broadest software support and the general public is by far most likely to already be familiar with it.

Many companies have a volume license of Windows, but it is not the case of WMF.

Many companies have a volume license of Windows, but it is not the case of WMF.

@Bugreporter: It is entirely irrelevant what WMF folks use on their machines. Please move off-topic Windows license discussions somewhere else. Thanks!

I don't think they would need the IP address. If all they want are statistics on the number of requests/solves from an IP address, they could be given a HMAC of the IP address with a secret salt. Plus probably the AS and country of the IP, since I'm sure that's also part of their risk analysis. They couldn't combine requests from wmf users with those from third parties, wikimedia sites would be on its own island, but that's the goal. We have a big enough user base, that I doubt it combining it would really be needed. That, plus proxying the actual image loads (and not letting them insert arbitrary javascript, but using a known-good copy), I think would work wrt privacy. Still not ideal from a FOSS philosophical POV, though.

From an operational perspective, a concern I have is the dependency that is created if using a single vendor for a service like this. If in 5-10 years time, after several mergers and acquisitions, the captcha provider decided to stop providing the service under acceptable terms for us (e.g. they change their terms and are no longer wishing to respect user privacy at all, in order to monetize them), what would we do? It's not like we could stop requiring captchas without an impact. At the very least, the current implementation current would have to be kept at an appropriate level, so it can easily fall back there in such case (or, simply, if the vendor had an outage).

I don't think they would need the IP address. If all they want are statistics on the number of requests/solves from an IP address, they could be given a HMAC of the IP address with a secret salt.

hCaptcha does indeed support such a paradigm by allowing clients to pass blinded end-user IPs to their backend, where they are isolated from the rest of the statistical reputation-scoring hCaptcha performs within the context of their large pool of client data. I cannot find any public-facing documentation for this feature, but I can confirm that it exists and would be a requirement for any proposed Wikimedia implementation.

They couldn't combine requests from wmf users with those from third parties, wikimedia sites would be on its own island, but that's the goal. We have a big enough user base, that I doubt it combining it would really be needed.

There would be a potential downgrade of the performance of hCaptcha's reputation-scoring relative to their standard implementation, but this would still be a vast improvement over FancyCaptcha, which essentially has none.

That, plus proxying the actual image loads (and not letting them insert arbitrary javascript, but using a known-good copy), I think would work wrt privacy. Still not ideal from a FOSS philosophical POV, though.

hCaptcha provides both first-party hosting and full-proxy options for their primary javascript widget and related resources, the latter of which should alleviate all user privacy issues within the context of Wikimedia's current privacy policy. In discussions with hCaptcha, they are also extremely comfortable with Wikimeda/WMF having as much access to relevant source code as possible for audit purposes. As you mentioned, this isn't fully in alignment with certain FOSS philosophies, but is likely the best outcome possible for such a vendor relationship. By contrast, Google currently does not and would likely be unwilling to satisfy any of these requirements with reCaptcha.

From an operational perspective, a concern I have is the dependency that is created if using a single vendor for a service like this. If in 5-10 years time, after several mergers and acquisitions, the captcha provider decided to stop providing the service under acceptable terms for us (e.g. they change their terms and are no longer wishing to respect user privacy at all, in order to monetize them), what would we do? It's not like we could stop requiring captchas without an impact. At the very least, the current implementation current would have to be kept at an appropriate level, so it can easily fall back there in such case (or, simply, if the vendor had an outage).

This is indeed a concern, and one that the Security-Team addressed within a recent WMF-internal risk assessment. FancyCaptcha (or similar) would need to be maintained to some extent as either a fallback captcha system (in the case of service outages) or as a temporary replacement if hCaptcha's terms and/or ethos ever departed significantly from current expectations. This would all likely be codified via contractual agreements between the WMF and hCaptcha, if this option were to move forward.

sbassett moved this task from Back Orders to Watching on the Security-Team board.

From an operational perspective, a concern I have is the dependency that is created if using a single vendor for a service like this. If in 5-10 years time, after several mergers and acquisitions, the captcha provider decided to stop providing the service under acceptable terms for us (e.g. they change their terms and are no longer wishing to respect user privacy at all, in order to monetize them), what would we do? It's not like we could stop requiring captchas without an impact. At the very least, the current implementation current would have to be kept at an appropriate level, so it can easily fall back there in such case (or, simply, if the vendor had an outage).

This is indeed a concern, and one that the Security-Team addressed within a recent WMF-internal risk assessment. FancyCaptcha (or similar) would need to be maintained to some extent as either a fallback captcha system (in the case of service outages) or as a temporary replacement if hCaptcha's terms and/or ethos ever departed significantly from current expectations. This would all likely be codified via contractual agreements between the WMF and hCaptcha, if this option were to move forward.

Can hCaptcha allow us to create a custom version of service that may be hosted in WMF server? this would significantly reduce the risk of outage and suspension. A non-revocable legal agreement of running the service may also be needed. Note even with it, this may still be much more controversial than T272111.

Can hCaptcha allow us to create a custom version of service that may be hosted in WMF server? this would significantly reduce the risk of outage and suspension. A non-revocable legal agreement of running the service may also be needed. Note even with it, this may still be much more controversial than T272111.

If hCaptcha were to be implemented within Wikimedia production, part of that process would involve creating a custom service that managed the proxied transmission of fully-anonymized data to hCaptcha for evaluation. And ideally said service would provide us more flexibility in migrating to separate or fallback captcha systems, such as FancyCaptcha, if the need arose. I do not believe there would be a way to avoid sending any data to hCaptcha, as that is not possible with their current architecture. But as previously discussed, there are a number of ways (technical, legal, etc) which should make such transactions as secure and private as possible and fully-compliant with the current Wikimedia privacy policy.

Update: this is a fairly interesting blog post from Cloudflare discussing their migration from reCaptcha to hCaptcha. They had many similar concerns over user privacy.