Page MenuHomePhabricator

Investigate and evaluate hCaptcha to replace Wikimedia's Fancy Captcha
Open, Needs TriagePublic

Description

This is not complete and as such, should be considered a WIP. Comments/questions and such below are welcome

Following on from T249854: Add support for hCaptcha, and as a potential solution to T241921: Fix Wikimedia captchas (and the various older incantations).

hCaptcha is an alternative to reCaptcha, without the usual privacy concerns that come with it. cloudflare are currently in the process of moving from reCaptcha to hCaptcha.

It may still require a change to the WIkimedia's Privacy Policy, as it requires loading JS from an external website, and submitting data back to them, but hCaptchas Privacy Policy is seemingly more in line with what we'd want (IANAL, and would need WMF-Legal review obviously). They're more interested in the aggregate data rather than individual data, and try to discard other data as soon as they can.

hCaptcha are offering donation of websites "earnings" from captchas being solved to the Wikimedia Foundation rather than keeping it for themselves. While I imagine this won't solve all of Wikimedia's funding problems, it's nice that we're considered a good solution for the problem. Obviously, there's the potential of this resulting in captcha solves on Wikimedia sites also helping generate income

The implementation is similar to reCaptcha, selecting images of a certain type etc.

Localisation is done to ~150 languages, and they're planning on open sourcing UI translations onto github, so a chance to expand that further and to help support more languages (which is one goal of the Captcha replacement project, T7309: Localize captcha images, though removing the text strings to be identified and typed out does make that task kinda redundant)

There's also a labelling service we could potentially use with MachineVision instead of the Google services. It would be potentially possible to use our own captchas to help label our own images from commons, somewhat a mix of T87598: Create a CAPTCHA that is also a useful micro edit and T34695: Implement, Review and Deploy Wikicaptcha

Questions:

  • Does this image matching captcha solution help our Accessibility issues?

Known caveats/issues:

  • No "no JS" solution (currently)
    • Can't serve captcha through API without expecting clients to load JS etc
    • Possibility of whitelisting bots
  • Browser support versions will differ from ours - https://docs.hcaptcha.com/faq
  • Not FOSS
    • However, Wikimedia can get access to JS source for auditing purposes

Useful links:

Event Timeline

Reedy created this task.Apr 14 2020, 8:31 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 14 2020, 8:31 PM
Reedy updated the task description. (Show Details)Apr 14 2020, 11:12 PM
Reedy updated the task description. (Show Details)Apr 14 2020, 11:20 PM
sbassett added a subscriber: sbassett.
Bugreporter added a subscriber: Bugreporter.EditedApr 15 2020, 11:39 AM

I strongly oppose:

  • It is not free or open-source
  • It is an external service that we have no control

If we still want to try it:

  • At least we should be able to (or ask them to allow to) set up a proxy to that service, so that all traffics go through a Wikimedia server (c.f. it seems not possible to create a proxy to reCaptcha) and not to hCaptcha directly; hCaptcha should not have access to user's cookie or IP
  • Try to make it avaliable in Cloud Services, and replacing current usage of reCaptche (this would the possible first step, as reCaptcha is currently in use, whether it is appropriate - see T196395: utrs.wmflabs.org loads lot of external content)
Reedy added a comment.EditedApr 16 2020, 1:31 PM

Thanks for the comments.

This is out of the scope of this task. While I obviously would welcome something like that, the Foundation does not generally maintain "random" tools, so that is somewhat upto the tool maintainers. I left a comment suggesting it on their task to remove/replace reCaptcha. I could probably find some time to help them out to some extent, if they wanted. But as the API is very similar to reCaptcha, it shouldn't require much effort to replace it. And the results could be used to help inform this.

  • At least we should be able to (or ask them to allow to) set up a proxy to that service, so that all traffics go through a Wikimedia server (c.f. it seems not possible to create a proxy to reCaptcha) and not to hCaptcha directly; hCaptcha should not have access to user's cookie or IP

There's scope to send the IP on the request from PHP, but not required (I think it's used for extra analysis). Which helps reduce linking between the JS request and the solving of the captcha.

It should be noted though, that if we're not giving the service the users IP, we might aswell not even bother trying the alternative captcha solution, as it's not going to make much/any difference. This information is used (and kept for very short periods) to work out whether the requests are legit. Just solving the captcha (successfully) doesn't give the service enough information as to whether you're not a bot. They use stats like how many requests/solves of captchas are coming from a particular source etc.

Note, T250314: Investigate Privacy Pass for Wikimedia Sites might help here for those more privacy concious.

  • It is an external service that we have no control

Neither are Google services we use for MachineVision, and for Google Translate in Content Translation, along with other services from other companies for translation. I think there's probably more too, without digging too deeply. I obviously understand interaction with those is more optional, where a captcha as part of the login flow (and other flows) is not so optional. And also in those cases, the users aren't directly interacting with Google Services, they're doing it via a "proxy" app/API. But in those cases, information like IP address serves no benefit. A translation between two languages is the same wherever you are in the world.

I would also say "no control" is {{cn}}, depending on what you mean by control.

I strongly oppose:

  • It is not free or open-source

So first off, this is not a requirement, and while it's something we strive for, but it's not essential. See Wikimedia Foundation Guiding Principles - I know some community members and foundation staff do think this should be an absolute, but it's currently not the case.

As an organization, we strive to use open source tools over proprietary ones, although we use proprietary or closed tools (such as software, operating systems, etc.) where there is currently no open-source tool that will effectively meet our needs.

Similarly, Open source is a draft, not a policy.

So, it's finding the best tool for the job. Preferring FOSS, but not requiring them.

I will note that hCaptcha are more than happy to provide Wikimedia the full JS source for auditing purposes. Which helps allieviate some of the issues of using closed source stuff. https://hcaptcha.com/1/api.js doesn't actually seem to be really obfuscated, just the obvious minification for performance reasons.

As it stands, no one has come forward with an appropriate Free/Open Source solution to the Captcha problem (unless I an everyone else involved has missed someone posting something that is enlightening and solves the problem). And as is clear by the mostly lack of progress on our own Captcha in over a decade, it's clear, that even with the best will in the world, the Foundation and my colleagues don't have all of the large quantities of required knowledge/experience of how to improve our captcha, whilst getting the benefits of l10n/i18n (which generally is something we do do quite well) and accessibility, and even more importantly don't have the time and resources to work on the projects in a capacity to make significant headway *at the same time as all the other work we have to do*.

So in the same way we don't use coreboot on our servers, and we use propreitary software switches and routers (I could continue), because of lack of appropriate alternatives. And of course, we don't use FOSS hardware; again, for the same reasons. It's just not practical.

See also Google services (including gApps by the Foundation) mentioned above too. Again, lack of alternatives that effectively meet our needs. Or at least, was the case at the time of last evaluation. And the moving of major services like that requires a lot of time and effort, for potentially little to no gain. That doesn't mean we shouldn't do it, but in a cost/benefit analysis...

And do bare in mind many community members don't feel as strongly (or in many cases, even care) as you do. How many use Windows? And therefore IE or Edge? Mac? Safari? iPhone? Non free drivers and binaries on Linux systems? In some cases they're forced to (work machines etc), but many by choice. Granted, it's consuming resoucrces using non FOSS, but it's a vein of a similar argument.

I'm not saying we're going to use hCaptcha for definite. Maybe we will, maybe we won't. But evaluating other options (like has happened for reCaptcha - if it was literally the only solution, we would've found a way to make it work) that don't fit the FOSS bill is something we should be doing as part of due diligence, in an attempt to unblock the process. Very much a case of "where there is currently no open-source tool that will effectively meet our needs". Do we want to be in the same position with our Captcha in 1, 5, 10 years time? Probably not. It also doesn't have to be a permenant solution. If we find something better down the road, we can definitely switch.

Noting this is an effort to try and help our overworked Stewards, global and local sysops, by having something that helps stop spam even happening in the first place.

Neither are Google services we use for MachineVision, and for Google Translate in Content Translation, along with other services from other companies for translation. I think there's probably more too, without digging too deeply. I obviously understand interaction with those is more optional, where a captcha as part of the login flow (and other flows) is not so optional. And also in those cases, the users aren't directly interacting with Google Services, they're doing it via a "proxy" app/API. But in those cases, information like IP address serves no benefit. A translation between two languages is the same wherever you are in the world.

But we do not rely on them to edit and we have plenty of alternatives. Also the requests go through proxies.

As it stands, no one has come forward with an appropriate Free/Open Source solution to the Captcha problem (unless I an everyone else involved has missed someone posting something that is enlightening and solves the problem). And as is clear by the mostly lack of progress on our own Captcha in over a decade, it's clear, that even with the best will in the world, the Foundation and my colleagues don't have all of the large quantities of required knowledge/experience of how to improve our captcha, whilst getting the benefits of l10n/i18n (which generally is something we do do quite well) and accessibility, and even more importantly don't have the time and resources to work on the projects in a capacity to make significant headway *at the same time as all the other work we have to do*.

I once suggested Wikimedia to develop one - T174874: Create a standalone Wikimedia CAPTCHA service

So in the same way we don't use coreboot on our servers, and we use propreitary software switches and routers (I could continue), because of lack of appropriate alternatives. And of course, we don't use FOSS hardware; again, for the same reasons. It's just not practical.

But we do have control on our servers. We do not have control on hCaptcha ones. At least it should be something that can be installed in Wikimedia servers; even if they may contact hCaptcha servers, the Captcha should work without them.

And do bare in mind many community members don't feel as strongly (or in many cases, even care) as you do. How many use Windows? And therefore IE or Edge? Mac? Safari? iPhone? Non free drivers and binaries on Linux systems? In some cases they're forced to (work machines etc), but many by choice. Granted, it's consuming resoucrces using non FOSS, but it's a vein of a similar argument.

Again, nobody requires users to use Windows. But Wikimedia may be going to require (at least new) users to use a non-free third-party service.

In the whole Wikimedia there's very few places that external scripts are loaded - content of MachineVision and Google Translate are already filtered so that they may not do anything bad. Here hCaptcha may theoretically inject arbitrary script to Wikimedia pages.

We have a current effort to replace any external resources, for privacy concerns. See also T135963: Add support for Content-Security-Policy (CSP) headers in MediaWiki

Reedy added a comment.Apr 16 2020, 6:27 PM

We have a current effort to replace any external resources, for privacy concerns. See also T135963: Add support for Content-Security-Policy (CSP) headers in MediaWiki

Yes, I'm aware of this. But CSP has a whitelisting system for this particular kind of issue. CSP is to stop unwanted and not specifically allowed things from being loaded; not stopping the wanted things that make things work

In the whole Wikimedia there's very few places that external scripts are loaded - content of MachineVision and Google Translate are already filtered so that they may not do anything bad. Here hCaptcha may theoretically inject arbitrary script to Wikimedia pages.

And their functionality and data requirements are different.

Yes, hCaptcha could (hell, we've seent it happen on Wikis enough times too. Sure it doesn't always last long, but it happens) inject arbitary scripts. Either purposefully, or accidentally due to some breach. But that's what contracts are for; so then if they are breached, there's legal ramifications.

Neither are Google services we use for MachineVision, and for Google Translate in Content Translation, along with other services from other companies for translation. I think there's probably more too, without digging too deeply. I obviously understand interaction with those is more optional, where a captcha as part of the login flow (and other flows) is not so optional. And also in those cases, the users aren't directly interacting with Google Services, they're doing it via a "proxy" app/API. But in those cases, information like IP address serves no benefit. A translation between two languages is the same wherever you are in the world.

But we do not rely on them to edit and we have plenty of alternatives. Also the requests go through proxies.

Again, what they do and how they work are different. Solving the captcha (ie the action/work) is only part of the process. Removing information the backend work with, such as IP, makes the service mostly useless. Please read my original responses.

Also, in most cases, most users will not see a Captcha. Certainly, I imagine long registered users won't have seen one on Wikimedia (unless creating an additional account for example) in a long time.

As it stands, no one has come forward with an appropriate Free/Open Source solution to the Captcha problem (unless I an everyone else involved has missed someone posting something that is enlightening and solves the problem). And as is clear by the mostly lack of progress on our own Captcha in over a decade, it's clear, that even with the best will in the world, the Foundation and my colleagues don't have all of the large quantities of required knowledge/experience of how to improve our captcha, whilst getting the benefits of l10n/i18n (which generally is something we do do quite well) and accessibility, and even more importantly don't have the time and resources to work on the projects in a capacity to make significant headway *at the same time as all the other work we have to do*.

I once suggested Wikimedia to develop one - T174874: Create a standalone Wikimedia CAPTCHA service

Great. But I've already answered this question. We only have limited time and resources. Your task was also explcitily declined. Same as many other ideas where people suggest we should branch out and do X.

As it stands, no one has come forward with an appropriate Free/Open Source solution to the Captcha problem (unless I an everyone else involved has missed someone posting something that is enlightening and solves the problem). And as is clear by the mostly lack of progress on our own Captcha in over a decade, it's clear, that even with the best will in the world, the Foundation and my colleagues don't have all of the large quantities of required knowledge/experience of how to improve our captcha, whilst getting the benefits of l10n/i18n (which generally is something we do do quite well) and accessibility, and even more importantly don't have the time and resources to work on the projects in a capacity to make significant headway *at the same time as all the other work we have to do*.

But we do have control on our servers. We do not have control on hCaptcha ones. At least it should be something that can be installed in Wikimedia servers; even if they may contact hCaptcha servers, the Captcha should work without them.

Again, read my answer about how the captcha works. Passing things through our servers removes that useful information, so we might aswell not bother.

How much control do we necessarily have with propriety firmware etc on them? How many Intel Management Engine type exploits are there out there? Sure, we can limit that by controlling egress, but that doesn't necessarily remove it completely.

Again, nobody requires users to use Windows. But Wikimedia may be going to require (at least new) users to use a non-free third-party service.

And in the same way you think that not using an FOSS solution is a big problem, other people do not. I suspect a decent amount of people that use Wikipedia don't know what this means, nor do they care. They'll happily use it on other sites they use, which are doing whatever with their data. Doesn't mean you're wrong, but certainly doesn't mean you're right either.

Again, nobody requires users to use Windows. But Wikimedia may be going to require (at least new) users to use a non-free third-party service.

This is explicitly not true; a huge number of businesses (I would argue "almost all", though obviously I don't have any hard statistics to back that up) force their employees to use Windows, for a variety of reasons (it's what the tech support on-hand is familiar with; apps the company relies on were written for Windows and it'd be expensive to update or replace them; the company values paid technical support; etc). You can argue that any or all of these should be non-concerns for any business, but you're screaming into an empty amphitheater in that case. Even ignoring this, pretty much any public computer is going to be Windows just because it has the broadest software support and the general public is by far most likely to already be familiar with it.

Reedy updated the task description. (Show Details)Apr 16 2020, 9:40 PM

Many companies have a volume license of Windows, but it is not the case of WMF.

Many companies have a volume license of Windows, but it is not the case of WMF.

@Bugreporter: It is entirely irrelevant what WMF folks use on their machines. Please move off-topic Windows license discussions somewhere else. Thanks!

Florian added a subscriber: Florian.
Reedy moved this task from Incoming to Back Orders on the Security-Team board.Apr 27 2020, 3:06 PM