Page MenuHomePhabricator

Evaluate Cloudflare Turnstile as alternative to FancyCaptcha at Wikimedia
Open, Needs TriagePublic

Description

This is just a suggestion and I hope it will be useful to us all.

https://blog.cloudflare.com/turnstile-private-captcha-alternative/

Turnstile is a proof-of-work CAPTCHA service by Cloudflare. It works by running a series of non-interactive JavaScript tests. It also supports interactive mode for better bot identification, where users perform a simple click to prove they are human. Cloudflare describes it as "an invisible alternative to CAPTCHA", and claims it collects less user data, and is more UX and privacy friendly then other alternatives.

Pros

  • Minimum user interaction
  • Avoid the culture gap issue [1] since no need to distinguish between images
  • Much better accessibility support (T6845)
  • Easy to translate into 300+ languages Wikipedia use since the only string to translate is something like "Prove you are human"
  • Cloudflare has a generally good record of privacy protection, and they used to help WMF fight off a massive DDoS attack

Cons

  • No non-js alternative
  • Not open source (there is no better open source alternative, though)
  • Require loading js from third party, possibly violates the Privacy Policy (probably a reverse proxy should be set up to send only anonymized data to Cloudflare)

Helpful links
https://www.cloudflare.com/products/turnstile/
https://developers.cloudflare.com/turnstile/
https://developers.cloudflare.com/turnstile/frequently-asked-questions/

[1] https://news.ycombinator.com/item?id=25226805

Event Timeline

Same concerns as expressed in the main ticket vis a vis privacy and Cloudfare being a for-profit corporation with not the best reputation for it.

The website description

we built a platform to test many alternatives and rotate new challenges in and out as they become more or less effective. With Turnstile, we adapt the actual challenge outcome to the individual visitor/browser. First we run a series of small non-interactive JavaScript challenges gathering more signals about the visitor/browser environment. Those challenges include proof-of-work, proof-of-space, probing for web APIs, and various other challenges for detecting browser-quirks and human behavior. As a result, we can fine-tune the difficulty of the challenge to the specific request.

To me, reading a bit between the lines, this sounds primarily browser fingerprinting based with a small proof-of-work component that is played up because that is trendy (probably a good thing too, pure proof of work is generally a terrible captcha technology)

It sounds like their main privacy arguments are:

  • cloudflare's business interests dont really depend on violating privacy, so they have no motive, unlike google
  • privacypass basically lets you skip captchas if you have already solved one previously without revealing any private data/cookies/etc.

I mean, i suppose that is nice compared to recaptcha, but its never going to convince a hardcore privacy advocate.

Cloudfare's business interests do depend on violating privacy; the US DHS has commented that the data it has (vis a vis their MITM proxy, for example) is "valuable" and offered to purchase it (source).

In T333770#8749150, @EpicPupper wrote:

Cloudfare's business interests do depend on violating privacy; the US DHS has commented that the data it has (vis a vis their MITM proxy, for example) is "valuable" and offered to purchase it (source).

This will probably rapidly dove-tail into off-topicness. I was more trying to just summarize what their claims are than to evaluate their truthfulness. I think it goes without saying that they occupy an extremely privileged position in the internet, and could certainly collect and sell much valuable data if they were so inclined. However, what you said doesn't really contradict their statement. They are claiming that they currently do not sell such data as part of their business and as such are less likely to be tempted. That is very different from the claim that they could start a private-data selling side business if they felt like it.

@Tgr This ticket is for evaluating Turnstile's use at Wikimedia websites, not just implementing it in MediaWiki.

Cloudflare has a generally good record of privacy protection, and they used to help WMF fight off a massive DDoS attack

It should be noted that their anti-ddos magic transport worked on the network (bgp) layer, which restricts how much information they can collect,even in theory, even if evil. What we are talking about here is javascript which is a whole different privacy ballgame.

If we are going to resort to device fingerprinting and super cookies as a captcha service I think the movement would be better served by building it's own device reputation service.

Krinkle renamed this task from Evaluate Cloudflare Turnstile as a potential alternative to Wikimedia Fancy Captcha to Evaluate Cloudflare Turnstile as alternative to FancyCaptcha at Wikimedia.Dec 16 2023, 9:33 PM

If we are going to resort to device fingerprinting and super cookies as a captcha service I think the movement would be better served by building it's own device reputation service.

Perhaps, though like many larger projects, the general idea of "improving Wikimedia captchas" has stalled for over a decade now. I don't think its impossible for a group of volunteers and maybe a WMF engineering team to eventually break that cycle, but it hasn't happened organically. I think there is also, potentially, some middle ground involving a vendor partnership where we don't have to reinvent the wheel at a sizable cost. I would personally be doubtful that such a partnership could exist with Cloudflare, but we did reach some agreements with hCaptcha a couple of years ago where they were theoretically willing to satisfy most of our privacy concerns (controlling and anonymizing via proxy layers any user data sent to them, enforceable contracts around the collection of said data [it would be used and immediately discarded], access to and auditing of any relevant source code on their end, etc).

I think in the past, efforts have mostly stalled around disagreements between different stakeholders. Essentially i think captcha efforts need a product manager. I think its possible for a volunteer to fulfil that role, but traditionally there haven't been a huge number of examples of volunteers fulfilling that type of role in wikimedia. I suspect because to do it effectively you have to know who everyone is, but most of the volunteer devs who are connected enough to do that type of bureaucratic work, dont actually like that type of work.

In general IMO we have a tendency towards privacy absolutism that doesn't serve as well. Yes, a captcha operator can collect some sensitive data if they are evil, but they probably aren't. A WMF staff member with server access being evil would be both more dangerous and more likely (while still being unlikely; but a single individual is easier to compromise than an entire organization which is not in the user tracking business and whose business success is largely staked on its reputation). The point of risk management is not to have zero risk (there is exactly one way to do that, by shutting the servers down, and then shredding them and burning the remains) but to find the right balance between risks and costs paid for avoiding risks.

That said, in this specific case, we are looking at a set of somewhat overlapping user tracking problems: we want to determine client reputation on the fly (captchas, login throttling, probably useful for some account security measures too), but also after the fact (for sockpuppet investigations), and also track manually assigned reputational flags (blocking). Historically we have relied on IP addresses for the latter two, but that's becoming less and less viable. I don't think a third-party service can fulfill all of those use cases, so there would be value in rolling our own device reputation service.

Yes, a captcha operator can collect some sensitive data if they are evil, but they probably aren't

I mean, that's pretty debatable. Turnstile literally comes with an analytics dashboard. Their privacy policy is super vague (unless i missed something). The language is all, we will not sell your data or use your data to target advertisements, which is great, but they say nothing about what data they do or don't collect. I certainly trust them more than google recaptcha, but fundamentally captcha products that aren't actually interactive user challenegs are going to involve collecting data.

To be clear, i agree with you generally, that viewing risks as absolute things instead of trade-offs is counterproductive. There are quite reasonable arguments that the trade-off in this case would be worth it. I just don't think we should describe it as being no user-data collected if they are not evil. This would probably be the most user-data collected of any vendor wikimedia uses.

We've been letting the perfect be the enemy of the good for years.

Yes, a captcha operator can collect some sensitive data if they are evil, but they probably aren't

I mean, that's pretty debatable. Turnstile literally comes with an analytics dashboard. Their privacy policy is super vague (unless i missed something). The language is all, we will not sell your data or use your data to target advertisements, which is great, but they say nothing about what data they do or don't collect. I certainly trust them more than google recaptcha, but fundamentally captcha products that aren't actually interactive user challenegs are going to involve collecting data.

It's also cloudflare, one of the most fundamental Internet companies around and thus by definition a target of the 3 letter spy organisations. The recent revalation that Apple had been required to secretly share push notification data with the US government is a good confirmation of not making too many assumptions about data sharing by such large companies.

In T250227 there are concerns about vendor dependency. If possible, can we implement Turnstile and hCaptcha simultaneously so if a vendor fails (or stop providing service at an acceptable level) we can switch to another seamlessly?

In T250227 there are concerns about vendor dependency. If possible, can we implement Turnstile and hCaptcha simultaneously so if a vendor fails (or stop providing service at an acceptable level) we can switch to another seamlessly?

Perhaps, but that would involve twice the vendor engagement and hoping that both companies would be willing to address all of the reasonable technical and privacy concerns that the WMF and Community might have. What might be a simpler approach (though likely only marginally simpler) is what @Tgr was sort-of suggesting above by building out a Wikimedia-specific reputation-checking service. This would be a fairly complex product and require legitimate engineering stewardship, so bridging the development of such a service might be possible by temporarily leveraging a vendor service, and perhaps even supplementing it with a simpler version of a Wikimedia-specific reputation-checking service. I know that hCaptcha, at least, did allow for its customers to override their IP-based reputation-checking services, if it was known on that customer's end that an IP was not problematic.