Page MenuHomePhabricator

Deal with Google Chrome User-Agent deprecation
Open, MediumPublic

Description

Background

Google Chrome is changing the way it shares user-agents for increased privacy of users. You can read more about it here: https://www.chromestatus.com/feature/5704553745874944

Google Chrome has released Client Hints to provide device information. This first release “is intended to allow for developers to experiment and provide feedback”: https://groups.google.com/a/chromium.org/g/blink-dev/c/-2JIRNMWJ7s/m/u-YzXjZ8BAAJ

Technical practicalities

How it works (simple overview)

  • A user sends a request to our site via their browser (e.g. “show me an article”)
  • Our server sends a response that includes the article and a header that asks the browser to send some user data on the next request
  • If the user makes subsequent requests (e.g. “show me another article” or “show me the editor so I can edit this article”) they will also include this user data

Differences from receiving the user agent string

  • The site asks explicitly for the information, meaning that this can be flagged up to the user
  • The site specifies which information it needs, out of this list
  • Browsers may legitimately decline to send the information (e.g. if considered unnecessary or if the site is asking for too much)
  • If the user only ever sends one request, we will not receive any extra data
Timeline

Client hints is an experimental feature on Chrome 84, meaning that the browser will only send client hint data if the user has enabled Experimental Web Platform features (disabled by default).

Google Chrome Stable VersionStable promotionWhat happens then?
Chrome 84July 14 2020Sec-CH-UA Client Hints
Chrome 86 (?)October 6 2020Reduce User Agent string information

Deprecation of the user agent string has been deferred until at least 2021.

Implications on CheckUser

User-agent strings are important pieces of information for checkusers and stewards in their work of detecting and blocking sock accounts. To continue to get that important data, we should implement support for client-hints on our end.

Even with client hints, the fingerprinting data may become unavailable to CheckUser in ways beyond our control (see Differences from receiving the user agent string). This should be discussed with checkusers.

Implications on privacy awareness

By actively asking for data, we expose Wikimedia to scrutiny over when/why we're asking for it. Anti-vandalism is an important reason. The vast majority of requests to our site don't result in making changes stored in CheckUser.

Fingerprinting for fighting vandalism is considered a legitimate but unfortunate use case, and may not always be supported in the future: https://github.com/WICG/ua-client-hints#fingerprinting

This project is being discussed with the Trust-and-Safety and WMF-Legal teams. The Anti-Harassment tools team has been tasked with executing the technical work on this project.

Investigations
Further reading

https://github.com/WICG/ua-client-hints
https://web.dev/user-agent-client-hints/

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Based on the announcement, I imagine this will be delayed.

I am working on an announcement for Checkusers to inform them of this upcoming change. Can someone more familiar with checkuser and useragents explain what the impact of the three specific events (as listed in the table in the task description) would be to the checkuser tool as it stands currently? Assuming we don't do anything special to mitigate this.

All Chorme User Agent will look same across devices. No mitigation is unacceptable from CU Point of View, and make CheckUser extension useless given the stake of Google Chorme's semi-monopoly.

All Chorme User Agent will look same across devices.

Can you elaborate on this more?

I'm trying to understand the effects before we make any decisions about mitigation. Above all, I want to make sure I explain this change to checkusers accurately.

Same as in, like this.

This means, everyone using Chrome will be automatically set to above UA regardless of the user's actual UA; even if you are on Win7, macOS, some linux flavor, or some BSD, you will get Win10 UA with specific frozen Chrome version. Same for the Android. It seems that Google is yet to set a value on iOS but expect some patches there as well.

This means, all checkuser data will return same UA if they use chrome, and if CheckUser don't consult Google's 'shiny new' alternatives.

@revi I see. Can you also explain what the shiny new alternatives are? Basically what can we do to mitigate the issue?

Can someone more familiar with checkuser and useragents explain what the impact of the three specific events (as listed in the table in the task description) would be to the checkuser tool as it stands currently?

Chrome 81 | Mid March 2020 | Deprecate access to navigator.userAgent

We don't use navigator.userAgent in CU. No effect.

Chrome 83 | Early June 2020 | Freeze browser version and unify OS versions

Chromium-based browsers will report the same browser version (not a big surprise with evergreen browsers that already report the latest version), also report the same OS version (Windows 10, Android 9). A bit less information to work with, particularly in case of mobiles.

Chrome 85 | Mid September 2020 | Unify desktop OS string as a common value for desktop browsers. / Unify mobile OS/device strings as a similarly common value for those at Chrome 85 [For the mobile value, we

Can't distinguish Linux from Windows anymore.

Practically only the IP address and mobile vs. desktop will be used for decision-making after Chrome 83 (early June). Old browsers unaffected, still sending UA.
IP address is and was the primary information (unaffected by these changes), so the CU tool will be still usable, albeit with less certainty.

It's noted the spec for the replacement, User-Agent Client Hints is still in draft

Basically what can we do to mitigate the issue?

CU is currently a passive logging solution. In the future determining the browser and OS will become an active action that requires sending either new HTTP headers or javascript code to query the user agent. It is possible that browsers will ask for permission (opt-in) from users to share this information in the form of a setting or a popup similar to how the location information is shared.
I think whether MediaWiki will ask for this information depends on how browsers will implement opt-in. After all, the reason to freeze the useragent is exactly to disable this form of fingerprinting (identification) of users.

It seems to me that the core issue here is that we are in a world where the browsers are going to increasingly stop telling web sites so much about their users. This is on top of whatever data regulations may come into effect in various locales.

Things like IP addresses and User Agent strings and device types and such are simply going to no longer be available for a whole host of reasons. For now, we are trying to improve the existing tool so it can be valuable until a time where we devise an entirely new solution.

IP isn’t going away (at least in a similar sense) unless a proxy is used. The underlying servers need that for internet to work

Per https://groups.google.com/a/chromium.org/forum/#!msg/blink-dev/A4wxFpvqUfA/yBjL6B0QDgAJ it seems the timeline is subject to change, and specifically that they "did not hit the 81 milestone" (presumably for deprecating navigator.userAgent).

I don't know that anyone has done a thorough investigation, but it seems Client Hints will still provide the granularity we need https://wicg.github.io/ua-client-hints/#http-ua-hints. I am hoping in the short-term we can just concatenate these values and store it as if it were the UA, so that we can keep CU in line with how it works now with minimal effort. So try Client Hints first, if there's no information available, use the UA (which as I understand isn't going away, it's just being frozen). Of course if there is a permission dialog than we're in trouble.

As Reedy notes, Client Hints are a proposed W3C specification, so once that matures other browsers are likely to use the same technology.

Of course if there is a permission dialog than we're in trouble.

From what I understand, this appears to be a mechanism that makes it easier for extensions like Privacy Badger to send part of the User-Agent, without modifying the actual UA (and thus breaking a lot of websites). Therefore it seems like the permission is somewhat global (as in, for all sites you visit in your browser) at this point. But to your point, that could change in the future. I guess we'll cross that bridge when/if we come to it.

I am working on an announcement for Checkusers to inform them of this upcoming change. Can someone more familiar with checkuser and useragents explain what the impact of the three specific events (as listed in the table in the task description) would be to the checkuser tool as it stands currently? Assuming we don't do anything special to mitigate this.

To summarize: CheckUser will start reporting inaccurate UAs when this change lands in Chrome, unless we mitigate the problem by implementing the alternative (which, as mentioned, is also subject to change).

Thanks all! This is helpful. Since the timeline seems to have changed on Google's end, it gives us some more time to think what we should do here.

I understand this is a fluid situation and no firm answer can be given to the following question, but is the idea that going forward, we would always send Client Hint headers in our response, or that we will only send it if the request headers indicate we are dealing with a Chromium-based browser?

Also, I don't fully understand how the Client Hint situation works. Browser sends a request, server sends a response back with client hint, browser sends a new request with more headers and server sends actually content only then? This two-phase response delivery is unusual, is it not?

@Huji, if the extra hints are "accepted" by the server via the "Accept-CH" header in the HTTP response, then they will be included in *subsequent* requests by the same client. No extra hints will be provided in the first request from a given client. If MediaWiki includes the "Accept-CH" header on every response, then at minimum every POST request will have the hints included, as it must have been preceded by a GET (to load the form, get a CSRF token, etc).

The "Accept-CH" header should be included in all responses, it will be ignored by browsers that do not use it. Given that browsers may stop including useful information in User-Agent, we cannot rely on User-Agent to decide whether to send Accept-CH.

This would probably require a schema change to add a new column, which might store all of the new headers in a JSON dictionary. Any given HTTP request might include:

  • Only a User-Agent
  • A User-Agent and a Sec-CH-UA
  • A User-Agent, a Sec-CH-UA, and one or more of the more detailed Sec-CH-UA-* headers
  • (way down the road) only Sec-CH-UA and Sec-CH-UA-* headers, with no User-Agent

@ST47 the GET/POST distinction was enlightening; thank you!

I'm afraid another solution exists, which is to piece together the Sec-CH-UA headers into a string and concatenate that to the end of the User-Agent string. This won't need a schema change and might be easier to implement. It is far from perfection though: we are taking structured data and making it free-text.

I wouldn't recommend it, as we may want that structured data to remain structured for the purposes of filtering. Plus, that field is only 255 characters long.

Yeah you have to at least load the edit page (at a minimum, to get the CSRF token) on that response we can request the client hints (which should be returned on the response).

I would prefer to add the headers to every request because someone could use the API to make an edit on any page. For the API to get the client hints, we need to have already requested them (via the page load).

I think it would be wise to move into the direction of structuring this data in the database. I think it would be difficult to reconstruct a UA string from what is provided (i.e. it wont contain things like "Mozilla/5.0" or "WebKit" etc.). This might be a challenge to present in the UI. :)

Last I checked, upstream vendors have backed off from this change as it became clear that it not improve privacy. The User-Agent header is to stay, so I don't think we need to change anything here?

It is true that browsers have been and will be reducing the amount of detail provided through them. For example, Apple is no longer exposing information about the device model number to the web. However note that that decision applies no matter where or how (it's not in UA hints either). I think it's safe to assume that anything browsers are comfortable sending through UA hints will be in the UA string as well, at least for the foreseeable future.

As I understand it, the frozen string will continue to reflect device type (desktop-ish vs mobile-ish) and broad browser family (e.g. Gecko/Firefox, Chrome, Safari). It will, however, not reflect the specific browser version, device model, or OS version going forward.

Perhaps in combination with the IP data that suffices for the CheckUser purposes? If not, I think it would be good to also re-evaluate what those purposes are exactly. Most of this has been set up over a decade ago by now, it's quite possible user expectation/intent has driften by now and that trying to make the same thing work again isn't as useful as it once was.

I haven't used CU in prod, but my limited understanding is that the IP and UA data are used to make a best-effort guess of whether someone might have (un)intentionally switched IPs and created a new account but is in fact the same person based solely on them using a similar enough browser. Is that a fair summary of its purpose?

Perhaps in combination with the IP data that suffices for the CheckUser purposes? If not, I think it would be good to also re-evaluate what those purposes are exactly. Most of this has been set up over a decade ago by now, it's quite possible user expectation/intent has driften by now and that trying to make the same thing work again isn't as useful as it once was.

Based on the mocks in T237593 and the investigation I preformed in T175587, I would say that the the current User Agent does not suffice for the purposes of CheckUser and arguably has never done so. Reducing the information that can be retrieved seems to be the opposite of the direction we're heading in.

As I mentioned in T258105#6313730, storing this data in a structured manner would effectively gain us the feature we were looking for in T175587 for "free".

Regardless, it appears (from what I can tell) that checkusers on our wikis are looking for better ways to surface insights from the data. Reducing the amount of data collected appears to make achieving that goal more difficult. While we may not need to surface the exact browser version or OS version, being able to surface which users have the exact same version or not, seems really helpful and may become even more important when anonymous editors IP addresses are no longer exposed (See: T133452).

@Niharika does that seem accurate to you? Please correct me if I'm wrong. :)

@dbarratt Thanks for that summary. Seems accurate to me.

Stewards and checkusers have been talking about this project for a while (ever since google's initial announcement). Trust-and-Safety requested Anti-Harassment tools team to pick up this project given the importance of this work in our anti-vandalism workflows.

Niharika triaged this task as Medium priority.Jul 16 2020, 9:06 PM
Niharika updated the task description. (Show Details)

I think a decision for Wikipedia to start actively fingerprinting is a really significant one, and should be driven by the need for that data in order to carry out anti-vandalism work. I think it would be helpful to have the discussion in T242825#6313858 publicly documented.

We have a really helpful perspective in T242825#6000120 and it would be great to hear from more check users and Trust-and-Safety experts on this task. In particular:

  • Do check users need any of the client hint information?
  • Do check users need all of the client hint information (e.g. full browser version, operating system version, CPU architecture)?

To be completely frank, I'm concerned that anyone questioning this decision in the future from a privacy perspective can see that it came from a clear need from our anti-vandalism experts rather than a judgement from the individuals on AHT, who themselves don't use the CheckUser tool.

That is a thoughtful question. But let me ask you a question in response: given that client hints are not even in use right now, what practical approach do we have to decide which of the client hint information will be useful and which will be not, in terms of CU use cases?

FWIW, in the current state where UAs contain essentially all of those hints in one fat string, I look at all of them.

@dbarratt Thanks for that summary. Seems accurate to me.

Stewards and checkusers have been talking about this project for a while (ever since google's initial announcement). Trust-and-Safety requested Anti-Harassment tools team to pick up this project given the importance of this work in our anti-vandalism workflows.

I just wanted to quickly jump in and confirm this. We liaise quite frequently with community groups like the Stewards and other CheckUsers who consistently express concerns about the potential loss of this data, especially given this may jeopardise their existing anti-vandalism workflows.

Task description has been updated with more context following further discussions with AHT.

I'm flagging this for Analytics. This deprecation will probably impact how device classification for browsers works in a bunch of our stats tools, like this one.