Page MenuHomePhabricator

Deal with Google Chrome User-Agent deprecation
Open, MediumPublic

Description

Background

Google Chrome is changing the way it shares user-agents for increased privacy of users. You can read more about it here: https://www.chromestatus.com/feature/5704553745874944

Google Chrome has released Client Hints to provide device information. This first release “is intended to allow for developers to experiment and provide feedback”: https://groups.google.com/a/chromium.org/g/blink-dev/c/-2JIRNMWJ7s/m/u-YzXjZ8BAAJ

Technical practicalities

How it works (simple overview)

  • A user sends a request to our site via their browser (e.g. “show me an article”)
  • Our server sends a response that includes the article and a header that asks the browser to send some user data on the next request
  • If the user makes subsequent requests (e.g. “show me another article” or “show me the editor so I can edit this article”) they will also include this user data

Differences from receiving the user agent string

  • The site asks explicitly for the information, meaning that this can be flagged up to the user
  • The site specifies which information it needs, out of this list
  • Browsers may legitimately decline to send the information (e.g. if considered unnecessary or if the site is asking for too much)
  • If the user only ever sends one request, we will not receive any extra data
Rollout plan

Client hints is an experimental feature on Chrome 84, meaning that the browser will only send client hint data if the user has enabled Experimental Web Platform features (disabled by default).

Google Chrome Stable VersionStable promotionWhat happens then?
Chrome 84July 14, 2020Sec-CH-UA Client Hints
Chrome 92October 6, 2020Audit site to understand where migration may be necessary
Chrome 95October 19, 2020Origin trial to experiment with Client Hints and provide feedback
Chrome 100March 29, 2022Deprecation trial (opt-in)
Chrome 101April 26, 2022Reduced Chrome version number rollout
Chrome 107October 25, 2022Reduced Desktop User-agent string rollout
Chrome 110Feb 7, 2023Reduced Mobile User-agent string rollout
Chrome 115May 2, 2023Deprecation trial ends. Everyone receives reduced user-agents

Chrome versions release schedule

Implications on CheckUser

User-agent strings are important pieces of information for checkusers and stewards in their work of detecting and blocking sock accounts. To continue to get that important data, we should implement support for client-hints on our end.

Even with client hints, the fingerprinting data may become unavailable to CheckUser in ways beyond our control (see Differences from receiving the user agent string). This should be discussed with checkusers.

Implications on privacy awareness

By actively asking for data, we expose Wikimedia to scrutiny over when/why we're asking for it. Anti-vandalism is an important reason. The vast majority of requests to our site don't result in making changes stored in CheckUser.

Fingerprinting for fighting vandalism is considered a legitimate but unfortunate use case, and may not always be supported in the future: https://github.com/WICG/ua-client-hints#fingerprinting

Investigations
Further reading

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Thanks all! This is helpful. Since the timeline seems to have changed on Google's end, it gives us some more time to think what we should do here.

I understand this is a fluid situation and no firm answer can be given to the following question, but is the idea that going forward, we would always send Client Hint headers in our response, or that we will only send it if the request headers indicate we are dealing with a Chromium-based browser?

Also, I don't fully understand how the Client Hint situation works. Browser sends a request, server sends a response back with client hint, browser sends a new request with more headers and server sends actually content only then? This two-phase response delivery is unusual, is it not?

@Huji, if the extra hints are "accepted" by the server via the "Accept-CH" header in the HTTP response, then they will be included in *subsequent* requests by the same client. No extra hints will be provided in the first request from a given client. If MediaWiki includes the "Accept-CH" header on every response, then at minimum every POST request will have the hints included, as it must have been preceded by a GET (to load the form, get a CSRF token, etc).

The "Accept-CH" header should be included in all responses, it will be ignored by browsers that do not use it. Given that browsers may stop including useful information in User-Agent, we cannot rely on User-Agent to decide whether to send Accept-CH.

This would probably require a schema change to add a new column, which might store all of the new headers in a JSON dictionary. Any given HTTP request might include:

  • Only a User-Agent
  • A User-Agent and a Sec-CH-UA
  • A User-Agent, a Sec-CH-UA, and one or more of the more detailed Sec-CH-UA-* headers
  • (way down the road) only Sec-CH-UA and Sec-CH-UA-* headers, with no User-Agent

@ST47 the GET/POST distinction was enlightening; thank you!

I'm afraid another solution exists, which is to piece together the Sec-CH-UA headers into a string and concatenate that to the end of the User-Agent string. This won't need a schema change and might be easier to implement. It is far from perfection though: we are taking structured data and making it free-text.

I wouldn't recommend it, as we may want that structured data to remain structured for the purposes of filtering. Plus, that field is only 255 characters long.

Yeah you have to at least load the edit page (at a minimum, to get the CSRF token) on that response we can request the client hints (which should be returned on the response).

I would prefer to add the headers to every request because someone could use the API to make an edit on any page. For the API to get the client hints, we need to have already requested them (via the page load).

I think it would be wise to move into the direction of structuring this data in the database. I think it would be difficult to reconstruct a UA string from what is provided (i.e. it wont contain things like "Mozilla/5.0" or "WebKit" etc.). This might be a challenge to present in the UI. :)

Last I checked, upstream vendors have backed off from this change as it became clear that it not improve privacy. The User-Agent header is to stay, so I don't think we need to change anything here?

It is true that browsers have been and will be reducing the amount of detail provided through them. For example, Apple is no longer exposing information about the device model number to the web. However note that that decision applies no matter where or how (it's not in UA hints either). I think it's safe to assume that anything browsers are comfortable sending through UA hints will be in the UA string as well, at least for the foreseeable future.

As I understand it, the frozen string will continue to reflect device type (desktop-ish vs mobile-ish) and broad browser family (e.g. Gecko/Firefox, Chrome, Safari). It will, however, not reflect the specific browser version, device model, or OS version going forward.

Perhaps in combination with the IP data that suffices for the CheckUser purposes? If not, I think it would be good to also re-evaluate what those purposes are exactly. Most of this has been set up over a decade ago by now, it's quite possible user expectation/intent has driften by now and that trying to make the same thing work again isn't as useful as it once was.

I haven't used CU in prod, but my limited understanding is that the IP and UA data are used to make a best-effort guess of whether someone might have (un)intentionally switched IPs and created a new account but is in fact the same person based solely on them using a similar enough browser. Is that a fair summary of its purpose?

Perhaps in combination with the IP data that suffices for the CheckUser purposes? If not, I think it would be good to also re-evaluate what those purposes are exactly. Most of this has been set up over a decade ago by now, it's quite possible user expectation/intent has driften by now and that trying to make the same thing work again isn't as useful as it once was.

Based on the mocks in T237593 and the investigation I preformed in T175587, I would say that the the current User Agent does not suffice for the purposes of CheckUser and arguably has never done so. Reducing the information that can be retrieved seems to be the opposite of the direction we're heading in.

As I mentioned in T258105#6313730, storing this data in a structured manner would effectively gain us the feature we were looking for in T175587 for "free".

Regardless, it appears (from what I can tell) that checkusers on our wikis are looking for better ways to surface insights from the data. Reducing the amount of data collected appears to make achieving that goal more difficult. While we may not need to surface the exact browser version or OS version, being able to surface which users have the exact same version or not, seems really helpful and may become even more important when anonymous editors IP addresses are no longer exposed (See: T133452).

@Niharika does that seem accurate to you? Please correct me if I'm wrong. :)

@dbarratt Thanks for that summary. Seems accurate to me.

Stewards and checkusers have been talking about this project for a while (ever since google's initial announcement). Trust-and-Safety requested Anti-Harassment tools team to pick up this project given the importance of this work in our anti-vandalism workflows.

Niharika triaged this task as Medium priority.Jul 16 2020, 9:06 PM
Niharika updated the task description. (Show Details)

I think a decision for Wikipedia to start actively fingerprinting is a really significant one, and should be driven by the need for that data in order to carry out anti-vandalism work. I think it would be helpful to have the discussion in T242825#6313858 publicly documented.

We have a really helpful perspective in T242825#6000120 and it would be great to hear from more check users and Trust-and-Safety experts on this task. In particular:

  • Do check users need any of the client hint information?
  • Do check users need all of the client hint information (e.g. full browser version, operating system version, CPU architecture)?

To be completely frank, I'm concerned that anyone questioning this decision in the future from a privacy perspective can see that it came from a clear need from our anti-vandalism experts rather than a judgement from the individuals on AHT, who themselves don't use the CheckUser tool.

That is a thoughtful question. But let me ask you a question in response: given that client hints are not even in use right now, what practical approach do we have to decide which of the client hint information will be useful and which will be not, in terms of CU use cases?

FWIW, in the current state where UAs contain essentially all of those hints in one fat string, I look at all of them.

@dbarratt Thanks for that summary. Seems accurate to me.

Stewards and checkusers have been talking about this project for a while (ever since google's initial announcement). Trust-and-Safety requested Anti-Harassment tools team to pick up this project given the importance of this work in our anti-vandalism workflows.

I just wanted to quickly jump in and confirm this. We liaise quite frequently with community groups like the Stewards and other CheckUsers who consistently express concerns about the potential loss of this data, especially given this may jeopardise their existing anti-vandalism workflows.

Task description has been updated with more context following further discussions with AHT.

I'm flagging this for Analytics. This deprecation will probably impact how device classification for browsers works in a bunch of our stats tools, like this one.

@ovasileva @SCherukuwada @Jdlrobson @SWakiyama @CBogen @MarkTraceur @DVrandecic @CBlanton @Jdforrester-WMF probably good to have in at least a watching column. I got a ping to raise awareness. Others such as TProgM (hi @LGoto I saw you were already triaging a related task) or Product Analytics (hi @kzimmerman and others on task !) may broach this as well, but doing my part and raising awareness in case there are UX or feature detection or instrumentation pieces requiring attention.

I have updated the task to reflect the latest timelines as published by the Google Chrome team.