Page MenuHomePhabricator

Implement storage for User-Agent Client Hints header data
Closed, ResolvedPublic

Description

This task tracks the work to implement database storage for the User-Agent Client Hints header data.


Original description

Problem
When the Sec-CH-UA-* headers are part of the request, the User-Agent header is unreliable as it may be frozen to a previous version of the browser (See T242825)

Proposed Solution
When Sec-CH-UA is present use that value instead of User-Agent. This is actually a combination of multiple headers. As a stop-gap solution, we'll concatenate the data together and insert into the existing cu_changes.cuc_agent field (later we'll split this into multiple fields).

It can be concatenated into whatever order makes the most sense, but will probably be something like this order:

  1. Sec-CH-UA
  2. Sec-CH-UA-Full-Version
  3. Sec-CH-UA-Platform
  4. Sec-CH-UA-Platform-Version
  5. Sec-CH-UA-Arch
  6. Sec-CH-UA-Model
  7. Sec-CH-UA-Mobile

and I imagine they will be concatenated by ; which is how the info in Sec-CH-UA is conatinated.

NOTE: It's possible that any (or all) of the headers are missing or empty. The empty/missing values should be filtered out before concatenation.
IMPORTANT: The Sec-CH-UA header will contain more than one value in random order, at least one of which will be a randomly generated and completely arbitrary (read: garbage). See https://wicg.github.io/ua-client-hints/#grease

Related Objects

Event Timeline

Reedy renamed this task from Use Sec-CH-UA-* when availble instead of User-Agent to Use Sec-CH-UA-* when available instead of User-Agent.Jul 15 2020, 10:11 PM
Niharika triaged this task as Medium priority.Jul 15 2020, 10:17 PM
Niharika moved this task from Untriaged to Triage/To be Estimated on the Anti-Harassment-Team board.

I would encourage you to store this in addition to the User-Agent, not instead of it. The User-Agent field may remain useful in many circumstances. I understand that it is extremely complex to make a schema change on the WMF cluster. I submit that it is worth doing so, for these reasons:

  • Some browsers may begin to ship with incomplete or buggy support for the Sec-CH-UA features, and discarding the User-Agent in favor of this not-yet-standardized and still-experimental feature may cause us to throw out important information, which would be detrimental.
  • Some users may configure their browser - either through settings or through plugins - to refuse to provide any client hints. During the transition period, even the "frozen" User-Agent string may provide useful distinctions that the Sec-CH-UA string does not, particularly because different browser vendors may "freeze" their User-Agent strings at different times or to different values.
  • CheckUsers are often familiar enough with many sockpuppeteers to have remembered or stored the User-Agent strings that are characteristic of them. We have no experience with Client Hints. Displaying the User-Agent alongside the Client Hints is important to allow us to learn what Client Hints correspond to specific User-Agents, and during the transition period, we will be able to use the User-Agent to compare to historical accounts while beginning to learn the corresponding Client Hints for future comparisons. Making a hard cutoff prevents us from learning these equivalencies, making it difficult to compare accounts created just after MediaWiki's cut-over against accounts that were blocked prior to the cut-over.

Further, the frankly ridiculous history of the User-Agent header provides argument enough to store structured data as structured data - perhaps encoding it to JSON for the database - and not to simply concatenate a series of values into a string:

  • Concatenating using ; would be ambiguous if any of the component fields are allowed to contain ; .
  • Concatenating generally can be ambiguous depending on how you handle Client Hint headers that are absent or empty.

Both of these would inhibit machine parsing and comparison.

Both the User-Agent header and the Client Hints data can be displayed together, perhaps by presenting the User-Agent with an smaller font size and an additional level of indentation directly below the Client-Hints data.

I would encourage you to store this in addition to the User-Agent, not instead of it. The User-Agent field may remain useful in many circumstances. I understand that it is extremely complex to make a schema change on the WMF cluster.

I think you're absolutely correct. It would be better in many ways to start storing both. We've previously talked about parsing User-Agent strings to improve the CheckUser UI in T175587. The new Client-Hints basically gives us this feature for "free" (for browsers that support it). All we have to do is store it in a structured way (and figure out the best way to surface both in the UI).

Instead of storing it in a "structured" blob like you've suggested (which doesn't gain us all that much other than I guess being able to store both), I would actually recommend that we add the following 8 columns to the existing cuc_changes table:

  1. ua_brand
  2. ua_significant_version
  3. ua_full_version
  4. ua_platform
  5. ua_platform_version
  6. ua_architecture
  7. ua_model
  8. ua_mobile
NOTE: I have updated the proposed schema in T258105#6315682

At first I was thinking we could just add these in another table, but since there is a 1:1 relationship between this data and the cuc_changes record, that seems incorrect.

Therefore, the choices are to either add these 8 columns as strings (with the exception of ua_mobile which is a bool) or to create a new table for each one (which would reduce the amount of duplicate data stored/indexed). However, even if we were to create a new table for each one, we'd still need to add 8 reference columns from the cuc_changes table to these 8 tables, so that doesn't seem to gain anything other than a reduction in data storage (if we care about that).

Therefore, the proper way to do this would be to make a schema change and modify the cuc_changes table. However, as you mentioned, this takes a lot of time (though we may have that time given T257893#6313244).

As a stop-gap solution, since the data in CheckUser is only stored for 90 days anyways, we thought shoving it into the cuc_agent field might be a good temporary solution just so that the data stays fresh (if the UAs do in fact get frozen or truncated). However, if that's not an acceptable solution, then I think we should probably move forward with the schema changes.

@Niharika What do you think? Do you think we should implement a temporary stop-gap or do you think we have the time (and resources) to implement a more comprehensive solution?

@dbarratt thanks for the thoughtful analysis. While the idea of normalizing the ua_* columns to save on storage is being discussed, I want to point out that it has an important downside with regard to the retention plan. Those client hints are still private data, and if we store them in a separate table in a normalized fashion, figuring out which one is not used in the last 90 days of data and can/should be deleted might become a more involved task.

My only concern with adding 8 columns is whether we know for sure that these are the right 8 columns. In other words, I'm not sure if the client hints definition is solidified or might evolve between now and when it gets deployed by Chromium-based browsers. I read somewhere that it is a W3C standard, but cannot find such standard. Obviously, we don't want to keep changing our DB schema, especially for large tables like cu_changes.

Based on that last point, I want to suggest a potential third solution: to keep the data in 8 columns of a table, but a separate table (i.e. vertical partitioning of cu_changes). We could call the second table cu_client_hints, with a 9th column called cuch_cuc_id that would join with the cuc_id column in cu_changes. That way, any potential schema changes would happen in a separate table that is smaller in size; the joins should be nearly instantaneous given the 1:1 relationship and at least one of them being a primary key.

My only concern with adding 8 columns is whether we know for sure that these are the right 8 columns. In other words, I'm not sure if the client hints definition is solidified or might evolve between now and when it gets deployed by Chromium-based browsers.

This feature was deployed into the latest stable version of Chrome which is Chrome 84:
https://www.chromestatus.com/feature/5995832180473856

I read somewhere that it is a W3C standard, but cannot find such standard. Obviously, we don't want to keep changing our DB schema, especially for large tables like cu_changes.

Here is the draft specification:
https://wicg.github.io/ua-client-hints/

It is a draft, but seeing that it has already been implemented in Chrome stable, I don't see it changing drastically.

My proposed schema above will need to be adjusted because the ua_brand and us_significant_version can have multiple values as described in https://web.dev/user-agent-client-hints/#user-agent-response-and-request-headers
and more specifically in the spec:
https://wicg.github.io/ua-client-hints/#sec-ch-ua
and this is indeed the case in my testing.

This creates a one-to-many relationship between the cuc_change and the Sec-CH-UA header (which represents the brand and the significant version)

Therefore, it may make the most sense to do something like this:

  1. Alter the existing cuc_changes table by adding the following fields (alternatively, this could be a separate table as @Huji has proposed, but will be distinct from the table below):
    1. cuc_ua_full_version (string)
    2. cuc_ua_platform (string)
    3. cuc_ua_platform_version (string)
    4. cuc_ua_architecture (string)
    5. cuc_ua_model (string)
    6. cuc_ua_mobile (bool)
  2. Create a new table cuc_changes_ua with the following fields:
    1. cuc_ua_id (int)
    2. cuc_id (int)
    3. cuc_ua_brand (string)
    4. cuc_ua_version (string)
dbarratt updated the task description. (Show Details)

Simply inserting rows into cu_changes won't work well now as cu_log_event and cu_private_event now exist which would require these columns. I suggest that using a separate table in a similar manner to the comment table (some kind of comparison method using a hash) would be best.

kostajh renamed this task from Use Sec-CH-UA-* when available instead of User-Agent to Implement storage for User-Agent Client Hints header data.Jun 1 2023, 10:46 AM
kostajh updated the task description. (Show Details)

Simply inserting rows into cu_changes won't work well now as cu_log_event and cu_private_event now exist which would require these columns. I suggest that using a separate table in a similar manner to the comment table (some kind of comparison method using a hash) would be best.

@Dreamy_Jazz do you mind elaborating on this idea–what table schema do you have in mind?

I was advised at Wikimania-Hackathon-2023 that some kind of elastic search method could work (can't remember who suggested this at the moment), especially if it would be possible to search for results by matching client hint data. This was because the SQL queries could get quite complicated/expensive, and elastic search could better handle the data. If a system using elastic search is followed, I suggest that for compatibility the raw client hint data is also stored in the DB using the alternative method for installs without the ability to use elastic search.

The initial idea using tables I have is as follows:

  • A table similar in structure to the log_search table is created that stores each part of the client hint data
    • ls_field would hold the name of each part of client hint data
    • ls_value would store the associated value
    • The ls_log_id would not be added (as this would be done by the other table detailed below)
  • A many-to-many table is created that links each client hint data part to each row in the three tables
    • One row in this many-to-many table would reference one row from cu_changes, cu_log_event or cu_private_event
      • A tinyint column could indicate which table this ID comes from using a integer enum system
      • Alternatively, three columns could be used which are each assigned one of the three tables
    • A given row only references one client hint row in the client hint data table
    • Multiple rows would exist for one edit/log action
      • The exact number would depend on how much the client hint data is split up

The rough diagram for this is shown below (column and table names are not thought out and would probably change):

Untitled Diagram.drawio.png (422×702 px, 25 KB)

To search by client hint data, a query would first find the ID(s) for the matching client hint data and then search the many-to-many table for matching rows. The IDs for the rows in the cu_changes, cu_log_event or cu_private_event tables from the matching many-to-many rows would then be selected for the results.

To display the client hint data, the row IDs would be searched for in the many-to-many table. These rows would then be used to find the matching client hint data, which is then showed in a pre-determined layout.

To purge expired data, when deleting rows from the three tables the associated many-to-many row would also be deleted. The row IDs for the client hint data table stored in these deleted many-to-many rows would be stored and then once the deletion operation is finished a query would check if any still existing many-to-many rows reference the client hint data rows. If not, they would be deleted.

Searching by client hint data is likely to be a highly requested feature. T146837 was filed to search by the user agent string and now that client hint data is structured it could be possible to search using IDs and therefore no need for SQL LIKE queries which should make this relatively fast.

While I am concerned that the size of the many-to-many table could get huge, the table would at maximum store four columns that store IDs. At minimum this could be two columns storing IDs and one that is used as an enum to indicate which of cu_changes, cu_log_event or cu_private_event is referenced. Example:

Untitled Diagram.drawio(1).png (133×162 px, 4 KB)

(Where cur_id references an ID in one of the three result tables, cur_enum is a tinyint or enum column using pre-determined integers to indicate which table the ID comes from and cuh_id indicates the ID of the data in the client hint table).


Alternatively, if searching by specific parts of the client hint data is not going to be added using SQL the client hint data could be combined and saved into the comment table. In theory, comment_data could be used to make this translatable. This would avoid duplication and should not violate the 3 month expiry as the data would no longer be tied to the user after 3 months. However, if the client hint data contained information that would tie it back to a user this could be a problem from a legal standpoint (as it wouldn't be deleted).

Thanks for this @Dreamy_Jazz.

I'm imagining that this structure would allow features like "show all users with the same value for <some client hint>" on Special:Investigate. (It can already add others with the same IP, but we weren't able to implement this for user agent because of our inability to perform an indexed search.)

Would it be possible to allow access to F37092353?

Thanks for this @Dreamy_Jazz.

I'm imagining that this structure would allow features like "show all users with the same value for <some client hint>" on Special:Investigate. (It can already add others with the same IP, but we weren't able to implement this for user agent because of our inability to perform an indexed search.)

One thing I wanted to note is that Chrome will send a "fake" value for one of the "brand" properties as part of its GREASE implementation:

User agents MUST include more than a single value in brands, where one of these values is an arbitrary value.

The document then says "One approach to minimize variance for caching and analytics could be to determine the GREASE parts of the UA set at build time, and keep them identical throughout the lifetime of the user agent's significant version.", but it doesn't specify if that is the approach that Chrome is taking. In other words, it's possible that Chrome adds an arbitrary "brand" value in every request, or only rotates it when upgrading from e.g. Chrome 114 to 115.

Currently, my brands list looks like this:

[
    {
        "brand": "Not.A/Brand",
        "version": "8"
    },
    {
        "brand": "Chromium",
        "version": "114"
    },
    {
        "brand": "Google Chrome",
        "version": "114"
    }
]

so if we are storing each of the brand lists separately, then one would be able to query for entries with e.g. "Google Chrome 114".

Thanks for this @Dreamy_Jazz.

I'm imagining that this structure would allow features like "show all users with the same value for <some client hint>" on Special:Investigate. (It can already add others with the same IP, but we weren't able to implement this for user agent because of our inability to perform an indexed search.)

That's what I was hoping too.

Would it be possible to allow access to F37092353?

Didn't notice it wasn't attached. I have now done that, so it should be visible.

kostajh changed the task status from Open to In Progress.Jun 21 2023, 7:29 AM
kostajh claimed this task.

Change 931875 had a related patch set uploaded (by Kosta Harlan; author: Kosta Harlan):

[mediawiki/extensions/CheckUser@master] schema: Introduce cu_useragent_clienthints table

https://gerrit.wikimedia.org/r/931875

Change 931958 had a related patch set uploaded (by Kosta Harlan; author: Kosta Harlan):

[mediawiki/extensions/CheckUser@master] [WIP] clienthints: Map hints to CheckUser tables

https://gerrit.wikimedia.org/r/931958

Change 931958 abandoned by Kosta Harlan:

[mediawiki/extensions/CheckUser@master] [WIP] clienthints: Map hints to CheckUser tables

Reason:

https://gerrit.wikimedia.org/r/931958

Change 931875 merged by jenkins-bot:

[mediawiki/extensions/CheckUser@master] clienthints: Introduce database tables to store data

https://gerrit.wikimedia.org/r/931875

Change 938242 had a related patch set uploaded (by Dreamy Jazz; author: Dreamy Jazz):

[mediawiki/extensions/CheckUser@master] Remove tables.sh and add version for addition of clienthints tables

https://gerrit.wikimedia.org/r/938242

Change 938242 merged by jenkins-bot:

[mediawiki/extensions/CheckUser@master] Remove tables.sh and add version for addition of clienthints tables

https://gerrit.wikimedia.org/r/938242

dom_walden subscribed.

Client-side:

  • Tested saving edits via the source editor, VisualEditor and Discussion Tools to see whether the browser sent a request to the new REST endpoint with client hints data.
  • Tested a number of different browsers including various versions of Chrome, Edge and Opera, Firefox 102 and Safari 15.6.
  • Firefox and Safari do not send a request on save.
  • Chrome, Edge and Opera do (although see T342729).
  • Nothing bad seems to happen when the REST request returns an error (it just fails silently as far as the user can see). This makes sense as I don't think we do anything with the response on the client-side.
  • Also briefly tested saving edits to ProofreadPage, which also sends a request (e.g. Page and Index).

User types:

  • I tested client hints can be sent for anonymous (IP), temporary and named users.

Permissions:

  • You can only save client hints for revisions you have created yourself (or your IP has created, which might belong to different people) and only when the author of the revision is publicly viewable.
  • For example, you cannot save client hints for a suppressed revision. Even a user with the rights to view suppressed revisions cannot save client hints for them. I cannot see why they would need to.
  • You also cannot save client hints for a deleted revision (i.e. revisions of a deleted page)
  • You can save client hints for a revision even if that revision no longer has an associated entry in cu_changes. There does not seem to be a time limit.

DB-side:

  • I only briefly tested whether the client hint data was being saved correctly to the database.
  • I spent more time testing whether the data could be made to be inconsistent (e.g. rows in cu_useragent_clienthints_map without the corresponding rows in cu_useragent_clienthints)
    • I did find one case where data could be made to be stored inconsistently. We are currently investigating this.

Pruning:

  • I did some brief testing of the maintenance/purgeOldData.php script and the pruning job that gets fired in the background when users visit the site.
  • More was done as part of T337943 and T340959.

Test environments:

  • local docker using sqlite database CheckUser 2.5 (02f071c) 05:10, 24 July 2023
  • local bare-metal using mysql database CheckUser 2.5 (02f071c) 05:10, 24 July 2023

Mentioned in SAL (#wikimedia-operations) [2023-07-26T13:05:58Z] <James_F> Created cu_useragent_clienthints.sql and cu_useragent_clienthints_map.sql on testwiki for T258105

Permissions:

  • You can only save client hints for revisions you have created yourself (or your IP has created, which might belong to different people) and only when the author of the revision is publicly viewable.
  • For example, you cannot save client hints for a suppressed revision. Even a user with the rights to view suppressed revisions cannot save client hints for them. I cannot see why they would need to.
  • You also cannot save client hints for a deleted revision (i.e. revisions of a deleted page)

The API call should be done soon enough after that deletion and/or suppression should not be an issue. However, I think that it should be possible to save client hints data even if the revision is deleted. This is just in the case that an edit goes through to a page and then it's deleted just after the edit but before the API request is sent.

  • You can save client hints for a revision even if that revision no longer has an associated entry in cu_changes. There does not seem to be a time limit.

This should be solved in T342134: Limit the ability for other users after the fact storing client hints data using the REST API (as previously discussed)

The API call should be done soon enough after that deletion and/or suppression should not be an issue. However, I think that it should be possible to save client hints data even if the revision is deleted. This is just in the case that an edit goes through to a page and then it's deleted just after the edit but before the API request is sent.

Ticket filed at T342790: Allow storage of client hints for deleted revisions and revisions with revision deleted performer.

Mentioned in SAL (#wikimedia-operations) [2023-08-10T13:09:29Z] <TheresNoTime> [samtar@mwmaint1002 ~]$ foreachwiki sql.php /srv/mediawiki-staging/php-1.41.0-wmf.20/extensions/CheckUser/schema/mysql/cu_useragent_clienthints.sql for T258105

Mentioned in SAL (#wikimedia-operations) [2023-08-10T13:14:51Z] <TheresNoTime> [samtar@mwmaint1002 ~]$ foreachwiki sql.php /srv/mediawiki-staging/php-1.41.0-wmf.20/extensions/CheckUser/schema/mysql/cu_useragent_clienthints_map.sql for T258105