Page MenuHomePhabricator

WE4.2.10 Add more browser signals to client hints pipeline to generate unique device identifier
Closed, ResolvedPublic

Description

If we add two more data points to the client hints collection pipeline, we will be able to measure and confirm the increased entropy in unique device identification.
Context: This hypothesis continues from WE4.2.1. … where we identified that the current Chrome client hints data is a reliable replacement for user agent that will be discontinued.

This hypothesis directly contributes to the KR by providing new signals (browser canvas fingerprint, list of fonts) that will allow CheckUsers to more precisely target sockpuppets and accounts attempting to evade bans.

Note: this is a joint work across (at least) two quarters between Research and Trust and Safety Product Team.

User stories

  • As a CheckUser, I should be able to see a device similarity score, to help with identifying sockpuppeting and ban evasions
  • As a functionary, when creating an indefinite block of a user, I should be able to block unique device identifiers associated with the user
  • ...

Scope of work (subtasks to be created):

  • Update ext.checkUser.clientHints to obtain list of fonts and generate a canvas fingerprint
  • Update CheckUser client hints APIs to allow for intake of list of fonts and canvas fingerprint hash
  • Build a class for generating a locality-sensitive hash client hints, fonts and canvas fingerprint, something like this
  • Update CheckUser UI to display the locality-sensitive hash of client hints, fonts and canvas fingerprint
  • Update CheckUser UI to be able to show matches with similarity score above some configurable threshold

Legal approval

Details

Due Date
Feb 28 2025, 5:00 AM

Event Timeline

Note the relevant L3SC discussion is happening here: https://app.asana.com/0/0/1208478147018066/f

Per conversation with @XiaoXiao-WMF and discussions in https://app.asana.com/0/0/1208478147018066/f, would like to put this on hold (or at least, keep it as something we are thinking about and discussing rather than actively working on) while we are working out if integration with hCaptcha is possible.

Note the relevant L3SC discussion is happening here: https://app.asana.com/0/0/1208478147018066/f

Per conversation with @XiaoXiao-WMF and discussions in https://app.asana.com/0/0/1208478147018066/f, would like to put this on hold (or at least, keep it as something we are thinking about and discussing rather than actively working on) while we are working out if integration with hCaptcha is possible.

OK, I thought this through some more and discussed with @SCherukuwada, my updated view is that we should move forward with this hypothesis in Q2/Q3, because:

  • even though hCaptcha will have a superior unique device identifier mechanism, we hedge our bets by not relying entirely on hCaptcha
  • the technical implementation seems feasible within a quarter of work
  • we can have a working solution in production sooner, based on established infrastructure (client hints collection + CheckUser)
  • we can work through legal and privacy concerns in this hypothesis that will apply to other work (e.g. hCaptcha)

I'd suggest removing the ml-model-requests tag--AIUI, no model would be needed here.

kostajh renamed this task from WE4.2.10 Add more signals to client hints pipeline for unique device identifier research to WE4.2.10 Add more browser signals to client hints pipeline to generate unique device identifier locality-sensitive hash.Dec 4 2024, 9:42 AM
kostajh updated the task description. (Show Details)

Thank you for starting this. As an enwiki CU, I appreciate the effort WMF is devoting to Trust and Safety Tools. Would it be possible for project CUs to weigh in more on their needs for such a tool? In particular, I will note that project CUs don't normally limit themselves to looking for exact matches (IP and User-Agent exactly matching), and so if this ever comes to supplant the raw availability of user-agents or client hints, the MVP of the unique device identifier locality-sensitive hash would then have to include include:

  • The ability to tell how similar one UA (or UA-equivalent) is to another (not just if it is identical)
  • The ability to tell how typical or atypical the UA is to the overall pool (for example, to be able to detect very old UAs, which is one common but not universal hallmark of UA spoofing, which is quite common on the projects)
  • Some ability to "look under the hood" to understand what drove the similarity score
  • More advanced ability to provide context on how the ISP or geography handles UAs, IP sharing, etc.

I'd be glad to chat more if that's helpful. Thank you!

Thank you for starting this. As an enwiki CU, I appreciate the effort WMF is devoting to Trust and Safety Tools. Would it be possible for project CUs to weigh in more on their needs for such a tool?

Yes, definitely! This is work intended to support the efforts of CUs. We're still in the early stages of seeking legal, security, privacy and safety center approval, but it's good to start talking about details of implementation and requests.

In particular, I will note that project CUs don't normally limit themselves to looking for exact matches (IP and User-Agent exactly matching), and so if this ever comes to supplant the raw availability of user-agents or client hints

I don't think there are plans to replace legacy user agent, user agent client hints, or IP address with the work described in this task. That would continue as-is.

, the MVP of the unique device identifier locality-sensitive hash would then have to include include:

  • The ability to tell how similar one UA (or UA-equivalent) is to another (not just if it is
  • The ability to tell how typical or atypical the UA is to the overall pool (for example, to be able to detect very old UAs, which is one common but not universal hallmark of UA spoofing, which is quite common on the projects)
  • Some ability to "look under the hood" to understand what drove the similarity score
  • More advanced ability to provide context on how the ISP or geography handles UAs, IP sharing, etc.

Looking under the hood would be challenging (from a privacy/legal point of view). If this is approved, what we'd likely end up doing is hashing e.g. the list of fonts and the canvas fingerprint and incorporating them into a locality-sensitive hash using a salt, and storing that.

To give an idea of what that might look like, some very rough, not-reviewed proof of concept code generates output like this:

Hash 1: c2f92a630e01080b
Hash 2: d2f92a630e01080b
Similarity Score: 96.88%

Changes in input:
- Removed fonts: 1
- Canvas hash: unchanged
- Model changed: MacBookPro18,2 -> MacBookPro18,3
- Platform version changed: 13.2.1 -> 13.2.2
- Browser version changed in fullVersionList

so, by changing one item in the font list, leaving canvas fingerprint the same, and slightly modifying the client hints, you'll get two hashes that look very similar, and are visually identifiable as being similar.

I'd be glad to chat more if that's helpful. Thank you!

Sure, would be happy to. Perhaps on Discord on the Wikimedia Community server? Or we could set up a video meeting if you prefer, just let me know.

Update ext.checkUser.clientHints to obtain list of fonts and generate a canvas fingerprint

Have we exhausted all avenues of passive fingerprinting? Canvas and font fingerprinting feel like a massive overreach in terms of violating a user's privacy in a way that a user cannot explicitly opt out of. (Outside of ceasing to edit Wikipedia)

Update ext.checkUser.clientHints to obtain list of fonts and generate a canvas fingerprint

Have we exhausted all avenues of passive fingerprinting? Canvas and font fingerprinting feel like a massive overreach in terms of violating a user's privacy in a way that a user cannot explicitly opt out of. (Outside of ceasing to edit Wikipedia)

Yes, we will also be looking at other features within CheckUser along the same lines of thoughts, e.g. hashing, or ways to reduce entropy.

XiaoXiao-WMF renamed this task from WE4.2.10 Add more browser signals to client hints pipeline to generate unique device identifier locality-sensitive hash to WE4.2.10 Add more browser signals to client hints pipeline to generate unique device identifier .Dec 13 2024, 6:57 PM
XiaoXiao-WMF triaged this task as High priority.

Update ext.checkUser.clientHints to obtain list of fonts and generate a canvas fingerprint

Have we exhausted all avenues of passive fingerprinting? Canvas and font fingerprinting feel like a massive overreach in terms of violating a user's privacy in a way that a user cannot explicitly opt out of. (Outside of ceasing to edit Wikipedia)

If you have suggestions for alternatives, I would be happy to hear them. I'm also not sure where to draw the line between passive/active fingerprinting, and if that is the important thing to focus on in regard to privacy. For example, obtaining user agent http-client-hints data happens via JS that fires after an edit is saved. I suppose that would be active fingerprinting?

Some assumptions for this task:

  • abuse from repeat visitors is a significant problem
  • IP blocking sometimes causes collateral damage
  • IP blocking is sometimes ineffective
  • being able to mitigate actions based on device similarity scores allows for more precisely targeting bad actors
XiaoXiao-WMF changed the task status from Open to In Progress.Jan 6 2025, 9:26 PM
XiaoXiao-WMF claimed this task.
XiaoXiao-WMF changed the status of subtask T383061: Algorithm creation from Open to In Progress.
XiaoXiao-WMF set Due Date to Feb 28 2025, 5:00 AM.

Moving from the quarterly lane to in-progress as I'm closing the quarterly lane. Please set/update the deadline for the task.

Can we please get some version of the report publicized?

Can we please get some version of the report publicized?

Hi @Izno, yes of course, I've pasted a copy of the report below.

The tl;dr from this work is that Research folks worked on defining an algorithm for generating a locality sensitive hash to allow taking a variety of inputs and generating hashes that can be compared for similarity against each other. We'll write more about next steps for experimenting with that algorithm soon (cc @sgrabarczuk).


Summary

Created and tested various hashing algorithms to compare client hints and other browser signal data.

Provided recommendations for a working algorithm and similarity measure that is ready to be converted into production usage.

Provided suggestions, in the context of production, to best protect the privacy of users and editors.

Potential use cases

  1. Integration with CheckUser: Assist CheckUser for comparing, and reduce their cognitive loads to compare Client Hints data as well as other browser signals.
  2. Integration with LoginNotify: Automatically identify registered users that have been blocked previously.

Algorithm

The algorithm is hashing based. It takes client hints data and other browser signals to provide one hash per record. The hash ensures the privacy of the browser signatures, while preserving the similarity between the original data points.

Analysis

We performed a number of analyses around the weighted hashing algorithm and distance metrics. Due to the lack of canvas hash and list of fonts, we created synthetic data for analytic purposes.

Output usage and benchmarking (will be finalized in Q4): https://gitlab.wikimedia.org/repos/research/research-datasets/-/blob/mnz/fp-samples/notebooks/ua\_hash\_samples.ipynb