Page MenuHomePhabricator

WE4.2.1 - Unique Device
Closed, ResolvedPublic

Description

Hypothesis statement: If we explore and define Wikimedia-specific methods for a unique device identification model, we will be able to define the collection and storage mechanisms that we can later implement in our anti-abuse workflows to enable more targeted blocking of bad actors.

Three main steps

  1. how will unique device identification help reducing the false positive rate caused from IP blocking
  2. what are the features we need to consider from all data we are collecting today that will help identifying unique device, and can we build a dataset with collection and storage mechanism (https://phabricator.wikimedia.org/T360195 shows additional data we just started to collect)
  3. can we build a model around features that deemed important

Details

Due Date
Sep 30 2024, 4:00 AM
Other Assignee
fkaelin

Event Timeline

XiaoXiao-WMF triaged this task as High priority.
XiaoXiao-WMF updated Other Assignee, added: Pablo.

Weekly update

  • Started review of existing data sources. For research on readers, the main approach is the hash of IP+user_agent (+language-variant?), productionized as actor_signature in pageview-actor table:
  • As the pageview-actor table only is limited to pageviews, initial exploration of the cu_changes_table that contains events from the recentchanges table and other selected events with the IP and UA.
XiaoXiao-WMF changed the task status from Open to In Progress.Jul 9 2024, 3:40 PM
XiaoXiao-WMF updated Other Assignee, added: fkaelin; removed: Pablo.

Weekly update

Less focus this week because of deadlines from other projects. However, there were some relevant discussions that indicating that stylometry may not be a good source of information for modeling unique devices (features from stylometry could be added on top for specific purposes). In addition, IP addresses could not be taken into account for modeling unique devices, as functionaries might be interested in blocking abusers even if they intentionally or unintentionally change their IP address.

Weekly update

  • Started review of CU UA client hints data (e.g., the most common combination ({"architecture": "x86", "bitness": "64", "mobile": "0", "platform": "Windows", "platformVersion": "10.0.0", "brands": ["Not/A)Brand 8", "Chromium 126", "Google Chrome 126"], "fullVersionList": ["Not/, )Brand 8.0.0.0", "Chromium 126.0.6478.127", "Google Chrome 26.0.6478.127"]}) is found for 215K edits by 20K different actors in 30K different IPs... as one could expect, UA is identical for over 99% of entries).
  • Next efforts will focus on looking for signals other than client hints to approximate unique devices without having to rely on IPs.

Weekly update

  • Analysis of the number of unique actors and entropy of revisions per unique actor for both CH and CH+IP (see stat1009.eqiad.wmnet:/home/paragon/unique-devices/client_hints.ipynb). Results suggest that CH+IP could be a reasonable approach to identify unique devices with the data we already have (only 55.85% of CH combinations are associated with one single actor, but this value is 98.41% for CH combinations + IP).
    image.png (321×1 px, 34 KB)
  • Exploratory inspection of data for user stories:
    • groups of registered users with the same CH+IP with at least one user blocked and another non-blocked -> sockpuppets were found.
    • groups of users with the same CH+IP -> spambots and sockpuppets were found, but also (good-faith) editathon participants and users editing articles related to Oceania and/or flights.

Weekly update

  • User stories data is already available in a spreadsheet:
    • user_story_1a: Groups of registered users with the same CH+IP, editing the same article and having at least one user blocked and another non-blocked. I manually inspected each case individually and they all seem ban evasions because of the following blocking reasons:
      • Promotional username: 17 groups
      • Sockpuppetry: 16 groups
      • Username violation: 12 groups
      • Copyright violations: 12 groups
      • Spam: 10 groups
      • Vandalism: 8 groups
      • Not here to build an encyclopedia: 6 groups
      • Arbitration enforcement sanction: 3 groups
      • Abuse: 2 groups
      • Persistent addition of unsourced content: 1 group
    • user_story_1b (IP+CH): Top 100 groups of registered users with the same CH+IP. Largest groups typically involve users with the same username pattern and very few edits, many related to Oceania. Therefore, I extended the IP data with AbuseIPDB and confirmed that these were Australian IPs. I am not sure if this is due to sockpuppetry or IP scarcity in the region. Then, I have continued exploring IPs from other countries finding cases like:
      • Editors from USA IPs having the similar username patterns and very few edits, many inserting spam.
      • Editors from Portuguese IPs having similar username patterns and very few edits, many related to keyboards and alphabets
      • Editors from Israeli IPs having similar username patterns and very few edits, many related to religion.
    • user_story_1b (IP+CH+article): Top 100 groups of registered users with the same CH+IP having editing the same article. This is similar to user_story_1a but without any blocking-related restriction (i.e., it include cases with all users already be blocked or even none blocked, e.g., editathon participants).
  • I have generated two CSV files following user_story_1b structure but also incorporating the ratio of blocked users in each group in order to easily build user_story_1a spreadsheet tab. The datasets are available at: stat1009.eqiad.wmnet:/home/paragon/unique-devices/ch_ip__actors.csv and stat1009.eqiad.wmnet:/home/paragon/unique-devices/ch_ip_title__actors.csv (14,202 and 7,946 rows respectively). With these datasets a prototype could be built to notify CheckUsers of groups of editors likely to be investigated.

Weekly update

  • Findings have been shared in the meeting with @XiaoXiao-WMF and @kostajh. It was decided that
    • Next efforts will focus on examine fingerprinting open source approaches
    • A task about “Group of users with same IP, same client hints, and one has been investigated, notify that a specific group of users requires an investigation” idea in CheckUser will be filed
  • A task about “Group of users with same IP, same client hints, and one has been investigated, notify that a specific group of users requires an investigation” idea in CheckUser will be filed

Filed as T372651: Recommend users for further investigation based on similarity scores of unique device identifiers

Thanks @kostajh for filing T372651! Please let me know how I can support you with that recommendation.

Weekly update

  • I created a spreadsheet to start collecting open source approaches to fingerprinting.
    • There is also a tab for recent surveys conducted by academics. I am planning to review the TWEB article Browser fingerprinting: A survey, that was co-authored by the developer of amiunique.org (it is worth considering the possibility of contacting him to exchange ideas).

Weekly update

Weekly update

  • @kostajh @Dreamy_Jazz and I had a call with the author of the TWEB article Browser fingerprinting: A survey (notes).
  • He provided good food for thought, e.g., how to prioritize new attributes to be collected, how to store them to enable fuzzy hashing, etc.
  • It was also mentioned that there is almost no literature on how industry organizations perform techniques to identify unique devices.
    • In my opinion, this research could be a secondary (but very impactful) contribution of this project, once our system is deployed and tested.

The research work for Q1 has concluded.