Page MenuHomePhabricator

Access to aggregate User Agent statistics
Open, MediumPublic

Description

I asked about this on IRC, and joal suggested that I file a ticket. I am interested in getting routine access to aggregate User Agent data to support a script for CheckUsers. The idea is that when using CheckUser, this script would annotate the useragent results with the prevalence of that UA in the past $time_period (initial thought is three months), so they might see a UA of Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:93.0) Gecko/20100101 Firefox/93.0 and an annotation of "5% of all WM traffic in the past three months had that useragent". I recognize that UAs are PII, but all CheckUsers have signed the ANPDP (though not the full NDA) and deal with UAs very frequently. The use case here is that by seeing the prevalence of a useragent, a CheckUser can make a more informed decision about how unique it is - if a given UA shows up a lot, then it's a less useful fingerprint for comparing two users.

I imagine that I would set up some kind of API endpoint on Toolforge that CUs could use my theoretical userscript to make these queries against, and that access would be protected by API keys only granted to checkusers. I have two ways I'd suggest that I get the data:

  • My tool would have direct access to whatever backend database has U data, and would periodically generate the statistics I'm looking for (not sure if this is possible from Toolforge?)
  • An analytics query could generate the statistics and the results could be handed over to me periodically (something like https://analytics.wikimedia.org/published/datasets/periodic/reports/metrics/browser/, but with the full UAs instead of browser + OS)

If it would help with data privacy, the data could be scrubbed so that useragents below a certain threshold (either raw number of hits or % of total hits) are excluded from the dataset. We would probably also want to filter automated traffic out, but I'm not certain of that. My experience is entirely with CU data, which comes from logged actions rather than browsing, and so I'm not sure how the browsing traffic will compare.

This is my first time requesting anything involving non-public analytics data, so I could be making some wildly off-base assumptions or something; corrections are welcome. Thank you!