Page MenuHomePhabricator

Provide aggregated user device data per-country
Open, Needs TriagePublic9 Estimated Story PointsFeature

Description

Feature summary

The server logs contain IP and User Agent data for every edit. Please provide aggregate statistics showing usage of each device type per-country (based on IP geolocation). Since device popularity changes over time, the statistics should also be per-year.

While this is highly confidential information when connected to a specific account, there should be no issues with providing aggregated statistics, since this would contain no PII (Personally Identifiable Information). To avoid the effects of small sample sizes, statistics could be dropped for countries or device types below some usage threshold.

Use case(s)

Checkusers can see what device type a user has, and this can be an important piece of information when determining if two accounts are controlled by the same person. If two users are using a very common device, that gives almost no information because the probability that two different people would both be using that are high. On the other hand, if it's a very unusual device type, that's a strong indication that the two accounts are the same person. Unfortunately, except for the most obvious cases ("Windows NT 10.0; Win64; x64" for example), most checkusers don't really know which devices are common and which are rare, especially in countries where they don't live.

As a more concrete example, there is a discussion going on right now on the checkuser IRC channel about whether a Samsung A20 phone is a rare or common device in India. People are taking stabs at it from publicly available market data, but those are just vague guesses. Having more accurate data would be very useful.

Benefits (why should this be implemented?):

It will provide an additional tool for checkusers to make more accurate determinations. This will both allow CUs to place blocks in cases where they would otherwise have insufficient confidence to justify a block, and also prevent erroneous blocks based on incorrect assumptions about how significant a device match is.

Having this sort of data will also be helpful to people working on user interface improvements. Better knowledge about the device capabilities of our users in various countries can inform decisions about what features to deploy or not deploy.

Event Timeline

See also T298912, which is a broader request for similar data (but less aggregated)

mpopov added subscribers: Htriedman, Niharika, mpopov.

Tagging @Htriedman who has been working on releasing another dataset aggregated per-country and @Niharika as the PM for Anti-Harassment Tools.

EChetty set the point value for this task to 9.Jan 16 2023, 4:36 PM

If it makes it easier, this data can be gleaned from cu_changes, cu_log_event and cu_private_event. Data is stored for 3 months. If longer is needed then the server logs will need to be manually inspected for this.