Page MenuHomePhabricator

Analyse Client Hints data collected on WMF wikis to determine entropy
Closed, ResolvedPublic

Description

A method of collecting data about what Client Hints contain the most entropy is needed. This method should:

  1. Collect the data in an anonymised form
  2. Not be accessible to be run or results inspected by anyone without access to inspect the DB and/or run maintenance scripts

A maintenance script makes the most sense for this. It should return the rows in cu_useragent_clienthints along with the number of uses for this in cu_useragent_clienthints_map. This should also be runnable on all wikis at once.

Event Timeline

Restricted Application added a subscriber: Aklapper. ยท View Herald TranscriptAug 23 2023, 8:51 AM

Change 951904 had a related patch set uploaded (by Dreamy Jazz; author: Dreamy Jazz):

[mediawiki/extensions/CheckUser@master] clienthints: Create maintenance script to analyse client hints data

https://gerrit.wikimedia.org/r/951904

Change 951904 merged by jenkins-bot:

[mediawiki/extensions/CheckUser@master] clienthints: Create maintenance script to collate client hints data

https://gerrit.wikimedia.org/r/951904

@Dreamy_Jazz

  • I am not sure I know what we mean by entropy here. Do we want to work out which value of uach_value is rarest? (Or perhaps which combination of values is rarest?) Or, which uach_name has the largest variety of uach_values associated with it? Or something else?
  • The mapTableRowCountBreakdown array might be a lot of data to digest. Could we instead just return a summary for each uach_name, such as the total number of different values?
  • If instead we are excepting the person analysing this data to do some further processing, could we return relatively "raw" data which they can analyse as they see fit. Do we know who is going to be analysing this data and can we ask them what they think?
  • averageItemsPerNamePerReferenceId always seems to return 1 for the fields other than brands and fullVersionList.
  • Further to the above, should we include information about missing client hints data for reference IDs? It might be interesting to know, for example, that only x% of reference IDs actually include the mobile field in their client hints.
  • It occurred to me that we currently cannot distinguish between client hints data which is the empty string and those which are not included in the JSON at all. I don't know if it would be interesting to start collecting this data.

For example, after editing, Chromium 116 on Debian Bullseye sends:

{"architecture":"","bitness":"64","brands":[{"brand":"Not)A;Brand","version":"24"},{"brand":"Chromium","version":"116"}],"fullVersionList":[{"brand":"Not)A;Brand","version":"24.0.0.0"},{"brand":"Chromium","version":"116.0.5845.140"}],"mobile":false,"model":"","platform":"Linux","platformVersion":"5.10.0"}

Chromium 90 on Mac Ventura sends:

{"architecture":"x86","model":"","platform":"macOS","platformVersion":"10_15_7"}
  • For mobile, I see:
"mobile": [
      "155",
      "46"
    ],

I assume that is supposed to be something like this, but was changed when converting to JSON:

"mobile": {
      "0" => "155",
      "1" => "46"
    },

@Dreamy_Jazz

  • I am not sure I know what we mean by entropy here. Do we want to work out which value of uach_value is rarest? (Or perhaps which combination of values is rarest?) Or, which uach_name has the largest variety of uach_values associated with it? Or something else?

What I want to determine is what uach_names have the most variety of data and therefore would be the best to display. For example, brands is probably more useful than architecture for display to a CU, but I wanted to empirically determine this.

  • The mapTableRowCountBreakdown array might be a lot of data to digest. Could we instead just return a summary for each uach_name, such as the total number of different values?

I wanted to avoid abstracting this down to how many rows exist with this uach_name as grease data could skew the numbers.

  • If instead we are excepting the person analysing this data to do some further processing, could we return relatively "raw" data which they can analyse as they see fit. Do we know who is going to be analysing this data and can we ask them what they think?

The maintenance script attempts to export the data in an anonymised form such that I wouldn't be able to make a link between a user/edit and a data point returned by the script. I intend to process this data to determine entropy and therefore what is the most useful data to display. I wanted to avoid, at least at first, having this be fully computed by a maintenance script as if the maintenance script fails for any reason the "raw" data is still there for the next run. I intend to use python to analyse the data.

  • averageItemsPerNamePerReferenceId always seems to return 1 for the fields other than brands and fullVersionList.

Yeah. I did think about filtering it out to only use brands and fullVersionList, but I didn't do that in the end. It should always return 1 for those entries and I guess if it isn't 1 then it is an indication something isn't working on production.

  • Further to the above, should we include information about missing client hints data for reference IDs? It might be interesting to know, for example, that only x% of reference IDs actually include the mobile field in their client hints.

I had thought about this, but did not end implementing this in the end.

  • It occurred to me that we currently cannot distinguish between client hints data which is the empty string and those which are not included in the JSON at all. I don't know if it would be interesting to start collecting this data.

It might be interesting, but I think the idea isn't to display empty strings so it would only be useful for analysis.

  • For mobile, I see:
"mobile": [
      "155",
      "46"
    ],

I assume that is supposed to be something like this, but was changed when converting to JSON:

"mobile": {
      "0" => "155",
      "1" => "46"
    },

Yes. When converting to JSON the "0" and "1" are converted to their integer type. However, the python script I will write will treat those as true and false (i.e. the first as false and the second as true).

@dom_walden do my thoughts make sense?

Yes, they do, thanks. If you are the one analysing the data then as long as you are happy that is all that matters.

Was there anything in particular you wanted me to test? I was just going to check whether the data reported was accurate.

@dom_walden do my thoughts make sense?

Yes, they do, thanks. If you are the one analysing the data then as long as you are happy that is all that matters.

Was there anything in particular you wanted me to test? I was just going to check whether the data reported was accurate.

Ideally that what I'm doing makes sense from your point of view. Your suggestions regarding the maintenance script makes sense.

@dom_walden do my thoughts make sense?

Yes, they do, thanks. If you are the one analysing the data then as long as you are happy that is all that matters.

Was there anything in particular you wanted me to test? I was just going to check whether the data reported was accurate.

Ideally that what I'm doing makes sense from your point of view. Your suggestions regarding the maintenance script makes sense.

For what you want to do with the data, it makes sense.

I also compared the values from the mapTableRowCountBreakdown section in the JSON output to the database and they matched.

I will move this to Done.

Results from this have been generated, but may not be shared publicly. A ranking list of this data is likely to be shown to WMF CheckUsers by placing this data on https://checkuser.wikimedia.org.