Analyse Client Hints data collected on WMF wikis to determine entropy
Closed, ResolvedPublic
Actions

Description

A method of collecting data about what Client Hints contain the most entropy is needed. This method should:

Collect the data in an anonymised form
Not be accessible to be run or results inspected by anyone without access to inspect the DB and/or run maintenance scripts

A maintenance script makes the most sense for this. It should return the rows in cu_useragent_clienthints along with the number of uses for this in cu_useragent_clienthints_map. This should also be runnable on all wikis at once.

Details

	Subject	Repo	Branch	Lines +/-
	clienthints: Create maintenance script to collate client hints data	mediawiki/extensions/CheckUser	master	+316 -5

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	kostajh	T257893 [EPIC] Support User-Agent Client Hints header in CheckUser
Resolved	Dreamy_Jazz	T337942 Display client hint data
Resolved	Dreamy_Jazz	T344800 Analyse Client Hints data collected on WMF wikis to determine entropy

Event Timeline

Dreamy_Jazz created this task.Aug 23 2023, 8:51 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 23 2023, 8:51 AM

Dreamy_Jazz moved this task from Ready 🎬 (ONLY IF YOU HAVE NO MORE CODE TO REVIEW) to In Progress 💪 on the Anti-Harassment (AHaT Sprint 32 - Baseball Cap) board.Aug 23 2023, 8:51 AM

Change 951904 had a related patch set uploaded (by Dreamy Jazz; author: Dreamy Jazz):

[mediawiki/extensions/CheckUser@master] clienthints: Create maintenance script to analyse client hints data

https://gerrit.wikimedia.org/r/951904

gerritbot added a project: Patch-For-Review.Aug 23 2023, 11:33 AM

Tchanders assigned this task to Dreamy_Jazz.Aug 23 2023, 4:37 PM

Dreamy_Jazz mentioned this in T345076: Consider ways to reduce the row count in cu_useragent_clienthints_map.Aug 28 2023, 1:17 PM

Dreamy_Jazz moved this task from In Progress 💪 to Code Review 🔍 on the Anti-Harassment (AHaT Sprint 32 - Baseball Cap) board.Aug 30 2023, 10:49 PM

Change 951904 merged by jenkins-bot:

[mediawiki/extensions/CheckUser@master] clienthints: Create maintenance script to collate client hints data

https://gerrit.wikimedia.org/r/951904

Maintenance_bot removed a project: Patch-For-Review.Sep 1 2023, 10:10 AM

ReleaseTaggerBot added a project: MW-1.41-notes (1.41.0-wmf.25; 2023-09-05).Sep 1 2023, 11:00 AM

Dreamy_Jazz moved this task from Code Review 🔍 to QA/Testing 🐞 on the Anti-Harassment (AHaT Sprint 32 - Baseball Cap) board.Sep 4 2023, 8:56 AM

@Dreamy_Jazz

I am not sure I know what we mean by entropy here. Do we want to work out which value of uach_value is rarest? (Or perhaps which combination of values is rarest?) Or, which uach_name has the largest variety of uach_values associated with it? Or something else?

The mapTableRowCountBreakdown array might be a lot of data to digest. Could we instead just return a summary for each uach_name, such as the total number of different values?

If instead we are excepting the person analysing this data to do some further processing, could we return relatively "raw" data which they can analyse as they see fit. Do we know who is going to be analysing this data and can we ask them what they think?

averageItemsPerNamePerReferenceId always seems to return 1 for the fields other than brands and fullVersionList.

Further to the above, should we include information about missing client hints data for reference IDs? It might be interesting to know, for example, that only x% of reference IDs actually include the mobile field in their client hints.

It occurred to me that we currently cannot distinguish between client hints data which is the empty string and those which are not included in the JSON at all. I don't know if it would be interesting to start collecting this data.

For example, after editing, Chromium 116 on Debian Bullseye sends:

{"architecture":"","bitness":"64","brands":[{"brand":"Not)A;Brand","version":"24"},{"brand":"Chromium","version":"116"}],"fullVersionList":[{"brand":"Not)A;Brand","version":"24.0.0.0"},{"brand":"Chromium","version":"116.0.5845.140"}],"mobile":false,"model":"","platform":"Linux","platformVersion":"5.10.0"}

Chromium 90 on Mac Ventura sends:

{"architecture":"x86","model":"","platform":"macOS","platformVersion":"10_15_7"}

For mobile, I see:

"mobile": [
      "155",
      "46"
    ],

I assume that is supposed to be something like this, but was changed when converting to JSON:

"mobile": {
      "0" => "155",
      "1" => "46"
    },

dom_walden moved this task from QA/Testing 🐞 to Blocked/Stalled 🚧 on the Anti-Harassment (AHaT Sprint 32 - Baseball Cap) board.Sep 5 2023, 12:18 PM

In T344800#9142219, @dom_walden wrote:

@Dreamy_Jazz

I am not sure I know what we mean by entropy here. Do we want to work out which value of uach_value is rarest? (Or perhaps which combination of values is rarest?) Or, which uach_name has the largest variety of uach_values associated with it? Or something else?

What I want to determine is what uach_names have the most variety of data and therefore would be the best to display. For example, brands is probably more useful than architecture for display to a CU, but I wanted to empirically determine this.

The mapTableRowCountBreakdown array might be a lot of data to digest. Could we instead just return a summary for each uach_name, such as the total number of different values?

I wanted to avoid abstracting this down to how many rows exist with this uach_name as grease data could skew the numbers.

If instead we are excepting the person analysing this data to do some further processing, could we return relatively "raw" data which they can analyse as they see fit. Do we know who is going to be analysing this data and can we ask them what they think?

The maintenance script attempts to export the data in an anonymised form such that I wouldn't be able to make a link between a user/edit and a data point returned by the script. I intend to process this data to determine entropy and therefore what is the most useful data to display. I wanted to avoid, at least at first, having this be fully computed by a maintenance script as if the maintenance script fails for any reason the "raw" data is still there for the next run. I intend to use python to analyse the data.

averageItemsPerNamePerReferenceId always seems to return 1 for the fields other than brands and fullVersionList.

Yeah. I did think about filtering it out to only use brands and fullVersionList, but I didn't do that in the end. It should always return 1 for those entries and I guess if it isn't 1 then it is an indication something isn't working on production.

Further to the above, should we include information about missing client hints data for reference IDs? It might be interesting to know, for example, that only x% of reference IDs actually include the mobile field in their client hints.

I had thought about this, but did not end implementing this in the end.

It occurred to me that we currently cannot distinguish between client hints data which is the empty string and those which are not included in the JSON at all. I don't know if it would be interesting to start collecting this data.

It might be interesting, but I think the idea isn't to display empty strings so it would only be useful for analysis.

For mobile, I see:
"mobile": [
      "155",
      "46"
    ],
I assume that is supposed to be something like this, but was changed when converting to JSON:
"mobile": {
      "0" => "155",
      "1" => "46"
    },

Yes. When converting to JSON the "0" and "1" are converted to their integer type. However, the python script I will write will treat those as true and false (i.e. the first as false and the second as true).

@dom_walden do my thoughts make sense?

In T344800#9142384, @Dreamy_Jazz wrote:

@dom_walden do my thoughts make sense?

Yes, they do, thanks. If you are the one analysing the data then as long as you are happy that is all that matters.

Was there anything in particular you wanted me to test? I was just going to check whether the data reported was accurate.

In T344800#9142403, @dom_walden wrote:

In T344800#9142384, @Dreamy_Jazz wrote:

@dom_walden do my thoughts make sense?

Yes, they do, thanks. If you are the one analysing the data then as long as you are happy that is all that matters.

Was there anything in particular you wanted me to test? I was just going to check whether the data reported was accurate.

Ideally that what I'm doing makes sense from your point of view. Your suggestions regarding the maintenance script makes sense.

In T344800#9142449, @Dreamy_Jazz wrote:

In T344800#9142403, @dom_walden wrote:

In T344800#9142384, @Dreamy_Jazz wrote:

@dom_walden do my thoughts make sense?

Yes, they do, thanks. If you are the one analysing the data then as long as you are happy that is all that matters.

Was there anything in particular you wanted me to test? I was just going to check whether the data reported was accurate.

Ideally that what I'm doing makes sense from your point of view. Your suggestions regarding the maintenance script makes sense.

For what you want to do with the data, it makes sense.

I also compared the values from the mapTableRowCountBreakdown section in the JSON output to the database and they matched.

I will move this to Done.

Results from this have been generated, but may not be shared publicly. A ranking list of this data is likely to be shown to WMF CheckUsers by placing this data on https://checkuser.wikimedia.org.

Dreamy_Jazz mentioned this in T345837: Create a maintenance script to trim trailing and leading spaces in uach_value.Sep 7 2023, 12:49 PM

Dreamy_Jazz moved this task from Backlog to Release 0 (Pilot wikis) on the http-client-hints board.Sep 26 2023, 12:00 PM

Dreamy_Jazz edited projects, added http-client-hints (Release 0 (Pilot wikis)); removed http-client-hints.

kostajh closed this task as Resolved.Sep 26 2023, 12:30 PM

Analyse Client Hints data collected on WMF wikis to determine entropyClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Analyse Client Hints data collected on WMF wikis to determine entropy
Closed, ResolvedPublic
Actions

Related Objects
Search...