Add a user-agent parser to CheckUser
Open, MediumPublic
Actions

Assigned To

None

Authored By

	Huji
	Sep 11 2017, 3:28 PM

Description

Goal

User story: As a user, it would be helpful if I could see a human-readable version of the User-agent string in the CheckUser interface so that I could easily see which OSs and browsers a user is using.

Often standard use-agent strings can be parsed into human-understandable information. For instance, Mozilla/5.0 (Windows NT 10.0; WOW64; rv:55.0) Gecko/20100101 Firefox/55.0 indicates that the user is using Firefox version 55.0 on a computer with the 64-bit version of Windows 10. Some of the other parts (such as rv:55.0, Gecko/20100101, etc) typically give no additional information for the purposes of CheckUser, and can be ignored.
We should investigate having a user-agent parser (similar to what http://www.useragentstring.com/ does), that would show these basic human-understandable information to make it easier to interpret the UAs.

Acceptance criteria

Given a UA string, split it into OS and browser
Display the parsed information
- Add option to see complete UA if needed (mock tbd)
If the UA is non-standard and cannot be split, display complete UA

Implementation Strategies

There may not be a single strategy that works perfectly, so using a combination of strategies might be best.

Parsing Libraries

One strategy could be to utilize a parsing library. Sadly, a lot of user agents lie, so it might not be completely accurate. Here are some example libraries that seem well maintained and might pass a security review:

UA Databases

Another strategy would be to utilize a database of User Agents. There does not appear to be a freely licensed database. :(

Theoretically, this database could be built on Wikidata, but as far as I know it doesn't currently exist. There is currently an RFC for how to manage software versions. It seems reasonable to add a user agent property that could be used on software versions. We could then probably build a bot that would take the user agents from the major browser's websites and insert them into Wikidata...

Related Objects
Search...

Status	Assigned	Task
Resolved	None	T139810 RFC: Overhaul the CheckUser extension
Resolved	Niharika	T236225 [Epic] CheckUser 2.0 Improvements
Resolved	Niharika	T237593 [Epic] CheckUser 2.0: Compare
Open	None	T175587 Add a user-agent parser to CheckUser
Resolved	Reedy	T239259 Security Review for whichbrowser/parser

Event Timeline

Huji created this task.Sep 11 2017, 3:28 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 11 2017, 3:28 PM

Huji mentioned this in rEBOP504857e86e0b: Update patch set 2.Jun 11 2018, 5:23 PM

0x010C subscribed.Oct 31 2018, 5:09 PM

Huji mentioned this in T237486: Organize tasks related to the overhaul of CheckUser.Nov 6 2019, 12:33 AM

Niharika triaged this task as Medium priority.Nov 7 2019, 12:20 AM

Niharika updated the task description. (Show Details)

• Demian added a parent task: T237593: [Epic] CheckUser 2.0: Compare.Nov 7 2019, 6:48 AM

Niharika added a project: Anti-Harassment.Nov 12 2019, 8:44 PM

Restricted Application added a subscriber: MGChecker. · View Herald TranscriptNov 12 2019, 8:44 PM

Niharika moved this task from Untriaged to Triage/To be Estimated on the Anti-Harassment board.Nov 12 2019, 8:44 PM

dbarratt updated the task description. (Show Details)Nov 14 2019, 8:34 PM

Thanks for doing an investigation on this, @dbarratt.

One strategy could be to utilize a parsing library. Sadly, a lot of user agents lie, so it might not be completely accurate.

Hmm, if we can have a sense of how inaccurate a parsed UA is (say, 80% or 20%) then based on that we can decide whether to display the complete UA or not. One of the use cases of this is to also know when the user might have spoofed their UA.

In T175587#5676953, @Niharika wrote:

Hmm, if we can have a sense of how inaccurate a parsed UA is (say, 80% or 20%) then based on that we can decide whether to display the complete UA or not. One of the use cases of this is to also know when the user might have spoofed their UA.

Hmm... I guess since it's three parts, I guess we are really saying how accurate each part is.. I'm not sure you can tell this without looking at it.

For instance, with the UA generated in T237669#5661529
throwing WikipediaApp/2.7.50302-r-2019-11-13 (Android 8.1.0; Phone; Nexus 6P Build/OPM6.171019.030.E1) Alpha Channel into https://whichbrowser.net/tryout/ we get: Android Browser on a Huawei Nexus 6P running Android 8.1.0. two of those parts are correct, but one isn't.

In T175587#5676963, @dbarratt wrote:

In T175587#5676953, @Niharika wrote:

Hmm, if we can have a sense of how inaccurate a parsed UA is (say, 80% or 20%) then based on that we can decide whether to display the complete UA or not. One of the use cases of this is to also know when the user might have spoofed their UA.

Hmm... I guess since it's three parts, I guess we are really saying how accurate each part is.. I'm not sure you can tell this without looking at it.

For instance, with the UA generated in T237669#5661529
throwing WikipediaApp/2.7.50302-r-2019-11-13 (Android 8.1.0; Phone; Nexus 6P Build/OPM6.171019.030.E1) Alpha Channel into https://whichbrowser.net/tryout/ we get: Android Browser on a Huawei Nexus 6P running Android 8.1.0. two of those parts are correct, but one isn't.

I see. I think we need some baseline estimate of how often it is wrong. If it's like 1 in 10 cases, then maybe we could split up the UA and give them a disclaimer that it might be wrong and give them an option to see the complete UA. If it's 1 in 2 cases, we probably don't want to do the splitting.

In T175587#5676994, @Niharika wrote:

I see. I think we need some baseline estimate of how often it is wrong. If it's like 1 in 10 cases, then maybe we could split up the UA and give them a disclaimer that it might be wrong and give them an option to see the complete UA. If it's 1 in 2 cases, we probably don't want to do the splitting.

makes sense. I'm not sure how we would figure that out other than taking some random UAs and testing them manually.

In T175587#5676999, @dbarratt wrote:

In T175587#5676994, @Niharika wrote:

I see. I think we need some baseline estimate of how often it is wrong. If it's like 1 in 10 cases, then maybe we could split up the UA and give them a disclaimer that it might be wrong and give them an option to see the complete UA. If it's 1 in 2 cases, we probably don't want to do the splitting.

makes sense. I'm not sure how we would figure that out other than taking some random UAs and testing them manually.

Yeah. I think picking a couple dozen or so actual UAs from our databases and running a manual check is probably the way to go.

The full UA can be displayed as a mouseover popup of the "Browser" field (title="<UA>") and also copied to the clipboard for further analysis when the field is clicked.
Whenever pattern-matching the UA fails the "Browser" column can display "Unknown", or the first ca. 20 characters of the UA with a different color. The user will mouseover to see the actual UA.
This might become tedious if there are more, that failed parsing, eg. more than 3. In that case (or in all cases) an extra column can be shown with all the full UA strings.
This column will likely overflow the screen and require horizontal scrolling.

I retrieved 20 random User Agent strings from the cu_changes table on eswiki and put them into a spreadsheet and compared the parsing of devicedetector.net and whichbrowser.net.

The results can be seen here:
https://docs.google.com/spreadsheets/d/1Hgv3waJxzBMuHiFIDyWBSNJqPO5ILHUYhWLnYfAwvGM/edit?usp=sharing

whichbrowser.net (and their free library) seems a lot better than the alternative (on average).

It seems like UAs are difficult to parse by looking at them, so it seems like the machine parsing is helpful, even if it's not 100% accurate. What are check users looking for when they look at UAs?

As a non-wm CU my order of evidence strength is (decreasing):

obviously fabricated UA (uncommon, only trolls do that), 2) same UA 3) browser version increased (updated) 4) same device and OS, different browser 5) different UA

The strength of the CU match adds up with the strength of the IP match:

same IP in a short timeframe or static IP 2) same subnet, if there aren't many users from the subnet/ISP 3) if accounts use IPs from random countries, that's unlikely to be due to travel 4) IPs reported as proxy/vpn

Note: IPs reported as proxy/vpn is a strong clue to investigate, therefore I suggested showing it in an additional column.

In T175587#5681677, @dbarratt wrote:

I retrieved 20 random User Agent strings from the cu_changes table on eswiki and put them into a spreadsheet and compared the parsing of devicedetector.net and whichbrowser.net.

The results can be seen here:
https://docs.google.com/spreadsheets/d/1Hgv3waJxzBMuHiFIDyWBSNJqPO5ILHUYhWLnYfAwvGM/edit?usp=sharing

whichbrowser.net (and their free library) seems a lot better than the alternative (on average).

It would be helpful to quantify how often we got accurate results from both the libraries. Looking at the table it seems to me that whichbrowser was more specific about the browser and OS versions.

It seems like UAs are difficult to parse by looking at them, so it seems like the machine parsing is helpful, even if it's not 100% accurate. What are check users looking for when they look at UAs?

Makes sense. I think @AronManning enlisted the use cases well. The biggest strength we would get from splitting the UA is the ability to sort results by OS/Device/Browser separately.

In T175587#5681931, @Niharika wrote:

It would be helpful to quantify how often we got accurate results from both the libraries.

What do you mean? errr... how do you define accuracy? Can you provide an example?

The biggest strength we would get from splitting the UA is the ability to sort results by OS/Device/Browser separately.

Do you mean sorting, filtering, or both?

In T175587#5681950, @dbarratt wrote:

In T175587#5681931, @Niharika wrote:

It would be helpful to quantify how often we got accurate results from both the libraries.

What do you mean? errr... how do you define accuracy? Can you provide an example?

For each UA we have the three parts. For 22 records, that is 66 pieces. So for each library, how many of those were right/wrong would be helpful to know.

The biggest strength we would get from splitting the UA is the ability to sort results by OS/Device/Browser separately.

Do you mean sorting, filtering, or both?

Both.

Parsing looks good. Reliable enough for the purpose: only distinguishing different agents, versions, OSs is important, 100% accuracy is not needed.
WhichBrowser returns more detailed version numbers, that's a plus. Stores patterns in php files.
DeviceDetector stores its pattern data in yaml and requires a yaml parser, thus it can be expected to load slower.

To parse
WikipediaApp/2.7.50296-r-2019-09-25 (Android 5.0.2; Phone) Google Play
add to
https://github.com/WhichBrowser/Parser-PHP/blob/master/data/applications-others.php
Constants\BrowserType::APP_EDITOR => [
To parse version with date:

[ 'name' => 'Wikipedia',            'id'    => 'wikipedia',   'regexp' =>'/WikipediaApp\/([0-9.r\-]*)/u' ],

without date:

[ 'name' => 'Wikipedia',            'id'    => 'wikipedia',   'regexp' =>'/WikipediaApp\/([0-9.]*)/u' ],

only date:

[ 'name' => 'Wikipedia',            'id'    => 'wikipedia',   'regexp' =>'/WikipediaApp\/[0-9.]*-(r[0-9\-]*)/u' ],

(not tested)

In T175587#5681956, @Niharika wrote:

For each UA we have the three parts. For 22 records, that is 66 pieces. So for each library, how many of those were right/wrong would be helpful to know.

I updated the spreadsheet: https://docs.google.com/spreadsheets/d/1Hgv3waJxzBMuHiFIDyWBSNJqPO5ILHUYhWLnYfAwvGM/edit?usp=sharing

I gave Green to the best answer, Yellow to an acceptable answer, but not complete (from what I could tell), Blue for an acceptable answer, but maybe not accurate, and Red for an answer that is wrong. I only subtracted credit for the completely wrong answers.

I think it's important to remember that UA parsing is a process of reverse engineering. There is no way to know with 100% certainty what originally produced it.

For instance, the UA string:

JUC (Linux; U; 2.3.6; zh-cn; GT-I8150; 480*800) UCWEB8.7.4.225/145/800

is apparently for UC Browser 8.7 on a Samsung Galaxy W running Android 2.3.6. There is no way to tell that by looking at it.

The biggest strength we would get from splitting the UA is the ability to sort results by OS/Device/Browser separately.

Do you mean sorting, filtering, or both?

Both.

Since we expect the "Compare" tab to have a potentially large amount of data on it (and be paginated because of that), then the only way to sort or filter this data, would be to index it in the database. I think it would be really useful to have that kind of filtering/sorting, but it means we would need to make a schema change and run a script to parse the existing User Agents and index them in a table. If you would like to do that, I suggest proposing the schema change as soon as possible.

In T175587#5682669, @dbarratt wrote:

In T175587#5681956, @Niharika wrote:

For each UA we have the three parts. For 22 records, that is 66 pieces. So for each library, how many of those were right/wrong would be helpful to know.

I updated the spreadsheet: https://docs.google.com/spreadsheets/d/1Hgv3waJxzBMuHiFIDyWBSNJqPO5ILHUYhWLnYfAwvGM/edit?usp=sharing

I gave Green to the best answer, Yellow to an acceptable answer, but not complete (from what I could tell), Blue for an acceptable answer, but maybe not accurate, and Red for an answer that is wrong. I only subtracted credit for the completely wrong answers.

This is great! So it looks like we should attempt to use whichbrowser.net for our purposes.

I think it's important to remember that UA parsing is a process of reverse engineering. There is no way to know with 100% certainty what originally produced it.

For instance, the UA string:
JUC (Linux; U; 2.3.6; zh-cn; GT-I8150; 480*800) UCWEB8.7.4.225/145/800
is apparently for UC Browser 8.7 on a Samsung Galaxy W running Android 2.3.6. There is no way to tell that by looking at it.

Gotcha. Good to know.
Rhetorical question: Why couldn't they have simplified this by making the UA simply be a concatenation of the three parts. 🤔

The biggest strength we would get from splitting the UA is the ability to sort results by OS/Device/Browser separately.

Do you mean sorting, filtering, or both?

Both.

Since we expect the "Compare" tab to have a potentially large amount of data on it (and be paginated because of that), then the only way to sort or filter this data, would be to index it in the database. I think it would be really useful to have that kind of filtering/sorting, but it means we would need to make a schema change and run a script to parse the existing User Agents and index them in a table. If you would like to do that, I suggest proposing the schema change as soon as possible.

I see what you mean. We'd need three more columns for the browser, device and OS in the cu_changes table, I assume?

In T175587#5685276, @Niharika wrote:

Gotcha. Good to know.
Rhetorical question: Why couldn't they have simplified this by making the UA simply be a concatenation of the three parts. 🤔

There is a long history that explains why. :)

I see what you mean. We'd need three more columns for the browser, device and OS in the cu_changes table, I assume?

Right... or we'd made a new table to not repeat the data over and over again (as we are now). We might want more columns if we want to be more granular (i.e. Do you want to filter for "Chrome" or "Chrome 77"? "Windows" or "Windows 10"? etc.). We might need to figure out the best way to normalize version strings (if possible) if we wanted to go that route.

dbarratt mentioned this in T239259: Security Review for whichbrowser/parser.Nov 26 2019, 6:14 PM

Reedy closed subtask T239259: Security Review for whichbrowser/parser as Resolved.Dec 10 2019, 8:07 PM

I do note this isn't particularly i18n friendly etc

https://github.com/WhichBrowser/Parser-PHP/blob/3810e9aceed8acab7f27d1145579cae30eb59ac0/src/Model/Main.php#L257-L316

hard coded "on a" among other words. I would suggest basically re-implementing the toString method (in check user. subclass and override the function, or just have your own utility function), using standard MW i18n functions, and a handful of messages. Which then helps support RTL etc too

I don't know if Upstream would necessarily be interested in this, but you'd need to ask them :)

I also note that they've been doing some updates (including a few new releases) recently to support new devices etc... Which might help some of your false positives (and basically would be an ongoing battle to keep the library up to date from Upstream)

They're fairly responsive to bug reports for other stuff from my experience in the last couple of weeks (doing a bit of repo maintenance)...

Make sure you report bugs/issues Upstream to try and get them fixed ;)

In T175587#5729382, @Reedy wrote:

I do note this isn't particularly i18n friendly etc

I can't imagine we would use that method, but good to know! :)

As an FYI why this may end up being wasted effort... https://groups.google.com/a/chromium.org/forum/#!msg/blink-dev/-2JIRNMWJ7s/yHe4tQNLCgAJ - "Intent to Deprecate and Freeze: The User-Agent string"

I applaud that effort personally. However, the users of this tool rely on that data. They are quite literally trying to fingerprint a user with this information. It's clear from that post that they want to thwart fingerprinting (which, again, I applaud).

We would need to conceive of another way to help the admins using this tool compare edits/users to each other to make the decisions they are trying to make.

Since they are just freezing the feature, it may well continue to serve its purpose in this particular use case.

In T175587#5802149, @aezell wrote:

Since they are just freezing the feature, it may well continue to serve its purpose in this particular use case.

Indeed, it was more making sure people were aware. :) They're talking about alternatives, but nothing is done until it's done

Like everything, it's not gonna happen particularly quickly. So carrying on with some of the implementation probably makes sense, but digging into weird edge cases, and putting much time into enhancing it in various ways might not be worth it

In T175587#5802152, @Reedy wrote:

So carrying on with some of the implementation probably makes sense, but digging into weird edge cases, and putting much time into enhancing it in various ways might not be worth it

100% agree. I think our hope was just to roll in an existing FLOSS implementation and let it suffice.

Reedy mentioned this in T242825: Deal with Google Chrome User-Agent deprecation.Jan 15 2020, 8:56 AM

We should out up a banner or something to let CheckUsers know about this at some point.

Niharika moved this task from Triage/To be Estimated to Product/Tech backlog on the Anti-Harassment board.Jan 23 2020, 6:55 PM

dbarratt mentioned this in T245499: Improve performance of Compare query for Special:Investigate.Feb 21 2020, 7:09 PM

MarcoAurelio subscribed.Mar 10 2020, 11:06 PM

@Niharika I think we should first do T242825 before this task. That way we are setup for handling the "new" version of UAs (the current version may be removed in the future). We can then parse the current UAs into the new format.

dbarratt mentioned this in T258105: Implement storage for User-Agent Client Hints header data.Jul 16 2020, 7:56 PM

DannyS712 subscribed.Nov 18 2020, 6:48 PM

Niharika added a project: AHT-Roadmap.Apr 15 2021, 1:52 PM

dbarratt unsubscribed.Apr 15 2021, 2:18 PM

Niharika removed a project: AHT-Roadmap.Apr 29 2021, 5:32 PM

Reedy mentioned this in T293379: [[:w:en:User:Firefly/checkuseragenthelper.js]] sends CU user-agents to a third party.Oct 14 2021, 5:35 PM

whichbrowser/parser

piwik/device-detector

If we do end up needing a "classic" UA string parsing library, I'd recommend uap-php rather than the above. This is a binding to the same pattern set that we use in production for Python, Java, R, and other programming languages (Codesearch). Such as for Analytics and EventGate data, and in our browser usage reports. This pattern set is well-maintained and backed by major organizations, and ensures the same UA is classified in the same way by all our services.

GeneralNotability subscribed.Oct 16 2021, 2:00 PM

TheresNoTime subscribed.Oct 16 2021, 4:03 PM

Stryn subscribed.Jan 18 2022, 12:14 PM

@GeneralNotability has a tool on enwiki called InvestigatorGoat that uses https://github.com/faisalman/ua-parser-js (but pasted onwiki). This may (or may not) be useful for this task.

Blablubbs subscribed.Jan 7 2023, 12:04 AM

Tgr mentioned this in T257893: [EPIC] Support User-Agent Client Hints header in CheckUser.May 3 2023, 12:36 PM

kostajh subscribed.May 3 2023, 12:37 PM

TAdeleye_WMF edited projects, added Trust and Safety Product Team; removed Anti-Harassment.May 14 2024, 4:22 PM

I'm not sure if we need this task anymore, given the decreasing specificity of the user agent header and usage of http-client-hints in its place (for Chrome, at least).

Add a user-agent parser to CheckUserOpen, MediumPublicActions