Page MenuHomePhabricator

Add a user-agent parser to CheckUser
Open, MediumPublic

Description

Goal

User story: As a user, it would be helpful if I could see a human-readable version of the User-agent string in the CheckUser interface so that I could easily see which OSs and browsers a user is using.

Often standard use-agent strings can be parsed into human-understandable information. For instance, Mozilla/5.0 (Windows NT 10.0; WOW64; rv:55.0) Gecko/20100101 Firefox/55.0 indicates that the user is using Firefox version 55.0 on a computer with the 64-bit version of Windows 10. Some of the other parts (such as rv:55.0, Gecko/20100101, etc) typically give no additional information for the purposes of CheckUser, and can be ignored.
We should investigate having a user-agent parser (similar to what http://www.useragentstring.com/ does), that would show these basic human-understandable information to make it easier to interpret the UAs.

Acceptance criteria

  • Given a UA string, split it into OS and browser
  • Display the parsed information
    • Add option to see complete UA if needed (mock tbd)
  • If the UA is non-standard and cannot be split, display complete UA

Implementation Strategies

There may not be a single strategy that works perfectly, so using a combination of strategies might be best.

Parsing Libraries

One strategy could be to utilize a parsing library. Sadly, a lot of user agents lie, so it might not be completely accurate. Here are some example libraries that seem well maintained and might pass a security review:

UA Databases

Another strategy would be to utilize a database of User Agents. There does not appear to be a freely licensed database. :(

Theoretically, this database could be built on Wikidata, but as far as I know it doesn't currently exist. There is currently an RFC for how to manage software versions. It seems reasonable to add a user agent property that could be used on software versions. We could then probably build a bot that would take the user agents from the major browser's websites and insert them into Wikidata...

Event Timeline

Huji created this task.Sep 11 2017, 3:28 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 11 2017, 3:28 PM
0x010C added a subscriber: 0x010C.Oct 31 2018, 5:09 PM
Niharika triaged this task as Medium priority.Nov 7 2019, 12:20 AM
Niharika updated the task description. (Show Details)
Restricted Application added a subscriber: MGChecker. · View Herald TranscriptTue, Nov 12, 8:44 PM
dbarratt updated the task description. (Show Details)Thu, Nov 14, 8:34 PM

Thanks for doing an investigation on this, @dbarratt.

One strategy could be to utilize a parsing library. Sadly, a lot of user agents lie, so it might not be completely accurate.

Hmm, if we can have a sense of how inaccurate a parsed UA is (say, 80% or 20%) then based on that we can decide whether to display the complete UA or not. One of the use cases of this is to also know when the user might have spoofed their UA.

Hmm, if we can have a sense of how inaccurate a parsed UA is (say, 80% or 20%) then based on that we can decide whether to display the complete UA or not. One of the use cases of this is to also know when the user might have spoofed their UA.

Hmm... I guess since it's three parts, I guess we are really saying how accurate each part is.. I'm not sure you can tell this without looking at it.

For instance, with the UA generated in T237669#5661529
throwing WikipediaApp/2.7.50302-r-2019-11-13 (Android 8.1.0; Phone; Nexus 6P Build/OPM6.171019.030.E1) Alpha Channel into https://whichbrowser.net/tryout/ we get: Android Browser on a Huawei Nexus 6P running Android 8.1.0. two of those parts are correct, but one isn't.

Hmm, if we can have a sense of how inaccurate a parsed UA is (say, 80% or 20%) then based on that we can decide whether to display the complete UA or not. One of the use cases of this is to also know when the user might have spoofed their UA.

Hmm... I guess since it's three parts, I guess we are really saying how accurate each part is.. I'm not sure you can tell this without looking at it.
For instance, with the UA generated in T237669#5661529
throwing WikipediaApp/2.7.50302-r-2019-11-13 (Android 8.1.0; Phone; Nexus 6P Build/OPM6.171019.030.E1) Alpha Channel into https://whichbrowser.net/tryout/ we get: Android Browser on a Huawei Nexus 6P running Android 8.1.0. two of those parts are correct, but one isn't.

I see. I think we need some baseline estimate of how often it is wrong. If it's like 1 in 10 cases, then maybe we could split up the UA and give them a disclaimer that it might be wrong and give them an option to see the complete UA. If it's 1 in 2 cases, we probably don't want to do the splitting.

I see. I think we need some baseline estimate of how often it is wrong. If it's like 1 in 10 cases, then maybe we could split up the UA and give them a disclaimer that it might be wrong and give them an option to see the complete UA. If it's 1 in 2 cases, we probably don't want to do the splitting.

makes sense. I'm not sure how we would figure that out other than taking some random UAs and testing them manually.

I see. I think we need some baseline estimate of how often it is wrong. If it's like 1 in 10 cases, then maybe we could split up the UA and give them a disclaimer that it might be wrong and give them an option to see the complete UA. If it's 1 in 2 cases, we probably don't want to do the splitting.

makes sense. I'm not sure how we would figure that out other than taking some random UAs and testing them manually.

Yeah. I think picking a couple dozen or so actual UAs from our databases and running a manual check is probably the way to go.

The full UA can be displayed as a mouseover popup of the "Browser" field (title="<UA>") and also copied to the clipboard for further analysis when the field is clicked.
Whenever pattern-matching the UA fails the "Browser" column can display "Unknown", or the first ca. 20 characters of the UA with a different color. The user will mouseover to see the actual UA.
This might become tedious if there are more, that failed parsing, eg. more than 3. In that case (or in all cases) an extra column can be shown with all the full UA strings.
This column will likely overflow the screen and require horizontal scrolling.

I retrieved 20 random User Agent strings from the cu_changes table on eswiki and put them into a spreadsheet and compared the parsing of devicedetector.net and whichbrowser.net.

The results can be seen here:
https://docs.google.com/spreadsheets/d/1Hgv3waJxzBMuHiFIDyWBSNJqPO5ILHUYhWLnYfAwvGM/edit?usp=sharing

whichbrowser.net (and their free library) seems a lot better than the alternative (on average).

It seems like UAs are difficult to parse by looking at them, so it seems like the machine parsing is helpful, even if it's not 100% accurate. What are check users looking for when they look at UAs?

AronManning added a comment.EditedThu, Nov 21, 4:19 PM

As a non-wm CU my order of evidence strength is (decreasing):

  1. obviously fabricated UA (uncommon, only trolls do that), 2) same UA 3) browser version increased (updated) 4) same device and OS, different browser 5) different UA

The strength of the CU match adds up with the strength of the IP match:

  1. same IP in a short timeframe or static IP 2) same subnet, if there aren't many users from the subnet/ISP 3) if accounts use IPs from random countries, that's unlikely to be due to travel 4) IPs reported as proxy/vpn

Note: IPs reported as proxy/vpn is a strong clue to investigate, therefore I suggested showing it in an additional column.

I retrieved 20 random User Agent strings from the cu_changes table on eswiki and put them into a spreadsheet and compared the parsing of devicedetector.net and whichbrowser.net.
The results can be seen here:
https://docs.google.com/spreadsheets/d/1Hgv3waJxzBMuHiFIDyWBSNJqPO5ILHUYhWLnYfAwvGM/edit?usp=sharing
whichbrowser.net (and their free library) seems a lot better than the alternative (on average).

It would be helpful to quantify how often we got accurate results from both the libraries. Looking at the table it seems to me that whichbrowser was more specific about the browser and OS versions.

It seems like UAs are difficult to parse by looking at them, so it seems like the machine parsing is helpful, even if it's not 100% accurate. What are check users looking for when they look at UAs?

Makes sense. I think @AronManning enlisted the use cases well. The biggest strength we would get from splitting the UA is the ability to sort results by OS/Device/Browser separately.

It would be helpful to quantify how often we got accurate results from both the libraries.

What do you mean? errr... how do you define accuracy? Can you provide an example?

The biggest strength we would get from splitting the UA is the ability to sort results by OS/Device/Browser separately.

Do you mean sorting, filtering, or both?

Niharika added a comment.EditedThu, Nov 21, 5:07 PM

It would be helpful to quantify how often we got accurate results from both the libraries.

What do you mean? errr... how do you define accuracy? Can you provide an example?

For each UA we have the three parts. For 22 records, that is 66 pieces. So for each library, how many of those were right/wrong would be helpful to know.

The biggest strength we would get from splitting the UA is the ability to sort results by OS/Device/Browser separately.

Do you mean sorting, filtering, or both?

Both.

AronManning added a comment.EditedThu, Nov 21, 9:03 PM

Parsing looks good. Reliable enough for the purpose: only distinguishing different agents, versions, OSs is important, 100% accuracy is not needed.
WhichBrowser returns more detailed version numbers, that's a plus. Stores patterns in php files.
DeviceDetector stores its pattern data in yaml and requires a yaml parser, thus it can be expected to load slower.

To parse
WikipediaApp/2.7.50296-r-2019-09-25 (Android 5.0.2; Phone) Google Play
add to
https://github.com/WhichBrowser/Parser-PHP/blob/master/data/applications-others.php
Constants\BrowserType::APP_EDITOR => [
To parse version with date:

[ 'name' => 'Wikipedia',            'id'    => 'wikipedia',   'regexp' =>'/WikipediaApp\/([0-9.r\-]*)/u' ],

without date:

[ 'name' => 'Wikipedia',            'id'    => 'wikipedia',   'regexp' =>'/WikipediaApp\/([0-9.]*)/u' ],

only date:

[ 'name' => 'Wikipedia',            'id'    => 'wikipedia',   'regexp' =>'/WikipediaApp\/[0-9.]*-(r[0-9\-]*)/u' ],

(not tested)

For each UA we have the three parts. For 22 records, that is 66 pieces. So for each library, how many of those were right/wrong would be helpful to know.

I updated the spreadsheet: https://docs.google.com/spreadsheets/d/1Hgv3waJxzBMuHiFIDyWBSNJqPO5ILHUYhWLnYfAwvGM/edit?usp=sharing

I gave Green to the best answer, Yellow to an acceptable answer, but not complete (from what I could tell), Blue for an acceptable answer, but maybe not accurate, and Red for an answer that is wrong. I only subtracted credit for the completely wrong answers.

I think it's important to remember that UA parsing is a process of reverse engineering. There is no way to know with 100% certainty what originally produced it.

For instance, the UA string:

JUC (Linux; U; 2.3.6; zh-cn; GT-I8150; 480*800) UCWEB8.7.4.225/145/800

is apparently for UC Browser 8.7 on a Samsung Galaxy W running Android 2.3.6. There is no way to tell that by looking at it.

The biggest strength we would get from splitting the UA is the ability to sort results by OS/Device/Browser separately.

Do you mean sorting, filtering, or both?

Both.

Since we expect the "Compare" tab to have a potentially large amount of data on it (and be paginated because of that), then the only way to sort or filter this data, would be to index it in the database. I think it would be really useful to have that kind of filtering/sorting, but it means we would need to make a schema change and run a script to parse the existing User Agents and index them in a table. If you would like to do that, I suggest proposing the schema change as soon as possible.

For each UA we have the three parts. For 22 records, that is 66 pieces. So for each library, how many of those were right/wrong would be helpful to know.

I updated the spreadsheet: https://docs.google.com/spreadsheets/d/1Hgv3waJxzBMuHiFIDyWBSNJqPO5ILHUYhWLnYfAwvGM/edit?usp=sharing
I gave Green to the best answer, Yellow to an acceptable answer, but not complete (from what I could tell), Blue for an acceptable answer, but maybe not accurate, and Red for an answer that is wrong. I only subtracted credit for the completely wrong answers.

This is great! So it looks like we should attempt to use whichbrowser.net for our purposes.

I think it's important to remember that UA parsing is a process of reverse engineering. There is no way to know with 100% certainty what originally produced it.
For instance, the UA string:

JUC (Linux; U; 2.3.6; zh-cn; GT-I8150; 480*800) UCWEB8.7.4.225/145/800

is apparently for UC Browser 8.7 on a Samsung Galaxy W running Android 2.3.6. There is no way to tell that by looking at it.

Gotcha. Good to know.
Rhetorical question: Why couldn't they have simplified this by making the UA simply be a concatenation of the three parts. 🤔

The biggest strength we would get from splitting the UA is the ability to sort results by OS/Device/Browser separately.

Do you mean sorting, filtering, or both?

Both.

Since we expect the "Compare" tab to have a potentially large amount of data on it (and be paginated because of that), then the only way to sort or filter this data, would be to index it in the database. I think it would be really useful to have that kind of filtering/sorting, but it means we would need to make a schema change and run a script to parse the existing User Agents and index them in a table. If you would like to do that, I suggest proposing the schema change as soon as possible.

I see what you mean. We'd need three more columns for the browser, device and OS in the cu_changes table, I assume?

Gotcha. Good to know.
Rhetorical question: Why couldn't they have simplified this by making the UA simply be a concatenation of the three parts. 🤔

There is a long history that explains why. :)

I see what you mean. We'd need three more columns for the browser, device and OS in the cu_changes table, I assume?

Right... or we'd made a new table to not repeat the data over and over again (as we are now). We might want more columns if we want to be more granular (i.e. Do you want to filter for "Chrome" or "Chrome 77"? "Windows" or "Windows 10"? etc.). We might need to figure out the best way to normalize version strings (if possible) if we wanted to go that route.

Reedy added a subscriber: Reedy.Tue, Dec 10, 8:12 PM

I do note this isn't particularly i18n friendly etc

https://github.com/WhichBrowser/Parser-PHP/blob/3810e9aceed8acab7f27d1145579cae30eb59ac0/src/Model/Main.php#L257-L316

hard coded "on a" among other words. I would suggest basically re-implementing the toString method (in check user. subclass and override the function, or just have your own utility function), using standard MW i18n functions, and a handful of messages. Which then helps support RTL etc too

I don't know if Upstream would necessarily be interested in this, but you'd need to ask them :)

Reedy added a comment.Tue, Dec 10, 8:15 PM

I also note that they've been doing some updates (including a few new releases) recently to support new devices etc... Which might help some of your false positives (and basically would be an ongoing battle to keep the library up to date from Upstream)

They're fairly responsive to bug reports for other stuff from my experience in the last couple of weeks (doing a bit of repo maintenance)...

Make sure you report bugs/issues Upstream to try and get them fixed ;)