Page MenuHomePhabricator

Add ability to search by user agent from CheckUser interface
Open, NormalPublic

Description

The CheckUser interface currently surfaces user agents when listing users or edits, but there's no way to search by user agent. It would be nice if you could click on a listed user agent and it would then show you all users or edits performed by that user agent/IP combination.

This may be an expensive query, so we may have to introduce a new database column with hashed user agents that could be used as an index. This would preclude us from being able to do wildcard or prefix searches, however, (for example with a text field) but in almost all cases, the user will want to search for a specific user agent anyway, so I think that's a decent trade-off.

We will need to add one or more indexes to the table for this.

The user agent search will only be performed as a combined search for user agent and IP address. (If you just searched for user agent on its own, a common user agent would return too many results to be meaningful.)

There will be a second text field under the existing IP/username field that only appears when you click on a user agent link or a user agent is specified in the query string.

Description of the workflow, Get users:

  • User pastes IP address into "IP/username" field, chooses Get users
  • Results list has usernames, IP address and user agents. All three are links. User clicks on one of the user agent links.
  • On refresh, there is an "IP/username" field with the IP address filled in, and a second "user agent" field with the user agent filled in.
  • User can now search again for Get edits or Get users, and the results will only show this IP and user agent combination.
  • User can also blank the "user agent" field, if they don't want to use it anymore.
  • Alternatively, user can change the IP to a different IP or a CIDR range.
  • Note: The "IP/username" field allows for ranges. We should do the same for the "user agent" field, if possible.

Description of the workflow, Get edits:

  • User pastes IP address or username into "IP/username" field, chooses Get edits
  • Results list has log entries, with IP and user agent listed under each entry. IP and user agent are both links. User clicks on user agent link.
  • On refresh, there is an "IP/username" field with the IP address filled in, and a second "user agent" field with the user agent filled in.

Wireframes:

Event Timeline

kaldari created this task.Sep 27 2016, 11:30 PM
Restricted Application added subscribers: JEumerus, Aklapper. · View Herald TranscriptSep 27 2016, 11:30 PM
kaldari updated the task description. (Show Details)Sep 28 2016, 2:21 AM
kaldari added a subscriber: DannyH.
Huji triaged this task as Low priority.Sep 28 2016, 3:35 AM
kaldari updated the task description. (Show Details)Sep 28 2016, 3:36 AM
DannyH updated the task description. (Show Details)Oct 4 2016, 9:58 PM
Huji added a subscriber: Huji.EditedOct 4 2016, 10:48 PM

@DannyH I don't think the wireframes you added are practical: there is no way to know for sure that a user with username "Mozilla/5.0 ..." doesn't exist. Or a user might edit with the user-agent "DannyH".

Perhaps a better option would be to have a dropdown which has three options (IP, user-agent, username) followed by a textbox in which you provide the data.

On a separate note, I think we should also allow searching in the UA using wildcards. I can think of the efficiency counter-arguments, but an "exact match only" search is not going to be that useful.

Hi @Huji,

Well, the user-agent is a long string of data, generated automatically. A full example is:

Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.27604.111 Safari/537.36

So you'd search for this by clicking on a user-agent string that you get from doing a query on an IP address or user name. You wouldn't just type it in; it would be too easy to make mistakes.

So you can't have a user-agent called "DannyH" or anything like that, and if we have a user with a name like "(that whole string of browser names and digits)", then I would like an explanation from that user for why they have such a ridiculous username. :)

I agree that we still have to figure out how useful the exact match will be, compared to wildcards. We need to test out searching for user-agents, and see if we get enough results, or too many.

Huji added a comment.EditedOct 5 2016, 12:06 AM

Well, the user-agent is a long string of data, generated automatically. A full example is:
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.27604.111 Safari/537.36

A user-agent is "usually" like that. It does not have to be. To my knowledge, there is no RFC that enforces a specific format for user-agent strings, and even if there is, there is no guarantee that a user would not modify their user-agent to something non-standard. We want the CU tool to know if it is searching for "DannyH" in the user field or the UA field. This cannot be decided just by looking at the input string.

So you can't have a user-agent called "DannyH" or anything like that

Yes you can. It is not illegal. In fact I have seen users with really ridiculous user-agent strings.

So you can't have a user-agent called "DannyH" or anything like that

Yes you can. It is not illegal. In fact I have seen users with really ridiculous user-agent strings.

You could spoof it, but the browser would seemingly only give a standard-ish user agent by default, otherwise it may get unwanted content. E.g. DannyH might result in a blank page with "your browser is not supported". I would not be surprised if the more prolific LTAs are spoofing user agents into something ridiculous like this, though. If that is the case, than being able to search for it would certainly be helpful given the unlikelihood of others with a similar UA. Also many browsers automatically update, so you might want to do a wildcard search omitting the browser version to allow for some variation.

I think Huji is right. There's really no way to reliably differentiate between a UA string and a username. In fact a lot of my bots have very simple user agent strings like "WikiTiki 1.0". The suggestion of a drop-down select list sounds like a good idea.

DannyH added a comment.Oct 5 2016, 9:48 PM

Okay, that can work. Thanks for helping to educate me on user-agents. :)

Huji added a comment.Oct 6 2016, 1:38 AM

Thanks @DannyH. To briefly respond to your second last comment above, using a non-standard UA generally doesn't result in any denial of service.

DannyH updated the task description. (Show Details)Oct 6 2016, 5:48 PM
kaldari updated the task description. (Show Details)Oct 6 2016, 5:51 PM
kaldari updated the task description. (Show Details)Oct 6 2016, 6:03 PM
DannyH updated the task description. (Show Details)Oct 6 2016, 10:14 PM

I've updated the wireframes, with a drop-down for IP address, Username and User agent.

DannyH raised the priority of this task from Low to Normal.Oct 11 2016, 9:25 PM
DannyH updated the task description. (Show Details)Oct 29 2016, 12:36 AM
DannyH updated the task description. (Show Details)Oct 29 2016, 1:22 AM
DannyH updated the task description. (Show Details)Oct 29 2016, 1:28 AM
DannyH added a subscriber: Jalexander.

I've revamped the spec and wireframes, following our meeting with @Jalexander. Now you can only search by user agent and IP together.

Huji updated the task description. (Show Details)Oct 29 2016, 2:34 AM

A few thoughts:

  • If we're only expecting users to provide the user agent by clicking it from a previous search, perhaps there is no input box at all? This could simplify the process.
  • Given the complexity, this system feels prone to displaying error messages. We'll need to list out the messages that could occur and make sure they use clear language.
  • Is there a loading indicator? Should we add one?
Huji added a comment.EditedMar 8 2017, 4:42 AM

A few thoughts:

  • If we're only expecting users to provide the user agent by clicking it from a previous search, perhaps there is no input box at all? This could simplify the process.

But why? I think this is a useful feature. We should allow as many easy ways to search UAs as we can.

Samtar added a subscriber: Samtar.Dec 21 2017, 10:37 PM
Restricted Application added subscribers: MGChecker, alanajjar. · View Herald TranscriptFeb 23 2018, 11:49 PM
TBolliger moved this task from Untriaged to Backlog on the Anti-Harassment board.Mar 9 2018, 1:55 PM

Will it be possible to search for IPs using given UA? For example, UA "DannyH" will usually be uncommon and searching for it can be useful.

Will it be possible to search for IPs using given UA? For example, UA "DannyH" will usually be uncommon and searching for it can be useful.

Yes, that is one of the use cases for this tool. It will work similar to IP addresses in the CheckUser tool.

0x010C added a subscriber: 0x010C.Oct 31 2018, 5:58 PM

One note on technical side of things: You can use a hash index. In that case you would lose the ability to do regex searches and pattern matching but it's pretty fast. This is a classic example of hash index lookup, I've done this before for other cases.

Huji added a comment.EditedFeb 18 2019, 10:26 PM

Please see my related comment in T147894#4962824 in which I explain why, at least for now, it is best to restrict the functionality to searching either by IP or by UA (so if both are provided, we would return an error and ask the user to remove one). It is possible (though I am not sure how likely) that allowing both the IP and the UA to be specified would translate to the need for a massive index on the database tables that would be unjustifiable given the few use cases for a joint search, so I would rather differ that to a later time in the interest of having the UA search in production in near term.

DannyH removed a subscriber: DannyH.Feb 19 2019, 9:45 PM