Page MenuHomePhabricator

CU 2.0: Filters in the Compare tab
Closed, ResolvedPublic3 Estimated Story Points

Description

Goal

We're introducing some filters in the Compare tab to allow users to be able to narrow down the search results and find patterns in the data.

Mock

https://prtksxna.github.io/wmf-cu-prototype/compare.html

image.png (227×775 px, 15 KB)

Acceptance criteria

To begin with, we'll have one filters in the UI -

  1. Hide the following users or IPs
    • When a username or IP address is being typed in this field, it suggests auto-complete suggestions from the users table.
    • The username/IP picked appears as a pill in the widget.
    • The capsule is blue, whether or not the username is invalid. We ran into technical problems knowing all the users across all the tabs so we aren't doing the validation check on the usernames. T247279
    • As soon as a valid username or IP address is entered or the user picks a suggestion from the auto-complete suggester, the result set re-computes to exclude the user/IP as a target. T247356

Event Timeline

Niharika triaged this task as Medium priority.Nov 21 2019, 1:52 AM
Niharika created this task.
Niharika moved this task from Untriaged to Triage/To be Estimated on the Anti-Harassment board.
From the description:

@Prtksxna should we give an error indicator perhaps?

Yeah, we can make the invalid capsules' text red. Similar to the categories input in Upload Wizard (though that is a redlink):

Screenshot 2019-11-21 at 6.37.14 PM.png (216×1 px, 41 KB)


The other option would be to not let them enter an incorrect input, and don't convert it into a capsule:

Screenshot 2019-11-21 at 6.40.49 PM.png (158×1 px, 28 KB)

Is this a validation we can run? If so I'd prefer we do this.

Taking this out for now, per our estimation meeting:

  1. Hide the following user-agents
    • When a user-agent is being typed in this field, it suggests auto-complete suggestions from the users and IPs in the table below.
    • If a user enters a user-agent that is not in the table below, nothing happens. @Prtksxna should we give an error indicator perhaps?
    • As soon as a valid user-agent is entered or the user picks a suggestion from the auto-complete suggester, the result set re-computes to exclude all rows concerning the users or IPs picked by the user.
    • The user-agent picked appears as a pill in the widget.

We can discuss this once we've figured out how we want to break down the UA and change the database accordingly.

@Prtksxna We should update this ticket after we split the form out.

@Niharika, based on the information on each page this how I am thinking of splitting the filters

CompareTimeline
User / IPUser / IP
User AgentUser Agent
Page
Type of activity

Does that sound alright?

@Niharika, based on the information on each page this how I am thinking of splitting the filters

CompareTimeline
User / IPUser / IP
User AgentUser Agent
Page
Type of activity

Does that sound alright?

Whoops, I missed this. Yes, this sounds about right.

Niharika updated the task description. (Show Details)

Change 574841 had a related patch set uploaded (by Dbarratt; owner: Dbarratt):
[mediawiki/extensions/CheckUser@master] Add filters to the Compare tab on Special:Investigate

https://gerrit.wikimedia.org/r/574841

@Niharika @Prtksxna - Re-asking a question that @dmaza asked on gerrit, about the precise meaning of this:

  • As soon as a valid username or IP address is entered or the user picks a suggestion from the auto-complete suggester, the result set re-computes to exclude all rows concerning the users or IPs picked by the user.

The current implementation excludes the entered users/IPs from the target list, but not necessarily from the final results. This means:

  • If 1.2.3.4 is a target, and ExampleUser is a filtered-out target, but has used IP 1.2.3.4, there could still be table rows for ExampleUser/1.2.3.4 (where 1.2.3.4 has a dark table cell to indicate that it is a target), but no rows for ExampleUser with other IPs.
  • If 1.2.3.4 is a target, and ExampleUser is a filtered-out target, but has used IP 1.2.3.4, the table cells for 1.2.3.4 contain (~n from all users), where n includes the edits made by ExampleUser.

If this is acceptable, it's straightforward enough to implement.

Alternatively is the intention that (in the above example) there should be no table rows for ExampleUser/1.2.3.4, even though 1.2.3.4 is a target (i.e. the filter outweighs the target)? Should (~n from all users) exclude ExampleUser's edits?

If the latter, I think there could be some technical challenges. Here are some ways to do it, and the challenges involved:

  • Filter out unwanted users/IPs at the subquery stage, by adding a WHERE to each subquery to ensure that rows don't contain any of the unwanted users/IPs. This might affect performance (e.g. scanning more rows rather than relying on an index alone), and we'd need to check whether it is still acceptable.
  • Filter out unwanted users/IPs at the outer query stage. This should affect performance less, since it will be done on the limited derived table. However, if a large proportion of the results contain the unwanted users/IPs, the final result might be smaller than expected (and the rows we're actually interested in may have been cut off due to the subquery limit). That may well be an edge-case though.
  • Keep the query the same, but filter out unwanted rows in PHP. I wouldn't favour this approach, since it could give us inconsistent table lengths across the pages. (E.g. almost all the rows may be filtered out from one page, but hardly any on the next, etc.)

@Niharika @Prtksxna - Re-asking a question that @dmaza asked on gerrit, about the precise meaning of this:

  • As soon as a valid username or IP address is entered or the user picks a suggestion from the auto-complete suggester, the result set re-computes to exclude all rows concerning the users or IPs picked by the user.

The current implementation excludes the entered users/IPs from the target list, but not necessarily from the final results. This means:

  • If 1.2.3.4 is a target, and ExampleUser is a filtered-out target, but has used IP 1.2.3.4, there could still be table rows for ExampleUser/1.2.3.4 (where 1.2.3.4 has a dark table cell to indicate that it is a target), but no rows for ExampleUser with other IPs.
  • If 1.2.3.4 is a target, and ExampleUser is a filtered-out target, but has used IP 1.2.3.4, the table cells for 1.2.3.4 contain (~n from all users), where n includes the edits made by ExampleUser.

If this is acceptable, it's straightforward enough to implement.

Yep, this was the intention, sorry for not being clearer in the description.

Yep, this was the intention, sorry for not being clearer in the description.

Does that mean that this should be a select list of the existing targets provided (instead of a user multiselect widget that allows any user/ip to be specified)?

@dbarratt It should be a multiselect in the sense that more than one target can be selected at once.

If we want to restrict to the currently-existing targets, we may be able to pass in allowedValues as a config to the UsersMultiselectWidget.

@Niharika @Prtksxna - Re-asking a question that @dmaza asked on gerrit, about the precise meaning of this:

  • As soon as a valid username or IP address is entered or the user picks a suggestion from the auto-complete suggester, the result set re-computes to exclude all rows concerning the users or IPs picked by the user.

The current implementation excludes the entered users/IPs from the target list, but not necessarily from the final results. This means:

  • If 1.2.3.4 is a target, and ExampleUser is a filtered-out target, but has used IP 1.2.3.4, there could still be table rows for ExampleUser/1.2.3.4 (where 1.2.3.4 has a dark table cell to indicate that it is a target), but no rows for ExampleUser with other IPs.
  • If 1.2.3.4 is a target, and ExampleUser is a filtered-out target, but has used IP 1.2.3.4, the table cells for 1.2.3.4 contain (~n from all users), where n includes the edits made by ExampleUser.

If this is acceptable, it's straightforward enough to implement.

Alternatively is the intention that (in the above example) there should be no table rows for ExampleUser/1.2.3.4, even though 1.2.3.4 is a target (i.e. the filter outweighs the target)? Should (~n from all users) exclude ExampleUser's edits?

If the latter, I think there could be some technical challenges. Here are some ways to do it, and the challenges involved:

  • Filter out unwanted users/IPs at the subquery stage, by adding a WHERE to each subquery to ensure that rows don't contain any of the unwanted users/IPs. This might affect performance (e.g. scanning more rows rather than relying on an index alone), and we'd need to check whether it is still acceptable.
  • Filter out unwanted users/IPs at the outer query stage. This should affect performance less, since it will be done on the limited derived table. However, if a large proportion of the results contain the unwanted users/IPs, the final result might be smaller than expected (and the rows we're actually interested in may have been cut off due to the subquery limit). That may well be an edge-case though.
  • Keep the query the same, but filter out unwanted rows in PHP. I wouldn't favour this approach, since it could give us inconsistent table lengths across the pages. (E.g. almost all the rows may be filtered out from one page, but hardly any on the next, etc.)

I'm sorry, I rushed to respond to this. I misunderstood a range to have individual IPs as targets and not the range itself. There can well be a use case where the user wants to remove a specific IP/username from the results but that IP/username may not be a target itself. Like if I do a lookup for User A and User B:

User A1.1.1.1
User A1.1.1.2
User B1.1.1.1
User B1.1.1.3

I should be able to remove 1.1.1.1 from the list of results. In a fairly large data set it is possible the user tries to narrow down on suspicious IPs.

So from what @Tchanders laid out, I think we want to go with the latter. I am okay if we want to iterate, starting with the former idea (since targets are a subset of all users+IPs) and making a new ticket if we want to experiment with the latter.

I am okay if we want to iterate, starting with the former idea (since targets are a subset of all users+IPs) and making a new ticket if we want to experiment with the latter.

Here's a follow up task to improve the way the filter works: T247281

Change 574841 merged by jenkins-bot:
[mediawiki/extensions/CheckUser@master] Add filters to the Compare tab on Special:Investigate

https://gerrit.wikimedia.org/r/574841

dom_walden added a subscriber: dom_walden.

Doing an investigation of $target_a and $target_b and filtering out $target_b is equivalent to doing an investigation on $target_a. I checked this by comparing the SQL queries performed (and to a lesser extent the output). This includes the full 100k limit getting applied to the remaining target.

The filters persist across pages and tabs.

I can submit the filter form with a non-existent target (where the input field goes red), but it just gets ignored.