Page MenuHomePhabricator

CU 2.0: Create queries required for the Timeline tab [medium]
Closed, ResolvedPublic2 Estimated Story PointsMar 25 2020

Description

As a step towards implementing the mockup detailed at the parent task T237595, this task concentrates on developing the required queries to fetch the timeline data.

Acceptance criteria:

  • Create the query needed to fetch a list of contributions of the given users and IPs under investigation, ordered by time
  • Output the results in a standardized way for the Frontend to pull this data for display (display itself will be a followup task)
  • The output should note each entry's type (log, edit, page creation) in a sensible way for (later) the frontend to be able to adjust its formatting.
  • Create integration test for the queries to make sure the data is outputted correctly

Details

Due Date
Mar 25 2020, 4:00 AM

Event Timeline

Niharika triaged this task as Medium priority.Mar 13 2020, 10:21 PM
Niharika moved this task from Untriaged to Cards ready to be discussed on the Anti-Harassment board.
Niharika updated the task description. (Show Details)
ARamirez_WMF renamed this task from CU 2.0: Create queries required for the Timeline tab to CU 2.0: Create queries required for the Timeline tab [medium].Mar 18 2020, 4:39 PM
ARamirez_WMF set the point value for this task to 2.
ARamirez_WMF set Due Date to Mar 25 2020, 4:00 AM.
ARamirez_WMF changed the subtype of this task from "Task" to "Deadline".

Looking at the current "edits" option on CheckUser, there is no pagination and a limit of 5000 rows.

  1. Is this what we want here? No pagination and a sensible limit?
  2. The old implementation only allows 1 user at a time; should the 5000 limit be split by the number of users in the current investigation or do we want to allow 5000 results per user?

From the acceptance criteria, I'm unsure what the following means. Is it to display the raw data without format/links/date grouping/etc the intention?

  • Output the results in a standardized way for the Frontend to pull this data for display (display itself will be a followup task)

Is this what we want here? No pagination and a sensible limit?

When we discussed T247640 in estimation we planned to make a pager, though it looks like this hasn't been documented in the task yet.

I would avoid doing things that have no (or huge) limit. Honestly, I think it's a product problem more than a performance one, but since we are checking more than one target (unlike the original) I'd be slightly more careful, and go with pagination.

We did talk about it during estimation, I think it was talked about as if it's obvious and maybe it isn't :) we should add it in. I don't want to change the acceptance criteria while you're working on it, @dmaza in case it doesn't make sense with the way you're implementing things -- does it sound okay to add a pager to this task? or will it change the entire estimation or make it too large/different?

@Mooeypoo @dmaza Adding the pager to this task would make this task a lot larger than the others, and would be a departure from what we did for the Preliminary Check and Compare tabs.

When we discussed breaking up the work in estimation we broadly planned to do as follows:

This breaks up the first two tasks nice and evenly. As we did previously, we can write TimelineService::getQueryInfo etc knowing that they will be called by a pager.

Is this what we want here? No pagination and a sensible limit?

When we discussed T247640 in estimation we planned to make a pager, though it looks like this hasn't been documented in the task yet.

A pager makes sense, but do we want to limit the number of records to fetch for each user? To quote the inline comment on SpecialCheckUser:747

// Ordered in descent by timestamp. Can cause large filesorts on range scans.
// Check how many rows will need sorting ahead of time to see if this is too big.
// Also, if we only show 5000, too many will be ignored as well.

@dmaza in case it doesn't make sense with the way you're implementing things -- does it sound okay to add a pager to this task? or will it change the entire estimation or make it too large/different?

Not really. I'm adding a pager

Change 583511 had a related patch set uploaded (by Dmaza; owner: Dmaza):
[mediawiki/extensions/CheckUser@master] Add TimelineService queries

https://gerrit.wikimedia.org/r/583511

So to summarize what has been discussed. This task is only to generate the query that will be consumed by the pager. It does not involved any data massaging or any extra handled on the pager. That work will be done as part of T247640: CU 2.0: Add the timeline tab with a list of generic structured results [medium]
With that said, is it fair to say that the last 3 bullet points on the Acceptance Criteria can be removed since they'll be handled in T247640?

@dmaza I think we can keep the last three acceptance criteria, but change the last one slightly. The way I interpret the other two, the patch already does them:

  • Output the results in a standardized way for the Frontend to pull this data for display (display itself will be a followup task)

I think this one is saying return the results in a way that the pager can interpret.

  • The output should note each entry's type (log, edit, page creation) in a sensible way for (later) the frontend to be able to adjust its formatting.

I think this one is similar to the above - making sure the type is passed onto the pager.

  • Create integration test for the queries to make sure the data is outputted correctly

I agree integration tests and output formatting tests need to wait for T247640. Could we change this here to:

  • Create unit test for the queries to make sure the query is built correctly

(As the patch already does.)

Tested the query that I'd expect the pager to build based on this patch on testwiki on the 5 most prolific users and 5 most prolific IPs - it runs instantaneously on a dataset that size.

I think we should do further tests as part of T248588 (which should be a release blocker) before enabling on larger wikis. We may need to do something along the lines of T245499, as pointed out by @dmaza.

Change 583511 merged by jenkins-bot:
[mediawiki/extensions/CheckUser@master] Add TimelineService queries

https://gerrit.wikimedia.org/r/583511

dom_walden added a subscriber: dom_walden.

Nothing to test on the frontend. For the backend work, I can look at the output of TimelineService::getQueryInfo(), and it looks fine.

E.g. output for an IP, IP range and username:

TimelineService::getQueryInfo( ["1.2.3.4", "1.2.3.4/16", "User1"] );                                                                                                                                                 
=> [
     "tables" => "cu_changes",
     "fields" => [
       "cuc_namespace",
       "cuc_title",
       "cuc_user",
       "cuc_user_text",
       "cuc_comment",
       "cuc_actiontext",
       "cuc_timestamp",
       "cuc_minor",
       "cuc_page_id",
       "cuc_type",
       "cuc_this_oldid",
       "cuc_last_oldid",
       "cuc_ip",
       "cuc_xff",
       "cuc_agent",
     ],
     "conds" => "cuc_ip_hex = '01020304' OR ((cuc_ip_hex >= '01020000') AND (cuc_ip_hex <='0102FFFF')) OR cuc_user = 5",
   ]

Acceptance criteria:

  • Create the query needed to fetch a list of contributions of the given users and IPs under investigation, ordered by time

The above query doesn't order by time, but I guess this will be done by whatever is going to call getQueryInfo().

  • Output the results in a standardized way for the Frontend to pull this data for display (display itself will be a followup task)
  • The output should note each entry's type (log, edit, page creation) in a sensible way for (later) the frontend to be able to adjust its formatting.

Above query returns things like cuc_type, cuc_minor, cuc_actiontext from which this can be derived.

  • Create integration test for the queries to make sure the data is outputted correctly

This new class has unit tests. Not sure about integration tests. Thalia notes this above.