Add a parameter Usercontribs API to limit results to contributions to pages that a list of users have edited
Open, LowPublic

Description

Problem
In T182548: Interaction Timeline returns only partial results on complex queries we discovered that we had exceeded the use case of the current Usercontribs API. We're also making hundreds/thousands of unnecessary requests.

Solution
Add a param to specify the other user(s) (single user for now, multiple users possibly later) who must have edited the page to include in the query.

Example
User:Apples edits Bananas, Oranges, and Cucumbers
User:Bananas edits Bananas, Avocados, and Oranges.

If I request the user contribs of User:Apples, I should be able to specify that I only contribs that are to pages that User:Bananas has edited. In that case I would get all of User:Apples contribs to Bananas and Oranges, but the contribs to Cucumbers would be excluded.

Work Around
Create Your Own API™ in T185459: Create user interaction endpoint

dbarratt created this task.Feb 5 2018, 7:51 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 5 2018, 7:51 PM
dbarratt updated the task description. (Show Details)Feb 5 2018, 8:02 PM
Anomie triaged this task as Low priority.Feb 5 2018, 10:11 PM
Anomie added a subscriber: Anomie.

This seems like a strange special case, but as stated it would be possible with caveats. I have to reject the direct approach unless the DBAs tell me it's ok if the query potentially has to touch millions of rows in some cases (i.e. bugs like T97797).

If the caveats are not acceptable and you can't propose a specific alternative plan, we may as well decline this task.

Add a param to specify the other users who must have edited the page to include in the query.

This is unclear. If you want contributions of Alice and specify Bob, Carol, and Dave as the "other users", which case are you intending?

  • Alice's contributions to pages that all of Bob, Carol, and Dave also edited.
  • Alice's contributions to pages that any one of Bob, Carol, or Dave also edited.
  • When you said "other users", you actually meant "one other user".

One possible approach would be:

  1. Fetch the uclimit contributions for the named users, as the module does already.
  2. Extract the set of pages in those contributions.
  3. Fetch the set of those pages also edited by the specified users (a SELECT on revision targeting the page_user_timestamp index, or in the future the page_actor_timestamp index)
  4. Filter the contributions from step 1 (in PHP code) to only return the pages in set 3.

Caveat: This will likely return fewer than uclimit results, possibly zero. The client will have to continue the query to get a "full" set of results (whatever it considers a "full" set to be).
Caveat: Until T167246 is complete, the "other users" must be registered accounts. The named users (for whom the contributions are being fetched) can include IPs.


A second, more complex possible approach would require a new index to the revision table on (rev_user, rev_page), or in the future (rev_actor, rev_page), or a new table that contains just the distinct rev_page and rev_user/rev_actor pairs.

  1. Fetch the next 5000 pages that the named users contributed to (a SELECT on revision targeting the new index, or on the new table).
  2. Fetch the set of those 5000 pages also edited by the specified users (a SELECT on revision targeting the page_user_timestamp (or in the future the page_actor_timestamp) index, or a SELECT on the new table)
  3. Fetch up to uclimit contributions for the named user to the set of pages in step 2 (targeting the page_user_timestamp index, or in the future the page_actor_timestamp index).

Caveat: The results will be sorted by user, then page_id, then by timestamp. The results are normally ordered by user and timestamp. Or, if T180153 happens, the results would be sorted by page_id then timestamp.
Caveat: Until T167246 is complete, the named users and the "other users" must all be registered accounts.
Caveat: You might still get fewer than uclimit results, possibly zero, although the risk is less than with the other plan.
Caveat: This probably can't work with ucuserprefix once T167246 is complete.

dbarratt added subscribers: dmaza, TBolliger.EditedFeb 5 2018, 11:15 PM

Add a param to specify the other users who must have edited the page to include in the query.

This is unclear. If you want contributions of Alice and specify Bob, Carol, and Dave as the "other users", which case are you intending?

  • Alice's contributions to pages that all of Bob, Carol, and Dave also edited.
  • Alice's contributions to pages that any one of Bob, Carol, or Dave also edited.
  • When you said "other users", you actually meant "one other user".

For now, we are only doing one other user, but that will not always be the case. I'm not sure if we would use an AND or an OR. For now it would be acceptable to only allow a single user. I'll update the description to reflect this.

One possible approach would be:

Basically what we are doing now is we select the first page of contribs for the user, then we use Revisions API to determine if the other user(s) have ever edited that page (i.e. zero revisions to a page indicate they've never edited it).

However, this creates a huge number of requests as described in T182548: Interaction Timeline returns only partial results on complex queries. To resolve this we'll query the replicas in Toolforge on T185459: Create user interaction endpoint.

This ticket is an "upstream" ticket to expand the usecase of the API. This may be a special use case, but I think there should be some way to do this with the API. If there's a better way to do this (rather than using a new parameter) please feel free to update.

dbarratt updated the task description. (Show Details)Feb 5 2018, 11:15 PM
Anomie added a comment.Feb 6 2018, 1:55 AM

The problem isn't a parameter, the problem is doing this in a way that won't let one request blow up the databases.

dbarratt added a comment.EditedFeb 6 2018, 5:12 AM

The problem isn't a parameter, the problem is doing this in a way that won't let one request blow up the databases.

well then, I'm really interested to see what @dmaza comes up with in T185459: Create user interaction endpoint

Anomie added a comment.Feb 6 2018, 3:29 PM

Note that the Toolforge replicas, particularly the "analytics" replicas, have greater leeway for inefficient queries than the production databases do. He probably can get away with the straightforward query that filters at the database level but might touch millions of rows there.