Page MenuHomePhabricator

Interaction Timeline returns only partial results on complex queries
Closed, ResolvedPublic

Description

Blocked by T184146: Sub-epic ⚡️ : Create an API service for InteractionTimeline


Problem

The Interaction Timeline returns only partial results for queries on users who have high editcounts:


Acceptance criteria

  • The tool should continue to paginate as needed to return all results for a query. If we need to transition to "Load more" button-based pagination, we should.

Related Objects

Event Timeline

This seems to be sort-of fixed now. Results are returned, but only some initial results. It seems to keep making API requests but not display any results beyond the first batch or so. For example, at https://tools.wmflabs.org/interaction-timeline/?wiki=enwiki&user=Missvain&user=Hullaballoo%20Wolfowitz it displays results up to August 2006, but nothing after that. Compare with https://tools.wmflabs.org/sigma/editorinteract.py?users=Missvain&users=Hullaballoo+Wolfowitz&users=&startdate=&enddate=&ns=&server=enwiki.

TBolliger renamed this task from Interaction Timeline Returns Empty Result to Interaction Timeline returns only partial results on complex queries.Dec 13 2017, 10:14 PM
TBolliger updated the task description. (Show Details)

Currently if I try to load https://tools.wmflabs.org/interaction-timeline?wiki=enwiki&user=Missvain&user=Funcrunch, it displays no results, but keeps making API requests indefinitely. I probably let it run for at least 10 minutes before killing it. If I load the same query in Editor Interaction Analyser (https://tools.wmflabs.org/sigma/editorinteract.py?users=Missvain&users=Funcrunch&users=&startdate=&enddate=&ns=&server=enwiki) it takes less than a minute to run. Feel free to steal whatever logic they are using :)

This is due to the way we use the combination of pages and revisions to get the interaction. Users with many edits across many pages require a lot of api calls due to the pagination limits on the API.

I think it might be time to work on an API for the timeline and query the db directly for the results.

This is due to the way we use the combination of pages and revisions to get the interaction. Users with many edits across many pages require a lot of api calls due to the pagination limits on the API.
I think it might be time to work on an API for the timeline and query the db directly for the results.

Yeah, I think it will be quite common that this tool will be used for users that have high edit counts (as such users often accumulate enemies over the years).

@TBolliger: Wanna create a new ticket for building an API proxy for InteractionTimeline. Doing this without one just isn't going to scale. Lemme know if I can help!

Went ahead and created a task for the API service: T184146. Feel free to rewrite as needed.

TBolliger changed the task status from Open to Stalled.Jan 4 2018, 5:45 PM
TBolliger moved this task from Triage/To be Estimated to Analytics on the Anti-Harassment board.
TBolliger updated the task description. (Show Details)

It seems like the problem occurs if there is a high number of edits, but a low number of common pages that have been edited.

If Apples and Bananas have both made 10,000 revisions, but they have been in their own corners of the wiki until recently, we'll have to page through their revisions until we get to enough common revisions. Since these haven't happened until recently, it could take a while.

Worst case scenario is that the users have no common pages they have edited. In this case all of the revisions of both users will end up being loaded.

Sadly, there isn't really a way to mitigate this with MediaWiki's API and I agree we should resolve this with T184146 (or this could be resolved upstream, but I don't see why this API would exist in core).

When we add deleted revisions to the timeline, we'll have to query MediaWiki API (since they not available in the Replicas) for those, but those can be queried independently and then we'll just merge the results together within the API.

It seems like the problem occurs if there is a high number of edits, but a low number of common pages that have been edited.

If Apples and Bananas have both made 10,000 revisions, but they have been in their own corners of the wiki until recently, we'll have to page through their revisions until we get to enough common revisions. Since these haven't happened until recently, it could take a while.

Worst case scenario is that the users have no common pages they have edited. In this case all of the revisions of both users will end up being loaded.

Sadly, there isn't really a way to mitigate this with MediaWiki's API and I agree we should resolve this with T184146 (or this could be resolved upstream, but I don't see why this API would exist in core).

When we add deleted revisions to the timeline, we'll have to query MediaWiki API (since they not available in the Replicas) for those, but those can be queried independently and then we'll just merge the results together within the API.

Why does it show results for one of the users if there are no common pages edited? Shouldn't it say "No interactions found" or something like that?
See https://tools.wmflabs.org/interaction-timeline/?wiki=enwiki&user=Oshwah&user=Drewmutt

@Niharika There are common pages, see https://tools.wmflabs.org/interaction-timeline/?wiki=enwiki&user=Oshwah&user=Drewmutt&startDate=1488387600.
If there is no interaction it currently says No Results. Perhaps we should rephrase that message.

@TBolliger, @dbarratt
This example does bring another issue(?). Should we display only one user's edit so far back before an actual interaction happened?

@dmaza If I go to https://tools.wmflabs.org/interaction-timeline/?wiki=enwiki&user=Oshwah&user=Drewmutt (without a start date), I see -

Screen Shot 2018-02-05 at 5.27.11 PM.png (1×2 px, 370 KB)

Some more similar records for first user after this. Nothing for the second user.

Yup, this is because edits are sorted chronologically and in this case you'll only see Oshwah's edits until Drewmutt makes the first edit in a common page. which can take several pages scrolling down.

In my previous comment I said

This example does bring another issue(?). Should we display only one user's edit that far back before an actual interaction happened?

To illustrate this better, Oshwah has been making edits to his/her talk page since 2007 and Drewmutt made one edit in 2017, you will see hundreds of edits from Oshwah until you get to Drewmutt's. Does that makes sense?

Yup, this is because edits are sorted chronologically and in this case you'll only see Oshwah's edits until Drewmutt makes the first edit in a common page. which can take several pages scrolling down.

In my previous comment I said

This example does bring another issue(?). Should we display only one user's edit that far back before an actual interaction happened?

To illustrate this better, Oshwah has been making edits to his/her talk page since 2007 and Drewmutt made one edit in 2017, you will see hundreds of edits from Oshwah until you get to Drewmutt's. Does that makes sense?

Not to be that guy, but is this a useful behaviour?

@dmaza If I go to https://tools.wmflabs.org/interaction-timeline/?wiki=enwiki&user=Oshwah&user=Drewmutt (without a start date), I see -

Screen Shot 2018-02-05 at 5.27.11 PM.png (1×2 px, 370 KB)

Some more similar records for first user after this. Nothing for the second user.

The revision displays if the other user has ever edited that page. In the case of the first entry:
https://en.wikipedia.org/w/api.php?action=query&prop=revisions&pageids=8786141&rvuser=Drewmutt&formatversion=2

but you haven't gotten to those revisions yet because you're still in March of 2017 and Drewmutt's edits start in 2006
https://en.wikipedia.org/w/api.php?action=query&list=usercontribs&ucuser=Drewmutt&ucdir=newer&formatversion=2

So you're browser has to page through all of Drewmutt's edits (starting in 2006) until we can find common edits (since we don't know when that starts). A better way to do this would be to get the first revision of each user and internally set the start date the later of the two.

Not to be that guy, but is this a useful behaviour?

Going with the solution above.. the only problem would be, is that you might care about edits before the first time the second person edited, but I don't think there is any value in time before the second person (whichever user that might be) ever edited. So I'm good ignoring all edits before the second person's first edit .

This example does bring another issue(?). Should we display only one user's edit so far back before an actual interaction happened?

I think there's little value in edits before the first person's edit, but I don't know if interaction is a good measure since they could have interacted on a page and that interaction could be years apart (which could bring you to the same problem).

@jrbs I don't think it is useful, that's why I'm raising the question if we should display so many edits that far apart from the first interaction. One solution to this would be to add more filters, like display X edits before/after each interaction.

@jrbs I don't think it is useful, that's why I'm raising the question if we should display so many edits that far apart from the first interaction. One solution to this would be to add more filters, like display X edits before/after each interaction.

That sounds like a great solution to me! Sorry to butt in.

@jrbs I don't think it is useful, that's why I'm raising the question if we should display so many edits that far apart from the first interaction. One solution to this would be to add more filters, like display X edits before/after each interaction.

How do you define "interaction"?

Yup, this is because edits are sorted chronologically and in this case you'll only see Oshwah's edits until Drewmutt makes the first edit in a common page. which can take several pages scrolling down.

In my previous comment I said

This example does bring another issue(?). Should we display only one user's edit that far back before an actual interaction happened?

To illustrate this better, Oshwah has been making edits to his/her talk page since 2007 and Drewmutt made one edit in 2017, you will see hundreds of edits from Oshwah until you get to Drewmutt's. Does that makes sense?

Okay, I get that now. But wait, so if I don't scroll down, it doesn't load interactions? As in, if I leave the tab open while I do something else and go back to it, it wouldn't have loaded all of the interactions in the meantime because I didn't scroll? That's what seems to be happening, from what I can tell. The loading indicator doesn't appear after a point too.

Okay, I get that now. But wait, so if I don't scroll down, it doesn't load interactions? As in, if I leave the tab open while I do something else and go back to it, it wouldn't have loaded all of the interactions in the meantime because I didn't scroll? That's what seems to be happening, from what I can tell. The loading indicator doesn't appear after a point too.

Oh no, it will keep loading until it finishes loading the first "page". if you open your network tab in your console, it will keep running queries until it gets the first "page". I got to 650 requests before I killed my tab.

Hmm, I'm still not sure what first "page" means. For that example, it repeatedly kept "dying" unless I scrolled and then it would pull more records into view. My console had 96 [Error] Failed to load resource: The Internet connection appears to be offline.https://en.wikipedia.org/w/api.php?action=query&list=usercontribs&ucuser=Drewmutt&ucdir=newer&format=json&origin=*&formatversion=2&ucstart=1488387600&uccontinue=20170405035245%7C773907587 (api.php, line 0)

(This happened on Safari.)

It seems to have stopped pulling records after 2008-09-08 where Drewmutt edited Tardigrade. It doesn't pull anymore records, even if I scroll. The loading indicator flashes for a couple of seconds and then goes away. I can do a screen recording to show you. As in, https://en.wikipedia.org/w/api.php?action=query&prop=revisions&pageids=8786141&rvuser=Drewmutt&formatversion=2 never shows up.

uhh, a page would be enough revisions to fill your viewport. If you zoom out (or scroll) it will have to make many more requests to fill the viewport.

uhh, a page would be enough revisions to fill your viewport. If you zoom out (or scroll) it will have to make many more requests to fill the viewport.

Can we make it keep going until it has at least one record for each user? Would that be too much? Like in this example, one page fetched only 12 records for User:Oshwah but none for Drewmutt.
My expectation was that if I'd move away from the tab, it'll keep loading interactions in the background and I can go back to it after a few minutes to see a full record of interactions. I'm not sure if that's just me though.

Can we make it keep going until it has at least one record for each user? Would that be too much? Like in this example, one page fetched only 12 records for User:Oshwah but none for Drewmutt.
My expectation was that if I'd move away from the tab, it'll keep loading interactions in the background and I can go back to it after a few minutes to see a full record of interactions. I'm not sure if that's just me though.

I like that. I created T186637: Interaction Timeline should continue initial pagination until at least one revision is retrieved for each user to further discuss that. However, would it be better to just ignore those revisions?

Get the first revision of User A and User B and set the "startDate" (internally) to the more recent of the two?

However, would it be better to just ignore those revisions?

That's pretty much what I'm proposing here with a filter.

Auto-paginating'til the first interaction feels like a workaround to the problem. Plus it can take forever and many many unnecessary requests and it won't be the best user experience

However, would it be better to just ignore those revisions?

That's pretty much what I'm proposing here with a filter.

What if someone doesn't specify the filter? You'd be back to the same problem.

I'm wondering if it's better to just do it for them (Don't Make Me Think™) and not have the filter (we can always add it if people request it).

Alternatively, I guess we could set the filter's default to 0 or 1 and I suppose we could also have a checkbox on the whole option if you really want every revision in that timeframe (or all time).

Both options are valid as long as we are transparent on how they affect the results

@jrbs I don't think it is useful, that's why I'm raising the question if we should display so many edits that far apart from the first interaction. One solution to this would be to add more filters, like display X edits before/after each interaction.

That sounds like a great solution to me! Sorry to butt in.

I second this. T186637 will get us part of the way there, but for discrepancies as wide as 2007 and 2017 it won't make as much sense. T184147: Interaction Timeline: Add a sane default value for the start date is out cheat for queries like such for now — I'm going to wait to create a task for now until we see if this solves the issue.

A quick suggestion: fetch the registration time of the two users. No interaction can occur before the latest of these two dates, so you can strictly limit fetching contribs of both users to after this time without affecting the results.

A quick suggestion: fetch the registration time of the two users. No interaction can occur before the latest of these two dates, so you can strictly limit fetching contribs of both users to after this time without affecting the results.

Nice, this is another great idea that could solve this problem.

It's only a partial solution -- if you've got two users that have been around since (say) 2006 with lots of edits it won't help much. But this improvement costs you just one API request. (You also get other metadata, such as whether the user is currently blocked, for free.)

A quick suggestion: fetch the registration time of the two users. No interaction can occur before the latest of these two dates, so you can strictly limit fetching contribs of both users to after this time without affecting the results.

Nice, this is another great idea that could solve this problem.

That was my suggestion! :) I just was thinking first revision rather then registration time, but registration time works too (and is probably better since then you wont miss the edits before the user's first edit)

TBolliger claimed this task.

This is going to be a really hard one to test (no human can cross-reference every edit by users with this many edits) but I think we're good.

I looked at all the examples in the description and they scrolled and scrolled and scrolled and scrolled. I set specific date ranges and the Timeline was zippy in updating. I looked at one example to see if all the expected edits appeared for a misc. month on both Special:Contributions — and they did.

A very noticeable improvement! Great work!