Page MenuHomePhabricator

API:Usercontribs has inconsistent ordering
Closed, InvalidPublic

Description

During T178715 we found some inconsistencies to how the Usercontribs API orders things when there are multiple users.

If you look at a query like this:
https://en.wikipedia.org/w/api.php?action=query&list=usercontribs&ucuser=Davidwbarratt|Kaldari&uclimit=500&ucdir=older
I would expect Kaldari's edits to be almost all of the results, however, this result is in there:

{
   "userid": 15989147,
   "user": "Davidwbarratt",
   "pageid": 2212828,
   "revid": 468108124,
   "parentid": 429907077,
   "ns": 0,
   "title": "Palmer Theological Seminary",
   "timestamp": "2011-12-28T16:46:01Z",
   "comment": "",
   "size": 6288
}

Kaldari has many many many more edits that are newer than this.

However, if you reverse the query:
https://en.wikipedia.org/w/api.php?action=query&list=usercontribs&ucuser=Davidwbarratt|Kaldari&uclimit=500&ucdir=newer
you do not get any results from Davidwbarratt which is what I would expect.

What is going on here? I thought at first it was running two different queries and then stacking the results, but then the reverse would not have had the same effect.

I expected the query to be run with multiple users, something like IN ('Davidwbarratt', 'Kaldari'), but that does not seem to be what is happening.

If we can't resolve this, I think it might be better to not support more than one user in a query.

Work Around
Make a tool on Toolforge that queries the database and returns the results in JSON format. In other words: Make Your Own API™

Event Timeline

Anomie subscribed.

TL;DR: The ordering is entirely consistent, it's just not as you expected. In this case, the results are first ordered by user ID then by timestamp. Thus, I'm closing this as Invalid. See below for a different task you might file if you want to propose changing the ordering to match what you expected.


To be at all efficient, the results need to be ordered in a manner that matches the database indexes used to fetch the data. Otherwise the database is going to have to fetch all the potentially-millions of results and sort them to pick the first 500 instead of being able to stop as soon as it has 500.

In the case of your request here, there are a few possible options for what the database could do:

  1. Look through over 700,000,000 revisions ordered by timestamp (index rev_timestamp) to find the handful that were made by one of these two users. At least it can stop once you find 501, but it'll probably go through millions of rows before it finds 501 that match.
  2. Fetch all the revisions by either user (around 60042 in this case, but choose some of the users from the top of this list or this list and you can see that could easily reach the millions), then order them by timestamp, then throw away all but the first 501.
  3. Look through the revisions by each user in turn, ordered by timestamp (index user_timestamp), stopping when it gets 501.
  4. Look through the revisions by each individual user, ordered by timestamp (index user_timestamp again), stopping for each when it gets 501, then merge those lists together and re-order by timestamp and throw away all but the first 501. We'd have to experiment to see if the worst case (a user with apihighlimits supplying the maximum of 500 users to fetch contributions for means 2505000 rows fetched in total) still has satisfactory performance.

#1 is obviously not going to work, that would take forever. #2 falls over when someone queries a bot's contributions. Between #3 and #4, #3 is much simpler and more efficient so that's what is being done. Thus, you get results ordered by user first and then by timestamp for each user (and then by revision ID if the same user made multiple revisions in 1 second).

The database query for #3 looks something like

SELECT rev_id, rev_timestamp, page_namespace, page_title, rev_user, rev_user_text, rev_deleted, rev_page, page_latest, rev_len, rev_minor_edit, rev_parent_id, rev_comment AS `rev_comment_text`, NULL AS `rev_comment_data`, NULL AS `rev_comment_cid` FROM `page`, `revision` WHERE (page_id=rev_page) AND ((rev_deleted & 4) != 4) AND rev_user IN ('15989147', '59944') ORDER BY rev_user DESC, rev_timestamp DESC, rev_id DESC LIMIT 501

Which is probably about what you expected, except that rev_user DESC is at the front of the ORDER BY clause. Removing that would give the query for #1 or #2 instead (the database would choose between the two plans depending on which it thought was best).

The database query for #4 would look something like this instead.

(SELECT rev_id, rev_timestamp, page_namespace, page_title, rev_user, rev_user_text, rev_deleted, rev_page, page_latest, rev_len, rev_minor_edit, rev_parent_id, rev_comment AS `rev_comment_text`, NULL AS `rev_comment_data`, NULL AS `rev_comment_cid` FROM `page`, `revision` WHERE (page_id=rev_page) AND ((rev_deleted & 4) != 4) AND rev_user = '15989147' ORDER BY rev_user DESC, rev_timestamp DESC, rev_id DESC LIMIT 501)
UNION
(SELECT rev_id, rev_timestamp, page_namespace, page_title, rev_user, rev_user_text, rev_deleted, rev_page, page_latest, rev_len, rev_minor_edit, rev_parent_id, rev_comment AS `rev_comment_text`, NULL AS `rev_comment_data`, NULL AS `rev_comment_cid` FROM `page`, `revision` WHERE (page_id=rev_page) AND ((rev_deleted & 4) != 4) AND rev_user = '59944' ORDER BY rev_user DESC, rev_timestamp DESC, rev_id DESC LIMIT 501)
ORDER BY rev_timestamp DESC,rev_id DESC LIMIT 501

And if there are 500 users being queried instead of 2, that becomes a lot longer since it has to repeat the parenthesized select for each one. It's certainly doable, and if you want to propose that feel free to file a separate task requesting it. But expect it to be postponed until after https://gerrit.wikimedia.org/r/#/c/380669/ is merged and the transitional code being added there is removed.

I thought at first it was running two different queries and then stacking the results, but then the reverse would not have had the same effect.

The original query did Davidwbarratt first and then Kaldari, since 15989147 comes before 59944 in descending order.

The reversed query looks like

SELECT rev_id, rev_timestamp, page_namespace, page_title, rev_user, rev_user_text, rev_deleted, rev_page, page_latest, rev_len, rev_minor_edit, rev_parent_id, rev_comment AS `rev_comment_text`, NULL AS `rev_comment_data`, NULL AS `rev_comment_cid` FROM `page`, `revision` WHERE (page_id=rev_page) AND ((rev_deleted & 4) != 4) AND rev_user IN ('15989147', '59944') ORDER BY rev_user ASC, rev_timestamp ASC, rev_id ASC LIMIT 501

Here it does Kaldari first, since 59944 comes before 15989147 in ascending order. And since Kaldari has more than 500 edits, it doesn't get to Davidwbarratt yet. If you follow the continuation far enough, you'll find Davidwbarratt's edits are included later on.

P.S. To save you some surprise, if you include a non-registered user in the list or use ucuserprefix then it will sort by user name instead of user ID. In the future the ordering will be by actor ID instead of user ID or name in the non-ucuserprefix case.

MZMcBride renamed this task from API:Usercontribs has inconsistant ordering to API:Usercontribs has inconsistent ordering.Nov 9 2017, 4:59 PM

@Anomie well that's fascinating. Thank you for the detailed explanation.

I guess my question is:

Should we provide an option to order the results by timestamp or should we encourage people to run their own queries (as you provided) on Toolforge?

It looks like T37349 was closed by the proposer after their patch ran into problems. As I mentioned, it's doable (although we might have to reduce the maximum uclimit and/or maximum number of ucusers when it's being done) for the non-prefix case, but the necessary logic to actually do it is complex. It'd probably be best to start fresh in a new task if we want to pursue it.