Page MenuHomePhabricator

usercontribs API: sort results by timestamp,user or user,timestamp
Closed, DeclinedPublic

Description

Author: ismael.bouya

Description:
patch for usercontribs

Hi,
This is a very small patch (10 lines including description and comments) for the "usercontribs" api, that adds the possibility to sort results by timestamp,user or user,timestamp (depending on a new param "mixed") instead of only user,timestamp.

Default behavior is not modified if the param mixed is not given, thus we keep ascendant compatibility.

Would it be possible to review and submit the patch?

Thanks a lot!


Version: unspecified
Severity: enhancement

attachment patch ignored as obsolete

Details

Reference
bz35349

Event Timeline

bzimport raised the priority of this task from to Needs Triage.Nov 22 2014, 12:17 AM
bzimport set Reference to bz35349.
bzimport added a subscriber: Unknown Object (MLST).

Bryan.TongMinh wrote:

This might have performance implications if the sort key is not indexed. Could you run the SELECT query with EXPLAIN in front of it, and post the output here?

sumanah wrote:

Thanks for the patch! Also, you might be interested in getting a Git account so you can submit future patches more easily: https://labsconsole.wikimedia.org/wiki/Help:Access

ismael.bouya wrote:

Hi, thanks for your interest!
The output of the explain select is exactly the same if I run with or without "mixed", are you sure it is the correct test for checking performance ?

id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE page ALL PRIMARY NULL NULL NULL 1 Using temporary; Using filesort
1 SIMPLE revision range usertext_timestamp usertext_timestamp 257 NULL 2 Using where; Using join buffer

(and 257 becomes 271 if there is a uccontinue condition)

By the way, I deleted an important part of the patch when I cleared all the changes I made in the file, I just resubmitted - and triple checked that I didn't delete anything else - another one.

@Sumana: The "developper" page said that Git will only work from tomorrow, otherwise I would have tried to use it instead of the instructions with SVN :) . I'll think about it, but maybe I will only be an occasionnal contributor to wikimedia

ismael.bouya wrote:

second patch, corrected

attachment patch ignored as obsolete

What about a parameter ucsort with the values "user" (as default) and "timestamp" like list=categorymembers with "sortkey" and "timestamp" has, instead of parameter "mixed"?

ismael.bouya wrote:

If there is something already existant like that it would be a good thing to do that yes, do you want me to adapt the patch or will the maintainer do it? (it's only a matter of minutes to do that...)

(In reply to comment #6)

If there is something already existant like that it would be a good thing to do
that yes, do you want me to adapt the patch or will the maintainer do it? (it's
only a matter of minutes to do that...)

It would be good if you would since it makes your patch immediately more usable to us.

ismael.bouya wrote:

New ucsort key instead of mixed

Attached:

ismael.bouya wrote:

No problem, just resubmitted it :)

Since I didn't have any clue that any of you where maintainers or simple users I didn't know whether I reached the correct audience or not, and whether you where interested in the patch or not.

Bryan.TongMinh wrote:

(In reply to comment #3)

Hi, thanks for your interest!
The output of the explain select is exactly the same if I run with or without
"mixed", are you sure it is the correct test for checking performance ?

id select_type table type possible_keys key key_len ref
rows Extra
1 SIMPLE page ALL PRIMARY NULL NULL NULL 1 Using
temporary; Using filesort
1 SIMPLE revision range usertext_timestamp usertext_timestamp
257 NULL 2 Using where; Using join buffer

(and 257 becomes 271 if there is a uccontinue condition)

Yes, EXPLAIN is normally used to get an indication of the database performance.
From the indices on the revision table it looks like this should work, but I think somebody who knows more about query performance should have a look at this.

ismael.bouya wrote:

Should I resubmit it to gerrit now that it is open or will it do there?

sumanah wrote:

Immae, go ahead and submit your change to Git via Gerrit. Thanks!

ismael.bouya wrote:

(was that the correct adress I was supposed to give? It wasn't very clear in the documentation, certainly because of my not-so-good english)

(In reply to comment #10)

(In reply to comment #3)

Hi, thanks for your interest!
The output of the explain select is exactly the same if I run with or without
"mixed", are you sure it is the correct test for checking performance ?

id select_type table type possible_keys key key_len ref
rows Extra
1 SIMPLE page ALL PRIMARY NULL NULL NULL 1 Using
temporary; Using filesort
1 SIMPLE revision range usertext_timestamp usertext_timestamp
257 NULL 2 Using where; Using join buffer

(and 257 becomes 271 if there is a uccontinue condition)

Yes, EXPLAIN is normally used to get an indication of the database performance.
From the indices on the revision table it looks like this should work, but I
think somebody who knows more about query performance should have a look at
this.

Both "ALL" and "Using filesort" are red flags.

Here are my explains from the toolserver:

mysql> explain select page_namespace, page_title, rev_id, rev_timestamp from page, revision where page_id=rev_page and rev_user_text LIKE 'J%' order by rev_user_text, rev_timestamp limit 51 \G

  • 1. row ******* id: 1 select_type: SIMPLE table: revision type: range

possible_keys: PRIMARY,page_timestamp,usertext_timestamp

    key: usertext_timestamp
key_len: 257
    ref: NULL
   rows: 46784760
  Extra: Using where; Using index
  • 2. row ******* id: 1 select_type: SIMPLE table: page type: eq_ref

possible_keys: PRIMARY

    key: PRIMARY
key_len: 4
    ref: enwiki.revision.rev_page
   rows: 1
  Extra:

2 rows in set (0.00 sec)

mysql> explain select page_namespace, page_title, rev_id, rev_timestamp from page, revision where page_id=rev_page and rev_user_text LIKE 'J%' order by rev_timestamp, rev_user_text limit 51 \G

  • 1. row ******* id: 1 select_type: SIMPLE table: revision type: range

possible_keys: PRIMARY,page_timestamp,usertext_timestamp

    key: usertext_timestamp
key_len: 257
    ref: NULL
   rows: 46784760
  Extra: Using where; Using index; Using filesort
  • 2. row ******* id: 1 select_type: SIMPLE table: page type: eq_ref

possible_keys: PRIMARY

    key: PRIMARY
key_len: 4
    ref: enwiki.revision.rev_page
   rows: 1
  Extra:

2 rows in set (0.00 sec)

Here's what happened when I ran these queries on the toolserver:

mysql> select page_namespace, page_title, rev_id, rev_timestamp from page, revision where page_id=rev_page and rev_user_text LIKE 'J%' order by rev_user_text, rev_timestamp limit 51;
[...]
51 rows in set (0.00 sec)

mysql> select page_namespace, page_title, rev_id, rev_timestamp from page, revision where page_id=rev_page and rev_user_text LIKE 'J%' order by rev_timestamp, rev_user_text limit 51;
[after about a minute]
^CCtrl-C -- sending "KILL QUERY 9922979" to server ...
Ctrl-C -- query aborted.
ERROR 1028 (HY000): Sort aborted

This makes sense when you consider how MySQL is executing this query: it fetches all rows whose user starts with a J (according to EXPLAIN there are approximately 46 million of these), then sorts those by (timestamp,user) using a slow filesort (unindexed sort).

So yes, this patch has major performance implications. An API query like ucuserprefix=J&ucsort=timestamp would result in an SQL query that takes over a minute to execute.

ismael.bouya wrote:

Hi,
Thanks for your explanations.

Would it be the same if we use ucuser instead of ucuserprefix? (i.e. we forbid to use it on ucuserprefix?)

Maybe the discussion should continue on gerrit since I reposted it there?

ismael.bouya wrote:

If I understood well what you said, the problem is when mysql tries to sort the result, which is much more smaller when there are only a limited number of users?
How long would a query take in that last case?

(In reply to comment #16)

Hi,
Thanks for your explanations.

Would it be the same if we use ucuser instead of ucuserprefix? (i.e. we forbid
to use it on ucuserprefix?)

Maybe the discussion should continue on gerrit since I reposted it there?

With a single user it would work, but then sorting by (user,timestamp) produces the same output as sorting by (timestamp,user) because the value of 'user' is always the same. As soon as you allow multiple users, you run into the slow query problem: the 500 most active users on enwiki probably account for a couple million edits.

ismael.bouya wrote:

Ok thanks.
I'll try to close the "bugreport" if I manage to...