Page MenuHomePhabricator

Create (rev_timestamp,rev_user_text) index
Closed, DeclinedPublic

Description

Currently, the revision table has the usertext_timestamp (rev_user_text,rev_timestamp) index. I was kinda forced to use this for the API's ucuserprefix feature:

SELECT stuff FROM revision LEFT JOIN page ON page_id=rev_page WHERE rev_deleted = 0 AND rev_user_text LIKE '123.456.%' ORDER BY rev_user_text DESC, rev_timestamp DESC

The ORDER BY part is weird, but required to be able to se the usertext_timestamp index. It results in some counterintuitive behavior though: the API sorts contributions by username (Z to A) first, then by timestamp. Sorting the other way around (first by timestamp, then by username) would be more useful and more intuitive, but is currently impossible without killing performance.

I therefore suggest another index be created that sorts the other way, e.g. timestamp_usertext (rev_timestamp, rev_user_text).


Version: 1.13.x
Severity: enhancement

Details

Reference
bz13622

Event Timeline

bzimport raised the priority of this task from to Lowest.Nov 21 2014, 10:05 PM
bzimport set Reference to bz13622.
bzimport added a subscriber: Unknown Object (MLST).

Created attachment 4789
Patch for tables.sql

Attached:

Hmm, two problems.

First, this is going to be a pretty inefficient query, especially on a site like English Wikipedia where there are many thousands of contributions each day. Since only a tiny handful of them are going to match your user prefix, you have to sort through *every edit to the entire wiki* until you've matched X number of rows on the user portion of the index.

Second, the user portion of this index would only come into play *if there are multiple edits made in the same second from the same username prefix*.

If making this query, you may as well simply use the existing rev_timestamp index -- it'll be more efficient since it's a smaller index, and will give the same results in nearly all cases. But it'll still be very inefficient, because you have to sift through tens or hundreds of thousands of rows to get a handful of matches.

(In reply to comment #2)

Hmm, two problems.

First, this is going to be a pretty inefficient query, especially on a site
like English Wikipedia where there are many thousands of contributions each
day. Since only a tiny handful of them are going to match your user prefix, you
have to sort through *every edit to the entire wiki* until you've matched X
number of rows on the user portion of the index.

Isn't that also true for the [[Special:Contributions]] query?

Second, the user portion of this index would only come into play *if there are
multiple edits made in the same second from the same username prefix*.

If making this query, you may as well simply use the existing rev_timestamp
index -- it'll be more efficient since it's a smaller index, and will give the
same results in nearly all cases. But it'll still be very inefficient, because
you have to sift through tens or hundreds of thousands of rows to get a handful
of matches.

That's true, but I thought rev_user_text had to be indexed somewhere too as I'm using it in a WHERE clause (not a DB optimization expert).

(In reply to comment #3)

Isn't that also true for the [[Special:Contributions]] query?

Special:Contributions does only exact username matches, which is efficient -- it hits the first part of the username_timestamp index (the username), then goes on to the timestamp part (the date sorting).

If making this query, you may as well simply use the existing rev_timestamp
index -- it'll be more efficient since it's a smaller index, and will give the
same results in nearly all cases. But it'll still be very inefficient, because
you have to sift through tens or hundreds of thousands of rows to get a handful
of matches.

That's true, but I thought rev_user_text had to be indexed somewhere too as I'm
using it in a WHERE clause (not a DB optimization expert).

Indexing isn't required; sometimes it helps you, sometimes it doesn't. Where it helps is when you can avoid having to look at a large portion of your data set -- the index lets you skip a lot of rows that won't match.

A timestamp_username index would only help this query on the timestamp part. Since the timestamps differ between almost every edit, you will almost never have the chance to compare the second, user part of that index.

(In reply to comment #4)

Indexing isn't required; sometimes it helps you, sometimes it doesn't. Where it
helps is when you can avoid having to look at a large portion of your data set

  • the index lets you skip a lot of rows that won't match.

A timestamp_username index would only help this query on the timestamp part.
Since the timestamps differ between almost every edit, you will almost never
have the chance to compare the second, user part of that index.

OK, so let's generalize: is there any way to optimize or change the current query (pasted below) so it doesn't sort in a counter-intuitive way while not killing server kittens?

SELECT stuff FROM revision WHERE rev_deleted=0 AND rev_user_text LIKE 'Abc%' ORDER BY rev_user_text DESC, rev_timestamp DESC

I'd like to sort this just by rev_timestamp, but that causes a filesort, as does sorting the other way (rev_timestamp before rev_user_text). That's why I thought having a reverse index would help as that would ostensibly allow ORDER BY rev_timestamp DESC, rev_user_text DESC

I don't believe there's an efficient way to do this, no.

The index you request would theoretically help with a sort BY rev_timestamp DESC, rev_user_text DESC, *however*:

  1. It would *not* help with the query -- it would fail to cut down the data set to a manageable size for your WHERE clause, because it cannot optimize the LIKE match
  1. The difference in sorting results would rarely, if ever, be any different from sorting BY rev_timestamp DESC alone

Now, if your result set ends up being *small*, then you can just use the user_timestamp index for the matching and do the sorting afterwards relatively efficiently. But if it's large (say, your username prefix matches thousands or tens of thousands or millions of edits), then that's not going to work, so you've got no general solution.

I think to generalize, you'd have to add 254 new indexes: one for each possible length of prefix. Since these would be.... well, kind of *huge* :) it wouldn't be generally feasible.

Hmm, that sucks. I think I'm gonna disable ucuserprefix and support CIDR ranges then (because that's what they're used for)