Page MenuHomePhabricator

"Last 10 editors" function fails on PostgreSQL
Closed, ResolvedPublic

Description

PostgreSQL complains about the following query:

SELECT DISTINCT rev_user_text FROM revision WHERE rev_page = 42 ORDER BY rev_timestamp DESC LIMIT 10

with the following error:

ERROR:  for SELECT DISTINCT, ORDER BY expressions must appear in select list

This makes sense: Consider if user A edits the page first, then users B to Z edit, then A edits again. Should the above query sort A at the beginning or the end of the list? It seems MySQL only gets the expected result here by chance, BTW: see http://dev.mysql.com/doc/refman/5.0/en/group-by-hidden-columns.html and http://archives.postgresql.org/pgsql-sql/2007-02/msg00169.php for more info. When I was testing this on a simplified table, MySQL was giving the "wrong" answer until I added a "rev_page" column and the page_timestamp index.

Unfortunately, I can't think of an alternate query that won't make MySQL filesort.


Version: unspecified
Severity: normal

Details

Reference
bz18078

Related Objects

View Standalone Graph
This task is connected to more than 200 other tasks. Only direct parents and subtasks are shown here. Use View Standalone Graph to show more of the graph.
StatusSubtypeAssignedTask
InvalidNone
Resolvedaaron

Event Timeline

bzimport raised the priority of this task from to Low.Nov 21 2014, 10:32 PM
bzimport added a project: AbuseFilter.
bzimport set Reference to bz18078.
bzimport added a subscriber: Unknown Object (MLST).

See $db->implicitGroupby() as a way to write two queries for cases like this.

implicitGroupby would only fix the problem here accidentally. The real problem is:

  1. The behavior of "SELECT DISTINCT foo FROM baz ORDER BY bar" is not well defined (except in the case that (foo,bar) could be a unique key for the result set). MySQL ignores the undefinedness and arbitrarily chooses the value of bar from the first row fetched, while PostgreSQL throws an error. The SQL standard seems to call for PostgreSQL's behavior, BTW.
  2. MySQL filesorts for any well-defined variation of the query that I can think of, for example "SELECT foo, MAX(bar) as max_bar FROM baz GROUP BY foo ORDER BY max_bar". Unless the rules are different for AbuseFilter than for API queries, a query that filesorts will bring the wrath of domas upon us.

As you implied, one possible "fix" for the problem is to continue using the accidentally-working query for MySQL and the correct query for PostgreSQL. implicitGroupby() could act as a "is this MySQL?" flag when we consider only MySQL versus PostgreSQL, but it makes as much sense as using cascadingDeletes(), cleanupTriggers(), strictIPs(), implicitOrderby(), realTimestamps(), searchableIPs(), or functionalIndexes() for the same purpose since the problem has nothing to do with whether the database sorts the result rows to implement GROUP BY. Better IMO would be to explicitly check $wgDBtype == 'mysql', since then it's clearly marked as being a MySQL-specific hack.

BTW, as far as I can tell the reason the query works in MySQL is because it uses the page_timestamp index to fetch the rows, and the "ORDER BY rev_timestamp DESC LIMIT 10" somehow causes it to use the index in reverse order. This makes its arbitrary choice of which rev_timestamp to keep be the maximum timestamp, so the later application of the ORDER BY does what we wanted. If anything changes to make the arbitrary choice not be the maximum rev_timestamp, it will start giving incorrect results. In fact, I found in my testing that simply leaving out the "LIMIT 10" seems to make MySQL use the page_timestamp index in forward order, so the ORDER BY sorts by each user's earliest edit rather than their most recent.

Interesting. Well, if this is another MySQLisms, the proper thing to do here (other than finding a way to write this in a standard way that makes MySQL happy) is to create another "implicitGroupBy()" like attribute and set it true for MySQL and false for the rest. This avoids hardcoding the '== mysql' bit and makes future changes easier in the off chance that some other database has the same behavior as MySQL. It would also be great to document this heavily in the code in question, or at least add a pointer to this thread.

The other thing that could be done is to just drop the DISTINCT.

(In reply to comment #4)

The other thing that could be done is to just drop the DISTINCT.

That would change how the "Last 10 editors" function works, though. If User:Example is one of those people who doesn't use the preview button, you could easily end up with that returning "Example, Example, Example, Example, Example, Example, Example, Example, Example, Example" rather than Example and 9 other editors.

(batch change)

Minor bugs that nevertheless need looking into

aaron claimed this task.
aaron subscribed.

This query doesn't have DISTINCT anymore.

Jdforrester-WMF subscribed.

Migrating from the old tracking task to a tag for PostgreSQL-related tasks.