API uses non-unique value for paging for some modules
Closed, ResolvedPublic

Description

When listing recentchanges, the API uses rctimestamp for paging, which is not a unique index:

<query-continue>

<recentchanges rcstart="2010-07-14T14:21:31Z" />

</query-continue>

This is quite bad: say, there where 23 changes in 14:21:39, but the first call only retrieved 10. Using 14:21:39 for rcstart for the next page will return the *same* 10 again, not any of the 134 remaining records. This might even put the client into an infinite loop.

Using 14:21:40 would not be better: it go to the next timestamp, thus effectively skipping the 13 items we didn not yet read for 14:21:39.

<query-continue> absultely MUST use a unique index. The only unique index for recentchanges is rcid - but is it guaranteed to increase monotonously? Maybe <query-continue> could use a compund index too, like <recentchanges rcstart="2010-07-14T14:21:31Z" rcid="667455"/> ?

The problem is also how to change this in a B/C fashion. In theory, the client should simply use whatever attriubute it finds in <query-continue><recentchanges>. But can we be sure non has the timestamp hard-coded for that? Would it be ok to break clients that rely on rcstart being a timestamp?


Version: 1.17.x
Severity: major
See Also:
https://bugzilla.wikimedia.org/show_bug.cgi?id=46508
https://bugzilla.wikimedia.org/show_bug.cgi?id=65251

bzimport added a project: MediaWiki-API.Via ConduitNov 21 2014, 11:09 PM
bzimport added a subscriber: wikibugs-l.
bzimport set Reference to bz24782.
daniel created this task.Via LegacyAug 13 2010, 2:42 PM
Reedy added a comment.Via ConduitAug 13 2010, 3:37 PM

I don't think we'd use 2 seperate attributes, the normal "api way" is to use either a globally unique thing, or a composite key "id|somefield|anotherfield" or something to that effect

Catrope added a comment.Via ConduitAug 13 2010, 3:45 PM

(In reply to comment #0)

<query-continue> absultely MUST use a unique index.

Yes, and the fact that it doesn't is very bad.

The only unique index for
recentchanges is rcid - but is it guaranteed to increase monotonously?

I'm not sure.

Maybe
<query-continue> could use a compund index too, like <recentchanges
rcstart="2010-07-14T14:21:31Z" rcid="667455"/> ?

That would be the nicest solution, probably, but it requires adding rc_id to the indexes we use (I think there's only two so that wouldn't be too bad). It would be encoded differently though, as rcontinue="2010-07-14T14:21:31Z|667455", to be consistent with other modules paging on multiple fields.

The problem is also how to change this in a B/C fashion. In theory, the client
should simply use whatever attriubute it finds in
<query-continue><recentchanges>. But can we be sure non has the timestamp
hard-coded for that? Would it be ok to break clients that rely on rcstart being
a timestamp?

Yes, this is definitely OK. Clients are supposed to simply use whatever's in query-continue and pass it back verbatim. Nevertheless, we would still announce it, but not as a breaking change.

Reedy added a comment.Via ConduitAug 13 2010, 7:58 PM

It's actually a widerspread problem.

Watchlist and blocks (for 2), do the same sort of just timestamp continue.

Category members if sorted by timestamp

Deletedrevs if !( $mode == 'all' || $mode == 'revs' )

Logevents, protected titles

ImageInfo

bzimport added a comment.Via ConduitOct 24 2010, 2:17 PM

Bryan.TongMinh wrote:

(In reply to comment #2)

> The problem is also how to change this in a B/C fashion. In theory, the client
> should simply use whatever attriubute it finds in
> <query-continue><recentchanges>. But can we be sure non has the timestamp
> hard-coded for that? Would it be ok to break clients that rely on rcstart being
> a timestamp?
Yes, this is definitely OK. Clients are supposed to simply use whatever's in
query-continue and pass it back verbatim. Nevertheless, we would still announce
it, but not as a breaking change.

It is not necessary to break b/c here. If the new format is "$timestamp|$rcid", then we can make the API simply accept "$timestamp" as well, which would then default to "$timestamp|0"

Reedy added a comment.Via ConduitJun 5 2011, 8:39 PM

Do we need to worry here that we don't have a rc_timestamp, rc_id index?

I'm presuming, that we shouldn't get that many results simultaneously for one second?

I know adding the index would be teh slow

We could use rc_timestamp, rc_title, but then what if a title gets 2 hits in a second. Bleh

Reedy added a comment.Via ConduitJun 5 2011, 8:39 PM

Created attachment 8616
recentchanges

Is this actually even right? I'm getting tired, and can't really think

Attached: bug_24782_recentchanges.patch

Reedy added a comment.Via ConduitJun 5 2011, 8:50 PM

(In reply to comment #3)

It's actually a widerspread problem.

Watchlist and blocks (for 2), do the same sort of just timestamp continue.

Category members if sorted by timestamp

Deletedrevs if !( $mode == 'all' || $mode == 'revs' )

Logevents, protected titles

ImageInfo

ImageInfo is fine (no idea why I said otherwise)

Deletedrevs if !( $mode == 'all' || $mode == 'revs' ) == ($mode == 'user')

Logevents, protected titles

Category members if sorted by timestamp

Watchlist and blocks

Reedy added a comment.Via ConduitJun 5 2011, 8:59 PM

Blocks, we can just use timestamp and id

Watchlist is essentially the same as recent changes..

Category members if by timestamp....?

Log events use timestamp and id

Protected titles, timestamp and title?

Deleted revs?

edwardspec added a comment.Via ConduitAug 8 2011, 2:39 PM

This is indeed very bad.

I just failed to download a ruwiki recent blocklog because of the adminbot that blocks proxies (>500 per second).

list=logevents API is unusable under these circumstances.

Catrope added a comment.Via ConduitNov 20 2011, 2:52 PM

(In reply to comment #6)

Created attachment 8616 [details]
recentchanges

Is this actually even right? I'm getting tired, and can't really think

It's got the right idea, but it needs an index on (rc_timestamp, rc_id) or something else unique. Otherwise MySQL's stupidity makes the query slow.

Attached: bug_24782_recentchanges.patch

duplicatebug added a comment.Via ConduitFeb 18 2012, 6:07 PM
  • Bug 34029 has been marked as a duplicate of this bug. ***
duplicatebug added a comment.Via ConduitApr 7 2012, 1:59 PM
  • Bug 35786 has been marked as a duplicate of this bug. ***
bzimport added a comment.Via ConduitJun 6 2012, 12:47 PM

sumanah wrote:

Are folks still experiencing this problem?

daniel added a comment.Via ConduitSep 5 2012, 6:12 PM

Fixed the patch a bit and submitted it to gerrit as Icc43b62f

daniel added a comment.Via ConduitSep 5 2012, 6:20 PM

Remaining issues:

As per comment #3 and #8, there are several more API modules with similar problems that still need to be addressed.

As per comment #2, rc_id is not part of the database index currently. Need to investigate whether that is an actual issue.

Anomie added a comment.Via ConduitJan 23 2013, 5:14 PM

Regarding the indexes, it looks like it's sort of an issue.

Normally, MySQL will insist on filesorting due to the extra field in the ORDER BY, which is certainly an issue for recentchanges because it doesn't seem like it's smart enough to fetch just the first 50 by the index plus only those extra are "tied" for 50th place in the non-unique index.[1]

But with the way InnoDB indexes work, one index (usually the primary key) is effectively appended to every other index[2] and MySQL will take advantage of this when performing queries. So for the recentchanges changes table, since we use InnoDB and the primary key is (rc_id), the rc_timestamp index is secretly (rc_timestamp,rc_id) even though it is defined as just (rc_timestamp). So a query SELECT ... FROM recentchanges WHERE rc_timestamp > '...' ORDER BY rc_timestamp, rc_id LIMIT 50 will not have to filesort under these conditions, and the EXPLAIN output reflects this.

Whether we want to *rely* on this behavior, I don't know.

[1]: https://dev.mysql.com/doc/refman/5.0/en/limit-optimization.html
[2]: https://dev.mysql.com/doc/refman/5.0/en/innodb-index-types.html

Aklapper added a comment.Via ConduitMar 25 2013, 9:15 AM

(In reply to comment #14)

Fixed the patch a bit and submitted it to gerrit as Icc43b62f

Patch committed on March 05th, 2013, removing keyword due to remaining issues in comment 15.

gerritbot added a comment.Via ConduitDec 24 2013, 9:00 PM

Change 103589 had a related patch set uploaded by Anomie:
API: Make more continuations unique

https://gerrit.wikimedia.org/r/103589

gerritbot added a comment.Via ConduitApr 11 2014, 6:29 PM

Change 103589 merged by jenkins-bot:
API: Make more continuations unique

https://gerrit.wikimedia.org/r/103589

Anomie added a comment.Via ConduitApr 11 2014, 7:28 PM

With the merge of Gerrit change 103589, this should be fixed now for all modules in core. If I've missed any, please let me know.

If this bug is present in any extensions, please open a new bug against the relevant extensions.

Add Comment

Column Prototype
This is a very early prototype of a persistent column. It is not expected to work yet, and leaving it open will activate other new features which will break things. Press "\" (backslash) on your keyboard to close it now.