Page MenuHomePhabricator

Continue may skip or repeat entries for iwlinks or langlinks
Closed, ResolvedPublic

Description

Patch to change the ORDER BY in the queries to match that expected by the continue handling

For iwlinks, the continue parameter implies ORDER BY iwl_from, iwl_prefix, iwl_title. However, this does not match the actual ordering in two of the three cases:

  1. When iwprefix is not given, the query is ORDER BY iwl_from, iwl_prefix (or ORDER BY iwl_prefix when iwl_from is constant). iwl_title is not sorted, so if the database would want to return three titles in the order A, C, B, with iwlimit=1 it would return A and set continue to C. When continuing, it would skip B because C > B.
  1. When iwprefix is given but iwtitle is not given, the query is ORDER BY iwl_title, iwl_from, which is backwards (iwl_prefix need not be included, as it is constant). This is very easy to see the problem: just create pages A and B where B has an interwiki link that sorts between two interwiki links in A. https://en.wikipedia.org/w/api.php?format=jsonfm&action=query&prop=iwlinks&titles=A|C&iwprefix=wikisource&iwlimit=1 illustrates this at the moment.
  1. When both iwprefix and iwltitle are specified, the query is ORDER BY iwl_from. This is fine because iwl_prefix and iwl_title are constant in the query.

The langlinks module has a similar structure, but here case 1 is not a problem because (ll_from, ll_prefix) is a unique key. Case 2 *is* still an issue.

The simple fix would be to change the queries so the ordering matches that implied by the continue parameter in all cases (see patch). I don't know if this will make MySQL filesort, however.


Version: 1.20.x
Severity: normal

attachment diff ignored as obsolete

Details

Reference
bz36400

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 12:24 AM
bzimport set Reference to bz36400.

sumanah wrote:

Thanks for the patch, Brad. It'll get reviewed faster if you use Developer access

https://www.mediawiki.org/wiki/Developer_access

to put it directly into the Git source code repository:

https://www.mediawiki.org/wiki/Git/Workflow

(In reply to comment #1)

Thanks for the patch, Brad. It'll get reviewed faster if you use Developer
access

I wasn't sure if I should do that with a patch I wasn't sure was right (re the potential filesort issue). I'll do that in the future.

The first part of your patch makes the query unindexed:

  • Was: iwl_prefix = const, ORDER BY iwl_title, iwl_from - covered by index iwl_prefix_title_from
  • Your patch: iwl_prefix = const, ORDER BY iwl_from, iwl_title - no suitable index.

Note: although the number of items with the same iwl_prefix isn't very high (in the hundreds), due to the size of every element filesort is still possible. CCing Asher who has the final word in the field of DB performance.

(In reply to comment #3)

The first part of your patch makes the query unindexed:

  • Was: iwl_prefix = const, ORDER BY iwl_title, iwl_from - covered by index

iwl_prefix_title_from

  • Your patch: iwl_prefix = const, ORDER BY iwl_from, iwl_title - no suitable

index.

I was afraid of something like that.

afeldman wrote:

The langlink portion looks desirable.

(In reply to comment #3)

The first part of your patch makes the query unindexed:

  • Was: iwl_prefix = const, ORDER BY iwl_title, iwl_from - covered by index

iwl_prefix_title_from

  • Your patch: iwl_prefix = const, ORDER BY iwl_from, iwl_title - no suitable

index.

I forgot to update the bug until now, but I talked to Roan at the Berlin Hackathon two months ago and we concluded that the query in my patch is correct, and we should have a iwl_prefix_from_title index to cover it (keep iwl_prefix_title_from, ApiQueryIWBacklinks uses that).

The other possibility we considered was changing the processing of the continue parameter so ORDER BY iwl_title, iwl_from would be correct for that case. The problem there is inconsistency: if the module needs to return (prefix,title,from) triplets of ('de','foo',1), ('de','bar',1), and ('de','bar',2), and the API limit is 1, it would return page_id 1 with link de:bar, then page_id 2 with link de:bar, then page_id 1 again with link de:foo. All other API modules that return results by page like this would return page_id 1 with link de:bar, then page_id 1 with link de:foo, and only then page_id 2. Some clients may depend on this behavior.

I finally got around to putting this in Gerrit:
Gerrit changeset 43389

  • Bug 46136 has been marked as a duplicate of this bug. ***

Change merged. It should be deployed to WMF wikis with 1.22wmf2, see https://www.mediawiki.org/wiki/MediaWiki_1.22/Roadmap for the schedule.