Page MenuHomePhabricator

Optimize Special:SupportedLanguages
Closed, ResolvedPublic1 Estimated Story Points

Description

https://gerrit.wikimedia.org/r/78660

Special:SupportedLanguages had to be disabled on Wikimedia wikis due to database queries taking over 90,000 seconds. No idea which query or why. Perhaps a missing index?


Version: master
Severity: normal

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 2:06 AM
bzimport set Reference to bz52728.
bzimport added a subscriber: Unknown Object (MLST).

The query would have been of the form:

SELECT user_name, user_editcount, MAX(rev_timestamp) as lastedit
FROM user, revision
WHERE user_id = rev_user
GROUP BY user_name

MAX() is optimised in various contexts, but I guess this isn't one of them. EXPLAIN indicates that this query scans the entire revision table, which would explain the long query times. This ungrouped query is efficient:

SELECT MAX(rev_timestamp)
FROM revision,user
WHERE rev_user=user_id AND user_name='Tim Starling';

So I guess it is the grouping that stops it from hitting a special-case optimisation of MAX(). The traditional way to retrieve a row from the end of a range, which does not rely on MAX() optimisations, is with ORDER BY and LIMIT:

SELECT rev_timestamp
FROM revision,user
WHERE rev_user=user_id AND user_name='Tim Starling'
ORDER BY rev_timestamp DESC
LIMIT 1

But even if that was done, it would still be extremely inefficient and would not work. All the special page wants to do is display statistics about translators listed on e.g. [[Portal:Fr/translators]] for French translators, and there are no such translator lists on Commons, so actually it is trying to calculate statistics about nobody in order to display an empty table. In order to display that empty table, it is fetching edit count statistics for all 3 million users on the wiki. Even if it managed to get the queries done, the memcached set would fail due to the value size being larger than 1MB.

Obviously, it should do a single ungrouped query like the one above for each translator, not for every user on the wiki.

Change 105736 had a related patch set uploaded by Nikerabbit:
Optimize Special:SupportedLanguages

https://gerrit.wikimedia.org/r/105736

Even the solution without GROUP BY could be slow for very active users. The index used is probably always revision user_timestamp (rev_user, rev_timestamp). Since rev_timestamp is in second place MAX() must still do a range scan on the user's portion of the btree.

As we're already proposing to query separately for each user then pulling the user data out first in bulk and removing the join in the individual user queries would be predictable and faster:

SELECT user_id, user_editcount FROM user WHERE user_name in (...);

foreach $user_id:

SELECT MAX(rev_timestamp) FROM revision WHERE rev_user = $user_id;
  • 1. row ******* id: 1 select_type: SIMPLE table: NULL type: NULL

possible_keys: NULL

    key: NULL
key_len: NULL
    ref: NULL
   rows: NULL
  Extra: Select tables optimized away

1 row in set (0.29 sec)

Each user needs a single Handler_read_key hit on revision rev_timestamp (rev_timestamp) index regardless of how many revisions they have authored.

Worth considering.

Actually, s/range scan/ref access/, but still potentially thousands of Handler_read_next.

It is actually doing individual queries now, though joining with user table as I didn't have user_id available without more refactoring.

I'm happy to do more refactoring to make it faster, but would like to avoid building dependent patchsets.

Change 105736 merged by jenkins-bot:
Optimize Special:SupportedLanguages

https://gerrit.wikimedia.org/r/105736

This is not an enhancement, it's a bug because it completely prevents the feature from being used on Wikimedia wikis.

On translatewiki.net, time for purge went from 60+ s (to get a timeout) to about 30 (to actually get it).
http://www.webpagetest.org/result/140123_M7_FB4/
http://www.webpagetest.org/result/140124_8E_8RV/
Is the performance of the query used on the other wikis (i.e. without translators lists in portals) proportional to this? Maybe to re-enable it we only need to be sure the really-worst case is eliminated (queries 90 000 s long).

I was discussing this with Siebrand and we were planning to drop our special case for using portal pages, but currently that seems to throw us over the 60+ limit again so further work is needed. On the other hand this is not on top of my todo list.

  • Bug 59497 has been marked as a duplicate of this bug. ***
gerritbot subscribed.

Change 189228 had a related patch set uploaded (by Nikerabbit):
Split language details to subpage on Special:SupportedLanguages

https://gerrit.wikimedia.org/r/189228

Patch-For-Review

Change 189228 merged by jenkins-bot:
Split language details to subpage on Special:SupportedLanguages

https://gerrit.wikimedia.org/r/189228

@Springle @tstarling I would appreciate your review on the updated code and guidance what to do if it is not yet good enough to be re-enabled on WMF sites.

Arrbee moved this task from In Review to Blocked on the LE-Sprint-83 board.

I tried this query on a commonswiki slave:

select rev_user_text,count(*) from page,revision where rev_page=page_id and page_namespace=1198 and page_title like '%/de' group by rev_user_text;

That's 5434 revisions, 3920 pages, which doesn't sound like much, but it did take 3 seconds, and we can expect that to grow linearly as the number of revisions increases. I'd be happier if you removed the purge parameter (or allowed it to be disabled in configuration) and wrapped the query in a PoolCounterWorkViaCallback, then it wouldn't be a DoS vulnerability.

Not sure if there is a documented policy on acceptable limits for slow queries, or if Sean wants to decree one.

Change 194104 had a related patch set uploaded (by Nikerabbit):
Wrap slow query in PoolCounterWorkViaCallback

https://gerrit.wikimedia.org/r/194104

Change 194104 merged by jenkins-bot:
Wrap slow query in PoolCounterWorkViaCallback

https://gerrit.wikimedia.org/r/194104

Arrbee moved this task from Backlog to In Review on the LE-Sprint-84 board.

Change 199263 had a related patch set uploaded (by Nikerabbit):
Add pool counter config for Translate

https://gerrit.wikimedia.org/r/199263

Change 203202 had a related patch set uploaded (by Aaron Schulz):
Optimized translator list loading

https://gerrit.wikimedia.org/r/203202

Change 199263 merged by jenkins-bot:
Add pool counter config for Translate

https://gerrit.wikimedia.org/r/199263

Change 203202 merged by jenkins-bot:
Optimized translator list loading

https://gerrit.wikimedia.org/r/203202

All patches (including improvements by Aaron, thanks) related to this have been merged. Can the special page now be re-enabled?

All patches (including improvements by Aaron, thanks) related to this have been merged. Can the special page now be re-enabled?

I think so.

Change 203859 had a related patch set uploaded (by Nemo bis):
Avoid sorting in languageCloud() and fixed IDE error

https://gerrit.wikimedia.org/r/203859

Change 203859 merged by jenkins-bot:
Avoid sorting in languageCloud() and fixed IDE error

https://gerrit.wikimedia.org/r/203859

Change 204032 had a related patch set uploaded (by Nikerabbit):
Re-enable Special:SupportedLanguages

https://gerrit.wikimedia.org/r/204032

Change 204032 merged by jenkins-bot:
Re-enable Special:SupportedLanguages

https://gerrit.wikimedia.org/r/204032

Thank you everyone, especially Aaron for help.

Change 274900 had a related patch set uploaded (by Siebrand):
Fix incorrect variable name

https://gerrit.wikimedia.org/r/274900

Change 274900 merged by jenkins-bot:
Fix incorrect variable name

https://gerrit.wikimedia.org/r/274900