Page MenuHomePhabricator

ORES on Watchlist causes big slowdown—especially with 'Last revision' filter turned on
Closed, ResolvedPublic

Description

Exercise #1

With the beta activated:

  1. Go to Recent Changes.
  2. Open the console to time page load.
  3. Use the default filters and set results to 250/7 days.
  4. Load the page and note the time (Mine loads in approx 3 secs)
  5. Now add the May Have Problems filter to the mix.
  6. Load the page and note the time. The time with and without the "May have problems filter" is pretty similar.
  7. Now go to Watchlist and repeat steps 2-6 (on Watchlist, the default filters include "Last revisions," so turn that off, just to make the comparison more exact).

Expected result: the page load-time with "May have problems" and without should be similar, as it is on Recent Changes.
Actual result: I get about a 5 second difference when adding "May have problems" on Watchlist

Exercise #2

  • Repeat the test in #7 above but this time include the "Last revision" filter as well.

Expected result: about the same as above?
Actual result: The delay now runs, for me, between 35 to75 seconds extra.

(Note that I have a very small Watchlist )

Related Objects

Event Timeline

jmatazzoni updated the task description. (Show Details)
jmatazzoni updated the task description. (Show Details)

! In T176456#3640455, @awight wrote:

This may be a duplicate of T168096.

No, this is not a duplicate of T168096. That, as its title states, is a "complex query" and it looks or a rare combination. In the example above, all I did was add "May have problems" to the default filters. That should be an extremely common combination and should find lots of results easily. Which is why this is much more alarming.

I told @jmatazzoni I would try to take a look at this, by capturing the query on my mw-vagrant dev wiki and then explaining the query plan on production MySQL. However, I'm stuck at making RCFilters appear on my Watchlist page. I've enabled $wgStructuredChangeFiltersOnWatchlist, I see new filters on my Recent Changes page, but nothing on WL yet. Anyone able to help me?

I told @jmatazzoni I would try to take a look at this, by capturing the query on my mw-vagrant dev wiki and then explaining the query plan on production MySQL. However, I'm stuck at making RCFilters appear on my Watchlist page. I've enabled $wgStructuredChangeFiltersOnWatchlist, I see new filters on my Recent Changes page, but nothing on WL yet. Anyone able to help me?

There's a shortcut, add ?rcfilters=1 to the url

There's a shortcut, add ?rcfilters=1 to the url

Yass, shortcut for the win. It's working for me now, thanks!

I captured a query and checked the execution plan on stat1006:

EXPLAIN SELECT /*! STRAIGHT_JOIN */
    rc_id, rc_timestamp, rc_user, rc_user_text, rc_namespace, rc_title, 
    rc_minor, rc_bot, rc_new, rc_cur_id, rc_this_oldid, rc_last_oldid, rc_type,
    rc_source, rc_patrolled, rc_ip, rc_old_len, rc_new_len, rc_deleted,
    rc_logid, rc_log_type, rc_log_action, rc_params, 
    rc_comment AS `rc_comment_text`,
    NULL AS `rc_comment_data`,
    NULL AS `rc_comment_cid`,
    page_latest,
    wl_notificationtimestamp,
    -- (SELECT GROUP_CONCAT(ct_tag SEPARATOR ',') FROM `change_tag` WHERE ct_rc_id=rc_id ) AS `ts_tags`,
    ores_damaging_cls.oresc_probability AS `ores_damaging_score`
FROM `recentchanges`
INNER JOIN `watchlist`
    ON (wl_user = '513988' AND (wl_namespace=rc_namespace) AND (wl_title=rc_title))
LEFT JOIN `page`
    ON ((rc_cur_id=page_id))
LEFT JOIN `ores_model` `ores_damaging_mdl`
    ON (ores_damaging_mdl.oresm_is_current = '1' AND ores_damaging_mdl.oresm_name = 'damaging')
LEFT JOIN `ores_classification` `ores_damaging_cls`
    ON ((ores_damaging_cls.oresc_model = ores_damaging_mdl.oresm_id) AND (rc_this_oldid = ores_damaging_cls.oresc_rev) AND ores_damaging_cls.oresc_class = '1')
WHERE
    ((ores_damaging_cls.oresc_probability BETWEEN 0.2 AND 1))
    AND (rc_type NOT IN ('3','5'))
    AND (rc_timestamp >= '20171002024806')
ORDER BY rc_timestamp
DESC LIMIT 250
	
*************************** 1. row ***************************
           id: 1
  select_type: PRIMARY
        table: recentchanges
         type: range
possible_keys: rc_timestamp,rc_namespace_title,rc_ns_usertext,tmp_1,tmp_3,rc_name_type_patrolled_timestamp
          key: rc_timestamp
      key_len: 16
          ref: NULL
         rows: 2497770
        Extra: Using where
*************************** 2. row ***************************
           id: 1
  select_type: PRIMARY
        table: watchlist
         type: eq_ref
possible_keys: wl_user,namespace_title
          key: wl_user
      key_len: 265
          ref: const,enwiki.recentchanges.rc_namespace,enwiki.recentchanges.rc_title
         rows: 1
        Extra: 
*************************** 3. row ***************************
           id: 1
  select_type: PRIMARY
        table: page
         type: eq_ref
possible_keys: PRIMARY
          key: PRIMARY
      key_len: 4
          ref: enwiki.recentchanges.rc_cur_id
         rows: 1
        Extra: 
*************************** 4. row ***************************
           id: 1
  select_type: PRIMARY
        table: ores_damaging_mdl
         type: ref
possible_keys: PRIMARY,oresm_version,ores_model_status
          key: ores_model_status
      key_len: 35
          ref: const,const
         rows: 1
        Extra: Using index condition
*************************** 5. row ***************************
           id: 1
  select_type: PRIMARY
        table: ores_damaging_cls
         type: eq_ref
possible_keys: oresc_rev_model_class,oresc_rev_predicted_model
          key: oresc_rev_model_class
      key_len: 7
          ref: enwiki.recentchanges.rc_this_oldid,enwiki.ores_damaging_mdl.oresm_id,const
         rows: 1
        Extra: Using index condition; Using where
*************************** 6. row ***************************
           id: 2
  select_type: DEPENDENT SUBQUERY
        table: change_tag
         type: ref
possible_keys: ct_rc_id,change_tag_rc_tag
          key: ct_rc_id
      key_len: 5
          ref: enwiki.recentchanges.rc_id
         rows: 352320
        Extra: Using index

The first run took 3'39", but unfortunately something gets cached and I wasn't able to get an empirical sense of what's hurting performance, just by bisecting the query. I don't get it yet.

There's a shortcut, add ?rcfilters=1 to the url

Sorry for interrupting, but is there some shortcut for a beta version too? This one does not include Live Updates. Thank you.

There's a shortcut, add ?rcfilters=1 to the url

Sorry for interrupting, but is there some shortcut for a beta version too? This one does not include Live Updates. Thank you.

I didn't realize ?rcfilters=1 doesn't include beta features like live update. This may change in the future. In the meantime, you can add &liveupdate=1 to the url to get live update.

Sorry for interrupting, but is there some shortcut for a beta version too?

In the meantime, you can add &liveupdate=1 to the url to get live update.

Thanks a lot.

I'll just admit right here that I'm not particularly good at query optimization. But still, I'll go ahead and make the obvious comment. I'm not sure we're using multi-column indexes correctly. For example, this is the only index that covers the oresc_probability column,

CREATE INDEX /*i*/oresc_model_class_prob ON /*_*/ores_classification (oresc_model, oresc_class, oresc_probability);

The order of columns matters here, so in my understanding, we can only filter by probability after we've already narrowed by model and class. Similarly, we can't filter by class until model is locked in. Something like that. So I'm imagining that our multi-column key is causing the conditions to be evaluated in a difficult order.

It would be great if someone with more DB chops could take a look at this.

I'll just admit right here that I'm not particularly good at query optimization. But still, I'll go ahead and make the obvious comment. I'm not sure we're using multi-column indexes correctly. For example, this is the only index that covers the oresc_probability column,

CREATE INDEX /*i*/oresc_model_class_prob ON /*_*/ores_classification (oresc_model, oresc_class, oresc_probability);

The order of columns matters here, so in my understanding, we can only filter by probability after we've already narrowed by model and class. Similarly, we can't filter by class until model is locked in. Something like that. So I'm imagining that our multi-column key is causing the conditions to be evaluated in a difficult order.

It would be great if someone with more DB chops could take a look at this.

I dont think that index exists in production (not sure; i dont remember it one i was looking about an hour ago).

Anyways, you are right that order matters, however in this case order is fine. At the time the join is executed model and class are known constants and probability is a known range. However that index would never be used since a (constant) revision is present at time of join, so the index containing revision is much more specific and (correctly) used instead.

The index containing oresc_rev is generally the best we can do for that query as written, since it narrows it down to a single row. If probability was also in the index it would save a tiny bit of io because index condition pushdown or covering index would stop us from having to look at the underlying table, but the speed savings of that is small (albeit possibly still worth it, im not sure how much a difference it would make)

I dont think that index exists in production (not sure; i dont remember it one i was looking about an hour ago).

I double checked. On the enwiki slave I looked at (db1083) there are only 3 indexes on ores_classification table: PRIMARY, oresc_rev_predicted_model and oresc_rev_model_class. oresc_model_class_prob does not exist on production, or at least not on the enwiki slaves.

However that index would never be used since a (constant) revision is present at time of join, so the index containing revision is much more specific and (correctly) used instead.

I missed one case here. If the query was not using STRAIGHT_JOIN, and the total number of revisions in the appropriate probability range of that oresc_model and that oresc_class over all time was less than the number of pages on the user's watchlist (If this is a watchlist query) and was significantly less than the number of entries in the recentchanges table over the specified time range, then the order of the joins would be re-arranged and it would use the oresc_model_class index (If that index existed). In the short term there might be some cases where this is more efficient, since ores is a relatively new feature. However in the long term its unlikely this would be the case, as the set of all revisions of all time with the specified probability range is going to get bigger over time as more and more edits get scored with ores (compared to the number of rc entries or the number of watchlist entries, which stays relatively constant over time)

I missed one case here. If the query was not using STRAIGHT_JOIN, and the total number of revisions in the appropriate probability range of that oresc_model and that oresc_class over all time was less than the number of pages on the user's watchlist (If this is a watchlist query) and was significantly less than the number of entries in the recentchanges table over the specified time range, then the order of the joins would be re-arranged and it would use the oresc_model_class index (If that index existed). In the short term there might be some cases where this is more efficient, since ores is a relatively new feature. However in the long term its unlikely this would be the case, as the set of all revisions of all time with the specified probability range is going to get bigger over time as more and more edits get scored with ores (compared to the number of rc entries or the number of watchlist entries, which stays relatively constant over time)

The ORES tables are purged along with the RC table, so the slippage that you're talking about won't happen.

Oh, well in that case, if a particular filter is very selective it may make sense to allow query plans wherethe base table is ores_classification, with some sort of index like (oresc_model, oresc_class, oresc_probability). If we can narrow down the range of revisions (for options of the form only show last three days), putting oresc_rev at the end of that index may make sense too for index condition pushdown. This of course only makes sense if the filter is very selective as the required filesort has a high overhead.

Oh, well in that case, if a particular filter is very selective it may make sense to allow query plans wherethe base table is ores_classification, with some sort of index like (oresc_model, oresc_class, oresc_probability). If we can narrow down the range of revisions (for options of the form only show last three days), putting oresc_rev at the end of that index may make sense too for index condition pushdown. This of course only makes sense if the filter is very selective as the required filesort has a high overhead.

Usually these filters are pretty selective, but it's possible for them to not be. Thankfully, the kinds of rows that bloat the RC table (Wikidata and categorization) are also among the kinds of rows that don't have ORES scores associated with them. So for wikis with bloated RC tables, the filter will be selective, and typical selections for these filters ("only show bad edits") are quite selective too (<10% on enwiki IIRC). However, the user could choose the reverse filter ("only show good edits") and that wouldn't be very selective.

You were spot on with your observation about STRAIGHT_JOIN in the Watchlist query, thanks a lot for that. I'm submitting a patch that will remove it on Watchlist (but continue to apply it on RC, where it does help). When diving into that I learned that MySQL also has a feature where you can force the join order between two tables without forcing the order entirely for the whole query: you can do foo STRAIGHT_JOIN bar JOIN baz, and that will tell it to always read foo before bar, while baz may still be moved around at the optimizer's discretion; as opposed to /*!STRAIGHT_JOIN*/ foo JOIN bar JOIN baz which forces the order to always be first foo, then bar, then baz. If we find more performance problems after removing the query-wide STRAIGHT_JOIN, we could consider experimenting with that to prevent bad query plans without forcing an even worse one.

Change 383868 had a related patch set uploaded (by Catrope; owner: Catrope):
[mediawiki/extensions/ORES@master] Don't use STRAIGHT_JOIN for Watchlist queries

https://gerrit.wikimedia.org/r/383868

Change 383868 merged by jenkins-bot:
[mediawiki/extensions/ORES@master] Don't use STRAIGHT_JOIN for Watchlist queries

https://gerrit.wikimedia.org/r/383868

Change 383885 had a related patch set uploaded (by Catrope; owner: Catrope):
[mediawiki/extensions/ORES@wmf/1.31.0-wmf.3] Don't use STRAIGHT_JOIN for Watchlist queries

https://gerrit.wikimedia.org/r/383885

Change 383885 merged by jenkins-bot:
[mediawiki/extensions/ORES@wmf/1.31.0-wmf.3] Don't use STRAIGHT_JOIN for Watchlist queries

https://gerrit.wikimedia.org/r/383885

jmatazzoni added a subscriber: Etonkovidova.

Hey @awight, I think @Bawolff's brilliant suggestion just enabled @Catrope to totally fix this problem. Following the protocol above, even with Latest Revision on, I get a 2.19 second load time—with 30 days selected! Good job gang!

Following the protocols in T176445, I get:

"Typical ORES filter set"

https://en.wikipedia.org/wiki/Special:Watchlist?highlight=1&limit=250&days=7&urlversion=2&hidepreviousrevisions=1&hidecategorization=1&hideWikibase=1&damaging=maybebad

2.71 secs at 7 days.

"'Rare' ORES filter set"

https://en.wikipedia.org/wiki/Special:Watchlist?hidebots=1&hidecategorization=1&hideWikibase=1&hidelog=1&highlight=1&limit=50&days=7&urlversion=2&damaging__verylikelybad_color=c4&damaging=likelybad&goodfaith=likelygood&hidepreviousrevisions=1

2.13 secs at 7 days.

@Etonkovidova, moving this to QA. @Trizek-WMF, if confirmed we should prominently announce that we've fixed the bug that caused ORES on watchlist to load very slowly. This bug predated the New Filters release, so anyone who has every dropped ORES on Watchlist in the past because of slowness should give it another try. (@Jdforrester-WMF, just pinging you because I think this will make you happy.)

Hey @awight, I think @Bawolff's brilliant suggestion just enabled @Catrope to totally fix this problem. Following the protocol above, even with Latest Revision on, I get a 2.19 second load time—with 30 days selected! Good job gang!

Following the protocols in T176445, I get:

"Typical ORES filter set"

https://en.wikipedia.org/wiki/Special:Watchlist?highlight=1&limit=250&days=30&urlversion=2&hidepreviousrevisions=1&hidecategorization=1&hideWikibase=1&damaging=maybebad

2.71 secs at 7 days.

"'Rare' ORES filter set"

https://en.wikipedia.org/wiki/Special:Watchlist?hidebots=1&hidecategorization=1&hideWikibase=1&hidelog=1&highlight=1&limit=50&days=7&urlversion=2&damaging__verylikelybad_color=c4&damaging=likelybad&goodfaith=likelygood

2.13 secs at 7 days.

@Etonkovidova, moving this to QA. @Trizek-WMF, if confirmed we should prominently announce that we've fixed the bug that caused ORES on watchlist to load very slowly. This bug predated the New Filters release, so anyone who has every dropped ORES on Watchlist in the past because of slowness should give it another try. (@Jdforrester-WMF, just pinging you because I think this will make you happy.)

Keep in mind the new query will scale differently. The old version's speed depended a lot on how many "days" the watchlist was set to, and how rare the filter was. The number of items on your watchlist didnt effect it much. The new one is much less affected by number of days and rareness of filter but is much more sensitive to number of items on your watchlist. So you should also test people with different watchlist sizes.

In T176456#3681333, @Bawolff wrote:

Keep in mind the new query will scale differently. The old version's speed depended a lot on how many "days" the watchlist was set to, and how rare the filter was. The number of items on your watchlist didnt effect it much. The new one is much less affected by number of days and rareness of filter but is much more sensitive to number of items on your watchlist. So you should also test people with different watchlist sizes.

Ah. That sounds right. Because it never made sense that my piddly WL would be so slow. So does that mean we think that advanced users will now suffer? Because they likely do a) search shorter time periods and b) have much bigger lists.

That could be a problem. It would be a bad trade if we make things better for casual users and worse for power users.

@Etonkovidova and @Catrope, this is something we need to judge carefully.

It will still probably be better then the old query for people with large but not insane sized watchlists, but the savings just wont be as spectacular for big watchlists

The new query leaves much more up to mariadb - there will come a point for huge watchlists where mariadb will go back to the old query (often mariadb seems to underestimate the cost of large filesorts, so the issue and reason why the straight join was there for the recentchanges case is that mariadb doesnt switch to the other query plan "soon" enough. Killing all the wikidata stuff will also helps with that problem). Also keep in mind the old query somewhat reverse scales with watchlist size - the old query would be faster the less filtering involved so its faster for people with big watchlists as they have less filtering.

Anyways if it turns out mariadb is making poor optimization decisions for users with very large watchlists we can always readd the straight join dynamically only for users with big watchlists (hacky but i guess thats life). Hopefully it wouldnt come to that.

@awight Btw, some debugging tips for next time:

I told @jmatazzoni I would try to take a look at this, by capturing the query on my mw-vagrant dev wiki and then explaining the query plan on production MySQL. However, I'm stuck at making RCFilters appear on my Watchlist page. I've enabled $wgStructuredChangeFiltersOnWatchlist, I see new filters on my Recent Changes page, but nothing on WL yet. Anyone able to help me?

Any timedout query (and queries that are just slow but under the timeout) gets logged into logstash. If all you're trying to find the query in question to optimize, often the easiest way is to search logstash for something like ("SlowTimer" "Read timeout is reached") AND "column_or_table_you_know_is_in_the_query" e.g. to find the ores slow queries you might search for something like ("SlowTimer" "Read timeout is reached") AND "ores_classification" - https://logstash.wikimedia.org/goto/1890300d4943223ea88934c4c6f31a53

The first run took 3'39", but unfortunately something gets cached and I wasn't able to get an empirical sense of what's hurting performance, just by bisecting the query. I don't get it yet.

Generally when optimizing a query, the best thing to do is first look at the explain, and see if the chosen explain looks correct (e.g. The join order is sane, the key field (which index is used) is correct, the key_len (How much of the index to use) is what you expect it to be, there's no "Using filesort". Important, the rows field is a rough guess and often misleading, so ignore that field of the explain other than to note if its 1 or a number > 1).

To bisect different changes to a query, the best thing to do is use handler stats. To do that:

  1. Open mysql client
  2. run query
  3. run show status like 'Ha%';
  4. close mysql client. Repeat (You have to close the client, becaue the status applies to all queries you've made thus far, so closing the client clears the status).

The different fields mean different things and some are worse than others, but basically try to minimize all the fields starting with Handler_read

Usually these filters are pretty selective, but it's possible for them to not be. Thankfully, the kinds of rows that bloat the RC table (Wikidata and categorization) are also among the kinds of rows that don't have ORES scores associated with them. So for wikis with bloated RC tables, the filter will be selective, and typical selections for these filters ("only show bad edits") are quite selective too (<10% on enwiki IIRC). However, the user could choose the reverse filter ("only show good edits") and that wouldn't be very selective.

Ok, so it sounds like there are three main cases to consider here:

  1. No filtering or filtering is not very selective, and watchlist is very large (or we're on Special:Recentchanges)
  2. Watchlist is small (For an unclear definition of small)
  3. filtering is very selective but watchlist is medium sized or large.

Case 1 and 2 are now well handled. So that leaves case 3. I believe that in case three adding an index on (oresc_model, oresc_class, oresc_probability, oresc_revision) [oresc_revision plopped on the end for covering index, since it only adds 4 bytes to the index. Potentially if we have a range of revision ids, based on the days parameter of watchlist, some rows could be eliminated before the join using the oresc_revision part of the index if a oresc_revision range was specified] and removing the STRAIGHT_JOIN may make sene sometimes (but only for high specificity filters)

However for that to be worth it the most specific filter would have to very specific (by itself. Having the intersection of the filters be specific is not enough). For example, damaging on enwiki set to probability 0.828 or higher still gives 149,024 results, which is a bit on the high end for this scheme to actually be useful. If you add a restriction based on revision time (for say a 7 day period), that goes down to 34,146 (DB would still have to scan all 149,024 rows, but it could eliminate most before the filesort. Even better, if the condition was on oresc_revision it could eliminate before the join which would save a little bit of io. This is probably better then the equivalent of scanning 2.3 million RC rows without the filesort, for the same 7 day period - but the RC scan is likely to stop early, so its unclear if there are actual savings there). Such an index is unlikely to be useful in the watchlist case unless the user had at least 100,000 items on their watchlist, maybe more.

So umm, in conclusion - its unclear how much benefit an index with oresc_probability would be. It may have savings in the case of a very specific ORES filter, but even the most specific filter available currently is probably not specific enough to see real improvements in the watchlist case unless watchlist size is very big (say 500,000 items), and there is a risk of performance degradation if a query plan is chosen using oresc_classificiation as the base table when its not a good idea to do so. For the recentchanges case. Probably a similar story. It depends on how much wikidata spam is in the table, and would probably require trying with real queries and looking at how many rows are actually touched when done for real to see if its worth it.

If the situation changes, and filters are added that are say an order of magnitude more specific, it would probably then be worth it.

However, if we can make more drastic changes to the query, the filesort could be removed by ordering on oresc_revision instead of rc_timestamp, which should give the same order as rc_timestamp since all non-edit related entries are removed, and revision_id increseases monotonically with time (That is a lot of assuming though). It would probably be difficult to modify the query to change the sort order like that, but if we could, that would make the filtering case with rare filters, much much faster (Probably as fast as recentchanges with no filtering enabled) [Edit: This is not true. In my excitement, I forgot that oresc_probability was a range in this query. If we could somehow have the same info expressed as an integer (e.g. converting probability ranges into distinct classes and shoving the class in the db as an int), then this could work, but only in the case where we have an intersection of filters and not a union of filters, but i'm not sure how possible that is]

Checked in betalabs (big watchlist (>700 pages) and very few updates happening) and production (few pages on the Watchlist and big volume of updates) - the new query works fast in both cases.

Checked in betalabs (big watchlist (>700 pages) and very few updates happening) and production (few pages on the Watchlist and big volume of updates) - the new query works fast in both cases.

Just to clarify, I wouldn't consider a watchlist big unless it had at least 50,000 pages. I know user's who have almost a million pages on their watchlist. 700 is definitely in the small category. If any performance problems do show up, it would probably only be on a wiki with both a big watchlist and a big RC volume at the same time.