It seems hidenondamaging in wikidatawiki is slow probably because there is no index on oresc_probability
Description
Details
Related Objects
Event Timeline
This is the query that activating ORES does, taking between 1 minute and 20 seconds to execute, including a DEPENDENT SUBQUERY and reading 5 million rows:
MariaDB db1026 wikidatawiki > EXPLAIN SELECT /* SpecialRecentChanges::doMainQuery Ladsgroup */ rc_id,rc_timestamp,rc_user,rc_user_text,rc_namespace,rc_title,rc_comment,rc_minor,rc_bot,rc_new,rc_cur_id, rc_this_oldid,rc_last_oldid,rc_type,rc_source,rc_patrolled,rc_ip,rc_old_len,rc_new_len,rc_deleted,rc_logid, rc_log_type,rc_log_action,rc_params,wl_user,wl_notificationtimestamp,page_latest, (SELECT GROUP_CONCAT(ct_tag SEPARATOR ',') FROM `change_tag` WHERE ct_rc_id=rc_id ) AS `ts_tags`, oresc_probability,0.8 AS ores_threshold FROM `recentchanges` LEFT JOIN `watchlist` ON (wl_user = '28859' AND (wl_title=rc_title) AND (wl_namespace=rc_namespace)) LEFT JOIN `page` ON ((rc_cur_id=page_id)) LEFT JOIN `ores_classification` ON ((rc_this_oldid = oresc_rev AND oresc_is_predicted = 1 AND oresc_class = 1)) LEFT JOIN `ores_model` ON ((oresc_model = oresm_id AND oresm_is_current = 1)) WHERE rc_bot = '0' AND (rc_type != '6') AND (rc_timestamp >= '20160608000000') AND (rc_source != 'wb') AND ((oresm_name = 'damaging' OR oresm_name IS NULL)) AND oresc_is_predicted = '1' AND (oresc_probability > '0.8') AND rc_patrolled = '0' AND rc_new IN ('0','1') ORDER BY rc_timestamp DESC LIMIT 50\G *************************** 1. row *************************** id: 1 select_type: PRIMARY table: recentchanges type: ref possible_keys: rc_timestamp,new_name_timestamp,tmp_2 key: tmp_2 key_len: 1 ref: const rows: 5700598 Extra: Using where *************************** 2. row *************************** id: 1 select_type: PRIMARY table: watchlist type: eq_ref possible_keys: wl_user,namespace_title,wl_user_notificationtimestamp key: wl_user key_len: 265 ref: const,wikidatawiki.recentchanges.rc_namespace,wikidatawiki.recentchanges.rc_title rows: 1 Extra: *************************** 3. row *************************** id: 1 select_type: PRIMARY table: ores_classification type: ref possible_keys: oresc_rev,oresc_winner key: oresc_rev key_len: 4 ref: wikidatawiki.recentchanges.rc_this_oldid rows: 1 Extra: Using where *************************** 4. row *************************** id: 1 select_type: PRIMARY table: ores_model type: eq_ref possible_keys: PRIMARY key: PRIMARY key_len: 2 ref: wikidatawiki.ores_classification.oresc_model rows: 1 Extra: Using where *************************** 5. row *************************** id: 1 select_type: PRIMARY table: page type: eq_ref possible_keys: PRIMARY key: PRIMARY key_len: 4 ref: wikidatawiki.recentchanges.rc_cur_id rows: 1 Extra: *************************** 6. row *************************** id: 2 select_type: DEPENDENT SUBQUERY table: change_tag type: ref possible_keys: change_tag_rc_tag key: change_tag_rc_tag key_len: 5 ref: wikidatawiki.recentchanges.rc_id rows: 1 Extra: Using index
Creating a better query would be a blocker to allowing the deploying of this functionality. Your aim should be on a few dozens of miliseconds, not seconds. I would recommend you checking the code of the other recentchanges filters.
Change 294693 had a related patch set uploaded (by Ladsgroup):
Performance boos on hidenondamaging
By using INNER JOIN instead of LEFT JOIN in case of hidenondamaging, everything would tons and tons of times faster.
Change 294712 had a related patch set uploaded (by Ladsgroup):
Performance boost on hidenondamaging
Can you copy me the potential SQL queries generated by your code so I can run EXPLAIN on them on production to test them? Thank you.
EXPLAIN SELECT /* SpecialRecentChanges::doMainQuery Ladsgroup */ rc_id,rc_timestamp,rc_user,rc_user_text,rc_namespace,rc_title,rc_comment,rc_minor,rc_bot,rc_new,rc_cur_id, rc_this_oldid,rc_last_oldid,rc_type,rc_source,rc_patrolled,rc_ip,rc_old_len,rc_new_len,rc_deleted,rc_logid, rc_log_type,rc_log_action,rc_params,wl_user,wl_notificationtimestamp,page_latest, (SELECT GROUP_CONCAT(ct_tag SEPARATOR ',') FROM `change_tag` WHERE ct_rc_id=rc_id ) AS `ts_tags`, oresc_probability,0.8 AS ores_threshold FROM `recentchanges` LEFT JOIN `watchlist` ON (wl_user = '28859' AND (wl_title=rc_title) AND (wl_namespace=rc_namespace)) LEFT JOIN `page` ON ((rc_cur_id=page_id)) INNER JOIN `ores_classification` ON ((rc_this_oldid = oresc_rev AND oresc_is_predicted = 1 AND oresc_class = 1)) LEFT JOIN `ores_model` ON ((oresc_model = oresm_id AND oresm_is_current = 1)) WHERE rc_bot = '0' AND (rc_type != '6') AND (rc_timestamp >= '20160608000000') AND (rc_source != 'wb') AND ((oresm_name = 'damaging' OR oresm_name IS NULL)) AND (oresc_probability > '0.8') AND rc_patrolled = '0' AND rc_new IN ('0','1') ORDER BY rc_timestamp DESC LIMIT 50\G
That's the query you copied but with my changes (using INNER JOIN instead of LEFT JOIN)
I think the changes look good, 30K-100K rows visited using an index. It could be better, but I think for that I will have to talk to the tags maintainer (not an ORES issue).
Did you check what I briefly mentioned, which is using a specific recentchanges-role slave?
@jcrespo I wrote there, I hope I understood you correctly, We can't get that threshold from DB directly, we need to get user preferences and then determine the threshold based on that value.
No, I am talking about using a particular role when connecting to the database and selecting a DB_SLAVE. I get a 5x better performance when using a recentchanges slave instead of a main traffic one. Maybe your code already does it, but I can't tell.
Oh, this hook is not changing the DB slave role, It only adds condition and stuff, then send it back to mediawiki/core to run the DB query.