Page MenuHomePhabricator

SecurePoll's populateEditCount should not make queries getting editcount of bots
Open, Needs TriagePublic

Description

I have been depooling a host for maintenance but the traffic is not draining.
The query that has been going on for really long time against production is this (in Wikidata):

SELECT /* populateEditCount  */  COUNT(*)  FROM `revision` JOIN `revision_actor_temp` `temp_rev_user` ON ((temp_rev_user.revactor_rev = rev_id))   WHERE (temp_rev_user.revactor_actor = 122) AND (revactor_timestamp < '20220207000000')  LIMIT 1

Which is from this.
https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/extensions/SecurePoll/+/400511a1afbe029b2e7f5b248b5aba02f4a658a7/cli/wm-scripts/ucoc/populateEditCount.php

That query is scanning 62M rows according to EXPLAIN and the user it's trying to get the data is KrBot. Given that bots can't vote and it's explicitly mentioned in the script. Can it at least avoid checking edit count for bots?

Event Timeline

This comment was removed by Zabe.

This seems a reasonable change to the code.

At first, my thought was to modify these lines and add an ANTI-JOIN to the query to exclude users who are in the bot group. However, that is inefficient in that it will have to query user_groups once per user.

A better approach is to first run a query against user_groups only once and get the list of user_id's for all bots, and then just check against this cached list within the PHP code. The chances of an account being promoted to bot admist the query running is slim to none.

Should I make a patch?

Change 767231 had a related patch set uploaded (by Huji; author: Huji):

[mediawiki/extensions/SecurePoll@master] Exclude bots from the populateEditCount queries

https://gerrit.wikimedia.org/r/767231

@PleaseStand addressed this in a different way in https://gerrit.wikimedia.org/r/c/mediawiki/extensions/SecurePoll/+/813346/ by setting a limit on the query to the max needed to verify eligibility (300 this year). So for bots or super active users, it should never have to scan more than 300 rows (plus the other 20 rows).