Page MenuHomePhabricator

Optimize the query for fawiki elections eligible voters
Closed, DeclinedPublic

Description

Every year we run a query on Quarry to retrieve the list eligible voters for that year's Supervisory Council elections (SC is fawiki's entity similar to enwiki's Arb Com). Given the recent changes in the databases, namely the addition of actor table, the query has gotten slower over time and now, it does not finish in the time limit Quarry imposes (which I think is 20 minutes).

While we can break this into multiple subqueries and combine them externally, we really prefer that it is a query that is publicly run on Quarry so that the fawiki community can transparently see its results.

The latest version of the query is below. Here is the logic in simple words:

  • Given that the nomination period starts at 20201012000000
  • Find all users whose account was created at least 3 months prior to the start of the nomination period
    • Old accounts don't have an account creation log, so accept those too if they have an edit at least 3 months prior to the start of the nomination period
  • AND who have had at least 500 edits in the article namespace any time before the nomination period
  • AND who have at least 100 edits in the main namespace in the last 12 months leading to the nomination period
  • AND are not bots

I tried my best to use alternative views, but that seems not to be sufficient. Is it possible to optimize this further?

use fawiki_p;
select
  concat('[[User:', user_name, '|]]'),
  concat(left(log_timestamp, 4), '-', mid(log_timestamp, 5, 2), '-', mid(log_timestamp, 7, 2)) as acct_created,
  concat(left(first_timestamp, 4), '-', mid(first_timestamp, 5, 2), '-', mid(first_timestamp, 7, 2)) as first_edit,
  concat('{{formatnum:', fivehundred.cnt, '}}') as article_edits,
  concat('{{formatnum:', onehundred.cnt, '}}') as recent_edits,
  '' as voter_status
from user
join actor_logging
  on actor_user = user_id
left join logging_userindex
  on log_actor = actor_id
  and log_type = 'newusers'
  and log_action in ('create', 'autocreate') -- creating main account
  and log_timestamp < '20200712000000' -- occured 3 or more months prior to nomination start date
join
(
  select
    actor_user,
    min(rev_timestamp) as first_timestamp
  from revision_userindex
  join actor_revision
    on rev_actor = actor_id
  group by actor_user
) first_edit
  on user_id = first_edit.actor_user
join 
(
  select
    actor_user,
    count(*) cnt
  from revision_userindex
  join actor_revision
    on rev_actor = actor_id
  join page
    on page_id = rev_page
  where
    page_namespace = 0
    and rev_timestamp < '20201012000000' -- edits prior to nomination start date
  group by actor_user
  having count(*) >= 500
) fivehundred -- at least 500 edits in main namspace
  on user_id = fivehundred.actor_user
join
(
  select
    actor_user,
    count(*) cnt
  from revision_userindex
  join actor_revision
    on rev_actor = actor_id
  join page
    on page_id = rev_page
  where
    page_namespace = 0
    and rev_timestamp > '20191012000000' -- last 12 months leading to nomination start date
    and rev_timestamp < '20201012000000'
  group by actor_user
  having count(*) >= 100
) onehundred -- at least 100 edits in the main namespace in the last twelve months leading to the nomination start date
  on user_id = onehundred.actor_user
left join
(
  select ug_user
  from user_groups
  where ug_group = 'bot'
) bots
  on bots.ug_user = user_id
where
  bots.ug_user is null
  and
  (
    log_timestamp is not null
    or 
    first_timestamp < '20200712000000' -- occured 3 or more months prior to nomination start date
  )
order by user_name

Event Timeline

@bd808 I tried EXPLAIN on a portion of that query, shown below:

select count(*)
from user
join actor_logging
  on actor_user = user_id
left join logging_userindex
  on log_actor = actor_id
  and log_type = 'newusers'
  and log_action in ('create', 'autocreate') -- creating main account
  and log_timestamp < '20200712000000' -- occured 3 or more months prior to nomination start date
where log_timestamp is not null
order by user_name

In the EXPLAIN results, ip_blocks is listed. Any idea why?

In the EXPLAIN results, ip_blocks is listed. Any idea why?

(u3518@fawiki.analytics.db.svc.eqiad.wmflabs) [fawiki_p]> show create view user\G
*************************** 1. row ***************************
                View: user
         Create View: CREATE ALGORITHM=UNDEFINED DEFINER=`viewmaster`@`%` SQL SECURITY DEFINER VIEW `user` AS select `fawiki`.`user`.`user_id` AS `user_id`,`fawiki`.`user`.`user_name` AS `user_name`,`fawiki`.`user`.`user_real_name` AS `user_real_name`,NULL AS `user_password`,NULL AS `user_newpassword`,NULL AS `user_email`,NULL AS `user_options`,NULL AS `user_touched`,NULL AS `user_token`,NULL AS `user_email_authenticated`,NULL AS `user_email_token`,NULL AS `user_email_token_expires`,`fawiki`.`user`.`user_registration` AS `user_registration`,NULL AS `user_newpass_time`,`fawiki`.`user`.`user_editcount` AS `user_editcount`,NULL AS `user_password_expires` from `fawiki`.`user` where isnull((select 1 from `fawiki`.`ipblocks` where ((`fawiki`.`ipblocks`.`ipb_auto` = 0) and (`fawiki`.`ipblocks`.`ipb_deleted` = 1) and (`fawiki`.`ipblocks`.`ipb_user` = `fawiki`.`user`.`user_id`))))
character_set_client: utf8
collation_connection: utf8_general_ci
1 row in set (0.00 sec)

Oh, right. Filtering oversighted accounts ...
Thanks!
By the way, I managed to rewrite most of the query into something that runs in ~10 minutes: https://quarry.wmflabs.org/query/48142
This does not incorporate the logic for "AND who have had at least 500 edits in the article namespace any time before the nomination period" but I think we are going to settle on a two-step process. If we had CTE support, we could wrap 48142 into a CTE and that could gain some efficiency I suppose. But we don't have it until MariaDB is updated.

So, now I have an initial query that reduces the search space to ~700 users: https://quarry.wmflabs.org/query/48142
But the subsequent query, which tries to see which of these ~700 users has had at least 500 edits in the main namespace fails: https://quarry.wmflabs.org/query/48153
Any way to make that last query more efficient in SQL?

My alternative is to write a script that hits the Special:Contribs through API. But if Special:Contribs can so quickly return all edits by a user in a given namespace, why doesn't the SQL query above do so?

My alternative is to write a script that hits the Special:Contribs through API. But if Special:Contribs can so quickly return all edits by a user in a given namespace, why doesn't the SQL query above do so?

Special:Contributions only queries for one contributor at a time, and does not do the count(*) operation that your sub query does. Your sub query is functionally doing the same thing as opening Special:Contributions 700 times and then paging through all 700 lists of edits until all pages for each user have been seen. Special:Contributions has a note in the source code that limiting the results to a given namespace makes the query for a single user, even without any aggregation, expensive.

Fair point.

Arithmetically, the key difference is that it seems like for the HAVING COUNT(*) >= 500 the database needs to calculate COUNT(*) fully, whereas with the API I can just ask for 500 rows and if I get that many rows I know the user has 500+ edits and I don't need to page through the rest of their contribs. I cannot think of a way for doing this in SQL (can you have a LIMIT statement that is partitioned by some other field?) and frankly, I think the two-part solution (where the second part is an API-based query) is just good enough. So I am going close this task.