The query attempts to calculate an "h-index" type metric for each user, but instead of the count of citations, it looks at the count of interwiki links. The complete query can be found here and used to run in about 15 minutes. It started to time out around the time the actor table was introduced. It can certainly benefit from using some alternative views. However, I think the actual issue lies somewhere else and has nothing to do with the actor table.
At its core, this query depends on a subquery like this:
SELECT actor_name, rev_actor, rev_page, COUNT(ll_lang) AS iwlinks FROM revision_userindex JOIN actor_revision on rev_actor = actor_id JOIN page ON rev_page = page_id AND page_namespace = 0 AND page_is_redirect = 0 JOIN langlinks ON ll_from = rev_page WHERE rev_parent_id = 0 GROUP BY actor_name, rev_actor, rev_page HAVING COUNT(ll_lang) > 5
It will then join that with itself and apply some additional logic and aggregations, all of which can also be done in the python script that runs the query. The only part that we absolutely need to be in SQL is the short query above. However, even that query does not finish within 30 minutes. Alas, even the following version which doesn't even join actor does not finish within 30 minutes:
SELECT rev_actor, rev_page, COUNT(ll_lang) AS iwlinks FROM revision_userindex JOIN page ON rev_page = page_id AND page_namespace = 0 AND page_is_redirect = 0 JOIN langlinks ON ll_from = rev_page WHERE rev_parent_id = 0 GROUP BY rev_actor, rev_page HAVING COUNT(ll_lang) > 5
Do you have any advice on how to optimize the query?
Also, this is one of the several queries we run which looks at who created an article (i.e. joins page and revision and filters on page_namespace and rev_parent_id). I wonder if we should create a table called page_creator that just contains the information (perhaps with three columns page_id, create_actor_id, create_rev_id)