Page MenuHomePhabricator

[wikireplicas] slow query runs every hour, but never completes
Closed, ResolvedPublic

Description

On clouddb1019, this query is running every hour, and always ends up being killed by wmf-pt-kill after 3 hours (10800 sec):

select  cl_from, page_id, cl_type from categorylinks,page where cl_type!="page" and page_namespace=14 and page_title=cl_to order by page_id
fnegri@clouddb1019:~$ sudo zgrep "select  cl_from" /var/log/wmf-pt-kill/wmf-pt-kill-s4.log.* |tail -n 10
/var/log/wmf-pt-kill/wmf-pt-kill-s4.log.4:# 2025-08-09T14:17:08 KILL 12895571 (Query 10806 sec) select  cl_from, page_id, cl_type from categorylinks,page where cl_type!="page" and page_namespace=14 and page_title=cl_to order by page_id
/var/log/wmf-pt-kill/wmf-pt-kill-s4.log.4:# 2025-08-09T15:17:08 KILL 12905210 (Query 10807 sec) select  cl_from, page_id, cl_type from categorylinks,page where cl_type!="page" and page_namespace=14 and page_title=cl_to order by page_id
/var/log/wmf-pt-kill/wmf-pt-kill-s4.log.4:# 2025-08-09T16:17:09 KILL 12915009 (Query 10807 sec) select  cl_from, page_id, cl_type from categorylinks,page where cl_type!="page" and page_namespace=14 and page_title=cl_to order by page_id
/var/log/wmf-pt-kill/wmf-pt-kill-s4.log.4:# 2025-08-09T17:17:09 KILL 12924295 (Query 10808 sec) select  cl_from, page_id, cl_type from categorylinks,page where cl_type!="page" and page_namespace=14 and page_title=cl_to order by page_id
/var/log/wmf-pt-kill/wmf-pt-kill-s4.log.4:# 2025-08-09T18:17:10 KILL 12934082 (Query 10808 sec) select  cl_from, page_id, cl_type from categorylinks,page where cl_type!="page" and page_namespace=14 and page_title=cl_to order by page_id
/var/log/wmf-pt-kill/wmf-pt-kill-s4.log.4:# 2025-08-09T19:17:10 KILL 12944736 (Query 10808 sec) select  cl_from, page_id, cl_type from categorylinks,page where cl_type!="page" and page_namespace=14 and page_title=cl_to order by page_id
/var/log/wmf-pt-kill/wmf-pt-kill-s4.log.4:# 2025-08-09T20:17:10 KILL 12954814 (Query 10809 sec) select  cl_from, page_id, cl_type from categorylinks,page where cl_type!="page" and page_namespace=14 and page_title=cl_to order by page_id
/var/log/wmf-pt-kill/wmf-pt-kill-s4.log.4:# 2025-08-09T21:17:11 KILL 12964440 (Query 10809 sec) select  cl_from, page_id, cl_type from categorylinks,page where cl_type!="page" and page_namespace=14 and page_title=cl_to order by page_id
/var/log/wmf-pt-kill/wmf-pt-kill-s4.log.4:# 2025-08-09T22:17:11 KILL 12973862 (Query 10809 sec) select  cl_from, page_id, cl_type from categorylinks,page where cl_type!="page" and page_namespace=14 and page_title=cl_to order by page_id
/var/log/wmf-pt-kill/wmf-pt-kill-s4.log.4:# 2025-08-09T23:17:01 KILL 12983150 (Query 10800 sec) select  cl_from, page_id, cl_type from categorylinks,page where cl_type!="page" and page_namespace=14 and page_title=cl_to order by page_id

The query is running with user u1115 (@dschwen) and host 10.64.151.2 (cloudlb1001.eqiad.wmnet.)

*************************** 11. row ***************************
      Id: 18406341
    User: u1115
    Host: 10.64.151.2:45334
      db: commonswiki_p
 Command: Query
    Time: 771
   State: Creating sort index
    Info: select  cl_from, page_id, cl_type from categorylinks,page where cl_type!="page" and page_namespace=14 and page_title=cl_to order by page_id
Progress: 0.000

While the query does not by itself cause table locks, it can end up in table locks if it runs during an ALTER TABLE, causing replication lag and queries stuck in Waiting for table metadata lock. Example from today:

|       36 | system user  |                      | commonswiki   | Slave_SQL |   21885 | Waiting for table metadata lock  | ALTER TABLE  categorylinks   DROP cl_to,   DROP cl_collation                                         |    0.000 |

[...]

| 18311052 | u1115        | 10.64.151.2:55786    | commonswiki_p | Query     |   10580 | Creating sort index              | select  cl_from, page_id, cl_type from categorylinks,page where cl_type!="page" and page_namespace=1 |    0.000 |
| 18316193 | u1115        | 10.64.151.2:57474    | commonswiki_p | Query     |    6979 | Creating sort index              | select  cl_from, page_id, cl_type from categorylinks,page where cl_type!="page" and page_namespace=1 |    0.000 |

[...]

| 18323283 | s51786       | 10.64.151.2:46044    | commonswiki_p | Query     |    1767 | Waiting for table metadata lock  | SELECT count(*) FROM templatelinks INNER JOIN linktarget ON tl_target_id = lt_id WHERE lt_title = 'I |    0.000 |

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
fgiunchedi triaged this task as Medium priority.Sep 4 2025, 10:16 AM

It's slightly more complicated, as it does sometimes complete before the 3 hours. This is the number of times it was killed per day in the last month:

fnegri@clouddb1019:~$ for i in $(seq -f "%02g" 01 31); do echo -n "2025-08-$i "; zgrep "select  cl_from" /var/log/wmf-pt-kill/wmf-pt-kill-s4.log.* |grep 2025-08-$i |wc -l; done
2025-08-01 0
2025-08-02 0
2025-08-03 24
2025-08-04 24
2025-08-05 24
2025-08-06 24
2025-08-07 24
2025-08-08 24
2025-08-09 24
2025-08-10 21
2025-08-11 24
2025-08-12 24
2025-08-13 24
2025-08-14 24
2025-08-15 24
2025-08-16 24
2025-08-17 24
2025-08-18 24
2025-08-19 24
2025-08-20 24
2025-08-21 23
2025-08-22 13
2025-08-23 4
2025-08-24 11
2025-08-25 4
2025-08-26 15
2025-08-27 2
2025-08-28 8
2025-08-29 7
2025-08-30 0
2025-08-31 0

Ok, I have set this to run once a day, and I think I can split this up into many small queries, by querying ranges of page IDs. Would that help?

Yes, that would help thank you!

The other thing that would help is if you could make sure that a new query is not started if the previous one is already running. At the moment, you can have up to 3 queries running at the same time, so even when the first one is killed after 3 hours, the second one is still running and the server is constantly busy.

Question: where are those queries running? Is it on a VM?

It's done. Let's see if that helps

Yes, they are running on a wmcloud VM. With spacing them out to run daily there should not be any overlap, but I can add a lock to my script to prevent overlapping execution.

The new query template is

sql = (
    "SELECT /* SLOW_OK */ cl_from, page_id, "
    "CAST(cl_type AS CHAR) AS cl_type "
    "FROM categorylinks "
    "JOIN page ON page_id >= %s AND page_id < %s "
    "  AND page_namespace = 14 "
    "  AND page_title = cl_to "
    "WHERE cl_type != 'page' "
    "ORDER BY page_id;"
)

let me know if you see any issues or optimization potential

ChatGPT suggests

SELECT /* SLOW_OK */
  cl.cl_from,
  p.page_id,
  cl.cl_type
FROM page AS p
STRAIGHT_JOIN categorylinks AS cl
  ON p.page_namespace = 14
 AND cl.cl_to        = p.page_title
WHERE p.page_id >= %s AND p.page_id < %s
  AND cl.cl_type IN ('subcat','file')   -- instead of != 'page'
ORDER BY p.page_id;

With spacing them out to run daily there should not be any overlap

Yes that should be fine without a lock, if you split into smaller more frequent scripts a lock could be useful. But if daily is fine it's probably not needed.

let me know if you see any issues or optimization potential

I'll check that later and get back to you.

Yes, they are running on a wmcloud VM.

Can I ask you the VM name? I was trying to reverse-engineer it but it's hard because there's a proxy in between :)

fnegri changed the task status from Open to In Progress.Sep 4 2025, 2:48 PM
fnegri claimed this task.

Can I ask you the VM name? I was trying to reverse-engineer it but it's hard because there's a proxy in between :)

Oh yes, of course. This is part of the FastCCI project and runs on fastcci-master.

A test run finished in just under 4h.

I double checked and there were no more queries getting killed because they reached 3 hours, which I think means that the batch approach is working and the batch is small enough.

I would aim at a batch size where a single query completes under 10 minutes, but if it's a bit more it's not a problem. The important thing is that the average time is well below 3 hours, so even if the server is a bit slower you don't risk hitting the limit and having queries getting killed.

Maybe there is some potential for further optimizing the query and the overall workflow, but for now I think this is good enough.

Thanks for the prompt response! I'll mark this as Resolved.

Ok, the total runtime right now is

START Fri Sep 5 02:17:01 UTC 2025
SUCCESS Fri Sep 5 06:06:43 UTC 2025

~230min and right now (and that depends on the number of commons pages) I'm splitting that into a batch of ~1700 queries, so each query is well under a minute. I could make that batch window a bit larger, if there is a sweet spot between reducing total query count vs. reducing single query time.

It's hard to predict if larger batch windows would be faster overall, feel free to experiment if you have time. Otherwise I think you can leave things as they are for the moment.