Today, June 20th 2024 we got paged because of MariaDB Replica lag on the s1 cluster, on db1206.
The configured threshold for paging is 300 seconds and the actual value was 300.70 seconds. Ever so slightly above the limit.
Shortly after the issue resolved itself without manual intervention.
first investigation by DBA showed:
18:50 < Amir1> db1206 starts lagging a lot when dumps start, something in the query is either wrong or it bombard the replica. Either way, it needs to be investigated. 18:55 < Amir1> > | 236130644 | wikiadmin2023 | 10.64.0.157:37742 | enwiki | Query | 1 | Creating sort index | SELECT /* WikiExporter::dumpPages */ /*! .. STRAIGHT_JOIN */ re
Notification Type: PROBLEM
Service: MariaDB Replica Lag: s1 #page
Host: db1206 #page
Address: 10.64.16.89
State: CRITICAL
Date/Time: Thu Jun 20 18:15:09 UTC 2024
Notes URLs: https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
Acknowledged by :
Additional Info:
CRITICAL slave_sql_lag Replication lag: 300.70 seconds
The query:
SELECT /* WikiExporter::dumpPages */ /*! STRAIGHT_JOIN */ rev_id,rev_page,rev_actor,actor_rev_user.actor_user AS `rev_user`,actor_rev_user.actor_name AS `rev_user_text`,rev_timestamp,rev_minor_edit,rev_deleted,rev_len,rev_parent_id,rev_sha1,comment_rev_comment.comment_text AS `rev_comment_text`,comment_rev_comment.comment_data AS `rev_comment_data`,comment_rev_comment.comment_id AS `rev_comment_cid`,page_namespace,page_title,page_id,page_latest,page_is_redirect,page_len,slot_revision_id,slot_content_id,slot_origin,slot_role_id,content_size,content_sha1,content_address,content_model FROM `revision` JOIN `page` ON ((rev_page=page_id)) JOIN `actor` `actor_rev_user` ON ((actor_rev_user.actor_id = rev_actor)) JOIN `comment` `comment_rev_comment` ON ((comment_rev_comment.comment_id = rev_comment_id)) JOIN `slots` ON 20:58 ((slot_revision_id = rev_id)) JOIN `content` ON ((content_id = slot_content_id)) WHERE (page_id >= 673734 AND page_id < 673816) AND ((rev_page > 0 OR (rev_page = 0 AND rev_id > 0))) ORDER BY rev_page ASC,rev_id ASC LIMIT 50000
In addition, dumps generation for english wikipedia also caused network saturation in eqiad:
<Amir1> yup it's dumps: https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=snapshot1012&var-datasource=thanos&var-cluster=dumps&from=1719089686711&to=1719101545535 <Amir1> it seems to be only snapshot1012 and that host has enwiki dump running
that had severe consequences including a full outage for editing that persisted for more than half an hour.