Page MenuHomePhabricator

Slow watchlist queries due to large and expensive temporary table construction
Closed, ResolvedPublic

Description

Example query on enwiki:

SELECT  rc_id,rc_timestamp,rc_namespace,rc_title,rc_minor,rc_bot,rc_cur_id,rc_this_oldid,rc_last_oldid,rc_type,rc_source,rc_patrolled,rc_ip,rc_old_len,rc_new_len,rc_deleted,rc_logid,rc_log_type,rc_log_action,rc_params,rc_actor,recentchanges_actor.actor_user AS `rc_user`,recentchanges_actor.actor_name AS `rc_user_text`,recentchanges_comment.comment_text AS `rc_comment_text`,recentchanges_comment.comment_data AS `rc_comment_data`,recentchanges_comment.comment_id AS `rc_comment_id`,we_expiry,page_latest,wl_notificationtimestamp,(SELECT  GROUP_CONCAT(ctd_name SEPARATOR ',')  FROM `change_tag` JOIN `change_tag_def` ON ((ct_tag_id=ctd_id))   WHERE (ct_rc_id=rc_id)  ) AS `ts_tags`,fp_stable,fp_pending_since,ores_damaging_cls.oresc_probability AS `ores_damaging_score`,ores_goodfaith_cls.oresc_probability AS `ores_goodfaith_score`  FROM `recentchanges` STRAIGHT_JOIN `actor` `recentchanges_actor` ON ((actor_id=rc_actor)) STRAIGHT_JOIN `comment` `recentchanges_comment` ON ((comment_id=rc_comment_id)) JOIN `watchlist` ON (wl_user = 48747569 AND (wl_namespace=rc_namespace) AND (wl_title=rc_title)) LEFT JOIN `watchlist_expiry` ON ((wl_id = we_item)) LEFT JOIN `page` ON ((rc_cur_id=page_id)) LEFT JOIN `flaggedpages` ON ((fp_page_id = rc_cur_id)) LEFT JOIN `ores_classification` `ores_damaging_cls` ON (ores_damaging_cls.oresc_model = 59 AND (ores_damaging_cls.oresc_rev=rc_this_oldid) AND ores_damaging_cls.oresc_class = 1) LEFT JOIN `ores_classification` `ores_goodfaith_cls` ON (ores_goodfaith_cls.oresc_model = 60 AND (ores_goodfaith_cls.oresc_rev=rc_this_oldid) AND ores_goodfaith_cls.oresc_class = 1)   WHERE (rc_type != 6) AND (rc_source != 'wb') AND (rc_timestamp >= '20250730013715') AND ((we_expiry IS NULL OR we_expiry > '20250829013715')) AND ((rc_type != 3 OR (rc_deleted & 1) != 1))  ORDER BY rc_timestamp DESC LIMIT 1000  

# Query_time: 2.845925  Lock_time: 0.000491  Rows_sent: 1000  Rows_examined: 695804

select count(*) from watchlist where wl_user=48747569;
2288

select count(*) from recentchanges join watchlist on (wl_user = 48747569 AND (wl_namespace=rc_namespace) AND (wl_title=rc_title));
61515

Explain:

idselect_typetabletypepossible_keyskeykey_lenrefrowsExtra
1PRIMARYwatchlistrefwl_user,wl_user_notificationtimestamp,wl_namespace_titlewl_user_notificationtimestamp4const2333Using temporary; Using filesort
1PRIMARYwatchlist_expiryeq_refPRIMARYPRIMARY4enwiki.watchlist.wl_id1Using where
1PRIMARYrecentchangesrefrc_timestamp, rc_name_type_patrolled_timestamp, rc_ns_actor, rc_actor, rc_namespace_title_timestamp, rc_source_name_timestamp, rc_name_source_patrolled_timestamprc_namespace_title_timestamp261enwiki.watchlist.wl_namespace, enwiki.watchlist.wl_title1Using index condition; Using where
1PRIMARYpageeq_refPRIMARYPRIMARY4enwiki.recentchanges.rc_cur_id1
1PRIMARYflaggedpageseq_refPRIMARYPRIMARY4enwiki.recentchanges.rc_cur_id1
1PRIMARYores_damaging_clseq_reforesc_rev_model_classoresc_rev_model_class7enwiki.recentchanges.rc_this_oldid,const,const1
1PRIMARYores_goodfaith_clseq_reforesc_rev_model_classoresc_rev_model_class7enwiki.recentchanges.rc_this_oldid,const,const1
1PRIMARYrecentchanges_actoreq_refPRIMARYPRIMARY8enwiki.recentchanges.rc_actor1
1PRIMARYrecentchanges_commenteq_refPRIMARYPRIMARY8enwiki.recentchanges.rc_comment_id1
2DEPENDENT SUBQUERYchange_tagrefct_rc_tag_id,ct_tag_id_idct_rc_tag_id9enwiki.recentchanges.rc_id1Using index
2DEPENDENT SUBQUERYchange_tag_defeq_refPRIMARYPRIMARY4enwiki.change_tag.ct_tag_id1

The user is asking for the first 1k of 61k rows, but 695k rows are examined. The multiplier is because of the large number of joins which collect data needed for display. A temporary table is constructed with 61k rows, with all joins being resolved, before the temporary table is truncated to 1k rows. This issue probably affects other ChangesListSpecialPage subclasses.

Doing the filters on a subquery in the FROM clause and then doing the joins for display in the outer query reduces the number of rows examined to ~140k:

SELECT  rc.rc_id,rc_timestamp,rc_namespace,rc_title,rc_minor,rc_bot,rc_cur_id,rc_this_oldid,rc_last_oldid,rc_type,rc_source,rc_patrolled,rc_ip,rc_old_len,rc_new_len,rc_deleted,rc_logid,rc_log_type,rc_log_action,rc_params,rc_actor,recentchanges_actor.actor_user AS `rc_user`,recentchanges_actor.actor_name AS `rc_user_text`,recentchanges_comment.comment_text AS `rc_comment_text`,recentchanges_comment.comment_data AS `rc_comment_data`,recentchanges_comment.comment_id AS `rc_comment_id`,we_expiry,page_latest,wl_notificationtimestamp,(SELECT  GROUP_CONCAT(ctd_name SEPARATOR ',')  FROM `change_tag` JOIN `change_tag_def` ON ((ct_tag_id=ctd_id))   WHERE (ct_rc_id=rc_sq.rc_id)  ) AS `ts_tags`,fp_stable,fp_pending_since,ores_damaging_cls.oresc_probability AS `ores_damaging_score`,ores_goodfaith_cls.oresc_probability AS `ores_goodfaith_score`  FROM (SELECT rc_id FROM `recentchanges` JOIN `watchlist` ON (wl_user = 48747569 AND (wl_namespace=rc_namespace) AND (wl_title=rc_title)) LEFT JOIN `watchlist_expiry` ON ((wl_id = we_item)) WHERE (rc_type != 6) AND (rc_source != 'wb') AND (rc_timestamp >= '20250730013715') AND ((we_expiry IS NULL OR we_expiry > '20250829013715')) AND ((rc_type != 3 OR (rc_deleted & 1) != 1))  ORDER BY rc_timestamp DESC LIMIT 1000) AS rc_sq JOIN recentchanges AS rc ON rc.rc_id=rc_sq.rc_id JOIN `watchlist` ON (wl_user = 48747569 AND (wl_namespace=rc_namespace) AND (wl_title=rc_title)) LEFT JOIN `watchlist_expiry` ON ((wl_id = we_item)) STRAIGHT_JOIN `actor` `recentchanges_actor` ON ((actor_id=rc_actor)) STRAIGHT_JOIN `comment` `recentchanges_comment` ON ((comment_id=rc_comment_id))  LEFT JOIN `page` ON ((rc_cur_id=page_id)) LEFT JOIN `flaggedpages` ON ((fp_page_id = rc_cur_id)) LEFT JOIN `ores_classification` `ores_damaging_cls` ON (ores_damaging_cls.oresc_model = 59 AND (ores_damaging_cls.oresc_rev=rc_this_oldid) AND ores_damaging_cls.oresc_class = 1) LEFT JOIN `ores_classification` `ores_goodfaith_cls` ON (ores_goodfaith_cls.oresc_model = 60 AND (ores_goodfaith_cls.oresc_rev=rc_this_oldid) AND ores_goodfaith_cls.oresc_class = 1)

# Query_time: 0.429881  Lock_time: 0.000557  Rows_sent: 1000  Rows_examined: 140392

If the timestamp range is known or can be guessed, the resulting query is several times faster still, with only 16k rows examined:

SELECT  rc_id,rc_timestamp,rc_namespace,rc_title,rc_minor,rc_bot,rc_cur_id,rc_this_oldid,rc_last_oldid,rc_type,rc_source,rc_patrolled,rc_ip,rc_old_len,rc_new_len,rc_deleted,rc_logid,rc_log_type,rc_log_action,rc_params,rc_actor,recentchanges_actor.actor_user AS `rc_user`,recentchanges_actor.actor_name AS `rc_user_text`,recentchanges_comment.comment_text AS `rc_comment_text`,recentchanges_comment.comment_data AS `rc_comment_data`,recentchanges_comment.comment_id AS `rc_comment_id`,we_expiry,page_latest,wl_notificationtimestamp,(SELECT  GROUP_CONCAT(ctd_name SEPARATOR ',')  FROM `change_tag` JOIN `change_tag_def` ON ((ct_tag_id=ctd_id))   WHERE (ct_rc_id=rc_id)  ) AS `ts_tags`,fp_stable,fp_pending_since,ores_damaging_cls.oresc_probability AS `ores_damaging_score`,ores_goodfaith_cls.oresc_probability AS `ores_goodfaith_score`  FROM `recentchanges` STRAIGHT_JOIN `actor` `recentchanges_actor` ON ((actor_id=rc_actor)) STRAIGHT_JOIN `comment` `recentchanges_comment` ON ((comment_id=rc_comment_id)) JOIN `watchlist` ON (wl_user = 48747569 AND (wl_namespace=rc_namespace) AND (wl_title=rc_title)) LEFT JOIN `watchlist_expiry` ON ((wl_id = we_item)) LEFT JOIN `page` ON ((rc_cur_id=page_id)) LEFT JOIN `flaggedpages` ON ((fp_page_id = rc_cur_id)) LEFT JOIN `ores_classification` `ores_damaging_cls` ON (ores_damaging_cls.oresc_model = 59 AND (ores_damaging_cls.oresc_rev=rc_this_oldid) AND ores_damaging_cls.oresc_class = 1) LEFT JOIN `ores_classification` `ores_goodfaith_cls` ON (ores_goodfaith_cls.oresc_model = 60 AND (ores_goodfaith_cls.oresc_rev=rc_this_oldid) AND ores_goodfaith_cls.oresc_class = 1)   WHERE (rc_type != 6) AND (rc_source != 'wb') AND rc_timestamp BETWEEN '20250829142829' AND '20250829235200' AND ((we_expiry IS NULL OR we_expiry > '20250829013715')) AND ((rc_type != 3 OR (rc_deleted & 1) != 1))  ORDER BY rc_timestamp DESC LIMIT 1000

# Query_time: 0.103520  Lock_time: 0.000531  Rows_sent: 1000  Rows_examined: 16077

Event Timeline

PEPE1234.13 triaged this task as Unbreak Now! priority.
Aklapper lowered the priority of this task from Unbreak Now! to Needs Triage.

The subquery is harder to integrate, because several extensions are adding joins, and they will need to tell us which part of the query they want to add their join to. It might not be possible to maintain backwards compatibility with the ChangesListSpecialPageQuery and ChangesListSpecialPageStructuredFilters hooks.

The subquery could be used to fix T406676 since it's easier to add all the fields to the GROUP BY when the only field in the relevant subquery is rc_id.

Not many things care about the timestamp range so that option is more self-contained. Maybe it's hard to guess the timestamp range, and there is a risk of a performance regression if we guess too low and do a lot more queries than necessary. But it's certain to reduce the rate of slow queries.

The timestamp range option offers 4x better performance than the subquery option, for the example query in the task description.

The options are not exclusive, they can both be done, which will provide better performance than either option alone.

I think I am leaning towards trying the timestamp option, and leaving the subquery for later.

Change #1195097 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[mediawiki/core@master] recentchanges: Optionally partition queries by timestamp

https://gerrit.wikimedia.org/r/1195097

Test wiki created on Patch demo by KGraessle-WMF using patch(es) linked to this task:
https://efd674b1a9.catalyst.wmcloud.org/w/

Test wiki on Patch demo by KGraessle-WMF using patch(es) linked to this task was deleted:

https://efd674b1a9.catalyst.wmcloud.org/w/

Change #1195097 merged by jenkins-bot:

[mediawiki/core@master] recentchanges: Optionally partition queries by timestamp

https://gerrit.wikimedia.org/r/1195097

Change #1197748 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[mediawiki/core@master] recentchanges: QueryRateEstimator improvements

https://gerrit.wikimedia.org/r/1197748

Change #1197748 merged by jenkins-bot:

[mediawiki/core@master] recentchanges: QueryRateEstimator improvements

https://gerrit.wikimedia.org/r/1197748

Change #1198178 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[mediawiki/core@wmf/1.45.0-wmf.24] recentchanges: QueryRateEstimator improvements

https://gerrit.wikimedia.org/r/1198178

Change #1198178 merged by jenkins-bot:

[mediawiki/core@wmf/1.45.0-wmf.24] recentchanges: QueryRateEstimator improvements

https://gerrit.wikimedia.org/r/1198178

Mentioned in SAL (#wikimedia-operations) [2025-10-23T02:05:51Z] <tstarling@deploy2002> Started scap sync-world: Backport for [[gerrit:1198178|recentchanges: QueryRateEstimator improvements (T403798)]], [[gerrit:1198185|recentchanges: Restore table qualifiers in change tag field expressions (T408040)]]

Mentioned in SAL (#wikimedia-operations) [2025-10-23T02:11:11Z] <tstarling@deploy2002> tstarling: Backport for [[gerrit:1198178|recentchanges: QueryRateEstimator improvements (T403798)]], [[gerrit:1198185|recentchanges: Restore table qualifiers in change tag field expressions (T408040)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.

Mentioned in SAL (#wikimedia-operations) [2025-10-23T02:21:22Z] <tstarling@deploy2002> Finished scap sync-world: Backport for [[gerrit:1198178|recentchanges: QueryRateEstimator improvements (T403798)]], [[gerrit:1198185|recentchanges: Restore table qualifiers in change tag field expressions (T408040)]] (duration: 15m 30s)

testwiki with days=30, limit=20:

Beginning partition request with rate=0.00024305555555556, density=1, period=2592000 {"period":2592000,"limit":20,"rate":0.00024305555555555555,"density":1,"rcSize":3525,"context.special_page_name":"Recentchanges","context.special_page_subpage":""} 
Finished partition request: got 20 rows in 1 queries, period=34557 {"queryCount":1,"actualPeriod":34557,"queryPeriod":90515,"actualRows":20,"context.special_page_name":"Recentchanges","context.special_page_subpage":""}

again:

Beginning partition request with rate=0.00023131307478501, density=1, period=2592000 {"period":2592000,"limit":20,"rate":0.00023131307478500943,"density":1,"rcSize":3528,"context.special_page_name":"Recentchanges","context.special_page_subpage":""} 
DEBUG: Finished partition request: got 20 rows in 1 queries, period=29595 {"queryCount":1,"actualPeriod":29595,"queryPeriod":95110,"actualRows":20,"context.special_page_name":"Recentchanges","context.special_page_subpage":""}

The performance of the partitioner can be judged by:

  • queryPeriod/period (here 4%). This how much scanning it would have saved compared to not having the partitioner, assuming MariaDB would have scanned to the end of the specified period.
  • actualPeriod/queryPeriod (here 31%). This is how close it got to requiring another query, or many rows were unnecessarily scanned compared to the optimal period. For small row counts it's biased towards avoiding a second query. For large row counts with rate estimation, the target is roughly 1/INFORMED_INITIAL_FACTOR = 91%.

There is an overhead due to loading the rate estimate. Judging by log timestamp deltas, it's less than 3ms. The annual plan project is about logged slow queries not average performance.

With days=30, limit=200:

2025-10-23 02:28:00.860083 [d1451968-3460-4fe2-92a9-b54e7e6cdb67] mw-debug.codfw.pinkunicorn-99c98cfd-nbp78 testwiki 1.45.0-wmf.24 ChangesListQuery DEBUG: Beginning partition request with rate=0.00024246493584005, density=1, period=2592000 {"period":2592000,"limit":200,"rate":0.00024246493584004775,"density":1,"rcSize":3528,"context.special_page_name":"Recentchanges","context.special_page_subpage":""} 
2025-10-23 02:28:00.879117 [d1451968-3460-4fe2-92a9-b54e7e6cdb67] mw-debug.codfw.pinkunicorn-99c98cfd-nbp78 testwiki 1.45.0-wmf.24 ChangesListQuery DEBUG: Finished partition request: got 200 rows in 1 queries, period=302025 {"queryCount":1,"actualPeriod":302025,"queryPeriod":907348,"actualRows":200,"context.special_page_name":"Recentchanges","context.special_page_subpage":""}

again:

2025-10-23 02:28:35.908516 [076bee2b-d9c2-4a46-9216-d3771b1fd262] mw-debug.codfw.pinkunicorn-99c98cfd-nbp78 testwiki 1.45.0-wmf.24 ChangesListQuery DEBUG: Beginning partition request with rate=0.00056415102322892, density=1, period=2592000 {"period":2592000,"limit":200,"rate":0.0005641510232289184,"density":1,"rcSize":3528,"context.special_page_name":"Recentchanges","context.special_page_subpage":""} 
2025-10-23 02:28:35.925089 [076bee2b-d9c2-4a46-9216-d3771b1fd262] mw-debug.codfw.pinkunicorn-99c98cfd-nbp78 testwiki 1.45.0-wmf.24 ChangesListQuery DEBUG: Finished partition request: got 200 rows in 1 queries, period=302060 {"queryCount":1,"actualPeriod":302060,"queryPeriod":389967,"actualRows":200,"context.special_page_name":"Recentchanges","context.special_page_subpage":""}

Here queryPeriod/period = 15%, actualPeriod/queryPeriod = 75%.

I tested the watchlist in the task description by calling SpecialWatchlist::getRows() from eval.php. The first time with no partitioning, the main query took 17.5 seconds. The second time, after warmup, it took 3.4 seconds. With partitioning, there were two queries for a total query time of 0.19 seconds.

For a new enwiki slow log entry, user 3116621, baseline after warmup was 1.3s, partitioned time was 0.5s over three queries.

For a commonswiki slow log entry, user 611, baseline was 4.1s, partitioned time was 2.5s despite the limit exhausting, so it should have scanned the same rows.

enwiki user 32651679, 1.3s -> 0.4s.

In general I expect it to be more impactful on enwiki than commonswiki, since commonswiki slow log entries tend to have a more watchlist rows with fewer recentchanges matches.

This is all with naive rates, since the rate estimator is broken for watchlists, since it includes the we_expiry>$now condition in the key.

Typical eval.php script:

$userId = 32651679;
$query = wfCgiToArray('action=render&days=30&enhanced=0&hideWikibase=1&hidecategorization=1&limit=1000&title=Special%3AWatchlist&urlversion=2');

$sv = MediaWiki\MediaWikiServices::getInstance();
$user = $sv->getUserFactory()->newFromId($userId);
$request = new MediaWiki\Request\FauxRequest($query);
$context = RequestContext::getMain();
$context->setUser($user);
$context->setRequest($request);
$page = $sv->getSpecialPageFactory()->getPage('Watchlist');
$context->setTitle($page->getPageTitle());
$page->getRows();

$query['enable_partitioning'] = 1;
$request = new MediaWiki\Request\FauxRequest($query);
$context->setRequest($request);
$page = $sv->getSpecialPageFactory()->getPage('Watchlist');
$page->getRows();

For the other special pages, we can only expect a performance advantage when a table other than recentchanges is first in the join order. Most Special:RecentChanges slow queries are due to ORES, which is not suitable for placing first in the join order and thus there is no advantage. I checked the EXPLAIN for a few other RC queries, but they were all putting recentchanges first.

For a large commonswiki RecentChangesLinked (Category:CC-BY-SA-3.0-migrated), the baseline was 1.9s and the partitioned time was 0.4s over three queries, so that's a decent saving.

In this case QueryRateEstimator did work but regressed the query time back to 1.8s by asking for the full timestamp range in the first query. So that needs some tuning. Running the query with a high limit then running the original query again (with limit 50) fixed that issue and we got a query time of 0.3s. QueryRateEstimator is bad at estimating the rate from a prior query that covers a timestamp range much smaller than a day. It keeps returning bad results until someone does a query that returns results from a significant portion of the day. And this pattern would be repeated daily.

I could just delete QueryRateEstimator and enable partitioning on Watchlist and RCL only. That's feasible within a day of work which is how much time I have for this at the moment.

Change #1198421 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[mediawiki/core@master] recentchanges: Back out QueryRateEstimator

https://gerrit.wikimedia.org/r/1198421

Change #1198422 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[mediawiki/core@master] recentchanges: Limit partitioning to only watchlist and RCL

https://gerrit.wikimedia.org/r/1198422

Change #1198421 merged by jenkins-bot:

[mediawiki/core@master] recentchanges: Back out QueryRateEstimator

https://gerrit.wikimedia.org/r/1198421

Change #1198422 merged by jenkins-bot:

[mediawiki/core@master] recentchanges: Limit partitioning to only watchlist and RCL

https://gerrit.wikimedia.org/r/1198422

Change #1199890 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[operations/mediawiki-config@master] Enable ChangesListQuery partitioning on mediawikiwiki

https://gerrit.wikimedia.org/r/1199890

Change #1199891 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[operations/mediawiki-config@master] Enable ChangesListQuery partitioning on enwiki and commonswiki

https://gerrit.wikimedia.org/r/1199891

Change #1199892 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[operations/mediawiki-config@master] Enable ChangesListQuery partitioning on all wikis

https://gerrit.wikimedia.org/r/1199892

@tstarling

Let us know when the partitioning is live.

@Ladsgroup

When all of the changes are live, please let us know the impact on the slow query rate.

Change #1199890 merged by jenkins-bot:

[operations/mediawiki-config@master] Enable ChangesListQuery partitioning on mediawikiwiki

https://gerrit.wikimedia.org/r/1199890

Mentioned in SAL (#wikimedia-operations) [2025-10-30T22:15:37Z] <tstarling@deploy2002> Started scap sync-world: Backport for [[gerrit:1199890|Enable ChangesListQuery partitioning on mediawikiwiki (T403798)]]

Mentioned in SAL (#wikimedia-operations) [2025-10-30T22:42:04Z] <tstarling@deploy2002> tstarling: Backport for [[gerrit:1199890|Enable ChangesListQuery partitioning on mediawikiwiki (T403798)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.

Mentioned in SAL (#wikimedia-operations) [2025-10-30T22:55:58Z] <tstarling@deploy2002> Finished scap sync-world: Backport for [[gerrit:1199890|Enable ChangesListQuery partitioning on mediawikiwiki (T403798)]] (duration: 40m 21s)

Change #1199891 merged by jenkins-bot:

[operations/mediawiki-config@master] Enable ChangesListQuery partitioning on enwiki and commonswiki

https://gerrit.wikimedia.org/r/1199891

Mentioned in SAL (#wikimedia-operations) [2025-10-30T23:21:07Z] <tstarling@deploy2002> Started scap sync-world: Backport for [[gerrit:1199891|Enable ChangesListQuery partitioning on enwiki and commonswiki (T403798)]]

Mentioned in SAL (#wikimedia-operations) [2025-10-30T23:25:30Z] <tstarling@deploy2002> tstarling: Backport for [[gerrit:1199891|Enable ChangesListQuery partitioning on enwiki and commonswiki (T403798)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.

Mentioned in SAL (#wikimedia-operations) [2025-10-30T23:35:41Z] <tstarling@deploy2002> Finished scap sync-world: Backport for [[gerrit:1199891|Enable ChangesListQuery partitioning on enwiki and commonswiki (T403798)]] (duration: 14m 33s)

I made a Grafana dashboard for the relevant metrics: https://grafana.wikimedia.org/goto/jL2bYqRDg?orgId=1

At the time of the enwiki+commonswiki deployment at 23:35, there was a drop in the p90 response time for the watchlist, but it's hard to say whether it's statistically significant. For my logstash dashboard I'm using a weekly count and even then it is quite noisy.

In any case, the site seems to be up, and it's not slower than before, so I'll deploy the feature flag to the remaining wikis.

Change #1199892 merged by jenkins-bot:

[operations/mediawiki-config@master] Enable ChangesListQuery partitioning on all wikis

https://gerrit.wikimedia.org/r/1199892

Mentioned in SAL (#wikimedia-operations) [2025-10-31T02:08:07Z] <tstarling@deploy2002> Started scap sync-world: Backport for [[gerrit:1199892|Enable ChangesListQuery partitioning on all wikis (T403798)]]

Mentioned in SAL (#wikimedia-operations) [2025-10-31T02:12:27Z] <tstarling@deploy2002> tstarling: Backport for [[gerrit:1199892|Enable ChangesListQuery partitioning on all wikis (T403798)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.

Mentioned in SAL (#wikimedia-operations) [2025-10-31T02:48:08Z] <tstarling@deploy2002> Finished scap sync-world: Backport for [[gerrit:1199892|Enable ChangesListQuery partitioning on all wikis (T403798)]] (duration: 40m 01s)

Watchlist slow query count -- deployment was at 23:35

watchlist slow queries.png (240×899 px, 20 KB)

An average over the past 3 days, compared to a week earlier, suggests this intervention reduced slow queries (in terms of total duration per day) by 59% for Watchlist and 77% for RecentChangesLinked.

A lot of the remaining Watchlist slow queries have very large watchlists, over 500k items, so scanning the watchlist table alone is enough to make it slow. That's a problem which is not covered by this task.