Page MenuHomePhabricator

s2 primary master getting reads?
Closed, ResolvedPublic

Assigned To
Authored By
Marostegui
Feb 2 2026, 2:01 PM
Referenced Files
F71794346: image.png
Feb 10 2026, 1:46 PM
F71774674: grafik.png
Feb 9 2026, 4:18 PM
F71711891: image.png
Feb 6 2026, 8:41 PM
F71711861: image.png
Feb 6 2026, 8:41 PM
F71711809: image.png
Feb 6 2026, 8:29 PM
F71710323: image.png
Feb 6 2026, 7:01 PM
F71674183: image.png
Feb 4 2026, 4:38 PM
F71674174: image.png
Feb 4 2026, 4:38 PM

Description

From 1st Feb it seems that the s2 primary master (db2204 as of today) was getting reads (https://grafana.wikimedia.org/goto/RFXwixHvg?orgId=1)

screenshot.png (256×783 px, 57 KB)

There seems to be an increase of read activity on it, which doesn't make much sense as it is configured with 0 weight on dbctl so no reads should appear.
This was part of the investigation for some lag spikes on two codfw replicas, db2175 and db2148

The amount of read_key and read_next went through the roof on 1st Feb.
There seems to be no related entry on SAL (it was a Sunday)

I guess I can enable the slow query log and check what's arriving there, unless there seems to be something obvious from MW side of things.

Event Timeline

I'm seeing this:

| 823987454 | wikiuser2023    | 10.194.182.235:41778 | idwiki       | Query       |        0 | Sending data                                                  | SELECT /* GrowthExperiments\UserImpact\ComputedUserImpactLookup::getCreatedArticleCount  */  COUNT(* |    0.000 |

And more like that:

Sending data                                                  | SELECT /* GrowthExperiments\UserImpact\ComputedUserImpactLookup::getCreatedArticleCount  */  log_pag |    0.000 |

The only place where ComputedUserLookup:: getExpensiveUserImpact() is called with READ_LATEST (to explain primary reads) is HomepageHooks:: onLocalUserCreated(). I do not understand how that code can be responsible for increasing the number of reads, as there doesn't seem to be any change in the number of signups:

image.png (824×1 px, 188 KB)

(generated via P88662)

Even when filtering to just wikipedias, there is no difference:

image.png (836×1 px, 188 KB)

(generated via P88663)

We can try removing the READ_LATEST flag just in case, but I do not understand how this code can be responsible for this change.

Is it possible to see queries directed to primary somewhere?

Thanks! I think it is something else causing issues and we found this on the way that shouldn't happen regardless. It happens quite a lot when I'm debugging a production issue. I investigate why that is the case but the current state is not helping :D

The only place where ComputedUserLookup:: getExpensiveUserImpact() is called with READ_LATEST (to explain primary reads) is HomepageHooks:: onLocalUserCreated(). I do not understand how that code can be responsible for increasing the number of reads, as there doesn't seem to be any change in the number of signups:

image.png (824×1 px, 188 KB)

(generated via P88662)

Even when filtering to just wikipedias, there is no difference:

image.png (836×1 px, 188 KB)

(generated via P88663)

We can try removing the READ_LATEST flag just in case, but I do not understand how this code can be responsible for this change.

Is it possible to see queries directed to primary somewhere?

I'll try to capture some of the reads tomorrow.

The amount of SELECTS it is receiving is pretty crazy when I tail it:
Some examples captured:

		3508928133 Query	SELECT /* MediaWiki\Revision\RevisionStore::fetchRevisionRowFromConds  */  rev_id,rev_page,rev_timestamp,rev_minor_edit,rev_deleted,rev_len,rev_parent_id,actor_rev_user.actor_user AS `rev_user`,actor_rev_user.actor_name AS `rev_user_text`,rev_actor,comment_rev_comment.comment_text AS `rev_comment_text`,comment_rev_comment.comment_data AS `rev_comment_data`,comment_rev_comment.comment_id AS `rev_comment_cid`,page_namespace,page_title,page_id,page_latest,page_is_redirect,page_len,user_name  FROM `revision` JOIN `actor` `actor_rev_user` ON ((actor_rev_user.actor_id = rev_actor)) JOIN `comment` `comment_rev_comment` ON ((comment_rev_comment.comment_id = rev_comment_id)) JOIN `page` ON ((page_id = rev_page)) LEFT JOIN `user` ON ((actor_rev_user.actor_user != 0) AND (user_id = actor_rev_user.actor_user))   WHERE rev_id = 44633665  LIMIT 1
		3508928133 Query	SELECT /* MediaWiki\Revision\RevisionStore::loadSlotRecordsFromDb  */  slot_revision_id,slot_content_id,slot_origin,slot_role_id,content_size,content_sha1,content_address,content_model  FROM `slots` JOIN `content` ON ((slot_content_id = content_id))   WHERE slot_revision_id = 44633665
		3508928133 Query	SELECT /* MediaWiki\Deferred\LinksUpdate\LinksUpdate::acquirePageLock  */ IF(GET_LOCK('enwiktionary:LinksUpdate:atomicity:pageid:1010942',1),UNIX_TIMESTAMP(SYSDATE(6)),NULL) AS acquired
		3508928133 Query	SELECT /* MediaWiki\Deferred\LinksUpdate\CategoryLinksTable::fetchExistingRows  */  lt_title,cl_sortkey_prefix  FROM `categorylinks` JOIN `linktarget` ON ((cl_target_id=lt_id))   WHERE cl_from = 1010942
		3508928133 Query	SELECT /* MediaWiki\Deferred\LinksUpdate\LinksTable::fetchExistingRows  */  el_to_domain_index,el_to_path  FROM `externallinks`    WHERE el_from = 1010942
		3508928133 Query	SELECT /* MediaWiki\Deferred\LinksUpdate\GenericPageLinksTable::fetchExistingRows  */  lt_namespace AS `ns`,lt_title AS `title`  FROM `existencelinks` JOIN `linktarget` ON ((exl_target_id=lt_id))   WHERE exl_from = 1010942
		3508928133 Query	SELECT /* MediaWiki\Deferred\LinksUpdate\LinksTable::fetchExistingRows  */  il_to  FROM `imagelinks`    WHERE il_from = 1010942
		3508928133 Query	SELECT /* MediaWiki\Deferred\LinksUpdate\LinksTable::fetchExistingRows  */  iwl_prefix,iwl_title  FROM `iwlinks`    WHERE iwl_from = 1010942
		3508928133 Query	SELECT /* MediaWiki\Deferred\LinksUpdate\LinksTable::fetchExistingRows  */  ll_lang,ll_title  FROM `langlinks`    WHERE ll_from = 1010942
		3508928133 Query	SELECT /* MediaWiki\Deferred\LinksUpdate\GenericPageLinksTable::fetchExistingRows  */  lt_namespace AS `ns`,lt_title AS `title`  FROM `pagelinks` JOIN `linktarget` ON ((pl_target_id=lt_id))   WHERE pl_from = 1010942
		3508928133 Query	SELECT /* MediaWiki\Deferred\LinksUpdate\LinksTable::fetchExistingRows  */  pp_propname,pp_value  FROM `page_props`    WHERE pp_page = 1010942
		3508928133 Query	SELECT /* MediaWiki\Deferred\LinksUpdate\GenericPageLinksTable::fetchExistingRows  */  lt_namespace AS `ns`,lt_title AS `title`  FROM `templatelinks` JOIN `linktarget` ON ((tl_target_id=lt_id))   WHERE tl_from = 1010942
		3508928125 Query	SELECT /* MediaWiki\Deferred\LinksUpdate\LinksUpdate::acquirePageLock  */ RELEASE_LOCK('enwiktionary:LinksUpdate:job:pageid:1011537') AS released
		3508928071 Query	SELECT /* MediaWiki\Deferred\LinksUpdate\LinksUpdate::acquirePageLock  */ RELEASE_LOCK('enwiktionary:LinksUpdate:job:pageid:1010759') AS released
		3508928133 Query	SELECT /* MediaWiki\Deferred\LinksUpdate\LinksUpdate::acquirePageLock  */ RELEASE_LOCK('enwiktionary:LinksUpdate:atomicity:pageid:1010942') AS released
		3508928132 Query	SELECT /* MediaWiki\Revision\RevisionStore::fetchRevisionRowFromConds  */  rev_id,rev_page,rev_timestamp,rev_minor_edit,rev_deleted,rev_len,rev_parent_id,actor_rev_user.actor_user AS `rev_user`,actor_rev_user.actor_name AS `rev_user_text`,rev_actor,comment_rev_comment.comment_text AS `rev_comment_text`,comment_rev_comment.comment_data AS `rev_comment_data`,comment_rev_comment.comment_id AS `rev_comment_cid`,page_namespace,page_title,page_id,page_latest,page_is_redirect,page_len,user_name  FROM `revision` JOIN `actor` `actor_rev_user` ON ((actor_rev_user.actor_id = rev_actor)) JOIN `comment` `comment_rev_comment` ON ((comment_rev_comment.comment_id = rev_comment_id)) JOIN `page` ON ((page_id = rev_page)) LEFT JOIN `user` ON ((actor_rev_user.actor_user != 0) AND (user_id = actor_rev_user.actor_user))   WHERE rev_id = 89417135  LIMIT 1
		3508928139 Query	SELECT /* MediaWiki\Revision\RevisionStore::loadSlotRecordsFromDb  */  slot_revision_id,slot_content_id,slot_origin,slot_role_id,content_size,content_sha1,content_address,content_model  FROM `slots` JOIN `content` ON ((slot_content_id = content_id))   WHERE slot_revision_id = 88944372
		3508928140 Query	SELECT /* MediaWiki\Revision\RevisionStore::fetchRevisionRowFromConds  */  rev_id,rev_page,rev_timestamp,rev_minor_edit,rev_deleted,rev_len,rev_parent_id,actor_rev_user.actor_user AS `rev_user`,actor_rev_user.actor_name AS `rev_user_text`,rev_actor,comment_rev_comment.comment_text AS `rev_comment_text`,comment_rev_comment.comment_data AS `rev_comment_data`,comment_rev_comment.comment_id AS `rev_comment_cid`,page_namespace,page_title,page_id,page_latest,page_is_redirect,page_len,user_name  FROM `revision` JOIN `actor` `actor_rev_user` ON ((actor_rev_user.actor_id = rev_actor)) JOIN `comment` `comment_rev_comment` ON ((comment_rev_comment.comment_id = rev_comment_id)) JOIN `page` ON ((page_id = rev_page)) LEFT JOIN `user` ON ((actor_rev_user.actor_user != 0) AND (user_id = actor_rev_user.actor_user))   WHERE rev_id = 81954531  LIMIT 1
		3508928140 Query	SELECT /* MediaWiki\Revision\RevisionStore::loadSlotRecordsFromDb  */  slot_revision_id,slot_content_id,slot_origin,slot_role_id,content_size,content_sha1,content_address,content_model  FROM `slots` JOIN `content` ON ((slot_content_id = content_id))   WHERE slot_revision_id = 81954531
		3508928139 Query	SELECT /* MediaWiki\Revision\RevisionStore::fetchRevisionRowFromConds  */  rev_id,rev_page,rev_timestamp,rev_minor_edit,rev_deleted,rev_len,rev_parent_id,actor_rev_user.actor_user AS `rev_user`,actor_rev_user.actor_name AS `rev_user_text`,rev_actor,comment_rev_comment.comment_text AS `rev_comment_text`,comment_rev_comment.comment_data AS `rev_comment_data`,comment_rev_comment.comment_id AS `rev_comment_cid`,page_namespace,page_title,page_id,page_latest,page_is_redirect,page_len,user_name  FROM `revision` JOIN `actor` `actor_rev_user` ON ((actor_rev_user.actor_id = rev_actor)) JOIN `comment` `comment_rev_comment` ON ((comment_rev_comment.comment_id = rev_comment_id)) JOIN `page` ON ((page_id = rev_page)) LEFT JOIN `user` ON ((actor_rev_user.actor_user != 0) AND (user_id = actor_rev_user.actor_user))   WHERE rev_id = 88916567  LIMIT 1
		3508928139 Query	SELECT /* MediaWiki\Revision\RevisionStore::loadSlotRecordsFromDb  */  slot_revision_id,slot_content_id,slot_origin,slot_role_id,content_size,content_sha1,content_address,content_model  FROM `slots` JOIN `content` ON ((slot_content_id = content_id))   WHERE slot_revision_id = 88916567
		3508928140 Query	SELECT /* MediaWiki\Deferred\LinksUpdate\LinksUpdate::acquirePageLock  */ IF(GET_LOCK('enwiktionary:LinksUpdate:atomicity:pageid:1010794',1),UNIX_TIMESTAMP(SYSDATE(6)),NULL) AS acquired
		3508928140 Query	SELECT /* MediaWiki\Deferred\LinksUpdate\CategoryLinksTable::fetchExistingRows  */  lt_title,cl_sortkey_prefix  FROM `categorylinks` JOIN `linktarget` ON ((cl_target_id=lt_id))   WHERE cl_from = 1010794
		3508928139 Query	SELECT /* MediaWiki\Deferred\LinksUpdate\LinksUpdate::acquirePageLock  */ IF(GET_LOCK('enwiktionary:LinksUpdate:atomicity:pageid:1011984',1),UNIX_TIMESTAMP(SYSDATE(6)),NULL) AS acquired
		3508928140 Query	SELECT /* MediaWiki\Deferred\LinksUpdate\LinksTable::fetchExistingRows  */  el_to_domain_index,el_to_path  FROM `externallinks`    WHERE el_from = 1010794
		3508928139 Query	SELECT /* MediaWiki\Deferred\LinksUpdate\CategoryLinksTable::fetchExistingRows  */  lt_title,cl_sortkey_prefix  FROM `categorylinks` JOIN `linktarget` ON ((cl_target_id=lt_id))   WHERE cl_from = 1011984
		3508928140 Query	SELECT /* MediaWiki\Deferred\LinksUpdate\GenericPageLinksTable::fetchExistingRows  */  lt_namespace AS `ns`,lt_title AS `title`  FROM `existencelinks` JOIN `linktarget` ON ((exl_target_id=lt_id))   WHERE exl_from = 1010794
		3508928140 Query	SELECT /* MediaWiki\Deferred\LinksUpdate\LinksTable::fetchExistingRows  */  il_to  FROM `imagelinks`    WHERE il_from = 1010794
		3508928139 Query	SELECT /* MediaWiki\Deferred\LinksUpdate\LinksTable::fetchExistingRows  */  el_to_domain_index,el_to_path  FROM `externallinks`    WHERE el_from = 1011984
		3508928140 Query	SELECT /* MediaWiki\Deferred\LinksUpdate\LinksTable::fetchExistingRows  */  iwl_prefix,iwl_title  FROM `iwlinks`    WHERE iwl_from = 1010794
		3508928139 Query	SELECT /* MediaWiki\Deferred\LinksUpdate\GenericPageLinksTable::fetchExistingRows  */  lt_namespace AS `ns`,lt_title AS `title`  FROM `existencelinks` JOIN `linktarget` ON ((exl_target_id=lt_id))   WHERE exl_from = 1011984
		3508928139 Query	SELECT /* MediaWiki\Deferred\LinksUpdate\LinksTable::fetchExistingRows  */  il_to  FROM `imagelinks`    WHERE il_from = 1011984
		3508928140 Query	SELECT /* MediaWiki\Deferred\LinksUpdate\LinksTable::fetchExistingRows  */  ll_lang,ll_title  FROM `langlinks`    WHERE ll_from = 1010794
		3508928139 Query	SELECT /* MediaWiki\Deferred\LinksUpdate\LinksTable::fetchExistingRows  */  iwl_prefix,iwl_title  FROM `iwlinks`    WHERE iwl_from = 1011984
		3508928140 Query	SELECT /* MediaWiki\Deferred\LinksUpdate\GenericPageLinksTable::fetchExistingRows  */  lt_namespace AS `ns`,lt_title AS `title`  FROM `pagelinks` JOIN `linktarget` ON ((pl_target_id=lt_id))   WHERE pl_from = 1010794
		3508928139 Query	SELECT /* MediaWiki\Deferred\LinksUpdate\LinksTable::fetchExistingRows  */  ll_lang,ll_title  FROM `langlinks`    WHERE ll_from = 1011984
		3508928140 Query	SELECT /* MediaWiki\Deferred\LinksUpdate\LinksTable::fetchExistingRows  */  pp_propname,pp_value  FROM `page_props`    WHERE pp_page = 1010794
		3508928140 Query	SELECT /* MediaWiki\Deferred\LinksUpdate\GenericPageLinksTable::fetchExistingRows  */  lt_namespace AS `ns`,lt_title AS `title`  FROM `templatelinks` JOIN `linktarget` ON ((tl_target_id=lt_id))   WHERE tl_from = 1010794
		3508928139 Query	SELECT /* MediaWiki\Deferred\LinksUpdate\GenericPageLinksTable::fetchExistingRows  */  lt_namespace AS `ns`,lt_title AS `title`  FROM `pagelinks` JOIN `linktarget` ON ((pl_target_id=lt_id))   WHERE pl_from = 1011984

I just checked s1 master and the traffic is pretty much the same, an insane amount of SELECTS arriving there too with similar (if not the same type of query) is this really expected to be run on (all) masters?

Having SELECTs happening on master is not unexpected per se. Usually it's the code path that is going to do a write immediately after and wants to get the most updated data to avoid race conditions. But this is acceptable as long as 1- The volume is low 2- They are properly indexed.

It isn't just s2. It's all of them. Since the switchover.

image.png (911×1 px, 569 KB)

Actually, this just in, it looks like it was even worse in eqiad before?

image.png (946×1 px, 527 KB)

None of the reads in @Marostegui's come from GrowthExperiments. However, @CDanis's note that this lines with wmf.6 makes me question https://gerrit.wikimedia.org/r/q/b698a96b even more, especially given we have a similar DB issue directly tied to user impact (T414080).

After looking a bit more into how the user impact computation works, it makes a lot of sense that the patch increases the volume of primary reads significantly. After every edit, where GrowthExperimentsUserImpactUpdater::userIsInCohort returns true, we call ComputedUserImpact::getExpensiveUserImpact( $userIdentity, READ_LATEST, ... ) (which results in a bunch of primary reads). Funnily, we schedule RefreshUserImpactJob to compute the exact same data _again_. This is apparently done to ensure the data computable usign SQL queries can be updated immediately, with pageviews getting updated once the job runs. I think we should just simplify all this to schedule the job immediately, and to avoid the primary read altogether...

In theory, removing READ_LATEST from GrowthExperimentsUserImpactUpdater::refreshUserImpactData might be sufficient. It runs within the same request, so I'd hope the chronology protector would ensure we are reading from a non-lagged replica? Or maybe that's not working when operating within the same request (since the cookie hasn't been set yet). But, I don't think having some data updated immediately is very important (and even if it is, we should probably find a different way than computing the same data twice on every edit), so I vote for just removing the precomputing code.

That said, I'm curious about the READ_LATEST necessity there though, just to understand this part a bit more.

Change #1237866 had a related patch set uploaded (by Urbanecm; author: Urbanecm):

[mediawiki/extensions/GrowthExperiments@master] GrowthExperimentsUserImpactUpdater: Do not compute data on every edit

https://gerrit.wikimedia.org/r/1237866

None of the reads in @Marostegui's come from GrowthExperiments. However, @CDanis's note that this lines with wmf.6 makes me question https://gerrit.wikimedia.org/r/q/b698a96b even more, especially given we have a similar DB issue directly tied to user impact (T414080).

Keep in mind that it was just a very small extract of the capture. If you want some specific patterns, let me know and I can capture and analyze more traffic.

I think I have a working theory on what might be happening here, thanks @Marostegui. I'll try what I have and let you know if more traffic info is needed.

Change #1237866 merged by jenkins-bot:

[mediawiki/extensions/GrowthExperiments@master] GrowthExperimentsUserImpactUpdater: Do not compute data on every edit

https://gerrit.wikimedia.org/r/1237866

Change #1237931 had a related patch set uploaded (by Ladsgroup; author: Urbanecm):

[mediawiki/extensions/GrowthExperiments@wmf/1.46.0-wmf.14] GrowthExperimentsUserImpactUpdater: Do not compute data on every edit

https://gerrit.wikimedia.org/r/1237931

Change #1237931 merged by jenkins-bot:

[mediawiki/extensions/GrowthExperiments@wmf/1.46.0-wmf.14] GrowthExperimentsUserImpactUpdater: Do not compute data on every edit

https://gerrit.wikimedia.org/r/1237931

Mentioned in SAL (#wikimedia-operations) [2026-02-09T15:58:27Z] <urbanecm@deploy2002> Started scap sync-world: Backport for [[gerrit:1237931|GrowthExperimentsUserImpactUpdater: Do not compute data on every edit (T416171)]]

Mentioned in SAL (#wikimedia-operations) [2026-02-09T16:00:30Z] <urbanecm@deploy2002> ladsgroup, urbanecm: Backport for [[gerrit:1237931|GrowthExperimentsUserImpactUpdater: Do not compute data on every edit (T416171)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.

Mentioned in SAL (#wikimedia-operations) [2026-02-09T16:08:30Z] <urbanecm@deploy2002> Finished scap sync-world: Backport for [[gerrit:1237931|GrowthExperimentsUserImpactUpdater: Do not compute data on every edit (T416171)]] (duration: 10m 02s)

@Marostegui @Ladsgroup I tried something. Would you mind confirming whether it is doing something on your end?

The number of queries hasn't gone done drastically but amount of read_next_rnd is now one tenth of the value it was before the deploy:
https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&from=2026-02-09T15:19:45.821Z&to=2026-02-09T16:17:08.380Z&timezone=utc&var-job=$__all&var-server=db2207&var-port=9104&refresh=1m&viewPanel=panel-3

Matching down to the minute

grafik.png (919×1 px, 121 KB)

Then my guess is that it was a rather slow query

I am still seeing an insane amount of selects arriving

# cat db2207.log  | grep -c SELECT
8609

This was in 10 seconds

More than half of them are related to LinksUpdate:

		3534493480 Query	SELECT /* MediaWiki\Deferred\LinksUpdate\LinksUpdate::acquirePageLock  */ RELEASE_LOCK('enwiktionary:LinksUpdate:atomicity:pageid:49576') AS released
		3534493484 Query	SELECT /* MediaWiki\Deferred\LinksUpdate\LinksUpdate::acquirePageLock  */ IF(GET_LOCK('enwiktionary:LinksUpdate:job:pageid:49579',1),UNIX_TIMESTAMP(SYSDATE(6)),NULL) AS acquired
		3534493433 Query	SELECT /* MediaWiki\Deferred\LinksUpdate\LinksTable::fetchExistingRows  */  iwl_prefix,iwl_title  FROM `iwlinks`    WHERE iwl_from = 49478
		3534493433 Query	SELECT /* MediaWiki\Deferred\LinksUpdate\LinksTable::fetchExistingRows  */  ll_lang,ll_title  FROM `langlinks`    WHERE ll_from = 49478

Unassigning, as I don't think the remainder is Growth facing. Happy to help with specifics if there's something I can ehlp with!

Unassigning, as I don't think the remainder is Growth facing. Happy to help with specifics if there's something I can ehlp with!

Thanks, it removed a lot of slow select queries from the master.

Change #1238060 had a related patch set uploaded (by Ladsgroup; author: Ladsgroup):

[mediawiki/core@master] LinksTable: Use replica database for fetching existing links

https://gerrit.wikimedia.org/r/1238060

I had the worry that in cases that a user edits and adds or removes links, we wouldn't be able to reflect that because it's just after the edit and this lookup is happening in the context of deferred update (=right after main transaction has been committed on master) BUT the thing is that the lookup is for existing values and not the new ones (those come from the parser output object). So this shouldn't cause any regression because of deferred updates being too close to the main transaction.

Change #1238060 merged by jenkins-bot:

[mediawiki/core@master] LinksTable: Use replica database for fetching existing links

https://gerrit.wikimedia.org/r/1238060

Change #1238321 had a related patch set uploaded (by Ladsgroup; author: Ladsgroup):

[mediawiki/core@wmf/1.46.0-wmf.15] LinksTable: Use replica database for fetching existing links

https://gerrit.wikimedia.org/r/1238321

Change #1238322 had a related patch set uploaded (by Ladsgroup; author: Ladsgroup):

[mediawiki/core@wmf/1.46.0-wmf.14] LinksTable: Use replica database for fetching existing links

https://gerrit.wikimedia.org/r/1238322

Change #1238322 merged by jenkins-bot:

[mediawiki/core@wmf/1.46.0-wmf.14] LinksTable: Use replica database for fetching existing links

https://gerrit.wikimedia.org/r/1238322

Change #1238321 merged by jenkins-bot:

[mediawiki/core@wmf/1.46.0-wmf.15] LinksTable: Use replica database for fetching existing links

https://gerrit.wikimedia.org/r/1238321

Mentioned in SAL (#wikimedia-operations) [2026-02-10T12:32:58Z] <ladsgroup@deploy2002> Started scap sync-world: Backport for [[gerrit:1238322|LinksTable: Use replica database for fetching existing links (T416171)]], [[gerrit:1238321|LinksTable: Use replica database for fetching existing links (T416171)]]

Mentioned in SAL (#wikimedia-operations) [2026-02-10T12:35:04Z] <ladsgroup@deploy2002> ladsgroup: Backport for [[gerrit:1238322|LinksTable: Use replica database for fetching existing links (T416171)]], [[gerrit:1238321|LinksTable: Use replica database for fetching existing links (T416171)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.

Mentioned in SAL (#wikimedia-operations) [2026-02-10T12:40:46Z] <ladsgroup@deploy2002> Finished scap sync-world: Backport for [[gerrit:1238322|LinksTable: Use replica database for fetching existing links (T416171)]], [[gerrit:1238321|LinksTable: Use replica database for fetching existing links (T416171)]] (duration: 07m 48s)

Nice one, just for the record, there are still lots of which I assume are expected

		3539342846 Query	SELECT /* MediaWiki\Deferred\LinksUpdate\LinksUpdate::acquirePageLock  */ IF(GET_LOCK('zhwiki:LinksUpdate:atomicity:pageid:3187433',1),UNIX_TIMESTAMP(SYSDATE(6)),NULL) AS acquired
		3539342877 Query	SELECT /* MediaWiki\Page\WikiPage::pageData  */  page_id,page_namespace,page_title,page_is_redirect,page_is_new,page_random,page_touched,page_links_updated,page_latest,page_len,page_content_model  FROM `page`    WHERE page_namespace = 1 AND page_title = '霍恩菲诺'  LIMIT 1
		3539342877 Query	SELECT /* MediaWiki\Deferred\LinksUpdate\LinksUpdate::acquirePageLock  */ IF(GET_LOCK('zhwiki:LinksUpdate:job:pageid:3187454',1),UNIX_TIMESTAMP(SYSDATE(6)),NULL) AS acquired
		3539342846 Query	SELECT /* MediaWiki\Deferred\LinksUpdate\LinksUpdate::acquirePageLock  */ RELEASE_LOCK('zhwiki:LinksUpdate:atomicity:pageid:3187433') AS released

That'd be T366938: Reduce relying on database locks . Timo had an idea that we should introduce a central lock providing service in mediawiki which could be backed by PoolCounter in production (or something else e.g. redis) and then switch all locking stuff to that and get rid of this. It's a nice idea and we should implement it.

I had the worry that in cases that a user edits and adds or removes links, we wouldn't be able to reflect that because it's just after the edit and this lookup is happening in the context of deferred update (=right after main transaction has been committed on master) BUT the thing is that the lookup is for existing values and not the new ones (those come from the parser output object). So this shouldn't cause any regression because of deferred updates being too close to the main transaction.

Change #1238060 merged by jenkins-bot:

[mediawiki/core@master] LinksTable: Use replica database for fetching existing links

https://gerrit.wikimedia.org/r/1238060

These are not reads of existing links for their own sake. These are private methods (afaik) exclusively part of the INSERT transaction in the LinksUpdate/RefreshLinks classes. It is essentially a select-for-update. When saving the new links from the ParserOutput object, we deduct links that already exist, and only remove/add in order to efficiently replace the entries in the database without a lot of churn.

It is not obvious why select-for-update would be a problem, because:

  • we control concurrency of RefreshLinks jobs. They should not burst uncontrollably (e.g. via edits to popular templates), and should induce the same volume of primary connections and writes with or without this change. If RefreshLinksJob has too high a frequency, should we lower that?
  • updates to the same page are skipped (page_links_updated) or serialized (due to a per-page lock) thus spreading their queries out.

If select-for-update is a problem, should we remove this optimisation? I suspect we originally had a naive multi-row replacement there, with an atomic delete+insert for each edit. That would churn primary keys and memory buffers because most links stays the same during a given RefreshLinks. But, maybe that's tolerable nowadays? I imagine we'd still prefer a primary read over a delete/insert.

The scenario it protects against is when processing two concurrent edits. I expect we want (and today, do) allow most of those edits to happen concurrently, thanks to wikitext/diff3 merge and mysql transactions (with the links update portion presumably serialized, given the page_id lock). When the "existing" links are stale, the thinking is that the second edit may end up neglecting to add or delete links. If true, then this change would be a frustrating kind of subtle break, because it does not self-correct. I mean, maybe some weeks/months later if the page gets edited or refreshed again by something else, but that is not covered by eventual-consistency in my mind. That's the kind of thing that will cause things like T221795: Refactor Category::refreshCounts logic to a job and simplify with all the doubt, external workarounds, and busy-work that that encourages.

So, can the race happen?

  • We do have a number of other protections in place. If MediaWiki didn't support merge conflict resolution, then a stale read in LinksUpdate would very easily cause damage by adding recently deleted links back, or deleting links recently added. But, we'd also have much worse problems with the wikitext itself (i.e. the second edit would completely revert previous edits if they started later, even when the edits don't overlap in the changed area). So, given this isn't reality and we do have merge conflict resolution in wikitext, this doesn't matter.
  • For direct edits that simply add links, diff3 merges and edit conflict detection enforces linear serialization on the wikitext. The links update also have their own per-page lock acquired to avoid race conditions. Basically, if the edits don't logically conflict (different areas of the page) then LinksUpdate and wikitext save, merge in a similar way.
    • If we don't see the first edit's new links yet, we also can't accidentally erase them.
    • If we do see them in time, then we must also have the wikitext (which is saved before LinksUpdate), and thus we expect and preserve them.
    • If we run after the first wikitext save and LinksUpdate, but read stale link data, then the second LinksUpdate will also try to insert them. The IGNORE option will silently skip the insertion, because they already exist in the primary database and are rejected by unique constraints on the table (e.g. pl_from, pl_target_id).
  • For anything relating to propagating template edits via the JobQueue, this shouldn't matter because RefreshLinksJob are essentially idempotent (not corresponding to a given wikitext blob). They just take whatever the current revision is and ensure the links are updated. They wait for replication, deduplicate via page_links_updated, and lock/no-op by page ID without concurrency per T366938.

So direct edits with independent removals and additions work themselves out, due to the "diff"-ing and selective updates.

For it to get out of sync, we'd need edits that perform contradictory actions. For example, edit 1 deletes link A in one paragraph, and edit 2 adds a link to A elsewhere. When edit 2 is processing the updates, if acts on stale data, it will neglect to perform an insert for A, and that would not correct itself. Given that these are independent edits (i.e. not reverting vandalism), they can happen concurrently and would cause corruption even when replication is <0.2s, right?

Perhaps we should move LinksUpdate exclusively to the job queue?

These are not reads of existing links for their own sake. These are private methods (afaik) exclusively part of the INSERT transaction in the LinksUpdate/RefreshLinks classes. It is essentially a select-for-update. When saving the new links from the ParserOutput object, we deduct links that already exist, and only remove/add in order to efficiently replace the entries in the database without a lot of churn.

It is not obvious why select-for-update would be a problem, because:

Because it's not a select-for-update. As I said in the commit message. The ratio of selects to write queries (all of them) on a master is 20 to 1 (when I‌ made the patch). There are twenty selects for each update. Majority of such selects are also coming from this code path. Even if assume every single update on the database is to update the links table, still 90%‌ of these "select-for-update" won't trigger any updates whatsoever, because most edits or reparses don't change links tables.

For it to get out of sync, we'd need edits that perform contradictory actions. For example, edit 1 deletes link A in one paragraph, and edit 2 adds a link to A elsewhere. When edit 2 is processing the updates, if acts on stale data, it will neglect to perform an insert for A, and that would not correct itself. Given that these are independent edits (i.e. not reverting vandalism), they can happen concurrently and would cause corruption even when replication is <0.2s, right?

Sure. But As I said in the commit message, this data is generally out of sync due to refreshlinks delays and many other reasons (as you said, the concurrency of refreshlinks is limited). For example, if you remove a template A that is used in template B which is used in many articles. Sometimes even for weeks, the row for usage of template A exists in the database. These tables have always been out of sync to a pretty significant degree. We have been cleaning up a lot of templates in commons and still many of them are not reflected in the links tables. Or externallinks ignored domains are still abundant on wikis. The edge case you mentioned here pales in comparison.

To put in another way: Is it really causing user-facing complaints or issues? Have users complained about this? If they do, we fix it. Otherwise, I‌ wouldn't recommend introducing an extremely major scalability bottleneck that is causing issues on master databases.

Perhaps we should move LinksUpdate exclusively to the job queue?

I'd say if you think it's an issue. Gather number of percentage of cases that it's making the tables go out of sync. Then we can decide based on those number.

Ladsgroup claimed this task.

I'm closing this as the graphs are clear. For further improvements we should do T366938: Reduce relying on database locks