Page MenuHomePhabricator

WF memcached service is dc-local but used for dc-global content
Closed, ResolvedPublic

Description

[Split this specific issue out from the parent for clarity.]

mcrouter-wikifunctions is dc-local; values are written to local/wf, without fail-over. This is as the system was built in 2023, and the Abstract team (and SRE) didn't notice that the planned use was for implicitly dc-global content, feeding into Wikifunctions.org, and later the parser cache.

Options (please edit!):

  • Add a new cache service that's dc-global and shift use over to that and then decommission the old one
  • Retrofit somehow the existing service to work dc-global
  • Use the databases as rendered content caches instead of memcached
  • ???

Details

Related Changes in Gerrit:
SubjectRepoBranchLines +/-
operations/deployment-chartsmaster+30 -0
operations/deployment-chartsmaster+6 -4
operations/mediawiki-configmaster+15 -8
operations/mediawiki-configmaster+8 -15
operations/deployment-chartsmaster+11 -5
mediawiki/extensions/WikiLambdamaster+4 -1
operations/mediawiki-configmaster+3 -3
operations/mediawiki-configmaster+7 -3
operations/deployment-chartsmaster+43 -0
operations/deployment-chartsmaster+12 -0
operations/deployment-chartsmaster+62 -2
operations/deployment-chartsmaster+24 -0
operations/puppetproduction+2 -10
operations/puppetproduction+55 -1
operations/puppetproduction+27 -22
Show related patches Customize query in gerrit

Event Timeline

Jdforrester-WMF triaged this task as Unbreak Now! priority.

Alex @akosiaris briefly looked into this and this seems to be more than a bug at this time. He is planning to secure some time to consult some folks from Security and Mediawiki while he is in the offsite to discuss. He should be able to provide more details once those conversations are done.

FYI, I 've continued posting updates in T405461

akosiaris lowered the priority of this task from Unbreak Now! to High.Dec 16 2025, 10:22 AM

Lowering to high while the analysis and recommendation is being discussed in T405461: Embedded function calls getting stuck showing "Function being called..." instead of result, due to (?) split-brain cache problem.

I also think we should merge this one in that parent task. Any objections?

I also think we should merge this one in that parent task. Any objections?

This was meant to be the cross-team discussion task, but never mind.

Re-opening as the doing-task for one approach that might (but might not) solve the parent.

Change #1229229 had a related patch set uploaded (by Jforrester; author: Jforrester):

[operations/puppet@production] mcrouter: Allow configuring secondary replicated caches

https://gerrit.wikimedia.org/r/1229229

Change #1229232 had a related patch set uploaded (by Jforrester; author: Jforrester):

[operations/mediawiki-config@master] [DNM] memcached: Point to the replicated Wikifunctions cache

https://gerrit.wikimedia.org/r/1229232

Change #1229230 had a related patch set uploaded (by Jforrester; author: Jforrester):

[operations/puppet@production] [WIP] mcrouter: Configure the Wikifunctions pool as replicated

https://gerrit.wikimedia.org/r/1229230

Change #1229231 had a related patch set uploaded (by Jforrester; author: Jforrester):

[operations/puppet@production] [DNM] memcached: Drop the local-only Wikifunctions cache route

https://gerrit.wikimedia.org/r/1229231

@Jdforrester-WMF to my understanding, you would like any mw-wf memcached keys we add in one DC, to be replicated to the other DC and vice versa?

@Jdforrester-WMF to my understanding, you would like any mw-wf memcached keys we add in one DC, to be replicated to the other DC and vice versa?

Yes, with best-efforts fixes, to resolve the current split-brain problem that has been a blocker to KRs since it was identified in September. In discussions with Alex concluding in December and January this seemed to be the only acceptable way forward for SRE and Product, given that the cache is written to in both data-centres due to MW's active-active status and read from in both (and by many wikis, currently 150 and hopefully-soon > 300).

MLechvien-WMF added subscribers: RLazarus, MLechvien-WMF.

@RLazarus is currently getting context/ramping up on this and will be able to support this work moving forward.

Change #1245162 had a related patch set uploaded (by RLazarus; author: RLazarus):

[operations/deployment-charts@master] [WIP] wikifunctions and friends: Add /*/wf-wan memcache routes

https://gerrit.wikimedia.org/r/1245162

Change #1229229 abandoned by Jforrester:

[operations/puppet@production] mcrouter: Allow configuring secondary replicated caches

Reason:

Doing this in deployment-charts instead now.

https://gerrit.wikimedia.org/r/1229229

Change #1229230 abandoned by Jforrester:

[operations/puppet@production] [WIP] mcrouter: Configure the Wikifunctions pool as replicated

Reason:

Doing this in deployment-charts instead now.

https://gerrit.wikimedia.org/r/1229230

Change #1229231 abandoned by Jforrester:

[operations/puppet@production] [DNM] memcached: Drop the local-only Wikifunctions cache route

Reason:

Doing this in deployment-charts instead now.

https://gerrit.wikimedia.org/r/1229231

Change #1247677 had a related patch set uploaded (by RLazarus; author: RLazarus):

[operations/deployment-charts@master] _mediawiki-common_: Add /*/wf-wan memcache routes

https://gerrit.wikimedia.org/r/1247677

Change #1247678 had a related patch set uploaded (by RLazarus; author: RLazarus):

[operations/deployment-charts@master] wikifunctions and friends: Add /*/wf-wan memcache routes

https://gerrit.wikimedia.org/r/1247678

Change #1245162 merged by jenkins-bot:

[operations/deployment-charts@master] mw-debug: Add /*/wf-wan memcache routes

https://gerrit.wikimedia.org/r/1245162

Mentioned in SAL (#wikimedia-operations) [2026-03-03T21:55:17Z] <rzl@deploy2002> rzl: https://gerrit.wikimedia.org/r/1245162 T411807 synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.

Mentioned in SAL (#wikimedia-operations) [2026-03-03T21:59:07Z] <rzl@deploy2002> Finished scap sync-world: https://gerrit.wikimedia.org/r/1245162 T411807 (duration: 12m 15s)

Change #1247687 had a related patch set uploaded (by Jforrester; author: Jforrester):

[operations/mediawiki-config@master] mc: Shift the Wikifunctions MC route from /local/wf/ to /<dc>/wf-wan/

https://gerrit.wikimedia.org/r/1247687

Change #1247694 had a related patch set uploaded (by RLazarus; author: RLazarus):

[operations/deployment-charts@master] mw-experimental: Add /*/wf-wan memcache routes and use in-pod mcrouter

https://gerrit.wikimedia.org/r/1247694

Change #1247694 merged by jenkins-bot:

[operations/deployment-charts@master] mw-experimental: Add /*/wf-wan memcache routes and use in-pod mcrouter

https://gerrit.wikimedia.org/r/1247694

Change #1247677 merged by jenkins-bot:

[operations/deployment-charts@master] _mediawiki-common_: Add /*/wf-wan memcache routes

https://gerrit.wikimedia.org/r/1247677

Change #1247678 merged by jenkins-bot:

[operations/deployment-charts@master] wikifunctions and friends: Add /*/wf-wan memcache routes

https://gerrit.wikimedia.org/r/1247678

Mentioned in SAL (#wikimedia-operations) [2026-03-16T22:27:32Z] <jforrester@deploy2002> Started scap sync-world: T411807

Mentioned in SAL (#wikimedia-operations) [2026-03-16T22:28:10Z] <jforrester@deploy2002> jforrester: T411807 synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.

Mentioned in SAL (#wikimedia-operations) [2026-03-16T22:37:00Z] <jforrester@deploy2002> Finished scap sync-world: T411807 (duration: 11m 10s)

Change #1229232 abandoned by Jforrester:

[operations/mediawiki-config@master] [DNM] memcached: Point to the replicated Wikifunctions cache

Reason:

For whatever reason I89738e16c3c92edd5d37ddf3e6042e518bdc2730 got pushed.

https://gerrit.wikimedia.org/r/1229232

Change #1255833 had a related patch set uploaded (by Jforrester; author: Jforrester):

[mediawiki/extensions/WikiLambda@master] WikiLambdaServices::buildZObjectStash: Pass in cache params to WANObjectCache

https://gerrit.wikimedia.org/r/1255833

Change #1259222 had a related patch set uploaded (by RLazarus; author: RLazarus):

[operations/deployment-charts@master] cache.mcrouter: Add replica.remote_read option

https://gerrit.wikimedia.org/r/1259222

Change #1264638 had a related patch set uploaded (by Jforrester; author: Jforrester):

[operations/mediawiki-config@master] Revert "Wikifunctions: Switch cache from mcrouter-wikifunctions to special access"

https://gerrit.wikimedia.org/r/1264638

Change #1264638 merged by jenkins-bot:

[operations/mediawiki-config@master] Revert "Wikifunctions: Switch cache from mcrouter-wikifunctions to special access"

https://gerrit.wikimedia.org/r/1264638

Change #1255833 abandoned by Jforrester:

[mediawiki/extensions/WikiLambda@master] WikiLambdaServices::buildZObjectStash: Pass in cache params to WANObjectCache

https://gerrit.wikimedia.org/r/1255833

Change #1266290 had a related patch set uploaded (by Jforrester; author: Jforrester):

[operations/mediawiki-config@master] Wikifunctions: Switch cache from mcrouter-wikifunctions to special access

https://gerrit.wikimedia.org/r/1266290

Change #1266290 merged by jenkins-bot:

[operations/mediawiki-config@master] Wikifunctions: Switch cache from mcrouter-wikifunctions to special access

https://gerrit.wikimedia.org/r/1266290

Mentioned in SAL (#wikimedia-operations) [2026-04-01T15:01:00Z] <jforrester@deploy1003> Started scap sync-world: Backport for [[gerrit:1266290|Wikifunctions: Switch cache from mcrouter-wikifunctions to special access (T411807)]]

Mentioned in SAL (#wikimedia-operations) [2026-04-01T15:02:59Z] <jforrester@deploy1003> jforrester: Backport for [[gerrit:1266290|Wikifunctions: Switch cache from mcrouter-wikifunctions to special access (T411807)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.

Mentioned in SAL (#wikimedia-operations) [2026-04-01T15:13:54Z] <jforrester@deploy1003> Finished scap sync-world: Backport for [[gerrit:1266290|Wikifunctions: Switch cache from mcrouter-wikifunctions to special access (T411807)]] (duration: 12m 53s)

OK, I'm declaring this particular bit Resolved. Wikifunctions's use of memcached now (a) works consistently across DCs (content is triggered in one and read from the other, both ways around), and (b) uses our own dedicated caches again, not polluting MW's general cache.

There's much more work to do to re-assess our content secondary storage / availability strategy and use of memcached, DBs, and other technologies, but let's call this done.

Change #1267915 had a related patch set uploaded (by RLazarus; author: RLazarus):

[operations/deployment-charts@master] mw-wikifunctions: Set $MCROUTER_SERVER in values-${ENV}.yaml

https://gerrit.wikimedia.org/r/1267915

Change #1269038 had a related patch set uploaded (by Jforrester; author: Jforrester):

[operations/deployment-charts@master] mw-mcrouter: add /{dc}/wf-wan routes for Wikifunctions client cache

https://gerrit.wikimedia.org/r/1269038

Change #1269038 merged by jenkins-bot:

[operations/deployment-charts@master] mw-mcrouter: add /{dc}/wf-wan routes for Wikifunctions client cache

https://gerrit.wikimedia.org/r/1269038