Page MenuHomePhabricator

Writes to /*/wf-wan/ failing with CONNECTION FAILURE or SERVER HAS FAILED AND IS DISABLED UNTIL TIMED RETRY (mcrouter not being reached?)
Closed, ResolvedPublic

Event Timeline

> $wgObjectCaches['mcrouter'];
= [
    "class" => "MemcachedPeclBagOStuff",
    "serializer" => "php",
    "persistent" => false,
    "servers" => [
      "127.0.0.1:11213",
    ],
    "server_failure_limit" => 1000000000.0,
    "retry_timeout" => -1,
    "loggroup" => "memcached",
    "timeout" => 250000.0,
    "allow_tcp_nagle_delay" => false,
  ]

> $wgWikiLambdaObjectCaches;
= [
    "eqiad" => [
      "host" => "127.0.0.1",
      "port" => "11213",
      "prefix" => "/eqiad/wf-wan/",
    ],
    "codfw" => [
      "host" => "127.0.0.1",
      "port" => "11213",
      "prefix" => "/codfw/wf-wan/",
    ],
  ]

So, the story is that

This is another case where I was going to suggest unifying everything for simplicity -- we need to use our own memcache pool, but that doesn't mean we need to use different mcrouters -- but it turns out we should do it sooner as a bug fix.

The fix here is for $wgWikiLambdaObjectCaches to use the same mcrouter host and port as the rest of MediaWiki -- the routing prefix is enough to ensure that mcrouter will talk to the right pool of memcache hosts. In mw-wikifunctions, we could use the in-pod mcrouter for everything, but (if I'm not missing anything) this seems like a fine time to point them at the shared mcrouters instead -- we don't need to use the in-pod one for validation anymore, although we always can again in the future if we want to test any changes.

A couple of things to check first, to make sure that's okay.

  • For routes, we just need to make sure all the necessary routes are present in both mcrouters (I think that's done, but I'll double-check now).
  • The only other difference between the two mcrouters is that they have different default routing prefixes, applied by mcrouter when a routing prefix isn't given: in the shared mcrouter we default to /eqiad/mw/ or /codfw/mw/ but in mw-wikifunctions we default to /local/wf/ (the old non-replicated prefix).
    • For everything except mw-wikifunctions, we know that's fine: everything already uses the existing default, and the wikilambda object cache adds the wf-wan routing prefixes, so those keys will work as soon as they hit a working mcrouter.
    • For mw-wikifunctions, do we need to keep that /local/wf/ default, or can we revert to /$DC/mw/? That is, are there any remaining applications of $wgWikiLambdaObjectCaches where we want to route to /local/wf/ but have left it out on the PHP side, so we're getting it by default? (Note that $MCROUTER_SERVER even in mw-wikifunctions already points to the shared mcrouter, 10.64.72.12:4442 in eqiad, so MW's non-wikifunctions-related memcache applications are already getting /eqiad/mw/ as they expect -- nothing will change there.)

We need to:

  • Check the above prerequisites.
  • Decide if we want to use the in-pod or shared mcrouter in mw-wikifunctions. If in-pod, we need to merge something like https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1267915 to set $MCROUTER_SERVER properly, but also change route_prefix to $DC/mw so the non-WF memcache keys get the same default they're expecting. If shared, we don't need to change anything. (Shared it is.)
  • Update $wgWikiLambdaObjectCaches to use $_SERVER['MCROUTER_SERVER'] for its host and port.
  • If mw-wikifunctions is using the shared mcrouter, remove the now-unused in-pod mcrouter by setting cache.mcrouter.enabled to false.

Caveat:

Happening in mw-wikifunctions, mw-web, and mw-jobrunner.

If I understand the situation, that can't be right: at from logs I see this happening in mw-jobrunner, mw-api-int, and mw-api-ext (which all makes sense) and I'd believe it's happening in mw-web too. But it couldn't be happening in mw-wikifunctions, unless there's something else going on. I see mw-api-int and mw-jobrunner in the three logstash links you posted (looking at kubernetes.namespace_name). If you see it in mw-wikifunctions too, can you post more information?

Happening in mw-wikifunctions, mw-web, and mw-jobrunner.

If I understand the situation, that can't be right: at from logs I see this happening in mw-jobrunner, mw-api-int, and mw-api-ext (which all makes sense) and I'd believe it's happening in mw-web too. But it couldn't be happening in mw-wikifunctions, unless there's something else going on. I see mw-api-int and mw-jobrunner in the three logstash links you posted (looking at kubernetes.namespace_name). If you see it in mw-wikifunctions too, can you post more information?

Sorry, yes, you're right, I saw them from wikifunctionswiki and assumed they were mw-wikifunctions instead of mw-api-int.

Note that our Memcached shim takes separate host and port config, rather than the blended form that MCROUTER_SERVER uses elsewhere.

On the prerequisites:

  • Double-checked, and mw-mcrouter has all the routes except /local/wf. That's fine, because...
  • Per @Jdforrester-WMF, we don't need to keep the /local/wf default. Nothing in mw-* namespaces, including mw-wikifunctions, uses it. (The orchestrator, running in the wikifunctions namespace does, but that's out of scope here.)

On in-pod vs. shared mcrouters, let's go ahead and use the shared ones. It's worth it to simplify the configuration by removing one special snowflake. (As mentioned, we can always reinstate an in-pod mcrouter if we need it for further prototyping down the line.)

That means we don't need to do any prep -- our next to-do item is the update to $wgWikiLambdaObjectCaches, and we can do that right away.

Note that our Memcached shim takes separate host and port config, rather than the blended form that MCROUTER_SERVER uses elsewhere.

Yeah, we'll just need to parse that host:port string and split it up. Quick and easy is to just split on the colon, since it's an IPv4 address. To be really future-proof we can also handle IPv6 hostports like "[2001:db8::1]:4442" and extract "2001:db8::1" as the host. If MCROUTER_SERVER is unset, fall back on 127.0.0.1:11213 just like we do for $wgObjectCaches['mcrouter']['servers'] above.

@Jdforrester-WMF Can I leave that config patch to you? When deploying it we'll want to test all four caching cases carefully,

$wgWikiLambdaObjectCacheseverything else
in mw-wikifunctions12
in other namespaces34

but only 1 and 3 should be affected. 1 should continue working with no change, and 3 should be fixed.

Change #1271895 had a related patch set uploaded (by Jforrester; author: Jforrester):

[operations/mediawiki-config@master] mc: Use MCROUTER_SERVER values rather than local sidepod for WF cache

https://gerrit.wikimedia.org/r/1271895

Change #1271895 merged by jenkins-bot:

[operations/mediawiki-config@master] mc: Use MCROUTER_SERVER values rather than local sidepod for WF cache

https://gerrit.wikimedia.org/r/1271895

Mentioned in SAL (#wikimedia-operations) [2026-04-16T14:20:09Z] <jforrester@deploy1003> Started scap sync-world: Backport for [[gerrit:1271895|mc: Use MCROUTER_SERVER values rather than local sidepod for WF cache (T423311)]]

Mentioned in SAL (#wikimedia-operations) [2026-04-16T14:21:58Z] <jforrester@deploy1003> jforrester: Backport for [[gerrit:1271895|mc: Use MCROUTER_SERVER values rather than local sidepod for WF cache (T423311)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.

Mentioned in SAL (#wikimedia-operations) [2026-04-16T14:29:45Z] <jforrester@deploy1003> Finished scap sync-world: Backport for [[gerrit:1271895|mc: Use MCROUTER_SERVER values rather than local sidepod for WF cache (T423311)]] (duration: 09m 36s)

  1. Confirmed: Tested via https://www.wikifunctions.org/wiki/Special:RunFunction?call=%7B%22Z1K1%22%3A%22Z7%22%2C%22Z7K1%22%3A%22Z19661%22%2C%22Z19661K1%22%3A%22Fash%21%22%7D before and after on debug, remains cached and accessed from both mw-debug-eqiad and mw-debug-codfw.

Successful debug log: MediaWiki\Extension\WikiLambda\Cache\MemcachedWrapper::get: cache hit for prefixed /codfw/wf-wan/WikiLambdaFunctionCall::zobject|Z1K1|Z7%23264153,Z7K1|Z19661%23260722,Z19661K1|Fash!,,doValidate|1, from codfw

  1. Confirmed: https://abstract.wikipedia.org/view/de/Q42 was initially not-available and was running the function again, then after a reload displayed immediately.

Secondary confirmed — fragments showing up after a parse: https://test.wikipedia.org/wiki/Wikifunctions

Successful debug log: MediaWiki\Extension\WikiLambda\Cache\MemcachedWrapper::get: cache hit for prefixed /codfw/wf-wan/WikiLambdaAbstractFragment:#56c5d9139f3f1c59c8eaf20589f1dc1acfc217da5c75cfda21fc7da4f94d1684 from codfw

  1. and 4) tested through regular MW actions on mw-debug and didn't notice any issues.
Jdforrester-WMF changed the task status from Open to In Progress.EditedApr 16 2026, 2:32 PM

Provisionally this now looks fixed.

Logged cache-write failures suddenly stopped:

Screenshot 2026-04-16 at 10.46.02.png (440×1 px, 49 KB)

OK, I think there are some follow-up tasks (that don't block this but we should at least write down if not do):

  • Drop /local/wf/ route from production mcrouter for mw-*
  • Drop in-pod mcrouter from mw-wikifunctions pod (?)
  • Add prod config env vars for MCROUTER_HOST and MCROUTER_PORT from MCROUTER_SERVER, and migrate prod MW config to use it instead
  • Others?

Change #1275463 had a related patch set uploaded (by RLazarus; author: RLazarus):

[operations/deployment-charts@master] mw-wikifunctions: Remove in-pod mcrouter

https://gerrit.wikimedia.org/r/1275463

Change #1275464 had a related patch set uploaded (by RLazarus; author: RLazarus):

[operations/deployment-charts@master] mediawiki-common, mw-debug, -experimental: Drop /local/wf memcache route

https://gerrit.wikimedia.org/r/1275464

Change #1275465 had a related patch set uploaded (by RLazarus; author: RLazarus):

[mediawiki/extensions/WikiLambda@master] MemcachedWrapper: Accept server config key, deprecate host and port

https://gerrit.wikimedia.org/r/1275465

Change #1275467 had a related patch set uploaded (by RLazarus; author: RLazarus):

[operations/mediawiki-config@master] mc: Set server, instead of host and port, for wgWikiLambdaObjectCaches

https://gerrit.wikimedia.org/r/1275467

Change #1275468 had a related patch set uploaded (by RLazarus; author: RLazarus):

[mediawiki/extensions/WikiLambda@master] MemcachedWrapper: Drop support for deprecated host and port config

https://gerrit.wikimedia.org/r/1275468

Change #1275463 merged by jenkins-bot:

[operations/deployment-charts@master] mw-wikifunctions: Remove in-pod mcrouter

https://gerrit.wikimedia.org/r/1275463

Change #1275467 merged by jenkins-bot:

[operations/mediawiki-config@master] mc: Set server, instead of host and port, for wgWikiLambdaObjectCaches

https://gerrit.wikimedia.org/r/1275467

Mentioned in SAL (#wikimedia-operations) [2026-05-07T13:03:46Z] <jforrester@deploy1003> Started scap sync-world: Backport for [[gerrit:1284547|Remove the progress bar]], [[gerrit:1275467|mc: Set server, instead of host and port, for wgWikiLambdaObjectCaches (T423311)]]

Mentioned in SAL (#wikimedia-operations) [2026-05-07T13:05:43Z] <jforrester@deploy1003> rzl, jforrester, hartman: Backport for [[gerrit:1284547|Remove the progress bar]], [[gerrit:1275467|mc: Set server, instead of host and port, for wgWikiLambdaObjectCaches (T423311)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.

Mentioned in SAL (#wikimedia-operations) [2026-05-07T13:10:41Z] <jforrester@deploy1003> Finished scap sync-world: Backport for [[gerrit:1284547|Remove the progress bar]], [[gerrit:1275467|mc: Set server, instead of host and port, for wgWikiLambdaObjectCaches (T423311)]] (duration: 06m 55s)