Page MenuHomePhabricator

Embedded function calls getting stuck showing "Function being called..." instead of result, due to (?) split-brain cache problem
Closed, ResolvedPublicBUG REPORT

Description

Steps to replicate the issue (include links if applicable):

In case that should work, here are a few other examples that got stuck:

What happens?:
It displays to me as "Function is being called…" instead of the result.

What should have happened instead?:
Should show "John is the friend of Paul."

Software version (on Special:Version page; skip for WMF-hosted wikis like Wikipedia):

Other information (browser name/version, screenshots, etc.):

Event Timeline

Same thing happens in the following flow:

  • Edit with visual editor
  • Insert or change a function call with values
  • Look at the Preview, shows OK
  • Click on "Apply changes" and the change never is displayed in the text

I think that this seems to just take much longer than I thought, but eventually it does show up.

Also, the cache does not seem to work as I expected it: when there are several function calls in a page, changing one seems to also turn the others back into "Function being called..." state.

E.g. here, try changing a single function call and then storing the page on this one:

https://www.wikifunctions.org/wiki/User:Denny/Test

Here are a few more pages that can be checked for this:

There are three different issues:

  1. Seeing the message "Function being called..." makes it sound like you just have to wait, and it will be automatically and interactively replaced with the result. We should change the message.
  2. It is unclear what the next step now should be for the user in order to get the page into a complete state. Do they need to reload? Do they have to wait? The next step should be made clearer.
  3. The "Function being called..." message should be shown less frequently. We might need to be a bit smarter about caching in order to achieve that.

Here's a demonstration of how the generated fragment immediately disappears:

immediately-disappearing-cached.gif (1,918×701 px, 1 MB)

I'm suspecting a caching issue, but not sure at what point.

Here's a little description of what I'm observing in production, hoping to get some ideas of what can be the issue and if there's anything we can do from our side!!

A. Using Visual Editor in https://www.wikifunctions.org/wiki/User:Geno_(WMF)/Sandbox I insert a function (E.g. "German noun declension table", inputs "Katz" and "English").
Without exiting Visual Editor, I wait till the fragment is ready (wait till the Loading chip disappears and replaced with the table result of my function).
At this point, we know that the WikifunctionsClientRequestJob has been executed, has requested the function call, received the response, and stored the response in the Object Cache (with the WikiLambdaClientFunctionCall key prefix)

I save my page and get redirected to the read page.

  • ✅ Fragment successful
  • Cache debug info:
<!-- 
NewPP limit report
Parsed by mw‐jobrunner.codfw.main‐5bf9d5d9ff‐f8ttw
Cached time: 20251106142733
Cache expiry: 2592000
Reduced expiry: true
Complications: [has‐async‐content, use‐parsoid]
CPU time usage: 0.039 seconds
Real time usage: 0.046 seconds
Preprocessor visited node count: 3/1000000
Revision size: 35/2097152 bytes
Post‐expand include size: 0/2097152 bytes
Template argument size: 0/2097152 bytes
Highest expansion depth: 1/100
Expensive parser function count: 0/500
Unstrip recursion depth: 0/20
Unstrip post‐expand size: 0/5000000 bytes
Number of Wikibase entities loaded: 0/500
-->
<!--
Transclusion expansion time report (%,ms,calls,template)
100.00%    0.000      1 -total
-->

<!-- Saved in parser cache with key wikifunctionswiki:parsoid-pcache:57698:|#|:idhash:useParsoid=1 and timestamp 20251106142733 and revision id 228091. Rendering was triggered because: parsoidCachePrewarm
 -->

B. Some seconds later I reload the page.

  • ❌ Fragment not ready -- "⏳ Function is being called…"
  • Cache debug info:
<!-- 
NewPP limit report
Parsed by mw‐wikifunctions.eqiad.group1‐7fff8cdc7b‐bcdds
Cached time: 20251106142743
Cache expiry: 60
Reduced expiry: true
Complications: [has‐async‐content, async‐not‐ready, use‐parsoid]
CPU time usage: 0.023 seconds
Real time usage: 0.028 seconds
Preprocessor visited node count: 3/1000000
Revision size: 35/2097152 bytes
Post‐expand include size: 0/2097152 bytes
Template argument size: 0/2097152 bytes
Highest expansion depth: 1/100
Expensive parser function count: 0/500
Unstrip recursion depth: 0/20
Unstrip post‐expand size: 0/5000000 bytes
Number of Wikibase entities loaded: 0/500
-->
<!--
Transclusion expansion time report (%,ms,calls,template)
100.00%    0.000      1 -total
-->

<!-- Saved in parser cache with key wikifunctionswiki:parsoid-pcache:57698:|#|:idhash:useParsoid=1 and timestamp 20251106142743 and revision id 228091. Rendering was triggered because: MediaWiki\Rest\Handler\Helper\HtmlOutputRendererHelper::getParserOutput
 -->

C. Some seconds later I reload the page.

  • ❌ Fragment not ready -- "⏳ Function is being called…"
  • Cache debug info:
!-- 
NewPP limit report
Parsed by mw‐wikifunctions.eqiad.group1‐7fff8cdc7b‐4fgk2
Cached time: 20251106142855
Cache expiry: 60
Reduced expiry: true
Complications: [has‐async‐content, async‐not‐ready, use‐parsoid]
CPU time usage: 0.017 seconds
Real time usage: 0.022 seconds
Preprocessor visited node count: 3/1000000
Revision size: 35/2097152 bytes
Post‐expand include size: 0/2097152 bytes
Template argument size: 0/2097152 bytes
Highest expansion depth: 1/100
Expensive parser function count: 0/500
Unstrip recursion depth: 0/20
Unstrip post‐expand size: 0/5000000 bytes
Number of Wikibase entities loaded: 0/500
-->
<!--
Transclusion expansion time report (%,ms,calls,template)
100.00%    0.000      1 -total
-->

<!-- Saved in parser cache with key wikifunctionswiki:parsoid-pcache:57698:|#|:idhash:useParsoid=1 and timestamp 20251106142855 and revision id 228091. Rendering was triggered because: page_view
 -->

D. To explore whether the fragment is ready or not, I edit using Visual Editor, and when selecting the same input values, the fragment is loaded immediately, without falling back to the async-not-ready fragment.

  • Edit the page,
  • Select the function,
  • Choose a different language and re-choose the original one (else it won't enable the "save" button)
  • Click save
  • The fragment is rendered immediately, which indicates the fragment is ready and cached in the Object Cache (WikiLambdaClientFunctionCall prefix)

re-edit-existing-function-same-inputs.gif (1,896×964 px, 1 MB)

With or without clicking Save (there are no edits after all), I go back to the read page.

  • ✅ Fragment successful
  • Cache debug info:
<!-- 
NewPP limit report
Parsed by mw‐jobrunner.codfw.main‐6797674669‐jh774
Cached time: 20251106143423
Cache expiry: 2592000
Reduced expiry: true
Complications: [has‐async‐content, use‐parsoid]
CPU time usage: 0.035 seconds
Real time usage: 0.039 seconds
Preprocessor visited node count: 3/1000000
Revision size: 35/2097152 bytes
Post‐expand include size: 0/2097152 bytes
Template argument size: 0/2097152 bytes
Highest expansion depth: 1/100
Expensive parser function count: 0/500
Unstrip recursion depth: 0/20
Unstrip post‐expand size: 0/5000000 bytes
Number of Wikibase entities loaded: 0/500
-->
<!--
Transclusion expansion time report (%,ms,calls,template)
100.00%    0.000      1 -total
-->

<!-- Saved in parser cache with key wikifunctionswiki:parsoid-pcache:57698:|#|:idhash:useParsoid=1 and timestamp 20251106143423 and revision id 228091. Rendering was triggered because: parsoidCachePrewarm
 -->

E. After this, the fragment appears to stay intact and the cache debug info the same.

One thing that leaps out is that all the successful renders were triggered by the async parsoidCachePrewarm job, while all the unsuccessful renders appear to come from the main thread, although I'm not 100% certain about the REST request in HtmlOutputRendererHelper. I wonder if there's a misconfiguration and the memcache for the async jobs is separate from the memcache used for the main thread, forcing multiple computations of the same result (and potentially races between them).

At first glance things are working as expected, with the exception of an invalidation of the original "parsoidCachePrewarm" render, which is complete and has a long expiry time (2592000). I don't know why that entry is being purged, but everything after that appears correct and consistent with the wikifunctions result not being in the cache: multiple renders each with a short 60s expiry and the async-not-ready flag set.

Summarizing a bit from Slack and IRC

<!-- 
NewPP limit report
Parsed by mw‐jobrunner.codfw.main‐5bf9d5d9ff‐f8ttw
Cached time: 20251106142733
Cache expiry: 2592000

vs

<!-- 
NewPP limit report
Parsed by mw‐wikifunctions.eqiad.group1‐7fff8cdc7b‐4fgk2
Cached time: 20251106142855
Cache expiry: 60

Key differences being the cache expiry value (which is the symptom, I think) and the fact that the 2 things are being served from different datacenters.

My understanding is that the async fragments are being saved by the orchestrator in the memcached instance addressed by the local mcrouter to wikifunctions instance set up in T391986

@gengh Can you please verify this understanding is correct ?

If that is the case, there is no cross DC replication between the 2 DCs. None was ever requested and the nature of things that were mentioned to be stored in it (cached wikidata entities and ZObjects) did not warrant cross-DC replication, which would explains this.

Provided the above is correct, I see at least the following 2 paths:

  • Not store in that cache the async fragment
  • Get cross-DC replication for this memcached

But first, let's make sure this understanding is correct.

@akosiaris

The fragments are being saved in the memcached instance but not by the orchestrator, but by the WikifunctionsClientRequestJob in the client wiki, who requests the execution from wikifunctions.org via REST api. All the rest of your assumptions are correct.

My understanding is that the async fragments are being saved by the orchestrator in the memcached instance addressed by the local mcrouter to wikifunctions instance set up in T391986

@gengh Can you please verify this understanding is correct ?

If that is the case, there is no cross DC replication between the 2 DCs. None was ever requested and the nature of things that were mentioned to be stored in it (cached wikidata entities and ZObjects) did not warrant cross-DC replication, which would explains this.

As Geno says, this is not right; the bits of wl-memcached written to and read from the MW land are distinct, and this use of the cache is to store final HTML fragments ready for Parsoid to splice into the document on-request.

  • We are storing HTML fragments in the cache, which is DC-local.
  • We are a content supplier for the ParserCache via Parsoid, which is DC-global.
  • Jobs are being run in the non-active DC and causing write activity on the ParserCache.
  • The mis-match between the scopes of the two caches is causing re-work and unreliable behaviour.

I think the fundamental problem is that one shouldn't connect a DC-local cache to a DC-global cache; either our cache needs to be DC-global (or fake it somehow), or ParserCache needs to be DC-local or at least jobs writing to it are stopped (which seems a bad move, presumably this is done for load).

Update: SRE started looking into this. Unfortunately some unexpected issues prevented us from diving deeper. We want to look into the problem in some more detail next week. For now, we are somewhat baffled by the use of the mc-wf hosts for storing the async fragments, this was unexpected.

we are somewhat baffled by the use of the mc-wf hosts for storing the async fragments, this was unexpected.

Storing the fragments resulting from function calls was the original design objective for having access to the production memcached cluster. It's show in this late-2020 architectural sketch with this role (implicitly, but hence the 'write' from the orchestrator which we ended up not doing):

https://commons.wikimedia.org/wiki/File:Wikifunctions_-_Top-level_architectural_model.svg

We've been using it this way since we went live in production in 2023, under T307742: Memoize Wikifunction functions calls in memcached.

The new use of the memcached cluster directly from the orchestrator to cache Objects and Wikidata entities that was only added this year is the secondary usage.

we are somewhat baffled by the use of the mc-wf hosts for storing the async fragments, this was unexpected.

Storing the fragments resulting from function calls was the original design objective for having access to the production memcached cluster. It's show in this late-2020 architectural sketch with this role (implicitly, but hence the 'write' from the orchestrator which we ended up not doing):

https://commons.wikimedia.org/wiki/File:Wikifunctions_-_Top-level_architectural_model.svg

I 'll point out that the "implicitly" part here is of pretty big importance. It's definitely not clearly annotated, there is no mention of fragments nor overall anything saying what will be read/written. I am still kinda surprised.

We've been using it this way since we went live in production in 2023, under T307742: Memoize Wikifunction functions calls in memcached.

Nothing about fragments either here. Yes, again fragments can be implied since function output can be cached and all, but it's definitely not explicit and clear.

I 'll also add T297815: New Service Request memcached-wikifunctions in the list of linked tasks, it isn't clearly spelled out there either.

The new use of the memcached cluster directly from the orchestrator to cache Objects and Wikidata entities that was only added this year is the secondary usage.

Thanks, this is now clear to me.

Jdforrester-WMF renamed this task from Embedded function calls show "Function being called..." instead of result to Embedded function calls getting stuck showing "Function being called..." instead of result, due to (?) split-brain cache problem.Dec 4 2025, 5:52 PM
Jdforrester-WMF raised the priority of this task from High to Unbreak Now!.
Jdforrester-WMF added a project: Wikifunctions.

Upgrading to reflect reality.

Summarizing some discussions we had with @ssastry and @mszabo. We are going to have a deeper look tomorrow, but here's a summary of our current understanding (we focused on Geno's GIF posted above):

  • The edit and the submit are write actions so are both being sent to the currently active datacenter, namely codfw
  • The 3 refreshes following the submit are (probably) being directed again to the currently active datacenter (codfw) because of the short lived UseDC Cookie that has the value master.
  • The first one failing is making to passive DC (eqiad). That requests triggers a full parse in eqiad.
  • That full parse is triggering WikiLambda extension which reaches out to dc-local memoization memcached instance to grab the data, but there is none, jobs gets submitted.
  • The job that gets submitted in eqiad that would populate that dc-local memoization memcached is not executed in eqiad but in codfw (because Jobs get executed in the active DC, regardless of where the jobs get submitted).
  • The job executing in codfw, probably finishes successfully and updates the dc-local memcached, which is however NOT the one that is missing the data.

However, there is the folllowing question. ParserCache is replicated across both DCs and there is a ParserCache entry for the page in the active DC (codfw) as we can see from the 3 refreshes (or reproduce for that matter). What happens and we end up without a proper ParserCache entry for the page in the passive DC is what we 'll take a look at tomorrow.

For what is worth, we could replicate the memoization memcached across DC, and it would probably solve the problem, but a) we could end up with split brain scenarios where the 2 caches have different results b) there appears to be a race condition (or something with similar symptoms) that could invalidate that approach in some cases at least.

@Jdforrester-WMF @gengh @DSantamaria @DVrandecic I have a product related question. What are the expectations regarding consistency of what viewers see across the world. i.e. how ok is it that someone in Europe/Africa sees older/newer/different content than someone say in the US?

@Jdforrester-WMF @gengh @DSantamaria @DVrandecic I have a product related question. What are the expectations regarding consistency of what viewers see across the world. i.e. how ok is it that someone in Europe/Africa sees older/newer/different content than someone say in the US?

In terms of the geographic variability content shown from the ParserCache, I don't think we have a product view from the Abstract team; we're "just" a content supplier into MediaWiki, and it would be for MW group's product side to set expectations.

I believe that given that currently MediaWiki has quite a lot of complexity around this question such as PoolCounter etc. the answer is "it depends", and of course I defer to the MW group's actual experts, but I think some short-term inconsistency is fine.

With @ssastry, @cscott and @Krinkle we managed to secure some time on Thursday to look more into this. Aside from @gengh's reproduction above that involves an edit, we managed to reproduce this using one of the pages that Denny has listed above by just issuing simple GET requests (no jobs involved, no edits, etc). In the process of trying to figure out why the active DC parse wasn't sufficient to avoid the issue appearing in the secondary DC, we believe we've uncovered a design flaw in the ParserCache. Either me of @cscott will split this in a different task once we are back from the offsite, but the TL;DR is that a condition exists where we can end up with stale data in the ParserCache, forcing a reparse in the secondary DC. None of this is probably related to wikifunctions specifically, but it exists nevertheless. If this functioned as I thought it would, it would have been hiding the architectural issue of wikifunctions not being designed for Multi-DC.

Nevertheless, and in agreement with what posted in the description of T411807: WF memcached service is dc-local but used for dc-global content, wikifunctions was not designed to be Multi-DC. This isn't just about the fact that the memoization cache is dc local and not dc global, but also stemming from the fact that while it uses MediaWiki Jobs to request content fragments from function-evaluator, it ignores the fact that the Jobs will not run anywhere else than the primary DC.

This is an architectural design flaw in wikifunctions. It was designed to be single DC and the design was never update when Multi-DC became a thing.

Ways that I see forward.

For now, I am just listing them, off the top of my head, without a feasibility or cost/benefit analysis. It's a Saturday, so I 'll update with an analysis next week, but I wanted to have an update anyway.

  1. Force Wikifunctions to be single DC
  2. Make the memoization cache replicate to the secondary DC
  3. Run WikifunctionsClientRequestJob in the local-DC.
  4. Retrofit ParserCache to be dc-global

Some of the ways listed above are brittle, some are probably not easily implementable, some are not very nice architecturally.

To be clear, there is no silver bullet, no easy and nice way out of this.

Feasibility analysis, pros/cons of the proposals above.

Force Wikifunctions to be single DC

This is a stopgap solution, but it should solve the problem. Given that wikifunctions wasn't designed to be compatible with Multi-DC, switching it to single DC, will make it behave as it was designed for. The trade-off being that a portion of users will not be able to reap the lower latencies that the closer DC would offer. Looking a bit at the Multi-DC configuration at https://github.com/wikimedia/operations-puppet/tree/production/modules/profile/files/trafficserver, files multi-dc.lua, multi-dc_test.lua and multi-dc.lua.conf it shouldn't require more than a section

domains = {
    ["www.wikifunctions.org"] = {
         mode = "primary"
    }
}

It's noteworthy that they are already not seeing benefits from the PoPs of the CDN since most responses (for non static content) that I see emitted from wikifuctions.org have a Cache-Control value of private anyway, which implies that latencies are not yet paramount for users.

One caveat: It will ONLY solve the problem for www.wikifunctions.org. Any project that includes functions, will remain vulnerable to the problems outlined in this task. But it should unblock the team from proceeding with their plans involving www.wikifunctions.org. Any work/goals/hypothesis ofc that involved other projects, will not benefit from this.

Make the memoization cache replicate to the secondary DC

This will probably work in the general case, but it's admittedly brittle (ParserCache being replicated is brittle too, but what we get in the case it fails is stale content, not broken content). In the cases that for some reason(s), e.g. packetloss, mcrouter misbehaving due to high load, high traffic, race conditions, the result hasn't been replicated (yet or not), the pages will appear broken for some amount of time. This will not fix the underlying problem, it will just make it appear less often, most probably in edge cases and/or load. This will in turn mean more confusing and difficult to reproduce and debug situations for users.

I should note that during the offsite we had some first discussions about whether we should stop the ParserCache replication (cc @Ladsgroup, @Marostegui) and rely instead on Jobs to keep the secondary DC warm.

For this mode to work sufficiently well, WikiLamba and Parsoid should be switched to not generate broken content (that "Function is being called" message) but rather return stale content on failure of a function. Which is easier said than done.

Run WikifunctionsClientRequestJob in the local-DC.

A quick look through https://gerrit.wikimedia.org/g/mediawiki/extensions/WikiLambda/+/3658a14386121044bb27f2c2e70a7a48796439cc/includes/Jobs/WikifunctionsClientRequestJob.php says that this job doesn't make any attempts to write to MariaDB, so it might work.

Furthermore, as pointed out above, we might find ourselves wanting to run parsoidCachePreWarm in both DCs to keep the secondary DC warm without relying on ParserCache replication.

That being said, we haven't tested this, nor do we have anything more than plans. It will require testing and changes to jobqueue configuration. It's also unclear how many blockers might show up.

Retrofit ParserCache to be entirely dc-global

I 'll link the task here when filed, but this idea entails making not just the second tier (MariaDB) of the ParserCache replicated, but the 1st tier (memcached) as well. This should decrease the chances that the problem shows up, but it suffers from the same architectural problems as the replication of the memoization cache (it's after all the same technique, just employed at a different level). It's brittle, complicated and will result in some cases in broken pages. This will in turn mean more confusing and difficult to reproduce and debug situations for users.

Retrofit WikiLamba to move away from the memoization cache and make it Multi-DC.

I won't add much here, cause I don't have the entirety of the picture, but it sounds like a big endeavor to me. It would endeavor not relying on the memoization cache, but rather switch to a job based system that constructs content asynchronously and never on a parse request (relying instead of serving stale content). It would require substantial changes to both how parsing works.

Recommendation

My recommendation for now would be to go down the Force Wikifunctions to be single DC route and if/when user latencies are deemed non satisfactory, embark on either Run WikifunctionsClientRequestJob in the local-DC or Retrofit WikiLamba to move away from the memoization cache and make it Multi-DC.

Also, given most of the content is in this task, should we merge T411807: WF memcached service is dc-local but used for dc-global content back in this one? It's a mostly empty task.

Change #1218338 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/puppet@production] multi-dc: Switch www.wikifunctions to Single-DC

https://gerrit.wikimedia.org/r/1218338

In the process of trying to figure out why the active DC parse wasn't sufficient to avoid the issue appearing in the secondary DC, we believe we've uncovered a design flaw in the ParserCache. Either me of @cscott will split this in a different task once we are back from the offsite, but the TL;DR is that a condition exists where we can end up with stale data in the ParserCache, forcing a reparse in the secondary DC.

This is now T412742: MultiWriteBagOStuff surprising behavior when secondary is unreplicated but primary is replicated.

Change #1218846 had a related patch set uploaded (by C. Scott Ananian; author: C. Scott Ananian):

[mediawiki/core@master] ParserCache: improve metric documentation

https://gerrit.wikimedia.org/r/1218846

Thanks @akosiaris for the update and the suggested options.

We discussed our options with the team and decided:

  • Given that the "Force Wikifunctions to be single DC" mitigation strategy would only affect fragments embedded in wikifunctions.org, we don't have a reason to rush this right now (three days before the holiday season), but we might go ahead with it after the holidays so that we can enable working on some example use cases in Q3, while we work on a definite solution.
  • Enter Q3, we will prioritize these efforts, study and plan the best long-term solution.

So, for now, there's no immediate action required, and we will pick this up early January 2026.

akosiaris lowered the priority of this task from Unbreak Now! to High.Dec 22 2025, 3:50 PM

Lowering from UBN to High given the latest update.

Change #1218846 merged by jenkins-bot:

[mediawiki/core@master] ParserCache: improve metric documentation

https://gerrit.wikimedia.org/r/1218846

Thanks for the write-up, Alexandros — it sounds like there was a fair bit of discussion in Lisbon; sorry we missed it. We've given our quick thoughts inline. It feels like there's a lot of confusion here. Perhaps we should discuss this in the Performance working group every second Monday?

  1. Force Wikifunctions to be single DC

[…]

One caveat: It will ONLY solve the problem for www.wikifunctions.org. Any project that includes functions, will remain vulnerable to the problems outlined in this task. But it should unblock the team from proceeding with their plans involving www.wikifunctions.org. Any work/goals/hypothesis ofc that involved other projects, will not benefit from this.

You're right, we agree that this fix would only help the issue for the client mode operating on Wikifunctions.org, whereas this split-brain problem arises on all of the >150 prod wikis that currently have access to embedded Wikifunctions calls (see the dblist), and so why we halted the roll-outs. We worry that making this change will just be another piece of technical debt we'll later have to undo or work around.

  1. Make the memoization cache replicate to the secondary DC

This will probably work in the general case, but it's admittedly brittle (ParserCache being replicated is brittle too, but what we get in the case it fails is stale content, not broken content). In the cases that for some reason(s), e.g. packetloss, mcrouter misbehaving due to high load, high traffic, race conditions, the result hasn't been replicated (yet or not), the pages will appear broken for some amount of time. This will not fix the underlying problem, it will just make it appear less often, most probably in edge cases and/or load. This will in turn mean more confusing and difficult to reproduce and debug situations for users.

Yes, we also think this might work mostly, but not be a perfect solution. Having a single-homed cache service in eqiad or codfw for this would "fix" the issue, but instead make the cache very slow and so defeat much of the point of it, we presume? Also we don't believe that that's an acceptable solution given the wider, still on-going work to make MW actually active-active.

[…]

For this mode to work sufficiently well, WikiLamba and Parsoid should be switched to not generate broken content (that "Function is being called" message) but rather return stale content on failure of a function. Which is easier said than done.

I'm really confused by this comment. We already return stale content when available, and pending content isn't "broken". What is the source of stale content we should be using instead to return objects when there's no hit in the cache? This feels like a conceptual mis-match as to what we each think is "broken" content.

  1. Run WikifunctionsClientRequestJob in the local-DC.

Can you explain what this means? As far as we are aware, the jobs are already being fired into the JobQueue in the datacentre where Parsoid is calling WikiLambda (i.e., both, locally); did we miss some configuration here to flag that this should happen?
That said, we think the split brain problem we're running into is exactly *because* the Parsoid requests and thus the jobs are being run in each datacentre, and so reading from and writing to local-only caches, leading to differing responses to Parsoid (and thus to the ParserCache).
[…]

Furthermore, as pointed out above, we might find ourselves wanting to run parsoidCachePreWarm in both DCs to keep the secondary DC warm without relying on ParserCache replication.

Would this pre-warm job run quickly enough after each edit that the secondary calls coming into the secondary datacentre (which are the proximate cause of the split-brain issue) would wait and be supplied with the same value as the primary-DC? If not, won't it be another source of confusion.

[…]

  1. Retrofit ParserCache to be entirely dc-global

T412742: MultiWriteBagOStuff surprising behavior when secondary is unreplicated but primary is replicated looks challenging. Is it likely that this issue will be addressed soon? It's also unclear to us how that would affect the flow of secondary-DC requests into Parsoid and then to WikiLambda.

[…]

  1. Retrofit WikiLamba to move away from the memoization cache and make it Multi-DC.

I won't add much here, cause I don't have the entirety of the picture, but it sounds like a big endeavor to me. It would endeavor not relying on the memoization cache, but rather switch to a job based system that constructs content asynchronously and never on a parse request (relying instead of serving stale content). It would require substantial changes to both how parsing works.

This is a bit confusing to us. Where would our cached content live? Or if it's not cached at all, how would Parsoid build fragments (available or pending) within a few ms if it has to wait for the job queue to return? The asynchronous content fragments work explicitly depended on there being a fragment content cache available; does all that work need revisiting?

We're not sure what the next steps could be here. For those available for the Monday meeting, it'd be good to discuss there.

As far as I understand it, there's some push back on "fixing" T412742: MultiWriteBagOStuff surprising behavior when secondary is unreplicated but primary is replicated at all, with the idea that the ultimate solution is to split the primary as well as the secondary and have a fully-split parser cache. The only work needed to do that, as I understand it, is to generalize the ParsoidCachePrewarmJob to work with the legacy parser as well, so as to maintain a sufficient "warmness" in the non-primary data center to ensure seamless DC migration. It seems like paying the CPU cost of "duplicate parses" is worth the operational tradeoff, as I understand the SRE thinking.

That said, I think the most straightforward short-term solution for wikifunctions is to replicate the WF cache between data centers ("Make the memoization cache replicate to the secondary DC"). This is "just" a configuration change, and should ensure that parsers which occur in both data centers get the computed WF results. There are some potential edge cases around replication failure, but I believe they ought to be sufficiently rare, and any "broken" results which manage to be cached will eventually be turned over when the ParserCache expiry time is reached -- and users already know how to manually purge pages when needed as well.

The options involving splitting the WF cache require jobs to run in both DCs (eg "Run WikifunctionsClientRequestJob in the local-DC") and my understanding is that the current infrastructure architecture is based around jobs running only in one DC. Make WF jobs the sole exception doesn't seem sustainable, but if this configuration is more broadly supported in the future it ought to be possible to migrate to a split WF memoization cache at that time with little additional effort.

Thanks for the write-up, Alexandros — it sounds like there was a fair bit of discussion in Lisbon; sorry we missed it. We've given our quick thoughts inline. It feels like there's a lot of confusion here. Perhaps we should discuss this in the Performance working group every second Monday?

  1. Force Wikifunctions to be single DC

[…]

One caveat: It will ONLY solve the problem for www.wikifunctions.org. Any project that includes functions, will remain vulnerable to the problems outlined in this task. But it should unblock the team from proceeding with their plans involving www.wikifunctions.org. Any work/goals/hypothesis ofc that involved other projects, will not benefit from this.

You're right, we agree that this fix would only help the issue for the client mode operating on Wikifunctions.org, whereas this split-brain problem arises on all of the >150 prod wikis that currently have access to embedded Wikifunctions calls (see the dblist), and so why we halted the roll-outs. We worry that making this change will just be another piece of technical debt we'll later have to undo or work around.

It is a stopgap for sure. I have a very narrow definition of technical debt, so I wouldn't call it that, but unless there are large architectural changes, it will have to be undone.

  1. Make the memoization cache replicate to the secondary DC

This will probably work in the general case, but it's admittedly brittle (ParserCache being replicated is brittle too, but what we get in the case it fails is stale content, not broken content). In the cases that for some reason(s), e.g. packetloss, mcrouter misbehaving due to high load, high traffic, race conditions, the result hasn't been replicated (yet or not), the pages will appear broken for some amount of time. This will not fix the underlying problem, it will just make it appear less often, most probably in edge cases and/or load. This will in turn mean more confusing and difficult to reproduce and debug situations for users.

Yes, we also think this might work mostly, but not be a perfect solution. Having a single-homed cache service in eqiad or codfw for this would "fix" the issue, but instead make the cache very slow and so defeat much of the point of it, we presume? Also we don't believe that that's an acceptable solution given the wider, still on-going work to make MW actually active-active.

Mostly the latter, not the former so much. It will also mean that in case of a large failure in the DC where the single homed cache is, all the projects will be suffering from widespread parse failures.

[…]

For this mode to work sufficiently well, WikiLamba and Parsoid should be switched to not generate broken content (that "Function is being called" message) but rather return stale content on failure of a function. Which is easier said than done.

I'm really confused by this comment. We already return stale content when available, and pending content isn't "broken". What is the source of stale content we should be using instead to return objects when there's no hit in the cache?

I should have been a bit clearer. By stale content here I mean "An already successfully parsed version of the page from the past". Which assumes that it exists and that it is accessible. Which is why I said "easier said than done".

This feels like a conceptual mis-match as to what we each think is "broken" content.

Quite possibly. My arbitrary definition of broken (from the user perspective) here is "a page with >=1 of Function being called... parts"

  1. Run WikifunctionsClientRequestJob in the local-DC.

Can you explain what this means? As far as we are aware, the jobs are already being fired into the JobQueue in the datacentre where Parsoid is calling WikiLambda (i.e., both, locally); did we miss some configuration here to flag that this should happen?

They are being fired in the DC where parsoind is calling WikiLambda, but the Job is being executed in the Primary DC in all cases. The jobrunners that the job is being executed against will always be the jobrunners of the primary DC. There is no way currently to influence this from a code point of view.

That said, we think the split brain problem we're running into is exactly *because* the Parsoid requests and thus the jobs are being run in each datacentre, and so reading from and writing to local-only caches, leading to differing responses to Parsoid (and thus to the ParserCache).

The Parsoid requests run in different datacenters from where the jobs are being executed, which is why the dc-local memcached instance never has the content that parsoid needs.

[…]

Furthermore, as pointed out above, we might find ourselves wanting to run parsoidCachePreWarm in both DCs to keep the secondary DC warm without relying on ParserCache replication.

Would this pre-warm job run quickly enough after each edit that the secondary calls coming into the secondary datacentre (which are the proximate cause of the split-brain issue) would wait and be supplied with the same value as the primary-DC? If not, won't it be another source of confusion.

[…]

  1. Retrofit ParserCache to be entirely dc-global

T412742: MultiWriteBagOStuff surprising behavior when secondary is unreplicated but primary is replicated looks challenging. Is it likely that this issue will be addressed soon? It's also unclear to us how that would affect the flow of secondary-DC requests into Parsoid and then to WikiLambda.

It is challenging indeed, and I think we haven't yet fully reached consensus on that task which is the best path forward. However, I think there is consensus that even if we make ParserCache dc-global, it will still be brittle in some ways, just less brittle than right now.

I should add here for content that the dc-global nature of ParserCache came to be in order to allow Switchover to happen without the secondary DC being cold. It's not a solution to a correctness problem (which is what we arguably have here).

[…]

  1. Retrofit WikiLamba to move away from the memoization cache and make it Multi-DC.

I won't add much here, cause I don't have the entirety of the picture, but it sounds like a big endeavor to me. It would endeavor not relying on the memoization cache, but rather switch to a job based system that constructs content asynchronously and never on a parse request (relying instead of serving stale content). It would require substantial changes to both how parsing works.

This is a bit confusing to us. Where would our cached content live? Or if it's not cached at all, how would Parsoid build fragments (available or pending) within a few ms if it has to wait for the job queue to return? The asynchronous content fragments work explicitly depended on there being a fragment content cache available; does all that work need revisiting?

I did not add much exactly because I don't have the full picture to answer this. But in my head, pages would only be parsed on a schedule (or some other internally only triggered event), never on a request from an end-user. That means that parsoid can wait substantially more time for things to return as it no longer would be in the hot path (when the Job part). It also means we could substantially increase wait times and allow for backlog to accumulate and be served later).

And it also means that if the content isn't available when a reader requests it, they are given a "page not yet complete message". And it's also a not really friendly to editors experience.

We're not sure what the next steps could be here. For those available for the Monday meeting, it'd be good to discuss there.

As a general point, I have stopped thinking of the fragments in the wikifunctions memcached as cached content. I consider them now *Job results*.

As far as I understand it, there's some push back on "fixing" T412742: MultiWriteBagOStuff surprising behavior when secondary is unreplicated but primary is replicated at all, with the idea that the ultimate solution is to split the primary as well as the secondary and have a fully-split parser cache. The only work needed to do that, as I understand it, is to generalize the ParsoidCachePrewarmJob to work with the legacy parser as well, so as to maintain a sufficient "warmness" in the non-primary data center to ensure seamless DC migration.

There is more work arguably. We need to alter the changeprop-jobqueue software (whether it's configuration only or not) to also submit jobs to the secondary DC jobrunners. PC aside, there are open questions and no consensus yet whether this is achievable. It certainly hasn't been tried yet.

It seems like paying the CPU cost of "duplicate parses" is worth the operational tradeoff, as I understand the SRE thinking.

I still haven't gathered SRE consensus on this one. For now, it's just an approach we can consider taking.

That said, I think the most straightforward short-term solution for wikifunctions is to replicate the WF cache between data centers ("Make the memoization cache replicate to the secondary DC"). This is "just" a configuration change, and should ensure that parsers which occur in both data centers get the computed WF results. There are some potential edge cases around replication failure, but I believe they ought to be sufficiently rare, and any "broken" results which manage to be cached will eventually be turned over when the ParserCache expiry time is reached -- and users already know how to manually purge pages when needed as well.

Yes, but it is a pretty brittle. Updates can be lost for whatever reason, latencies and race conditions will happen, hot pages will be reparsed more. It's going to be a better state than the current situation but definitely not solving the problem. Architecturally speaking, it feels like we are just using some duct tape to be honest.

The options involving splitting the WF cache require jobs to run in both DCs (eg "Run WikifunctionsClientRequestJob in the local-DC") and my understanding is that the current infrastructure architecture is based around jobs running only in one DC. Make WF jobs the sole exception doesn't seem sustainable, but if this configuration is more broadly supported in the future it ought to be possible to migrate to a split WF memoization cache at that time with little additional effort.

Mostly agreed. This would only make sense from a work point of view, if we split the ParserCache to become per DC, which will also require the ParsoidCachePrewarmJob to be run in both DCs.

If we are not going to move to a fully-split ParserCache, then fixing T412742: MultiWriteBagOStuff surprising behavior when secondary is unreplicated but primary is replicated is worthwhile, since we're currently paying the costs (duplicate parses) of a split parser cache without getting all of the benefits. Fixing T412742 is a little non-trivial. Of course, adding job runners in the secondary data center is non-trivial as well.

Change #1218338 abandoned by Alexandros Kosiaris:

[operations/puppet@production] multi-dc: Switch www.wikifunctions to Single-DC

Reason:

I 'll abandon this, the team decided to not go with this. It can always be restored if needed

https://gerrit.wikimedia.org/r/1218338