Page MenuHomePhabricator

Enabling the nestedMetadata functionality on Wikifunctions.org causes evaluations to often fail with "gateway timeout" or “service unavailable”
Open, HighPublicBUG REPORT

Description

Follow-up to T363384: [24Q4] Make it easier for users to debug their code by having each function sub-call return a complete metadata map and then its disabling in production

  • Confirm we know where the problem was
  • Fix said issues, with testing
  • Add better handling of failure state rather than mysterious "Gateway timeout"
  • Re-deploy and check all is well

Steps to replicate the issue (include links if applicable):

What happens?:

  • both linked tests fail
  • Try this implementation works, however

What should have happened instead?:
The linked tests should always pass for such a critical function. Dependent implementations like https://www.wikifunctions.org/view/en/Z11223 also have failing tests but evaluate with Try this implementation (“4 hours ago”).

Software version (on Special:Version page; skip for WMF-hosted wikis like Wikipedia):

Other information (browser name/version, screenshots, etc.):
Not all built-in functions have connected tests.

Tests pass on:

Tests fail on:

Community functions generally fail and Try this function also fails (with no metadata) on:

Some exceptions which pass:

I haven’t seen any functions where some implementations pass while others fail, apart from built-in functions like https://www.wikifunctions.org/wiki/Z801.
For https://www.wikifunctions.org/view/en/Z12203, the composition with Z802 fails and the composition wrapping Z13254 succeeds

It appears to have been a temporary error, but I encountered

IMG_0966.png (2×960 px, 317 KB)
on https://www.wikifunctions.org/view/en/Z13521. Here, three of four implementations were successfully tested by Z13527 4 or 5 hours ago (18:45 UTC) (the fourth always fails). Try this function succeeds for 2 + 11 (Z13573) but fails with 213 + 1179 (no metadata). Using Try this implementation, this sum succeeds with Z13573, Z13529 and Z14759 (which depends on a function that has only Python implementations). Although it is not generally delivering old results, the selected implementation varies from call to call, failing altogether about 10% of the time. The depicted error occurred again at 19:18 UTC, adding 7387 and 6656. It seems to be persisting with similar values even after succeeding with one of its values increased by an order of magnitude. Trying this sum as a new test, the three implementations all pass but the sum still fails in Try this function.

Event Timeline

Jdforrester-WMF renamed this task from Function evaluations are often failing on Wikifunctions.org to Function evaluations are often failing on Wikifunctions.org with "gateway timeout".Jul 2 2024, 1:13 PM
GrounderUK renamed this task from Function evaluations are often failing on Wikifunctions.org with "gateway timeout" to Function evaluations are often failing on Wikifunctions.org with "gateway timeout" or “service unavailable”.Jul 2 2024, 1:25 PM

Change #1051364 had a related patch set uploaded (by Jforrester; author: Jforrester):

[operations/deployment-charts@master] Revert "wikifunctions: Upgrade orchestrator from 2024-06-11-223956 to 2024-06-17-221517"

https://gerrit.wikimedia.org/r/1051364

Change #1051364 merged by jenkins-bot:

[operations/deployment-charts@master] Revert "wikifunctions: Upgrade orchestrator from 2024-06-11-223956 to 2024-06-17-221517"

https://gerrit.wikimedia.org/r/1051364

Random evaluations:
22:00 UTC
https://www.wikifunctions.org/view/en/Z15386
https://www.wikifunctions.org/view/en/Z17105
https://www.wikifunctions.org/view/en/Z14226 Service Unavailable [passing at 22:50]
https://www.wikifunctions.org/view/en/Z10869
https://www.wikifunctions.org/view/en/Z11208 Service Unavailable [persistent]
https://www.wikifunctions.org/view/en/Z11684
22:30 UTC
https://www.wikifunctions.org/view/en/Z805 (equality function changed in one test)
https://www.wikifunctions.org/view/en/Z14304
23:00 UTC
https://www.wikifunctions.org/view/en/Z13633 Service Unavailable [passing at 00:30 UTC]
https://www.wikifunctions.org/wiki/Z808
⚠️https://www.wikifunctions.org/view/en/Z15251 Three Python implementations but only one passes (Z15377), otherwise Service Unavailable [Z15621 passing after 00:00 UTC; Z15252 has some passes too and is disconnected, probably because it fails for large values – errors are meaningful]
https://www.wikifunctions.org/view/en/Z17111 (with one odd exception: Z17117 with Z17114)

IMG_0967.png (2×960 px, 334 KB)

Change #1051813 had a related patch set uploaded (by Jforrester; author: Jforrester):

[operations/deployment-charts@master] wikifunctions: Raise CPU limit in orchestrator from 200m to 400m

https://gerrit.wikimedia.org/r/1051813

@cmassaro thanks for the ping. Unfortunately I have trouble finding the right data to analyze this further. The orchestrator does call the evaluator(s) directly (not via the service mesh) and does not produce metrics regarding those requests like latency, error rates etc. (at least I'm unable to find those). Looking at the logs of the components I can't see failed requests as well.

I think the "Gateway Timeout" message might be produced by the orchestrator (so the timeout happens while calling the evaluator) but as said I've no data really to back this up as I also I fail to understand where those errors are produced and if/how they bubble up the chain of services.

From the latency envoy telemetry metrics[1] of the evaluator I'd say that python function sometimes take pretty long (in the rage of 20-30s) which might make the call time out. This can't really be correlated with latency metrics from [2] as the bucket size for request latency seems to be 1s max (it might also make sense to split the dashboard or the panels between orchestrator and evaluators to get a clearer picture).
What does become pretty clear from the dashboard [1] is that the orchestrator and the python evaluator do have too low CPU limits. I'd suggest to double both as a first step and see how/if the throttled metric reduces.

[1] https://grafana-rw.wikimedia.org/d/b1jttnFMz/envoy-telemetry-k8s?orgId=1&var-datasource=thanos&var-site=eqiad&var-prometheus=k8s&var-app=function-evaluator&var-kubernetes_namespace=wikifunctions&var-destination=LOCAL_function-evaluator-javascript-evaluator&var-destination=LOCAL_function-evaluator-python-evaluator&from=now-2d&to=now
[2] https://grafana-rw.wikimedia.org/d/FEkiKFqVk/wikifunctions?orgId=1&from=now-6h&to=now

akosiaris lowered the priority of this task from Unbreak Now! to High.Jul 5 2024, 6:55 AM
akosiaris subscribed.

I am gonna be bold and lower this to "High".

UBN per https://www.mediawiki.org/wiki/Phabricator/Project_management is

Unbreak Now! – Something is broken and needs to be fixed immediately, setting anything else aside. This should meet the requirements for issues that hold the train.

Per https://grafana.wikimedia.org/d/FEkiKFqVk/wikifunctions, v1/evaluate sees traffic on the order of 0.1 to 0.2 requests per second and apparently per the description and comments in this tasks only a, currently not well estimated/unknown (correct me if I am wrong), ratio of those is affected. This doesn't look like something that needs to be fixed immediately, settings anything else aside. Nor is it holding the train of course.

Change #1051813 merged by jenkins-bot:

[operations/deployment-charts@master] wikifunctions: Raise CPU limit in orchestrator from 200m to 400m

https://gerrit.wikimedia.org/r/1051813

We're still seeing the error after the above change. I agree that it's coming from the orchestrator (or at least the PHP -> orchestrator request), but the timeout we're getting is not the one implemented in the orchestrator's business logic. One of the Gateway Timeout errors we're getting doesn't even invoke the evaluator (https://www.wikifunctions.org/view/en/Z902), so evaluator latency doesn't explain everything.

Is there a PHP-layer limit on request duration?

I've found more of these errors. They only seem to happen in compositions. This is making me think that maybe the result object is getting too big at some point in the execution. I'll dig into that.

I am gonna be bold and lower this to "High".

UBN per https://www.mediawiki.org/wiki/Phabricator/Project_management is

Unbreak Now! – Something is broken and needs to be fixed immediately, setting anything else aside. This should meet the requirements for issues that hold the train.

Per https://grafana.wikimedia.org/d/FEkiKFqVk/wikifunctions, v1/evaluate sees traffic on the order of 0.1 to 0.2 requests per second and apparently per the description and comments in this tasks only a, currently not well estimated/unknown (correct me if I am wrong), ratio of those is affected. This doesn't look like something that needs to be fixed immediately, settings anything else aside. Nor is it holding the train of course.

Something like 80% of the functionality of Wikifunctions has been offline for over a week now. We don't understand what is actually broken (it seems to be between MW and the k8s cluster, or in handling the request at the boundary, or otherwise), and its errors are challenging to parse. I think this definitely counts as UBN still.

We're still seeing the error after the above change. I agree that it's coming from the orchestrator (or at least the PHP -> orchestrator request), but the timeout we're getting is not the one implemented in the orchestrator's business logic. One of the Gateway Timeout errors we're getting doesn't even invoke the evaluator (https://www.wikifunctions.org/view/en/Z902), so evaluator latency doesn't explain everything.

Is there a PHP-layer limit on request duration?

Yes, it's set to 30s, but we're getting the failure much faster (within 5 seconds mostly).

akosiaris raised the priority of this task from High to Unbreak Now!.Jul 8 2024, 3:17 PM

I am gonna be bold and lower this to "High".

UBN per https://www.mediawiki.org/wiki/Phabricator/Project_management is

Unbreak Now! – Something is broken and needs to be fixed immediately, setting anything else aside. This should meet the requirements for issues that hold the train.

Per https://grafana.wikimedia.org/d/FEkiKFqVk/wikifunctions, v1/evaluate sees traffic on the order of 0.1 to 0.2 requests per second and apparently per the description and comments in this tasks only a, currently not well estimated/unknown (correct me if I am wrong), ratio of those is affected. This doesn't look like something that needs to be fixed immediately, settings anything else aside. Nor is it holding the train of course.

Something like 80% of the functionality of Wikifunctions has been offline for over a week now. We don't understand what is actually broken (it seems to be between MW and the k8s cluster, or in handling the request at the boundary, or otherwise), and its errors are challenging to parse. I think this definitely counts as UBN still.

I see this task isn't tagged with SRE or serviceops as I initially thought, which would put it in the grey area of shared priority definition (in SRE, we define UBN differently apparently). I was wrong to lower it, raising it again.

If the issue is reproducible without calling any evaluator, could you check if curl'ing the orchestrator (e.g. bypassing the wiki) does also also result in an error? With bypassing the wikis you can also rule out timeout issues as you're (mostly, as there's also between the TLS terminator end the orchestrator) under control of those.
I would also suggest trying to figure out when the issues started and try to roll back to the release version that was active before to rule out code changes that happened in between.

From [1] you can see that local calls from the TLS terminator to the orchestrator regularly take way to long, but that has been that way since quite some time.
On [2] you can see (with a lot of patience, sorry for that) the picture from the mediawiki side, regularly seeing 504/503 for calls to the orchestrator.

[1] https://grafana-rw.wikimedia.org/d/b1jttnFMz/envoy-telemetry-k8s?from=now-90d&orgId=1&to=now&var-app=function-orchestrator&var-datasource=thanos&var-destination=local_service&var-destination=LOCAL_function-orchestrator-main-orchestrator&var-kubernetes_namespace=wikifunctions&var-prometheus=k8s&var-site=eqiad&viewPanel=14
[2] https://grafana-rw.wikimedia.org/d/b1jttnFMz/envoy-telemetry-k8s?from=now-2d&orgId=1&to=now&var-app=mediawiki&var-datasource=thanos&var-site=eqiad&var-prometheus=k8s&var-kubernetes_namespace=mw-wikifunctions&var-destination=wikifunctions

At least some calls to the orchestrator can run successfully with curl, and the response is basically instantaneous.

Some of the failing tests also work if I use curl:

curl https://wikifunctions.k8s-staging.discovery.wmnet:30443/1/v1/evaluate/ -X POST --data '{"zobject":{"Z1K1":"Z7","Z7K1":"Z13717","Z13717K1":["Z6","cherry","banana","apple","banana"],"Z13717K2":["Z6","apple","banana","cherry"],"Z13717K3":["Z6","fruit 1","fruit 2","fruit 3"]},"doValidate":false}' --header "Content-type: application/json" -w "\n"

I disabled the passing Python implementation, so this was running the composition version (the one that fails its tests). There was significant lag (8833 ms) and the response object was big (88 Kb).

@Clement_Goubert: I'm debugging this, and the only deployed change in the deployment-charts repo that touches our services in the period is https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1050321 that you deployed for T283861; do you think it possible this might be causing odd network/envoy issues? It seems unlikely, but I'm grasping at straws at this point. Never mind, it looks like the outage preceded anything around this.

There have been changes to envoy itself, most notably https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1052263, but they happened after the disruption started.

If I try running the function call above in the ApiSandbox, I get a "Service Unavailable."

Query: https://www.wikifunctions.org/wiki/Special:ApiSandbox#action=wikilambda_function_call&format=json&wikilambda_function_call_zobject=%7B%22Z1K1%22%3A%22Z7%22%2C%22Z7K1%22%3A%22Z13717%22%2C%22Z13717K1%22%3A%5B%22Z6%22%2C%22cherry%22%2C%22banana%22%2C%22apple%22%2C%22banana%22%5D%2C%22Z13717K2%22%3A%5B%22Z6%22%2C%22apple%22%2C%22banana%22%2C%22cherry%22%5D%2C%22Z13717K3%22%3A%5B%22Z6%22%2C%22fruit%201%22%2C%22fruit%202%22%2C%22fruit%203%22%5D%7D&formatversion=2

Result:

{
    "query": {
        "wikilambda_function_call": {
            "success": false,
            "data": "{\"Z1K1\":\"Z22\",\"Z22K1\":\"Z24\",\"Z22K2\":{\"Z1K1\":{\"Z1K1\":\"Z7\",\"Z7K1\":\"Z883\",\"Z883K1\":\"Z6\",\"Z883K2\":\"Z1\"},\"K1\":[{\"Z1K1\":\"Z7\",\"Z7K1\":\"Z882\",\"Z882K1\":\"Z6\",\"Z882K2\":\"Z1\"},{\"Z1K1\":{\"Z1K1\":\"Z7\",\"Z7K1\":\"Z882\",\"Z882K1\":\"Z6\",\"Z882K2\":\"Z1\"},\"K1\":\"errors\",\"K2\":{\"Z1K1\":\"Z5\",\"Z5K1\":\"Z507\",\"Z5K2\":{\"Z1K1\":{\"Z1K1\":\"Z7\",\"Z7K1\":\"Z885\",\"Z885K1\":\"Z507\"},\"K1\":{\"Z1K1\":\"Z99\",\"Z99K1\":{\"Z1K1\":\"Z7\",\"Z7K1\":\"Z13717\",\"Z13717K1\":[\"Z6\",\"cherry\",\"banana\",\"apple\",\"banana\"],\"Z13717K2\":[\"Z6\",\"apple\",\"banana\",\"cherry\"],\"Z13717K3\":[\"Z6\",\"fruit 1\",\"fruit 2\",\"fruit 3\"]}},\"K2\":{\"Z1K1\":\"Z5\",\"Z5K1\":\"Z500\",\"Z5K2\":{\"Z1K1\":{\"Z1K1\":\"Z7\",\"Z7K1\":\"Z885\",\"Z885K1\":\"Z500\"},\"K1\":\"Service Unavailable\"}}}}}]}}"
        }
    }
}

Change #1052906 had a related patch set uploaded (by Jforrester; author: Jforrester):

[operations/deployment-charts@master] Revert "wikifunctions: Add addNestedMetadata to production orchestrator config"

https://gerrit.wikimedia.org/r/1052906

Change #1052906 merged by jenkins-bot:

[operations/deployment-charts@master] Revert "wikifunctions: Add addNestedMetadata to production orchestrator config"

https://gerrit.wikimedia.org/r/1052906

If I try running the function call above in the ApiSandbox, I get a "Service Unavailable."

I'm unable to reproduce that. But judging from the data I get back it seems to be a cached response (orchestrationCpuUsage\",\"K2\":\"688.86 ms is always the same for me)

Sorry for repeating the probably naive question, but I still struggle to understand: Which component(s?) do alter the message that is returned from the orchestrator?
My impression is that the response seen in the wikis is not the response provided by the orchestrator but rather a mutated version that is created by the WikiLambda extension. Is that correct? If so, is there a chance to validate/log within WikiLambda when it does so?
Judging from all the above it seems that at some point the response of something/the orchestrator(?) is passed as is (without any validation) into a data structure that is returned to the user - which should probably be fixed anyways.

OK, by reverting the addNestedMetadata flag (and so significantly reducing the payload returned), this appears now fixed.

Sorry for repeating the probably naive question, but I still struggle to understand: Which component(s?) do alter the message that is returned from the orchestrator?
My impression is that the response seen in the wikis is not the response provided by the orchestrator but rather a mutated version that is created by the WikiLambda extension. Is that correct? If so, is there a chance to validate/log within WikiLambda when it does so?
Judging from all the above it seems that at some point the response of something/the orchestrator(?) is passed as is (without any validation) into a data structure that is returned to the user - which should probably be fixed anyways.

Yes, agreed, there's something in the network between the orchestrator and MW that was not happy with what the orchestrator is returning, and instead transforming it into 503s/504s and the occasional "no healthy upstream". Maybe there's a \0 we're accidentally passing, or it breaks on return blobs above a certain size?

If I try running the function call above in the ApiSandbox, I get a "Service Unavailable."

Query: https://www.wikifunctions.org/wiki/Special:ApiSandbox#action=wikilambda_function_call&format=json&wikilambda_function_call_zobject=%7B%22Z1K1%22%3A%22Z7%22%2C%22Z7K1%22%3A%22Z13717%22%2C%22Z13717K1%22%3A%5B%22Z6%22%2C%22cherry%22%2C%22banana%22%2C%22apple%22%2C%22banana%22%5D%2C%22Z13717K2%22%3A%5B%22Z6%22%2C%22apple%22%2C%22banana%22%2C%22cherry%22%5D%2C%22Z13717K3%22%3A%5B%22Z6%22%2C%22fruit%201%22%2C%22fruit%202%22%2C%22fruit%203%22%5D%7D&formatversion=2

Result:

{
    "query": {
        "wikilambda_function_call": {
            "success": false,
            "data": "{\"Z1K1\":\"Z22\",\"Z22K1\":\"Z24\",\"Z22K2\":{\"Z1K1\":{\"Z1K1\":\"Z7\",\"Z7K1\":\"Z883\",\"Z883K1\":\"Z6\",\"Z883K2\":\"Z1\"},\"K1\":[{\"Z1K1\":\"Z7\",\"Z7K1\":\"Z882\",\"Z882K1\":\"Z6\",\"Z882K2\":\"Z1\"},{\"Z1K1\":{\"Z1K1\":\"Z7\",\"Z7K1\":\"Z882\",\"Z882K1\":\"Z6\",\"Z882K2\":\"Z1\"},\"K1\":\"errors\",\"K2\":{\"Z1K1\":\"Z5\",\"Z5K1\":\"Z507\",\"Z5K2\":{\"Z1K1\":{\"Z1K1\":\"Z7\",\"Z7K1\":\"Z885\",\"Z885K1\":\"Z507\"},\"K1\":{\"Z1K1\":\"Z99\",\"Z99K1\":{\"Z1K1\":\"Z7\",\"Z7K1\":\"Z13717\",\"Z13717K1\":[\"Z6\",\"cherry\",\"banana\",\"apple\",\"banana\"],\"Z13717K2\":[\"Z6\",\"apple\",\"banana\",\"cherry\"],\"Z13717K3\":[\"Z6\",\"fruit 1\",\"fruit 2\",\"fruit 3\"]}},\"K2\":{\"Z1K1\":\"Z5\",\"Z5K1\":\"Z500\",\"Z5K2\":{\"Z1K1\":{\"Z1K1\":\"Z7\",\"Z7K1\":\"Z885\",\"Z885K1\":\"Z500\"},\"K1\":\"Service Unavailable\"}}}}}]}}"
        }
    }
}

Result cached; if, now that I've disabled the nested metadata config, I try the modified: https://www.wikifunctions.org/wiki/Special:ApiSandbox#action=wikilambda_function_call&format=json&wikilambda_function_call_zobject=%7B%22Z1K1%22%3A%22Z7%22%2C%22Z7K1%22%3A%22Z13717%22%2C%22Z13717K1%22%3A%5B%22Z6%22%2C%22cherry%22%2C%22banana%22%2C%22apple%22%2C%22banana%22%2C%22dog%22%5D%2C%22Z13717K2%22%3A%5B%22Z6%22%2C%22apple%22%2C%22banana%22%2C%22cherry%22%2C%22hound%22%5D%2C%22Z13717K3%22%3A%5B%22Z6%22%2C%22fruit%201%22%2C%22fruit%202%22%2C%22fruit%203%22%2C%22chien%22%5D%7D&formatversion=2 I get a successful "Z22K1":["Z6","fruit 3","fruit 2","fruit 1","fruit 2","dog"].

This comment was removed by JMeybohm.
Jdforrester-WMF lowered the priority of this task from Unbreak Now! to High.Jul 9 2024, 9:24 AM

Downgrading now a temporary fix is in place.

Again, limited understanding on what I'm looking at here, but https://logstash.wikimedia.org/goto/49bb255fe9a7c07a37389ad8e377b362 seems to assemble a portion of the problematic requests. Unfortunately not all because a 503 does not seem to always result in the zObject containing "Service Unavailable".

This aligns nicely with a huge increase in readiness probe failures: https://logstash.wikimedia.org/goto/c85d3e64ea5df538b2659d7df84d97d9 which can ultimately lead to "no healthy endpoints" if there are, well, no healthy endpoints anymore (e.g. all your pods failing the readiness probes at more or less the same time).

Downgrading now a temporary fix is in place.

Could you please elaborate what the fix is and why it fixes things?
Ah, missed the comment above about addNestedMetadata. But I still don't understand why this helps.

Could you please elaborate what the fix is and why it fixes things?
Ah, missed the comment above about addNestedMetadata. But I still don't understand why this helps.

Our best theory is that the size of responses was causing problems. The recent addNestedMetadata change made responses much, much larger than we had bargained for. In my testing, response size (more than other factors like time) seemed to account for these failures. It's still not clear how this worked: I managed to find a function call that crashed the orchestrator entirely, even when bypassing the frontend and API layers.

Besides size, @gengh also mentioned some escaping that might be going astray.

Another datapoint to add: During the problematic time frame alerting did catch some of the orchestrator restarts (it just alerts on restarts that are caused by OOM kills): https://logstash.wikimedia.org/goto/aac4d9745b946d83bc486d3300221658
The memory increase is mostly not visible in grafana, so I would suspect it to have happened fast (we only scrape data every 30s). Maybe that helps with tracking down the root cause.

On https://www.wikifunctions.org/view/en/Z17692, the test cases are still failing with a "Gateway timeout"

Jdforrester-WMF renamed this task from Function evaluations are often failing on Wikifunctions.org with "gateway timeout" or “service unavailable” to Enabling the nestedMetadata functionality on Wikifunctions.org causes evaluations to often fail with "gateway timeout" or “service unavailable”.Aug 29 2024, 5:13 PM

Change #1075629 had a related patch set uploaded (by Jforrester; author: Jforrester):

[operations/deployment-charts@master] wikifunctions: Set 'generateFunctionsMetrics' to true in orchestrator config

https://gerrit.wikimedia.org/r/1075629

Change #1075629 merged by jenkins-bot:

[operations/deployment-charts@master] wikifunctions: Set 'generateFunctionsMetrics' to true in orchestrator config

https://gerrit.wikimedia.org/r/1075629