Page MenuHomePhabricator

Function evaluations are often failing on Wikifunctions.org with "gateway timeout" or “service unavailable”
Open, HighPublicBUG REPORT

Description

Steps to replicate the issue (include links if applicable):

What happens?:

  • both linked tests fail
  • Try this implementation works, however

What should have happened instead?:
The linked tests should always pass for such a critical function. Dependent implementations like https://www.wikifunctions.org/view/en/Z11223 also have failing tests but evaluate with Try this implementation (“4 hours ago”).

Software version (on Special:Version page; skip for WMF-hosted wikis like Wikipedia):

Other information (browser name/version, screenshots, etc.):
Not all built-in functions have connected tests.

Tests pass on:

Tests fail on:

Community functions generally fail and Try this function also fails (with no metadata) on:

Some exceptions which pass:

I haven’t seen any functions where some implementations pass while others fail, apart from built-in functions like https://www.wikifunctions.org/wiki/Z801.
For https://www.wikifunctions.org/view/en/Z12203, the composition with Z802 fails and the composition wrapping Z13254 succeeds

It appears to have been a temporary error, but I encountered

IMG_0966.png (2×960 px, 317 KB)
on https://www.wikifunctions.org/view/en/Z13521. Here, three of four implementations were successfully tested by Z13527 4 or 5 hours ago (18:45 UTC) (the fourth always fails). Try this function succeeds for 2 + 11 (Z13573) but fails with 213 + 1179 (no metadata). Using Try this implementation, this sum succeeds with Z13573, Z13529 and Z14759 (which depends on a function that has only Python implementations). Although it is not generally delivering old results, the selected implementation varies from call to call, failing altogether about 10% of the time. The depicted error occurred again at 19:18 UTC, adding 7387 and 6656. It seems to be persisting with similar values even after succeeding with one of its values increased by an order of magnitude. Trying this sum as a new test, the three implementations all pass but the sum still fails in Try this function.

Event Timeline

Jdforrester-WMF renamed this task from Function evaluations are often failing on Wikifunctions.org to Function evaluations are often failing on Wikifunctions.org with "gateway timeout".Tue, Jul 2, 1:13 PM
GrounderUK renamed this task from Function evaluations are often failing on Wikifunctions.org with "gateway timeout" to Function evaluations are often failing on Wikifunctions.org with "gateway timeout" or “service unavailable”.Tue, Jul 2, 1:25 PM

Change #1051364 had a related patch set uploaded (by Jforrester; author: Jforrester):

[operations/deployment-charts@master] Revert "wikifunctions: Upgrade orchestrator from 2024-06-11-223956 to 2024-06-17-221517"

https://gerrit.wikimedia.org/r/1051364

Change #1051364 merged by jenkins-bot:

[operations/deployment-charts@master] Revert "wikifunctions: Upgrade orchestrator from 2024-06-11-223956 to 2024-06-17-221517"

https://gerrit.wikimedia.org/r/1051364

Random evaluations:
22:00 UTC
https://www.wikifunctions.org/view/en/Z15386
https://www.wikifunctions.org/view/en/Z17105
https://www.wikifunctions.org/view/en/Z14226 Service Unavailable [passing at 22:50]
https://www.wikifunctions.org/view/en/Z10869
https://www.wikifunctions.org/view/en/Z11208 Service Unavailable [persistent]
https://www.wikifunctions.org/view/en/Z11684
22:30 UTC
https://www.wikifunctions.org/view/en/Z805 (equality function changed in one test)
https://www.wikifunctions.org/view/en/Z14304
23:00 UTC
https://www.wikifunctions.org/view/en/Z13633 Service Unavailable [passing at 00:30 UTC]
https://www.wikifunctions.org/wiki/Z808
⚠️https://www.wikifunctions.org/view/en/Z15251 Three Python implementations but only one passes (Z15377), otherwise Service Unavailable [Z15621 passing after 00:00 UTC; Z15252 has some passes too and is disconnected, probably because it fails for large values – errors are meaningful]
https://www.wikifunctions.org/view/en/Z17111 (with one odd exception: Z17117 with Z17114)

IMG_0967.png (2×960 px, 334 KB)

Change #1051813 had a related patch set uploaded (by Jforrester; author: Jforrester):

[operations/deployment-charts@master] wikifunctions: Raise CPU limit in orchestrator from 200m to 400m

https://gerrit.wikimedia.org/r/1051813

@cmassaro thanks for the ping. Unfortunately I have trouble finding the right data to analyze this further. The orchestrator does call the evaluator(s) directly (not via the service mesh) and does not produce metrics regarding those requests like latency, error rates etc. (at least I'm unable to find those). Looking at the logs of the components I can't see failed requests as well.

I think the "Gateway Timeout" message might be produced by the orchestrator (so the timeout happens while calling the evaluator) but as said I've no data really to back this up as I also I fail to understand where those errors are produced and if/how they bubble up the chain of services.

From the latency envoy telemetry metrics[1] of the evaluator I'd say that python function sometimes take pretty long (in the rage of 20-30s) which might make the call time out. This can't really be correlated with latency metrics from [2] as the bucket size for request latency seems to be 1s max (it might also make sense to split the dashboard or the panels between orchestrator and evaluators to get a clearer picture).
What does become pretty clear from the dashboard [1] is that the orchestrator and the python evaluator do have too low CPU limits. I'd suggest to double both as a first step and see how/if the throttled metric reduces.

[1] https://grafana-rw.wikimedia.org/d/b1jttnFMz/envoy-telemetry-k8s?orgId=1&var-datasource=thanos&var-site=eqiad&var-prometheus=k8s&var-app=function-evaluator&var-kubernetes_namespace=wikifunctions&var-destination=LOCAL_function-evaluator-javascript-evaluator&var-destination=LOCAL_function-evaluator-python-evaluator&from=now-2d&to=now
[2] https://grafana-rw.wikimedia.org/d/FEkiKFqVk/wikifunctions?orgId=1&from=now-6h&to=now

akosiaris lowered the priority of this task from Unbreak Now! to High.Fri, Jul 5, 6:55 AM
akosiaris subscribed.

I am gonna be bold and lower this to "High".

UBN per https://www.mediawiki.org/wiki/Phabricator/Project_management is

Unbreak Now! – Something is broken and needs to be fixed immediately, setting anything else aside. This should meet the requirements for issues that hold the train.

Per https://grafana.wikimedia.org/d/FEkiKFqVk/wikifunctions, v1/evaluate sees traffic on the order of 0.1 to 0.2 requests per second and apparently per the description and comments in this tasks only a, currently not well estimated/unknown (correct me if I am wrong), ratio of those is affected. This doesn't look like something that needs to be fixed immediately, settings anything else aside. Nor is it holding the train of course.