User Details
- User Since
- Jan 5 2021, 4:31 PM (185 w, 3 d)
- Availability
- Available
- LDAP User
- Cory Massaro
- MediaWiki User
- CMassaro (WMF) [ Global Accounts ]
Thu, Jul 25
Thanks @gengh for pointing me to the call being sent to the orchestrator.
This is now passing on both of the linked tests. The error in the task description points to an executor error, and the empty string after it means that the executor just didn't return anything. Could have been something ephemeral like resource exhaustion or a timeout. Marking as "needs sign-off" for now.
Wed, Jul 24
Tue, Jul 23
Thu, Jul 18
Wed, Jul 17
I am gonna write some tests to "confirm" and then an MR to maybe fix. I can't truly replicate the problem locally, so we won't be 100% sure until the "fix" lands, but the fix shouldn't do harm either way.
I have an idea. I think what's happening here is that, when we populate the implementation list, we enter a race condition (due to an unfortunately-structured Promise.all).
Test call for ApiSandbox:
Tue, Jul 9
Mon, Jul 8
If I try running the function call above in the ApiSandbox, I get a "Service Unavailable."
Some of the failing tests also work if I use curl:
At least some calls to the orchestrator can run successfully with curl, and the response is basically instantaneous.
This is getting more relevant now. Image publication often fails and has to be re-run manually.
I've found more of these errors. They only seem to happen in compositions. This is making me think that maybe the result object is getting too big at some point in the execution. I'll dig into that.
We're still seeing the error after the above change. I agree that it's coming from the orchestrator (or at least the PHP -> orchestrator request), but the timeout we're getting is not the one implemented in the orchestrator's business logic. One of the Gateway Timeout errors we're getting doesn't even invoke the evaluator (https://www.wikifunctions.org/view/en/Z902), so evaluator latency doesn't explain everything.
Jun 20 2024
Jun 18 2024
Jun 17 2024
This is now starting to mess things up. ZID resolution is interfering with our scope attachment (and therefore our reporting of nested metadata): we shouldn't have the dereferencer wrap returned objects in ZWrappers but leave that responsibility to the caller, which has the relevant correct scope information.
Jun 12 2024
Jun 11 2024
Jun 10 2024
Jun 6 2024
May 31 2024
May 30 2024
In /function-orchestrator/app.js, the system doesn't like req, res, or next, the params in app.all(). Anytime I try to log them, the logs get held up.
This has apparently been fixed for a while!
This will be an excellent experimental case for the propagation of granular error data once the nestedMetadata stuff is ready; I recommend we take this on AFTER nested metadata is done.
This was due to upstream changes in RustPython. @Jdforrester-WMF has since pinned our RustPython checkout to a specific tagged commit, so we shouldn't see this kind of volatility in future.
May 29 2024
May 28 2024
This is no longer true!
These tests are still failing. As far as I can tell, the "bad user-defined type" test in mswOrchestrateTest is redundant with "generic type validation error: bad list" in mockedServicesOrchestrateTest/genericsTest.js
May 23 2024
ps-node also fails, in an even worse way: it kills the two parent processes but leaves the wasmedge process itself running (in a non-zombie state, even).
May 22 2024
For what it's worth, I am not seeing this issue on Beta cluster
root@3ed8d7dddc2f:/srv/service# ps aux USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 1 0.6 0.2 1105088 74724 ? Ssl 16:15 0:01 node server.js root 17 0.1 0.0 4608 3712 pts/0 Ss 16:18 0:00 /bin/bash root 25 0.1 0.1 724296 41088 ? Ssl 16:18 0:00 node /srv/service/executor-classes/javaScriptWasmSubprocess.js root 36 0.0 0.0 2576 1536 ? S 16:18 0:00 /bin/sh -c . /srv/service/programming-languages/wasmedge/wasmedge-binary/env && wasmedge --enable-all-statistics --dir .:. --dir /executo root 37 99.9 0.1 12746860 42556 ? Sl 16:18 0:14 wasmedge --enable-all-statistics --dir .:. --dir /executors/javascript-wasmedge:./executors/javascript-wasmedge /srv/service/programming- root 45 0.0 0.0 8480 4352 pts/0 R+ 16:18 0:00 ps aux root@3ed8d7dddc2f:/srv/service# ps aux USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 1 0.7 0.2 1172928 77528 ? Ssl 16:15 0:01 node server.js root 17 0.1 0.0 4608 3712 pts/0 Ss 16:18 0:00 /bin/bash root 36 0.0 0.0 0 0 ? Z 16:18 0:00 [sh] <defunct> root 37 97.0 0.0 0 0 ? Z 16:18 0:15 [wasmedge] <defunct> root 50 0.0 0.0 8480 4480 pts/0 R+ 16:18 0:00 ps aux
Whatever happened, it was ephemeral. I tried Wikifunctions.Debugging the arguments, and they were as expected. Removing the cached results and re-running caused the function to execute successfully.
May 21 2024
May 20 2024
I have replicated this locally. There are a few unfortunate things happening here. One is that our last-ditch effort (process.kill) still doesn't work in some cases, and every solution I've tried so far leaves us with zombie processes. It seems like the wasmedge environment does something that messes up process parent relationships, so I'll keep digging into that.
May 17 2024
Very weird. When I test with the image that production is using, I can repro locally; however, when I use a more recent version of the orchestrator, I can run the function successfully. It's possible that the failure to recognize the Identity key, combined with the expansions of that key, might be responsible for the issue. In any case, I think this will resolve itself once we deploy a new version of the orchestrator on Monday, so we should retry then.