Page MenuHomePhabricator

evaluator pods terminate without crash logs
Open, In Progress, HighPublicSpike

Description

All of the pods for the Evaluator service seem to terminate before logging anything which results in inability to retrieve crash logs (this has not been an issue with the Orchestrator pods).

Event Timeline

Restricted Application changed the subtype of this task from "Task" to "Spike". · View Herald TranscriptDec 10 2025, 2:10 AM
kube_env wikifunctions codfw
kubectl get pods

NAME                                                       READY   STATUS    RESTARTS        AGE
function-evaluator-javascript-evaluator-9d947b667-42dzd    2/2     Running   1 (47h ago)     5d10h
function-evaluator-javascript-evaluator-9d947b667-glhjg    2/2     Running   1 (2d7h ago)    5d10h
function-evaluator-python-evaluator-6957c7dcb8-jktpd       2/2     Running   2 (2d7h ago)    5d10h
function-evaluator-python-evaluator-6957c7dcb8-qq6xd       2/2     Running   1 (2d15h ago)   5d10h
function-orchestrator-main-orchestrator-86569bd79d-cm9fn   4/4     Running   3 (3d12h ago)   5d10h
function-orchestrator-main-orchestrator-86569bd79d-pvqld   4/4     Running   5 (2d4h ago)    5d10h

as mentioned, the Evaluator ones have no crash logs available on any of them; for ex, kubectl logs function-evaluator-python-evaluator-6957c7dcb8-jktpd function-evaluator-python-evaluator --previous:

...
ion/json","user-agent":"wikifunctions-function-orchestrator/0.0.1","x-request-id":"97df044f-d7d7-4189-b050-9dfadc8c39ba"},"id":"97df044f-d7d7-4189-b050-9dfadc8c39ba","method":"POST","query":{}}},"log.level":"info","message":"Incoming evaluator request","service":{"name":"function-evaluator"},"source":{"ip":"127.0.0.1","port":52800},"url":{"full":"/1/v1/evaluate/","path":"/1/v1/evaluate/"}}
{"@timestamp":"2025-12-06T18:24:11.983Z","ecs.version":"8.10.0","labels":{"requestId":"97df044f-d7d7-4189-b050-9dfadc8c39ba"},"log.level":"info","message":"calling executor in evaluator...","service":{"name":"function-evaluator"}}
{"@timestamp":"2025-12-06T18:24:11.983Z","ecs.version":"8.10.0","http":{"request":{"id":"97df044f-d7d7-4189-b050-9dfadc8c39ba"}},"log.level":"info","message":"calling execute in evaluator...","service":{"name":"function-evaluator"}}
{"@timestamp":"2025-12-06T18:24:11.995Z","ecs.version":"8.10.0","http":{"request":{"id":"97df044f-d7d7-4189-b050-9dfadc8c39ba"}},"log.level":"info","message":"calling execute in executor..., original time: 2025-12-06T18:24:11.986344","service":{"name":"function-evaluator"}}
{"@timestamp":"2025-12-06T18:24:11.995Z","ecs.version":"8.10.0","http":{"request":{"id":"97df044f-d7d7-4189-b050-9dfadc8c39ba"}},"log.level":"info","message":"...finished calling execute in executor, original time: 2025-12-06T18:24:11.993077","service":{"name":"function-evaluator"}}
{"@timestamp":"2025-12-06T18:24:12.006Z","ecs.version":"8.10.0","http":{"request":{"id":"97df044f-d7d7-4189-b050-9dfadc8c39ba"}},"log.level":"info","message":"...finished calling execute in evaluator","service":{"name":"function-evaluator"}}
{"@timestamp":"2025-12-06T18:24:12.028Z","ecs.version":"8.10.0","labels":{"requestId":"97df044f-d7d7-4189-b050-9dfadc8c39ba"},"log.level":"info","message":"...finished calling executor in evaluator","service":{"name":"function-evaluator"}}
{"@timestamp":"2025-12-06T18:24:12.928Z","ecs.version":"8.10.0","http":{"request":{"headers":{"user-agent":"kube-probe/1.31","x-request-id":"ceac0023-db48-47f0-8dba-415fa4ab2081"},"id":"ceac0023-db48-47f0-8dba-415fa4ab2081","method":"GET","query":{}}},"log.level":"info","message":"Incoming evaluator request","service":{"name":"function-evaluator"},"source":{"ip":"10.192.7.17","port":60698},"url":{"full":"/_info","path":"/_info"}}
{"@timestamp":"2025-12-06T18:24:21.228Z","ecs.version":"8.10.0","http":{"request":{"headers":{"content-length":"746","content-type":"application/json","user-agent":"wikifunctions-function-orchestrator/0.0.1","x-request-id":"ae62ba8c-6ae2-4441-b3ed-118d4d159635"},"id":"ae62ba8c-6ae2-4441-b3ed-118d4d159635","method":"POST","query":{}}},"log.level":"info","message":"Incoming evaluator request","service":{"name":"function-evaluator"},"source":{"ip":"127.0.0.1","port":34390},"url":{"full":"/1/v1/evaluate/","path":"/1/v1/evaluate/"}}
{"@timestamp":"2025-12-06T18:24:21.229Z","ecs.version":"8.10.0","labels":{"requestId":"ae62ba8c-6ae2-4441-b3ed-118d4d159635"},"log.level":"info","message":"calling executor in evaluator...","service":{"name":"function-evaluator"}}
{"@timestamp":"2025-12-06T18:24:21.229Z","ecs.version":"8.10.0","http":{"request":{"id":"ae62ba8c-6ae2-4441-b3ed-118d4d159635"}},"log.level":"info","message":"calling execute in evaluator...","service":{"name":"function-evaluator"}}

(that's the end of the output);
whereas here's for one of the Orchestrator crash logs, kubectl logs function-orchestrator-main-orchestrator-86569bd79d-pvqld function-orchestrator-main-orchestrator --previous:

{"request":{"id":"9dffd9e3-0b11-456c-9753-188626b2031e"}},"labels":{"host":"www.wikifunctions.org","url":"http://localhost:6501/w/api.php?action=wikilambda_fetch&format=json&uselang=content&zids=Z6021"},"log.level":"debug","message":"orchestration start fetching ZIDs <Z6021>","service":{"name":"function-orchestrator"}}
{"@timestamp":"2025-12-06T22:18:38.942Z","ecs.version":"8.10.0","http":{"request":{"id":"9dffd9e3-0b11-456c-9753-188626b2031e"}},"labels":{"host":"www.wikifunctions.org","url":"http://localhost:6501/w/api.php?action=wikilambda_fetch&format=json&uselang=content&zids=Z6021"},"log.level":"debug","message":"orchestration start fetching ZIDs <Z6021>","service":{"name":"function-orchestrator"}}

<--- Last few GCs --->

[1:0x3b3bd000] 118247648 ms: Scavenge 1488.0 (1542.0) -> 1483.9 (1542.0) MB, pooled: 0 MB, 10.39 / 0.00 ms  (average mu = 0.218, current mu = 0.295) task; 
[1:0x3b3bd000] 118248854 ms: Mark-Compact (reduce) 1491.3 (1542.2) -> 1440.7 (1492.2) MB, pooled: 0 MB, 132.42 / 0.00 ms  (+ 1035.7 ms in 0 steps since start of marking, biggest step 0.0 ms, walltime since start of marking 1206 ms) (average mu = 0.246, cu

<--- JS stacktrace --->

FATAL ERROR: Ineffective mark-compacts near heap limit Allocation failed - JavaScript heap out of memory
----- Native stack trace -----

 1: 0xe1803a node::OOMErrorHandler(char const*, v8::OOMDetails const&) [node]
 2: 0x11ec9c0 v8::Utils::ReportOOMFailure(v8::internal::Isolate*, char const*, v8::OOMDetails const&) [node]
 3: 0x11ecc97 v8::internal::V8::FatalProcessOutOfMemory(v8::internal::Isolate*, char const*, v8::OOMDetails const&) [node]
 4: 0x141a575  [node]
 5: 0x141a5a3  [node]
 6: 0x143367a  [node]
 7: 0x1436848  [node]
 8: 0x1c9c351  [node]
ecarg changed the task status from Open to In Progress.Dec 10 2025, 2:18 AM
ecarg triaged this task as High priority.

From Slack thread w/ @akosiaris (thank you!)

The relevant Kubernetes events for one of the pods are at: https://logstash.wikimedia.org/goto/744de7bab2aecfce9b54c1fd7517117c. PNG attached as well with the 2 restarts cycled. There are no pod killing events involved. No failure of livenessProbe, no OOM kills. The lack of logs means that the pod terminated without logging anything. @Grace you can do a kubectl describe pod <pod> and see the reason for termination? It should be one of Completed, OOMKilled or something similar?

Update: Got it. It's OOMKilled. See https://w.wiki/Ga7p.
@Grace It's absolutely expected that you would not get any logs when the process is being OOMKilled. The kernel sends the SIGKILL signal to the process, which is not catchable and can not be handled. The application will abnormally be terminated, no chance for logging, no chance for cleanup, nothing of that sort. For some reason the app ended up consuming more memory than it has been allocated. This can also be seen here: https://w.wiki/Ga82 where you see that the pod reached the 1GB memory limit.

image.png (690×1 px, 139 KB)

Since I am on that front, the patterns at that graph are very indicative of a memory leak. I am attaching a pic, but just zoom out a couple of times and you should be able to see the staircase like pattern, the spike and the kill, starting from scratch again.

image (1).png (968×1 px, 89 KB)