Page MenuHomePhabricator

mediawiki.content_history: flink applications experiencing frequent restarts due to JobManager OOMs
Closed, ResolvedPublic

Description

Both mw-content-history-reconcile-enrich and mw-content-history-reconcile-enrich-next JobManagers on dse-k8s-eqiad are experiencing frequent, unexpected restarts.

This is occurring despite the applications being mostly idle. I haven't observed anything suspicious in the pod logs or the Flink JobManager UI.

Grafana shows that pod memory usage increases over time until the container is OOM-killed. However, the Flink JobManager UI reports stable JVM heap allocations of 256MB.

Some context in this Slack thread.

Related Objects

Event Timeline

related to spikes observed here https://phabricator.wikimedia.org/T397854 ?

We see these restarts when the application is idle. They do not seem at all traffic related, and only affect the Job Manager (not worker nodes).

Change #1172053 had a related patch set uploaded (by Bking; author: Bking):

[operations/deployment-charts@master] mw-content-history-reconcile-enrich: increase jobmanager.memory.off-heap.size

https://gerrit.wikimedia.org/r/1172053

Change #1172053 merged by Bking:

[operations/deployment-charts@master] mw-content-history-reconcile-enrich: increase jobmanager.memory.off-heap.size

https://gerrit.wikimedia.org/r/1172053

Ottomata renamed this task from mediawiki.content_history: flink applications experiencing frequent restarts to mediawiki.content_history: flink applications experiencing frequent restarts due to JobManager OOMs.Aug 12 2025, 5:08 PM
Ottomata triaged this task as Medium priority.Aug 12 2025, 5:11 PM

Copying over context from T397336#11042604

@gmodena wrote:

We see odd memory allocation patterns for Job Managers of applications deployed on DSE: upon startup, JM memory starts to build up (monotonically) until it reaches the container limit. Raising memory limits and manually tuning memory allocations did not help.

While the business logic is different from what we have deployed in Wikikube, the runtime, charts, and Helmfile config match. This issue manifests itself on the Job Manager, which is a coordinator node and should not perform any workload. To further isolate the issue, we tried disabling HA (and Flink state checkpointing), with no impact on the allocation pattern behavior.

The (recommended) next steps should be:

  • deploy Flink 1.20 and validate if this issue persists
  • install JVM tools (JDK) in the image, so we can collect more fine-grained memory info (e.g. via jmap)
  • execute the Flink app with JVM instrumentation enabled and collect a memory trace

I think I have identified a way of debugging a flink app, without having to build all flink apps from the JDK.
The suggestion is outlined here: T400296#11102186

Essentially, we would need to:

  1. add the following override to the mw-content-history-reconcile-enrich deployment here
debug:
  enabled: true

This will cause it to enable a shared PID namespace between all containers within each container. It will also start a wmfdebug container, which has a lot of useful tools, but no jmap.

  1. Redeploy the application, because the shared PID namespace can only be enabled at pod creation time.
  2. Add an ephemeral container to the pod like this:
kubectl get pod flink-app-production-7f45bf699b-65cc4   -n mw-content-history-reconcile-enrich -o json | jq '.
  | .spec.ephemeralContainers += [{
      "name": "jdk-debugger",
      "image": "docker-registry.discovery.wmnet/openjdk-8-jdk:8.452-1-20250810",
      "stdin": true,
      "tty": true,
      "command": ["bash"],
      "securityContext": {
        "allowPrivilegeEscalation": false,
        "capabilities": { "drop": ["ALL"] },
        "runAsNonRoot": true,
        "runAsUser": 1000,
        "seccompProfile": { "type": "RuntimeDefault" }
      }
    }]
  ' | kubectl replace --raw "/api/v1/namespaces/mw-content-history-reconcile-enrich/pods/flink-app-production-7f45bf699b-65cc4/ephemeralcontainers" -f -
  1. Attach to the running openjdk8-jdk container like this: kubectl attach -it flink-app-production-7f45bf699b-65cc4 -c jdk-debugger

I haven't managed to do much testing yet, but it looks to me like it should work and I think that it would be preferable to building all flink apps from the JDK image, personally.

Ahoelzl updated Other Assignee, added: Ottomata.

I'm going to try to upgrade mw-content-history-reconcile-enrich-next to Flink 1.20 to see if it magically fixes the issue, but I won't do any work migrating from deprecated config and stuff in this ticket though. If the issue doesn't get fix, at least with the update it includes a feature that allows us to profile the JobManager using the Flink Web UI, which could be useful.

Side note: Updating to 1.20 would mean running it on experimental support for Java 17 (since from what I can tell we only have 1.20 on Java 17) which would progress T404340

Updating to 1.20 would mean running it on experimental support for Java 17 (since from what I can tell we only have 1.20 on Java 17

Hm, from T404340 and https://docker-registry.wikimedia.org/flink/tags/ it does look like we only have 1.20.2 on Java 17, but I would hope it wouldn't be a requirement. I mean, okay if it works let's go for it, but it seems a little risky?

I'll ask folks in that ticket if we could add the java version used to the image name so we could more easily choose?

I'm going to try to upgrade mw-content-history-reconcile-enrich-next to Flink 1.20 to see if it magically fixes the issue, but I won't do any work migrating from deprecated config and stuff in this ticket though. If the issue doesn't get fix, at least with the update it includes a feature that allows us to profile the JobManager using the Flink Web UI, which could be useful.

Side note: Updating to 1.20 would mean running it on experimental support for Java 17 (since from what I can tell we only have 1.20 on Java 17) which would progress T404340

I think it would be nice to test out the java17 base image but if you don't want to mix up too many things to troubleshoot your mem issues you still have 1.20.1-wmf1-20250907 that runs 1.20.1 but with java11.

Could we please check to see if this issue is still affecting us, after the kernel upgrade carried out in T405361: JVMs get assigned a max heap size of 1/4th of the node memory instead of 1/4th of the pod max memory in dse-k8s-eqiad?
We believe that the -Xmx values for JVMs were being mistakenly assigned far more RAM that was available in the container, but that should now be fixed.
Thanks.

Hilarious ending to this saga! 😄

Change #1259975 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/deployment-charts@master] dse-k8s - unset some Flink JobManager off-heap.size override

https://gerrit.wikimedia.org/r/1259975

dse-k8s - unset some Flink JobManager off-heap.size override (1259975) should be done to remove the override added by this task when debugging the OOMs.

Change #1259975 merged by jenkins-bot:

[operations/deployment-charts@master] dse-k8s - unset some Flink JobManager off-heap.size override

https://gerrit.wikimedia.org/r/1259975