Page MenuHomePhabricator

Can't deploy machinetranslation due to exceeding resource quotas
Open, In Progress, HighPublic8 Estimated Story Points

Description

So, this is weird:

rzl@deploy2002:~$ kube-env machinetranslation eqiad
rzl@deploy2002:~$ kubectl get pod
NAME                                             READY   STATUS                   RESTARTS        AGE
machinetranslation-production-76d6f759cf-6s8mp   0/3     ContainerStatusUnknown   134 (35d ago)   42d
machinetranslation-production-76d6f759cf-bdtlj   0/3     Completed                64              12d
machinetranslation-production-76d6f759cf-dlmdx   3/3     Running                  0               8m8s
machinetranslation-production-76d6f759cf-g8hsf   0/3     ContainerStatusUnknown   145 (35d ago)   42d
machinetranslation-production-76d6f759cf-h5x74   3/3     Running                  4 (35d ago)     35d
machinetranslation-production-76d6f759cf-zvj8k   3/3     Running                  0               4m45s

Some highlights from kubectl describe pod machinetranslation-production-76d6f759cf-6s8mp:

Name:             machinetranslation-production-76d6f759cf-6s8mp
[...]
Status:           Failed
Reason:           Evicted
Message:          The node was low on resource: ephemeral-storage. Threshold quantity: 86613984216, available: 83696140Ki. Container machinetranslation-production was using 15553756Ki, request is 0, has larger consumption of ephemeral-storage. Container machinetranslation-production-tls-proxy was using 3320Ki, request is 0, has larger consumption of ephemeral-storage. Container production-metrics-exporter was using 32Ki, request is 0, has larger consumption of ephemeral-storage.
[...]
Containers:
  machinetranslation-production:
    [...]
    State:           Terminated
      Reason:        ContainerStatusUnknown
      Message:       The container could not be located when the pod was terminated
      Exit Code:     137
      Started:       Mon, 01 Jan 0001 00:00:00 +0000
      Finished:      Mon, 01 Jan 0001 00:00:00 +0000
    Last State:      Terminated
      Reason:        OOMKilled
      Exit Code:     137
      Started:       Tue, 21 Oct 2025 01:31:23 +0000
      Finished:      Tue, 21 Oct 2025 03:05:13 +0000

The same service in codfw has nothing in ContainerStatusUnknown but a fair number of restarts due to OOMKilled:

rzl@deploy2002:~$ kube-env machinetranslation codfw
rzl@deploy2002:~$ kubectl get pod
NAME                                             READY   STATUS    RESTARTS         AGE
machinetranslation-production-76d6f759cf-8xwtf   3/3     Running   91 (6h50m ago)   12d
machinetranslation-production-76d6f759cf-bg5mp   3/3     Running   326 (31m ago)    26d
machinetranslation-production-76d6f759cf-pl9qq   3/3     Running   93 (35d ago)     40d
machinetranslation-production-76d6f759cf-th6dd   3/3     Running   84 (4h5m ago)    12d

I came across this while trying to deploy the service in eqiad for envoy upgrades, which hit the helm deadline and rolled back (hence the youngest pods in that first get pod list). The risk here, apart from general instability, is that we might not be able to deploy the service in an emergency -- although there's a pretty good chance deleting the pods would work. I left them in place so we can investigate and fix the cause.

Event Timeline

RLazarus triaged this task as High priority.

@RLazarus We deployed MinT lastly on 06 Nov with a37ece7cde26383bba8b3f22519635f3e3b95da5. Is it possible that resource allocation is mismatch after that?

ContainerStatusUnknown usually happens when a node is down or otherwise in trouble which seems to have been the for the two nodes the evicted pods where running on (from kubectl describe <pod>). That can be reviewed in logstash: https://logstash.wikimedia.org/goto/5cb7a0ac7771341e6fc65e8e22de5809

The scheduler will not take these pods into account when counting the replicas, so no immediate issue here. They should also not hinder deployments. These where most likely interrupted by to high resource allocations (exceeding the namespace quota):

https://logstash.wikimedia.org/goto/0ebfeb3f5ba4f7af6ece2d87dbdf6abd

Error creating: pods "machinetranslation-production-76d6f759cf-kmfgt" is forbidden: exceeded quota: quota-compute-resources, requested: limits.memory=33368Mi,requests.memory=32968Mi, used: limits.memory=133472Mi,requests.memory=131872Mi, limited: limits.memory=150Gi,requests.memory=150Gi

Helm timed out again when I tried to deploy machinetranslation for the next round of envoy upgrades. I'll retitle this task, as the ContainerStatusUnknown pods aren't the cause of the problem, but we still can't run helmfile apply and we should fix that.

RLazarus renamed this task from machinetranslation eqiad pods in state ContainerStatusUnknown to Can't deploy machinetranslation due to exceeding resource quotas.Dec 19 2025, 2:15 AM

I'm still debugging, and probably best way to check with reverting original memory allocation. Patch is coming up.

KartikMistry changed the task status from Open to In Progress.Jan 29 2026, 7:53 AM
KartikMistry claimed this task.

I'm still debugging, and probably best way to check with reverting original memory allocation. Patch is coming up.

Update: I'm updating staging with the reverting memory update and updating here.

KartikMistry set the point value for this task to 8.Mon, Mar 2, 12:06 PM

Change #1248388 had a related patch set uploaded (by KartikMistry; author: KartikMistry):

[operations/deployment-charts@master] WIP: machinetranslation: Reduce GUNICORN_WORKERS

https://gerrit.wikimedia.org/r/1248388

Change #1248388 merged by jenkins-bot:

[operations/deployment-charts@master] machinetranslation: Optimize model loading and memory footprints

https://gerrit.wikimedia.org/r/1248388

Mentioned in SAL (#wikimedia-operations) [2026-03-12T05:24:44Z] <kart_> staging: machinetranslation: Optimize model loading and memory footprints (T411058)