So, this is weird:
rzl@deploy2002:~$ kube-env machinetranslation eqiad rzl@deploy2002:~$ kubectl get pod NAME READY STATUS RESTARTS AGE machinetranslation-production-76d6f759cf-6s8mp 0/3 ContainerStatusUnknown 134 (35d ago) 42d machinetranslation-production-76d6f759cf-bdtlj 0/3 Completed 64 12d machinetranslation-production-76d6f759cf-dlmdx 3/3 Running 0 8m8s machinetranslation-production-76d6f759cf-g8hsf 0/3 ContainerStatusUnknown 145 (35d ago) 42d machinetranslation-production-76d6f759cf-h5x74 3/3 Running 4 (35d ago) 35d machinetranslation-production-76d6f759cf-zvj8k 3/3 Running 0 4m45s
Some highlights from kubectl describe pod machinetranslation-production-76d6f759cf-6s8mp:
Name: machinetranslation-production-76d6f759cf-6s8mp
[...]
Status: Failed
Reason: Evicted
Message: The node was low on resource: ephemeral-storage. Threshold quantity: 86613984216, available: 83696140Ki. Container machinetranslation-production was using 15553756Ki, request is 0, has larger consumption of ephemeral-storage. Container machinetranslation-production-tls-proxy was using 3320Ki, request is 0, has larger consumption of ephemeral-storage. Container production-metrics-exporter was using 32Ki, request is 0, has larger consumption of ephemeral-storage.
[...]
Containers:
machinetranslation-production:
[...]
State: Terminated
Reason: ContainerStatusUnknown
Message: The container could not be located when the pod was terminated
Exit Code: 137
Started: Mon, 01 Jan 0001 00:00:00 +0000
Finished: Mon, 01 Jan 0001 00:00:00 +0000
Last State: Terminated
Reason: OOMKilled
Exit Code: 137
Started: Tue, 21 Oct 2025 01:31:23 +0000
Finished: Tue, 21 Oct 2025 03:05:13 +0000The same service in codfw has nothing in ContainerStatusUnknown but a fair number of restarts due to OOMKilled:
rzl@deploy2002:~$ kube-env machinetranslation codfw rzl@deploy2002:~$ kubectl get pod NAME READY STATUS RESTARTS AGE machinetranslation-production-76d6f759cf-8xwtf 3/3 Running 91 (6h50m ago) 12d machinetranslation-production-76d6f759cf-bg5mp 3/3 Running 326 (31m ago) 26d machinetranslation-production-76d6f759cf-pl9qq 3/3 Running 93 (35d ago) 40d machinetranslation-production-76d6f759cf-th6dd 3/3 Running 84 (4h5m ago) 12d
I came across this while trying to deploy the service in eqiad for envoy upgrades, which hit the helm deadline and rolled back (hence the youngest pods in that first get pod list). The risk here, apart from general instability, is that we might not be able to deploy the service in an emergency -- although there's a pretty good chance deleting the pods would work. I left them in place so we can investigate and fix the cause.