Page MenuHomePhabricator

Alert triage: overdue alert [warning] Kubelet exec_sync operations on ml-serve1001.eqiad.wmnet take 1.133s in p99
Closed, ResolvedPublic

Description

There is an overdue (>7 days) warning alert:

Kubelet exec_sync operations on ml-serve1001.eqiad.wmnet take 1.133s in p99:
https://alerts.wikimedia.org/?q=alertname%3DKubeletOperationalLatency&q=team%3Dsre&q=%40receiver%3Ddefault

Could you please fix the underlying cause or adjust the alert? Please also tag the alert with your team name if not already done.

Event Timeline

Very interesting:

Aug 09 00:09:33 ml-serve1001 kubelet[3980749]: E0809 00:09:33.603646 3980749 remote_runtime.go:673] "ExecSync cmd from runtime service failed" err="rpc error: code = DeadlineExceeded desc = context deadline exceeded" containerID="3a00111fb1d6964690d6eb078d1b4759de20a10a2491a8dac65211de01252cbb" cmd=[/usr/bin/check-status -r]
root@deploy1002:~# kubectl get pod -o jsonpath='{range .items[?(@.status.containerStatuses[].containerID=="docker://3a00111fb1d6964690d6eb078d1b4759de20a10a2491a8dac65211de01252cbb")]}{.metadata.name}{end}' -A
calico-kube-controllers-b47dfd47-jdtx4

Change 947376 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] admin_ng: increase resources for calico kube-controllers in ml-serve

https://gerrit.wikimedia.org/r/947376

Change 947376 merged by Elukey:

[operations/deployment-charts@master] admin_ng: increase resources for calico kube-controllers in ml-serve

https://gerrit.wikimedia.org/r/947376

Throttling is gone, but I still see the exec_sync elevated latency, errors from the kubelet:

Aug 09 15:49:02 ml-serve1001 kubelet[3980749]: E0809 15:49:02.604683 3980749 remote_runtime.go:673] "ExecSync cmd from runtime service failed" err="rpc error: code = Unknown desc = container not running (3a00111fb1d6964690d6eb078d1b4759de20a10a2491a8dac65211de01252cbb)" containerID="3a00111fb1d6964690d6eb078d1b4759de20a10a2491a8dac65211de01252cbb" cmd=[/usr/bin/check-status -r]

Next steps:

  • The above error msg should auto-resolve (hopefully), so the ml-serve1001 specific alert should go away.
  • Will it come back to another ml-serve1* node? (since the new calico pods have been relocated after the deployment)

Mentioned in SAL (#wikimedia-operations) [2023-08-11T08:31:48Z] <elukey> restart kubelet on ml-serve1001 - T343900

elukey claimed this task.

After the kubelet restart the metric cleared!

Change 948091 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] admin_ng: increase resources for calico on wikikube clusters

https://gerrit.wikimedia.org/r/948091

I have done ml2002 and ml2003 today (two machines to force some pods back onto 2002, to see it works properly). So far, everything seems fine.

Steps I took:

On deployXXXX, useful things to run in watch (after kube_env ml-serve-XXXX):

  • See node status: kubectl get nodes
  • When waiting for drain to complete: kubectl get pods -A --field-selector spec.nodeName=ml-serveXXXX.codfw.wmnet

Steps for drain/resize/undrain:

On ml-serveXXXX, verify you're dealing with a machine that still has a small partition (<30G): df -h /var/lib/kubelet/

On deployXXXX:

  • Drain node: kubectl drain --ignore-daemonsets --delete-emptydir-data ml-serveXXXX.codfw.wmnet
  • Wait until only daemonsets are running (istio etc), see get pods command above.

On ml-serveXXXX:

  • Optionally watch number of docker processes (will not reach 0 due to daemonsets): watch sudo docker ps\|wc -l
  • Stop kubelet, resize partition and filesystem (does not need to be unmounted):
sudo systemctl stop kubelet
sudo lvextend -L+80g /dev/mapper/vg0-kubelet
sudo resize2fs /dev/mapper/vg0-kubelet
  • Verify increased size:
$ df -h /var/lib/kubelet
Filesystem               Size  Used Avail Use% Mounted on
/dev/mapper/vg0-kubelet  107G  480K  102G   1% /var/lib/kubelet`
  • Start kubelet: sudo systemctl start kubelet.service
  • Check kubelet logs: sudo journalctl -xefu kubelet.service Must not have ExecSync errors! Keep watching this file while doing the second machine, to see if there are orders as pods are moved.

On deployXXXX: Undrain/uncordon: kubectl uncordon ml-serveXXXX.codfw.wmnet

If you do two machines one after the other, the first machine will receive new pods during the drain of the second, you can use this to make sure it's working correctly.

I might be missing something here, but what issues did you have with the 28G kubelet filesystem? Are you making a lot of use of host path mounts and filling that up?

The problem is only really relevant for LLMs (Large Language Models), since they need more local disk space. Or at least the specific ones we tried did. We have plenty of disk space on our workers so far, so having a bigger kubelet partition/fs is quite feasible.

T339231 has more discussion on the matter.

Change 948091 merged by jenkins-bot:

[operations/deployment-charts@master] admin_ng: increase resources for calico on wikikube clusters

https://gerrit.wikimedia.org/r/948091

The problem is only really relevant for LLMs (Large Language Models), since they need more local disk space. Or at least the specific ones we tried did. We have plenty of disk space on our workers so far, so having a bigger kubelet partition/fs is quite feasible.

T339231 has more discussion on the matter.

Ah, okay. That comment makes way more sense on that phab task now :-) thanks