I have recently noticed the following weird CPU pattern for ml-serve-ctrl hosts (ganeti vms) in response to some high latency alerts for the k8s api:
The increase seems to have started when we deployed kubelets on the controller nodes, to allow routing between workers and them (for webhooks etc..).
From https://grafana.wikimedia.org/d/000000435/kubernetes-api, it seems that several things match the cpu usage trend:
- k8s api latency increasing
- etcd latency increasing
From https://grafana.wikimedia.org/d/G8zPL7-Wz/kubernetes-node other things match, mostly metrics related to sync operations for kube-* daemons (missing metrics should become available after https://gerrit.wikimedia.org/r/c/operations/puppet/+/707235.
Last but not the least, on the etcd side I noticed logs like:
Jul 22 15:46:37 ml-etcd1003 etcd: read-only range request "key:\"/registry/health\" " with result "range_response_count:0 size:7" took too long (321.038398ms) to execute
The etcd metrics have been added only yesterday (we were missing the config to collect them), and now are available at https://grafana-rw.wikimedia.org/d/Ku6V7QYGz/jayme-etcd3?orgId=1&var-site=codfw&var-cluster=ml_etcd&var-instance_prefix=ml-etcd&from=now-24h&to=now
Things done so far:
- roll restart of kube daemons
- removal of istio from eqiad (but codfw follows the same pattern and we haven't deployed istio there)
- add more ganeti vcores (2->4)
- moved from DRBD disk template to plain for ml-serve-ctrl1002
The last one is still in progress, I'll check latency/cpu after the weekend to see if anything improved. The rationale for the choice was T224556, basically what we did for the etcd nodes as well. To add the Kubelets I had to mount a /dev/vdb disk (10G) for Docker to all the ml-serve-ctrl vms. What I want to see is if Docker's underlying disk causes lags when replicating somehow, even if what runs on top shouldn't do a ton of fsync-like calls.
The other thing to check is why etcd shows a regression in latency as well. From vm-level metrics I don't see any overhead or clear bottleneck, so I am not sure if it is a client side problem (namely being extremely slow in communicating with etcd) or something else.