While working on istio on the ml-serve-eqiad cluster, I noticed the following error in the kube-api logs:
Jul 01 09:33:07 ml-serve-ctrl1002 kube-apiserver[421]: E0701 09:33:07.002314 421 dispatcher.go:129] failed calling webhook "validation.istio.io": Post https://istiod.istio-system.svc:443/validate?timeout=30s: dial tcp 10.64.77.73:443: i/o timeout
The IP is part of a svc that istiod creates to expose a validation webhook service:
elukey@ml-serve-ctrl1001:~$ kubectl describe svc -n istio-system Name: istiod Namespace: istio-system [..] Selector: app=istiod,istio=pilot Type: ClusterIP IP: 10.64.77.73 <===================================== Port: grpc-xds 15010/TCP TargetPort: 15010/TCP Endpoints: 10.64.79.74:15010 Port: https-dns 15012/TCP TargetPort: 15012/TCP Endpoints: 10.64.79.74:15012 Port: https-webhook 443/TCP <========================== TargetPort: 15017/TCP Endpoints: 10.64.79.74:15017 <============================ Port: http-monitoring 15014/TCP TargetPort: 15014/TCP Endpoints: 10.64.79.74:15014 Session Affinity: None Events: <none> elukey@ml-serve-ctrl1001:~$ kubectl describe ep -n istio-system Name: istiod Namespace: istio-system [..] Subsets: Addresses: 10.64.79.74 NotReadyAddresses: <none> Ports: Name Port Protocol ---- ---- -------- http-monitoring 15014 TCP https-webhook 15017 TCP <======================== grpc-xds 15010 TCP https-dns 15012 TCP
The kube-api needs to be able to call 10.64.77.73, but the IP is available only on k8s worker nodes where calico runs. The Kubeflow stack is full of webhooks, so it would be great if we could find a way to add calico to the master nodes to enable the extra routing needed.
Some caveats:
- The bird daemon runs in a container (calico-node) in our current worker setup, so it is not sufficient to add profile::calico::kubernetes and add BGP peering to cr* routers.
- We could run calico-node as docker container launched by systemd on master nodes, but it would be yet another way of starting it (and another thing to maintain).
- We could run calico-node similar to how we run it on worker nodes, but we'd also need to deploy kubelet on master nodes (currently not running on them).
- Some tweaks in deployment-charts may be needed to deploy calico on master nodes as well.