Change Details

While working on istio on the ml-serve-eqiad cluster, I noticed the following error in the kube-api logs: ``` Jul 01 09:33:07 ml-serve-ctrl1002 kube-apiserver[421]: E0701 09:33:07.002314 421 dispatcher.go:129] failed calling webhook "validation.istio.io": Post https://istiod.istio-system.svc:443/validate?timeout=30s: dial tcp 10.64.77.73:443: i/o timeout ``` The IP is part of a svc that istiod creates to expose a validation webhook service: ``` elukey@ml-serve-ctrl1001:~$ kubectl describe svc -n istio-system Name: istiod Namespace: istio-system [..] Selector: app=istiod,istio=pilot Type: ClusterIP IP: 10.64.77.73 <===================================== Port: grpc-xds 15010/TCP TargetPort: 15010/TCP Endpoints: 10.64.79.74:15010 Port: https-dns 15012/TCP TargetPort: 15012/TCP Endpoints: 10.64.79.74:15012 Port: https-webhook 443/TCP <========================== TargetPort: 15017/TCP Endpoints: 10.64.79.74:15017 <============================ Port: http-monitoring 15014/TCP TargetPort: 15014/TCP Endpoints: 10.64.79.74:15014 Session Affinity: None Events: <none> ``` elukey@ml-serve-ctrl1001:~$ kubectl describe ep -n istio-system Name: istiod Namespace: istio-system [..] Subsets: Addresses: 10.64.79.74 NotReadyAddresses: <none> Ports: Name Port Protocol ---- ---- -------- http-monitoring 15014 TCP https-webhook 15017 TCP <======================== grpc-xds 15010 TCP https-dns 15012 TCP ``` The kube-api needs to be able to call 10.64.77.73, but the IP is available only on k8s worker nodes where calico runs. The Kubeflow stack is full of webhooks, so it would be great if we could find a way to add calico to the master nodes to enable the extra routing needed. Some caveats: * The bird daemon runs on a container in our current worker setup, so it is not sufficient to add `profile::calico::kubernetes` and add BGP peering to cr* routers. * We could run bird in a docker container launched by systemd on master nodes, but it would be yet another way of starting it (and another thing to maintain). * We could run bird similar to how we run it on worker nodes, but we'd also need to deploy `kubelet` on master nodes (currently not running on them). * Some tweaks in `deployment-charts` may be needed to deploy calico on master nodes as well.

While working on istio on the ml-serve-eqiad cluster, I noticed the following error in the kube-api logs: ``` Jul 01 09:33:07 ml-serve-ctrl1002 kube-apiserver[421]: E0701 09:33:07.002314 421 dispatcher.go:129] failed calling webhook "validation.istio.io": Post https://istiod.istio-system.svc:443/validate?timeout=30s: dial tcp 10.64.77.73:443: i/o timeout ``` The IP is part of a svc that istiod creates to expose a validation webhook service: ``` elukey@ml-serve-ctrl1001:~$ kubectl describe svc -n istio-system Name: istiod Namespace: istio-system [..] Selector: app=istiod,istio=pilot Type: ClusterIP IP: 10.64.77.73 <===================================== Port: grpc-xds 15010/TCP TargetPort: 15010/TCP Endpoints: 10.64.79.74:15010 Port: https-dns 15012/TCP TargetPort: 15012/TCP Endpoints: 10.64.79.74:15012 Port: https-webhook 443/TCP <========================== TargetPort: 15017/TCP Endpoints: 10.64.79.74:15017 <============================ Port: http-monitoring 15014/TCP TargetPort: 15014/TCP Endpoints: 10.64.79.74:15014 Session Affinity: None Events: <none> elukey@ml-serve-ctrl1001:~$ kubectl describe ep -n istio-system Name: istiod Namespace: istio-system [..] Subsets: Addresses: 10.64.79.74 NotReadyAddresses: <none> Ports: Name Port Protocol ---- ---- -------- http-monitoring 15014 TCP https-webhook 15017 TCP <======================== grpc-xds 15010 TCP https-dns 15012 TCP ``` The kube-api needs to be able to call 10.64.77.73, but the IP is available only on k8s worker nodes where calico runs. The Kubeflow stack is full of webhooks, so it would be great if we could find a way to add calico to the master nodes to enable the extra routing needed. Some caveats: * The bird daemon runs on a container in our current worker setup, so it is not sufficient to add `profile::calico::kubernetes` and add BGP peering to cr* routers. * We could run bird in a docker container launched by systemd on master nodes, but it would be yet another way of starting it (and another thing to maintain). * We could run bird similar to how we run it on worker nodes, but we'd also need to deploy `kubelet` on master nodes (currently not running on them). * Some tweaks in `deployment-charts` may be needed to deploy calico on master nodes as well.