Page MenuHomePhabricator

Add the possibility to deploy calico on kubernetes master nodes
Closed, ResolvedPublic

Description

While working on istio on the ml-serve-eqiad cluster, I noticed the following error in the kube-api logs:

Jul 01 09:33:07 ml-serve-ctrl1002 kube-apiserver[421]: E0701 09:33:07.002314     421 dispatcher.go:129] failed calling webhook "validation.istio.io": Post https://istiod.istio-system.svc:443/validate?timeout=30s: dial tcp 10.64.77.73:443: i/o timeout

The IP is part of a svc that istiod creates to expose a validation webhook service:

elukey@ml-serve-ctrl1001:~$ kubectl describe svc -n istio-system
Name:              istiod
Namespace:         istio-system
[..]
Selector:          app=istiod,istio=pilot
Type:              ClusterIP

IP:                10.64.77.73           <=====================================

Port:              grpc-xds  15010/TCP
TargetPort:        15010/TCP
Endpoints:         10.64.79.74:15010
Port:              https-dns  15012/TCP
TargetPort:        15012/TCP
Endpoints:         10.64.79.74:15012
Port:              https-webhook  443/TCP  <==========================
TargetPort:        15017/TCP
Endpoints:         10.64.79.74:15017    <============================
Port:              http-monitoring  15014/TCP
TargetPort:        15014/TCP
Endpoints:         10.64.79.74:15014
Session Affinity:  None
Events:            <none>

elukey@ml-serve-ctrl1001:~$ kubectl describe ep -n istio-system
Name:         istiod
Namespace:    istio-system
[..]
Subsets:
  Addresses:          10.64.79.74
  NotReadyAddresses:  <none>
  Ports:
    Name             Port   Protocol
    ----             ----   --------
    http-monitoring  15014  TCP
    https-webhook    15017  TCP           <========================
    grpc-xds         15010  TCP 
    https-dns        15012  TCP

The kube-api needs to be able to call 10.64.77.73, but the IP is available only on k8s worker nodes where calico runs. The Kubeflow stack is full of webhooks, so it would be great if we could find a way to add calico to the master nodes to enable the extra routing needed.

Some caveats:

  • The bird daemon runs in a container (calico-node) in our current worker setup, so it is not sufficient to add profile::calico::kubernetes and add BGP peering to cr* routers.
  • We could run calico-node as docker container launched by systemd on master nodes, but it would be yet another way of starting it (and another thing to maintain).
  • We could run calico-node similar to how we run it on worker nodes, but we'd also need to deploy kubelet on master nodes (currently not running on them).
  • Some tweaks in deployment-charts may be needed to deploy calico on master nodes as well.

Event Timeline

I don't like the idea of having another way of how calico-node is run (it's already complex enough). Because of that I'll suggest we add profile::kubernetes::node (maybe it needs some tweaks to allow this?) to masters and taint them (node-role.kubernetes.io/master:NoSchedule) to prevent normal pods from being scheduled there. This will give us the benefit of having the same calico deployment and config everywhere without much extra effort.

Downside of this is ofc. that we need docker and additional k8s deployments on the masters. For production this is a bit more troublesome as we currently have different debian versions on masters and nodes. But ML can test this approach without affecting production. :-)

Definitely, it seems a good way to proceed. The only concern that I have is that our kube masters are lightweight VMs (1 virtual cpu, 2G of ram) so they may need a little revamp before getting docker and everything else. What do you think?

Yeah, maybe. Calico-node runs with a memory limit of 400Mi and CPU requests if 350m but the other components will also take up some resources ofc.
But as you're running in testing mode, you could also just see how it turns out and increase resources afterwards.

Change 702645 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] ml_k8s::master: add profile::kubernetes::node

https://gerrit.wikimedia.org/r/702645

herron triaged this task as Medium priority.Jul 1 2021, 5:17 PM

Change 702898 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] kubernetes: centralize the creation of /etc/kubernetes

https://gerrit.wikimedia.org/r/702898

Change 702898 merged by Elukey:

[operations/puppet@production] kubernetes: centralize the creation of /etc/kubernetes

https://gerrit.wikimedia.org/r/702898

Change 702983 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] profile::kubernetes::node: add hiera config to expose puppet certs

https://gerrit.wikimedia.org/r/702983

Next steps:

  • refactor how base::expose_puppet_certs is used in kubernetes profiles, since if profile::kubernetes::node is deployed on a master node there is a duplicated resource declaration.
  • work on https://gerrit.wikimedia.org/r/702645 to deploy kubelets on master nodes
  • add network config for calico to make everything working (the master nodes will need to be able to peer via BGP to the core routers etc..)

base::expose_puppet_certs is used in both master and node profiles, with different settings:

  • on master, the server.key is owned/readable only by the kube user. This makes sense since the kube api etc.. daemons are running under the kube user.
  • on node, the server.key is owned/readable only by root (see https://gerrit.wikimedia.org/r/c/operations/puppet/+/328924). This makes sense since the daemon in need of the tls certs, the kubelet, is running under the root user.

I am not entirely sure how to resolve this cleanly, so here some options:

  • use the kube user everywhere, since the kubelet process running as root will be able to read the cert key anyway. Not really clean and future proof if anything changes on the kubelet side (like running user).
  • use a configurable directory (default /etc/kubernetes) for expose_puppet_certs in the node profile. Use cases like ML will be able to expose the puppet certs in different dirs (so duplicating the server.key and cert exposure) using different permissions, so that ML master nodes will be able to deploy the kubelet daemon without impacting the serviceops clusters.
  • some other solution that is possibly cleaner.

After a chat with Janis we reviewed the master's code and found https://gerrit.wikimedia.org/r/c/operations/puppet/+/343787/2/modules/profile/manifests/kubernetes/master.pp

In theory for the "prod" use case (so non-toolforge-related) we don't use the puppet certs, and the hiera flag should not be used. Going to test it on ML-serve and report back.

Change 704071 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] role::ml_k8s::master: avoid exposing puppet certs

https://gerrit.wikimedia.org/r/704071

Change 704071 merged by Elukey:

[operations/puppet@production] role::ml_k8s::master: avoid exposing puppet certs

https://gerrit.wikimedia.org/r/704071

Change 702983 abandoned by Elukey:

[operations/puppet@production] profile::kubernetes::node: add hiera config to expose puppet certs

Reason:

https://gerrit.wikimedia.org/r/702983

Mentioned in SAL (#wikimedia-operations) [2021-07-12T10:11:13Z] <elukey> add 10g disk to ml-serve-ctrl[12]00[12] for T285927

Change 702645 merged by Elukey:

[operations/puppet@production] ml_k8s::master: add profile::kubernetes::node

https://gerrit.wikimedia.org/r/702645

Change 704088 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] role::ml_k8s::master: add docker profiles

https://gerrit.wikimedia.org/r/704088

Change 704088 merged by Elukey:

[operations/puppet@production] role::ml_k8s::master: add docker profiles

https://gerrit.wikimedia.org/r/704088

Change 704104 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/homer/public@master] Add ml-serve-ctrl* nodes to the k8s ML iBGP configs

https://gerrit.wikimedia.org/r/704104

Change 704104 merged by Elukey:

[operations/homer/public@master] Add ml-serve-ctrl* nodes to the k8s ML iBGP configs

https://gerrit.wikimedia.org/r/704104

Change 704131 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] Update iBGP neighbor list for the ML k8s clusters

https://gerrit.wikimedia.org/r/704131

Change 704131 merged by Elukey:

[operations/puppet@production] Update iBGP neighbor list for the ML k8s clusters

https://gerrit.wikimedia.org/r/704131

Change 704132 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] Add k8s iBGP neighbor config to the ML k8s master nodes

https://gerrit.wikimedia.org/r/704132

Change 704132 merged by Elukey:

[operations/puppet@production] Add k8s iBGP neighbor config to the ML k8s master nodes

https://gerrit.wikimedia.org/r/704132

Kubelets / calico / bird are deployed on the ml-serve-ctrl nodes, but the istio webhook svc seems not reachable yet from them. Something is missing but we should be close!

elukey claimed this task.

istio bootstrapped, everything worked nicely, thanks a lot to all that helped :)

Change 704831 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] profile::kubernetes::master: add comments and improve hiera lookups

https://gerrit.wikimedia.org/r/704831

Change 704831 merged by Elukey:

[operations/puppet@production] profile::kubernetes::master: add comments and improve hiera lookups

https://gerrit.wikimedia.org/r/704831