Page MenuHomePhabricator

Refactor calico deploy strategy
Closed, ResolvedPublic

Description

Our current deploy strategy is:

  • Deploy via debian packages (calicoctl, calico-cni)
  • Deploy calico-node (as docker container, launched by systemd) via puppet (modules/calico/manifests/init.pp)
  • Deploy calico-policy-controller via helmfile.d/admin (internal_charts/wmf-calico-policy-controller/)

With upcoming k8s and calico updates it would be nice to have this less scattered, like:

  • Deploy via debian packages (calicoctl, calico-cni)
  • Deploy CDRs, RBAC, calico-node, typha, calico-policy-controller via a helm chart and helmfile.d/admin

Unfortunately this is not easily possible with helm2 & tiller as there is a catch-22 in accessing the k8s API from tiller prior to having the policy-controller running. Also, deploying calico-node as daemonset would require us to run the pod privileged which we currently prohibit globally I suppose.

Event Timeline

To solve the catch-22 we could deploy the to-be calico helm chart via helm3. Which would require us to invest into helm3 integration earlier than we hoped for (at least for the helmfile.d/admin part).

Change 641511 had a related patch set uploaded (by JMeybohm; owner: JMeybohm):
[operations/debs/kubernetes@future] Add kubernetes-addon-manager

https://gerrit.wikimedia.org/r/641511

Change 641711 had a related patch set uploaded (by JMeybohm; owner: JMeybohm):
[operations/puppet@production] kubernetes: Add profile to install addon-manager on masters

https://gerrit.wikimedia.org/r/641711

@akosiaris and me discussed this further and we initially decided to give the kubernetes addon-manager a try for rolling out calico components to a freshly bootstrapped cluster. I created a binary package from kubernetes source package that sets up the addon-manager and added corresponding puppet code.

While the very static manifests like CRDs and RBAC rules could easily be packaged (from calico source) and installed via addon-manager, the trouble starts with the calico-node DaemonSet and typha Deployment:

  • In "Calico the hard way" certificates are used to authenticate communication between calico-node and typha. If we want to do so as well, we would need to integrate them somehow which would potentially duplicate puppet code we have for helmfile.d already.
  • We need to override the ENV variables for the kubernetes service (kubernetes API) for all calico specific containers because of the IP SAN limitation of puppet CA. So we would need to template the manifests again, which is a bad thing to to in puppet (and we already have helm to fight against in this area).

Another open question is if we want to switch from running calico-node manually (docker run via systemd) to the recommended way of running it as a Daemonset within the cluster.
Personally I would prefer to run it the recommended way as it seems more straight forward and the benefit of separating it from k8s gets even smaller when we use the kubernetes datastore. The downside of this is that calico-node needs to run privileged and we have privileged containers disabled. We also lack proper limitations regarding privileged containers in production, see: T228967.

I did a bit of testing and it looks as if it is totally possible to switch helmfile.d/admin to use helm3 and get rid of tiller there (e.g. catch-22) while keeping helm2 + tiller for helmfile.d/services for now (see T268434).

Taking that route instead of addon-manager would also keep the benefit of having everything "inside" the cluster deployed via deployment-charts repo.

Change 641711 abandoned by JMeybohm:
[operations/puppet@production] kubernetes: Add profile to install addon-manager on masters

Reason:

https://gerrit.wikimedia.org/r/641711

Change 641511 abandoned by JMeybohm:
[operations/debs/kubernetes@future] Add kubernetes-addon-manager

Reason:

https://gerrit.wikimedia.org/r/641511

Change 643974 had a related patch set uploaded (by JMeybohm; owner: JMeybohm):
[operations/deployment-charts@master] Add charts for calico and calico-crds

https://gerrit.wikimedia.org/r/643974

Change 644462 had a related patch set uploaded (by JMeybohm; owner: JMeybohm):
[operations/deployment-charts@master] Add calico helm chart

https://gerrit.wikimedia.org/r/644462

Change 643974 merged by jenkins-bot:
[operations/deployment-charts@master] Add helm chart for calico CRDs

https://gerrit.wikimedia.org/r/643974

Change 645317 had a related patch set uploaded (by JMeybohm; owner: JMeybohm):
[operations/deployment-charts@master] Split out RBAC rules and service accoutns for typa and CNI

https://gerrit.wikimedia.org/r/645317

Change 645408 had a related patch set uploaded (by JMeybohm; owner: JMeybohm):
[operations/deployment-charts@master] calico: Bind the calico-cni Role to the calico-cni user

https://gerrit.wikimedia.org/r/645408

Change 645412 had a related patch set uploaded (by JMeybohm; owner: JMeybohm):
[labs/private@master] k8s_infrastructure_users: add calico-cni

https://gerrit.wikimedia.org/r/645412

Change 645412 merged by JMeybohm:
[labs/private@master] k8s_infrastructure_users: add calico-cni

https://gerrit.wikimedia.org/r/645412

Change 645417 had a related patch set uploaded (by JMeybohm; owner: JMeybohm):
[operations/puppet@production] calico: Add support for calico 3.x with kubernetes datastore

https://gerrit.wikimedia.org/r/645417

Change 645426 had a related patch set uploaded (by JMeybohm; owner: JMeybohm):
[labs/private@master] Add tokens for calico::kubernetes cni and ctl

https://gerrit.wikimedia.org/r/645426

Change 645426 merged by JMeybohm:
[labs/private@master] Add tokens for calico::kubernetes cni and ctl

https://gerrit.wikimedia.org/r/645426

Change 646740 had a related patch set uploaded (by JMeybohm; owner: JMeybohm):
[operations/puppet@production] calico: Remove calico/data

https://gerrit.wikimedia.org/r/646740

Change 646740 merged by Alexandros Kosiaris:
[operations/puppet@production] calico: Remove calico/data

https://gerrit.wikimedia.org/r/646740

Change 644462 merged by jenkins-bot:
[operations/deployment-charts@master] Add calico helm chart

https://gerrit.wikimedia.org/r/644462

Change 645317 merged by jenkins-bot:
[operations/deployment-charts@master] Split out RBAC rules and service accounts for typha and CNI

https://gerrit.wikimedia.org/r/645317

Change 645408 merged by jenkins-bot:
[operations/deployment-charts@master] calico: Bind the calico-cni Role to the calico-cni user

https://gerrit.wikimedia.org/r/645408

The new calico chart is merged, thanks @akosiaris

What is missing currently is a proper RoleBinding for the calicoctl user as I was not sure yet what permissions he's going to need.
We should be not using the tool for changing calico config, that's to be done via the helm chart now. But we will want to keep the analyze functionality intact. Could not find any docs on that by know so we will maybe just have to figure it out when we have a node in staging-codfw

The new calico chart is merged, thanks @akosiaris

What is missing currently is a proper RoleBinding for the calicoctl user as I was not sure yet what permissions he's going to need.
We should be not using the tool for changing calico config, that's to be done via the helm chart now. But we will want to keep the analyze functionality intact. Could not find any docs on that by know so we will maybe just have to figure it out when we have a node in staging-codfw

That's fine, but since the tool is also used to run some diagnostics and will only be run from the kubernetes nodes by an SRE, it's probably ok to use the network-admin role that is defined in https://docs.projectcalico.org/getting-started/kubernetes/hardway/end-user-rbac

The new calico chart is merged, thanks @akosiaris

What is missing currently is a proper RoleBinding for the calicoctl user as I was not sure yet what permissions he's going to need.
We should be not using the tool for changing calico config, that's to be done via the helm chart now. But we will want to keep the analyze functionality intact. Could not find any docs on that by know so we will maybe just have to figure it out when we have a node in staging-codfw

That's fine, but since the tool is also used to run some diagnostics and will only be run from the kubernetes nodes by an SRE, it's probably ok to use the network-admin role that is defined in https://docs.projectcalico.org/getting-started/kubernetes/hardway/end-user-rbac

Yeah. I thought it might be smart to create a read-only role for that (if possible) to prevent (untracked) changes to the config by human error. Let's see when we get there.

Change 645417 merged by JMeybohm:
[operations/puppet@production] calico: Add support for calico 3.x with kubernetes datastore

https://gerrit.wikimedia.org/r/645417

[puppet-private] (487bdca0) (jayme) Add calicoctl and calico-cni kubernetes users

Change 648143 had a related patch set uploaded (by JMeybohm; owner: JMeybohm):
[operations/deployment-charts@master] Add calico releases to admin_ng helmfile

https://gerrit.wikimedia.org/r/648143

Change 648143 merged by jenkins-bot:
[operations/deployment-charts@master] Add calico releases to admin_ng helmfile

https://gerrit.wikimedia.org/r/648143

Change 648245 had a related patch set uploaded (by JMeybohm; owner: JMeybohm):
[operations/deployment-charts@master] helmfile needs: parameter requires a release namespace

https://gerrit.wikimedia.org/r/648245

Change 648245 merged by jenkins-bot:
[operations/deployment-charts@master] helmfile needs: parameter requires a release namespace

https://gerrit.wikimedia.org/r/648245

This is done with calico deployed now via puppet (CNI plugins and calicoctl) as well as helm3 (helmfile.d/admin_ng).
Everything is under version control and there are no catch-22's anymore during cluster bootstrapping.

Looks like we're missing an RBAC rule for calico-node:

2021-01-28 14:17:19.139 [INFO][31] tunnel-ip-allocator/ipam.go 1325: Releasing all IPs with handle 'wireguard-tunnel-addr-kubestage2002.codfw.wmnet'
2021-01-28 14:17:19.142 [FATAL][31] tunnel-ip-allocator/allocateip.go 537: Error releasing address by handle Handle="wireguard-tunnel-addr-kubestage2002.codfw.wmnet" IP="" error=connection is unauthorized: ipamhandles.crd.projectcalico.org "wireguard-tunnel-addr-kubestage2002.codfw.wmnet" is forbidden: User "system:serviceaccount:kube-system:calico-node" cannot get resource "ipamhandles" in API group "crd.projectcalico.org" at the cluster scope type="wireguardTunnelAddress"
Calico node failed to start

Change 659259 had a related patch set uploaded (by JMeybohm; owner: JMeybohm):
[operations/deployment-charts@master] calico-node needs read access to ipamhandles resource

https://gerrit.wikimedia.org/r/659259

Change 659259 merged by jenkins-bot:
[operations/deployment-charts@master] calico-node needs read access to ipamhandles resource

https://gerrit.wikimedia.org/r/659259

fixed with helm chart version: 0.1.9

2021-01-30 15:45:27.743 [ERROR][8] lookup.go 63: Failed to get Typha endpoint from Kubernetes error=endpoints "calico-typha" is forbidden: User "system:serviceaccount:kube-system:calico-typha" cannot get resource "endpoints" in API group "" in the namespace "kube-system"

Change 660399 had a related patch set uploaded (by JMeybohm; owner: JMeybohm):
[operations/deployment-charts@master] calico: Typha needs to get endpoints to discover it's instances

https://gerrit.wikimedia.org/r/660399

Change 660399 merged by jenkins-bot:
[operations/deployment-charts@master] calico: Typha needs to get endpoints to discover it's instances

https://gerrit.wikimedia.org/r/660399

I think it's safe to say this is done now with the admin_ng using helm3 and the updates to puppet.