Page MenuHomePhabricator

Improve calico-typha firewall rules
Open, Stalled, MediumPublic

Description

What?

When adding new nodes/control-planes to k8s clusters we need to run puppet on all nodes of the cluster in order to create ferm rules that allow the new node to connect to typha which runs on some of the nodes in the nodes network namespace.

This is not ideal as it might lead to a situation where a new node is added to the cluster_nodes: list in hiera before DNS resolution works for it. Puppet then adds the node to the ferm rule (/etc/ferm/conf.d/10_calico-typha) but as resolving the A record fails, no iptables rule is created.
When DNS resolution starts working, ferm is not refreshed (by puppet) because the node is already in the ferm rule (so no file change)

Temporary Bandaid

  • Move from the the legacy resolution (with the resolve() function of ferm does the DNS lookup) to the new srange() parameter (where DNS is resolved on the Puppet server side with every Puppet run). Done

Proposal

  • Relax the typha ferm rule in such a way that we don't need a per host access rule
    • This would let us depricate the cluster_nodes: config structure in hiera completely, as it is used nowhere else.
How?

Enable (if possible) authentication between calico-node and typha. Calico uses mTLS by default between typha and felix (calico-node) when deployed using the tigera-operator

Option 1: certmanagert
That is not the case in our setup we'd have to provide the necessary certificates ourselves. We probably can't do that inside kubernetes (with certmanager) as that would require Pod networking to be up, which is not the case when initially bootstrapping a cluster.

Option 2: Generate the certificates via puppet
We could generate the certificates for typha and felix (calico-node) via puppet on all kubernetes nodes and mount them into the pods by a hostPath volume.

Question: Is typha and felix capable of hot reloading certificates if they change on disk? We assume they can, as they use k8s certificate/secret objects when deployed via the operator

Now What?

We need 2 certificates that need to be available to Felix and Typha

  • Typha: Common Name: typha-client
    • extended key usage ServerAuth
  • Felix: Common Name: typha-server
    • extended key usage ClientAuth

Those certs should be available to felix and typha respectively.

Providing certs to pods
  • Secrets
    • The certificates could be secrets which we can then mount as files
  • Files on workers
    • We could have those certs present on all workers and mount them via host path
Open questions
  • certs expiration & renewal
    • How often should those certs expire
  • Can typha and/or calico detect cert changes?
    • If a cert is renewed, will it be immediately available?

Docs
https://docs.tigera.io/calico/3.26/reference/typha/configuration#felix-typha-tls-configuration
https://docs.tigera.io/calico/3.26/network-policy/comms/crypto-auth

Event Timeline

JMeybohm updated the task description. (Show Details)

Change #1035365 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Switch Typha firewall config to firewall::service

https://gerrit.wikimedia.org/r/1035365

Change #1035365 merged by Muehlenhoff:

[operations/puppet@production] Switch Typha firewall config to firewall::service

https://gerrit.wikimedia.org/r/1035365

The Typha firewall service is now based on firewall::service and does dynamic name resolution on the puppet server side, let's see if this improves things with the next rename.

The Typha firewall service is now based on firewall::service and does dynamic name resolution on the puppet server side, let's see if this improves things with the next rename.

The issue didn't happen again, but we also did the move vlan in addition to the rename (so the IP changed too).

jijiki changed the task status from Open to In Progress.Jun 13 2024, 7:54 AM
jijiki claimed this task.
jijiki subscribed.
Providing certs to pods
  • Secrets
    • The certificates could be secrets which we can then mount as files

As per our initial discussion I would rather not do this as it created an additional (potentially manual) step.

  • Files on workers
    • We could have those certs present on all workers and mount them via host path
Open questions
  • certs expiration?
  • certs renewal

Puppet and PKI will take care of both automatically. The open question here is if typha and felix are capable of detecting changes to the certs on disk and hot reloads.

Providing certs to pods
  • Secrets
    • The certificates could be secrets which we can then mount as files

As per our initial discussion I would rather not do this as it created an additional (potentially manual) step.

Yes, I was just writing down our options.

  • Files on workers
    • We could have those certs present on all workers and mount them via host path
Open questions
  • certs expiration?
  • certs renewal

That was poorly phrased, I meant, how often would we want the certificates to be renewed

Puppet and PKI will take care of both automatically. The open question here is if typha and felix are capable of detecting changes to the certs on disk and hot reloads.

Rephrased the task description accordingly

That was poorly phrased, I meant, how often would we want the certificates to be renewed

I would suggest to go with the setting we have for kubernetes related certificates (pki_renew_seconds in hieradata/common/kubernetes.yaml). We use short lived certificates in staging to catch problems early and longer expiration in production clusters.

jijiki changed the task status from In Progress to Stalled.Nov 28 2024, 3:59 PM
jijiki triaged this task as Medium priority.

I keep de-prioritising this, marking it as stalled until I pick it up again

Change #1112204 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] calico: Create certificates for Typha/Felix mTLS

https://gerrit.wikimedia.org/r/1112204

Change #1112235 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] calico: Add support for Typha/Felix mTLS

https://gerrit.wikimedia.org/r/1112235

Change #1112236 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] Update calico to 0.2.11 in staging-codfw

https://gerrit.wikimedia.org/r/1112236

Change #1112204 merged by JMeybohm:

[operations/puppet@production] calico: Create certificates for Typha/Felix mTLS

https://gerrit.wikimedia.org/r/1112204

Change #1112250 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] calico: mTLS certificate symlinks have to be relative

https://gerrit.wikimedia.org/r/1112250

Change #1112250 merged by JMeybohm:

[operations/puppet@production] calico: mTLS certificate symlinks have to be relative

https://gerrit.wikimedia.org/r/1112250

I've configured staging-codfw to run with mTLS enabled. Certificates have a 72h expiry there, so we should be able to tell how that behaves on Monday latest.

Typha and calico-node are unable to hot reload changed certificates. For calico-node this is not that big of a problem as it reads the certificates from disk every time it connects to typha. To technically the new certificate will be used when the old one is rejected. Typha on the other hand needs a full restart in order to pick up the new certificate.

The tigera-operator manages the CA and certificate for mTLS itself (without external dependencies, so no catch-22 when pod networking is not yet up) and solves the reload problem by updating a hash annotation of the typha deployment, triggering a rolling restart. Given that we, in the current implementation, have certificates per host rather than per services, this is not an option as each certificate expires at a different time.

I guess there is no easy way to solve this. Having a typha sidecar restart typha on cert change is probably a bad idea as this could lead to all typhas restarting into a bad state. Another option would be to implement a controller that basically does the same as the tigera-operator: Manage CA and the certificates and roll-restart typha when the certificate is updated. But that seems also pretty involved...kicking this down the road even more.

JMeybohm updated the task description. (Show Details)

Change #1112782 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] Revert "Create certificates for Typha/Felix mTLS"

https://gerrit.wikimedia.org/r/1112782

Change #1112235 abandoned by JMeybohm:

[operations/deployment-charts@master] calico: Add support for Typha/Felix mTLS

Reason:

Does not work this way, see T365687

https://gerrit.wikimedia.org/r/1112235

Change #1112236 abandoned by JMeybohm:

[operations/deployment-charts@master] Update calico to 0.2.11 in staging-codfw and enable mTLS

Reason:

Does not work this way, see T365687

https://gerrit.wikimedia.org/r/1112236

Change #1112782 merged by JMeybohm:

[operations/puppet@production] Revert "Create certificates for Typha/Felix mTLS"

https://gerrit.wikimedia.org/r/1112782