Page MenuHomePhabricator

Integrate cert-manager/issuer in ml-serve clusters
Closed, ResolvedPublic

Description

We are currently using TLS certificates issued by the Puppet CA and created via cergen. They are deployed to the deploy1002 node via puppet, that creates ad-hoc helmfile private configs that we can load in our helm charts.

The TLS certs that we created are:

  1. inference.discovery.wmnet, deployed on the istio gateway pods (in the istio-system namespace) via the knative helm chart (since it handles all L7 configs for istio).
  2. Webhooks certificates for various pods
  3. TLS certificate for the Istio Egress gateway

The SRE team is integrating cert-manager/issuer in their clusters: https://wikitech.wikimedia.org/wiki/Kubernetes/cert-manager

We could do the following:

  1. Use the same config to create inference.discovery.wmnet from the Discovery intermediate PKI (see documentation).
  2. Create a new Intermediate PKI to handle our certificates for webhooks and egress gateway. See https://wikitech.wikimedia.org/wiki/PKI/CA_Operations#Intermediate_Certificates

This would free us from the Puppet CA completely, and open the door to mTLS between pods.

Event Timeline

I had a chat with Janis today, and IIUC for inference.discovery.wmnet we should:

  1. Create a new signing profile like https://gerrit.wikimedia.org/r/c/operations/puppet/+/745496 (and set the related cfssl-issuer-values.yaml file in deployment-charts)
  2. Set install_cert_manager: true in deployment-charts
  3. Follow https://wikitech.wikimedia.org/wiki/Kubernetes/cert-manager#Usage and add a CRD like the one stated in the docs. We can add one in the knative-serving chart, where we define the Istio gateways etc..
  4. Force Istio Gateway to use the new certificate/secret.

In this way we'll use the discovery PKI CA instead of the puppet one (the current cert for inference.d.wmnet we'll need to be revoked etc.. as clean up step).

A similar/different procedure may be needed for the other certs that we deployed (egress gw, webhooks, etc..) but we'll also likely need another dedicated intermediate CA for us.

Change 754885 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] role::pki::multirootca: add dedicated profile for ml-serve k8s

https://gerrit.wikimedia.org/r/754885

Change 754890 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] helmfile.d: deploy cert-manager for ml-serve nodes

https://gerrit.wikimedia.org/r/754890

Change 754885 merged by Elukey:

[operations/puppet@production] role::pki::multirootca: add dedicated profile for ml-serve k8s

https://gerrit.wikimedia.org/r/754885

Change 754890 merged by Elukey:

[operations/deployment-charts@master] helmfile.d: deploy cert-manager for ml-serve nodes

https://gerrit.wikimedia.org/r/754890

Change 754981 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] helmfile.d: add 'cert-manager' namespace to ml-serve

https://gerrit.wikimedia.org/r/754981

Change 755259 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] role::pki::multirootca: add expiry for k8s_mlserve

https://gerrit.wikimedia.org/r/755259

Change 755259 merged by Elukey:

[operations/puppet@production] role::pki::multirootca: add expiry for k8s_mlserve

https://gerrit.wikimedia.org/r/755259

Change 754981 merged by Elukey:

[operations/deployment-charts@master] helmfile.d: add 'cert-manager' namespace to ml-serve

https://gerrit.wikimedia.org/r/754981

NAMESPACE                   NAME                                                              READY   STATUS    RESTARTS   AGE
cert-manager                cert-manager-5f788b6dbb-sxz28                                     1/1     Running   0          15m
cert-manager                cert-manager-cainjector-6b48b95c9-c9sbg                           1/1     Running   0          15m
cert-manager                cert-manager-cainjector-6b48b95c9-q875f                           1/1     Running   0          15m
cert-manager                cert-manager-webhook-7496cbfb78-jdjhn                             1/1     Running   0          15m
cert-manager                cert-manager-webhook-7496cbfb78-z4gmw                             1/1     Running   0          15m
cert-manager                cfssl-issuer-8567cb7675-kkj8x                                     1/1     Running   0          67s
cert-manager                cfssl-issuer-8567cb7675-phl57                                     1/1     Running   0          67s

Change 755333 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] Expand helmfile_namespace_certs to support the ml-serve use case

https://gerrit.wikimedia.org/r/755333

Change 755334 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] Add cert-manager settings for the ml-serve clusters

https://gerrit.wikimedia.org/r/755334

Change 755333 merged by Elukey:

[operations/deployment-charts@master] Expand helmfile_namespace_certs to support the ml-serve use case

https://gerrit.wikimedia.org/r/755333

Change 755334 merged by Elukey:

[operations/deployment-charts@master] Add cert-manager settings for the ml-serve clusters

https://gerrit.wikimedia.org/r/755334

Change 755431 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] admin_ng: update tls secret settings for ml-serve

https://gerrit.wikimedia.org/r/755431

Change 755431 merged by Elukey:

[operations/deployment-charts@master] admin_ng: update knative's tls secret settings for ml-serve

https://gerrit.wikimedia.org/r/755431

Change 755441 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] admin_ng: remove the secrets chart from knative-serving

https://gerrit.wikimedia.org/r/755441

Change 755640 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] helmfile.d: remove secrets chart from knative-serving

https://gerrit.wikimedia.org/r/755640

Change 755640 merged by Elukey:

[operations/deployment-charts@master] helmfile.d: remove secrets chart from knative-serving

https://gerrit.wikimedia.org/r/755640

Change 755441 merged by Elukey:

[operations/deployment-charts@master] admin_ng: remove the secrets chart from knative-serving

https://gerrit.wikimedia.org/r/755441

Cleaned up all config and certs related to inference.discovery.wmnet (on the puppet private repo and on k8s secrets etc..). The Puppet-based cert has also been revoked.

The remaining puppet-based certificates are:

  • kserve-webhook-server-service.kserve.svc.cluster.local
  • istio-egressgateway.istio-system.svc.cluster.local

Plus of course if we'll enable mTLS in istio in the future, all the ones related to InferenceService etc..

Change 755651 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] role::pki::root: add the ml_serve intermediate PKI

https://gerrit.wikimedia.org/r/755651

After a chat with John in https://gerrit.wikimedia.org/r/755651 it seems that for our use case we can use the discovery intermediate (without creating a new one).

Next steps: move kserve-webhook-server-service.kserve.svc.cluster.local and istio-egressgateway.istio-system.svc.cluster.local to the discovery intermediate CA.

Change 755651 abandoned by Elukey:

[operations/puppet@production] role::pki::root: add the ml_serve intermediate PKI

Reason:

https://gerrit.wikimedia.org/r/755651

Change 755955 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] kserve: move to cert-manager

https://gerrit.wikimedia.org/r/755955

Change 755955 merged by Elukey:

[operations/deployment-charts@master] kserve: move to cert-manager

https://gerrit.wikimedia.org/r/755955

Change 755970 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] knative-serving: move egress gateway to cert-manager

https://gerrit.wikimedia.org/r/755970

Change 755970 merged by Elukey:

[operations/deployment-charts@master] knative-serving: move egress gateway to cert-manager

https://gerrit.wikimedia.org/r/755970

Change 755984 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] Move ml-services to the new CA cert bundle

https://gerrit.wikimedia.org/r/755984

Change 755984 merged by Elukey:

[operations/deployment-charts@master] Move ml-services to the new CA cert bundle

https://gerrit.wikimedia.org/r/755984

It sems that the kserve webhook doesn't work properly with the new settings, this is what I get when trying to deploy/modify an InferenceService:

Error: UPGRADE FAILED: an error occurred while rolling back the release. original upgrade error: cannot patch "enwiki-goodfaith" with kind InferenceService: Internal error occurred: failed calling webhook "inferenceservice.kserve-webhook-server.defaulter": Post https://kserve-webhook-server-service.kserve.svc:443/mutate-serving-kserve-io-v1beta1-inferenceservice?timeout=10s: x509: certificate signed by unknown authority && cannot patch "enwiki-damaging" with kind InferenceService: Internal error occurred: failed calling webhook "inferenceservice.kserve-webhook-server.defaulter": Post https://kserve-webhook-server-service.kserve.svc:443/mutate-serving-kserve-io-v1beta1-inferenceservice?timeout=10s: x509: certificate signed by unknown authority: no NetworkPolicy with the name "kserve-inference-main" found

From https://cert-manager.io/v1.2-docs/concepts/certificate/:

Additionally, if the Certificate Authority is known, the corresponding CA certificate will be stored in the secret with key ca.crt. For example, with the ACME issuer, the CA is not known and ca.crt will not exist in acme-crt-secret.

There is probably something missing our config, the caBundle field in the kserve webhook CRDs is not injected with the Discovery's CA crt value and the k8s api fails to validate TLS connections to the kserve webhook.

This task is currently blocked by T299906

Change 757502 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] knative-serving: add more SANs to the Istio Egress gw's Certificate

https://gerrit.wikimedia.org/r/757502

Change 757502 merged by Elukey:

[operations/deployment-charts@master] knative-serving: add more SANs to the Istio Egress gw's Certificate

https://gerrit.wikimedia.org/r/757502

elukey closed this task as Resolved.EditedJan 26 2022, 7:25 PM

Finally done!

The work in T299906 was enough to make all errors disappearing, the kserve webhook now works fine and we can deploy.

The egress tls certificate is also picked up and valid.