Page MenuHomePhabricator

Experiment with the Istio TLS mesh
Closed, ResolvedPublic

Description

In T294414 we realized that having the Istio sidecar proxy machinery would simplify a lot the configuration of our InferenceService resources. This task tracks a spike to test if it is feasible to turn on the TLS mesh in our ml-serve-{eqiad,codfw} clusters.

We have never tried this road since we thought that cfssl-manager + PKI were both needed, but in reality this is not true. The istiod pod creates a root CA (self signed) to manage certificates in the mesh if nothing is specified, so we could try to use it before other roads. The upstream docs suggest to use an intermediate PKI CA dedicated to istiod, but in our case our cluster is separated from the rest and it will host very similar services (all InferenceService based), so the security gain of using an extra intermediate CA doesn't seem big. In the future we could swap the self signed root CA for a dedicated PKI intermediate, it should be easy enough (if we want).

Details

SubjectRepoBranchLines +/-
operations/deployment-chartsmaster+7 -2
operations/deployment-chartsmaster+1 -0
operations/puppetproduction+29 -30
operations/deployment-chartsmaster+2 -1
operations/deployment-chartsmaster+10 -10
operations/deployment-chartsmaster+7 -3
operations/deployment-chartsmaster+82 -131
operations/deployment-chartsmaster+46 -89
operations/deployment-chartsmaster+47 -5
operations/puppetproduction+30 -0
operations/puppetproduction+22 -0
operations/deployment-chartsmaster+82 -0
labs/privatemaster+6 -0
operations/puppetproduction+58 -66
operations/puppetproduction+1 -0
operations/debs/istiomaster+156 -0
operations/docker-images/production-imagesmaster+32 -0
operations/deployment-chartsmaster+66 -1
Show related patches Customize query in gerrit

Event Timeline

Change 746880 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] helmfile.d: add the istio pod security policy

https://gerrit.wikimedia.org/r/746880

Change 746880 abandoned by Elukey:

[operations/deployment-charts@master] helmfile.d: add the istio pod security policy

Reason:

The better way is to try the calico plugin

https://gerrit.wikimedia.org/r/746880

Change 767924 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/docker-images/production-images@master] istio: add the install-cni docker file

https://gerrit.wikimedia.org/r/767924

I had a super interesting chat with Janis about https://gerrit.wikimedia.org/r/767924, since we went through a similar path for Calico in the past (we have CNI binaries deployed on all k8s nodes).

We reviewed install-cni and it seems meant to deploy istio cni binaries + configs on the underlying k8s nodes, taking care of the cni chain config as well. If this is the case (so nothing else is influenced at the istioctl level), we could create a new debian package (like the calico one) to deploy the cni binaries and set up the correct cni chain config with calico (see https://github.com/containernetworking/cni/blob/master/SPEC.md). In this way we'd avoid the install-cni daemon set, and also we'd avoid things like https://istio.io/latest/docs/setup/additional-setup/cni/#race-condition-mitigation.

I will try to come up with a proof of concept for this, but it will require some work of course (I hope to time box it to a couple of days).

I tried to do the following experiment on ml-serve-eqiad:

  • cordon all worker nodes except ml-serve1001
  • copy (via scp) the istio-cni and istio-iptables binaries (copied from the istio build docker image on our registry) under ml-serve1001's /opt/cni/bin
  • added the istio-cni user to the ml-serve-ctrl nodes (manually)
  • added /etc/cni/net.d/istio-kubeconfig (with the user's token etc..) and added the following snippet to 10-calico.conflist:
{
  "name": "istio-cni",
  "type": "istio-cni",
  "log_level": "info",
  "kubernetes": {
    "kubeconfig": "/etc/cni/net.d/istio-kubeconfig",
    "cni_bin_dir": "/opt/cni/bin",
    "exclude_namespaces": [
      "istio-system",
      "kube-system",
      "knative-serving",
      "cert-manager",
      "kserve"
    ]
  }
 }
  • manually added (via kubectl apply -f) the following config:
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: istio-cni
  labels:
    app: istio-cni
    operator.istio.io/component: "Cni"
rules:
- apiGroups: [""]
  resources:
  - pods
  - nodes
  verbs:
  - get
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: istio-cni
  labels:
    app: istio-cni
    operator.istio.io/component: "Cni"
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: istio-cni
subjects:
- kind: User
  name: istio-cni

Then I tried to add sidecar.istio.io/inject: "true" (default false on our InfernceService pods/resources) to one InferenceService (arwiki-goodfaith) and deleted the pod. The kubelet recognized the new cni plugin, but I found the following lines:

Mar 08 07:41:04 ml-serve1001 kubelet[1864845]: {"level":"info","time":"2022-03-08T07:41:04.320626Z","msg":"Pod arwiki-goodfaith-predictor-default-8656s-deployment-5d4c4c2m5zq contains inject annotation: true"}
Mar 08 07:41:04 ml-serve1001 kubelet[1864845]: {"level":"info","time":"2022-03-08T07:41:04.320641Z","msg":"Pod arwiki-goodfaith-predictor-default-8656s-deployment-5d4c4c2m5zq excluded due to not containing sidecar annotation"}

That corresponds to this log in the istio-cni code, that doesn't find the sidecar.istio.io/status annotation.
There should be a webhook injector that takes care of the annotation, so I started looking into the istio helm config (the one embedded inside istioctl) and found a lot of if conditionals for values.istio_cni.enabled, that is the Istio Operator / istioctl flag that allows to enable the CNI plugin. The main problem is that there seems to be only one way to enable/configure the cni plugin, namely via install-cni:

https://github.com/istio/istio/tree/release-1.9/cni

This in turns means that there seem no easy way to deploy the cni binaries without the extra daemonset, since the related config to use the istio-proxy sidecar wouldn't be added automatically.

I'm still not completely sold, sorry. :)
AIUI istio_cni is not strictly required for the service mesh to work (because there is this other way, using init containers). So in theory it can't be that istio_cni.enabled is required to be true - or am I missing something?

I'm still not completely sold, sorry. :)

No need, if we can find a solution that doesn't require the install-cni daemonset I am more than happy :)

AIUI istio_cni is not strictly required for the service mesh to work (because there is this other way, using init containers). So in theory it can't be that istio_cni.enabled is required to be true - or am I missing something?

It is required to be true with the CNI setting that we want to use, the init container is not an option due to the security privileges needed. My point is that I see Values.istio_cni.enabled conditionals in various part of the istio helm manifests, that add or not annotations to pods. For example, https://github.com/istio/istio/blob/release-1.9/manifests/charts/istiod-remote/files/injection-template.yaml#L13

Confirmed with istioctl manifest generate, the daemonset is added by default when the istio cni component is enabled. I have filed https://github.com/istio/istio/issues/37855 to upstream, let's see if they come back.

I also tried to deploy the settings to the ml-serve-eqiad cluster, and the daemonset fails to come up due to missing hostPath volumes capabilities (as expected, it would need a special psp policy for it). One half-baked solution could be to deploy the istio-cni component via istioctl, and leave the daemon-set in that state while we deploy the CNI binaries + /etc config via puppet, but it is not a great solution.

Would you try this hack? :-)
This rewrites the DaemonSet to be a Deployment with 0 replicas. As no Pods will be created, I think this should not trigger PSP and it will not leave us with a "looks broken" DaemonSet on each node.

diff --git a/custom_deploy.d/istio/main/config.yaml b/custom_deploy.d/istio/main/config.yaml
index 3ece4739..afce2dff 100644
--- a/custom_deploy.d/istio/main/config.yaml
+++ b/custom_deploy.d/istio/main/config.yaml
@@ -15,7 +15,21 @@ spec:
     pilot:
       enabled: true
     cni:
-      enabled: false
+      enabled: true
+      k8s:
+        overlays:
+        - apiVersion: apps/v1
+          kind: DaemonSet
+          name: istio-cni-node
+          patches:
+          - path: kind
+            value: Deployment
+          - path: spec.replicas
+            value: 0
+          - path: spec.strategy
+            value:
+              type: Recreate
+          - path: spec.updateStrategy
     egressGateways:
       - enabled: false
         name: istio-egressgateway

Thanks to Janis' patch I am now able to see a pod with the istio-proxy sidecar, together with CNI logs in the kubelet (to inject the iptables rules etc..). Some useful notes to remember:

  1. I had to add a specific annotation to allow the kserve storage initializer to reach thanos-swift. The container is an init one, and it starts when the istio-proxy sidecar is not there, failing to connect to Thanos and returning an error.
global:
  proxy:
    excludeIPRanges: "10.2.2.54/32,10.2.1.54/32" # thanos-swift's IPs
  1. The istiod network policies need to allow traffic to port 15012 also from istio-proxy pods. There seem to be no label to use for it, so we'll have to find something.
  2. In order to inject the istio-proxy sidecar, the isvc needs to have sidecar.istio.io/inject: "true" and the related namespace

I have requested a new istio repository in:
https://www.mediawiki.org/wiki/Gerrit/New_repositories/Requests

The new repository will consolidate the istioctl binary and the istio-cni ones.

Change 767924 abandoned by Elukey:

[operations/docker-images/production-images@master] istio: add the install-cni docker file

Reason:

https://gerrit.wikimedia.org/r/767924

Change 771670 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/debs/istio@master] Initial debianization of istio-cni

https://gerrit.wikimedia.org/r/771670

Change 772909 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] Refactor Calico's CNI plugin config

https://gerrit.wikimedia.org/r/772909

Change 773565 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] Add helmfile config for Istio proxy sidecars

https://gerrit.wikimedia.org/r/773565

Next steps:

After the above, test again and add documentation :)

Change 771670 merged by Elukey:

[operations/debs/istio@master] Initial debianization of istio-cni

https://gerrit.wikimedia.org/r/771670

Change 773791 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] aptrepo: add component for istio 1.9.5

https://gerrit.wikimedia.org/r/773791

Change 773791 merged by Elukey:

[operations/puppet@production] aptrepo: add component for istio 1.9.5

https://gerrit.wikimedia.org/r/773791

Change 774424 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] profile::calico::kubernetes: add optional istio-cni config

https://gerrit.wikimedia.org/r/774424

To keep archives happy: this is the config that I have used to allow pods to contact api-ro.discovery. wmnet.

- apiVersion: networking.istio.io/v1beta1
  kind: VirtualService
  metadata:
    annotations:
      meta.helm.sh/release-name: knative-serving
      meta.helm.sh/release-namespace: knative-serving
    creationTimestamp: "2021-12-15T09:43:03Z"
    generation: 46
    labels:
      app.kubernetes.io/managed-by: Helm
    name: mediawiki-api-vs
    namespace: knative-serving
    resourceVersion: "253359188"
    selfLink: /apis/networking.istio.io/v1beta1/namespaces/knative-serving/virtualservices/mediawiki-api-vs
    uid: fcc2050a-4ab0-4a77-9f21-68f94d14ee97
  spec:
    gateways:
    - mesh
    hosts:
    - '*.wikipedia.org'
    - api-ro.discovery.wmnet
    tls:
    - match:
      - port: 443
        sniHosts:
        - '*.wikipedia.org'
        - api-ro.discovery.wmnet
      route:
      - destination:
          host: api-ro.discovery.wmnet
          port:
            number: 443
- apiVersion: networking.istio.io/v1beta1
  kind: DestinationRule
  metadata:
    annotations:
      meta.helm.sh/release-name: knative-serving
      meta.helm.sh/release-namespace: knative-serving
    creationTimestamp: "2021-12-15T09:43:03Z"
    generation: 33
    labels:
      app.kubernetes.io/managed-by: Helm
    name: https-api-ro
    namespace: knative-serving
    resourceVersion: "241864545"
    selfLink: /apis/networking.istio.io/v1beta1/namespaces/knative-serving/destinationrules/https-api-ro
    uid: f355e9ab-f271-4f5a-a3db-ef5ef01dea3e
  spec:
    host: api-ro.discovery.wmnet
    trafficPolicy:
      connectionPool:
        tcp:
          maxConnections: 10
apiVersion: v1
items:
- apiVersion: networking.istio.io/v1beta1
  kind: ServiceEntry
  metadata:
    annotations:
      meta.helm.sh/release-name: knative-serving
      meta.helm.sh/release-namespace: knative-serving
    creationTimestamp: "2021-12-15T09:43:03Z"
    generation: 34
    labels:
      app.kubernetes.io/managed-by: Helm
    name: mediawiki-api-ro
    namespace: knative-serving
    resourceVersion: "239884766"
    selfLink: /apis/networking.istio.io/v1beta1/namespaces/knative-serving/serviceentries/mediawiki-api-ro
    uid: 680caa2a-917b-4829-8d96-9fc841967e0a
  spec:
    hosts:
    - api-ro.discovery.wmnet
    location: MESH_EXTERNAL
    ports:
    - name: https-apiro
      number: 443
      protocol: HTTPS
    resolution: DNS

Change 772909 merged by Elukey:

[operations/puppet@production] Refactor Calico's CNI plugin config

https://gerrit.wikimedia.org/r/772909

Change 774865 had a related patch set uploaded (by Elukey; author: Elukey):

[labs/private@master] Add istio-cni fake token to k8s configurations

https://gerrit.wikimedia.org/r/774865

Change 774865 merged by Elukey:

[labs/private@master] Add istio-cni fake token to k8s configurations

https://gerrit.wikimedia.org/r/774865

Change 773565 merged by Elukey:

[operations/deployment-charts@master] Add helmfile config for Istio proxy sidecars

https://gerrit.wikimedia.org/r/773565

Change 774424 merged by Elukey:

[operations/puppet@production] profile::calico::kubernetes: add optional istio-cni config

https://gerrit.wikimedia.org/r/774424

Change 773185 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] Add istio-cni plugin configs to ml-serve clusters

https://gerrit.wikimedia.org/r/773185

Change 773185 merged by Elukey:

[operations/puppet@production] Add istio-cni plugin configs to ml-serve clusters

https://gerrit.wikimedia.org/r/773185

Change 775343 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] Apply the istio sidecar/mesh settings to the ml-serve configs

https://gerrit.wikimedia.org/r/775343

Change 775344 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] knative-serving: refactor support for egress

https://gerrit.wikimedia.org/r/775344

Change 775345 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] Move ml-serve pod configs to Istio proxy sidecars

https://gerrit.wikimedia.org/r/775345

Change 775343 merged by Elukey:

[operations/deployment-charts@master] Apply the istio sidecar/mesh settings to the ml-serve configs

https://gerrit.wikimedia.org/r/775343

Change 775344 merged by Elukey:

[operations/deployment-charts@master] knative-serving: refactor support for egress

https://gerrit.wikimedia.org/r/775344

Change 775345 merged by Elukey:

[operations/deployment-charts@master] Move ml-serve pod configs to Istio proxy sidecars

https://gerrit.wikimedia.org/r/775345

Change 775824 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] Update helmfie_istio-proxy

https://gerrit.wikimedia.org/r/775824

Change 775825 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] Enable istio proxy injection in ml-serve's kserve ns and update settings

https://gerrit.wikimedia.org/r/775825

Change 775824 merged by Elukey:

[operations/deployment-charts@master] Update helmfie_istio-proxy

https://gerrit.wikimedia.org/r/775824

Change 775825 merged by Elukey:

[operations/deployment-charts@master] Enable istio proxy injection in ml-serve's kserve ns and update settings

https://gerrit.wikimedia.org/r/775825

Change 775855 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] Update istio mesh settings for ml-serve

https://gerrit.wikimedia.org/r/775855

Change 775855 merged by Elukey:

[operations/deployment-charts@master] Update istio mesh settings for ml-serve

https://gerrit.wikimedia.org/r/775855

Change 775884 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] role::ml_k8s::worker: add calico/istio CNI settings to ml-serve-codfw

https://gerrit.wikimedia.org/r/775884

Change 775884 merged by Elukey:

[operations/puppet@production] role::ml_k8s::worker: add calico/istio CNI settings to ml-serve-codfw

https://gerrit.wikimedia.org/r/775884

Change 776856 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] Add a namespace selector to helmfile_istio-proxy's config

https://gerrit.wikimedia.org/r/776856

Change 776856 merged by Elukey:

[operations/deployment-charts@master] Add a namespace selector to helmfile_istio-proxy's config

https://gerrit.wikimedia.org/r/776856

We have finally our clusters running with the istio sidecars, a long journey but I hope it will pay off.

Change 778247 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] kserve-inference: Allow prometheus to scrape istio sidecar's port

https://gerrit.wikimedia.org/r/778247

Change 778247 merged by Elukey:

[operations/deployment-charts@master] kserve-inference: Allow prometheus to scrape istio sidecar's port

https://gerrit.wikimedia.org/r/778247