Page MenuHomePhabricator

Install Istio on ml-serve cluster
Closed, ResolvedPublic

Description

For the Lift-Wing proof of concept, we want to install KFServing.

Istio is the primary dependency of both KFServing & Knative.

We should be able to install via helm:
https://istio.io/latest/docs/setup/install/helm/

Event Timeline

After talking with @elukey last week, we both seemed to agree that we should install Istio without the full service mesh (sidecar injection) for our proof of concept. We do not need the full mesh network at this point and it introduces considerable overhead to the MVP.

The KFServing docs also mention this as a quick way to get started: https://github.com/kubeflow/kfserving#prerequisites

Today I followed up on the Kubeflow's slack (there is a kfserving chan) and I got a couple of interesting links:

https://github.com/kubeflow/kfserving/blob/master/hack/quick_install.sh
https://github.com/ajinkya933/Kubeflow-Serving

From the quick install script, it seems that the bare minimum config to make everything working is:

  1. An istio namespace
  2. Some basic config for Ingress

The steps outlined in https://istio.io/latest/docs/setup/install/helm/ seems more related to setting up a more complete setup. All the helm charts are available in the release tarball so hopefully it will not be super hard to test them (still need to figure out how to helm install only on our cluster for tests without making mess elsewhere).

From another angle: https://knative.dev/docs/install/installing-istio/#installing-istio. Our dear Knative needs Istio as well, and it seems better to use 1.8.2 (last upstream is 1.9).

Both approaches (except the helm one) use the istioctl command, that IIUC it is a binary shipped in the release that should automate some manual work. There is also a mention of istiod, that should be the istio daemon needed when using the service mesh as control plane, but we shouldn't need it now.

How to make everything working is still a bit unclear to me, but I'll keep the task updated :D

Hey @elukey, this is the script we're using for e2e testing on kfserving community. It is using the most recent versions of Istio and knative, with their operators and having the sidecar injection disabled as per the requirement above.

apiVersion: install.istio.io/v1alpha1
kind: IstioOperator
spec:
  values:
    global:
      proxy:
        autoInject: disabled

@Theofpa thanks a lot for following up! I have a generic question about what istio setting we should try to pursue. My understanding is that a service mesh is not strictly needed to serve simple models via KFserving, but as soon as the complexity increases a bit the Istio control plane and Envoy sidecar is needed (to allow service -> service communications). Should we start from the beginning with a full service mesh (since it will be surely needed) or do you think that it is not worth it as first step?

The service to service communication can be enhanced with a service mesh if we require for example security policies across the services of that cluster. By having Istio sidecar injection in a namespace, each pod will have an extra container with the envoy proxy that brings the access control, the logging, the tracing, etc in the services that these pods provide.

So, we need to answer the following question:

What type of workloads are we going to have in this cluster?

Model serving only? Model serving AND other services which communicate with each other via access-control?

In case of model serving only, we don’t need to use service mesh, it’s only going to be an overhead. We can just use the Istio ingress gateway and the Istio virtualservices as managed by the kfserving reconciler.

In case we are going to host other services as well (for example public services which we want to restrict from accessing the model services), we can benefit from the service mesh. We can have Istio managing the communication across the namespaces and their services based on roles, and tracking this communication with metrics and tracing in Istio’s Prometheus&Jaeger.

It looks like this is a cluster dedicated to model serving, and any incoming traffic will be managed by northbound interfaces. So my recommendation would be to keep the sidecar injection disabled.

@Theofpa makes sense, our use case is surely only model serving, and the service mesh seemed to us an overkill, so good to have it confirmed :) My doubts were related to use cases like:

  1. Fetch the model from a storage (if not included in the Docker image) like S3/Swift (we have an internal cluster).
  2. Fetch features from cache/store/etc.. (not sure if needed for the first models but I am pretty sure the use case will come up).

My fear was that the above use cases needed dedicated micro-services, and hence the Istio service mesh. If it is not the case then I am very happy. I see that Istio offers some helm charts, it would be great to fit them into our deployment-charts repo, this is what me and Tobias are going to work on in the immediate future (trying also to fit Istio's requirements in the RBAC policies that the SRE team suggests for the kubernetes clusters).

Having said that, please note that I have zero experience in kubernetes and ML, so I hope I haven't written anything totally off!

elukey subscribed.

Today I tried to think about next steps for this task, and I have some thoughts, lemme know :)

From T278194#6964746 it seems that we should target istio 1.6.2 for our current environment. It is almost a year old, not very up to date, but until we upgrade kubernetes it seems better to follow what works best with knative 0.18 (we may have some flexibility for istio versions, so let's say a version close to 1.6.2).

So we can start from https://github.com/kubeflow/kfserving/blob/master/test/scripts/run-e2e-tests.sh#L39-L71, in which we can see a simple example about how and what to deploy to get a minimal istio config:

  • istio gateway
  • istio operator
  • istiod for the control plane

IIUC, all the above (including pulling images from docker hub) is handled in the script by the binary istioclt, shipped with all releases of istio. We want to use helm 3 if possible, and the first indication about how to do it was added in 1.8's doc (https://istio.io/v1.8/docs/setup/install/helm/) but in theory we should be able to work on 1.6 without much troubles (famous last words).

The big missing piece at the moment are the docker images, that we should somehow end up having in our internal Wikimedia Docker registry. I followed up with Service ops today and they pointed me to how calico is packaged, namely we pick a certain release, verify it and copy binaries to a deb package. Next step is to figure out what docker images are needed, and if we can create them on our docker registry.

We can also decide later on if it is more convenient to use istioctl or helm (the latter seems to be more self descriptive and better for documentation).

I was able to bootstrap minikube with k8s 1.20.2 (the other ones failed for cgroup issues..)

elukey@wintermute:~/Wikimedia/scratch-dir/istio-1.6.2$ ./bin/istioctl operator init
Using operator Deployment image: docker.io/istio/operator:1.6.2
✔ Istio operator installed                                                                                           
✔ Installation complete

docker@minikube:~$ docker ps | grep istio | grep -v pause
88d852cb49d4   istio/operator         "operator server"        About a minute ago   Up About a minute             k8s_istio-operator_istio-operator-5668d5ddb-kkk9t_istio-operator_7c4aaa0f-f6a6-4f81-81f5-e47d9cb6e887_0

docker@minikube:~$ docker images | grep istio
istio/operator                            1.6.2      69540da46816   10 months ago   223MB

Then:

elukey@wintermute:~/Wikimedia/scratch-dir/istio-1.6.2$ ./bin/istioctl manifest apply -y -f ./istio-minimal-operator.yaml
✔ Istio core installed                                                                                               
✔ Istiod installed                                                                                                   
✔ Ingress gateways installed                                                                                         
✔ Addons installed                                                                                                   
✔ Installation complete  


docker@minikube:~$ docker ps | grep istio | grep -v pause
f4e15ec117a1   istio/proxyv2          "/usr/local/bin/pilo…"   19 seconds ago   Up 18 seconds             k8s_istio-proxy_prometheus-56944b6bd5-x99j8_istio-system_f2e27f41-3d04-40ab-8581-2707d361566a_0
b71a6e34a987   61bf337f2956           "/bin/prometheus --s…"   21 seconds ago   Up 21 seconds             k8s_prometheus_prometheus-56944b6bd5-x99j8_istio-system_f2e27f41-3d04-40ab-8581-2707d361566a_0
34333b758ace   14e45d814562           "/usr/local/bin/pilo…"   22 seconds ago   Up 21 seconds             k8s_discovery_istiod-c4cfbfb6c-l5mzq_istio-system_257abb97-84d2-4b45-9d98-60dd6694e620_0
db82ac640191   1162f09e0728           "/usr/local/bin/pilo…"   23 seconds ago   Up 22 seconds             k8s_istio-proxy_istio-ingressgateway-57bd88c95c-g7v66_istio-system_eb649955-8963-40c8-af27-b8ca297b0bba_0
88d852cb49d4   istio/operator         "operator server"        6 minutes ago    Up 6 minutes              k8s_istio-operator_istio-operator-5668d5ddb-kkk9t_istio-operator_7c4aaa0f-f6a6-4f81-81f5-e47d9cb6e887_0

docker@minikube:~$ docker images | grep istio
istio/proxyv2                             1.6.2      1162f09e0728   10 months ago   304MB
istio/pilot                               1.6.2      14e45d814562   10 months ago   237MB
istio/operator                            1.6.2      69540da46816   10 months ago   223MB

elukey@wintermute:~/Wikimedia/scratch-dir/istio-1.6.2$ kubectl get namespaces
NAME              STATUS   AGE
default           Active   12m
istio-operator    Active   9m40s
istio-system      Active   4m58s

elukey@wintermute:~/Wikimedia/scratch-dir/istio-1.6.2$ kubectl get pods -n istio-operator
NAME                             READY   STATUS    RESTARTS   AGE
istio-operator-5668d5ddb-kkk9t   1/1     Running   0          10m

elukey@wintermute:~/Wikimedia/scratch-dir/istio-1.6.2$ kubectl get pods -n istio-system
NAME                                    READY   STATUS    RESTARTS   AGE
istio-ingressgateway-57bd88c95c-g7v66   1/1     Running   0          4m10s
istiod-c4cfbfb6c-l5mzq                  1/1     Running   0          4m8s
prometheus-56944b6bd5-x99j8             2/2     Running   0          4m8s

Even if used istioctl for this use case (and not helm), we should have a complete list of Docker images to add to our internal registry. In theory the best thing would be to avoid pulling from Dockerhub directly, and https://github.com/istio/istio/blob/release-1.6/tools/istio-docker.mk looks promising.

Addendum - the istio operator pod is needed only if we want to support istioctl, it seems not needed when using helm. As starting point, we could try to import istio/proxyv2 and istio/pilot in the WMF Docker registry, and then come up with some Helm charts for Istio.

Mapping images -> pods:

elukey@wintermute:~/Wikimedia/minikube$ kubectl get pods --all-namespaces -o=jsonpath='{range .items[*]}{"\n"}{.metadata.name}{":\t"}{range .spec.containers[*]}{.image}{", "}{end}{end}' |sort | grep istio | grep -v knative
istiod-c4cfbfb6c-j2m6k:	docker.io/istio/pilot:1.6.2, 
istio-ingressgateway-57bd88c95c-g7v66:	docker.io/istio/proxyv2:1.6.2, 
istio-operator-5668d5ddb-kkk9t:	docker.io/istio/operator:1.6.2, 
prometheus-56944b6bd5-x99j8:	docker.io/prom/prometheus:v2.15.1, docker.io/istio/proxyv2:1.6.2,

Something interesting that I found today is: https://gcsweb.istio.io/gcs/istio-build/dev/1.6-alpha.3ddc57b6d1e15afebefd725e01c0dc7099f3f6dd/docker/

Istio pushes daily builds to gcsweb, containing also the Docker images that we need. I suppose that we could build the docker dir on deneb as well, and then push the docker images to our docker registry. We could also use the above website as source of truth for Docker images.

Links to start:

https://doc.wikimedia.org/docker-pkg/
https://gerrit.wikimedia.org/r/admin/repos/operations/docker-images/production-images

Joe gave me a nice pointer in production-images, namely the loki multi-stage container example. Basically the idea is to build go binaries in one container first, then use them for the official Docker image to push to the registry. If we find a way to build istio (that in theory shouldn't be super difficult) we should also be able to re-use the Docker images like https://github.com/istio/istio/blob/master/pilot/docker/Dockerfile.proxyv2 relatively easy (same thing for Knative etc..)

More info about what binaries are executed in the minikube test that I made:

docker@minikube:~$ docker ps --no-trunc | grep istio | grep -v pause | grep istio-system  | cut -d '"' -f 2
/usr/local/bin/pilot-discovery discovery --monitoringAddr=:15014 --log_output_level=default:info --domain cluster.local --trust-domain=cluster.local --keepaliveMaxServerConnectionAge 30m
/usr/local/bin/pilot-agent proxy sidecar --domain istio-system.svc.cluster.local istio-proxy-prometheus --proxyLogLevel=warning --proxyComponentLogLevel=misc:error --controlPlaneAuthPolicy NONE --trust-domain=cluster.local
/usr/local/bin/pilot-agent proxy router --domain istio-system.svc.cluster.local --proxyLogLevel=warning --proxyComponentLogLevel=misc:error --log_output_level=default:info --serviceCluster istio-ingressgateway --trust-domain=cluster.local
/bin/prometheus --storage.tsdb.retention=6h --config.file=/etc/prometheus/prometheus.yml

The above doesn't include the istio operator (that hanldes istioctl commands) since we may not needed it we use helm.

I tried then to clone the istio github repo, checkout in a separate branch the 1.6.2 tag, and ran make && make docker to see what the build process looked like. In the out/linux-amd64 dir I found:

elukey@wintermute:~/github/istio$ ls out/linux_amd64/
client  docker_build  docker_temp  envoy  istioctl  istio_is_init  logs  mixc  mixgen  mixs  node_agent  operator  pilot-agent  pilot-discovery  policybackend  release  sdsclient  server

There seems also to be some pre-backed environment/layout to build the docker images:

elukey@wintermute:~/github/istio/out/linux_amd64/docker_build$ ls
docker.app  docker.app_sidecar  docker.istioctl  docker.mixer  docker.mixer_codegen  docker.operator  docker.pilot  docker.proxyv2  docker.test_policybackend
elukey@wintermute:~/github/istio/out/linux_amd64/docker_build$ ls docker.proxyv2/
Dockerfile.proxyv2  envoy  envoy_bootstrap_v2.json  envoy_policy.yaml.tmpl  gcp_envoy_bootstrap.json  metadata-exchange-filter.wasm  pilot-agent  stats-filter.wasm
FROM docker-registry.wikimedia.org/golang:1.13-3 as build

ENV ISTIO_VERSION=1.6.2
ENV SOURCE_REPO=https://github.com/istio/istio.git
ENV REPO_BASE=/go/github.com/istio/istio

ENV BUILD_WITH_CONTAINER=0
ENV GOARCH=amd64
ENV GOOS=linux

WORKDIR /go

USER root
RUN apt-get update && apt-get install -y curl ca-certificates

USER nobody
RUN mkdir -p $REPO_BASE \
  && cd $REPO_BASE \
  && git clone $SOURCE_REPO \
  && cd istio \
  && git checkout tags/$ISTIO_VERSION

WORKDIR $REPO_BASE/istio
RUN make build-linux

The above seems ok to just build the istio binaries!

Current idea:

  • multi-stage docker build to generate the images to push to our registry
  • light debian packaging for istioctl, to deploy it on the deployment server, to be able to control the istio mesh.

Change 688211 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/docker-images/production-images@master] WIP - Add istio base images build support

https://gerrit.wikimedia.org/r/688211

https://gerrit.wikimedia.org/r/688211 reached a good stage, but there are still some unclear points about to deploy istio. IIUC there are two ways:

  • pure helm charts - this is similar to what we use for other services
  • istioctl manifest

The latter seems the preferred way for upstream, but from https://github.com/istio/istio/blob/master/manifests/charts/UPDATING-CHARTS.md it seems that istioctl / IstioOperator (that corresponds to a container in the control plane that handles istioctl's commands) gets built with the istio repository's helm chart, in order to be able to translate a istioctl manifest into a helm chart that then gets deployed (basically the istioctl manifest seems to be expanded to a helm chart behind the scenes). This way of hiding complexity might be good on one side, but it poses some questions on the other one. For example, if what I wrote before is true, a custom istio helm chart may force admins to rebuild the binaries to get applied via istioctl.

What road we will take depends a lot on the trade offs that we want to make. It is a decision that we should share with SRE to be able to establish a common best practice.

https://github.com/istio/istio/blob/master/operator/ARCHITECTURE.md#manifest-creation is very informative. To refine what I wrote above:

  • istioctl and istio operator have base helm charts compiled within in their binaries (not entirely sure if both have the same charts)
  • an istio manifest gets translated into helm, and overrides can be applied as well.
  • istioctl acts basically as simplification layer for helm.

Same thought about next steps for us - should we rely on istioctl or specific helm charts in our repos?

I was able to test the httppbin example with Ingress gateway following https://istio.io/latest/docs/tasks/traffic-management/ingress/ingress-control, and I got it working on minikube with the images locally build for https://gerrit.wikimedia.org/r/688211, all good!

By default istioctl uses the default profile, that incorporates a complete prometheus stack (not talking about exporters, but the whole poll stack) so the image for prom/prometheus was needed as well. This is of course far from our goal, so I found the following to prevent it:

kind: IstioOperator
spec:
  addonComponents:
    prometheus:
      enabled: false

We'll have to think about how to connect our Prometheus k8s infrastructure to Istio's telemetry, but it should hopefully be a matter of helm configuration.

Since we're not planning to use the service-mesh functionality of istio but the ingress-gateway only, there is no reason to connect istio with prometheus.

Since we're not planning to use the service-mesh functionality of istio but the ingress-gateway only, there is no reason to connect istio with prometheus.

I understand that we'll not have envoy proxy related metrics, but I'd expect to see some metrics about how istiod works (pilot metrics maybe) and possibly also how many requests are going through the istio gateway, since IIUC it runs envoy behind the scenes as well. Flying completely blind about Istio seems to be strange, but I am not expert in kubernetes so if the metrics are already published by other means it will be ok as well. My point is that we should investigate what metrics are offered since the start, bootstrapping a service in production without any trace of what it is doing makes me uncomfortable.

Note on communication style (my 2c): a comment like "there is not reason to" related to somebody else's work may appear a little too direct. In this case I completely understand that you are helping us and we are really grateful, but there is a lot of context that it is not self contained in this task that you may be missing (like the SRE team interested in Istio and how it works as well, etc..).

Updates:

  • The 1.6.2 Dockerfiles have been reviewed and their images should be available on our docker registry soon (next week).
  • Ideally we'd like to run a more up-to-date version of istio, like 1.9.5, so we'll try to import it as well during the next days. It requires go 1.15 so we'll need SRE to add the bullseye-golang build images first (work in progress).
  • After a chat with the SRE team we'll try to import the Istio helm charts in our repository instead of relying on istioctl, I'll try to test them on minikube next week.

@Theofpa so far I tried to follow your guidelines outlined in T278194#6964746, but I am wondering if there is any issue mixing something like the following:

  • istio 1.9.5/1.10.0 + knative 0.18.1 + kfserving 0.5.1 + cert-manager (any version)

The only thing that we should keep "fixed" is knative since we run on k8s 1.16, but it would be nice for example to bump up istio to get the latest security fixes (I see some CVEs out for older versions) and a better helm support (following https://istio.io/latest/docs/setup/install/helm with the charts generated for 1.6.14 doesn't work well).

What do you think? Thanks in advance :)

I understand that we'll not have envoy proxy related metrics, but I'd expect to see some metrics about how istiod works (pilot metrics maybe) and possibly also how many requests are going through the istio gateway, since IIUC it runs envoy behind the scenes as well. Flying completely blind about Istio seems to be strange, but I am not expert in kubernetes so if the metrics are already published by other means it will be ok as well. My point is that we should investigate what metrics are offered since the start, bootstrapping a service in production without any trace of what it is doing makes me uncomfortable.

Ah, I've overlooked the case of monitoring istio itself, it makes sense.

Note on communication style (my 2c): a comment like "there is not reason to" related to somebody else's work may appear a little too direct. In this case I completely understand that you are helping us and we are really grateful, but there is a lot of context that it is not self contained in this task that you may be missing (like the SRE team interested in Istio and how it works as well, etc..).

Thanks for that advice, it wasn't my intention to be direct, so sorry, I can be more careful in the future!

@Theofpa so far I tried to follow your guidelines outlined in T278194#6964746, but I am wondering if there is any issue mixing something like the following:

  • istio 1.9.5/1.10.0 + knative 0.18.1 + kfserving 0.5.1 + cert-manager (any version)

The versions I've mentioned are the ones we've tested in the e2e tests of kfserving. It wouldn't be a surprise if other combinations will work, although I recall I've always tried to test the latest versions of both istio&knative in each kfserving release and it didn't always work.

The only thing that we should keep "fixed" is knative since we run on k8s 1.16, but it would be nice for example to bump up istio to get the latest security fixes (I see some CVEs out for older versions) and a better helm support (following https://istio.io/latest/docs/setup/install/helm with the charts generated for 1.6.14 doesn't work well).

It makes sense, security wise.

In the community meetings of kfserving, people share that they're not really using istio other than the ingress gateway. There are also some who are evaluating underlying technologies other than knative and istio. The driver for that is the need to be less dependent from a specific and heavy service mesh and serverless technology. Most probably we will end up having multiple flavours of kfserving, with combinations of different dependency stacks.

If the SRE team is evaluating a more general use of Istio across other services as well, I understand that you might want to adopt istio as the preferred service mesh of lift-wing as well. After all there is no alternative at the moment. But having in mind that kfserving might not be tightly coupled with Istio in the future, can help you make decisions about the level of Istio adoption you will have vs other meshes/smi.

Some interesting things discovered recently about istio deployment via helm:

  • istioctl offers a sub-command called manifest generate that emits the generated yaml file that the original manifest leads to. It is very useful to inspect before applying a new setting, since it can also show diffs between two generated manifests. Upstream suggests that one could be tempted to use it directly via kubectl apply -f, but the new settings may lead to inconsistencies since the istioctl (via helm) adds some ordering when applying things. More info https://istio.io/latest/docs/setup/install/istioctl/#generate-a-manifest-before-installation
  • istio deployment via helm is still considered alpha, and it doesn't seem to work as expected on 1.6.x (meanwhile it does with 1.9.x).

I also had a chat with Giuseppe about istioctl vs helm, and one interesting idea came up: we could use istioctl instead of helm to be more resilient to future upstream changes (even switching from helm to something else) but we could store its configuration in the deployment-charts repo anyway (in a dedicated directory). We could then create a debian package to store the various istioctl versions (since, as mentioned, the istio repo's make gen-charts target generates a go file containing the code to render the helm charts that gets into the istioctl binary) and then use them from the deployment host (like we use helm etc..).

The above idea is surely a good compromise between flexibility and control of what gets deployed. I'll work on a proposal for deployment-charts :)

Change 697938 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] [WIP] - Add the operators.d directory with basic Istio config

https://gerrit.wikimedia.org/r/697938

Change 688211 merged by Elukey:

[operations/docker-images/production-images@master] Add istio base images build support

https://gerrit.wikimedia.org/r/688211

Change 699156 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] profile::docker::builder: add the istio use case to image builders

https://gerrit.wikimedia.org/r/699156

Change 699156 merged by Elukey:

[operations/puppet@production] profile::docker::builder: add the istio use case to image builders

https://gerrit.wikimedia.org/r/699156

Change 699212 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] profile::docker: add ca config to the build istio config

https://gerrit.wikimedia.org/r/699212

Change 699212 merged by Elukey:

[operations/puppet@production] profile::docker: add ca config to the build istio config

https://gerrit.wikimedia.org/r/699212

There seems to be some consensus about istioctl in https://gerrit.wikimedia.org/r/697938, so the next step is to create a simple .deb that deploys istioctl versions.

In order to create the istioctl gerrit repo, I'd need:

ssh -p 29418 gerrit.wikimedia.org 'gerrit create-project operations/debs/istioctl -d "Debian package for the istioctl command line tool" -o ldap/ops -p operations/debs'

Change 700012 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/debs/istioctl@master] Add initial debianization for istioctl 1.6.14

https://gerrit.wikimedia.org/r/700012

Change 700012 merged by Elukey:

[operations/debs/istioctl@master] Add initial debianization for istioctl 1.6.14

https://gerrit.wikimedia.org/r/700012

Change 700162 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/debs/istioctl@master] Skip the dwz step

https://gerrit.wikimedia.org/r/700162

Change 700162 merged by Elukey:

[operations/debs/istioctl@master] Skip the dwz step

https://gerrit.wikimedia.org/r/700162

Change 700203 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] profile::kubernetes::deployment_server: add istioctl package

https://gerrit.wikimedia.org/r/700203

Change 700396 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/docker-images/production-images@master] Add istio 1.9.5 images

https://gerrit.wikimedia.org/r/700396

Change 700397 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/debs/istioctl@master] Add support for istioctl 1.9.5

https://gerrit.wikimedia.org/r/700397

Change 700397 merged by Elukey:

[operations/debs/istioctl@master] Add support for istioctl 1.9.5

https://gerrit.wikimedia.org/r/700397

Change 700203 merged by Elukey:

[operations/puppet@production] profile::kubernetes::deployment_server: add istioctl package

https://gerrit.wikimedia.org/r/700203

Change 700396 merged by Elukey:

[operations/docker-images/production-images@master] Add istio 1.9.5 images

https://gerrit.wikimedia.org/r/700396

Change 701067 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/docker-images/production-images@master] istio: add a more specific Depends target

https://gerrit.wikimedia.org/r/701067

Change 701067 merged by Elukey:

[operations/docker-images/production-images@master] istio: add a more specific build target

https://gerrit.wikimedia.org/r/701067

Change 697938 merged by Elukey:

[operations/deployment-charts@master] Add the custom_deploy.d directory with basic Istio config

https://gerrit.wikimedia.org/r/697938

I tried to deploy the istio config outlined in https://gerrit.wikimedia.org/r/697938 to ml-serve-eqiad and these are the problems that came up:

  • istiod seems to rely on a private CA, and if it doesn't find one saved among its secrets it creates a self-signed one. This is something to think about after the prototype phase (do we want a different solution?)
  • To make the istiod pod work, I had to add an override for the kubernetes api like the following (since we don't have IP SANs in the k8s API's TLS certs):
components:
  pilot:
    k8s:
      env:
      - name: 'KUBERNETES_SERVICE_HOST'
        value: 'kubernetes.default.svc.cluster.local'
      - name: 'KUBERNETES_SERVICE_PORT'
        value: '443'
  • I had also to execute the following manually via kubectl (will need to somewhere):
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: allow-restricted-psp
  namespace: istio-system
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: allow-restricted-psp
subjects:
  - kind: ServiceAccount
    name: istiod-service-account
    namespace: istio-system
  • The istiod pod tries to validate a bad/wrong config as first step to test if the webhook works (so calls to k8s api -> webhook work fine etc..). This doesn't work in our cluster, the problem is outlined in T285927
  • I have seen a lot of errors related to the Horizontal Pod Autoscaler configs not finding cpu/memory metrics to use to scale up and down pods. This is due to the fact that we don't have a metrics-server infrastructure in production, details in T249929.

Change 704507 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/docker-images/production-images@master] Add missing envoy config files to istio proxyv2

https://gerrit.wikimedia.org/r/704507

Change 704507 merged by Elukey:

[operations/docker-images/production-images@master] Add missing envoy config files to istio proxyv2

https://gerrit.wikimedia.org/r/704507

Finally!

elukey@ml-serve-ctrl1001:~$ kubectl get pods -A 
NAMESPACE      NAME                                       READY   STATUS    RESTARTS   AGE
istio-system   cluster-local-gateway-585f96dccc-dtknd     1/1     Running   0          2m7s
istio-system   istio-ingressgateway-657b89d44d-wmqhg      1/1     Running   0          2m7s
istio-system   istiod-68d4cb6c9-4759f                     1/1     Running   0          2m14s

Change 704552 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] istio: improve base config.yaml for ml-serve

https://gerrit.wikimedia.org/r/704552

Change 704552 merged by Elukey:

[operations/deployment-charts@master] istio: improve base config.yaml for ml-serve

https://gerrit.wikimedia.org/r/704552

Things to do before closing:

  1. Do we need to add a custom TLS certificate for istiod? If not added then istiod creates one, but it is not clear if it auto-renews or not etc..
  2. We should automated a little bit more the istioctl steps in deployment-charts (there are some TODOs in the README)
  3. We should find a way to add RBAC rules for Istio. The following was needed (applied manually) to make it work in prod:
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: allow-restricted-psp
  namespace: istio-system
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: allow-restricted-psp
subjects:
  - kind: ServiceAccount
    name: istiod-service-account
    namespace: istio-system

Change 708529 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] Change docker images used for knative-serving

https://gerrit.wikimedia.org/r/708529

Change 708529 merged by Elukey:

[operations/deployment-charts@master] Change docker images used for knative-serving

https://gerrit.wikimedia.org/r/708529

Change 708545 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] knative-serving: override KUBERNETES_SERVICE_HOST

https://gerrit.wikimedia.org/r/708545

Change 708545 merged by Elukey:

[operations/deployment-charts@master] knative-serving: override KUBERNETES_SERVICE_HOST and update images

https://gerrit.wikimedia.org/r/708545

Things to do before closing:

  1. Do we need to add a custom TLS certificate for istiod? If not added then istiod creates one, but it is not clear if it auto-renews or not etc..
elukey claimed this task.
elukey@ml-serve-ctrl1001:~$ kubectl get secrets istio-ca-secret -o jsonpath="{.data.ca-cert\.pem}" -n istio-system  | base64 --decode | openssl x509 -text -noout
Certificate:
    Data:
        Version: 3 (0x2)
        Serial Number:
            9e:e0:ed:94:2b:82:6e:56:2c:ee:6f:79:01:2d:64:9f
        Signature Algorithm: sha256WithRSAEncryption
        Issuer: O = cluster.local
        Validity
            Not Before: Jun 30 09:02:19 2021 GMT
            Not After : Jun 28 09:02:19 2031 GMT

It seems that the default expiry date for the istio CA is way ahead in the future, I am inclined to mark this as non-issue for the MVP and open a task to track a better solution (likely use cert-manager).