Page MenuHomePhabricator

Install Knative on ml-serve cluster
Closed, ResolvedPublic

Description

The Lift-Wing proof of concept requires Knative to be installed in order to run KFServing.

We need Knative Serving: v0.14.3+

The Knative docs say to install via k8s CRDs/Operators:
https://knative.dev/docs/install/any-kubernetes-cluster/

There is also some prior art around creating a custom helm chart (which fits better into the WMF stack):
https://github.com/triggermesh/charts

Note: cluster-local-gateway is required to serve cluster-internal traffic for transformer and explainer use cases (unless we are running v0.19.0).
Please follow instructions here to install cluster local gateway.

Event Timeline

Little note from https://knative.dev/docs/install/any-kubernetes-cluster/#before-you-begin

Knative v0.21.0 requires a Kubernetes cluster v1.17 or newer, as well as a compatible kubectl.

We are running 1.16 at the moment, and next Fiscal Year we'll work with SRE for 1.20, but there is no clear timeline yet.

Can we use a less recent version compatible with 1.16? Is it a viable path given that KNative is relatively young and probably releases every couple of months? (see https://github.com/knative/serving/tags)

@Theofpa Any guidance from you on this would be really helpful :)

I've made a version compatibility matrix from our recent tests (kfserving#1334, kfserving#1482):

kubernetesistioknative
1.161.3.10.17
1.161.6.20.18
1.171.7.10.20
1.191.8.20.21

The transition from knative<=1.18 to knative>=1.19 introduced a change that impacted kfserving: The deprecation of the cluster-local-gateway.

I understand from T272918 that the requirement was for k8s 1.16-1.18 version and the delivered cluster is a k8s-1.16.

It looks like we have two options:

  1. Stay with k8s-1.16 and install knative-0.18
    • Problem: on a future upgrade, we'll have to deal with the migration from cluster-local-gateway to knative-local-gateway.
  2. Get a k8s-1.19 and install knative-0.21
    • Problem: I assume the delivery time will be long, and impact the project delivery. Unless we can request the upgrade of that empty cluster to k8s 1.17, 1.18 or 1.19?

I would recommend to go with the most recent versions as we are in greenfield and in such a way we can stay up-to-date for a longer period.

This is a really important set of infos, thanks! I think that for the MVP we can go for 1.16 + 0.18, and then we can decide later on what to do. IIUC our SRE team is planning to introduce k8s 1.20 later on during the year, so we could possibly anticipate the need and be the first ones to test it (before going really live).

I do share the opinion that we should be as close to upstream as possible, especially to get the latest bugfixes from knative upstream if needed. I'd be worried in ending up stuck on 0.18 with some bugs to solve with patches for later versions only (and backporting patches on 0.18 is not a great idea either).

Tried to install knative + istio following https://github.com/kubeflow/kfserving/blob/master/test/scripts/run-e2e-tests.sh#L75-L102 on minikube + k8s 1.20.2 (1.16.0 seems not running ok with minukube):

elukey@wintermute:~/Wikimedia/minikube$ kubectl apply -f operator.yaml 
Warning: apiextensions.k8s.io/v1beta1 CustomResourceDefinition is deprecated in v1.16+, unavailable in v1.22+; use apiextensions.k8s.io/v1 CustomResourceDefinition
customresourcedefinition.apiextensions.k8s.io/knativeeventings.operator.knative.dev created
customresourcedefinition.apiextensions.k8s.io/knativeservings.operator.knative.dev created
configmap/config-logging created
configmap/config-observability created
deployment.apps/knative-operator created
clusterrole.rbac.authorization.k8s.io/knative-serving-operator-aggregated created
clusterrole.rbac.authorization.k8s.io/knative-serving-operator created
clusterrole.rbac.authorization.k8s.io/knative-eventing-operator-aggregated created
clusterrole.rbac.authorization.k8s.io/knative-eventing-operator created
clusterrolebinding.rbac.authorization.k8s.io/knative-serving-operator created
clusterrolebinding.rbac.authorization.k8s.io/knative-serving-operator-aggregated created
clusterrolebinding.rbac.authorization.k8s.io/knative-eventing-operator created
clusterrolebinding.rbac.authorization.k8s.io/knative-eventing-operator-aggregated created
serviceaccount/knative-operator created

elukey@wintermute:~/Wikimedia/minikube$ cat knative-serving.yaml 
apiVersion: v1
kind: Namespace
metadata:
 name: knative-serving
 labels:
   istio-injection: enabled
---
apiVersion: operator.knative.dev/v1alpha1
kind: KnativeServing
metadata:
  name: knative-serving
  namespace: knative-serving
elukey@wintermute:~/Wikimedia/minikube$ kubectl apply -f knative-serving.yaml 
namespace/knative-serving created
knativeserving.operator.knative.dev/knative-serving created

docker@minikube:~$ docker ps | grep -v pause | grep knative
9291aafdb339   gcr.io/knative-releases/knative.dev/serving/cmd/autoscaler   "/ko-app/autoscaler"     13 seconds ago   Up 12 seconds             k8s_autoscaler_autoscaler-6cb7c9d4fb-s9qn5_knative-serving_c5a7465e-be7a-455f-bd18-8c82f4092396_0
f09372a2e006   gcr.io/knative-releases/knative.dev/serving/cmd/activator    "/ko-app/activator"      27 seconds ago   Up 26 seconds             k8s_activator_activator-666887556-qfsv9_knative-serving_11e760ce-df4b-4c59-87d8-7cb51c083a54_0
64fbbf25845b   gcr.io/knative-releases/knative.dev/operator/cmd/operator    "/ko-app/operator"       3 minutes ago    Up 3 minutes              k8s_knative-operator_knative-operator-6b6fb7bdf5-tqn94_default_8436e577-af49-4d25-8f16-90601db4c515_0

docker@minikube:~$ docker images | grep knative
gcr.io/knative-releases/knative.dev/operator/cmd/operator    <none>     c9cf5f68657a   4 months ago    70.8MB
gcr.io/knative-releases/knative.dev/serving/cmd/activator    <none>     1d721a5f82f5   5 months ago    64.2MB
gcr.io/knative-releases/knative.dev/serving/cmd/autoscaler   <none>     f736a5dbb725   5 months ago    64.3MB
gcr.io/knative-releases/knative.dev/serving/cmd/controller   <none>     514b2f906521   5 months ago    69.4MB

The docs in https://knative.dev/docs/install/install-serving-with-yaml/ looks reproducible on minikube nicely:

kubectl apply -f https://github.com/knative/serving/releases/download/v0.18.0/serving-crds.yaml
kubectl apply -f https://github.com/knative/serving/releases/download/v0.18.0/serving-core.yaml
kubectl apply -f https://github.com/knative/net-istio/releases/download/v0.18.0/release.yaml

I don't see support for Helm, from https://github.com/knative/docs/issues/311#issuecomment-639170274 it seems that upstream prefers operators instead of helm (they had concerns with Tiller and helm 2.0 IIUC). There are probably some thirdparty helm charts that we can use, or we can create our own one in case, we'll see.

The other thing to do is to add the following images to our registry:

elukey@wintermute:~/Wikimedia/minikube$ kubectl get pods --all-namespaces -o=jsonpath='{range .items[*]}{"\n"}{.metadata.name}{":\t"}{range .spec.containers[*]}{.image}{", "}{end}{end}' |sort | grep knative
activator-79f56666d8-x78ps:	gcr.io/knative-releases/knative.dev/serving/cmd/activator@sha256:69065cec1c1d57d1b16eb448c1abd895c2c554ef0ec19bedd1c14dc3150d2ff1, 
autoscaler-558966cc68-4nt74:	gcr.io/knative-releases/knative.dev/serving/cmd/autoscaler@sha256:bc1f5dc5594e880dcb126336d8344f0a87cf22075ef32eebd3280e6548ef22ef, 
controller-6866b6ffdd-brw2z:	gcr.io/knative-releases/knative.dev/serving/cmd/controller@sha256:8b2b5d06ab5b3bbbe0f40393b3e39f6aceb80542d5cfbab97e89758b59b5ef6e, 
istio-webhook-f66f5d879-pbl6c:	gcr.io/knative-releases/knative.dev/net-istio/cmd/webhook@sha256:e0b6d3928e6b731f21ca17db2ab9020b42850ce6427fedc4bcb728389ce20ee8, 
networking-istio-6f558bfb75-vmhsp:	gcr.io/knative-releases/knative.dev/net-istio/cmd/controller@sha256:49bb4cb2224b6d41d07f2259753fd89e8a440cd7bb81eee190faff1e817e7eb9, 
webhook-745db77b96-sdg82:	gcr.io/knative-releases/knative.dev/serving/cmd/webhook@sha256:e65e11bc8711ed619b346f0385de4d266f59dccf0781fe41a416559b85d414ed,

Change 692899 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/docker-images/production-images@master] Add knative serving and net-istio images

https://gerrit.wikimedia.org/r/692899

Updates:

  • Dockerfiles are being reviewed during these days, but they seem to work fine on minikube.
  • Ideally we should be able to import all the CRD + other configs to our helm chart repository, to have a single place where we configure/deploy our cluster.
  • No helm charts are provided for Knative, so we''ll either need to come up with some (from the base config yamls provided by upstream) or we could check what it is available from the open source community of knative.
  • We should review and test the Knative operator (https://github.com/knative/operator) as well, possibly keeping support for it.

Had a chat with SRE today about Helm vs Knative Operator (https://knative.dev/docs/install/knative-with-operators/). The idea that we discussed was to add basic helm support for the Knative Operator in deployment-charts, and then use kubectl apply to deploy settings (with a very little manifest).

One thing that makes me doubtful about this approach is this entry: https://github.com/knative/operator/releases/tag/v0.18.1

Bumping k8s to 1.18

As far as I can see from the related pull request it seems only a go client package, but from what I gathered they may have bumped the min k8s compatibility to 1.18 for the operator. In the meantime I found this nice chart https://github.com/softonic/knative-serving-chart that I am trying to port to deployment-charts, it looks relatively easy and more flexible in my opinion. Will update this task when I have something to share :)

Change 692899 merged by Elukey:

[operations/docker-images/production-images@master] Add knative serving and net-istio images

https://gerrit.wikimedia.org/r/692899

Change 699380 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] Add support for knative serving

https://gerrit.wikimedia.org/r/699380

elukey changed the task status from Open to Stalled.Jul 1 2021, 1:32 PM

We are currently try to deploy layer by layer in production, following this order:

  • istio
  • knative-serving
  • kfserving
  • inference services

Setting this task to STALLED until it will be actionable, see the other subtasks of the parent for more info.

elukey changed the task status from Stalled to Open.Jul 15 2021, 3:57 PM

We were finally able to deploy istio in prod, so this task can proceed!

Next steps:

  1. Work with service ops on https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/699380 (knative helm chart)
  2. Figure out how to create the inference.wikimedia.org TLS cert (likely via https://wikitech.wikimedia.org/wiki/Enable_TLS_for_Kubernetes_deployments#Create_and_place_certificates) and how to add it to the knative helmfile settings (it is a little different from a regular service TLS deployment so we need to figure out how to do it)
  3. Deploy to Prod and figure out if any RBAC rule is needed.

Change 699380 merged by Elukey:

[operations/deployment-charts@master] Add support for knative serving

https://gerrit.wikimedia.org/r/699380

Change 707408 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] admin_ng: add knative-serving in bases list

https://gerrit.wikimedia.org/r/707408

Change 707408 abandoned by Elukey:

[operations/deployment-charts@master] admin_ng: add knative-serving in bases list

Reason:

https://gerrit.wikimedia.org/r/707408

Change 708523 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/docker-images/production-images@master] Use uid for the nobody user in knative-serving's Dockerfiles

https://gerrit.wikimedia.org/r/708523

Change 708523 merged by Elukey:

[operations/docker-images/production-images@master] Use uid for the nobody user in knative-serving's Dockerfiles

https://gerrit.wikimedia.org/r/708523

Change 711111 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/docker-images/production-images@master] knative-serving: add ca-certificates to the controller's image

https://gerrit.wikimedia.org/r/711111

Change 711111 merged by Elukey:

[operations/docker-images/production-images@master] knative-serving: add wmf-certificates to the controller's image

https://gerrit.wikimedia.org/r/711111

Change 711127 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] knative-serving: force the controller to use ca certificates

https://gerrit.wikimedia.org/r/711127

Change 711127 merged by Elukey:

[operations/deployment-charts@master] knative-serving: force the controller to use ca certificates

https://gerrit.wikimedia.org/r/711127

Change 715006 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] knative-serving: add missing vars for replicaCount

https://gerrit.wikimedia.org/r/715006

Change 715006 merged by Elukey:

[operations/deployment-charts@master] knative-serving: add missing vars for replicaCount

https://gerrit.wikimedia.org/r/715006

Change 715053 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] knative-serving: fix templating for memory limits

https://gerrit.wikimedia.org/r/715053

Change 715053 merged by Elukey:

[operations/deployment-charts@master] knative-serving: fix templating for memory limits

https://gerrit.wikimedia.org/r/715053

elukey claimed this task.

Marking this as completed, metrics will be added in T289841.