Page MenuHomePhabricator

Install KFServing standalone
Closed, ResolvedPublic

Description

We need to install KFServing standalone our ml-serve k8s cluster.

Requirements:

  • k8s cluster with at least 4 cpus and 8Gi memory
  • Istio service mesh
  • Knative Serving (and Eventing if we want transformers/explainers)
  • Cert Manager / LetsEncrypt

Install docs: https://github.com/kubeflow/kfserving#standalone-kfserving-installation

Details

SubjectRepoBranchLines +/-
operations/deployment-chartsmaster+15 -6
operations/deployment-chartsmaster+73 -7
operations/deployment-chartsmaster+15 -6
operations/deployment-chartsmaster+5 -1
operations/docker-images/production-imagesmaster+7 -1
operations/deployment-chartsmaster+2 -2
operations/deployment-chartsmaster+21 -4
operations/docker-images/production-imagesmaster+11 -1
operations/docker-images/production-imagesmaster+13 -0
operations/deployment-chartsmaster+16 -13
operations/docker-images/production-imagesmaster+8 -1
operations/deployment-chartsmaster+7 -1
operations/docker-images/production-imagesmaster+28 -0
operations/deployment-chartsmaster+19 -3
operations/deployment-chartsmaster+3 -3
operations/deployment-chartsmaster+6 -1
operations/deployment-chartsmaster+44 -19
operations/deployment-chartsmaster+5 -5
operations/deployment-chartsmaster+9 -10
operations/deployment-chartsmaster+0 -1
operations/deployment-chartsmaster+11 -160
operations/deployment-chartsmaster+17 K -0
operations/deployment-chartsmaster+23 -0
operations/puppetproduction+2 -0
operations/docker-images/production-imagesmaster+19 -1
operations/docker-images/production-imagesmaster+17 -3
operations/docker-images/production-imagesmaster+57 -0
Show related patches Customize query in gerrit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Here are my notes from installing KFServing on GKE last quarter: https://etherpad.wikimedia.org/p/kfserving-standalone
TLDR: there is a weird version mismatch issue between Knative & Istio that can potentially prevent us from using the transformer (pre/post-processing) and explainabilty features in KFServing.

@calbon mentioned in chat:

KFServing's explainability feature (i.e. an API endpoint for each model that provides some insights into how a model came to some conclusion) is a nice-to-have. However, transformers (i.e. a feature that allows api requests to be pre-processed before being submitted to a model for a prediction) is critical because it is part of the current ORES featureset

For now let's try to just follow the directions in the KFServing README and setup Istio to handle cluster-interal traffic so we can use transformers.

you might also find the quick install script interesting, it has a more fresh version of Istio&knative (1.6 & 0.18) than the run-e2e-tests.sh (1.3 & 0.17). I'm currently rewriting the run-e2e-tests.sh to migrate the test to a tekton pipeline and will keep you posted. I'm targeting Istio 1.7 and Knative 0.20. One more thing, the helm installation method is nice, but both Istio&Knative also have operators for installation nowadays, which can make future upgrades easier.

I've started a WIP PR which supports knative 0.20 and Istio 1.7.1 here.

Thanks @Theofpa ! We are considering helm as it is part of our SRE stack used across the Foundation, however I can see the operators being very beneficial for long term use.

@ACraze @kevinbazira I was reviewingthe kfserving.yaml kubernetes config looking for Docker images to build, and beside the kfserving ones (controller/agent) I found also Docker images (shipped to Dockerhub) based on: https://github.com/kubeflow/kfserving/tree/master/python

It seems mostly related to model servers for various providers, but I have no idea if we need them now or not. Can you shed some light? :D

Change 693644 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/docker-images/production-images@master] Add base kubeflow kfserving images and kube-rbac-proxy

https://gerrit.wikimedia.org/r/693644

Thanks a lot for the summary!

To help you answer the question, these images are mostly model servers (tensorflow, pytorch, sklearn, etc) and explainers. The current models built by @kevinbazira and @ACraze are based on a custom image rather than one of the popular model servers. So if you need to have the minimum list of images cloned in your registry to make kfserving work for the current capabilities of the platform, as you said these would be the controller and the agent.

Although from the roadmap of lift-wing I understand that the platform will serve more model types in the future, which most likely will be using one of the popular frameworks like tensorflow, pytorch or sklearn. So my recommendation would be to cache them all.

One more thing, we're planning to move out from docker hub in the near future and have the images hosted in aws ecr-public.

It seems mostly related to model servers for various providers, but I have no idea if we need them now or not. Can you shed some light? :D

@elukey -- mostly echoing @Theofpa: I think all we need right now for the MVP is controller & agent from KFServing. The ORES models will be a custom image that we are still finishing and the other model we are working on is the Outlinks topic model, which is another custom image that runs a fastText model. We also might need to do the storage-init as well, but that depends on the outcome of T282802: Implement model storage for enwiki-goodfaith inference service

Long-term we will definitely want most (if not all) of the model servers. I know some teams are using tensorflow and pytorch for upcoming projects, also sklearn is pretty common too.

Change 693644 merged by Elukey:

[operations/docker-images/production-images@master] Add base kubeflow kfserving images

https://gerrit.wikimedia.org/r/693644

Change 700179 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/docker-images/production-images@master] Move knative serving's queue image to a different layout

https://gerrit.wikimedia.org/r/700179

Change 700179 merged by Elukey:

[operations/docker-images/production-images@master] Move knative serving's queue image to a different layout

https://gerrit.wikimedia.org/r/700179

Change 700470 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] WIP - Add kubeflow's kfserving chart

https://gerrit.wikimedia.org/r/700470

elukey changed the task status from Open to Stalled.Jul 1 2021, 1:31 PM

We are currently try to deploy layer by layer in production, following this order:

  • istio
  • knative-serving
  • kfserving
  • inference services

Setting this task to STALLED until it will be actionable, see the other subtasks of the parent for more info.

Change 708783 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/docker-images/production-images@master] Import Kubeflow Kfserving 0.6.0

https://gerrit.wikimedia.org/r/708783

Change 708783 merged by Elukey:

[operations/docker-images/production-images@master] Import Kubeflow Kfserving 0.6.0

https://gerrit.wikimedia.org/r/708783

Change 709014 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] Add ml-serve-{eqiad,codfw} to kubernetes_clusters

https://gerrit.wikimedia.org/r/709014

Change 709014 merged by Elukey:

[operations/puppet@production] Add ml-serve-{eqiad,codfw} to kubernetes_clusters

https://gerrit.wikimedia.org/r/709014

Change 709494 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] Add kfserving basic helmfile config under admin_ng

https://gerrit.wikimedia.org/r/709494

Change 700470 merged by Elukey:

[operations/deployment-charts@master] Add kubeflow's kfserving charts

https://gerrit.wikimedia.org/r/700470

Change 709494 merged by Elukey:

[operations/deployment-charts@master] Add kfserving basic helmfile config under admin_ng

https://gerrit.wikimedia.org/r/709494

Change 710226 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] Improve the kubeflow-kfserving chart

https://gerrit.wikimedia.org/r/710226

Change 710226 merged by Elukey:

[operations/deployment-charts@master] Improve the kubeflow-kfserving chart

https://gerrit.wikimedia.org/r/710226

Change 710296 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] helmfile: allow kubeflow-kfserving to create the kfserving-system ns

https://gerrit.wikimedia.org/r/710296

Change 710296 merged by Elukey:

[operations/deployment-charts@master] helmfile: allow kubeflow-kfserving to create the kfserving-system ns

https://gerrit.wikimedia.org/r/710296

Change 710301 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] Kubeflow: fix secrete in chart and update helmfile

https://gerrit.wikimedia.org/r/710301

Change 710301 merged by Elukey:

[operations/deployment-charts@master] Kubeflow: fix secret in chart and update helmfile

https://gerrit.wikimedia.org/r/710301

Change 710307 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] kubeflow-kfserving: better handling of Secrets

https://gerrit.wikimedia.org/r/710307

Change 710307 merged by Elukey:

[operations/deployment-charts@master] kubeflow-kfserving: better handling of Secrets

https://gerrit.wikimedia.org/r/710307

Change 710481 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] kubeflow: create a separate chart for its Secret

https://gerrit.wikimedia.org/r/710481

Change 710483 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] kubeflow: add pre-hook ordering for Namespace and Secret

https://gerrit.wikimedia.org/r/710483

Change 710481 abandoned by Elukey:

[operations/deployment-charts@master] kubeflow: create a separate chart for its Secret

Reason:

https://gerrit.wikimedia.org/r/710481

Change 710483 merged by Elukey:

[operations/deployment-charts@master] kubeflow: add pre-hook ordering for Namespace and Secret

https://gerrit.wikimedia.org/r/710483

Change 710486 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] kubeflow: fix the cert name for the webhook TLS certificate

https://gerrit.wikimedia.org/r/710486

Change 710486 merged by Elukey:

[operations/deployment-charts@master] kubeflow: fix the cert name for the webhook TLS certificate

https://gerrit.wikimedia.org/r/710486

Change 710493 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] kubeflow: add container env variables to reach the k8s api

https://gerrit.wikimedia.org/r/710493

Change 710493 merged by Elukey:

[operations/deployment-charts@master] kubeflow: add container env variables to reach the k8s api

https://gerrit.wikimedia.org/r/710493

Finally!

elukey@ml-serve-ctrl1001:~$ sudo kubectl get pods -A
NAMESPACE          NAME                                       READY   STATUS    RESTARTS   AGE
istio-system       cluster-local-gateway-585f96dccc-54kqm     1/1     Running   1          13d
istio-system       istio-ingressgateway-7ffffd874b-67zkp      1/1     Running   1          13d
istio-system       istiod-68d4cb6c9-h84qd                     1/1     Running   1          13d
kfserving-system   kfserving-controller-manager-0             1/1     Running   0          7m59s
knative-serving    activator-867d54cc88-vwdpt                 1/1     Running   2          8d
knative-serving    autoscaler-cfc4cc49f-zbvmv                 1/1     Running   0          8d
knative-serving    controller-784f95f8df-m4djc                1/1     Running   0          8d
knative-serving    istio-webhook-b8854d86f-4lxh2              1/1     Running   0          8d
knative-serving    networking-istio-857f9bbdf6-bdcp5          1/1     Running   0          8d
knative-serving    webhook-5bf64fb48d-qvr8x                   1/1     Running   0          8d

Change 710584 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/docker-images/production-images@master] Add the Kubeflow storage initializer docker image

https://gerrit.wikimedia.org/r/710584

Change 710584 merged by Elukey:

[operations/docker-images/production-images@master] Add the Kubeflow storage initializer docker image

https://gerrit.wikimedia.org/r/710584

Change 711096 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] kubeflow-kfserving: move to the Wikimedia storage-initializer

https://gerrit.wikimedia.org/r/711096

elukey changed the task status from Stalled to Open.Aug 10 2021, 7:52 AM

Change 711096 merged by Elukey:

[operations/deployment-charts@master] kubeflow-kfserving: move to the Wikimedia storage-initializer

https://gerrit.wikimedia.org/r/711096

Change 711113 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] kubeflow,knative: use new controller image and docker registry endpoint

https://gerrit.wikimedia.org/r/711113

Change 711113 merged by Elukey:

[operations/deployment-charts@master] kubeflow,knative: use new controller image and docker registry endpoint

https://gerrit.wikimedia.org/r/711113

Change 711129 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] kubeflow-kfserving: add quoting and refactor storage_init limits

https://gerrit.wikimedia.org/r/711129

Change 711151 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/docker-images/production-images@master] kubeflow: add wmf-certificates to the storage-initializer

https://gerrit.wikimedia.org/r/711151

Change 711151 merged by Elukey:

[operations/docker-images/production-images@master] kubeflow: add wmf-certificates to the storage-initializer

https://gerrit.wikimedia.org/r/711151

Change 711129 merged by Elukey:

[operations/deployment-charts@master] kubeflow-kfserving: update chart

https://gerrit.wikimedia.org/r/711129

Change 711579 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/docker-images/production-images@master] kubeflow: add workaround for TLS validation in storage-initializer

https://gerrit.wikimedia.org/r/711579

Change 711579 merged by Elukey:

[operations/docker-images/production-images@master] kubeflow: add workaround for TLS validation in storage-initializer

https://gerrit.wikimedia.org/r/711579

Change 712118 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/docker-images/production-images@master] kubeflow: add the AWS_DEFAULT_REGION env variable to storage-initializer

https://gerrit.wikimedia.org/r/712118

Change 712118 merged by Elukey:

[operations/docker-images/production-images@master] kubeflow: add the AWS_DEFAULT_REGION env variable to storage-initializer

https://gerrit.wikimedia.org/r/712118

Change 712346 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] kubeflow: update storage-init's image and add variable for local gw

https://gerrit.wikimedia.org/r/712346

Change 712346 merged by Elukey:

[operations/deployment-charts@master] kubeflow: update storage-init's image and add variable for local gw

https://gerrit.wikimedia.org/r/712346

We have finally something working!

elukey@ml-serve-ctrl1001:~$ curl http://ml-serve1001.eqiad.wmnet:8081/v1/models/enwiki-goodfaith:predict -X POST -d @input.json -i -H "Host: $SERVICE_HOSTNAME"
HTTP/1.1 200 OK
content-length: 112
content-type: application/json; charset=UTF-8
date: Thu, 12 Aug 2021 13:27:13 GMT
server: istio-envoy
x-envoy-upstream-service-time: 31933{"predictions": {"prediction": true, "probability": {"false": 0.03387957196040836, "true": 0.9661204280395916}}}

Change 714534 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] kubeflow-kfserving: add missing comma

https://gerrit.wikimedia.org/r/714534

Change 714534 merged by Elukey:

[operations/deployment-charts@master] kubeflow-kfserving: add missing comma

https://gerrit.wikimedia.org/r/714534

Change 714773 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/docker-images/production-images@master] kubeflow: change storage-init's AWS_DEFAULT_REGION value

https://gerrit.wikimedia.org/r/714773

Change 714773 merged by Elukey:

[operations/docker-images/production-images@master] kubeflow: change storage-init's AWS_DEFAULT_REGION value

https://gerrit.wikimedia.org/r/714773

Change 714800 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] kubeflow: add override in admin_ng for storage-init

https://gerrit.wikimedia.org/r/714800

Change 714800 merged by Elukey:

[operations/deployment-charts@master] kubeflow: add override in admin_ng for storage-init

https://gerrit.wikimedia.org/r/714800

Change 715042 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] kubeflow: raise cpu limits for the kfserving controller

https://gerrit.wikimedia.org/r/715042

Change 715042 merged by Elukey:

[operations/deployment-charts@master] kubeflow: raise cpu limits for the kfserving controller

https://gerrit.wikimedia.org/r/715042

Change 715747 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] kubeflow-kfserving-inference: add Secret specs for Swift

https://gerrit.wikimedia.org/r/715747

Change 715747 abandoned by Elukey:

[operations/deployment-charts@master] kubeflow-kfserving-inference: add Secret specs for Swift

Reason:

This shouldn't be needed if I got https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/719128 right

https://gerrit.wikimedia.org/r/715747

elukey claimed this task.