Page MenuHomePhabricator

Kserve helm chart
Open, MediumPublic

Description

I would like to propose the adoption of the kserve helm chart as a mean to manage and install kserve itself.
Kserve is used to manage machine learning inference workloads on ml specific k8s clusters.

https://github.com/kserve/kserve/tree/master/charts

https://wikitech.wikimedia.org/wiki/Helm/Upstream_Charts/kserve

Event Timeline

tappof triaged this task as Medium priority.Thu, Feb 5, 3:32 PM

We have a rough documentation about our policy and process around adopting upstream helm charts which can be found here: https://wikitech.wikimedia.org/wiki/Kubernetes/Upstream_Helm_charts_policy

You may start with working through the various points and the best practices outlined there. It might also help to go over other validation processes, like https://wikitech.wikimedia.org/wiki/Helm/Upstream_Charts/cloudnative-pg for example to get an idea about where potential problems might hide.

In any case, since we have a custom knative chart already (from back when upstream was only releasing a bunch of YAML), you should make sure that the upstream chart supports the features/toggles/flags implemented in the custom chart and thing about a migration process for existing deployments.

@MLechvien-WMF i am happy to follow up with Dawid and the review the charts etc.., but as Janis pointed out SRE should not lead the efforts to import the chart :)

Assigning the task to Dawid is probably the best course of action.

@JMeybohm will check and update the ticket, cheers!

It seems that the major difference is the fact that we have a calico network policy but the chart doesn't (unsurprisingly). Perhaps we can supply that out of band.
Our images expect /usr/bin/manager but upstream uses /manager and this is not configurable. We might want to update our images to use the upstream path.

Knative will be ditched since new kserve doesn't need it so that whole part will be gone, however, within the knative chart we add several calico network policies and I'm not sure if any of that will have to be kept and re-adapted.
Istio based ingress can be still used altough gateway api seems to be the new default. We'll keep using istio.

@DPogorzelski-WMF please check https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/charts/kserve/README, there is a little more :) Not sure what it is still required or not, but worth to keep it mind when checking the upstream chart.

Will do but I would argue it's much better to deploy it, test it, see what's broken, fix and iterate until it's working as intended. Quick, small iterations.
It's a big chart and planning this waterfall style it's not going to result in something that works 100% not matter how much one spends evaluating the differences.
In the end, even if we don't import the chart we will end up copying a big portion of it anyways because we still need kserve so one way or another it will end up in the repo, perhaps adapted but still.

On a related note, the fact that we run k8s on bare metal is just another major issue, there is no way to test these things under the same conditions as a production system without actually rolling this thing out to a production system (local kind/minikube is not an option as it doesn't mimic the production context at all).

Will do but I would argue it's much better to deploy it, test it, see what's broken, fix and iterate until it's working as intended. Quick, small iterations.

The README contains things that need to be added to avoid the issues that you want to fix with small iterations, so since we know it beforehand I am not 100% sure why you want to rediscover them another time. Some of them may not be relevant, but checking in the chart first for their presence and/or what alternatives are available is a good sanity measure in my opinion.

It's a big chart and planning this waterfall style it's not going to result in something that works 100% not matter how much one spends evaluating the differences.

It is fine to import the basic version of the chart without being sure 100% that it works, small fixes later on are allowed and expected. The main goal is to avoid importing garbage or testing code in our infrastructure that we don't want to.

On a related note, the fact that we run k8s on bare metal is just another major issue, there is no way to test these things under the same conditions as a production system without actually rolling this thing out to a production system (local kind/minikube is not an option as it doesn't mimic the production context at all).

Sure we cannot be 100% confident, but doing basic tests locally will be a good indication that the chart works in its basic functionalities. Again the goal for the review is not to have something that works 100%, but to review the chart and adapt it to our needs. More iterations are ok later on in the process.

The README contains things that need to be added to avoid the issues that you want to fix with small iterations, so since we know it beforehand I am not 100% sure why you want to rediscover them another time.

I don't want to. In fact what I want is to import the chart and deploy it. Then add what is missing on top using the readme and feedback from the deployment as a guideline. Also because not everything is clear to me so this is also a way of absorbing the internal know how.

From the upstream repo we probably only need kserve-crd and kserve-resources, for now.

Btw the chart does work fine locally for what's worth it. Bartosz also tested it.

I would agree with @elukey in that validating the known "problems" from the README beforehand makes sense. Especially considering the size of the chart. In general this process is more about validating the quality/match with our standard of a chart rather than it's functionality. We, for example, had issues with other charts granting too wide privileges (RBAC) and/or capabilities to some of their components (see https://wikitech.wikimedia.org/wiki/Kubernetes/Upstream_Helm_charts_policy#Special_considerations_considering_privileges). You may take a look at this review summary for example to get an idea what to look out for.

Regarding the calico policies you may just add an additional template file for those to the chart in a follow up commit. With a proper comment this is easily spotted on the next chart upgrade/patch and it does not interfere with the upstream chart.