Page MenuHomePhabricator

kserve helm status is broken across ml clusters
Closed, ResolvedPublic

Description

Hey folks,

there are currently alerts about kserve's helm status being broken. Those happened after the Kubernetes 1.31 upgrade, here an example:

root@deploy2002:~# helm3 -n kserve history kserve
REVISION	UPDATED                 	STATUS    	CHART       	APP VERSION	DESCRIPTION                                                                                                                                                                                                                                                                                                                                                    
1       	Tue Feb 10 11:38:49 2026	deployed  	kserve-0.2.9	0.11.2     	Install complete                                                                                                                                                                                                                                                                                                                                               
2       	Tue Feb 24 13:21:42 2026	superseded	kserve-0.2.9	0.11.2     	Upgrade "kserve" failed: cannot patch "inferenceservices.serving.kserve.io" with kind CustomResourceDefinition: CustomResourceDefinition.apiextensions.k8s.io "inferenceservices.serving.kserve.io" is invalid: spec.conversion.webhookClientConfig.caBundle: Invalid value: []byte{0xa}: unable to load root certificates: unable to parse bytes as PEM block 
3       	Tue Feb 24 13:21:47 2026	failed    	kserve-0.2.9	0.11.2     	Rollback "kserve" failed: cannot patch "inferenceservices.serving.kserve.io" with kind CustomResourceDefinition: CustomResourceDefinition.apiextensions.k8s.io "inferenceservices.serving.kserve.io" is invalid: spec.conversion.webhookClientConfig.caBundle: Invalid value: []byte{0xa}: unable to load root certificates: unable to parse bytes as PEM block
4       	Fri Feb 27 13:14:58 2026	superseded	kserve-0.2.9	0.11.2     	Upgrade "kserve" failed: cannot patch "inferenceservices.serving.kserve.io" with kind CustomResourceDefinition: CustomResourceDefinition.apiextensions.k8s.io "inferenceservices.serving.kserve.io" is invalid: spec.conversion.webhookClientConfig.caBundle: Invalid value: []byte{0xa}: unable to load root certificates: unable to parse bytes as PEM block 
5       	Fri Feb 27 13:15:03 2026	failed    	kserve-0.2.9	0.11.2     	Rollback "kserve" failed: cannot patch "inferenceservices.serving.kserve.io" with kind CustomResourceDefinition: CustomResourceDefinition.apiextensions.k8s.io "inferenceservices.serving.kserve.io" is invalid: spec.conversion.webhookClientConfig.caBundle: Invalid value: []byte{0xa}: unable to load root certificates: unable to parse bytes as PEM block

We don't explicitly override webhookClientConfig.caBundle in our yaml configs (the default is caBundle: Cg==), but we use the following injection in the KServe CRDs:

annotations:
  cert-manager.io/inject-ca-from: kserve/serving-cert

If you check kubectl edit crd inferenceservices.serving.kserve.io on any cluster you'll see that the value is not Cg==, but a valid base64-encoded PEM (the PKI root certificate - the right one). In https://github.com/metallb/metallb/issues/2679 people discuss a similar problem: it seems that K8s 1.31 got really strict about caBundle fields, it wants a valid PEM file.

The main issue IIUC in our case is that helm thinks Cg== is set, while cert-manager injects the right value behind the scenes.

We cannot keep things as they are, so we should find a solution:

  • Maybe instead of Cg== we could create a dummy valid PEM string, and see how it goes.
  • Upgrading the Kserve control plane to a new version may be more advisable, modulo keeping the compatibility with the current isvcs.

Details

Event Timeline

I'll try a few fixes on the side on staging

what works once only:
kubectl delete crd inferenceservices.serving.kserve.io --cascade=true
helmfile -e ml-staging-codfw sync
then the issue comes back.
i will try to remove the
caBundle: Cg== from the chart which is just an empty line

Seems that doesn't matter how you handle it the result is the same.
needs more investigation on the cert-manager side

@DPogorzelski-WMF check https://github.com/kserve/kserve/pull/3890#discussion_r1734596750

So in theory removing caBundle entry from the CRD itself should fix the problem, but I am wondering if we need to nuke kserve first, and then deploy the updated version of the chart. I am not sure what happens to the isvcs if we do it.

i tested it and it always works on first sync, but the problem comes back on following syncs.
will check again

yea i did test this:

image.png (1×1 px, 248 KB)

i think i'll re-check this after kserve update, could be pointless trying to fix it if we want to update kserve

Change #1251220 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] kserve: Remove caBundle occurrences

https://gerrit.wikimedia.org/r/1251220

@DPogorzelski-WMF my idea was to follow what upstream did, namely remove caBundle occurrences in the CRD itself and then re-deploy. I think it is possibly something that we can test even before the upgrade, just to see if we can restore the current good status in production. We cannot leave things broken for so long, we may need to roll out emergent fixes in the meantime.

https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1251220 is my idea, should easy enough to test, lemme know if you like the idea.

I can try again but as per screenshot above it's something i have tried and then reverted because it didn't have effect

Sorry captured the wrong change there, but pretty sure did test on the side with removing the whole entry, can try again though

Change #1251220 merged by Elukey:

[operations/deployment-charts@master] kserve: Remove caBundle occurrences

https://gerrit.wikimedia.org/r/1251220

So far in staging it looks good:

20              Thu Mar  5 13:57:23 2026        failed          kserve-0.2.9    0.11.2          Rollback "kserve" failed: cannot patch "inferenceservices.serving.kserve.io" with kind CustomResourceDefinition: CustomResourceDefinition.apiextensions.k8s.io "inferenceservices.serving.kserve.io" is invalid: spec.conversion.webhookClientConfig.caBundle: Invalid value: []byte{0xa}: unable to load root certificates: unable to parse bytes as PEM block

21              Fri Mar 13 09:33:53 2026        deployed        kserve-0.3.0    0.11.2          Upgrade complete    <-----------------------------------------------------

I also checked with kubectl get secrets -n kserve kserve-webhook-server-cert -o yaml and the ca.crt entry is correctly populated.

Awesome! Then I must have done something wrong

I think the missing bit was to bump the chart's version, that must be it. I'll deploy to prod on Monday so we an close it, better not to risk it on a Friday :D

Deployed on ml-serve-eqiad:

root@deploy2002:/srv/deployment-charts/helmfile.d/admin_ng# helm3 -n kserve history kserve
REVISION        UPDATED                         STATUS          CHART           APP VERSION     DESCRIPTION                                                                                                                                                                                                                                                                                                                                                    
1               Tue Feb 24 10:44:19 2026        superseded      kserve-0.2.9    0.11.2          Install complete                                                                                                                                                                                                                                                                                                                                               
2               Tue Mar  3 12:41:35 2026        superseded      kserve-0.2.9    0.11.2          Upgrade "kserve" failed: cannot patch "inferenceservices.serving.kserve.io" with kind CustomResourceDefinition: CustomResourceDefinition.apiextensions.k8s.io "inferenceservices.serving.kserve.io" is invalid: spec.conversion.webhookClientConfig.caBundle: Invalid value: []byte{0xa}: unable to load root certificates: unable to parse bytes as PEM block 
3               Tue Mar  3 12:41:43 2026        failed          kserve-0.2.9    0.11.2          Rollback "kserve" failed: cannot patch "inferenceservices.serving.kserve.io" with kind CustomResourceDefinition: CustomResourceDefinition.apiextensions.k8s.io "inferenceservices.serving.kserve.io" is invalid: spec.conversion.webhookClientConfig.caBundle: Invalid value: []byte{0xa}: unable to load root certificates: unable to parse bytes as PEM block
4               Mon Mar 16 09:34:16 2026        deployed        kserve-0.3.0    0.11.2          Upgrade complete

So far all good, I'll do codfw later on.

elukey claimed this task.
root@deploy2002:/srv/deployment-charts/helmfile.d/admin_ng# kube-env admin ml-serve-codfw
root@deploy2002:/srv/deployment-charts/helmfile.d/admin_ng# helm3 -n kserve history kserve
REVISION        UPDATED                         STATUS          CHART           APP VERSION     DESCRIPTION                                                                                                                                                                                                                                                                                                                                                    
1               Mon Feb 23 17:00:58 2026        superseded      kserve-0.2.9    0.11.2          Install complete                                                                                                                                                                                                                                                                                                                                               
2               Tue Mar  3 08:31:38 2026        superseded      kserve-0.2.9    0.11.2          Upgrade "kserve" failed: cannot patch "inferenceservices.serving.kserve.io" with kind CustomResourceDefinition: CustomResourceDefinition.apiextensions.k8s.io "inferenceservices.serving.kserve.io" is invalid: spec.conversion.webhookClientConfig.caBundle: Invalid value: []byte{0xa}: unable to load root certificates: unable to parse bytes as PEM block 
3               Tue Mar  3 08:31:43 2026        failed          kserve-0.2.9    0.11.2          Rollback "kserve" failed: cannot patch "inferenceservices.serving.kserve.io" with kind CustomResourceDefinition: CustomResourceDefinition.apiextensions.k8s.io "inferenceservices.serving.kserve.io" is invalid: spec.conversion.webhookClientConfig.caBundle: Invalid value: []byte{0xa}: unable to load root certificates: unable to parse bytes as PEM block
4               Mon Mar 16 09:51:44 2026        deployed        kserve-0.3.0    0.11.2          Upgrade complete