Page MenuHomePhabricator

k8s changes needed to allow article topic (and other future isvcs) to use the kserve v2 inference protocol (and gRPC)
Open, Needs TriagePublic

Description

We have an isvc that would accept kserve v2-style requests using gRPC, but for this to work for clients, we will need to make sure k8s and related components (like Istio) route the request traffic correctly (in and out).

At a minimum, the port v2/gRPC is served on will have to be part of the isvc deployment chart/template (if it's a separate port).

It's also not quite clear yet whether Istio would just accept/route this traffic or whether it needs tweaking. It's probably best to just try it.

We will eventually also have to update our docs (that usually mention curl) to describe how gRPC calls can be made.

Event Timeline

In T423582: Modify article topic model code to support kserve v2 inference protocol we have done some minimal work to enable the v2 open inference protocol which also accepts gRPC requests. Requests with v2 (both REST & gRPC) are documented in https://gerrit.wikimedia.org/r/plugins/gitiles/machinelearning/liftwing/inference-services/+/refs/heads/main/src/models/outlink_topic_model/README.md
By minimal, I mean that we are not actually using strict typing but rather transforming inputs and outputs in order to allow the service to be backwards compatible and to enable the work required for this task and the changes required on kubernetes.
When trying to make a request with grpc we currently get the following error (request is made from a stathost) using the script

1import grpc
2import json
3
4from kserve.protocol.grpc import grpc_predict_v2_pb2, grpc_predict_v2_pb2_grpc
5
6channel = grpc.insecure_channel('inference-staging.svc.codfw.wmnet:30443')
7request = grpc_predict_v2_pb2.ModelInferRequest()
8request.model_name = "outlink-topic-model"
9request.id = "test-123"
10
11# Add host header metadata
12metadata = [('host', 'outlink-topic-model.articletopic-outlink.wikimedia.org')]
13
14input_data = json.dumps({"page_id": 5355, "lang": "en"}).encode("utf-8")
15tensor = request.inputs.add()
16tensor.name = "input"
17tensor.shape.extend([1])
18tensor.datatype = "BYTES"
19tensor.contents.bytes_contents.append(input_data)
20
21stub = grpc_predict_v2_pb2_grpc.GRPCInferenceServiceStub(channel)
22response = stub.ModelInfer(request, metadata=metadata)
23
24# Parse the response
25result = json.loads(response.outputs[0].contents.bytes_contents[0].decode("utf-8"))
26print(json.dumps(result, indent=2))

We get the following errors -- which make total sense since this service doesn't export port 8081 at the moment.

 python grpc_test.py
Traceback (most recent call last):
  File "/srv/home/isaranto/grpc_test/grpc_test.py", line 22, in <module>
    response = stub.ModelInfer(request, metadata=metadata)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/srv/home/isaranto/grpc_test/.venv/lib/python3.11/site-packages/grpc/_channel.py", line 1159, in __call__
    return _end_unary_response_blocking(state, call, False, None)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/srv/home/isaranto/grpc_test/.venv/lib/python3.11/site-packages/grpc/_channel.py", line 990, in _end_unary_response_blocking
    raise _InactiveRpcError(state)  # pytype: disable=not-instantiable
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "failed to connect to all addresses; last error: UNAVAILABLE: ipv4:10.2.1.58:30443: Socket closed"
	debug_error_string = "UNKNOWN:Error received from peer  {grpc_status:14, grpc_message:"failed to connect to all addresses; last error: UNAVAILABLE: ipv4:10.2.1.58:30443: Socket closed"}"

Change #1277043 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[operations/deployment-charts@master] Add gRPC support to Istio ingress gateway for ML services

https://gerrit.wikimedia.org/r/1277043

Change #1277043 merged by jenkins-bot:

[operations/deployment-charts@master] Add gRPC support to Istio ingress gateway for ML services

https://gerrit.wikimedia.org/r/1277043

Change #1277436 had a related patch set uploaded (by Bartosz Wójtowicz; author: Bartosz Wójtowicz):

[operations/deployment-charts@master] Fix gRPC Gateway protocol to allow TLS termination

https://gerrit.wikimedia.org/r/1277436

After syncing patch ✅Add gRPC support to Istio ingress gateway for ML services (operations/deployment-charts~1277043) we are getting the errors below on syncing changes to admin_ng

Error: UPGRADE FAILED: release knative-serving failed, and has been rolled back due to atomic being set: cannot patch "knative-ingress-gateway" with kind Gateway: admission webhook "validation.istio.io" denied the request: configuration is invalid: server cannot have TLS settings for plain text HTTP ports

Small status update from debugging efforts.

Root issue of the admission webhook error from above - protocol: HTTP2 with a tls block isn't valid in Istio gateway. Switching to protocol: HTTPS resolved it. The required changes can be found in https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1277436, which also includes changing the config_1.24.2.yaml, which is the actual file used for staging.

After manually deploying changes above on staging, gRPC still fails. We found two problems:

  1. The stathosts can reach LVS on 30443, but not on 38181 as LVS doesn't know about that port. Adding a new load balancing service requires changes in operations/puppet repo, more documentation can be found here: https://wikitech.wikimedia.org/wiki/LVS#Add_a_new_load_balanced_service. For now, we're testing directly against an ml-staging node bypassing LVS.
  2. We're experiencing the error below caused by defining a second Gateway server for port 8081 in ml-serve.yaml, whereas net_istio.yaml already has a match on the same port - this causes Envoy to produce the error. We're exploring options to: use a different port for the gRPC to avoid duplicate matcher issue, or share 443/30443 with a different hostname (this could avoid LVS work from point 1 entirely).

Duplicate matcher error:

{"time":"2026-04-27T13:54:38.112947Z","level":"warning","scope":"envoy config","msg":"gRPC config for type.googleapis.com/envoy.config.listener.v3.Listener rejected: Error adding/updating listener(s) 0.0.0.0_8081: error adding listener '0.0.0.0:8081': filter chain '' has the same matching rules defined as ''. duplicate matcher is: {}\n","caller":"external/envoy/source/extensions/config_subscription/grpc/grpc_subscription_impl.cc:138","thread":"21"}

Change #1277436 merged by jenkins-bot:

[operations/deployment-charts@master] Fix gRPC Gateway protocol to allow TLS termination

https://gerrit.wikimedia.org/r/1277436

Change #1279327 had a related patch set uploaded (by Bartosz Wójtowicz; author: Bartosz Wójtowicz):

[operations/deployment-charts@master] Add 50051 to istio ingressgateway ports for ml-staging-codfw.

https://gerrit.wikimedia.org/r/1279327

Change #1279327 merged by jenkins-bot:

[operations/deployment-charts@master] Add 50051 to istio ingressgateway ports for ml-staging-codfw.

https://gerrit.wikimedia.org/r/1279327

Change #1279360 had a related patch set uploaded (by Bartosz Wójtowicz; author: Bartosz Wójtowicz):

[operations/deployment-charts@master] ml-services: Use gRPC port for staging outlink-topic-model.

https://gerrit.wikimedia.org/r/1279360

Change #1279360 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: Use gRPC port for staging outlink-topic-model.

https://gerrit.wikimedia.org/r/1279360

Change #1279375 had a related patch set uploaded (by Bartosz Wójtowicz; author: Bartosz Wójtowicz):

[operations/deployment-charts@master] Enable Knative HTTP/2 auto-detection on ml-staging-codfw.

https://gerrit.wikimedia.org/r/1279375

Change #1279375 merged by jenkins-bot:

[operations/deployment-charts@master] Enable Knative HTTP/2 auto-detection on ml-staging-codfw.

https://gerrit.wikimedia.org/r/1279375

Update on current state of things.

After merging a few fixes (gRPC moved to 50051 with nodePort 30051, added 50051 port to ingress gateway), the path from stathost to staging works at TLS layer (verified by curl -kvv "https://ml-staging2003.codfw.wmnet:30051" -H "Host: outlink-topic-model.articletopic-outlink.wikimedia.org" --http2), but gRPC requests to the ISVC still fail.

Progress on the ISVC side:

  1. In https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1279360 we added an h2c-named container port (8081) to the staging chart values so Knative routes traffic to the gRPC endpoint instead of the HTTP one. This currently breaks REST on staging, but in the future we'll have 2 separate deployments for HTTP and gRPC.
  2. Added autodetect-http2: "enabled" in Knative: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1279375. Confirmed pods now have ENABLE_HTTP2_AUTO_DETECTION=true.

What currently works:

  • KServe gRPC server starts on [::]:8081.
  • outlink-topic-model service exposes port 81 with appProtocol: kubernetes.io/h2c, targetPort is 8013, endpoints contain the new pod IP.
  • knative-local-gateway service has 3 healthy endpoints.
  • Knative VirtualServices route correctly: ingressgateway > knative-local-gateway > predictor svc.

ISVC request still fails with

Traceback (most recent call last):
  File "/home/bwojtowicz/grpc_test.py", line 39, in <module>
    response = stub.ModelInfer(request, metadata=metadata)
  File "/srv/home/bwojtowicz/myenv/lib/python3.9/site-packages/grpc/_channel.py", line 1159, in __call__
    return _end_unary_response_blocking(state, call, False, None)
  File "/srv/home/bwojtowicz/myenv/lib/python3.9/site-packages/grpc/_channel.py", line 990, in _end_unary_response_blocking
    raise _InactiveRpcError(state)  # pytype: disable=not-instantiable
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "upstream connect error or disconnect/reset before headers. reset reason: connection timeout"
	debug_error_string = "UNKNOWN:Error received from peer  {grpc_message:"upstream connect error or disconnect/reset before headers. reset reason: connection timeout", grpc_status:14}"

Change #1280202 had a related patch set uploaded (by Bartosz Wójtowicz; author: Bartosz Wójtowicz):

[operations/deployment-charts@master] kserve-inference: allow ingress on queue-proxy port 8013.

https://gerrit.wikimedia.org/r/1280202

Change #1280202 merged by jenkins-bot:

[operations/deployment-charts@master] kserve-inference: allow ingress on queue-proxy port 8013.

https://gerrit.wikimedia.org/r/1280202

Happy update!

We've managed to make the gRPC work. The issues above stemmed from the fact that Knative's queue-proxy was listening on port 8013 in case of h2c, whereas all the previous HTTP deployments were using port 8012. The fix was to add the port 8013 to the relevant NetworkPolicy -https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1280202.

One can successfully test the gRPC endpoint from stat host using the script below:

import grpc
import json

from kserve.protocol.grpc import grpc_predict_v2_pb2, grpc_predict_v2_pb2_grpc

# Bypass LVS until pybal config for port 30051 lands; istio routes internally
# so any ml-staging-codfw node works.
# Force IPv4 — gRPC otherwise picks the AAAA record first and the IPv6 path
# from the analytics subnet doesn't currently route to ml-staging-codfw.
TARGET = "10.192.21.9:30051"
AUTHORITY = "outlink-topic-model.articletopic-outlink.wikimedia.org"

with open("/etc/ssl/certs/ca-certificates.crt", "rb") as f:
    ca_bundle = f.read()

channel = grpc.secure_channel(
    TARGET,
    grpc.ssl_channel_credentials(root_certificates=ca_bundle),
    options=(
        ("grpc.default_authority", AUTHORITY),
        ("grpc.ssl_target_name_override", AUTHORITY),
    ),
)

request = grpc_predict_v2_pb2.ModelInferRequest()
request.model_name = "outlink-topic-model"
request.id = "test-123"

metadata = [("host", AUTHORITY)]

input_data = json.dumps({"page_id": 5355, "lang": "en"}).encode("utf-8")
tensor = request.inputs.add()
tensor.name = "input"
tensor.shape.extend([1])
tensor.datatype = "BYTES"
tensor.contents.bytes_contents.append(input_data)

stub = grpc_predict_v2_pb2_grpc.GRPCInferenceServiceStub(channel)
response = stub.ModelInfer(request, metadata=metadata)

result = json.loads(response.outputs[0].contents.bytes_contents[0].decode("utf-8"))
print(json.dumps(result, indent=2))

Main next step is making LVS understand the 30051 port as described in https://wikitech.wikimedia.org/wiki/LVS#Add_a_new_load_balanced_service. Currently we're testing speaking directly to the staging node.

Change #1282328 had a related patch set uploaded (by Dpogorzelski; author: Dpogorzelski):

[operations/puppet@production] lvs: expose grpc port on ml-serve staging

https://gerrit.wikimedia.org/r/1282328

Change #1283745 had a related patch set uploaded (by Dpogorzelski; author: Dpogorzelski):

[operations/puppet@production] 1/3 ml-serve(grpc): etcd data for DNS Discovery

https://gerrit.wikimedia.org/r/1283745

Change #1283746 had a related patch set uploaded (by Dpogorzelski; author: Dpogorzelski):

[operations/puppet@production] 2/3 ml-serve(grpc): add entry to service catalog

https://gerrit.wikimedia.org/r/1283746

Change #1283747 had a related patch set uploaded (by Dpogorzelski; author: Dpogorzelski):

[operations/puppet@production] 3/3 ml-serve(grpc): add service to k8s pools

https://gerrit.wikimedia.org/r/1283747

Change #1282328 abandoned by Dpogorzelski:

[operations/puppet@production] lvs: expose grpc port on ml-serve staging

Reason:

closing this in favor of 3 separate changes

https://gerrit.wikimedia.org/r/1282328

I've reached out to #wikimedia-serviceops to get reviews for the attached patches to expose the grpc port on the ml-serve clusters (starting with ml-staging-codfw)

@DPogorzelski-WMF & @klausman please also coordinate with #wikimedia-traffic about this because they are responsible for LVS https://wikitech.wikimedia.org/wiki/LVS#Add_a_new_load_balanced_service

achou updated Other Assignee, added: klausman.

I probably don't have all the details but in general istio is capable of handling HTTP/1.1 and HTTP/2 on the same port. Since gRPC runs over HTTP/2 it works on the same port as well. We run setups like this in wikikube and aux clusters.

One thing to mention is that isio usually establishes HTTP/1.1 connections to upstream, regardless of that protocol the client used. This can be changed by either setting the useClientProtocol option in the Istio DestinationRule or by specifying the protocol at the K8s Service object (via the appProtocol field). See https://istio.io/latest/docs/ops/configuration/traffic-management/protocol-selection/. I'm not sure how much of that is true for knative ofc.

If you really need an additional LVS service and you're already coordinating with the traffic team you may ask them to review LVS related patches if you feel unsafe with only inter-team reviews.

Change #1283745 merged by Ssingh:

[operations/puppet@production] ml-serve(grpc): step 1, etcd data for DNS Discovery

https://gerrit.wikimedia.org/r/1283745

Change #1283746 merged by Ssingh:

[operations/puppet@production] ml-serve(grpc): step 2, add entry to service catalog

https://gerrit.wikimedia.org/r/1283746

Change #1293120 had a related patch set uploaded (by Dpogorzelski; author: Dpogorzelski):

[operations/puppet@production] ml-serve(grpc): step 4, change lvs state

https://gerrit.wikimedia.org/r/1293120

Change #1283747 merged by Ssingh:

[operations/puppet@production] ml-serve(grpc): step 3, add service to k8s pools

https://gerrit.wikimedia.org/r/1283747

Change #1293121 had a related patch set uploaded (by Dpogorzelski; author: Dpogorzelski):

[operations/puppet@production] ml-serve(grpc): step 5, change lvs state

https://gerrit.wikimedia.org/r/1293121

Change #1293120 merged by Ssingh:

[operations/puppet@production] ml-serve(grpc): step 4, change lvs state

https://gerrit.wikimedia.org/r/1293120

Mentioned in SAL (#wikimedia-operations) [2026-05-25T13:57:59Z] <sukhe> sudo cumin 'A:lvs and A:eqiad' 'run-puppet-agent --enable "adding new ml-serve (grpc) T424049": NOOP change, since service is codfw only

Mentioned in SAL (#wikimedia-operations) [2026-05-25T14:00:16Z] <sukhe> sudo cumin 'A:lvs and A:lvs-secondary-codfw' 'run-puppet-agent --enable "adding new ml-serve (grpc) T424049"'

Mentioned in SAL (#wikimedia-operations) [2026-05-25T14:02:18Z] <sukhe> sukhe@lvs2014:~$ sudo systemctl restart pybal.service": T424049

Mentioned in SAL (#wikimedia-operations) [2026-05-25T14:03:54Z] <sukhe> sudo cumin 'A:lvs and A:lvs-low-traffic-codfw' 'run-puppet-agent --enable "adding new ml-serve (grpc) T424049"'

Mentioned in SAL (#wikimedia-operations) [2026-05-25T14:05:11Z] <sukhe> sukhe@lvs2013:~$ sudo systemctl restart pybal.service: T424049

Mentioned in SAL (#wikimedia-operations) [2026-05-25T14:06:15Z] <sukhe> curl localhost:9090/pools/inference-staging-grpc_30051 shows ml-staging200[1-3].codfw.wmnet as enabled and pooled: T424049

Change #1293121 merged by Ssingh:

[operations/puppet@production] ml-serve(grpc): step 5, change lvs state

https://gerrit.wikimedia.org/r/1293121

Thanks to the changes to LVS, I was successful with testing the gRPC connection on staging with the script below! 🎉
We will now do the gRPC integration with Hoarde on staging and if everything goes smooth, we can merge similar LVS changes to production. Thank you for all the work here!


Script used for testing:

import grpc
import json

from kserve.protocol.grpc import grpc_predict_v2_pb2, grpc_predict_v2_pb2_grpc

TARGET = "inference-staging.svc.codfw.wmnet:30051"
AUTHORITY = "outlink-topic-model.articletopic-outlink.wikimedia.org"

with open("/etc/ssl/certs/ca-certificates.crt", "rb") as f:
    ca_bundle = f.read()

channel = grpc.secure_channel(
    TARGET,
    grpc.ssl_channel_credentials(root_certificates=ca_bundle),
    options=(
        ("grpc.default_authority", AUTHORITY),
        ("grpc.ssl_target_name_override", AUTHORITY),
    ),
)

request = grpc_predict_v2_pb2.ModelInferRequest()
request.model_name = "outlink-topic-model"
request.id = "test-123"

metadata = [("host", AUTHORITY)]

input_data = json.dumps({"page_id": 5355, "lang": "en"}).encode("utf-8")
tensor = request.inputs.add()
tensor.name = "input"
tensor.shape.extend([1])
tensor.datatype = "BYTES"
tensor.contents.bytes_contents.append(input_data)

stub = grpc_predict_v2_pb2_grpc.GRPCInferenceServiceStub(channel)
response = stub.ModelInfer(request, metadata=metadata)

result = json.loads(response.outputs[0].contents.bytes_contents[0].decode("utf-8"))
print(json.dumps(result, indent=2))

@BWojtowicz-WMF @DPogorzelski-WMF did you see what Janis suggested in T424049#11940457? Given the current config I think that the second Gateway is not needed, so all the LVS patches should be avoidable. Before proceeding further, could you please test to route traffic in staging to a single Ingress like we do in other clusters? It would be nice to keep a single standard across clusters, and a single endpoint as well.

@elukey
Thanks, I indeed missed it! Initially I thought that 2nd gateway might be a way to overcome the Knatives auto-managed Gateway single-port limitation, but now I see that it is not the case.
I indeed just tested the script above, but pointing at the 30443 port directly instead of the new 30051 port and it succeeded, which proves that the LVS changes were indeed not needed(?).

@BWojtowicz-WMF I think that it is really awesome, we'll avoid prod changes on LVS that it is always a good thing :) Rolling back the work shouldn't be too bad, we can do it anytime. Thanks for confirming!

@DPogorzelski-WMF when you have a moment could you please work with Traffic to rollback the staging changes? I know bad timing :(

We confirmed that gRPC endpoints works via standard 30443 port on production server without the LVS changes. Glad that we found that out, thank you @JMeybohm and @elukey!

@isarantopoulos o/ before closing this task let's make sure that the staging endpoint changes for port 30051 are rolled back :)