Page MenuHomePhabricator

ML Sandbox Transformer Configuration
Closed, ResolvedPublic

Description

We have an ML development sandbox running the WMF KServe stack running on cloudvps.
Info: https://wikitech.wikimedia.org/wiki/User:Accraze/MachineLearning/ML-Sandbox

Currently we can run an inference service with a predictor and things work well. When we add a transformer to the isvc spec, we get a 503 service not found.

Digging a bit deeper, it seems that we can reach both the predictor and transformer endpoints directly
(i.e. - enwiki-articlequality-predictor-default.kserve-test.example.com)

However, when we use the standard service hostname that follows (ex. enwiki-articlequality.kserve-test.example.com), then we get a 503.

This is most likely due to how we have our cluster-local-gateway configured in istio. The top-level kserve isvc should be able to route the incoming request to the transformer, which should then communicate with the predictor.

Event Timeline

ACraze changed the task status from Open to In Progress.Jan 26 2022, 9:14 PM
ACraze updated the task description. (Show Details)
ACraze added subscribers: kevinbazira, elukey, klausman.

I can hit a transformer endpoint directly, but I get a 503 error. When I inspect the transformer logs, I see the following

[E 220124 20:50:32 web:2243] 500 POST /v1/models/enwiki-articlequality:predict (127.0.0.1) 157.61ms
[E 220124 20:52:14 web:1793] Uncaught exception POST /v1/models/enwiki-articlequality:predict (127.0.0.1)
    HTTPServerRequest(protocol='http', host='enwiki-articlequality-transformer-default.kserve-test.example.com', method='POST', uri='/v1/models/enwiki-articlequality:predict', version='HTTP/1.1', remote_ip='127.0.0.1')
    Traceback (most recent call last):
      File "/opt/lib/python/site-packages/tornado/web.py", line 1704, in _execute
        result = await result
      File "/opt/lib/python/site-packages/kserve/handlers/http.py", line 70, in post
        response = await model(body)
      File "/opt/lib/python/site-packages/kserve/kfmodel.py", line 59, in __call__
        response = (await self.predict(request)) if inspect.iscoroutinefunction(self.predict) \
      File "/opt/lib/python/site-packages/kserve/kfmodel.py", line 153, in predict
        body=json.dumps(request)
      File "/opt/lib/python/site-packages/tornado/simple_httpclient.py", line 344, in run
        source_ip=source_ip,
      File "/opt/lib/python/site-packages/tornado/tcpclient.py", line 265, in connect
        addrinfo = await self.resolver.resolve(host, port, af)
      File "/opt/lib/python/site-packages/tornado/netutil.py", line 399, in resolve
        None, _resolve_addr, host, port, family
      File "/usr/lib/python3.7/concurrent/futures/thread.py", line 57, in run
        result = self.fn(*self.args, **self.kwargs)
      File "/opt/lib/python/site-packages/tornado/netutil.py", line 382, in _resolve_addr
        addrinfo = socket.getaddrinfo(host, port, family, socket.SOCK_STREAM)
      File "/usr/lib/python3.7/socket.py", line 748, in getaddrinfo
        for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
    socket.gaierror: [Errno -5] No address associated with hostname
[E 220124 20:52:14 web:2243] 500 POST /v1/models/enwiki-articlequality:predict (127.0.0.1) 212.39ms

The predictor_host is None so the feature values never get sent to the predictor.

Looking at the KServe transformer docs: https://kserve.github.io/website/modelserving/v1beta1/transformer/torchserve_image_transformer/#build-transformer-image

[...] when predictor_host is passed the predict handler by default makes a HTTP call to the predictor url and gets back a response which then passes to postprocess handler. KServe automatically fills in the predictor_host for Transformer and handle the call to the Predictor

I've been rebuilding the sandbox cluster using the install script with the updated charts for knative and kserve. The KServe stack is able to load with all containers running fine, however, now when I deploy a new isvc (i.e. enwiki-articlequality) in a custom namespace like kserve-test, it seems that the images are unable to be pulled from WMF docker registry:

kserve-test       enwiki-articlequality-predictor-default-xwb8n-deployment-88mwlg   0/2     ImagePullBackOff   0          56m
kserve-test       enwiki-articlequality-transformer-default-dlgvf-deploymentmq8qx   0/2     ImagePullBackOff   0          56m

I double checked the knative config-deployment configmap and ensured that I included docker-registry.wikimedia.org in registriesSkippingTagResolving:
https://gitlab.wikimedia.org/accraze/ml-sandbox-cfg/-/blob/main/install.sh#L53

@elukey: are there any sort of network policy changes in the recent chart updates? I'm trying to understand why Knative is unable to pull images and make a Revision now. Could this be cert related?

edit: nvm - i was using an incorrect image tag that didn't exist, false alarm!

I think I've got the networking issue solved. The top-level isvc was unable to route to the transformer, because my cluster-local-gateway did not have the ports configured correctly in the Istio Operator.

I needed to update the Istio components with following settings:

components:
  ingressGateways:
    - name: istio-ingressgateway
      enabled: true
    - name: cluster-local-gateway
      enabled: true
      label:
        istio: cluster-local-gateway
        app: cluster-local-gateway
      k8s:
        service:
          type: ClusterIP
          ports:
          - port: 15020
            targetPort: 15021
            name: status-port
          - port: 80
            name: http2
            targetPort: 8080
          - port: 443
            name: https
            targetPort: 8443

The install/config script has been updated as well:
https://gitlab.wikimedia.org/accraze/ml-sandbox-cfg/-/blob/main/istio-minimal-operator.yaml

I am able to hit the enwiki-articlequality isvc deployed to the ml-sandbox. It hits the transformer, fetches article text and then passes it off to the predictor who responds with a prediction!

@kevinbazira - can you try hitting enwiki-articlequality on the ml-sandbox to confirm the transformer routing works for you too? I have a test script in my home_dir if you want to use:

root@ml-sandbox:/srv/home/accraze/isvcs/articlequality# ./test-aq.sh
enwiki-articlequality.kserve-test.wikimedia.org
* Expire in 0 ms for 6 (transfer 0x55beeadd7fb0)
*   Trying 192.168.49.2...
* TCP_NODELAY set
* Expire in 200 ms for 4 (transfer 0x55beeadd7fb0)
* Connected to 192.168.49.2 (192.168.49.2) port 30702 (#0)
> POST /v1/models/enwiki-articlequality:predict HTTP/1.1
> Host: enwiki-articlequality.kserve-test.wikimedia.org
> User-Agent: curl/7.64.0
> Accept: */*
> Content-Length: 71
> Content-Type: application/x-www-form-urlencoded
>
* upload completely sent off: 71 out of 71 bytes
< HTTP/1.1 200 OK
< content-length: 225
< content-type: application/json; charset=UTF-8
< date: Wed, 02 Feb 2022 22:51:06 GMT
< server: istio-envoy
< x-envoy-upstream-service-time: 205
<
* Connection #0 to host 192.168.49.2 left intact
{"predictions": {"prediction": "Stub", "probability": {"B": 0.017382693143129683, "C": 0.011305576384229396, "FA": 0.002078191955918339, "GA": 0.0029161293780774434, "Start": 0.05709479871741571, "Stub": 0.9092226104212294}}}

Thank you for working on this, @ACraze.

I logged into the ml-sandbox and first checked whether the enwiki-articlequality isvc is up and running:

root@ml-sandbox:/srv/home/kevinbazira/isvcs/articlequality# kubectl get po -A
NAMESPACE         NAME                                                              READY   STATUS    RESTARTS   AGE
.
.
.
kserve-test       enwiki-articlequality-predictor-default-7dpqn-deployment-dqkp9l   2/2     Running   0          9h
kserve-test       enwiki-articlequality-transformer-default-9vgwg-deploymentd5wjt   2/2     Running   0          9h
.
.
.

Then went ahead to run it:

root@ml-sandbox:/srv/home/kevinbazira/isvcs/articlequality# ./test-aq.sh
enwiki-articlequality.kserve-test.wikimedia.org
* Expire in 0 ms for 6 (transfer 0x558d15720fb0)
*   Trying 192.168.49.2...
* TCP_NODELAY set
* Expire in 200 ms for 4 (transfer 0x558d15720fb0)
* Connected to 192.168.49.2 (192.168.49.2) port 30702 (#0)
> POST /v1/models/enwiki-articlequality:predict HTTP/1.1
> Host: enwiki-articlequality.kserve-test.wikimedia.org
> User-Agent: curl/7.64.0
> Accept: */*
> Content-Length: 71
> Content-Type: application/x-www-form-urlencoded
> 
* upload completely sent off: 71 out of 71 bytes
< HTTP/1.1 200 OK
< content-length: 225
< content-type: application/json; charset=UTF-8
< date: Thu, 03 Feb 2022 07:59:18 GMT
< server: istio-envoy
< x-envoy-upstream-service-time: 220
< 
* Connection #0 to host 192.168.49.2 left intact
{"predictions": {"prediction": "Stub", "probability": {"B": 0.017382693143129683, "C": 0.011305576384229396, "FA": 0.002078191955918339, "GA": 0.0029161293780774434, "Start": 0.05709479871741571, "Stub": 0.9092226104212294}}}

I confirm that the transformer routing works for me too. Thanks again for digging into this.

Excellent, networking issues have been resolved and we can now run transformers on ml-sandbox.
Marking this as RESOLVED.