Page MenuHomePhabricator

Create a LB service for inference.discovery.wmnet
Open, Needs TriagePublic

Description

The ML serve clusters will need an LB endpoint to reach the istio ingress gateway pods, that in turn (using HTTP headers) will route packets to the correct pod backends.

We should do the following:

  1. Add a TLS certificate for inference.discovery.wmnet to the istio ingress pods, configured via knative (that holds the L4/7 configurations for istio in its config). This should be similar to what we do for other services.
  2. Remove port 80 from istioctl's config
  3. Add an LVS endpoint in front of the secure port.

At the end we should be able to get scores for revisions using https://inference.discovery.wmnet + related HTTP headers.

Event Timeline

Change 716235 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] Introduce the secrets helm chart

https://gerrit.wikimedia.org/r/716235

Change 717069 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] knative-serving: add serving chart to helmfile config

https://gerrit.wikimedia.org/r/717069

Change 716235 merged by Elukey:

[operations/deployment-charts@master] Introduce the secrets helm chart

https://gerrit.wikimedia.org/r/716235

Change 717069 merged by Elukey:

[operations/deployment-charts@master] knative-serving: add secrets chart to helmfile config

https://gerrit.wikimedia.org/r/717069

Change 717438 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] knative-serving: improve the helmfile configuration

https://gerrit.wikimedia.org/r/717438

Change 717438 merged by Elukey:

[operations/deployment-charts@master] knative-serving: improve the helmfile configuration

https://gerrit.wikimedia.org/r/717438

After some tests I was able to get a nice 404 via https:

elukey@ml-serve-ctrl1001:~$ curl --resolve inference.discovery.wmnet:18443:`dig +short ml-serve1001.eqiad.wmnet` "https://inference.discovery.wmnet:18443/v1/models/enwiki-damaging:predict" -X POST -d @input.json -i -H "Host: enwiki-damaging.elukey-test.example.com" -i
HTTP/2 404 
date: Fri, 03 Sep 2021 15:39:49 GMT
server: istio-envoy

The proxy listeners looks good:

root@deploy1002:/home/elukey# istioctl-1.9.5 -n istio-system proxy-config listener deploy/istio-ingressgateway
ADDRESS PORT  MATCH                          DESTINATION
0.0.0.0 8080  ALL                            Route: http.80
0.0.0.0 8081  ALL                            Route: http.8081
0.0.0.0 8443  SNI: inference.discovery.wmnet Route: https.443.https.knative-ingress-gateway.knative-serving
0.0.0.0 15021 ALL                            Inline Route: /healthz/ready*
0.0.0.0 15090 ALL                            Inline Route: /stats/prometheus*

But what's currently not working is the config:

NAME                                                        DOMAINS                                                              MATCH                  VIRTUAL SERVICE
[...]
https.443.https.knative-ingress-gateway.knative-serving     *                                                                    /*                     404

It should be a matter of knative/istio configs!

Change 717482 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] istio: change node port for HTTPS

https://gerrit.wikimedia.org/r/717482

Change 717482 merged by Elukey:

[operations/deployment-charts@master] istio: change node port for HTTPS

https://gerrit.wikimedia.org/r/717482

Change 719035 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] knative-serving,istio: allow only https for ml-serve

https://gerrit.wikimedia.org/r/719035

Change 719035 merged by Elukey:

[operations/deployment-charts@master] knative-serving,istio: allow only https for ml-serve

https://gerrit.wikimedia.org/r/719035

Found the problem: the hosts attribute in the Knative serving config for the istio ingress gateway represents the backend hosts / istio virtual services allowed to be targeted, and I had put inference.discovery.wmnet in there instead (of course a nonsense). I just removed port 80 / HTTP config from istio/knative on ml-serve-eqiad, and:

elukey@ml-serve-ctrl1001:~$ curl --resolve inference.discovery.wmnet:18443:`dig +short ml-serve1001.eqiad.wmnet` "https://inference.discovery.wmnet:18443/v1/models/enwiki-damaging:predict" -X POST -d @input.json -i -H "Host: enwiki-damaging.elukey-test.example.com" 
HTTP/2 200 
content-length: 113
content-type: application/json; charset=UTF-8
date: Mon, 06 Sep 2021 06:56:21 GMT
server: istio-envoy
x-envoy-upstream-service-time: 454

{"predictions": {"prediction": false, "probability": {"false": 0.8884455065938814, "true": 0.11155449340611859}}}

Change 719038 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] knative-serving: set min protocol version to TLSV1_2

https://gerrit.wikimedia.org/r/719038

Change 719038 merged by Elukey:

[operations/deployment-charts@master] knative-serving: set min protocol version to TLSV1_2

https://gerrit.wikimedia.org/r/719038

Change 719225 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] conftool-data: add worker nodes to ml_serve

https://gerrit.wikimedia.org/r/719225

Change 719226 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] conftool-data: add new inference discovery service

https://gerrit.wikimedia.org/r/719226

Allocated two IPs for the new service in netbox:

https://netbox.wikimedia.org/ipam/ip-addresses/8994/
https://netbox.wikimedia.org/ipam/ip-addresses/8995/ (the codfw one just reserved since the cluster is not up yet)

Change 719227 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/dns@master] Add inference eqiad service record

https://gerrit.wikimedia.org/r/719227

Change 719235 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] istio: change ingress gateway nodeport to 4688

https://gerrit.wikimedia.org/r/719235

Change 719239 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] role::ml_k8s::worker: add LVS configuration for the inference svc

https://gerrit.wikimedia.org/r/719239

Change 719235 merged by Elukey:

[operations/deployment-charts@master] istio: change ingress gateway nodeport to 4688

https://gerrit.wikimedia.org/r/719235

Change 719225 merged by Elukey:

[operations/puppet@production] conftool-data: add worker nodes to ml_serve

https://gerrit.wikimedia.org/r/719225

Change 719226 merged by Elukey:

[operations/puppet@production] conftool-data: add new inference discovery service

https://gerrit.wikimedia.org/r/719226

Change 719227 merged by Elukey:

[operations/dns@master] Add inference eqiad service record

https://gerrit.wikimedia.org/r/719227

Interesting thing happened while trying to configure https://gerrit.wikimedia.org/r/719227 - we cannot specify, in Pybal, a port for health checks and a port for service HTTP traffic. This is how the Istio Ingress Gateway works currently:

root@deploy1002:~# istioctl-1.9.5 -n istio-system proxy-config listeners deploy/istio-ingressgateway
ADDRESS PORT  MATCH DESTINATION
0.0.0.0 8081  ALL   Route: http.8081
0.0.0.0 8443  ALL   Route: https.443.https.knative-ingress-gateway.knative-serving
0.0.0.0 15021 ALL   Inline Route: /healthz/ready*
0.0.0.0 15090 ALL   Inline Route: /stats/prometheus*

We expose the 8443 port via a node port, same thing for 15021. Since Istio + Knative already have their own health check mechanisms, we could simply resolve the problem having Pybal just checking if a TCP connection to the service port can be made (so basically if pods are up at L4 level). It should be enough for the moment, but there may be more/different things that we could do in the future to improve this (for example it would be nice to you a L7 health check via /healthz/ready).

@klausman lemme know your thoughts!

Change 719465 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] charts/secrets: chart type is not valid in apiVersion v1

https://gerrit.wikimedia.org/r/719465

Change 719465 merged by jenkins-bot:

[operations/deployment-charts@master] charts/secrets: chart type is not valid in apiVersion v1

https://gerrit.wikimedia.org/r/719465

Change 720009 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] Set lvs_setup status for the inference service

https://gerrit.wikimedia.org/r/720009

Change 730419 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] custom.deploy.d: change the ml-serve' istio ingress node port

https://gerrit.wikimedia.org/r/730419

Change 730419 merged by Elukey:

[operations/deployment-charts@master] custom.deploy.d: change the ml-serve' istio ingress node port

https://gerrit.wikimedia.org/r/730419

Change 719239 merged by Elukey:

[operations/puppet@production] role::ml_k8s::worker: add LVS configuration for the inference svc

https://gerrit.wikimedia.org/r/719239

Change 720009 merged by Elukey:

[operations/puppet@production] Set lvs_setup status for the inference service

https://gerrit.wikimedia.org/r/720009

Mentioned in SAL (#wikimedia-operations) [2021-10-13T14:36:00Z] <elukey> restart pybal on lvs1016 (low-traffic secondary) to pick up new config for inference.discovery.wmnet - T289835

Mentioned in SAL (#wikimedia-operations) [2021-10-13T14:50:17Z] <elukey> restart pybal on lvs1015 (low-traffic primary) to pick up new config for inference.discovery.wmnet - T289835

Change 730541 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/dns@master] Add dns discovery settings for inference.svc.{eqiad,codfw}.wmnet

https://gerrit.wikimedia.org/r/730541

elukey@ml-serve-ctrl1001:~$ time curl "https://inference.svc.eqiad.wmnet:30443/v1/models/enwiki-goodfaith:predict" -X POST -d @input.json -i -H "Host: enwiki-goodfaith.revscoring-editquality.wikimedia.org" --http1.1
HTTP/1.1 200 OK
content-length: 112
content-type: application/json; charset=UTF-8
date: Wed, 13 Oct 2021 15:00:58 GMT
server: istio-envoy
x-envoy-upstream-service-time: 302

{"predictions": {"prediction": true, "probability": {"false": 0.06715093098078351, "true": 0.9328490690192165}}}
real	0m0.332s
user	0m0.017s
sys	0m0.008s

The eqiad endpoint works! To get the .discovery.wmnet one enabled we'll need to complete the steps outlined in https://wikitech.wikimedia.org/wiki/LVS#Add_a_new_load_balanced_service, but since we don't want to page SRE when we are testing I am inclined to wait a little bit more :)