Page MenuHomePhabricator

Create a LB service for inference.discovery.wmnet
Closed, ResolvedPublic

Description

The ML serve clusters will need an LB endpoint to reach the istio ingress gateway pods, that in turn (using HTTP headers) will route packets to the correct pod backends.

We should do the following:

  1. Add a TLS certificate for inference.discovery.wmnet to the istio ingress pods, configured via knative (that holds the L4/7 configurations for istio in its config). This should be similar to what we do for other services.
  2. Remove port 80 from istioctl's config
  3. Add an LVS endpoint in front of the secure port.

At the end we should be able to get scores for revisions using https://inference.discovery.wmnet + related HTTP headers.

Event Timeline

Change 716235 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] Introduce the secrets helm chart

https://gerrit.wikimedia.org/r/716235

Change 717069 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] knative-serving: add serving chart to helmfile config

https://gerrit.wikimedia.org/r/717069

Change 716235 merged by Elukey:

[operations/deployment-charts@master] Introduce the secrets helm chart

https://gerrit.wikimedia.org/r/716235

Change 717069 merged by Elukey:

[operations/deployment-charts@master] knative-serving: add secrets chart to helmfile config

https://gerrit.wikimedia.org/r/717069

Change 717438 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] knative-serving: improve the helmfile configuration

https://gerrit.wikimedia.org/r/717438

Change 717438 merged by Elukey:

[operations/deployment-charts@master] knative-serving: improve the helmfile configuration

https://gerrit.wikimedia.org/r/717438

After some tests I was able to get a nice 404 via https:

elukey@ml-serve-ctrl1001:~$ curl --resolve inference.discovery.wmnet:18443:`dig +short ml-serve1001.eqiad.wmnet` "https://inference.discovery.wmnet:18443/v1/models/enwiki-damaging:predict" -X POST -d @input.json -i -H "Host: enwiki-damaging.elukey-test.example.com" -i
HTTP/2 404 
date: Fri, 03 Sep 2021 15:39:49 GMT
server: istio-envoy

The proxy listeners looks good:

root@deploy1002:/home/elukey# istioctl-1.9.5 -n istio-system proxy-config listener deploy/istio-ingressgateway
ADDRESS PORT  MATCH                          DESTINATION
0.0.0.0 8080  ALL                            Route: http.80
0.0.0.0 8081  ALL                            Route: http.8081
0.0.0.0 8443  SNI: inference.discovery.wmnet Route: https.443.https.knative-ingress-gateway.knative-serving
0.0.0.0 15021 ALL                            Inline Route: /healthz/ready*
0.0.0.0 15090 ALL                            Inline Route: /stats/prometheus*

But what's currently not working is the config:

NAME                                                        DOMAINS                                                              MATCH                  VIRTUAL SERVICE
[...]
https.443.https.knative-ingress-gateway.knative-serving     *                                                                    /*                     404

It should be a matter of knative/istio configs!

Change 717482 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] istio: change node port for HTTPS

https://gerrit.wikimedia.org/r/717482

Change 717482 merged by Elukey:

[operations/deployment-charts@master] istio: change node port for HTTPS

https://gerrit.wikimedia.org/r/717482

Change 719035 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] knative-serving,istio: allow only https for ml-serve

https://gerrit.wikimedia.org/r/719035

Change 719035 merged by Elukey:

[operations/deployment-charts@master] knative-serving,istio: allow only https for ml-serve

https://gerrit.wikimedia.org/r/719035

Found the problem: the hosts attribute in the Knative serving config for the istio ingress gateway represents the backend hosts / istio virtual services allowed to be targeted, and I had put inference.discovery.wmnet in there instead (of course a nonsense). I just removed port 80 / HTTP config from istio/knative on ml-serve-eqiad, and:

elukey@ml-serve-ctrl1001:~$ curl --resolve inference.discovery.wmnet:18443:`dig +short ml-serve1001.eqiad.wmnet` "https://inference.discovery.wmnet:18443/v1/models/enwiki-damaging:predict" -X POST -d @input.json -i -H "Host: enwiki-damaging.elukey-test.example.com" 
HTTP/2 200 
content-length: 113
content-type: application/json; charset=UTF-8
date: Mon, 06 Sep 2021 06:56:21 GMT
server: istio-envoy
x-envoy-upstream-service-time: 454

{"predictions": {"prediction": false, "probability": {"false": 0.8884455065938814, "true": 0.11155449340611859}}}

Change 719038 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] knative-serving: set min protocol version to TLSV1_2

https://gerrit.wikimedia.org/r/719038

Change 719038 merged by Elukey:

[operations/deployment-charts@master] knative-serving: set min protocol version to TLSV1_2

https://gerrit.wikimedia.org/r/719038

Change 719225 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] conftool-data: add worker nodes to ml_serve

https://gerrit.wikimedia.org/r/719225

Change 719226 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] conftool-data: add new inference discovery service

https://gerrit.wikimedia.org/r/719226

Allocated two IPs for the new service in netbox:

https://netbox.wikimedia.org/ipam/ip-addresses/8994/
https://netbox.wikimedia.org/ipam/ip-addresses/8995/ (the codfw one just reserved since the cluster is not up yet)

Change 719227 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/dns@master] Add inference eqiad service record

https://gerrit.wikimedia.org/r/719227

Change 719235 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] istio: change ingress gateway nodeport to 4688

https://gerrit.wikimedia.org/r/719235

Change 719239 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] role::ml_k8s::worker: add LVS configuration for the inference svc

https://gerrit.wikimedia.org/r/719239

Change 719235 merged by Elukey:

[operations/deployment-charts@master] istio: change ingress gateway nodeport to 4688

https://gerrit.wikimedia.org/r/719235

Change 719225 merged by Elukey:

[operations/puppet@production] conftool-data: add worker nodes to ml_serve

https://gerrit.wikimedia.org/r/719225

Change 719226 merged by Elukey:

[operations/puppet@production] conftool-data: add new inference discovery service

https://gerrit.wikimedia.org/r/719226

Change 719227 merged by Elukey:

[operations/dns@master] Add inference eqiad service record

https://gerrit.wikimedia.org/r/719227

Interesting thing happened while trying to configure https://gerrit.wikimedia.org/r/719227 - we cannot specify, in Pybal, a port for health checks and a port for service HTTP traffic. This is how the Istio Ingress Gateway works currently:

root@deploy1002:~# istioctl-1.9.5 -n istio-system proxy-config listeners deploy/istio-ingressgateway
ADDRESS PORT  MATCH DESTINATION
0.0.0.0 8081  ALL   Route: http.8081
0.0.0.0 8443  ALL   Route: https.443.https.knative-ingress-gateway.knative-serving
0.0.0.0 15021 ALL   Inline Route: /healthz/ready*
0.0.0.0 15090 ALL   Inline Route: /stats/prometheus*

We expose the 8443 port via a node port, same thing for 15021. Since Istio + Knative already have their own health check mechanisms, we could simply resolve the problem having Pybal just checking if a TCP connection to the service port can be made (so basically if pods are up at L4 level). It should be enough for the moment, but there may be more/different things that we could do in the future to improve this (for example it would be nice to you a L7 health check via /healthz/ready).

@klausman lemme know your thoughts!

Change 719465 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] charts/secrets: chart type is not valid in apiVersion v1

https://gerrit.wikimedia.org/r/719465

Change 719465 merged by jenkins-bot:

[operations/deployment-charts@master] charts/secrets: chart type is not valid in apiVersion v1

https://gerrit.wikimedia.org/r/719465

Change 720009 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] Set lvs_setup status for the inference service

https://gerrit.wikimedia.org/r/720009

Change 730419 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] custom.deploy.d: change the ml-serve' istio ingress node port

https://gerrit.wikimedia.org/r/730419

Change 730419 merged by Elukey:

[operations/deployment-charts@master] custom.deploy.d: change the ml-serve' istio ingress node port

https://gerrit.wikimedia.org/r/730419

Change 719239 merged by Elukey:

[operations/puppet@production] role::ml_k8s::worker: add LVS configuration for the inference svc

https://gerrit.wikimedia.org/r/719239

Change 720009 merged by Elukey:

[operations/puppet@production] Set lvs_setup status for the inference service

https://gerrit.wikimedia.org/r/720009

Mentioned in SAL (#wikimedia-operations) [2021-10-13T14:36:00Z] <elukey> restart pybal on lvs1016 (low-traffic secondary) to pick up new config for inference.discovery.wmnet - T289835

Mentioned in SAL (#wikimedia-operations) [2021-10-13T14:50:17Z] <elukey> restart pybal on lvs1015 (low-traffic primary) to pick up new config for inference.discovery.wmnet - T289835

Change 730541 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/dns@master] Add dns discovery settings for inference.svc.{eqiad,codfw}.wmnet

https://gerrit.wikimedia.org/r/730541

elukey@ml-serve-ctrl1001:~$ time curl "https://inference.svc.eqiad.wmnet:30443/v1/models/enwiki-goodfaith:predict" -X POST -d @input.json -i -H "Host: enwiki-goodfaith.revscoring-editquality.wikimedia.org" --http1.1
HTTP/1.1 200 OK
content-length: 112
content-type: application/json; charset=UTF-8
date: Wed, 13 Oct 2021 15:00:58 GMT
server: istio-envoy
x-envoy-upstream-service-time: 302

{"predictions": {"prediction": true, "probability": {"false": 0.06715093098078351, "true": 0.9328490690192165}}}
real	0m0.332s
user	0m0.017s
sys	0m0.008s

The eqiad endpoint works! To get the .discovery.wmnet one enabled we'll need to complete the steps outlined in https://wikitech.wikimedia.org/wiki/LVS#Add_a_new_load_balanced_service, but since we don't want to page SRE when we are testing I am inclined to wait a little bit more :)

Things to do before closing:

Change 741924 had a related patch set uploaded (by Klausman; author: Klausman):

[operations/dns@master] Add inference codfw service record

https://gerrit.wikimedia.org/r/741924

Change 741924 merged by Klausman:

[operations/dns@master] Add inference codfw service record

https://gerrit.wikimedia.org/r/741924

Change 741925 had a related patch set uploaded (by Klausman; author: Klausman):

[operations/puppet@production] conftool-data: add new inference discovery service in codfw

https://gerrit.wikimedia.org/r/741925

Change 741925 merged by Klausman:

[operations/puppet@production] conftool-data: add new inference discovery service in codfw

https://gerrit.wikimedia.org/r/741925

$ sudo confctl select 'cluster=ml_serve,service=kubesvc' set/pooled=yes:weight=1
The selector you chose has selected the following objects:
{"/codfw/ml_serve/kubesvc": ["ml-serve2002.codfw.wmnet", "ml-serve2004.codfw.wmnet", "ml-serve2001.codfw.wmnet", "ml-serve2003.codfw.wmnet"], "/eqiad/ml_serve/kubesvc": ["ml-serve1003.eqiad.wmnet", "ml-serve1004.eqiad.wmnet", "ml-serve1002.eqiad.wmnet", "ml-serve1001.eqiad.wmnet"]}
Ok to continue? [y/N]
confctl>y
codfw/ml_serve/kubesvc/ml-serve2002.codfw.wmnet: weight changed 1 => 1
codfw/ml_serve/kubesvc/ml-serve2004.codfw.wmnet: weight changed 1 => 1
codfw/ml_serve/kubesvc/ml-serve2001.codfw.wmnet: weight changed 1 => 1
codfw/ml_serve/kubesvc/ml-serve2003.codfw.wmnet: weight changed 1 => 1
eqiad/ml_serve/kubesvc/ml-serve1003.eqiad.wmnet: weight changed 1 => 1
eqiad/ml_serve/kubesvc/ml-serve1004.eqiad.wmnet: weight changed 1 => 1
eqiad/ml_serve/kubesvc/ml-serve1002.eqiad.wmnet: weight changed 1 => 1
eqiad/ml_serve/kubesvc/ml-serve1001.eqiad.wmnet: weight changed 1 => 1
WARNING:conftool.announce:conftool action : set/pooled=yes:weight=1; selector: cluster=ml_serve,service=kubesvc
$

Change 741934 had a related patch set uploaded (by Klausman; author: Klausman):

[operations/puppet@production] role::ml_k8s::worker: Activate LVS config for inference in codfw

https://gerrit.wikimedia.org/r/741934

Change 741934 merged by Klausman:

[operations/puppet@production] role::ml_k8s::worker: Activate LVS config for inference in codfw

https://gerrit.wikimedia.org/r/741934

Mentioned in SAL (#wikimedia-operations) [2021-11-25T15:57:03Z] <vgutierrez> restarting pybal on lvs2010 - T289835

Mentioned in SAL (#wikimedia-operations) [2021-11-25T16:10:19Z] <klausman> restarting pybal on lvs2009 T289835

Change 741939 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/dns@master] Add discovery record support for the inference LVS

https://gerrit.wikimedia.org/r/741939

Change 741940 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] service::catalog: set inference as active-active

https://gerrit.wikimedia.org/r/741940

Change 741940 merged by Elukey:

[operations/puppet@production] service::catalog: set inference as active-active

https://gerrit.wikimedia.org/r/741940

Change 741939 merged by Elukey:

[operations/dns@master] Add discovery record support for the inference LVS

https://gerrit.wikimedia.org/r/741939

Next steps to finish:

  1. Move the LVS monitoring config to "production", in order to enable the next step (it is a requirement). This will introduce pages to SRE if the endpoint is down, so we should do it after we are sure that we are not doing more tests that could bring instability. See T298976 for example.
  2. Create the inference.discovery.wmnet record/configuration.

Steps outlined in https://wikitech.wikimedia.org/wiki/LVS#Add_a_new_load_balanced_service

Change 753461 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] custom_deploy.d: improve istio ingress gateway's config for ml-serve

https://gerrit.wikimedia.org/r/753461

Change 753461 merged by Elukey:

[operations/deployment-charts@master] custom_deploy.d: improve istio ingress gateways' config for ml-serve

https://gerrit.wikimedia.org/r/753461

Change 755715 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] Set inference.discovery.wmnet to production stage

https://gerrit.wikimedia.org/r/755715

Change 755715 merged by Elukey:

[operations/puppet@production] Set inference.discovery.wmnet to production stage

https://gerrit.wikimedia.org/r/755715

Change 730541 merged by Elukey:

[operations/dns@master] Add dns discovery settings for inference.svc.{eqiad,codfw}.wmnet

https://gerrit.wikimedia.org/r/730541

The inference endpoint is up, but there is still an error with the monitoring:

PROBLEM - LVS inference eqiad port 30443/tcp - Inference ML service IPv4 on inference.svc.eqiad.wmnet is CRITICAL: TCP CRITICAL - Invalid hostname

The alert doesn't page, we've set the critical flag to false. Once we'll be ready for prime time (after MVP), we'll turn it on.

elukey claimed this task.