Page MenuHomePhabricator

Create the ml-serve-staging k8s cluster
Closed, ResolvedPublic

Description

In T294946 DCops racked and configured ml-staging200[12] nodes. We should do the following:

  1. Reimage both nodes as Bullseye (with overlay partitions etc..)
  2. Create ml-serve-staging-etcd200[1-3] VMs and the related etcd cluster
  3. Create ml-serve-staging-ctrl200[1-2] VMs (control plane nodes)
  4. Allocate network resources.
  5. Bootstrap the ml-serve-staging k8s cluster
  6. Add the inference-staging.svc.codfw.wmnet endpoint (or a discovery one, if it makes sense, maybe yes for consistency).

The above plan is very high level, it will require surely more work. More details in https://wikitech.wikimedia.org/wiki/Kubernetes/Clusters/New

Details

SubjectRepoBranchLines +/-
operations/deployment-chartsmaster+36 -0
operations/deployment-chartsmaster+9 -0
operations/puppetproduction+29 -0
operations/deployment-chartsmaster+13 -0
operations/deployment-chartsmaster+1 -1
operations/puppetproduction+2 -1
labs/privatemaster+3 -0
operations/puppetproduction+15 -0
operations/puppetproduction+3 -1
operations/puppetproduction+32 -1
operations/dnsmaster+4 -0
operations/deployment-chartsmaster+5 -0
operations/puppetproduction+6 -7
labs/privatemaster+0 -54
operations/deployment-chartsmaster+10 -5
operations/puppetproduction+12 -0
operations/puppetproduction+1 -1
operations/puppetproduction+0 -1
operations/deployment-chartsmaster+71 -0
operations/puppetproduction+1 -1
operations/dnsmaster+1 -0
operations/puppetproduction+238 -1
operations/puppetproduction+84 -1
operations/puppetproduction+159 -1
labs/privatemaster+0 -0
labs/privatemaster+52 -0
Show related patches Customize query in gerrit

Event Timeline

ml-staging200x nodes reimaged with bullseye!

elukey triaged this task as Medium priority.

Change 772417 had a related patch set uploaded (by Klausman; author: Klausman):

[operations/puppet@production] hiera: Add ML staging k8s ctrl node config

https://gerrit.wikimedia.org/r/772417

Change 772866 had a related patch set uploaded (by Klausman; author: Klausman):

[labs/private@master] hiera: Add k8s dummy tokens for ML staging env

https://gerrit.wikimedia.org/r/772866

Change 772866 merged by Klausman:

[labs/private@master] hiera: Add k8s dummy tokens for ML staging env

https://gerrit.wikimedia.org/r/772866

Change 772871 had a related patch set uploaded (by Klausman; author: Klausman):

[labs/private@master] labs: Add dummy keyfile for ML staging k8s in codfw

https://gerrit.wikimedia.org/r/772871

Change 772871 merged by Klausman:

[labs/private@master] labs: Add dummy keyfile for ML staging k8s in codfw

https://gerrit.wikimedia.org/r/772871

Change 774488 had a related patch set uploaded (by Klausman; author: Klausman):

[operations/puppet@production] hiera: Add ML staging k8s role

https://gerrit.wikimedia.org/r/774488

Change 774488 merged by Klausman:

[operations/puppet@production] hiera: Add ML staging k8s role

https://gerrit.wikimedia.org/r/774488

Change 775860 had a related patch set uploaded (by Klausman; author: Klausman):

[operations/puppet@production] hiera/modules: Add config for ML staging k8s workers

https://gerrit.wikimedia.org/r/775860

Change 775860 merged by Klausman:

[operations/puppet@production] hiera/modules: Add config for ML staging k8s workers

https://gerrit.wikimedia.org/r/775860

Change 772417 abandoned by Klausman:

[operations/puppet@production] hiera: Add ML staging k8s ctrl node config

Reason:

All covered in smaller CLs

https://gerrit.wikimedia.org/r/772417

Change 786319 had a related patch set uploaded (by Klausman; author: Klausman):

[operations/puppet@production] Switch ML staging control plane to lvs_setup

https://gerrit.wikimedia.org/r/786319

Change 786320 had a related patch set uploaded (by Klausman; author: Klausman):

[operations/dns@master] Add service IP for ML staging k8s ctrl plane

https://gerrit.wikimedia.org/r/786320

Change 786320 merged by Klausman:

[operations/dns@master] Add service IP for ML staging k8s ctrl plane

https://gerrit.wikimedia.org/r/786320

Change 786319 merged by Klausman:

[operations/puppet@production] hiera: Switch ML staging control plane to lvs_setup

https://gerrit.wikimedia.org/r/786319

Change 786808 had a related patch set uploaded (by Klausman; author: Klausman):

[operations/deployment-charts@master] admin_ng: Add config for ML staging k8s in codfw

https://gerrit.wikimedia.org/r/786808

Change 786808 merged by jenkins-bot:

[operations/deployment-charts@master] admin_ng: Add config for ML staging k8s in codfw

https://gerrit.wikimedia.org/r/786808

Mentioned in SAL (#wikimedia-operations) [2022-05-31T07:27:39Z] <elukey> add profile k8s_mlstaging + authkey for ml-staging k8s - T302195

Change 801662 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] Move the ml-staging cluster under ml-serve's definition

https://gerrit.wikimedia.org/r/801662

Change 801662 merged by Elukey:

[operations/puppet@production] Move the ml-staging cluster under ml-serve's definition

https://gerrit.wikimedia.org/r/801662

Change 801742 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] Set cluster group to ml-serve for ml-staging control plane nodes

https://gerrit.wikimedia.org/r/801742

Change 801742 merged by Elukey:

[operations/puppet@production] Set cluster group to ml-serve for ml-staging control plane nodes

https://gerrit.wikimedia.org/r/801742

Change 801744 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] role::pki::multirootca: add settings for the ml-staging cluster

https://gerrit.wikimedia.org/r/801744

Change 801766 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] admin_ng: set cfssl-issuer's values for ml-serve clusters

https://gerrit.wikimedia.org/r/801766

Change 801744 merged by Elukey:

[operations/puppet@production] role::pki::multirootca: add settings for the ml-staging cluster

https://gerrit.wikimedia.org/r/801744

Change 801766 merged by Elukey:

[operations/deployment-charts@master] admin_ng: set cfssl-issuer's values for ml-serve clusters

https://gerrit.wikimedia.org/r/801766

Change 802159 had a related patch set uploaded (by Elukey; author: Elukey):

[labs/private@master] profile::kubernetes: remove ml-staging specific bits

https://gerrit.wikimedia.org/r/802159

We moved the ml-staging configs under the ml-serve umbrella, since the staging cluster will be used to test serving things mostly.

elukey@deploy1002:~$ ls /etc/helmfile-defaults/private/ml-serve_services/cfssl-issuer/
ml-serve-codfw.yaml  ml-serve-eqiad.yaml  ml-staging-codfw.yaml

helmfile specific configs are now populated as well.

Change 802159 merged by Elukey:

[labs/private@master] profile::kubernetes: remove ml-staging specific bits

https://gerrit.wikimedia.org/r/802159

Change 803295 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] role::prometheus: enable settings for k8s ml-staging

https://gerrit.wikimedia.org/r/803295

Change 803295 merged by Elukey:

[operations/puppet@production] role::prometheus: enable settings for k8s ml-staging

https://gerrit.wikimedia.org/r/803295

Completed the basic networking work (calico, eventrouter, coredns) + BGP config.

Next step: https://wikitech.wikimedia.org/wiki/Kubernetes/Clusters/New#Istio

Some notes:

  1. We should create an LVS endpoint called inference-staging.svc.codfw.wmnet to support this cluster
  2. Let's pay attention to what we have to configure for cert-manager to issue the above certificate. With the standard ml-serve settings I am afraid that it will try to retrieve the inference.svc TLS cert from cfssl, that is probably not what we want. We should check what the serviceops team did and try to follow it.

Istio config and (most of) the cert-manager config have been applied. For cert-manager, I need to sync up with Luca regarding part of said config referring to the ml-serve endpoints.

Change 805127 had a related patch set uploaded (by Klausman; author: Klausman):

[operations/deployment-charts@master] ml-staging-codfw: Add override for cert names

https://gerrit.wikimedia.org/r/805127

Change 805127 merged by jenkins-bot:

[operations/deployment-charts@master] ml-staging-codfw: Add override for cert names

https://gerrit.wikimedia.org/r/805127

Change 805135 had a related patch set uploaded (by Klausman; author: Klausman):

[operations/dns@master] Add inference-staging service IP (10.2.1.58)

https://gerrit.wikimedia.org/r/805135

Change 805135 merged by Klausman:

[operations/dns@master] Add inference-staging service IP (10.2.1.58)

https://gerrit.wikimedia.org/r/805135

Change 805329 had a related patch set uploaded (by Klausman; author: Klausman):

[operations/puppet@production] service::catalog: Add inference-staging service

https://gerrit.wikimedia.org/r/805329

Change 805329 merged by Elukey:

[operations/puppet@production] service::catalog: Add inference-staging service

https://gerrit.wikimedia.org/r/805329

Change 807096 had a related patch set uploaded (by Klausman; author: Klausman):

[operations/puppet@production] net: Add network config setup for ML staging k8s

https://gerrit.wikimedia.org/r/807096

Change 807133 had a related patch set uploaded (by Klausman; author: Klausman):

[operations/puppet@production] hiera: Switch ML staging inference endpoint to lvs_setup

https://gerrit.wikimedia.org/r/807133

Change 807133 merged by Klausman:

[operations/puppet@production] hiera: Switch ML staging inference endpoint to lvs_setup

https://gerrit.wikimedia.org/r/807133

Change 807096 merged by Klausman:

[operations/puppet@production] net: Add network config setup for ML staging k8s

https://gerrit.wikimedia.org/r/807096

Change 807502 had a related patch set uploaded (by Klausman; author: Klausman):

[operations/puppet@production] pki: Add ML staging k8s to list of CAs

https://gerrit.wikimedia.org/r/807502

Change 807520 had a related patch set uploaded (by Klausman; author: Klausman):

[labs/private@master] Add dummy secrets for ML staging k8s CA

https://gerrit.wikimedia.org/r/807520

Change 807520 merged by Klausman:

[labs/private@master] Add dummy secrets for ML staging k8s CA

https://gerrit.wikimedia.org/r/807520

Change 807502 merged by Klausman:

[operations/puppet@production] pki: Add ML staging k8s to list of CAs

https://gerrit.wikimedia.org/r/807502

Add'l things done:

  • Firewall rules so Staging k8s can talk to the WMF PKI machines
  • PKI setup so the new cluster can be its own CA
  • Deployed knative
  • Deployed kserve

Still needs to be done:

  • Prometheus
  • Deploy a model and test if it works

Prometheus is now correctly set up with its own volumes (we hadn't done that yet), and I managed to save the old data.

Change 809146 had a related patch set uploaded (by Klausman; author: Klausman):

[operations/deployment-charts@master] ml-staging: Add inference services for testing

https://gerrit.wikimedia.org/r/809146

Change 809149 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] Add ml-staging-codfw among the helmfile envs to test

https://gerrit.wikimedia.org/r/809149

Change 809149 merged by Elukey:

[operations/deployment-charts@master] Add ml-staging-codfw among the helmfile envs to test

https://gerrit.wikimedia.org/r/809149

Change 809146 merged by Klausman:

[operations/deployment-charts@master] ml-staging: Add inference services for testing

https://gerrit.wikimedia.org/r/809146

Change 809534 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] role::ml_k8s::worker::staging: add calico-cni config

https://gerrit.wikimedia.org/r/809534

Change 809534 merged by Elukey:

[operations/puppet@production] role::ml_k8s::worker::staging: add calico-cni config

https://gerrit.wikimedia.org/r/809534

articlequality pods up and running! The swift credentials are working as expected.

Next steps: add some editquality etc.. pods as well.

elukey@ml-serve-ctrl1001:~$ curl "https://inference-staging.svc.codfw.wmnet:30443/v1/models/enwiki-articlequality:predict" -X POST -d @input.json -i -H "Host: enwiki-articlequality.revscoring-articlequality.wikimedia.org" --http1.1
HTTP/1.1 200 OK
content-length: 225
content-type: application/json; charset=UTF-8
date: Tue, 05 Jul 2022 08:53:20 GMT
server: istio-envoy
x-envoy-upstream-service-time: 317

{"predictions": {"prediction": "Stub", "probability": {"B": 0.017382693143129683, "C": 0.011305576384229396, "FA": 0.002078191955918339, "GA": 0.0029161293780774434, "Start": 0.05709479871741571, "Stub": 0.9092226104212294}}}

Change 811304 had a related patch set uploaded (by Klausman; author: Klausman):

[operations/deployment-charts@master] ml-services: add single draftquality inference service to staging

https://gerrit.wikimedia.org/r/811304

Change 811304 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: add single draftquality inference service to staging

https://gerrit.wikimedia.org/r/811304

Now also running draftquality for enwiki:

ml-serve-ctrl1001 ~ $ curl "https://inference-staging.svc.codfw.wmnet:30443/v1/models/enwiki-draftquality:predict" -X POST -d @input.json -i -H "Host: enwiki-draftquality.revscoring-draftquality.wikimedia.org" --http1.1;echo
HTTP/1.1 200 OK
content-length: 174
content-type: application/json; charset=UTF-8
date: Tue, 05 Jul 2022 13:37:02 GMT
server: istio-envoy
x-envoy-upstream-service-time: 230

{"predictions": {"prediction": "OK", "probability": {"OK": 0.6755321636163237, "attack": 0.048572863418591725, "spam": 0.13317597668241205, "vandalism": 0.1427189962826724}}}
ml-serve-ctrl1001 ~ $

Change 811313 had a related patch set uploaded (by Klausman; author: Klausman):

[operations/deployment-charts@master] ml-services: add some more revscoring services to staging

https://gerrit.wikimedia.org/r/811313

Change 811313 merged by Klausman:

[operations/deployment-charts@master] ml-services: add some more revscoring services to staging

https://gerrit.wikimedia.org/r/811313