Create the ml-serve-staging k8s cluster
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	elukey
	Feb 21 2022, 8:54 AM

Description

In T294946 DCops racked and configured ml-staging200[12] nodes. We should do the following:

Reimage both nodes as Bullseye (with overlay partitions etc..)
Create ml-serve-staging-etcd200[1-3] VMs and the related etcd cluster
Create ml-serve-staging-ctrl200[1-2] VMs (control plane nodes)
Allocate network resources.
Bootstrap the ml-serve-staging k8s cluster
Add the inference-staging.svc.codfw.wmnet endpoint (or a discovery one, if it makes sense, maybe yes for consistency).

The above plan is very high level, it will require surely more work. More details in https://wikitech.wikimedia.org/wiki/Kubernetes/Clusters/New

Details

Subject	Repo	Branch	Lines +/-
ml-services: add some more revscoring services to staging	operations/deployment-charts	master	+36 -0
ml-services: add single draftquality inference service to staging	operations/deployment-charts	master	+9 -0
role::ml_k8s::worker::staging: add calico-cni config	operations/puppet	production	+29 -0
ml-staging: Add inference services for testing	operations/deployment-charts	master	+13 -0
Add ml-staging-codfw among the helmfile envs to test	operations/deployment-charts	master	+1 -1
pki: Add ML staging k8s to list of CAs	operations/puppet	production	+2 -1
Add dummy secrets for ML staging k8s CA	labs/private	master	+3 -0
net: Add network config setup for ML staging k8s	operations/puppet	production	+15 -0
hiera: Switch ML staging inference endpoint to lvs_setup	operations/puppet	production	+3 -1
service::catalog: Add inference-staging service	operations/puppet	production	+32 -1
Add inference-staging service IP (10.2.1.58)	operations/dns	master	+4 -0
ml-staging-codfw: Add override for cert names	operations/deployment-charts	master	+5 -0
role::prometheus: enable settings for k8s ml-staging	operations/puppet	production	+6 -7
profile::kubernetes: remove ml-staging specific bits	labs/private	master	+0 -54
admin_ng: set cfssl-issuer's values for ml-serve clusters	operations/deployment-charts	master	+10 -5
role::pki::multirootca: add settings for the ml-staging cluster	operations/puppet	production	+12 -0
Set cluster group to ml-serve for ml-staging control plane nodes	operations/puppet	production	+1 -1
Move the ml-staging cluster under ml-serve's definition	operations/puppet	production	+0 -1
admin_ng: Add config for ML staging k8s in codfw	operations/deployment-charts	master	+71 -0
hiera: Switch ML staging control plane to lvs_setup	operations/puppet	production	+1 -1
Add service IP for ML staging k8s ctrl plane	operations/dns	master	+1 -0
hiera: Add ML staging k8s ctrl node config	operations/puppet	production	+238 -1
hiera/modules: Add config for ML staging k8s workers	operations/puppet	production	+84 -1
hiera: Add ML staging k8s role	operations/puppet	production	+159 -1
labs: Add dummy keyfile for ML staging k8s in codfw	labs/private	master	+0 -0
hiera: Add k8s dummy tokens for ML staging env	labs/private	master	+52 -0

Related Objects
Search...

Status	Assigned	Task
Resolved	None	T272917 Lift Wing proof of concept
Resolved	klausman	T302195 Create the ml-serve-staging k8s cluster
Resolved	klausman	T302197 Create etcd cluster for ml-serve-staging k8s
Resolved	klausman	T302503 New VMs for ML staging cluster in codfw
Resolved	klausman	T302198 Create ml-serve-staging k8s's control plane VMs
Duplicate	klausman	T302504 New control plane VMs for ML staging cluster in codfw

Event Timeline

elukey created this task.Feb 21 2022, 8:54 AM

ml-staging200x nodes reimaged with bullseye!

elukey assigned this task to klausman.Mar 2 2022, 8:22 AM

elukey triaged this task as Medium priority.

klausman added a subtask: T302504: New control plane VMs for ML staging cluster in codfw.Mar 2 2022, 10:18 AM

klausman closed subtask T302504: New control plane VMs for ML staging cluster in codfw as Resolved.Mar 2 2022, 11:19 AM

klausman closed subtask T302198: Create ml-serve-staging k8s's control plane VMs as Resolved.

klausman closed subtask T302197: Create etcd cluster for ml-serve-staging k8s as Resolved.Mar 15 2022, 5:54 PM

Change 772417 had a related patch set uploaded (by Klausman; author: Klausman):

[operations/puppet@production] hiera: Add ML staging k8s ctrl node config

https://gerrit.wikimedia.org/r/772417

gerritbot added a project: Patch-For-Review.Mar 21 2022, 3:03 PM

Change 772866 had a related patch set uploaded (by Klausman; author: Klausman):

[labs/private@master] hiera: Add k8s dummy tokens for ML staging env

https://gerrit.wikimedia.org/r/772866

Change 772866 merged by Klausman:

[labs/private@master] hiera: Add k8s dummy tokens for ML staging env

https://gerrit.wikimedia.org/r/772866

klausman mentioned this in rLPRIbd2fb2724109: hiera: Add k8s dummy tokens for ML staging env.Mar 22 2022, 3:51 PM

Change 772871 had a related patch set uploaded (by Klausman; author: Klausman):

[labs/private@master] labs: Add dummy keyfile for ML staging k8s in codfw

https://gerrit.wikimedia.org/r/772871

Change 772871 merged by Klausman:

[labs/private@master] labs: Add dummy keyfile for ML staging k8s in codfw

https://gerrit.wikimedia.org/r/772871

klausman mentioned this in rLPRI899d25e97d8d: labs: Add dummy keyfile for ML staging k8s in codfw.Mar 22 2022, 4:06 PM

Change 774488 had a related patch set uploaded (by Klausman; author: Klausman):

[operations/puppet@production] hiera: Add ML staging k8s role

https://gerrit.wikimedia.org/r/774488

Change 774488 merged by Klausman:

[operations/puppet@production] hiera: Add ML staging k8s role

https://gerrit.wikimedia.org/r/774488

Change 775860 had a related patch set uploaded (by Klausman; author: Klausman):

[operations/puppet@production] hiera/modules: Add config for ML staging k8s workers

https://gerrit.wikimedia.org/r/775860

Change 775860 merged by Klausman:

[operations/puppet@production] hiera/modules: Add config for ML staging k8s workers

https://gerrit.wikimedia.org/r/775860

Change 772417 abandoned by Klausman:

[operations/puppet@production] hiera: Add ML staging k8s ctrl node config

Reason:

All covered in smaller CLs

https://gerrit.wikimedia.org/r/772417

Maintenance_bot removed a project: Patch-For-Review.Mar 31 2022, 4:11 PM

elukey moved this task from Parked to In Progress on the Machine-Learning-Team (Active Tasks) board.Apr 26 2022, 7:34 AM

Change 786319 had a related patch set uploaded (by Klausman; author: Klausman):

[operations/puppet@production] Switch ML staging control plane to lvs_setup

https://gerrit.wikimedia.org/r/786319

gerritbot added a project: Patch-For-Review.Apr 26 2022, 2:04 PM

Change 786320 had a related patch set uploaded (by Klausman; author: Klausman):

[operations/dns@master] Add service IP for ML staging k8s ctrl plane

https://gerrit.wikimedia.org/r/786320

Change 786320 merged by Klausman:

[operations/dns@master] Add service IP for ML staging k8s ctrl plane

https://gerrit.wikimedia.org/r/786320

Change 786319 merged by Klausman:

[operations/puppet@production] hiera: Switch ML staging control plane to lvs_setup

https://gerrit.wikimedia.org/r/786319

Change 786808 had a related patch set uploaded (by Klausman; author: Klausman):

[operations/deployment-charts@master] admin_ng: Add config for ML staging k8s in codfw

https://gerrit.wikimedia.org/r/786808

Change 786808 merged by jenkins-bot:

[operations/deployment-charts@master] admin_ng: Add config for ML staging k8s in codfw

https://gerrit.wikimedia.org/r/786808

Mentioned in SAL (#wikimedia-operations) [2022-05-31T07:27:39Z] <elukey> add profile k8s_mlstaging + authkey for ml-staging k8s - T302195

Change 801662 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] Move the ml-staging cluster under ml-serve's definition

https://gerrit.wikimedia.org/r/801662

Change 801662 merged by Elukey:

[operations/puppet@production] Move the ml-staging cluster under ml-serve's definition

https://gerrit.wikimedia.org/r/801662

Change 801742 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] Set cluster group to ml-serve for ml-staging control plane nodes

https://gerrit.wikimedia.org/r/801742

Change 801742 merged by Elukey:

[operations/puppet@production] Set cluster group to ml-serve for ml-staging control plane nodes

https://gerrit.wikimedia.org/r/801742

Change 801744 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] role::pki::multirootca: add settings for the ml-staging cluster

https://gerrit.wikimedia.org/r/801744

Change 801766 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] admin_ng: set cfssl-issuer's values for ml-serve clusters

https://gerrit.wikimedia.org/r/801766

Change 801744 merged by Elukey:

[operations/puppet@production] role::pki::multirootca: add settings for the ml-staging cluster

https://gerrit.wikimedia.org/r/801744

Change 801766 merged by Elukey:

[operations/deployment-charts@master] admin_ng: set cfssl-issuer's values for ml-serve clusters

https://gerrit.wikimedia.org/r/801766

Change 802159 had a related patch set uploaded (by Elukey; author: Elukey):

[labs/private@master] profile::kubernetes: remove ml-staging specific bits

https://gerrit.wikimedia.org/r/802159

We moved the ml-staging configs under the ml-serve umbrella, since the staging cluster will be used to test serving things mostly.

elukey@deploy1002:~$ ls /etc/helmfile-defaults/private/ml-serve_services/cfssl-issuer/
ml-serve-codfw.yaml  ml-serve-eqiad.yaml  ml-staging-codfw.yaml

helmfile specific configs are now populated as well.

Next steps:

get a final review of https://gerrit.wikimedia.org/r/c/operations/homer/public/+/802072
merge and proceed with https://wikitech.wikimedia.org/wiki/Kubernetes/Clusters/New#Networking

Change 802159 merged by Elukey:

[labs/private@master] profile::kubernetes: remove ml-staging specific bits

https://gerrit.wikimedia.org/r/802159

elukey mentioned this in rLPRI12e7c00f2a31: profile::kubernetes: remove ml-staging specific bits.Jun 1 2022, 4:30 PM

Change 803295 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] role::prometheus: enable settings for k8s ml-staging

https://gerrit.wikimedia.org/r/803295

Change 803295 merged by Elukey:

[operations/puppet@production] role::prometheus: enable settings for k8s ml-staging

https://gerrit.wikimedia.org/r/803295

Completed the basic networking work (calico, eventrouter, coredns) + BGP config.

Next step: https://wikitech.wikimedia.org/wiki/Kubernetes/Clusters/New#Istio

Some notes:

We should create an LVS endpoint called inference-staging.svc.codfw.wmnet to support this cluster
Let's pay attention to what we have to configure for cert-manager to issue the above certificate. With the standard ml-serve settings I am afraid that it will try to retrieve the inference.svc TLS cert from cfssl, that is probably not what we want. We should check what the serviceops team did and try to follow it.

Istio config and (most of) the cert-manager config have been applied. For cert-manager, I need to sync up with Luca regarding part of said config referring to the ml-serve endpoints.

Change 805127 had a related patch set uploaded (by Klausman; author: Klausman):

[operations/deployment-charts@master] ml-staging-codfw: Add override for cert names

https://gerrit.wikimedia.org/r/805127

Change 805127 merged by jenkins-bot:

[operations/deployment-charts@master] ml-staging-codfw: Add override for cert names

https://gerrit.wikimedia.org/r/805127

Change 805135 had a related patch set uploaded (by Klausman; author: Klausman):

[operations/dns@master] Add inference-staging service IP (10.2.1.58)

https://gerrit.wikimedia.org/r/805135

Change 805135 merged by Klausman:

[operations/dns@master] Add inference-staging service IP (10.2.1.58)

https://gerrit.wikimedia.org/r/805135

Change 805329 had a related patch set uploaded (by Klausman; author: Klausman):

[operations/puppet@production] service::catalog: Add inference-staging service

https://gerrit.wikimedia.org/r/805329

Change 805329 merged by Elukey:

[operations/puppet@production] service::catalog: Add inference-staging service

https://gerrit.wikimedia.org/r/805329

Change 807096 had a related patch set uploaded (by Klausman; author: Klausman):

[operations/puppet@production] net: Add network config setup for ML staging k8s

https://gerrit.wikimedia.org/r/807096

Change 807133 had a related patch set uploaded (by Klausman; author: Klausman):

[operations/puppet@production] hiera: Switch ML staging inference endpoint to lvs_setup

https://gerrit.wikimedia.org/r/807133

Change 807133 merged by Klausman:

[operations/puppet@production] hiera: Switch ML staging inference endpoint to lvs_setup

https://gerrit.wikimedia.org/r/807133

Change 807096 merged by Klausman:

[operations/puppet@production] net: Add network config setup for ML staging k8s

https://gerrit.wikimedia.org/r/807096

Change 807502 had a related patch set uploaded (by Klausman; author: Klausman):

[operations/puppet@production] pki: Add ML staging k8s to list of CAs

https://gerrit.wikimedia.org/r/807502

Change 807520 had a related patch set uploaded (by Klausman; author: Klausman):

[labs/private@master] Add dummy secrets for ML staging k8s CA

https://gerrit.wikimedia.org/r/807520

Change 807520 merged by Klausman:

[labs/private@master] Add dummy secrets for ML staging k8s CA

https://gerrit.wikimedia.org/r/807520

klausman mentioned this in rLPRI45bed6f9e285: Add dummy secrets for ML staging k8s CA.Jun 22 2022, 12:28 PM

Change 807502 merged by Klausman:

[operations/puppet@production] pki: Add ML staging k8s to list of CAs

https://gerrit.wikimedia.org/r/807502

Add'l things done:

Firewall rules so Staging k8s can talk to the WMF PKI machines
PKI setup so the new cluster can be its own CA
Deployed knative
Deployed kserve

Still needs to be done:

Prometheus
Deploy a model and test if it works

Prometheus is now correctly set up with its own volumes (we hadn't done that yet), and I managed to save the old data.

Change 809146 had a related patch set uploaded (by Klausman; author: Klausman):

[operations/deployment-charts@master] ml-staging: Add inference services for testing

https://gerrit.wikimedia.org/r/809146

Change 809149 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] Add ml-staging-codfw among the helmfile envs to test

https://gerrit.wikimedia.org/r/809149

Change 809149 merged by Elukey:

[operations/deployment-charts@master] Add ml-staging-codfw among the helmfile envs to test

https://gerrit.wikimedia.org/r/809149

Change 809146 merged by Klausman:

[operations/deployment-charts@master] ml-staging: Add inference services for testing

https://gerrit.wikimedia.org/r/809146

Change 809534 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] role::ml_k8s::worker::staging: add calico-cni config

https://gerrit.wikimedia.org/r/809534

Change 809534 merged by Elukey:

[operations/puppet@production] role::ml_k8s::worker::staging: add calico-cni config

https://gerrit.wikimedia.org/r/809534

articlequality pods up and running! The swift credentials are working as expected.

Next steps: add some editquality etc.. pods as well.

elukey@ml-serve-ctrl1001:~$ curl "https://inference-staging.svc.codfw.wmnet:30443/v1/models/enwiki-articlequality:predict" -X POST -d @input.json -i -H "Host: enwiki-articlequality.revscoring-articlequality.wikimedia.org" --http1.1
HTTP/1.1 200 OK
content-length: 225
content-type: application/json; charset=UTF-8
date: Tue, 05 Jul 2022 08:53:20 GMT
server: istio-envoy
x-envoy-upstream-service-time: 317

{"predictions": {"prediction": "Stub", "probability": {"B": 0.017382693143129683, "C": 0.011305576384229396, "FA": 0.002078191955918339, "GA": 0.0029161293780774434, "Start": 0.05709479871741571, "Stub": 0.9092226104212294}}}

Change 811304 had a related patch set uploaded (by Klausman; author: Klausman):

[operations/deployment-charts@master] ml-services: add single draftquality inference service to staging

https://gerrit.wikimedia.org/r/811304

Change 811304 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: add single draftquality inference service to staging

https://gerrit.wikimedia.org/r/811304

Now also running draftquality for enwiki:

ml-serve-ctrl1001 ~ $ curl "https://inference-staging.svc.codfw.wmnet:30443/v1/models/enwiki-draftquality:predict" -X POST -d @input.json -i -H "Host: enwiki-draftquality.revscoring-draftquality.wikimedia.org" --http1.1;echo
HTTP/1.1 200 OK
content-length: 174
content-type: application/json; charset=UTF-8
date: Tue, 05 Jul 2022 13:37:02 GMT
server: istio-envoy
x-envoy-upstream-service-time: 230

{"predictions": {"prediction": "OK", "probability": {"OK": 0.6755321636163237, "attack": 0.048572863418591725, "spam": 0.13317597668241205, "vandalism": 0.1427189962826724}}}
ml-serve-ctrl1001 ~ $

Change 811313 had a related patch set uploaded (by Klausman; author: Klausman):

[operations/deployment-charts@master] ml-services: add some more revscoring services to staging

https://gerrit.wikimedia.org/r/811313

Change 811313 merged by Klausman:

[operations/deployment-charts@master] ml-services: add some more revscoring services to staging

https://gerrit.wikimedia.org/r/811313

calbon closed this task as Resolved.Jul 19 2022, 2:36 PM

calbon moved this task from In Progress to Completed on the Machine-Learning-Team (Active Tasks) board.

Create the ml-serve-staging k8s clusterClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Create the ml-serve-staging k8s cluster
Closed, ResolvedPublic
Actions

Related Objects
Search...