Page MenuHomePhabricator

Create ml-serve k8s cluster
Open, HighPublic

Description

We need to create a k8s cluster on the ml-serve1xxx boxes for the Lift Wing proof of concept.

Note: k8s must be v1.16-1.18, KFServing does not work on v1.19

Details

ProjectBranchLines +/-Subject
operations/puppetproduction+1 -0
operations/puppetproduction+13 -0
operations/puppetproduction+2 -2
operations/puppetproduction+10 -0
operations/puppetproduction+2 -8
operations/homer/publicmaster+92 -0
operations/puppetproduction+5 -1
operations/puppetproduction+3 -1
operations/deployment-chartsmaster+19 -0
labs/privatemaster+6 -0
operations/puppetproduction+100 -14
operations/puppetproduction+1 -1
labs/privatemaster+0 -0
operations/puppetproduction+32 -32
operations/puppetproduction+1 -1
operations/puppetproduction+7 -0
operations/puppetproduction+25 -0
operations/dnsmaster+2 -1
operations/puppetproduction+1 -1
operations/puppetproduction+1 -5
operations/puppetproduction+36 -36
operations/puppetproduction+99 -1
labs/privatemaster+0 -0
operations/puppetproduction+1 -46
Show related patches Customize query in gerrit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Mentioned in SAL (#wikimedia-operations) [2021-02-24T15:01:51Z] <klausman@cumin1001> START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve[1001-1004].eqiad.wmnet with reason: Reimaging for T272918

Icinga downtime set by klausman@cumin1001 for 2:00:00 4 host(s) and their services with reason: Reimaging for T272918

ml-serve[1001-1004].eqiad.wmnet

Mentioned in SAL (#wikimedia-operations) [2021-02-24T15:01:58Z] <klausman@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve[1001-1004].eqiad.wmnet with reason: Reimaging for T272918

Script wmf-auto-reimage was launched by klausman on cumin1001.eqiad.wmnet for hosts:

['ml-serve1001.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202102241504_klausman_30714.log.

Script wmf-auto-reimage was launched by klausman on cumin1001.eqiad.wmnet for hosts:

['ml-serve1002.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202102241520_klausman_28826.log.

Script wmf-auto-reimage was launched by klausman on cumin1001.eqiad.wmnet for hosts:

['ml-serve1001.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202102241521_klausman_32638.log.

Script wmf-auto-reimage was launched by klausman on cumin1001.eqiad.wmnet for hosts:

['ml-serve1002.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202102241528_klausman_28643.log.

Script wmf-auto-reimage was launched by klausman on cumin1001.eqiad.wmnet for hosts:

['ml-serve1001.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202102241528_klausman_28329.log.

Script wmf-auto-reimage was launched by klausman on cumin1001.eqiad.wmnet for hosts:

['ml-serve1002.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202102241537_klausman_31980.log.

Script wmf-auto-reimage was launched by klausman on cumin1001.eqiad.wmnet for hosts:

['ml-serve1001.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202102241537_klausman_31843.log.

Script wmf-auto-reimage was launched by klausman on cumin2001.codfw.wmnet for hosts:

['ml-serve2001.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202102250912_klausman_26655.log.

Script wmf-auto-reimage was launched by klausman on cumin2001.codfw.wmnet for hosts:

['ml-serve2004.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202102250913_klausman_26916.log.

Script wmf-auto-reimage was launched by klausman on cumin2001.codfw.wmnet for hosts:

['ml-serve2003.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202102250913_klausman_26972.log.

Script wmf-auto-reimage was launched by klausman on cumin2001.codfw.wmnet for hosts:

['ml-serve2002.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202102250913_klausman_26998.log.

Completed auto-reimage of hosts:

['ml-serve2001.codfw.wmnet']

and were ALL successful.

Completed auto-reimage of hosts:

['ml-serve2004.codfw.wmnet']

and were ALL successful.

Completed auto-reimage of hosts:

['ml-serve2003.codfw.wmnet']

and were ALL successful.

Change 668723 had a related patch set uploaded (by Klausman; owner: Klausman):
[labs/private@master] ml-ctrl: Add dummy keys for ML k8s control plane

https://gerrit.wikimedia.org/r/668723

Change 668723 merged by Klausman:
[labs/private@master] ml-ctrl: Add dummy keys for ML k8s control plane

https://gerrit.wikimedia.org/r/668723

Change 668075 had a related patch set uploaded (by Klausman; owner: Klausman):
[operations/puppet@production] modules/roles: Add k8s config for ML team machines

https://gerrit.wikimedia.org/r/668075

Change 668075 merged by Klausman:
[operations/puppet@production] modules/roles: Add k8s config for ML team machines

https://gerrit.wikimedia.org/r/668075

Change 670196 had a related patch set uploaded (by Klausman; owner: Klausman):
[operations/puppet@production] files/ssl/: Update ML k8s certs

https://gerrit.wikimedia.org/r/670196

Change 670196 merged by Klausman:
[operations/puppet@production] files/ssl/: Update ML k8s certs

https://gerrit.wikimedia.org/r/670196

Change 670198 had a related patch set uploaded (by Klausman; owner: Klausman):
[operations/puppet@production] manifest: Move ml-serve1002 insetup -> ml_k8s::master

https://gerrit.wikimedia.org/r/670198

Change 670198 merged by Klausman:
[operations/puppet@production] manifest: Move ml-serve1002 insetup -> ml_k8s::master

https://gerrit.wikimedia.org/r/670198

Change 670214 had a related patch set uploaded (by Klausman; owner: Klausman):
[operations/puppet@production] manifest: Mov ML k8s machines in codfw to prod

https://gerrit.wikimedia.org/r/670214

Change 670214 merged by Klausman:
[operations/puppet@production] manifest: Mov ML k8s machines in codfw to prod

https://gerrit.wikimedia.org/r/670214

@klausman I just acked some prometheus alerts with NaN in icinga related to ml-serve-ctrl nodes, the puppet profiles enable them automatically. When we'll have the k8s nodes set up, let's remember to:

  • add prometheus k8s cluster tokens in puppet private's hiera (see profile::prometheus::kubernetes::cluster_tokens entries for other clusters)
  • add prometheus k8s cluster details in puppet public's hiera (see profile::prometheus::kubernetes::clusters entries for other clusters)

The profile that runs the k8s Prometheus master config is profile::prometheus::k8s

elukey triaged this task as High priority.

Change 670444 had a related patch set uploaded (by Klausman; owner: Klausman):
[operations/puppet@production] service catalog: Add entry for ML Team k8s control plane

https://gerrit.wikimedia.org/r/670444

Change 670797 had a related patch set uploaded (by Klausman; owner: Klausman):
[operations/dns@master] Add DNS RR for ML team k8s control plane

https://gerrit.wikimedia.org/r/670797

Change 670797 merged by Klausman:
[operations/dns@master] Add DNS RR for ML team k8s control plane

https://gerrit.wikimedia.org/r/670797

Change 670444 merged by Klausman:
[operations/puppet@production] service catalog: Add entry for ML Team k8s control plane

https://gerrit.wikimedia.org/r/670444

Change 670816 had a related patch set uploaded (by Klausman; owner: Klausman):
[operations/puppet@production] hiera: Add LVS realserver config for ML k8s

https://gerrit.wikimedia.org/r/670816

Change 670816 merged by Klausman:
[operations/puppet@production] hiera: Add LVS realserver config for ML k8s

https://gerrit.wikimedia.org/r/670816

Change 670818 had a related patch set uploaded (by Klausman; owner: Klausman):
[operations/puppet@production] hiera: Switch ml-ctrl service to lvs_setup

https://gerrit.wikimedia.org/r/670818

Change 670818 merged by Klausman:
[operations/puppet@production] hiera: Switch ml-ctrl service to lvs_setup

https://gerrit.wikimedia.org/r/670818

Mentioned in SAL (#wikimedia-operations) [2021-03-11T14:49:29Z] <klausman> restarting pybal on lvs2010 T272918

Mentioned in SAL (#wikimedia-operations) [2021-03-11T14:50:15Z] <klausman> restarting pybal on lvs1016 T272918

Mentioned in SAL (#wikimedia-operations) [2021-03-11T14:55:38Z] <klausman> restarting pybal on lvs2009 T272918

Mentioned in SAL (#wikimedia-operations) [2021-03-11T15:02:47Z] <klausman> restarting pybal on lvs1015 T272918

Change 670835 had a related patch set uploaded (by Klausman; owner: Klausman):
[operations/puppet@production] ssl: update ml-ctrl certs (fixed altname)

https://gerrit.wikimedia.org/r/670835

Change 670835 merged by Klausman:
[operations/puppet@production] ssl: update ml-ctrl certs (fixed altname)

https://gerrit.wikimedia.org/r/670835

Change 672402 had a related patch set uploaded (by Klausman; owner: Klausman):
[operations/puppet@production] hiera/modules: Add role for ML k8s workers

https://gerrit.wikimedia.org/r/672402

Change 672455 had a related patch set uploaded (by Klausman; owner: Klausman):
[labs/private@master] hiera: add dummy secrets for ML k8s workers

https://gerrit.wikimedia.org/r/672455

Change 672455 merged by Klausman:
[labs/private@master] hiera: add dummy secrets for ML k8s workers

https://gerrit.wikimedia.org/r/672455

Change 672457 had a related patch set uploaded (by Klausman; owner: Klausman):
[labs/private@master] hiera: move ML k8s worker secrets into the correct location

https://gerrit.wikimedia.org/r/672457

Change 672457 merged by Klausman:
[labs/private@master] hiera: move ML k8s worker secrets into the correct location

https://gerrit.wikimedia.org/r/672457

Some details about what we'll have to deploy on top of the new Kubernetes cluster.

One thing is worth to note in my opinion:

If you want to get up running Knative quickly or you do not need service mesh, we recommend installing Istio without service mesh(sidecar injection).

Do we need service mesh for serving?

https://istio.io/latest/docs/ops/deployment/architecture/ - explains a very high level view of Istio and how it works

https://istio.io/latest/docs/ops/deployment/deployment-models/ - explains in depth deployment models, especially in relationship with number of clusters, networks, etc..

After reading the last document (really really nice and well done) it feels as if we are using Istio (a Ferrari) to go and shop for groceries (that could be done with a simple Fiat Panda :D). Jokes aside, knowing what KFServing really needs from Istio will probably ease a lot the bootstrap time. I see the following in the docs:

KFServing currently depends on Istio Ingress Gateway to route requests to inference services.

@ACraze I recall that we discussed Istio during one of your presentation, but I don't recall if the service mesh was needed for Train Wing (so the Kubeflow pipelines etc..) or also for Lift Wing, can you shed some light on this topic ? :)

it feels as if we are using Istio (a Ferrari) to go and shop for groceries (that could be done with a simple Fiat Panda :D)

@elukey nice analogy :) it will likely feel like this as we get started, although I can see having a service mesh becoming extremely helpful later down the road.

Technically we don't need Istio for serving, although the burden of discovery, traceability and allowing microservices to communicate securely with each other will be on us then.

In many ways, each model will be a microservice and they will also need pre-(and/or post) processing, which would be yet another microservice. If we want to A/B test different versions of models, deploy a shadow challenger or use tools like Explainer or Bias/Fairness testing, these all would be additional microservices that need to communicate with other models in our cluster in a secure manner. I've had to manage microservices by hand in the past and it quickly became unsustainable, so that's why I'm in favor of using a service mesh to handle cross-cutting concerns.

I'm not 100% sure if a service mesh will be needed for TrainWing yet, it is unclear if Istio is a hard dependency for KF Pipelines, although it does use Argo to orchestrate workflows, which has a mechanism for sidecar injection using a service mesh.

@ACraze thanks for the explanation, it seems that we have a good motivation to setup Istio with service mesh support, medium to long term it will be needed so it makes sense to proceed now. Thanks for the explanation :)

FWIW, ServiceOps decided against using a full mesh networking for our services because we considered istio to be both very complex and not really needed for our level of complication.

Before we use it in production, we will need to study with attention its failure modes, and how/what to monitor. I would in general suggest that starting with the simple solution (where "simple" is knative + istio without service mesh, which is already quite complex), and move to full mesh once we feel the need.

FWIW, ServiceOps decided against using a full mesh networking for our services because we considered istio to be both very complex and not really needed for our level of complication.

Before we use it in production, we will need to study with attention its failure modes, and how/what to monitor. I would in general suggest that starting with the simple solution (where "simple" is knative + istio without service mesh, which is already quite complex), and move to full mesh once we feel the need.

Yep makes sense, my only doubt right now is if Kubelow's KF serving relies heavily on Istio being configured for service mesh or not. For example, in the case of a simple model that needs pre/post processing, we'll have 2+ microservices communicating with each other (as Andy mentioned), and I have no idea if Kubeflow relies on Istio entirely to make the comms between the microservices happening or if there is another way (more manual etc..). If Kubeflow is not flexible enough we might need to go to service mesh asap, otherwise we'll not able to run basic models. I agree though that we could go step by step and checking the non-mesh solution first, I was just trying to understand the use case medium/long term :)

Change 672402 merged by Klausman:
[operations/puppet@production] hiera/modules: Add role for ML k8s workers

https://gerrit.wikimedia.org/r/672402

Change 673044 had a related patch set uploaded (by Klausman; owner: Klausman):
[operations/puppet@production] hiera: change docker package for ML worker nodes to docker.io

https://gerrit.wikimedia.org/r/673044

Change 673044 merged by Klausman:
[operations/puppet@production] hiera: change docker package for ML worker nodes to docker.io

https://gerrit.wikimedia.org/r/673044

Change 673227 had a related patch set uploaded (by Klausman; owner: Klausman):
[operations/deployment-charts@master] helm: Make ML k8s clusters visible to helm

https://gerrit.wikimedia.org/r/673227

Change 673452 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] cumin: fix ml-serve aliases and add new ones

https://gerrit.wikimedia.org/r/673452

Change 673227 merged by jenkins-bot:
[operations/deployment-charts@master] helm: Make ML k8s clusters visible to helm

https://gerrit.wikimedia.org/r/673227

Change 673452 merged by Elukey:
[operations/puppet@production] cumin: fix ml-serve aliases and add new ones

https://gerrit.wikimedia.org/r/673452

Change 673457 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] cumin: fix ml-serve alias and add newer ones

https://gerrit.wikimedia.org/r/673457

Change 673457 merged by Elukey:
[operations/puppet@production] cumin: fix ml-serve alias and add newer ones

https://gerrit.wikimedia.org/r/673457

Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:

['ml-serve2002.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202103191100_elukey_4442.log.

Script wmf-auto-reimage was launched by klausman on cumin2001.codfw.wmnet for hosts:

['ml-serve2002.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202103191217_klausman_19980.log.

Completed auto-reimage of hosts:

['ml-serve2002.codfw.wmnet']

and were ALL successful.

All worker nodes are now up and visible in both DCs:

ml-serve-ctrl1001:~$ kubectl get nodes -o wide
NAME                       STATUS   ROLES    AGE   VERSION    INTERNAL-IP    EXTERNAL-IP   OS-IMAGE                       KERNEL-VERSION    CONTAINER-RUNTIME
ml-serve1001.eqiad.wmnet   Ready    <none>   83m   v1.16.15   10.64.0.41     <none>        Debian GNU/Linux 10 (buster)   4.19.0-14-amd64   docker://18.9.1
ml-serve1002.eqiad.wmnet   Ready    <none>   83m   v1.16.15   10.64.16.183   <none>        Debian GNU/Linux 10 (buster)   4.19.0-14-amd64   docker://18.9.1
ml-serve1003.eqiad.wmnet   Ready    <none>   83m   v1.16.15   10.64.32.81    <none>        Debian GNU/Linux 10 (buster)   4.19.0-14-amd64   docker://18.9.1
ml-serve1004.eqiad.wmnet   Ready    <none>   83m   v1.16.15   10.64.48.50    <none>        Debian GNU/Linux 10 (buster)   4.19.0-14-amd64   docker://18.9.1
ml-serve-ctrl2001 ~ $ kubectl get nodes -o wide
NAME                       STATUS   ROLES    AGE     VERSION    INTERNAL-IP    EXTERNAL-IP   OS-IMAGE                       KERNEL-VERSION    CONTAINER-RUNTIME
ml-serve2001.codfw.wmnet   Ready    <none>   84m     v1.16.15   10.192.0.21    <none>        Debian GNU/Linux 10 (buster)   4.19.0-14-amd64   docker://18.9.1
ml-serve2002.codfw.wmnet   Ready    <none>   9m57s   v1.16.15   10.192.16.43   <none>        Debian GNU/Linux 10 (buster)   4.19.0-14-amd64   docker://18.9.1
ml-serve2003.codfw.wmnet   Ready    <none>   84m     v1.16.15   10.192.32.29   <none>        Debian GNU/Linux 10 (buster)   4.19.0-14-amd64   docker://18.9.1
ml-serve2004.codfw.wmnet   Ready    <none>   84m     v1.16.15   10.192.48.11   <none>        Debian GNU/Linux 10 (buster)   4.19.0-14-amd64   docker://18.9.1

Change 661055 merged by Elukey:
[operations/homer/public@master] Add BGP configuration for the new ML Serve eqiad/codfw clusters

https://gerrit.wikimedia.org/r/661055

Change 673985 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Enable coredns for k8s ml-serve clusters

https://gerrit.wikimedia.org/r/673985

Change 673985 merged by Elukey:
[operations/puppet@production] Enable coredns for k8s ml-serve clusters

https://gerrit.wikimedia.org/r/673985

Mentioned in SAL (#wikimedia-operations) [2021-03-23T07:36:37Z] <elukey> create a 50g lvm volume on prometheus[12]00[34] for the k8s-mlserve cluster - T272918

Change 674258 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] prometheus: add the ml-serve clusters settings

https://gerrit.wikimedia.org/r/674258

Change 674258 merged by Elukey:
[operations/puppet@production] prometheus: add the ml-serve clusters settings

https://gerrit.wikimedia.org/r/674258

Change 674273 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] prometheus: change port for k8s-mlserve clusters

https://gerrit.wikimedia.org/r/674273

Change 674273 merged by Elukey:
[operations/puppet@production] prometheus: change port for k8s-mlserve clusters

https://gerrit.wikimedia.org/r/674273

Change 674279 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Add config for prometheus@k8s-mlserve

https://gerrit.wikimedia.org/r/674279

Change 674279 merged by Elukey:
[operations/puppet@production] Add config for prometheus@k8s-mlserve

https://gerrit.wikimedia.org/r/674279

Change 674313 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::ml_k8s::master: add prometheus instance endpoint

https://gerrit.wikimedia.org/r/674313

Change 674313 merged by Elukey:
[operations/puppet@production] role::ml_k8s::master: add prometheus instance endpoint

https://gerrit.wikimedia.org/r/674313

The two clusters are up and running, together with Prometheus monitoring. I created two subtasks to deal with the namespace override problem and the shared user credentials, I'll follow up with SRE during the next days.

Moritz told me today that two ml-etcd2xxx nodes are in the same ganeti host, this is my bad since I have created the cluster (so there is no real 3 rows redundancy). I'll proceed with destroy+re-create ml-etcd2001 in a different row host tomorrow to fix the problem.

Fixed the ml-etcd200x row issue (re-created the VM and updated the etcd cluster).

The remaining things to work on are:

In theory they are not blocking us to test Istio and Knative, but I'll leave the task open :)

Quick note about prometheus metrics - the SRE team is going to review/create the Kubernetes dashboards in a bit, so I didn't spend too much time in adding our use case to them for the moment. I verified that some metrics were displayed and decided to do another pass after the SRE team's work :)

Reporting a chat that I had with Alex and Janis on IRC. I wasn't able to create a simple deployment in the default namespace from ml-serve-ctrl1001, the pods were not coming up due to hitting PSP policies (https://kubernetes.io/docs/concepts/policy/pod-security-policy). Janis gave me https://phabricator.wikimedia.org/P15078, that with s/jayme/elukey worked nicely (namely, kubectl apply -f blabla.yaml ended up in the creation of the elukey namespace with a tiller pod running on it).