We need to create a k8s cluster on the ml-serve1xxx boxes for the Lift Wing proof of concept.
Note: k8s must be v1.16-1.18, KFServing does not work on v1.19
We need to create a k8s cluster on the ml-serve1xxx boxes for the Lift Wing proof of concept.
Note: k8s must be v1.16-1.18, KFServing does not work on v1.19
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | None | T272917 Lift Wing proof of concept | |||
Resolved | klausman | T272918 Create ml-serve k8s cluster | |||
Resolved | klausman | T273071 Create etcd VMs for use with ML platform | |||
Resolved | elukey | T275630 eqiad/codfw: 2x2 VM request for ML-Serve Kubernetes cluster | |||
Resolved | elukey | T278208 Allow namespaces to be overriden in deployment-chart's admin_ng | |||
Resolved | elukey | T278224 Allow k8s clusters to have their own k8s_infrastructure_users in puppet | |||
Resolved | elukey | T278238 Recreate ml-etcd2002 in a different row |
Script wmf-auto-reimage was launched by klausman on cumin1001.eqiad.wmnet for hosts:
['ml-serve1002.eqiad.wmnet']
The log can be found in /var/log/wmf-auto-reimage/202102241528_klausman_28643.log.
Script wmf-auto-reimage was launched by klausman on cumin1001.eqiad.wmnet for hosts:
['ml-serve1001.eqiad.wmnet']
The log can be found in /var/log/wmf-auto-reimage/202102241528_klausman_28329.log.
Script wmf-auto-reimage was launched by klausman on cumin1001.eqiad.wmnet for hosts:
['ml-serve1002.eqiad.wmnet']
The log can be found in /var/log/wmf-auto-reimage/202102241537_klausman_31980.log.
Script wmf-auto-reimage was launched by klausman on cumin1001.eqiad.wmnet for hosts:
['ml-serve1001.eqiad.wmnet']
The log can be found in /var/log/wmf-auto-reimage/202102241537_klausman_31843.log.
Script wmf-auto-reimage was launched by klausman on cumin2001.codfw.wmnet for hosts:
['ml-serve2001.codfw.wmnet']
The log can be found in /var/log/wmf-auto-reimage/202102250912_klausman_26655.log.
Script wmf-auto-reimage was launched by klausman on cumin2001.codfw.wmnet for hosts:
['ml-serve2004.codfw.wmnet']
The log can be found in /var/log/wmf-auto-reimage/202102250913_klausman_26916.log.
Script wmf-auto-reimage was launched by klausman on cumin2001.codfw.wmnet for hosts:
['ml-serve2003.codfw.wmnet']
The log can be found in /var/log/wmf-auto-reimage/202102250913_klausman_26972.log.
Script wmf-auto-reimage was launched by klausman on cumin2001.codfw.wmnet for hosts:
['ml-serve2002.codfw.wmnet']
The log can be found in /var/log/wmf-auto-reimage/202102250913_klausman_26998.log.
Completed auto-reimage of hosts:
['ml-serve2001.codfw.wmnet']
and were ALL successful.
Completed auto-reimage of hosts:
['ml-serve2004.codfw.wmnet']
and were ALL successful.
Completed auto-reimage of hosts:
['ml-serve2003.codfw.wmnet']
and were ALL successful.
Change 668723 had a related patch set uploaded (by Klausman; owner: Klausman):
[labs/private@master] ml-ctrl: Add dummy keys for ML k8s control plane
Change 668723 merged by Klausman:
[labs/private@master] ml-ctrl: Add dummy keys for ML k8s control plane
Change 668075 had a related patch set uploaded (by Klausman; owner: Klausman):
[operations/puppet@production] modules/roles: Add k8s config for ML team machines
Change 668075 merged by Klausman:
[operations/puppet@production] modules/roles: Add k8s config for ML team machines
Change 670196 had a related patch set uploaded (by Klausman; owner: Klausman):
[operations/puppet@production] files/ssl/: Update ML k8s certs
Change 670196 merged by Klausman:
[operations/puppet@production] files/ssl/: Update ML k8s certs
Change 670198 had a related patch set uploaded (by Klausman; owner: Klausman):
[operations/puppet@production] manifest: Move ml-serve1002 insetup -> ml_k8s::master
Change 670198 merged by Klausman:
[operations/puppet@production] manifest: Move ml-serve1002 insetup -> ml_k8s::master
Change 670214 had a related patch set uploaded (by Klausman; owner: Klausman):
[operations/puppet@production] manifest: Mov ML k8s machines in codfw to prod
Change 670214 merged by Klausman:
[operations/puppet@production] manifest: Mov ML k8s machines in codfw to prod
@klausman I just acked some prometheus alerts with NaN in icinga related to ml-serve-ctrl nodes, the puppet profiles enable them automatically. When we'll have the k8s nodes set up, let's remember to:
The profile that runs the k8s Prometheus master config is profile::prometheus::k8s
Change 670444 had a related patch set uploaded (by Klausman; owner: Klausman):
[operations/puppet@production] service catalog: Add entry for ML Team k8s control plane
Change 670797 had a related patch set uploaded (by Klausman; owner: Klausman):
[operations/dns@master] Add DNS RR for ML team k8s control plane
Change 670797 merged by Klausman:
[operations/dns@master] Add DNS RR for ML team k8s control plane
Change 670444 merged by Klausman:
[operations/puppet@production] service catalog: Add entry for ML Team k8s control plane
Change 670816 had a related patch set uploaded (by Klausman; owner: Klausman):
[operations/puppet@production] hiera: Add LVS realserver config for ML k8s
Change 670816 merged by Klausman:
[operations/puppet@production] hiera: Add LVS realserver config for ML k8s
Change 670818 had a related patch set uploaded (by Klausman; owner: Klausman):
[operations/puppet@production] hiera: Switch ml-ctrl service to lvs_setup
Change 670818 merged by Klausman:
[operations/puppet@production] hiera: Switch ml-ctrl service to lvs_setup
Mentioned in SAL (#wikimedia-operations) [2021-03-11T14:49:29Z] <klausman> restarting pybal on lvs2010 T272918
Mentioned in SAL (#wikimedia-operations) [2021-03-11T14:50:15Z] <klausman> restarting pybal on lvs1016 T272918
Mentioned in SAL (#wikimedia-operations) [2021-03-11T14:55:38Z] <klausman> restarting pybal on lvs2009 T272918
Mentioned in SAL (#wikimedia-operations) [2021-03-11T15:02:47Z] <klausman> restarting pybal on lvs1015 T272918
Change 670835 had a related patch set uploaded (by Klausman; owner: Klausman):
[operations/puppet@production] ssl: update ml-ctrl certs (fixed altname)
Change 670835 merged by Klausman:
[operations/puppet@production] ssl: update ml-ctrl certs (fixed altname)
Change 672402 had a related patch set uploaded (by Klausman; owner: Klausman):
[operations/puppet@production] hiera/modules: Add role for ML k8s workers
Change 672455 had a related patch set uploaded (by Klausman; owner: Klausman):
[labs/private@master] hiera: add dummy secrets for ML k8s workers
Change 672455 merged by Klausman:
[labs/private@master] hiera: add dummy secrets for ML k8s workers
Change 672457 had a related patch set uploaded (by Klausman; owner: Klausman):
[labs/private@master] hiera: move ML k8s worker secrets into the correct location
Change 672457 merged by Klausman:
[labs/private@master] hiera: move ML k8s worker secrets into the correct location
Some details about what we'll have to deploy on top of the new Kubernetes cluster.
One thing is worth to note in my opinion:
If you want to get up running Knative quickly or you do not need service mesh, we recommend installing Istio without service mesh(sidecar injection).
Do we need service mesh for serving?
https://istio.io/latest/docs/ops/deployment/architecture/ - explains a very high level view of Istio and how it works
https://istio.io/latest/docs/ops/deployment/deployment-models/ - explains in depth deployment models, especially in relationship with number of clusters, networks, etc..
After reading the last document (really really nice and well done) it feels as if we are using Istio (a Ferrari) to go and shop for groceries (that could be done with a simple Fiat Panda :D). Jokes aside, knowing what KFServing really needs from Istio will probably ease a lot the bootstrap time. I see the following in the docs:
KFServing currently depends on Istio Ingress Gateway to route requests to inference services.
@ACraze I recall that we discussed Istio during one of your presentation, but I don't recall if the service mesh was needed for Train Wing (so the Kubeflow pipelines etc..) or also for Lift Wing, can you shed some light on this topic ? :)
it feels as if we are using Istio (a Ferrari) to go and shop for groceries (that could be done with a simple Fiat Panda :D)
@elukey nice analogy :) it will likely feel like this as we get started, although I can see having a service mesh becoming extremely helpful later down the road.
Technically we don't need Istio for serving, although the burden of discovery, traceability and allowing microservices to communicate securely with each other will be on us then.
In many ways, each model will be a microservice and they will also need pre-(and/or post) processing, which would be yet another microservice. If we want to A/B test different versions of models, deploy a shadow challenger or use tools like Explainer or Bias/Fairness testing, these all would be additional microservices that need to communicate with other models in our cluster in a secure manner. I've had to manage microservices by hand in the past and it quickly became unsustainable, so that's why I'm in favor of using a service mesh to handle cross-cutting concerns.
I'm not 100% sure if a service mesh will be needed for TrainWing yet, it is unclear if Istio is a hard dependency for KF Pipelines, although it does use Argo to orchestrate workflows, which has a mechanism for sidecar injection using a service mesh.
@ACraze thanks for the explanation, it seems that we have a good motivation to setup Istio with service mesh support, medium to long term it will be needed so it makes sense to proceed now. Thanks for the explanation :)
FWIW, ServiceOps decided against using a full mesh networking for our services because we considered istio to be both very complex and not really needed for our level of complication.
Before we use it in production, we will need to study with attention its failure modes, and how/what to monitor. I would in general suggest that starting with the simple solution (where "simple" is knative + istio without service mesh, which is already quite complex), and move to full mesh once we feel the need.
Yep makes sense, my only doubt right now is if Kubelow's KF serving relies heavily on Istio being configured for service mesh or not. For example, in the case of a simple model that needs pre/post processing, we'll have 2+ microservices communicating with each other (as Andy mentioned), and I have no idea if Kubeflow relies on Istio entirely to make the comms between the microservices happening or if there is another way (more manual etc..). If Kubeflow is not flexible enough we might need to go to service mesh asap, otherwise we'll not able to run basic models. I agree though that we could go step by step and checking the non-mesh solution first, I was just trying to understand the use case medium/long term :)
Change 672402 merged by Klausman:
[operations/puppet@production] hiera/modules: Add role for ML k8s workers
Change 673044 had a related patch set uploaded (by Klausman; owner: Klausman):
[operations/puppet@production] hiera: change docker package for ML worker nodes to docker.io
Change 673044 merged by Klausman:
[operations/puppet@production] hiera: change docker package for ML worker nodes to docker.io
Change 673227 had a related patch set uploaded (by Klausman; owner: Klausman):
[operations/deployment-charts@master] helm: Make ML k8s clusters visible to helm
Change 673452 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] cumin: fix ml-serve aliases and add new ones
Change 673227 merged by jenkins-bot:
[operations/deployment-charts@master] helm: Make ML k8s clusters visible to helm
Change 673452 merged by Elukey:
[operations/puppet@production] cumin: fix ml-serve aliases and add new ones
Change 673457 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] cumin: fix ml-serve alias and add newer ones
Change 673457 merged by Elukey:
[operations/puppet@production] cumin: fix ml-serve alias and add newer ones
Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:
['ml-serve2002.codfw.wmnet']
The log can be found in /var/log/wmf-auto-reimage/202103191100_elukey_4442.log.
Script wmf-auto-reimage was launched by klausman on cumin2001.codfw.wmnet for hosts:
['ml-serve2002.codfw.wmnet']
The log can be found in /var/log/wmf-auto-reimage/202103191217_klausman_19980.log.
Completed auto-reimage of hosts:
['ml-serve2002.codfw.wmnet']
and were ALL successful.
All worker nodes are now up and visible in both DCs:
ml-serve-ctrl1001:~$ kubectl get nodes -o wide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME ml-serve1001.eqiad.wmnet Ready <none> 83m v1.16.15 10.64.0.41 <none> Debian GNU/Linux 10 (buster) 4.19.0-14-amd64 docker://18.9.1 ml-serve1002.eqiad.wmnet Ready <none> 83m v1.16.15 10.64.16.183 <none> Debian GNU/Linux 10 (buster) 4.19.0-14-amd64 docker://18.9.1 ml-serve1003.eqiad.wmnet Ready <none> 83m v1.16.15 10.64.32.81 <none> Debian GNU/Linux 10 (buster) 4.19.0-14-amd64 docker://18.9.1 ml-serve1004.eqiad.wmnet Ready <none> 83m v1.16.15 10.64.48.50 <none> Debian GNU/Linux 10 (buster) 4.19.0-14-amd64 docker://18.9.1
ml-serve-ctrl2001 ~ $ kubectl get nodes -o wide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME ml-serve2001.codfw.wmnet Ready <none> 84m v1.16.15 10.192.0.21 <none> Debian GNU/Linux 10 (buster) 4.19.0-14-amd64 docker://18.9.1 ml-serve2002.codfw.wmnet Ready <none> 9m57s v1.16.15 10.192.16.43 <none> Debian GNU/Linux 10 (buster) 4.19.0-14-amd64 docker://18.9.1 ml-serve2003.codfw.wmnet Ready <none> 84m v1.16.15 10.192.32.29 <none> Debian GNU/Linux 10 (buster) 4.19.0-14-amd64 docker://18.9.1 ml-serve2004.codfw.wmnet Ready <none> 84m v1.16.15 10.192.48.11 <none> Debian GNU/Linux 10 (buster) 4.19.0-14-amd64 docker://18.9.1
Change 661055 merged by Elukey:
[operations/homer/public@master] Add BGP configuration for the new ML Serve eqiad/codfw clusters
Change 673985 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Enable coredns for k8s ml-serve clusters
Change 673985 merged by Elukey:
[operations/puppet@production] Enable coredns for k8s ml-serve clusters
Mentioned in SAL (#wikimedia-operations) [2021-03-23T07:36:37Z] <elukey> create a 50g lvm volume on prometheus[12]00[34] for the k8s-mlserve cluster - T272918
Change 674258 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] prometheus: add the ml-serve clusters settings
Change 674258 merged by Elukey:
[operations/puppet@production] prometheus: add the ml-serve clusters settings
Change 674273 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] prometheus: change port for k8s-mlserve clusters
Change 674273 merged by Elukey:
[operations/puppet@production] prometheus: change port for k8s-mlserve clusters
Change 674279 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Add config for prometheus@k8s-mlserve
Change 674279 merged by Elukey:
[operations/puppet@production] Add config for prometheus@k8s-mlserve
Change 674313 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::ml_k8s::master: add prometheus instance endpoint
Change 674313 merged by Elukey:
[operations/puppet@production] role::ml_k8s::master: add prometheus instance endpoint
The two clusters are up and running, together with Prometheus monitoring. I created two subtasks to deal with the namespace override problem and the shared user credentials, I'll follow up with SRE during the next days.
Moritz told me today that two ml-etcd2xxx nodes are in the same ganeti host, this is my bad since I have created the cluster (so there is no real 3 rows redundancy). I'll proceed with destroy+re-create ml-etcd2001 in a different row host tomorrow to fix the problem.
Quick note about prometheus metrics - the SRE team is going to review/create the Kubernetes dashboards in a bit, so I didn't spend too much time in adding our use case to them for the moment. I verified that some metrics were displayed and decided to do another pass after the SRE team's work :)
Reporting a chat that I had with Alex and Janis on IRC. I wasn't able to create a simple deployment in the default namespace from ml-serve-ctrl1001, the pods were not coming up due to hitting PSP policies (https://kubernetes.io/docs/concepts/policy/pod-security-policy). Janis gave me https://phabricator.wikimedia.org/P15078, that with s/jayme/elukey worked nicely (namely, kubectl apply -f blabla.yaml ended up in the creation of the elukey namespace with a tiller pod running on it).
The base clusters (eqiad/codfw) are up and running, all the work is now tracked in the Kfserving standalone deployment efforts. Closing, but re-open if anything is missing!
Change 724933 had a related patch set uploaded (by Elukey; author: Elukey):
[operations/puppet@production] network: add k8s pod+svc ipv{4,6} subnets for the ml-serve clusters
Change 724933 merged by Elukey:
[operations/puppet@production] network: add k8s pod+svc ipv{4,6} subnets for the ml-serve clusters
Change 817201 had a related patch set uploaded (by Elukey; author: Elukey):
[operations/puppet@production] prometheus: add config for the k8s ml-staging codfw cluster
Change 817201 merged by Elukey:
[operations/puppet@production] prometheus: add config for the k8s ml-staging codfw cluster