Page MenuHomePhabricator

K8 DSE Kubernetes Cluster
Closed, ResolvedPublic

Description

Request Status: New Request
Request Type: Infrastructure Request

Request Title: K8 DSE Kubernetes Cluster

  • Request Description:

This project was originally doing by the codename of the Train Wing cluster and it was proposed by the Machine-Learning-Team.

We in the Data-Engineering team have muscled in on offered to help with this project and we will be moving forward with building the cluster.
It has now been renamed the DSE cluster, so that it is clear that it is to be a natural home for Data Science, Engineering (plus Machine Learning and Analytics) workloads.

Some key use cases include

  • Running a full Kubeflow stack for training ML models
  • Host the Data Warehouse & Enable trusted datasets.
  • Provide a place to host & register Data-Team Applications
  • the ability to integrate S3 and/or Swift compatible object storage as a back-end for analytics and similar workloads

In this phase we are only looking at building this platform in eqiad, although we should always consider how it would scale to a multi-DC and/or cross-DC design.

  • Indicate Priority Level: High
  • Main Requestors: Data Engineering,
  • Ideal Delivery Date: Q1 of the 2022/2023 financial year,
  • Stakeholders: Data Engineering, Machine Learning, Platform Engineering

Request Documentation

Document TypeRequired?Document/Link
Related PHAB TicketsYesT310195: Ceph Data Infrastructure Request
Product One PagerYes<add link here>
Product Requirements Document (PRD)No<add link here>
Product RoadmapNo<add link here>
Product Planning/Business CaseNo<add link here>
Product BriefNo<add link here>
Design DocYesDesign Document - DSE K8S Cluster

Related Objects

Event Timeline

[Please add appropriate project tags if possible - thanks!]

BTullis subscribed.

I have begun work on a Design Document for the DSE K8S Cluster and linked it from the description above, replacing the draft proposal slides.

Change 824163 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Add new admin_ng values for the dse-k8s-eqiad cluster

https://gerrit.wikimedia.org/r/824163

Change 824694 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Add the necessary configuration to enable the dse-k8s control plane

https://gerrit.wikimedia.org/r/824694

Change 824695 had a related patch set uploaded (by Btullis; author: Btullis):

[labs/private@master] Add dummy tokens for dse_k8s cluster

https://gerrit.wikimedia.org/r/824695

Change 824695 merged by Btullis:

[labs/private@master] Add dummy tokens for dse_k8s cluster

https://gerrit.wikimedia.org/r/824695

Change 824699 had a related patch set uploaded (by Btullis; author: Btullis):

[labs/private@master] Add dummy infrastructure_users for dse-k8s cluster

https://gerrit.wikimedia.org/r/824699

Change 824723 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Add a new signing profile for the dse_k8s cfssl-issuer

https://gerrit.wikimedia.org/r/824723

Change 824725 had a related patch set uploaded (by Btullis; author: Btullis):

[labs/private@master] Add a dummy auth_key for the dse_k8s cluster cfssl-issuer

https://gerrit.wikimedia.org/r/824725

Change 824699 merged by Btullis:

[labs/private@master] Add dummy infrastructure_users for dse-k8s cluster

https://gerrit.wikimedia.org/r/824699

Change 824767 had a related patch set uploaded (by Btullis; author: Btullis):

[labs/private@master] Add a dummy certificate for dse_k8s

https://gerrit.wikimedia.org/r/824767

Change 824767 merged by Btullis:

[labs/private@master] Add a dummy certificate for dse_k8s

https://gerrit.wikimedia.org/r/824767

Change 824723 merged by Btullis:

[operations/puppet@production] Add a new signing profile for the dse_k8s cfssl-issuer

https://gerrit.wikimedia.org/r/824723

Change 824163 merged by jenkins-bot:

[operations/deployment-charts@master] Add new admin_ng values for the dse-k8s-eqiad cluster

https://gerrit.wikimedia.org/r/824163

Change 824694 merged by Btullis:

[operations/puppet@production] Add the necessary configuration to enable the dse-k8s control plane

https://gerrit.wikimedia.org/r/824694

Change 825329 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/dns@master] Add a new VIP for dse-k8s-ctrl.svc.eqiad.wmnet

https://gerrit.wikimedia.org/r/825329

Change 824725 merged by Btullis:

[labs/private@master] Add a dummy auth_key for the dse_k8s cluster cfssl-issuer

https://gerrit.wikimedia.org/r/824725

Change 825329 merged by Btullis:

[operations/dns@master] Add a new VIP for dse-k8s-ctrl.svc.eqiad.wmnet

https://gerrit.wikimedia.org/r/825329

Change 826836 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Add a helmfile configuration for the dse-k8s-eqiad cluster

https://gerrit.wikimedia.org/r/826836

Change 826836 merged by jenkins-bot:

[operations/deployment-charts@master] Add a helmfile configuration for the dse-k8s-eqiad cluster

https://gerrit.wikimedia.org/r/826836

I believe that this ticket should be closed now. We're moving to a phase where we need to look at workloads for this cluster on a case-by-case basis.

Change 840186 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/cookbooks@master] Add dse-k8s-worker as a permitted alias for the reboot-nodes cookbook

https://gerrit.wikimedia.org/r/840186

Change 843932 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Add cumin aliases for dse-k8s in eqiad

https://gerrit.wikimedia.org/r/843932

Change 840186 merged by jenkins-bot:

[operations/cookbooks@master] Add dse-k8s-worker as a permitted alias for the reboot-nodes cookbook

https://gerrit.wikimedia.org/r/840186

Change 845028 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/cookbooks@master] Fix the sre.k8s.reboot-nodes cookbook for dse-k8s

https://gerrit.wikimedia.org/r/845028

Change 845028 abandoned by Btullis:

[operations/cookbooks@master] Fix the sre.k8s.reboot-nodes cookbook for dse-k8s

Reason:

Functionality being added in https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/845430 instead.

https://gerrit.wikimedia.org/r/845028

Change 843932 abandoned by Btullis:

[operations/puppet@production] Add cumin aliases for dse-k8s in eqiad

Reason:

Not currently necessary

https://gerrit.wikimedia.org/r/843932

BTullis moved this task from QA/Review to Done on the Foundational Technology Requests board.

I'm being bold and stating that this ticket is done, since we requested resources for a K8S cluster, the hardware was duly purchased/reallocated and the cluster was built.

We're now at the point where we are starting to test various DSE related workloads on this cluster.