Request Title: DSE Cluster Experiment
There is an industry trend to use Kubernetes for data engineering and machine learning workloads. Given the nature of Wikimedia hardware, networking, and data infrastructures, a discovery project is required to understand the specific possibilities, constraints, and limitations with regard to how industry-recognized best practices should be applied.
The DSE (Data Science and Engineering) Kubernetes (k8s) Cluster Experiment is a joint effort by Machine Learning and Data Engineering, with the support of SRE, to build out a k8s cluster to meet the specific needs of data workloads with the specific storage, security and networking requirements to run them.
In parallel but outside this experiment, the Data Engineering team, with the support of SRE, is investing in a new software-defined storage platform based on Ceph. One of the core goals of this new storage platform is to evaluate its potential as a replacement for HDFS in the long term.
This document defines the scope of the DSE K8S Cluster Experiment. There has been a lot of interest in running data applications and workloads, leading to scope creep. To prevent the experiment from growing beyond its original purpose, this document lists the goals for the DSE Cluster Experiment as user-defined user stories. The shared DSE Cluster experiment is complete when these user stories are met. Note that some of the goals depend on the progress of the Ceph cluster project.
The user story goals of the DSE Cluster Experiment are listed in order of planned completion:
- Address Kerberos T327257
- Make Compute available T327258
- Make Block Storage Available T327259
- Machine Learning Use Case T327262
Experiment Outcomes & Learnings:
Deploying Kubeflow on a self-managed cluster can present challenges that we wish to overcome in this experiment. These Include:
- Compatibility issues: Kubeflow requires certain versions of Kubernetes and other components to work correctly. If our self-managed Kubernetes cluster does not meet these requirements, you may need to learn to rapidly upgrade or reconfigure our cluster.
- Networking and security considerations: Kubeflow requires specific networking and security configurations to work properly. If our self-managed Kubernetes cluster does not have these configurations in place, you may need to make changes to your network and security settings before deploying Kubeflow.
- Resource constraints: Deploying Kubeflow on a self-managed Kubernetes cluster can consume significant resources, including CPU, memory, and storage. We need to make sure that our cluster has sufficient resources to run Kubeflow before deploying it.
- Maintenance and upgrades: If we deploy Kubeflow on our own cluster, we will be responsible for maintaining and upgrading it. This includes keeping the cluster and its components up to date, as well as troubleshooting and resolving any issues that may arise.
- Monitoring and logging: To effectively monitor and troubleshoot our Kubeflow deployment, we will need to establish a monitoring and logging infrastructure. This may require additional setup and configuration work to get it running properly.
- Knowledge and Expertise: The team should develop the necessary knowledge and expertise to deploy, configure, and maintain Kubeflow on a self-managed Kubernetes
- Indicate Priority Level: High
- Main Requestors: Data-Engineering Machine-Learning-Team
- Ideal Delivery Date: End of Q4
- Stakeholders: Data-Engineering Machine-Learning-Team Product-Analytics Research
|Related PHAB Tickets||Yes||T327257 T327258 T327259 T327262|
|Product One Pager||Yes||https://docs.google.com/document/d/1cDFc_cGlP9qDFWilBAuQtIMZr_EfASfo4AwFfm_3mXY/|