Page MenuHomePhabricator

Create a DSE Kubernetes cluster with support for persistent storage from Ceph
Closed, ResolvedPublic

Description

Request Title: DSE Cluster Experiment

Request Description:

There is an industry trend to use Kubernetes for data engineering and machine learning workloads. Given the nature of Wikimedia hardware, networking, and data infrastructures, a discovery project is required to understand the specific possibilities, constraints, and limitations with regard to how industry-recognized best practices should be applied.

The DSE (Data Science and Engineering) Kubernetes (k8s) Cluster Experiment is a joint effort by Machine Learning and Data Engineering, with the support of SRE, to build out a k8s cluster to meet the specific needs of data workloads with the specific storage, security and networking requirements to run them.

In parallel but outside this experiment, the Data Engineering team, with the support of SRE, is investing in a new software-defined storage platform based on Ceph. One of the core goals of this new storage platform is to evaluate its potential as a replacement for HDFS in the long term.

This document defines the scope of the DSE K8S Cluster Experiment. There has been a lot of interest in running data applications and workloads, leading to scope creep. To prevent the experiment from growing beyond its original purpose, this document lists the goals for the DSE Cluster Experiment as user-defined user stories. The shared DSE Cluster experiment is complete when these user stories are met. Note that some of the goals depend on the progress of the Ceph cluster project.

The user story goals of the DSE Cluster Experiment are listed in order of planned completion:

  1. Address Kerberos T327257
  2. Make Compute available T327258
  3. Make Block Storage Available T327259
  4. Machine Learning Use Case T327262
Experiment Outcomes & Learnings:

Deploying Kubeflow on a self-managed cluster can present challenges that we wish to overcome in this experiment. These Include:

  1. Compatibility issues: Kubeflow requires certain versions of Kubernetes and other components to work correctly. If our self-managed Kubernetes cluster does not meet these requirements, you may need to learn to rapidly upgrade or reconfigure our cluster.
  2. Networking and security considerations: Kubeflow requires specific networking and security configurations to work properly. If our self-managed Kubernetes cluster does not have these configurations in place, you may need to make changes to your network and security settings before deploying Kubeflow.
  3. Resource constraints: Deploying Kubeflow on a self-managed Kubernetes cluster can consume significant resources, including CPU, memory, and storage. We need to make sure that our cluster has sufficient resources to run Kubeflow before deploying it.
  4. Maintenance and upgrades: If we deploy Kubeflow on our own cluster, we will be responsible for maintaining and upgrading it. This includes keeping the cluster and its components up to date, as well as troubleshooting and resolving any issues that may arise.
  5. Monitoring and logging: To effectively monitor and troubleshoot our Kubeflow deployment, we will need to establish a monitoring and logging infrastructure. This may require additional setup and configuration work to get it running properly.
  6. Knowledge and Expertise: The team should develop the necessary knowledge and expertise to deploy, configure, and maintain Kubeflow on a self-managed Kubernetes
Status

Request Documentation

Document TypeRequired?Document/Link
Related PHAB TicketsYesT327257 T327258 T327259 T327262
Product One PagerYeshttps://docs.google.com/document/d/1cDFc_cGlP9qDFWilBAuQtIMZr_EfASfo4AwFfm_3mXY/

Details

TitleReferenceAuthorSource BranchDest Branch
Add data-engineering/ceph-csi to trusted runnersrepos/releng/gitlab-trusted-runner!69btullisadd_ceph_csimain
Add the build pipeline for the ceph-csi containerrepos/data-engineering/ceph-csi!1btullisadd_csi_containermain
Customize query in GitLab

Related Objects

StatusSubtypeAssignedTask
ResolvedGehel
DuplicateNone
DeclinedNone
Resolvedelukey
ResolvedGehel
ResolvedBTullis
ResolvedBTullis
ResolvedBTullis
ResolvedBTullis
ResolvedBTullis
DeclinedBTullis
ResolvedBTullis
Resolved EChetty
ResolvedBTullis
ResolvedBTullis
ResolvedBTullis
Resolved EChetty
ResolvedBTullis
Resolved EChetty
ResolvedBTullis
ResolvedBTullis
ResolvedBTullis
ResolvedBTullis
ResolvedBTullis
ResolvedBTullis
ResolvedBTullis
ResolvedBTullis
ResolvedBTullis

Event Timeline

EChetty changed the task status from Open to In Progress.Jan 18 2023, 12:13 PM

Removing inactive assignee (please do so as part of team offboarding!).

BTullis renamed this task from Data Science and Engineering Kubernetes Cluster Experiment to Create a DSE Kubernetes cluster with support for persistent storage from Ceph.Jul 18 2023, 12:31 PM
Gehel lowered the priority of this task from High to Low.Oct 11 2023, 8:48 AM
Gehel claimed this task.
Gehel subscribed.

Completed! We're continuing work on experimenting with Airflow running on k8s with Ceph as a storage backend (T362788).