Create a DSE Kubernetes cluster with support for persistent storage from Ceph
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• EChetty
	Jan 18 2023, 12:10 PM

Description

Request Title: DSE Cluster Experiment

Request Description:

There is an industry trend to use Kubernetes for data engineering and machine learning workloads. Given the nature of Wikimedia hardware, networking, and data infrastructures, a discovery project is required to understand the specific possibilities, constraints, and limitations with regard to how industry-recognized best practices should be applied.

The DSE (Data Science and Engineering) Kubernetes (k8s) Cluster Experiment is a joint effort by Machine Learning and Data Engineering, with the support of SRE, to build out a k8s cluster to meet the specific needs of data workloads with the specific storage, security and networking requirements to run them.

In parallel but outside this experiment, the Data Engineering team, with the support of SRE, is investing in a new software-defined storage platform based on Ceph. One of the core goals of this new storage platform is to evaluate its potential as a replacement for HDFS in the long term.

This document defines the scope of the DSE K8S Cluster Experiment. There has been a lot of interest in running data applications and workloads, leading to scope creep. To prevent the experiment from growing beyond its original purpose, this document lists the goals for the DSE Cluster Experiment as user-defined user stories. The shared DSE Cluster experiment is complete when these user stories are met. Note that some of the goals depend on the progress of the Ceph cluster project.

The user story goals of the DSE Cluster Experiment are listed in order of planned completion:

Address Kerberos T327257
Make Compute available T327258
Make Block Storage Available T327259
Machine Learning Use Case T327262

Experiment Outcomes & Learnings:

Deploying Kubeflow on a self-managed cluster can present challenges that we wish to overcome in this experiment. These Include:

Compatibility issues: Kubeflow requires certain versions of Kubernetes and other components to work correctly. If our self-managed Kubernetes cluster does not meet these requirements, you may need to learn to rapidly upgrade or reconfigure our cluster.
Networking and security considerations: Kubeflow requires specific networking and security configurations to work properly. If our self-managed Kubernetes cluster does not have these configurations in place, you may need to make changes to your network and security settings before deploying Kubeflow.
Resource constraints: Deploying Kubeflow on a self-managed Kubernetes cluster can consume significant resources, including CPU, memory, and storage. We need to make sure that our cluster has sufficient resources to run Kubeflow before deploying it.
Maintenance and upgrades: If we deploy Kubeflow on our own cluster, we will be responsible for maintaining and upgrading it. This includes keeping the cluster and its components up to date, as well as troubleshooting and resolving any issues that may arise.
Monitoring and logging: To effectively monitor and troubleshoot our Kubeflow deployment, we will need to establish a monitoring and logging infrastructure. This may require additional setup and configuration work to get it running properly.
Knowledge and Expertise: The team should develop the necessary knowledge and expertise to deploy, configure, and maintain Kubeflow on a self-managed Kubernetes

Status

Indicate Priority Level: High
Main Requestors: Data-Engineering Machine-Learning-Team
Ideal Delivery Date: End of Q4
Stakeholders: Data-Engineering Machine-Learning-Team Product-Analytics Research

Request Documentation

Document Type	Required?	Document/Link
Related PHAB Tickets	Yes	T327257 T327258 T327259 T327262
Product One Pager	Yes	https://docs.google.com/document/d/1cDFc_cGlP9qDFWilBAuQtIMZr_EfASfo4AwFfm_3mXY/

Related Objects
Search...

Status	Assigned	Task
Resolved	Gehel	T327267 Create a DSE Kubernetes cluster with support for persistent storage from Ceph
Duplicate	None	T327258 DSE Experiment - User Story 2 (Make Compute available)
Declined	None	T327262 DSE Experiment - User Story 4 (Machine Learning Use Case)
Resolved	elukey	T330261 Upgrade DSE to k8s 1.23
Resolved	Gehel	T324660 Install Ceph Cluster for Data Platform Engineering
Resolved	BTullis	T324670 Create partman recipe for cephosd servers
Resolved	BTullis	T326945 Decide on installation details for new ceph cluster
Resolved	BTullis	T328123 Create puppet profiles for the new ceph cluster
Resolved	BTullis	T330149 Deploy ceph mon and mgr processes to data-engineering cluster
Resolved	BTullis	T330151 Deploy ceph osd processes to data-engineering cluster
Declined	BTullis	T363558 Switch the DPE Ceph cluster to use cephadm management
Resolved	BTullis	T363559 eqiad: 1 VMs requested for ceph cluster administration (cephadm)
Resolved	BTullis	T330153 Configure Anycast load-balancing ceph radosgw services on the data-engineering cluster
Resolved	BTullis	T374447 Test the S3 and swift interfaces of rgw.eqiad.dpe.anycast.wmnet
Resolved	BTullis	T330152 Deploy ceph radosgw processes to data-engineering cluster
Resolved	BTullis	T369634 Decide how to do DAG logging on dse-k8s
Resolved	brouberol	T372787 Implement S3 based logging for Airflow tasks on dse-k8s
Declined	None	T372788 Implement elasticsearch/opensearch based logging for Airflow on dse-k8s
Resolved	bking	T373034 Build Opensearch2 Docker image for testing
Resolved	brouberol	T372281 Configure scheduled backups and WAL archiving to use our S3 endpoint
Resolved	None	T376401 Enable the cephfs services on the DPE Ceph cluster
Resolved	brouberol	T376402 Create cephx authentication keys for the mds servers
Resolved	brouberol	T376404 Deploy mds services to the cephosd cluster
Resolved	BTullis	T376405 Create the pools required for cephfs
Resolved	brouberol	T376406 Import the upstream ceph-csi-cephfs chart and adapt it to our needs
Resolved	brouberol	T376408 Deploy the ceph-csi-cephfs chart to the dse-k8s-cluster
Resolved	brouberol	T376407 Create cephx authentication keys for the ceph-csi-cephfs plugin
Resolved	• EChetty	T310196 K8 DSE Kubernetes Cluster
Resolved	BTullis	T310170 Deploy (3) etcd cluster of VMs for dse-k8s cluster
Resolved	BTullis	T311131 Site: eqiad : 3 VMs requested for Etcd cluster in support of the new DSE Kubernetes cluster
Resolved	BTullis	T313129 Configure etcd for dse-k8s cluster
Resolved	• EChetty	T310174 Configure routing for dse-k8s cluster
Resolved	BTullis	T310172 Configure k8s API control plane service with LVS
Resolved	• EChetty	T310173 Configure DNS for dse-k8s cluster
Resolved	BTullis	T310175 Configure ingress for dse-k8s cluster
Resolved	BTullis	T310176 Integrate dse-k8s with deployment-charts
Resolved	BTullis	T310177 Integrate the (8) existing dse-k8s worker nodes
Resolved	BTullis	T310178 Define RBAC policies and user roles for dse-k8s cluster
Resolved	BTullis	T310179 Configure prometheus monitoring for dse-k8s cluster
Resolved	BTullis	T310180 Test with a demo stateless application
Resolved	BTullis	T310171 Deploy (2) control-plane VMs for dse-k8s cluster
Resolved	BTullis	T311133 Site: eqiad : 2 VMs requested for DSE Kubernetes Cluster control plane servers
Resolved	BTullis	T310169 Determine IP ranges for dse-k8s cluster

Event Timeline

• EChetty created this task.Jan 18 2023, 12:10 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 18 2023, 12:10 PM

• EChetty claimed this task.Jan 18 2023, 12:10 PM

• EChetty added subtasks: T327257: DSE Experiment - User Story 1 (Address Kerberos), T327258: DSE Experiment - User Story 2 (Make Compute available), T327259: Enable the Container Storage Interface (CSI) and the Ceph CSI plugin on dse-k8s cluster, T327262: DSE Experiment - User Story 4 (Machine Learning Use Case).

• EChetty added projects: Shared-Data-Infrastructure, Epic.

• EChetty changed the task status from Open to In Progress.Jan 18 2023, 12:13 PM

• EChetty moved this task from Backlog to Work In Progress on the Foundational Technology Requests board.

• EChetty moved this task from Backlog to Epics on the Shared-Data-Infrastructure board.Jan 18 2023, 12:18 PM

BTullis triaged this task as High priority.Jan 27 2023, 12:47 PM

elukey changed the status of subtask T330261: Upgrade DSE to k8s 1.23 from Open to Stalled.Feb 24 2023, 2:49 PM

elukey changed the status of subtask T330261: Upgrade DSE to k8s 1.23 from Stalled to Open.Feb 27 2023, 6:05 PM

elukey closed subtask T330261: Upgrade DSE to k8s 1.23 as Resolved.Mar 2 2023, 9:49 AM

Removing inactive assignee (please do so as part of team offboarding!).

BTullis subscribed.Mar 24 2023, 2:36 PM

mpopov mentioned this in T336282: Investigate adding jupyterlab-git plugin.May 12 2023, 4:37 PM

AndrewTavis_WMDE subscribed.May 15 2023, 12:14 PM

JArguello-WMF moved this task from Epics to To be discussed on the Shared-Data-Infrastructure board.Jun 29 2023, 1:42 PM

JArguello-WMF moved this task from To be discussed to Epics on the Shared-Data-Infrastructure board.Jun 29 2023, 1:45 PM

JArguello-WMF removed a project: Shared-Data-Infrastructure.Jun 29 2023, 1:46 PM

BTullis merged a task: T308317: Data Infrastructure as a Service MVP.Jul 18 2023, 10:37 AM

BTullis removed a subtask: T327259: Enable the Container Storage Interface (CSI) and the Ceph CSI plugin on dse-k8s cluster.Jul 18 2023, 10:40 AM

BTullis added a project: Data-Platform-SRE.Jul 18 2023, 10:54 AM

BTullis added a subtask: T324660: Install Ceph Cluster for Data Platform Engineering.Jul 18 2023, 10:59 AM

BTullis removed a subtask: T327257: DSE Experiment - User Story 1 (Address Kerberos).

BTullis closed subtask T327262: DSE Experiment - User Story 4 (Machine Learning Use Case) as Declined.Jul 18 2023, 11:04 AM

BTullis renamed this task from Data Science and Engineering Kubernetes Cluster Experiment to Create a DSE Kubernetes cluster with support for persistent storage from Ceph.Jul 18 2023, 12:31 PM

BTullis added a subtask: T310196: K8 DSE Kubernetes Cluster.

BTullis moved this task from Incoming to Epics on the Data-Platform-SRE board.Jul 19 2023, 5:38 PM

Gehel lowered the priority of this task from High to Low.Oct 11 2023, 8:48 AM

BTullis added a subtask: T327259: Enable the Container Storage Interface (CSI) and the Ceph CSI plugin on dse-k8s cluster.Mar 22 2024, 5:16 PM

btullis opened https://gitlab.wikimedia.org/repos/data-engineering/ceph-csi/-/merge_requests/1

Add the build pipeline for the ceph-csi container

btullis opened https://gitlab.wikimedia.org/repos/releng/gitlab-trusted-runner/-/merge_requests/69

Add data-engineering/ceph-csi to trusted runners

dancy merged https://gitlab.wikimedia.org/repos/releng/gitlab-trusted-runner/-/merge_requests/69

Add data-engineering/ceph-csi to trusted runners

AndrewTavis_WMDE unsubscribed.Apr 17 2024, 5:09 PM

btullis merged https://gitlab.wikimedia.org/repos/data-engineering/ceph-csi/-/merge_requests/1