⚓ T327259 Enable the Container Storage Interface (CSI) and the Ceph CSI plugin on dse-k8s cluster

Subject	Repo	Branch	Lines +/-
Add a values file for the ceph-csi plugin on dse-k8s-eqiad	operations/deployment-charts	master	+50 -0
Initial import of ceph-csi-rbd chart for inspection	operations/deployment-charts	master	+1 K -0
Add a dummy Cephx user key for the cephcsi plugin to use	labs/private	master	+5 -0
Fix the cephosd dse-k8s-csi user caps	operations/puppet	production	+3 -3
Add a ceph client for the dse-k8s container storage interface	operations/puppet	production	+5 -0

	Title	Reference	Author	Source Branch	Dest Branch
	Add repos/data-engineering/kubernetes/csi to the trusted-runners	repos/releng/gitlab-trusted-runner!75	btullis	add_kubernetes_csi	main

Status	Assigned	Task
Open	None	T362788 Migrate Airflow to the dse-k8s cluster
Open	None	T364386 Validate postgres operator and Ceph integration
Open	BTullis	T327259 Enable the Container Storage Interface (CSI) and the Ceph CSI plugin on dse-k8s cluster
Open	None	T364472 Assess the suitability of the upstream ceph-csi-rbd helm chart for deployment

• EChetty created this task.Jan 18 2023, 11:50 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 18 2023, 11:50 AM

• EChetty mentioned this in T327267: Create a DSE Kubernetes cluster with support for persistent storage from Ceph.Jan 18 2023, 12:10 PM

• EChetty added a parent task: T327267: Create a DSE Kubernetes cluster with support for persistent storage from Ceph.

• EChetty added a subtask: T324660: Install Ceph Cluster for Data Engineering.Jan 18 2023, 12:15 PM

• EChetty moved this task from Backlog to To be discussed on the Shared-Data-Infrastructure board.Jan 18 2023, 12:18 PM

• EChetty moved this task from To be discussed to EQ2 Kanban (Sprints 04-07) on the Shared-Data-Infrastructure board.Jan 23 2023, 1:31 PM

• EChetty edited projects, added Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-07)); removed Shared-Data-Infrastructure.

• EChetty moved this task from EQ2 Kanban (Sprints 04-07) to 2022-23 Q4 Wrap up on the Shared-Data-Infrastructure board.Feb 6 2023, 12:59 PM

• EChetty edited projects, added Shared-Data-Infrastructure (2022-23 Q4 Wrap up); removed Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-07)).

JArguello-WMF added a project: Epic.Feb 21 2023, 1:11 PM

JArguello-WMF edited projects, added Shared-Data-Infrastructure; removed Shared-Data-Infrastructure (2022-23 Q4 Wrap up).Mar 14 2023, 3:50 PM

JArguello-WMF moved this task from Backlog to Epics on the Shared-Data-Infrastructure board.

JArguello-WMF moved this task from Epics to To be discussed on the Shared-Data-Infrastructure board.Jun 29 2023, 1:42 PM

BTullis renamed this task from DSE Experiment - User Story 3 (Make Block Storage Available) to Support PersistentVolumeClaim objects on dse-k8s cluster.Jul 18 2023, 10:29 AM

BTullis triaged this task as Low priority.

BTullis removed a parent task: T327267: Create a DSE Kubernetes cluster with support for persistent storage from Ceph.Jul 18 2023, 10:40 AM

BTullis removed a subtask: T324660: Install Ceph Cluster for Data Engineering.

BTullis edited projects, added Data-Platform-SRE; removed Shared-Data-Infrastructure, Epic.Mar 22 2024, 5:04 PM

BTullis added a parent task: T327267: Create a DSE Kubernetes cluster with support for persistent storage from Ceph.Mar 22 2024, 5:16 PM

BTullis added a parent task: T362788: Migrate Airflow to the dse-k8s cluster.Apr 17 2024, 4:01 PM

BTullis claimed this task.Apr 17 2024, 4:44 PM

BTullis raised the priority of this task from Low to High.

BTullis edited projects, added Data-Platform-SRE (2024.04.15 - 2024.05.05); removed Data-Platform-SRE.

BTullis updated the task description. (Show Details)

BTullis removed a subscriber: • EChetty.

bking subscribed.Apr 17 2024, 4:45 PM

BTullis moved this task from Backlog to In Progress on the Data-Platform-SRE (2024.04.15 - 2024.05.05) board.Apr 17 2024, 5:08 PM

btullis merged https://gitlab.wikimedia.org/repos/data-engineering/ceph-csi/-/merge_requests/1

Add the build pipeline for the ceph-csi container

BTullis mentioned this in T362788: Migrate Airflow to the dse-k8s cluster.Apr 17 2024, 5:15 PM

BTullis updated the task description. (Show Details)Apr 24 2024, 9:30 AM

BTullis moved this task from In Progress to Blocked / Waiting on the Data-Platform-SRE (2024.04.15 - 2024.05.05) board.Apr 29 2024, 10:39 AM

BTullis moved this task from Blocked / Waiting to In Progress on the Data-Platform-SRE (2024.04.15 - 2024.05.05) board.May 2 2024, 10:57 AM

I configured an initial set of pools that use erasure coding in T326945#9045272

btullis@cephosd1005:~$ sudo ceph osd pool ls
.mgr
rbd-metadata-ssd
rbd-metadata-hdd
rbd-data-ssd
rbd-data-hdd

btullis@cephosd1005:~$ sudo ceph osd pool ls detail
pool 2 '.mgr' replicated size 3 min_size 2 crush_rule 2 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 23676 flags hashpspool stripe_width 0 pg_num_max 32 pg_num_min 1 application mgr read_balance_score 37.50
pool 3 'rbd-metadata-ssd' replicated size 3 min_size 2 crush_rule 2 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 29336 lfor 0/29336/29334 flags hashpspool stripe_width 0 application rbd read_balance_score 5.00
pool 4 'rbd-metadata-hdd' replicated size 3 min_size 2 crush_rule 1 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 29537 lfor 0/29537/29535 flags hashpspool stripe_width 0 application rbd read_balance_score 3.73
pool 5 'rbd-data-ssd' erasure profile ec32-ssd size 5 min_size 4 crush_rule 3 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 29262 lfor 0/29262/29260 flags hashpspool,ec_overwrites,selfmanaged_snaps stripe_width 12288 application rbd
pool 6 'rbd-data-hdd' erasure profile ec32-hdd size 5 min_size 4 crush_rule 4 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 29782 lfor 0/29782/29780 flags hashpspool,ec_overwrites,selfmanaged_snaps stripe_width 12288 application rbd

We have a test volume present on each of these pools.

btullis@cephosd1005:~$ sudo rbd info rbd-metadata-ssd/test-ssd-volume
rbd image 'test-ssd-volume':
	size 10 GiB in 2560 objects
	order 22 (4 MiB objects)
	snapshot_count: 0
	id: 25b9aaf6aa3bd
	data_pool: rbd-data-ssd
	block_name_prefix: rbd_data.3.25b9aaf6aa3bd
	format: 2
	features: layering, exclusive-lock, object-map, fast-diff, deep-flatten, data-pool
	op_features: 
	flags: 
	create_timestamp: Wed Jul 26 16:30:55 2023
	access_timestamp: Wed Jul 26 16:30:55 2023
	modify_timestamp: Thu Jul 27 15:51:24 2023

However, I think that it would be better in terms of this initial testing of the Kubernetes CSI to use a replicated pool with a replication factor of 3. It is a simpler option. We can look at efficiencies of storage space later.

I have:

created a pool named dse-k8s-csi-ssd with the following command:

btullis@cephosd1005:~$ sudo ceph osd pool create dse-k8s-csi-ssd 800 800 replicated ssd --autoscale-mode=on
pool 'dse-k8s-csi-ssd' created

associated this pool with the rbd application.

btullis@cephosd1005:~$ sudo ceph osd pool application enable dse-k8s-csi-ssd rbd
enabled application 'rbd' on pool 'dse-k8s-csi-ssd'

initialized the pool with rbd

btullis@cephosd1005:~$ sudo rbd pool init dse-k8s-csi-ssd

validated that it is visible.

btullis@cephosd1005:~$ sudo rbd pool stats dse-k8s-csi-ssd
Total Images: 0
Total Snapshots: 0
Provisioned Size: 0 B
btullis@cephosd1005:~$ sudo ceph df
--- RAW STORAGE ---
CLASS      SIZE    AVAIL     USED  RAW USED  %RAW USED
hdd    1010 TiB  982 TiB   28 TiB    28 TiB       2.77
ssd     140 TiB  139 TiB  286 GiB   286 GiB       0.20
TOTAL   1.1 PiB  1.1 PiB   28 TiB    28 TiB       2.46
 
--- POOLS ---
POOL              ID  PGS   STORED  OBJECTS     USED  %USED  MAX AVAIL
.mgr               2    1  125 MiB       32  374 MiB      0     44 TiB
rbd-metadata-ssd   3   32   81 GiB  432.05k  248 GiB   0.18     44 TiB
rbd-metadata-hdd   4   32    692 B        5   24 KiB      0    311 TiB
rbd-data-ssd       5   32   10 GiB    2.56k   17 GiB   0.01     79 TiB
rbd-data-hdd       6   32   40 GiB   10.33k   67 GiB      0    559 TiB
dse-k8s-csi-ssd    7  476     19 B        1   12 KiB      0     44 TiB

BTullis updated the task description. (Show Details)May 2 2024, 1:19 PM

BTullis edited projects, added Data-Platform-SRE (2024.04.15 - 2024.05.05); removed Data-Platform-SRE.May 2 2024, 1:32 PM

Change #1026819 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Add a ceph client for the dse-k8s container storage interface

https://gerrit.wikimedia.org/r/1026819

gerritbot added a project: Patch-For-Review.May 3 2024, 10:14 AM

Change #1026819 merged by Btullis:

[operations/puppet@production] Add a ceph client for the dse-k8s container storage interface

https://gerrit.wikimedia.org/r/1026819

BTullis updated the task description. (Show Details)May 3 2024, 12:18 PM

I note that there is an upstream chart available: https://github.com/ceph/ceph-csi/tree/devel/charts/ceph-csi-rbd

I will have a brief look at it to see whether or not it might be suitable for us.

On first glance it looks good, so I have made a request for a review of the chart based on our policy.

That request is here: https://wikitech.wikimedia.org/wiki/Helm/Upstream_Charts/ceph-csi-rbd

I have specified version 3.7.2 because that is the last version that officially supported version 1.23 of Kubernetes.
That matches the version of the ceph-csi container image that we built as well.

BTullis moved this task from In Progress to Blocked / Waiting on the Data-Platform-SRE (2024.04.15 - 2024.05.05) board.May 3 2024, 1:50 PM

Gehel edited projects, added Data-Platform-SRE (2024.05.06 - 2024.05.26); removed Data-Platform-SRE (2024.04.15 - 2024.05.05).May 3 2024, 3:38 PM

Gehel moved this task from Backlog to Blocked / Waiting on the Data-Platform-SRE (2024.05.06 - 2024.05.26) board.

Gehel removed a parent task: T327267: Create a DSE Kubernetes cluster with support for persistent storage from Ceph.May 7 2024, 8:54 AM

Change #1028773 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Fix the cephosd dse-k8s-csi user caps

https://gerrit.wikimedia.org/r/1028773

Change #1028773 merged by Btullis:

[operations/puppet@production] Fix the cephosd dse-k8s-csi user caps

https://gerrit.wikimedia.org/r/1028773

Gehel edited parent tasks, added: T364386: Validate postgres operator and Ceph integration; removed: T362788: Migrate Airflow to the dse-k8s cluster.May 7 2024, 1:16 PM

Change #1028931 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Initial import of ceph-csi-rbd chart for inspection

https://gerrit.wikimedia.org/r/1028931

btullis opened https://gitlab.wikimedia.org/repos/data-engineering/kubernetes/csi-attacher/-/merge_requests/1

Initial ci build

BTullis renamed this task from Support PersistentVolumeClaim objects on dse-k8s cluster to Enable the Container Storage Interface (CSI) and the Ceph CSI plugin on dse-k8s cluster.May 15 2024, 11:29 AM

btullis opened https://gitlab.wikimedia.org/repos/releng/gitlab-trusted-runner/-/merge_requests/75

Add repos/data-engineering/kubernetes/csi to the trusted-runners

Change #1031589 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] [WIP] Add a values file for the ceph-csi plugin on dse-k8s-eqiad

https://gerrit.wikimedia.org/r/1031589

dancy merged https://gitlab.wikimedia.org/repos/releng/gitlab-trusted-runner/-/merge_requests/75

Add repos/data-engineering/kubernetes/csi to the trusted-runners

btullis merged https://gitlab.wikimedia.org/repos/data-engineering/kubernetes/csi/-/merge_requests/1

Initial ci build

dcaro subscribed.May 22 2024, 2:37 PM

Gehel edited projects, added Data-Platform-SRE (2024.05.27 - 2024.06.16); removed Data-Platform-SRE (2024.05.06 - 2024.05.26).May 24 2024, 12:18 PM

Gehel moved this task from Backlog to Blocked / Waiting on the Data-Platform-SRE (2024.05.27 - 2024.06.16) board.

BTullis moved this task from Blocked / Waiting to In Progress on the Data-Platform-SRE (2024.05.27 - 2024.06.16) board.Tue, Jun 4, 12:09 PM

Change #1046666 had a related patch set uploaded (by Btullis; author: Btullis):

[labs/private@master] Add a Cephx user key for the cephcsi plugin to use

https://gerrit.wikimedia.org/r/1046666

Change #1046666 merged by Btullis:

[labs/private@master] Add a dummy Cephx user key for the cephcsi plugin to use

https://gerrit.wikimedia.org/r/1046666

BTullis mentioned this in rLPRI8948c4d8f0fc: Add a dummy Cephx user key for the cephcsi plugin to use.Mon, Jun 17, 1:49 PM

Gehel edited projects, added Data-Platform-SRE (2024.06.17 - 2024.07.07); removed Data-Platform-SRE (2024.05.27 - 2024.06.16).Mon, Jun 17, 3:05 PM

Gehel moved this task from Backlog to In Progress on the Data-Platform-SRE (2024.06.17 - 2024.07.07) board.

I believe that I have finished my work on T364472: Assess the suitability of the upstream ceph-csi-rbd helm chart for deployment so that is awaiting a review from others.
I'll mark this ticket as blocked, pending the outcome of that review.

Enable the Container Storage Interface (CSI) and the Ceph CSI plugin on dse-k8s cluster
Open, HighPublic
Actions

Description

Update April 2024

User Story

As a Wikimedia engineer, I want to be able to deploy a stateful application using the Persistent Volume Claim Kubernetes object so that I can ensure the application's data remains persistent even if the pod or container running the application is deleted or recreated.

Implementation Plan

Acceptance Criteria

Details

Related Objects
Search...

Event Timeline

Enable the Container Storage Interface (CSI) and the Ceph CSI plugin on dse-k8s clusterOpen, HighPublicActions