Page MenuHomePhabricator

Enable the Container Storage Interface (CSI) and the Ceph CSI plugin on dse-k8s cluster
Closed, ResolvedPublic

Description

Update April 2024

We now need this functionality to support T362788: Migrate Airflow to the dse-k8s cluster so it seems best to use this ticket to track the remaining work to get the persistent volumes working in dse-k8s.

User Story

As a Wikimedia engineer, I want to be able to deploy a stateful application using the Persistent Volume Claim Kubernetes object so that I can ensure the application's data remains persistent even if the pod or container running the application is deleted or recreated.

Implementation Plan

We will be following the guidance outlined here: https://docs.ceph.com/en/reef/rbd/rbd-kubernetes

The required steps are:

Acceptance Criteria

  • The engineer should be able to deploy a stateful application using the PersistentVolumeClaim Kubernetes object.
  • The PersistentVolumeClaim should be serviced by the Ceph cluster
  • The application's data should remain persistent even if the pod or container running the application is deleted or recreated.

Details

SubjectRepoBranchLines +/-
operations/deployment-chartsmaster+1 -1
operations/puppetproduction+7 -3
operations/deployment-chartsmaster+7 -1
operations/deployment-chartsmaster+4 -3
operations/deployment-chartsmaster+1 -1
operations/deployment-chartsmaster+5 -12
operations/deployment-chartsmaster+1 -1
operations/deployment-chartsmaster+3 -1
operations/deployment-chartsmaster+6 -2
operations/deployment-chartsmaster+1 -1
operations/deployment-chartsmaster+1 -1
operations/deployment-chartsmaster+4 -6
operations/deployment-chartsmaster+2 -1
operations/deployment-chartsmaster+3 -0
operations/deployment-chartsmaster+5 -3
operations/deployment-chartsmaster+3 -1
operations/deployment-chartsmaster+1 -1
operations/deployment-chartsmaster+1 -2
operations/deployment-chartsmaster+3 -2
operations/deployment-chartsmaster+50 -0
operations/deployment-chartsmaster+1 K -0
operations/puppetproduction+16 -25
operations/puppetproduction+2 -1
operations/puppetproduction+1 -0
labs/privatemaster+5 -0
operations/puppetproduction+3 -3
operations/puppetproduction+5 -0
Show related patches Customize query in gerrit
TitleReferenceAuthorSource BranchDest Branch
Add the ceph-common package, which contains the rbd binaryrepos/data-engineering/ceph-csi!8btullisadd_ceph_commonmain
Revert the changes to add an entryoint.sh scriptrepos/data-engineering/ceph-csi!7btullisrevert_entrypoint_umaskmain
Configure the umask of the cephcsi processrepos/data-engineering/ceph-csi!6btullisumask_entrypointmain
Add the kmod package to the ceph-csi containerrepos/data-engineering/ceph-csi!5btullisadd_kmod_cephcsimain
Add the required ceph libraries in the production variantrepos/data-engineering/ceph-csi!4btullisadd_libs_productionmain
Add entrypoint and remove duplicate blubber file.repos/data-engineering/ceph-csi!2btullisenable_entrypoint_cephscsimain
Add repos/data-engineering/kubernetes/csi to the trusted-runnersrepos/releng/gitlab-trusted-runner!75btullisadd_kubernetes_csimain
Customize query in GitLab

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change #1028773 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Fix the cephosd dse-k8s-csi user caps

https://gerrit.wikimedia.org/r/1028773

Change #1028773 merged by Btullis:

[operations/puppet@production] Fix the cephosd dse-k8s-csi user caps

https://gerrit.wikimedia.org/r/1028773

Change #1028931 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Initial import of ceph-csi-rbd chart for inspection

https://gerrit.wikimedia.org/r/1028931

BTullis renamed this task from Support PersistentVolumeClaim objects on dse-k8s cluster to Enable the Container Storage Interface (CSI) and the Ceph CSI plugin on dse-k8s cluster.May 15 2024, 11:29 AM

Change #1031589 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] [WIP] Add a values file for the ceph-csi plugin on dse-k8s-eqiad

https://gerrit.wikimedia.org/r/1031589

Change #1046666 had a related patch set uploaded (by Btullis; author: Btullis):

[labs/private@master] Add a Cephx user key for the cephcsi plugin to use

https://gerrit.wikimedia.org/r/1046666

Change #1046666 merged by Btullis:

[labs/private@master] Add a dummy Cephx user key for the cephcsi plugin to use

https://gerrit.wikimedia.org/r/1046666

I believe that I have finished my work on T364472: Assess the suitability of the upstream ceph-csi-rbd helm chart for deployment so that is awaiting a review from others.
I'll mark this ticket as blocked, pending the outcome of that review.

The upstream chart and my helmfile deployment spec have been approved, so I am now ready to deploy this.

Change #1050308 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Switch ceph server firewall to nftables and permit access from dse_kubepods

https://gerrit.wikimedia.org/r/1050308

Change #1050330 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Switch cephosd1001 to use the nftables based firewall

https://gerrit.wikimedia.org/r/1050330

Change #1050331 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] cephosd: Switch to use nftables instead of iptables

https://gerrit.wikimedia.org/r/1050331

Change #1050308 merged by Btullis:

[operations/puppet@production] Update ceph server firewall and permit access from dse_kubepods

https://gerrit.wikimedia.org/r/1050308

Change #1028931 merged by jenkins-bot:

[operations/deployment-charts@master] Initial import of ceph-csi-rbd chart for inspection

https://gerrit.wikimedia.org/r/1028931

Change #1031589 merged by jenkins-bot:

[operations/deployment-charts@master] Add a values file for the ceph-csi plugin on dse-k8s-eqiad

https://gerrit.wikimedia.org/r/1031589

Change #1050362 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] cephosd: Do not set an encryption password if encryption is disabled

https://gerrit.wikimedia.org/r/1050362

Change #1050362 merged by jenkins-bot:

[operations/deployment-charts@master] cephosd: Do not set an encryption password if encryption is disabled

https://gerrit.wikimedia.org/r/1050362

Change #1050566 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Update the image used for the ceph-csi containers

https://gerrit.wikimedia.org/r/1050566

Change #1050566 merged by jenkins-bot:

[operations/deployment-charts@master] Update the image used for the ceph-csi containers

https://gerrit.wikimedia.org/r/1050566

Change #1050575 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Update the ceph-csi image to add missing libraries

https://gerrit.wikimedia.org/r/1050575

Change #1050575 merged by jenkins-bot:

[operations/deployment-charts@master] Update the ceph-csi image to add missing libraries

https://gerrit.wikimedia.org/r/1050575

Change #1050585 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Set the fsGroup to 900 for the ceph-csi provisioner

https://gerrit.wikimedia.org/r/1050585

Change #1050585 merged by jenkins-bot:

[operations/deployment-charts@master] Set the fsGroup to 900 for the ceph-csi provisioner

https://gerrit.wikimedia.org/r/1050585

Change #1050615 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] ceph-csi: revert fsGroup change and disable metrics container

https://gerrit.wikimedia.org/r/1050615

Change #1050615 merged by jenkins-bot:

[operations/deployment-charts@master] ceph-csi: revert fsGroup change and disable metrics container

https://gerrit.wikimedia.org/r/1050615

Change #1050622 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] cephcsi: disable the metrics container in the nodeplugin

https://gerrit.wikimedia.org/r/1050622

Change #1050622 merged by jenkins-bot:

[operations/deployment-charts@master] cephcsi: disable the metrics container in the nodeplugin

https://gerrit.wikimedia.org/r/1050622

Change #1050625 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Configure fsgroup for the cephcsi nodeplugin pod to be 900

https://gerrit.wikimedia.org/r/1050625

Change #1050625 merged by jenkins-bot:

[operations/deployment-charts@master] Configure the user of the csi-rbdplugin container to be 0

https://gerrit.wikimedia.org/r/1050625

Change #1050633 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Fix the values.yaml file for the cephcsi deployment

https://gerrit.wikimedia.org/r/1050633

Change #1050633 merged by jenkins-bot:

[operations/deployment-charts@master] Fix the values.yaml file for the cephcsi deployment

https://gerrit.wikimedia.org/r/1050633

Change #1050644 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] cephcsi: Bump the image version

https://gerrit.wikimedia.org/r/1050644

Change #1050644 merged by jenkins-bot:

[operations/deployment-charts@master] cephcsi: Bump the image version

https://gerrit.wikimedia.org/r/1050644

Change #1050645 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] cephcsi: correct image tag

https://gerrit.wikimedia.org/r/1050645

Change #1050645 merged by jenkins-bot:

[operations/deployment-charts@master] cephcsi: correct image tag

https://gerrit.wikimedia.org/r/1050645

Change #1050648 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] cephcsi: Use fsGroup 900 to allow /csi/csi.sock to be shared

https://gerrit.wikimedia.org/r/1050648

Change #1050648 merged by jenkins-bot:

[operations/deployment-charts@master] cephcsi: Use fsGroup 900 to allow /csi/csi.sock to be shared

https://gerrit.wikimedia.org/r/1050648

Change #1051083 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] cephcsi: Run the nodeplugin-registrar with elevated privileges

https://gerrit.wikimedia.org/r/1051083

Change #1051083 merged by jenkins-bot:

[operations/deployment-charts@master] cephcsi: Run the csi-rbdplugin container as gid 900

https://gerrit.wikimedia.org/r/1051083

Change #1051447 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] cephcsi: bump the image version

https://gerrit.wikimedia.org/r/1051447

Change #1051447 merged by jenkins-bot:

[operations/deployment-charts@master] cephcsi: bump the image version

https://gerrit.wikimedia.org/r/1051447

@elukey - If you don't mind, I'd like to carry on a discussion that we started on this patch, about whether or not the node-driver-registrar container requires elevated permissions.

When we were assessing the ceph-csi-rbd chart in T364472: Assess the suitability of the upstream ceph-csi-rbd helm chart for deployment and the initial import, I said that I believed we would not have to run the node-plugin-registrar container as root.
That is what this comment seems to suggest, as well.

This is necessary only for systems with SELinux, where non-privileged sidecar containers cannot access unix domain socket created by privileged CSI driver container.

However, when we tried to deploy the ceph-csi-rbd chart, we noticed that the driver-registrar container attempts to create its own unix socket at /registration/rbd.csi.ceph.com-reg.sock, which is a hostPath mount to /var/lib/kubelet/plugins_registry on each worker. It can't do this, as that directory is owned by root:root, which is pretty understandable. It suggests that the comment in the chart is probably misleading as well.

Checking out the docs for the node-driver-registrar component, we can see that this behaviour is as expected:

*Registration socket*:

  • Registers the driver with kubelet.
  • Created by the node-driver-registrar.
  • Exposed on a Kubernetes node via hostpath in the Kubelet plugin registry. (typically /var/lib/kubelet/plugins_registry/<drivername.example.com>-reg.sock). The hostpath volume must be mounted at /registration.

It also explains here what permissions are required for the node-driver-registrar process..

The node-driver-registrar does not interact with the Kubernetes API, so no RBAC rules are needed.

It does, however, need to be able to mount hostPath volumes and have the file permissions to:

  • Access the CSI driver socket (typically in /var/lib/kubelet/plugins/<drivername.example.com>/).
    • Used by the node-driver-registrar to fetch the driver name from the driver contain (via the CSI GetPluginInfo() call).
  • Access the registration socket (typically in /var/lib/kubelet/plugins_registry/).
    • Used by the node-driver-registrar to register the driver with kubelet.

So after reviewing this, I think that we probably will need to run the driver-registrar container as root, after all.

The only other workable alternative I can think of is to return to the idea of distributing both the ceph-rbdplugin and driver-registrar binaries via (separate?) packages and configuring systemd services for both of these.

I'm still hopeful that we can run the liveness-metrics part of the plugin without root privileges, but I have currently disabled it in the daemonset. I'll come back to this afterwards.

The provisioner part of the chart is all non-privileged and seems to be working as expected, at the moment. It's just these file system permissions on hostPath volumes and the associated unix sockets that are problematic at the moment.

Can you think of any other practical way to get this working, or would you agree that elevating the privileges of this container is a reasonable approach? Thanks.

Change #1051732 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] cephcsi: Grant elevated privileges to the driver-registrar container

https://gerrit.wikimedia.org/r/1051732

Change #1051732 merged by jenkins-bot:

[operations/deployment-charts@master] cephcsi: Grant elevated privileges to the driver-registrar container

https://gerrit.wikimedia.org/r/1051732

Significant progress now, as with cephcsi: Grant elevated privileges to the driver-registrar container merged, we now have both the nodeplugin daemonset and the provisioner deployment stable.

root@deploy1002:~# kubectl -n kube-system -l release=ceph-csi-rbd get pods
NAME                                        READY   STATUS    RESTARTS   AGE
ceph-csi-rbd-nodeplugin-6vq2c               2/2     Running   0          5m42s
ceph-csi-rbd-nodeplugin-8jqql               2/2     Running   0          5m40s
ceph-csi-rbd-nodeplugin-ffh5h               2/2     Running   0          5m43s
ceph-csi-rbd-nodeplugin-g8dtw               2/2     Running   0          5m43s
ceph-csi-rbd-nodeplugin-hlr6v               2/2     Running   0          5m43s
ceph-csi-rbd-nodeplugin-pnwpg               2/2     Running   0          5m42s
ceph-csi-rbd-nodeplugin-tjr45               2/2     Running   0          5m43s
ceph-csi-rbd-nodeplugin-xkqdx               2/2     Running   0          5m42s
ceph-csi-rbd-provisioner-6f9fc45549-4vczx   5/5     Running   0          5m43s
ceph-csi-rbd-provisioner-6f9fc45549-cxm7l   5/5     Running   0          5m39s
ceph-csi-rbd-provisioner-6f9fc45549-t4t4m   5/5     Running   0          5m43s

Our storageClass called ceph-rbd-ssd is now available.

root@deploy1002:/srv/deployment-charts/helmfile.d/admin_ng# kubectl get storageclass
NAME           PROVISIONER        RECLAIMPOLICY   VOLUMEBINDINGMODE   ALLOWVOLUMEEXPANSION   AGE
ceph-rbd-ssd   rbd.csi.ceph.com   Delete          Immediate           true                   5d23h

root@deploy1002:/srv/deployment-charts/helmfile.d/admin_ng# kubectl describe storageclass ceph-rbd-ssd
Name:                  ceph-rbd-ssd
IsDefaultClass:        No
Annotations:           meta.helm.sh/release-name=ceph-csi-rbd,meta.helm.sh/release-namespace=kube-system
Provisioner:           rbd.csi.ceph.com
Parameters:            clusterID=6d4278e1-ea45-4d29-86fe-85b44c150813,csi.storage.k8s.io/controller-expand-secret-name=csi-rbd-secret,csi.storage.k8s.io/controller-expand-secret-namespace=kube-system,csi.storage.k8s.io/fstype=ext4,csi.storage.k8s.io/node-stage-secret-name=csi-rbd-secret,csi.storage.k8s.io/node-stage-secret-namespace=kube-system,csi.storage.k8s.io/provisioner-secret-name=csi-rbd-secret,csi.storage.k8s.io/provisioner-secret-namespace=kube-system,imageFeatures=layering,pool=dse-k8s-csi-ssd
AllowVolumeExpansion:  True
MountOptions:          <none>
ReclaimPolicy:         Delete
VolumeBindingMode:     Immediate
Events:                <none>

There are still a few tweaks that I would like to do to the configuration, then I will begin testing the creation and handling of persistentVolume and persistentVolumeClaim objects.

Change #1052275 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] cephcsi: bump_image_version

https://gerrit.wikimedia.org/r/1052275

Change #1052275 merged by jenkins-bot:

[operations/deployment-charts@master] cephcsi: bump_image_version

https://gerrit.wikimedia.org/r/1052275

Change #1052292 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] cephcsi: Enable the prometheus-liveness container

https://gerrit.wikimedia.org/r/1052292

Change #1052292 merged by jenkins-bot:

[operations/deployment-charts@master] cephcsi: Enable the prometheus-liveness container

https://gerrit.wikimedia.org/r/1052292

I am beginning tests with some very basic resources.

I created a temporary namespace for testing.

root@deploy1002:~# kubectl create namespace btullis-pvc-tests
namespace/btullis-pvc-tests created

I have a simple PVC definition as a raw block device, using this namespace and our named storageClass:

root@deploy1002:~# cat raw-block-pvc.yaml 
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: raw-block-pvc
  namespace: btullis-pvc-tests
spec:
  accessModes:
    - ReadWriteOnce
  volumeMode: Block
  resources:
    requests:
      storage: 1Gi
  storageClassName: ceph-rbd-ssd

root@deploy1002:~# kubectl apply -f raw-block-pvc.yaml 
persistentvolumeclaim/raw-block-pvc created

The block device shows as pending.

root@deploy1002:~# kubectl get pvc --all-namespaces
NAMESPACE           NAME            STATUS    VOLUME   CAPACITY   ACCESS MODES   STORAGECLASS   AGE
btullis-pvc-tests   raw-block-pvc   Pending                                      ceph-rbd-ssd   26s

I then create a very simple pod which attempts to bind this pvc.

root@deploy1002:~# cat raw-block-pod.yaml 
---
apiVersion: v1
kind: Pod
metadata:
  name: pod-with-raw-block-volume
  namespace: btullis-pvc-tests
spec:
  containers:
    - name: do-nothing
      image: docker-registry.discovery.wmnet/bookworm:20240630
      command: ["/bin/sh", "-c"]
      args: ["tail -f /dev/null"]
      volumeDevices:
        - name: data
          devicePath: /dev/xvda
  volumes:
    - name: data
      persistentVolumeClaim:
        claimName: raw-block-pvc

root@deploy1002:~# kubectl apply -f raw-block-pod.yaml 
pod/pod-with-raw-block-volume created

The block device still shows as pending:

root@deploy1002:~# kubectl get pvc --all-namespaces
NAMESPACE           NAME            STATUS    VOLUME   CAPACITY   ACCESS MODES   STORAGECLASS   AGE
btullis-pvc-tests   raw-block-pvc   Pending                                      ceph-rbd-ssd   4m48s

Looking at the events for this namespace, we can see that there is an RBAC error:

root@deploy1002:~# kubectl -n btullis-pvc-tests get events -w
LAST SEEN   TYPE      REASON                 OBJECT                                MESSAGE
28s         Warning   FailedScheduling       pod/pod-with-raw-block-volume         0/10 nodes are available: 10 pod has unbound immediate PersistentVolumeClaims.
111s        Normal    Provisioning           persistentvolumeclaim/raw-block-pvc   External provisioner is provisioning volume for claim "btullis-pvc-tests/raw-block-pvc"
2s          Normal    ExternalProvisioning   persistentvolumeclaim/raw-block-pvc   waiting for a volume to be created, either by external provisioner "rbd.csi.ceph.com" or manually created by system administrator
111s        Warning   ProvisioningFailed     persistentvolumeclaim/raw-block-pvc   failed to provision volume with StorageClass "ceph-rbd-ssd": error getting secret csi-rbd-secret in namespace kube-system: secrets "csi-rbd-secret" is forbidden: User "system:serviceaccount:kube-system:ceph-csi-rbd-provisioner" cannot get resource "secrets" in API group "" in the namespace "kube-system"
0s          Normal    ExternalProvisioning   persistentvolumeclaim/raw-block-pvc   waiting for a volume to be created, either by external provisioner "rbd.csi.ceph.com" or manually created by system administrator
0s          Normal    ExternalProvisioning   persistentvolumeclaim/raw-block-pvc   waiting for a volume to be created, either by external provisioner "rbd.csi.ceph.com" or manually created by system administrator

The key part of that message is:

failed to provision volume with StorageClass "ceph-rbd-ssd": error getting secret csi-rbd-secret in namespace kube-system: secrets "`" is forbidden: User "system:serviceaccount:kube-system:ceph-csi-rbd-provisioner" cannot get resource "secrets" in API group "" in the namespace "kube-system"

I modified the RBAC here to remove the ability for the provisioner to be able to get secrets across the cluster. Given that we know we only want to be able to get the Cephx user key from the csi-rbd-secret secret in the kube-system namespace, I can probably add this to the provisioner role instead.

I have cleaned up these unmanaged resources so that they don't cause errors.

root@deploy1002:~# kubectl delete -f raw-block-pod.yaml 
pod "pod-with-raw-block-volume" deleted
root@deploy1002:~# kubectl delete -f raw-block-pvc.yaml 
persistentvolumeclaim "raw-block-pvc" deleted

I will leave the btullis-pvc-tests namespace in place, but it is empty again.

root@deploy1002:~# kubectl -n btullis-pvc-tests get all
No resources found in btullis-pvc-tests namespace.

Change #1052341 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] cephcsi: Grant the provisioner access to the ceph userID secret

https://gerrit.wikimedia.org/r/1052341

Change #1052341 merged by jenkins-bot:

[operations/deployment-charts@master] cephcsi: Grant the provisioner access to the ceph userID secret

https://gerrit.wikimedia.org/r/1052341

This is now working.

The test raw block device was created and bound.

root@deploy1002:/home/btullis# kubectl apply -f raw-block-pvc.yaml 
persistentvolumeclaim/raw-block-pvc created

root@deploy1002:/home/btullis# kubectl -n btullis-pvc-tests get pvc
NAME            STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
raw-block-pvc   Bound    pvc-e0bf68f7-10e5-472e-a915-6e4679fc4c78   1Gi        RWO            ceph-rbd-ssd   2s

The test pod that uses this persistent volume is also working:

root@deploy1002:/home/btullis# kubectl apply -f raw-block-pod.yaml 
pod/pod-with-raw-block-volume created
root@deploy1002:/home/btullis# kubectl -n btullis-pvc-tests describe pod pod-with-raw-block-volume 
Name:         pod-with-raw-block-volume
Namespace:    btullis-pvc-tests
Priority:     0
Node:         dse-k8s-worker1006.eqiad.wmnet/10.64.132.8
Start Time:   Mon, 08 Jul 2024 12:53:58 +0000
Labels:       <none>
Annotations:  kubernetes.io/psp: privileged
Status:       Pending
IP:           
IPs:          <none>
Containers:
  do-nothing:
    Container ID:  
    Image:         docker-registry.discovery.wmnet/bookworm:20240630
    Image ID:      
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/sh
      -c
    Args:
      tail -f /dev/null
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-4fc9r (ro)
    Devices:
      /dev/xvda from data
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  data:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  raw-block-pvc
    ReadOnly:   false
  kube-api-access-4fc9r:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type    Reason                  Age   From                     Message
  ----    ------                  ----  ----                     -------
  Normal  Scheduled               15s   default-scheduler        Successfully assigned btullis-pvc-tests/pod-with-raw-block-volume to dse-k8s-worker1006.eqiad.wmnet
  Normal  SuccessfulAttachVolume  15s   attachdetach-controller  AttachVolume.Attach succeeded for volume "pvc-e0bf68f7-10e5-472e-a915-6e4679fc4c78"

We can see the block device that has been provisioned from the ceph server side:

root@cephosd1001:/etc/ceph# rbd ls dse-k8s-csi-ssd
csi-vol-018de35a-3d29-11ef-a792-be78068bb159
r
oot@cephosd1001:/etc/ceph# rbd info dse-k8s-csi-ssd/csi-vol-018de35a-3d29-11ef-a792-be78068bb159
rbd image 'csi-vol-018de35a-3d29-11ef-a792-be78068bb159':
	size 1 GiB in 256 objects
	order 22 (4 MiB objects)
	snapshot_count: 0
	id: 8489d39cd1bb88
	block_name_prefix: rbd_data.8489d39cd1bb88
	format: 2
	features: layering
	op_features: 
	flags: 
	create_timestamp: Mon Jul  8 12:53:00 2024
	access_timestamp: Mon Jul  8 12:53:00 2024
	modify_timestamp: Mon Jul  8 12:53:00 2024

I'd like to try a few more tests, such as:

  • resizing the block device
  • provisioning a file system instead of a raw block device
  • deleting the pvc

Deleting the pod and pvc worked as expected.

root@deploy1002:/home/btullis# kubectl delete -f raw-block-pod.yaml 
pod "pod-with-raw-block-volume" deleted

root@deploy1002:/home/btullis# kubectl -n btullis-pvc-tests get pvc
NAME            STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
raw-block-pvc   Bound    pvc-e0bf68f7-10e5-472e-a915-6e4679fc4c78   1Gi        RWO            ceph-rbd-ssd   22m

root@deploy1002:/home/btullis# kubectl delete -f raw-block-pvc.yaml 
persistentvolumeclaim "raw-block-pvc" deleted

root@deploy1002:/home/btullis# kubectl -n btullis-pvc-tests get pvc
No resources found in btullis-pvc-tests namespace.

The image was also deleted from ceph, as expected.

root@cephosd1001:/etc/ceph# rbd ls dse-k8s-csi-ssd
root@cephosd1001:/etc/ceph#

This is expected because of the reclaimPolicy of delete for the storageClass that we have created: https://kubernetes.io/docs/concepts/storage/storage-classes/#reclaim-policy
We could choose to be more conservative about retaining volumes, if we wish to have additional safeguards.

The file system based tests haven't worked yet.
I tried the following resources.

oot@deploy1002:/home/btullis# cat pvc.yaml 
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: rbd-pvc
  namespace: btullis-pvc-tests
spec:
  accessModes:
    - ReadWriteOnce
  volumeMode: Filesystem
  resources:
    requests:
      storage: 1Gi
  storageClassName: ceph-rbd-ssd
root@deploy1002:/home/btullis# cat pod.yaml 
---
apiVersion: v1
kind: Pod
metadata:
  name: csi-rbd-demo-pod
  namespace: btullis-pvc-tests
spec:
  containers:
    - name: do-nothing
      image: docker-registry.discovery.wmnet/bookworm:20240630
      command: ["/bin/sh", "-c"]
      args: ["tail -f /dev/null"]
      volumeMounts:
        - name: mypvc
          mountPath: /var/lib/www/html
  volumes:
    - name: mypvc
      persistentVolumeClaim:
        claimName: rbd-pvc
        readOnly: false

The pod is stuck in a container-creating state, with the following mount warnings.

root@deploy1002:/home/btullis# kubectl -n btullis-pvc-tests describe pod csi-rbd-demo-pod 
Name:         csi-rbd-demo-pod
Namespace:    btullis-pvc-tests
Priority:     0
Node:         dse-k8s-worker1006.eqiad.wmnet/10.64.132.8
Start Time:   Mon, 08 Jul 2024 13:25:52 +0000
Labels:       <none>
Annotations:  kubernetes.io/psp: privileged
Status:       Pending
IP:           
IPs:          <none>
Containers:
  do-nothing:
    Container ID:  
    Image:         docker-registry.discovery.wmnet/bookworm:20240630
    Image ID:      
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/sh
      -c
    Args:
      tail -f /dev/null
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /var/lib/www/html from mypvc (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-2tjs7 (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  mypvc:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  rbd-pvc
    ReadOnly:   false
  kube-api-access-2tjs7:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason                  Age                    From                     Message
  ----     ------                  ----                   ----                     -------
  Normal   Scheduled               7m50s                  default-scheduler        Successfully assigned btullis-pvc-tests/csi-rbd-demo-pod to dse-k8s-worker1006.eqiad.wmnet
  Normal   SuccessfulAttachVolume  7m50s                  attachdetach-controller  AttachVolume.Attach succeeded for volume "pvc-575c9e33-4305-4da2-8e2e-ec669a637e27"
  Warning  FailedMount             5m50s                  kubelet                  MountVolume.MountDevice failed for volume "pvc-575c9e33-4305-4da2-8e2e-ec669a637e27" : rpc error: code = DeadlineExceeded desc = context deadline exceeded
  Warning  FailedMount             3m41s (x8 over 5m49s)  kubelet                  MountVolume.MountDevice failed for volume "pvc-575c9e33-4305-4da2-8e2e-ec669a637e27" : rpc error: code = Aborted desc = an operation with the given Volume ID 0001-0024-6d4278e1-ea45-4d29-86fe-85b44c150813-0000000000000007-82644734-3d2d-11ef-a792-be78068bb159 already exists
  Warning  FailedMount             76s (x3 over 5m48s)    kubelet                  Unable to attach or mount volumes: unmounted volumes=[mypvc], unattached volumes=[mypvc kube-api-access-2tjs7]: timed out waiting for the condition

The pv looks good, so maybe it's something as simple as not having mkfs.ext4 available in the plugin container.

root@deploy1002:/home/btullis# kubectl -n btullis-pvc-tests describe pv pvc-575c9e33-4305-4da2-8e2e-ec669a637e27 
Name:            pvc-575c9e33-4305-4da2-8e2e-ec669a637e27
Labels:          <none>
Annotations:     pv.kubernetes.io/provisioned-by: rbd.csi.ceph.com
                 volume.kubernetes.io/provisioner-deletion-secret-name: csi-rbd-secret
                 volume.kubernetes.io/provisioner-deletion-secret-namespace: kube-system
Finalizers:      [external-provisioner.volume.kubernetes.io/finalizer kubernetes.io/pv-protection]
StorageClass:    ceph-rbd-ssd
Status:          Bound
Claim:           btullis-pvc-tests/rbd-pvc
Reclaim Policy:  Delete
Access Modes:    RWO
VolumeMode:      Filesystem
Capacity:        1Gi
Node Affinity:   <none>
Message:         
Source:
    Type:              CSI (a Container Storage Interface (CSI) volume source)
    Driver:            rbd.csi.ceph.com
    FSType:            ext4
    VolumeHandle:      0001-0024-6d4278e1-ea45-4d29-86fe-85b44c150813-0000000000000007-82644734-3d2d-11ef-a792-be78068bb159
    ReadOnly:          false
    VolumeAttributes:      clusterID=6d4278e1-ea45-4d29-86fe-85b44c150813
                           imageFeatures=layering
                           imageName=csi-vol-82644734-3d2d-11ef-a792-be78068bb159
                           journalPool=dse-k8s-csi-ssd
                           pool=dse-k8s-csi-ssd
                           storage.kubernetes.io/csiProvisionerIdentity=1720432837686-9999-rbd.csi.ceph.com
Events:                <none>

I'll check some more logs.

Change #1052812 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Allow dse-k8s-worker hosts to access ceph ports

https://gerrit.wikimedia.org/r/1052812

Change #1052812 merged by Btullis:

[operations/puppet@production] Allow dse-k8s-worker hosts to access ceph ports

https://gerrit.wikimedia.org/r/1052812

Change #1052818 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] cephcsi: Bump the image version

https://gerrit.wikimedia.org/r/1052818

Change #1052818 merged by jenkins-bot:

[operations/deployment-charts@master] cephcsi: Bump the image version

https://gerrit.wikimedia.org/r/1052818

It's working! Here is my test pod with a 1 GB ext4 file system mounted at /var/lib/www/html

root@deploy1002:/home/btullis# kubectl -n btullis-pvc-tests get pods
NAME               READY   STATUS    RESTARTS   AGE
csi-rbd-demo-pod   1/1     Running   0          15s

Entering the pod:

root@deploy1002:/home/btullis# kubectl -n btullis-pvc-tests exec -it csi-rbd-demo-pod -- bash

Showing the free space in the file system

root@csi-rbd-demo-pod:/# df -h /var/lib/www/html/
Filesystem      Size  Used Avail Use% Mounted on
/dev/rbd0       974M   24K  958M   1% /var/lib/www/html

I had two more issues to fix before this worked.

  1. The dse-k8s-workers required the firewall to be opened to the ceph server ports. The reason for this is that, although we had previously updated the ceph server firewall and permitted access from dse_kubepods, this didn't allow the nodeplugin component to access the ceph ports. This is because this container uses host networking, which makes perfect sense. I fixed it with Allow dse-k8s-worker hosts to access ceph ports
  1. Once this was fixed, the next error stated that the rbd executable wasn't found in the path. I fixed that with: Add the ceph-common package, which contains the rbd binary.
BTullis updated the task description. (Show Details)

Marking this as resolved.