We will want the metadata to use the ssd CRUSH rule.
However, we may want to be able to provision cephfs volumes with data on both the hdd and ssd device classes.
We will want the metadata to use the ssd CRUSH rule.
However, we may want to be able to provision cephfs volumes with data on both the hdd and ssd device classes.
| Status | Subtype | Assigned | Task | ||
|---|---|---|---|---|---|
| Resolved | Gehel | T327267 Create a DSE Kubernetes cluster with support for persistent storage from Ceph | |||
| Resolved | Gehel | T324660 Install Ceph Cluster for Data Platform Engineering | |||
| Resolved | None | T376401 Enable the cephfs services on the DPE Ceph cluster | |||
| Resolved | BTullis | T376405 Create the pools required for cephfs |
We have been investigating this and reviewing the documentation at:
https://docs.ceph.com/en/reef/cephfs/#getting-started-with-cephfs
https://docs.ceph.com/en/reef/cephfs/fs-volumes/
This refers to the volume plugin for the ceph mgr component, which seems to the more recent way of working with file systems, compared with the more low-level commands in https://docs.ceph.com/en/reef/cephfs/createfs/
We have already deployed five MDS daemons, each of which is named ceph-mgr@$hostname
So we can create a file system using the command sudo ceph fs create $name
In this case we started with: sudo ceph fs volume create cephfs
btullis@cephosd1001:~$ sudo ceph fs volume create cephfs Volume created successfully (no MDS daemons created)
This then modified the output of the ceph -s command, to show the active MDS servers.
btullis@cephosd1001:~$ sudo ceph -s
cluster:
id: 6d4278e1-ea45-4d29-86fe-85b44c150813
health: HEALTH_OK
services:
mon: 5 daemons, quorum cephosd1001,cephosd1002,cephosd1003,cephosd1004,cephosd1005 (age 6w)
mgr: cephosd1005(active, since 2w), standbys: cephosd1003, cephosd1004, cephosd1001, cephosd1002
mds: 1/1 daemons up, 4 standby
osd: 100 osds: 100 up (since 6w), 100 in (since 5M)
rgw: 5 daemons active (5 hosts, 1 zones)
data:
volumes: 1/1 healthy
pools: 11 pools, 3313 pgs
objects: 5.37k objects, 5.6 GiB
usage: 28 TiB used, 1.1 PiB / 1.1 PiB avail
pgs: 3313 active+clean
io:
client: 19 KiB/s wr, 0 op/s rd, 2 op/s wrWe can see that it also created two new pools.
btullis@cephosd1001:~$ sudo ceph osd pool ls .mgr dse-k8s-csi-ssd .rgw.root eqiad.rgw.log eqiad.rgw.control eqiad.rgw.meta eqiad.rgw.buckets.index eqiad.rgw.buckets.data eqiad.rgw.buckets.non-ec cephfs.cephfs.meta cephfs.cephfs.data
After this, we decided to rename the file system from cephfs to dpe to reduce confusion.
btullis@cephosd1001:~$ sudo ceph fs volume rename cephfs dpe --yes-i-really-mean-it FS volume 'cephfs' renamed to 'dpe' btullis@cephosd1001:~$ sudo ceph osd pool ls .mgr dse-k8s-csi-ssd .rgw.root eqiad.rgw.log eqiad.rgw.control eqiad.rgw.meta eqiad.rgw.buckets.index eqiad.rgw.buckets.data eqiad.rgw.buckets.non-ec cephfs.dpe.meta cephfs.dpe.data
The next step will be for us to ensure that we have two pools to use for the file system data storage. One will be associated with the hdd CRUSH rule, and the other will be associated with the ssd CRUSH rule.
Referring to the documenation here:
https://docs.ceph.com/en/reef/cephfs/createfs/#creating-pools
We recommend the fastest feasible low-latency storage devices (NVMe, Optane, or at the very least SAS/SATA SSD) for the metadata pool, as this will directly affect the latency of client file system operations.
We can achieve this with:
btullis@cephosd1001:~$ sudo ceph osd pool set cephfs.dpe.meta crush_rule ssd set pool 15 crush_rule to ssd
The data pool used to create the file system is the “default” data pool and the location for storing all inode backtrace information, which is used for hard link management and disaster recovery. For this reason, all CephFS inodes have at least one object in the default data pool. If erasure-coded pools are planned for file system data, it is best to configure the default as a replicated pool to improve small-object write and read performance when updating backtraces.
Since it can directly affect the overall performance of the file system, we will ensure that the default data pool is also on the SSDs.
btullis@cephosd1001:~$ sudo ceph osd pool set cephfs.dpe.data crush_rule ssd set pool 16 crush_rule to ssd
Then we will add a new pool that we can use for ensuring that data is written to the hdd devices, by means of a file layout and/or a subvolume group.
btullis@cephosd1001:~$ sudo ceph osd pool create cephfs.dpe.data-hdd pool 'cephfs.dpe.data-hdd' created
We need to associate this pool with the cephfs application, since that doesn't happen automatically when we create it manually.
btullis@cephosd1001:~$ sudo ceph osd pool application enable cephfs.dpe.data-hdd cephfs enabled application 'cephfs' on pool 'cephfs.dpe.data-hdd'
This pool already uses the hdd crush rule.
btullis@cephosd1001:~$ sudo ceph osd pool get cephfs.dpe.data-hdd crush_rule crush_rule: hdd
For clarity, we will rename the cephfs.dpe.data to cephfs.dpe.data-ssd
btullis@cephosd1001:~$ sudo ceph osd pool rename cephfs.dpe.data cephfs.dpe.data-ssd pool 'cephfs.dpe.data' renamed to 'cephfs.dpe.data-ssd'
The pools listing is now as follows:
btullis@cephosd1001:~$ sudo ceph osd lspools 2 .mgr 7 dse-k8s-csi-ssd 8 .rgw.root 9 eqiad.rgw.log 10 eqiad.rgw.control 11 eqiad.rgw.meta 12 eqiad.rgw.buckets.index 13 eqiad.rgw.buckets.data 14 eqiad.rgw.buckets.non-ec 15 cephfs.dpe.meta 16 cephfs.dpe.data-ssd 17 cephfs.dpe.data-hdd
After a couple of goes at getting the cephx authentication keys correct, we can now see the file system info.
btullis@cephosd1001:~$ sudo ceph fs volume info dpe
{
"mon_addrs": [
"10.64.130.13:6789",
"10.64.131.21:6789",
"10.64.132.23:6789",
"10.64.134.12:6789",
"10.64.135.21:6789"
],
"pools": {
"data": [
{
"avail": 48615049396224,
"name": "cephfs.dpe.data-ssd",
"used": 0
}
],
"metadata": [
{
"avail": 48615049396224,
"name": "cephfs.dpe.meta",
"used": 98304
}
]
}
}We have zero subvolumes in this filesystem, which is what we are expecting.
btullis@cephosd1001:~$ sudo ceph fs subvolume ls dpe []
We believe that the ceph-csi-cephfs plugin will create subvolumes that are then made available as persistent volumes to the application pods.
We may have different storage classes that use different data placement pools.