Page MenuHomePhabricator

Create the pools required for cephfs
Closed, ResolvedPublic

Description

We will want the metadata to use the ssd CRUSH rule.

However, we may want to be able to provision cephfs volumes with data on both the hdd and ssd device classes.

Event Timeline

BTullis triaged this task as High priority.Oct 3 2024, 5:09 PM

We have been investigating this and reviewing the documentation at:

https://docs.ceph.com/en/reef/cephfs/#getting-started-with-cephfs
https://docs.ceph.com/en/reef/cephfs/fs-volumes/

This refers to the volume plugin for the ceph mgr component, which seems to the more recent way of working with file systems, compared with the more low-level commands in https://docs.ceph.com/en/reef/cephfs/createfs/

We have already deployed five MDS daemons, each of which is named ceph-mgr@$hostname

So we can create a file system using the command sudo ceph fs create $name

In this case we started with: sudo ceph fs volume create cephfs

btullis@cephosd1001:~$ sudo ceph fs volume create cephfs
Volume created successfully (no MDS daemons created)

This then modified the output of the ceph -s command, to show the active MDS servers.

btullis@cephosd1001:~$ sudo ceph -s
  cluster:
    id:     6d4278e1-ea45-4d29-86fe-85b44c150813
    health: HEALTH_OK
 
  services:
    mon: 5 daemons, quorum cephosd1001,cephosd1002,cephosd1003,cephosd1004,cephosd1005 (age 6w)
    mgr: cephosd1005(active, since 2w), standbys: cephosd1003, cephosd1004, cephosd1001, cephosd1002
    mds: 1/1 daemons up, 4 standby
    osd: 100 osds: 100 up (since 6w), 100 in (since 5M)
    rgw: 5 daemons active (5 hosts, 1 zones)
 
  data:
    volumes: 1/1 healthy
    pools:   11 pools, 3313 pgs
    objects: 5.37k objects, 5.6 GiB
    usage:   28 TiB used, 1.1 PiB / 1.1 PiB avail
    pgs:     3313 active+clean
 
  io:
    client:   19 KiB/s wr, 0 op/s rd, 2 op/s wr

We can see that it also created two new pools.

btullis@cephosd1001:~$ sudo ceph osd pool ls
.mgr
dse-k8s-csi-ssd
.rgw.root
eqiad.rgw.log
eqiad.rgw.control
eqiad.rgw.meta
eqiad.rgw.buckets.index
eqiad.rgw.buckets.data
eqiad.rgw.buckets.non-ec
cephfs.cephfs.meta
cephfs.cephfs.data

After this, we decided to rename the file system from cephfs to dpe to reduce confusion.

btullis@cephosd1001:~$ sudo ceph fs volume rename cephfs dpe --yes-i-really-mean-it
FS volume 'cephfs' renamed to 'dpe'
btullis@cephosd1001:~$ sudo ceph osd pool ls
.mgr
dse-k8s-csi-ssd
.rgw.root
eqiad.rgw.log
eqiad.rgw.control
eqiad.rgw.meta
eqiad.rgw.buckets.index
eqiad.rgw.buckets.data
eqiad.rgw.buckets.non-ec
cephfs.dpe.meta
cephfs.dpe.data

The next step will be for us to ensure that we have two pools to use for the file system data storage. One will be associated with the hdd CRUSH rule, and the other will be associated with the ssd CRUSH rule.

Referring to the documenation here:
https://docs.ceph.com/en/reef/cephfs/createfs/#creating-pools

We recommend the fastest feasible low-latency storage devices (NVMe, Optane, or at the very least SAS/SATA SSD) for the metadata pool, as this will directly affect the latency of client file system operations.

We can achieve this with:

btullis@cephosd1001:~$ sudo ceph osd pool set cephfs.dpe.meta crush_rule ssd
set pool 15 crush_rule to ssd

The data pool used to create the file system is the “default” data pool and the location for storing all inode backtrace information, which is used for hard link management and disaster recovery. For this reason, all CephFS inodes have at least one object in the default data pool. If erasure-coded pools are planned for file system data, it is best to configure the default as a replicated pool to improve small-object write and read performance when updating backtraces.

Since it can directly affect the overall performance of the file system, we will ensure that the default data pool is also on the SSDs.

btullis@cephosd1001:~$ sudo ceph osd pool set cephfs.dpe.data crush_rule ssd
set pool 16 crush_rule to ssd

Then we will add a new pool that we can use for ensuring that data is written to the hdd devices, by means of a file layout and/or a subvolume group.

btullis@cephosd1001:~$ sudo ceph osd pool create cephfs.dpe.data-hdd
pool 'cephfs.dpe.data-hdd' created

We need to associate this pool with the cephfs application, since that doesn't happen automatically when we create it manually.

btullis@cephosd1001:~$ sudo ceph osd pool application enable cephfs.dpe.data-hdd cephfs
enabled application 'cephfs' on pool 'cephfs.dpe.data-hdd'

This pool already uses the hdd crush rule.

btullis@cephosd1001:~$ sudo ceph osd pool get cephfs.dpe.data-hdd crush_rule
crush_rule: hdd

For clarity, we will rename the cephfs.dpe.data to cephfs.dpe.data-ssd

btullis@cephosd1001:~$ sudo ceph osd pool rename cephfs.dpe.data cephfs.dpe.data-ssd
pool 'cephfs.dpe.data' renamed to 'cephfs.dpe.data-ssd'

The pools listing is now as follows:

btullis@cephosd1001:~$ sudo ceph osd lspools
2 .mgr
7 dse-k8s-csi-ssd
8 .rgw.root
9 eqiad.rgw.log
10 eqiad.rgw.control
11 eqiad.rgw.meta
12 eqiad.rgw.buckets.index
13 eqiad.rgw.buckets.data
14 eqiad.rgw.buckets.non-ec
15 cephfs.dpe.meta
16 cephfs.dpe.data-ssd
17 cephfs.dpe.data-hdd

After a couple of goes at getting the cephx authentication keys correct, we can now see the file system info.

btullis@cephosd1001:~$ sudo ceph fs volume info dpe
{
    "mon_addrs": [
        "10.64.130.13:6789",
        "10.64.131.21:6789",
        "10.64.132.23:6789",
        "10.64.134.12:6789",
        "10.64.135.21:6789"
    ],
    "pools": {
        "data": [
            {
                "avail": 48615049396224,
                "name": "cephfs.dpe.data-ssd",
                "used": 0
            }
        ],
        "metadata": [
            {
                "avail": 48615049396224,
                "name": "cephfs.dpe.meta",
                "used": 98304
            }
        ]
    }
}

We have zero subvolumes in this filesystem, which is what we are expecting.

btullis@cephosd1001:~$ sudo ceph fs subvolume ls dpe
[]

We believe that the ceph-csi-cephfs plugin will create subvolumes that are then made available as persistent volumes to the application pods.
We may have different storage classes that use different data placement pools.

BTullis claimed this task.