Decide on installation details for new ceph cluster
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	BTullis
	Jan 13 2023, 11:25 AM

Description

We need to make several decisions about how to install the new ceph cluster.

These decisions include:

Which version of ceph should we target? - Quincy - That's 17.2.5 at the time of writing.
Should we use packages or containers? Packages
Where exactly do we get our Ceph builds? download.ceph.com
What installation and bootstrapping method will we use? i.e.
- Existing puppet manifests: modules/ceph
- ~~Import/adapt third-party puppet module: https://opendev.org/openstack/puppet-ceph~~
- ~~cephadm~~
- ~~ceph-deploy~~
- ~~Manul installation~~

What will the pool names be? 4 initial pools configred for RBD
What will the replication settings and/or erasure coding settings be for the pools? Currently evaluating erasure coding for RBD, with replicated pools for metadata
How many placement groups should be configured for each pool? Initial settings: 1200 PGs for the HDD pools, 800 for the SSD pools. Autotuning enabled
Should we add buckets for row and rack to the CRUSH maps now? Yes

Details

Subject	Repo	Branch	Lines +/-
aptrepo: Add missing Suite for ceph-quincy	operations/puppet	production	+1 -0
Add a third-party apt repo for ceph-quincy packages	operations/puppet	production	+12 -0
Rename ceph profiles to cloudceph	operations/puppet	production	+331 -331
Correct the ceph mgr and mon keys in codfw	labs/private	master	+14 -14
Duplicate existing secrets for profile::ceph to profile::cloudceph	labs/private	master	+103 -0

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	Gehel	T327267 Create a DSE Kubernetes cluster with support for persistent storage from Ceph
Resolved	Gehel	T324660 Install Ceph Cluster for Data Platform Engineering
Resolved	BTullis	T326945 Decide on installation details for new ceph cluster

Event Timeline

BTullis triaged this task as Medium priority.Jan 13 2023, 11:25 AM

BTullis created this task.

BTullis moved this task from Backlog to Shared Data Infra on the Data-Engineering-Planning board.

BTullis moved this task from Next Up to In Progress on the Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-07)) board.

BTullis updated the task description. (Show Details)Jan 13 2023, 11:32 AM

OK, so first things first, we have to understand the release cycle.

Since the release of version 12 (Luminous) Ceph has changed its previous release cycle so that there is now a new stable release every year. Previously, every other year used to be a long-term support (LTS) release.

https://docs.ceph.com/en/latest/releases/#ceph-releases-index

They target a new stable release for around March of each year and provide support for it for a little over two years. So currectly there are two major versions of Ceph supported:

https://docs.ceph.com/en/latest/releases/#active-releases

The minor version of 2 (i.e. 16.2.z or 17.2.z) indicates that this is a stable release, intended for production. As per the docs.

Version numbers have three components, x.y.z. x identifies the release cycle (e.g., 13 for Mimic). y identifies the release type:

x.0.z - development versions

x.1.z - release candidates (for test clusters, brave users)

x.2.z - stable/bugfix releases (for users)

Whilst we could choose Quincy it will be EOL by June 2023 and I can't see any benefit to selecting the older version.

Therefore, I'm happt to target Pacific now, which will last us until June 2024 with only point releases.

However there should also be another version released around March 2023, so we should bear that in mind and consider upgrading to it at an early stage.
I think that ensuring we have robust processes in place to help us keep in sync with this annual release cycle is probably a good idea.

BTullis updated the task description. (Show Details)Jan 13 2023, 11:59 AM

Looking at the existing WMCS Ceph cluster, I can see that this uses packages built by croit.io

btullis@cloudcephmon1001:~$ apt-cache policy ceph
ceph:
  Installed: 15.2.16-1~bpo10+1
  Candidate: 15.2.16-1~bpo10+1
  Version table:
 *** 15.2.16-1~bpo10+1 1003
       1003 http://apt.wikimedia.org/wikimedia buster-wikimedia/thirdparty/ceph-octopus amd64 Packages
        100 /var/lib/dpkg/status
     14.2.21-1~bpo10+1 100
        100 http://mirrors.wikimedia.org/debian buster-backports/main amd64 Packages
     12.2.11+dfsg1-2.1 500
        500 http://mirrors.wikimedia.org/debian buster/main amd64 Packages

These repositories are defined here: https://github.com/wikimedia/puppet/blob/production/modules/aptrepo/files/updates#L270-L288
and come from https://mirror.croit.io/debian-octopus

They do have packages for quincy, so one option is to use these: https://mirror.croit.io/debian-quincy/

However, ceph.com themselves also build Debian packages for recent versions of Ceph: https://docs.ceph.com/en/latest/install/get-packages/#debian-packages so this is another option, if we decide to use packages.

I have been reading about progress with the way in which Ceph is containerized lately.

My initial reasearch has been based on these resources:

The official containers are only served by https://quay.io/repository/ceph/ceph
Those at https://hub.docker.com/u/ceph are no longer receiving updates and are considered legacy.

I also discovered a fascinating thread started by our very own @MatthewVernon on the ceph mailing list entitled: Why you might want packages not containers for Ceph deployments.

Ultimately, I think I'm coming down on the side of using packages from download.ceph.com for this cluster.

Whilst the combination of cephadm and containers is compelling in many ways, the official images wouldn't be compatible with our security policies.

I looked at the build process for the official packages, but they have specifially removed Debian and Ubuntu support from this mechanism back in 2018.

See https://github.com/ceph/ceph-container#stable-images and https://github.com/ceph/ceph-container/commit/c346f86409ab86688f4dbec923f14c3db95a9940

I did try some minor modifications in order to try to get the remnants of the Debian-based build commands to work, such as:

echo python3-sklearn > ceph-releases/ALL/debian/daemon-base/__SCIKIT_LEARN__
make FLAVORS="pacific,debian,9" build

...but they fail with various apt package verification errors and they're trying to pull in ceph-ansible packages from http://obs.linaro.org/ERP and Ceph packages from http://download.ceph.com.

All in all, my impression is that it would be a lot of work to re-engineer a Debian-based build process for Ceph containers and that any effort we put in to re-add Debian support to the upstream scripts would probably not be terribly welcome.

I welcome any other points of view, but at the moment I would be more inclined to start out by using packages from http://download.ceph.com for this cluster and to try to re-use our existing puppet manifests as much as possible.

Change 880461 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Add a third-party apt repo for ceph-quincy packages

https://gerrit.wikimedia.org/r/880461

gerritbot added a project: Patch-For-Review.Jan 16 2023, 12:57 PM

I'll tentatively update the description to say that the decision has bveen made to proceed with packages, but I'm still open to other opinions if anyone feel strongly about it.

BTullis updated the task description. (Show Details)Jan 16 2023, 3:12 PM

I do think upstream haven't covered themselves in glory here - only cephadm fully supports the orchestrator API and dashboard features, and cephadm only supports containerised deployments. And, as you say above, it seems that there is limited effort being expended on making containers buildable on a number of the platforms that Ceph is supported on (AFAICT only Centos and OpenSUSE are supported).

This is a bit sad - it seems like if we want to be able to build our own Ceph (local fixes, never mind our container policy) then we can't go with containers, and we can't get the benefits of recent upstream improvements to cluster management if we don't go with containers. [the philosophical question of whether this means Ceph is only properly Free Software if you run centos/SUSE doesn't fit in the margin]

AFAICT there are only ansible salt and juju-based management solutions with some sort of upstream/distro blessing? So Puppet & packages puts us rather further from the beaten path than is ideal.

One further thought - might we want to (be able to) build our own packages? Upstream haven't always got their release process Just Right, and sometimes Canonical's ubutu cloud archive packages have had useful backport fixes.

Thanks for those insights @MatthewVernon - I think I'll go ahead and try the packages from ceph.download.com for now. We can certainly switch to croit.io easily if we find that there is any issue with the upstream packages from ceph.com.

As for building our own packages and/or containers, I certainly wouldn't rule it out. It is definitely a shame to miss out on those cluster orchestration and dashboard features, but we can always bear in mind the possibility of switching to containers in future, if we identify a clear requirement.

Change 880939 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Rename ceph roles and profiles to cloudceph

https://gerrit.wikimedia.org/r/880939

Change 880949 had a related patch set uploaded (by Btullis; author: Btullis):

[labs/private@master] Duplicate existing secrets for profile::ceph to profile::cloudceph

https://gerrit.wikimedia.org/r/880949

Change 880949 merged by Btullis:

[labs/private@master] Duplicate existing secrets for profile::ceph to profile::cloudceph

https://gerrit.wikimedia.org/r/880949

BTullis mentioned this in rLPRIf842ed94e55d: Duplicate existing secrets for profile::ceph to profile::cloudceph.Jan 17 2023, 12:55 PM

Having examined the puppet manifests that we have for ceph, I believe that we can re-use the ceph module relatively easily as it is.

However, the profile::ceph::* classes are unfortunately quite specific to the use case of the WMCS cluster, so I think that we will have to make new profiles for the new cluster.

I have suggested renaming profile::ceph and everything beneath to profile::cloudceph so that it more closely matches the hostnames of the servers to which they are applied.
https://gerrit.wikimedia.org/r/880939

Ideally we would be able to share the same profiles between clusters, so whatever we write for the new one we will try to craft in such a way that it is applicable generically.

BTullis updated the task description. (Show Details)Jan 17 2023, 5:30 PM

There is another option in terms of the configuration mecahanism that I hadn't previously considered, which is to use a third-party puppet module for ceph.

The potential benefit of doing so would be to take advantage of the additional development effort that has gone into adding functionality to the module.

The potential risk is introducing additional complexity into our puppet codebase, for functionality that we might not need.

I'm going to carry out a brief review of the functionality offered by this puppet-ceph module, which is maintained by OpenStack here: https://opendev.org/openstack/puppet-ceph and published with an Apache 2.0 license.

I'll compare it with the functionality offered by our existing ceph module. Then we'll be in a better position to compare the benefits of using either solution.

Here's the initial comparison of functionality. I added the WMCS cookbooks for ceph as well.

Functionality	WMF ceph module	WMF ceph cookbooks	OpenStack ceph module	Notes
Create ceph user/group				Openstack version uses package scripts to create user/group
Install ceph packages
Manage `/etc/ceph/ceph.conf`				More parameters in openstack module vs file template in WMF module
Cephx key and keyring management
Install and configure mgr daemons
Install and configure mon daemons
Install and configure osd daemons
Install and configure mds daemons
Install and configure radosgw daemons
Create ceph filesystem
Create rbd pool
Create radosgw user
Configure rbd client (mount point)
Configure cephfs client (mount point)
Configure rbd mirror
Roll-restart mons
Roll-reboot mons
Roll-restart osds
Roll-reboot osds
Upgrade osds
Upgrade mons
Set/unset cluster maintenance

I agree that taking advantage of existing Free Software (and then contributing to it) seems like a good idea.
I note no mention of the Ceph dashboard; or settings done by CLI rather than ceph.conf (e.g. ceph config set mon auth_allow_insecure_global_id_reclaim false which I've used in the past), though the latter is probably a niche need.

...settings done by CLI rather than ceph.conf (e.g. ceph config set mon auth_allow_insecure_global_id_reclaim false which I've used in the past), though the latter is probably a niche need.

Yes, I agree that this feature of running arbitrary ceph commands is missing from that module, but we would probably want to write this into a cookbook anyway, wouldn't we? For example, the way that the cluster maintenance mode is set: https://github.com/wikimedia/cloud-wmcs-cookbooks/blob/main/wmcs_libs/ceph.py#L476

I think that the ceph_config provider is very powerful though: https://wiki.openstack.org/wiki/Puppet/ceph-blueprint#conf
It effectively allows us to put any parameter into the right section of /etc/ceph/ceph.conf
For equivalent functionality on our own module we would need to rewrite the template, which is not too difficult, but definitely not as clean.

I think that having the OSD definitions in hiera, along with their bluestore_wal and bluestore_db paths, would be particularly useful.
I'm not super-keen on the way that this is done in the cookbook so this feature seems like it would be beneficial.

With regard to the dashboard, perhaps we could just enable this with the ceph_config parameter injection, like this?
https://docs.ceph.com/en/quincy/mgr/dashboard/#host-name-and-port

Change 881670 had a related patch set uploaded (by Btullis; author: Btullis):

[labs/private@master] Correct the ceph mgr and mon keys in codfw

https://gerrit.wikimedia.org/r/881670

Change 881670 merged by Btullis:

[labs/private@master] Correct the ceph mgr and mon keys in codfw

https://gerrit.wikimedia.org/r/881670

BTullis mentioned this in rLPRI4af95f7fee5f: Correct the ceph mgr and mon keys in codfw.Jan 19 2023, 5:01 PM

BTullis updated the task description. (Show Details)Jan 19 2023, 6:21 PM

BTullis updated the task description. (Show Details)

Having carried out a review of the puppet-ceph module, it looks like it would be a little too difficult to vendor it into our puppet codebase.

The primary reason is that has several dependencies that would also need to be vendored in, some of which would also clash with our own modules.
https://opendev.org/openstack/puppet-ceph/src/branch/master/metadata.json#L50-L77

Problematic dependencies due to namespce clashes include:

puppetlabs/apache
puppetlabs/apt

The concat and stdlib libraries are already included, but openstack/openstacklib each openstack/keystone each bring in their own set of dependencies and so the problem becomes increasingly intractable.

I will therefore look to use and build upon our existing puppet module, albeith that I will look to the third-party module for solutions to specific problems.

BTullis updated the task description. (Show Details)Jan 23 2023, 4:30 PM

Change 880939 merged by Btullis:

[operations/puppet@production] Rename ceph profiles to cloudceph

https://gerrit.wikimedia.org/r/880939

aborrero mentioned this in T328011: cloudceph: reimage some servers.Jan 26 2023, 11:50 AM

Moving to paused while we work on installing the packages/services. We can continue to make decisions as we go, but this isn't a blocker for making progress.

Change 880461 merged by Btullis:

[operations/puppet@production] Add a third-party apt repo for ceph-quincy packages

https://gerrit.wikimedia.org/r/880461

Maintenance_bot removed a project: Patch-For-Review.Feb 1 2023, 11:32 AM

fnegri mentioned this in T325223: Decision request - Upgrade cadence for Ceph.Feb 1 2023, 4:39 PM

Mentioned in SAL (#wikimedia-analytics) [2023-02-02T09:52:39Z] <btullis> beginning a rolling reboot of all aqs servers for T326945

Change 886842 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/puppet@production] aptrepo: Add missing Suite for ceph-quincy

https://gerrit.wikimedia.org/r/886842

gerritbot added a project: Patch-For-Review.Feb 6 2023, 11:04 AM

Change 886842 merged by Vgutierrez:

[operations/puppet@production] aptrepo: Add missing Suite for ceph-quincy

https://gerrit.wikimedia.org/r/886842

Maintenance_bot removed a project: Patch-For-Review.Feb 6 2023, 1:30 PM

BTullis removed projects: Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-07)), Epic.Feb 13 2023, 9:57 AM

dcaro changed the status of subtask T328011: cloudceph: reimage some servers from Open to In Progress.Feb 13 2023, 2:08 PM

• nfraison mentioned this in T330151: Deploy ceph osd processes to data-engineering cluster.Mar 8 2023, 1:40 PM

JArguello-WMF moved this task from Shared Data Infra to To be Discussed on the Data-Engineering-Planning board.Jun 29 2023, 4:12 PM

JArguello-WMF edited projects, added Data-Platform-SRE; removed Data-Engineering-Planning.Jun 29 2023, 9:13 PM

Moving this task to in-progress, so that I can use it to record the pool creation and the related crush rules.

We now have raw storage available in this ceph cluster:

btullis@cephosd1001:~$ sudo ceph df
--- RAW STORAGE ---
CLASS      SIZE    AVAIL     USED  RAW USED  %RAW USED
hdd    1010 TiB  982 TiB   28 TiB    28 TiB       2.77
ssd     140 TiB  140 TiB  843 MiB   843 MiB          0
TOTAL   1.1 PiB  1.1 PiB   28 TiB    28 TiB       2.43
 
--- POOLS ---
POOL  ID  PGS   STORED  OBJECTS     USED  %USED  MAX AVAIL
.mgr   2    1  1.9 MiB        2  1.9 MiB      0    354 TiB

We need at least:

two pools - https://docs.ceph.com/en/quincy/rados/operations/pools/#pools
two crush rules - https://docs.ceph.com/en/quincy/rados/operations/crush-map/#rules

Currently, the only pool we have is called .mgr and this is an internal pool used by the ceph command itself. The . prefix indicates that this is not for user data.

btullis@cephosd1001:~$ sudo ceph osd pool ls detail
pool 2 '.mgr' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 805 flags hashpspool stripe_width 0 pg_num_max 32 pg_num_min 1 application mgr

Key questions

How should we name the two pools based on the two storage classes?
What should the crush rules be?
- Should these pools be replicated or erasure coded?
- If they are to be replicated, should there be 2 replicas or 3?
- If erasure coded, what profile should they use?
How many placement groups should be used in each pool?
- Should the placement groups use Autoscaling or not?
Which applications (Principally: rbd and radosgw) should be associated with which pools?
- We have not currently catered for cephfs, but it is still a possibility.
- Should we associate both storage classes with both applications, requiring four pools?

I'd like to bring in @JAllemandou at this point, to make sure that he's aware of the decisions and potential pros/cons of any decisions that are made now, and so that he can offer advice.

Current situation

Currently we only have one crush rule, called replicated-rule.

btullis@cephosd1001:~$ sudo ceph osd crush rule ls
replicated_rule
btullis@cephosd1001:~$ sudo ceph osd crush rule dump
[
    {
        "rule_id": 0,
        "rule_name": "replicated_rule",
        "type": 1,
        "steps": [
            {
                "op": "take",
                "item": -1,
                "item_name": "default"
            },
            {
                "op": "chooseleaf_firstn",
                "num": 0,
                "type": "host"
            },
            {
                "op": "emit"
            }
        ]
    }
]

We already have two devices classes, named hdd and ssd and the relevant devices are already allocated to these storage classes.

Ah, it occurs to me that radosgw also makes use of multiple pools anyway, so my previous comment about requiring a single pool for radosgw is complete bunk.

However, with radosgw there is the concept of placement targets, which control which pools are associated with a particular bucket.

Each placement target defines the following:

An index_pool - which contains the bucket indices.
A data_extra_pool - which contains metadata about incomplete multipart uploads.
A data_pool for each storage class

Then we can have pools for each of the storage classes (hdd and ssd), which are set as available storage classes within the default_placement target.

I suggest that we select a multi-zone configuration, which is the simplest of the multi-site configurations.

We could then have:

One zonegroup - named: wmnet
Two zones - named: eqiad and codfw

If we went with this approach, we would need the following metadata pools to support radosgw on this cluster:

.rgw.root
eqiad.rgw.control
eqiad.rgw.meta
eqiad.rgw.log

Then we could have a placement target that selects the SSDs

eqiad.rgw.ssd-buckets.index
eqiad.rgw.ssd-buckets.data
eqiad.rgw.ssd-buckets.data_extra

...and a placement target that selects the HDDs

eqiad.rgw.hdd-buckets.index
eqiad.rgw.hdd-buckets.data
eqiad.rgw.hdd-buckets.data_extra

Our default configuration would be to select the HDDs (for capacity) but it would be possible to select the SSDs when creating a bucket.

It's also possible to use S3 Bucket Lifecycle rules to automate the transition of data between buckets.

So in summary, I don't think we need to think about the radosgw pool configuraiton much more just at the moment.
We can get on with installing radosgw and adding the pools at that time as part of T330152: Deploy ceph radosgw processes to data-engineering cluster

I think I'll start by going with erasure coding for RBD on both device classes, using the values k=3 and m=2
This gives a 60% efficiency in storage usage, which is greater than either of the replicated (2 or 3 replicas) options.
There is a likelihood of lower performance compared with a replicated approach, but greater storage efficiency.

I created the two erasure code profiles with:

btullis@cephosd1001:~$ sudo ceph osd erasure-code-profile set ec32-ssd k=3 m=2 crush-failure-domain=host plugin=jerasure technique=reed_sol_van crush-device-class=ssd
btullis@cephosd1001:~$ sudo ceph osd erasure-code-profile get ec32-ssd
crush-device-class=ssd
crush-failure-domain=host
crush-root=default
jerasure-per-chunk-alignment=false
k=3
m=2
plugin=jerasure
technique=reed_sol_van
w=8

btullis@cephosd1001:~$ sudo ceph osd erasure-code-profile set ec32-hdd k=3 m=2 crush-failure-domain=host plugin=jerasure technique=reed_sol_van crush-device-class=hdd
btullis@cephosd1001:~$ sudo ceph osd erasure-code-profile get ec32-hdd
crush-device-class=hdd
crush-failure-domain=host
crush-root=default
jerasure-per-chunk-alignment=false
k=3
m=2
plugin=jerasure
technique=reed_sol_van
w=8

We can now see that there are three erasure code profiles in total.

btullis@cephosd1001:~$ sudo ceph osd erasure-code-profile ls
default
ec32-hdd
ec32-ssd

FWIW, I've tended to view disk as cheap and complexity as expensive to have used replicated (n=3) in the past.

[are you intending the RGW service to be general-purpose?]

...but I'd be inclined to always put bucket indexes and suchlike on fast storage even if the objects themselves are on spinning disks.

bking subscribed.Jul 25 2023, 3:13 PM

In T326945#9041443, @MatthewVernon wrote:

[are you intending the RGW service to be general-purpose?]

Thanks Matthew, I appreciate your viewpoint.

FWIW, I've tended to view disk as cheap and complexity as expensive to have used replicated (n=3) in the past.

I too have tended to stick to replicated pools with n=3 in the past.
That's realistically what we have with hadoop at the moment (dfs.replication: 3) too.

I suppose my thinking is that while disks might be cheap, lead-times for more servers are relatively long.
So for both the HDD and SSD storage classes it just strikes me as a shame to commission them with 33% storage efficiency, when 60% might serve us just as well.

However, at the moment we're still in more of an experimental phase anyway, so I think it would be fairly cheap for us to delete all of the pools and start again with new settings.

...but I'd be inclined to always put bucket indexes and suchlike on fast storage even if the objects themselves are on spinning disks.

Yes, I think you're right. So what I said above (T326945#9041188) about making a eqiad.rgw.hdd-buckets.index pool doesn't really hold up unless the indices themselves ever become very large. Better to keep them completely on the SSDs.
That was definitely what I was thinking about for the smaller buckets like eqiad.rgw.control, eqiad.rgw.meta and such like.

[are you intending the RGW service to be general-purpose?]

Good question. Yes, I think so. One of the original drivers for the project was about evaluating S3 as an alternative to HDFS for analytics data sets.
Our HDFS cluster is around 3.3 PB in capacity at the moment (2.75 PB used) but this ceph cluster only has around 1.1 PB of raw capacity that is backed by HDDs at the moment, so there's no way we could do a full migration as it is.
I'm aiming to make the available capacity large enough to provide its utility and flexibility, so if I initially commission the HDDs at 60% storage efficiency, that's around 650 TB of space.

With the SSDs, I'm working on the premise that these will largely be used with Kubernetes for PersistentVolumeClaim support, rather than large analytics data sets.

I created two new crush rules to use with any replicated pools.

btullis@cephosd1001:~$ sudo ceph osd crush rule create-replicated hdd default host hdd
btullis@cephosd1001:~$ sudo ceph osd crush rule create-replicated ssd default host ssd
btullis@cephosd1001:~$ sudo ceph osd crush rule ls
replicated_rule
hdd
ssd
btullis@cephosd1001:~$ sudo ceph osd crush rule dump hdd
{
    "rule_id": 1,
    "rule_name": "hdd",
    "type": 1,
    "steps": [
        {
            "op": "take",
            "item": -4,
            "item_name": "default~hdd"
        },
        {
            "op": "chooseleaf_firstn",
            "num": 0,
            "type": "host"
        },
        {
            "op": "emit"
        }
    ]
}

btullis@cephosd1001:~$ sudo ceph osd crush rule dump ssd
{
    "rule_id": 2,
    "rule_name": "ssd",
    "type": 1,
    "steps": [
        {
            "op": "take",
            "item": -6,
            "item_name": "default~ssd"
        },
        {
            "op": "chooseleaf_firstn",
            "num": 0,
            "type": "host"
        },
        {
            "op": "emit"
        }
    ]
}

In T326945#9042137, @BTullis wrote:

In T326945#9041443, @MatthewVernon wrote:

[are you intending the RGW service to be general-purpose?]

Good question. Yes, I think so. One of the original drivers for the project was about evaluating S3 as an alternative to HDFS for analytics data sets.
Our HDFS cluster is around 3.3 PB in capacity at the moment (2.75 PB used) but this ceph cluster only has around 1.1 PB of raw capacity that is backed by HDDs at the moment, so there's no way we could do a full migration as it is.
I'm aiming to make the available capacity large enough to provide its utility and flexibility, so if I initially commission the HDDs at 60% storage efficiency, that's around 650 TB of space.

That's interesting; we are expecting to have enough hardware for pilot-MOSS this quarter, so we'll be looking at deployment/POC for that in Q2 with a bit of luck; I guess it might be worth considering whether it should still be a separate system at some point.

Apropos your CRUSH rules, it might be worth adding rack/row as well? We have the equivalent in ms-swift and most of our cassandra setups, and if we've managed to put our kit into different racks/rows then I think it makes sense to tell CRUSH about that - that way we can be confident that if (WLOG) we have replication with n=3 and lose 1 rack out of 3 that we still have two copies of all pgs available and the cluster will continue to operate.

[sorry, I think I'm ending up providing suggestions just after they'd have been most useful!]

As per the instructions here: https://docs.ceph.com/en/quincy/rados/operations/pools/#creating-a-pool
...I have now created four pools with the following commands:

btullis@cephosd1001:~$ sudo ceph osd pool create rbd-metadata-ssd 800 800 replicated ssd --autoscale-mode=on
pool 'rbd-metadata-ssd' created
btullis@cephosd1001:~$ sudo ceph osd pool create rbd-metadata-hdd 1200 1200 replicated hdd --autoscale-mode=on
pool 'rbd-metadata-hdd' created
btullis@cephosd1001:~$ sudo ceph osd pool create rbd-data-ssd 800 800 erasure ec32-ssd --autoscale-mode=on
pool 'rbd-data-ssd' created
btullis@cephosd1001:~$ sudo ceph osd pool create rbd-data-hdd 1200 1200 erasure ec32-hdd --autoscale-mode=on
pool 'rbd-data-hdd' created

The reasoning is that even if we use an erasure coded pool for RBD data, the metadata needs to be kept in a replicated pool since erasure coded pools do not support OMAP operations.

As per the instructions here: https://docs.ceph.com/en/quincy/rados/operations/erasure-code/#erasure-coding-with-overwrites
...I have allowed these erasure coded pools to be used for overwrites with:

btullis@cephosd1001:~$ sudo ceph osd pool set rbd-data-ssd allow_ec_overwrites true
set pool 5 allow_ec_overwrites to true
btullis@cephosd1001:~$ sudo ceph osd pool set rbd-data-hdd allow_ec_overwrites true
set pool 6 allow_ec_overwrites to true

As per the instructions here: https://docs.ceph.com/en/quincy/rados/operations/pools/#associating-a-pool-with-an-application
...I have associated all four of these pools with the rbd application with:

btullis@cephosd1001:~$ sudo ceph osd pool application enable rbd-metadata-ssd rbd
enabled application 'rbd' on pool 'rbd-metadata-ssd'
btullis@cephosd1001:~$ sudo ceph osd pool application enable rbd-metadata-hdd rbd
enabled application 'rbd' on pool 'rbd-metadata-hdd'
btullis@cephosd1001:~$ sudo ceph osd pool application enable rbd-data-ssd rbd
enabled application 'rbd' on pool 'rbd-data-ssd'
btullis@cephosd1001:~$ sudo ceph osd pool application enable rbd-data-hdd rbd
enabled application 'rbd' on pool 'rbd-data-hdd'

Next I will try creating some test images and validating that they work. As per these instructions I will need to specify the --data-pool parameter and give it the erasure coded pool, whilst storing the metadata on the replicated pool.

In T326945#9045226, @MatthewVernon wrote:

In T326945#9042137, @BTullis wrote:

In T326945#9041443, @MatthewVernon wrote:

That's interesting; we are expecting to have enough hardware for pilot-MOSS this quarter, so we'll be looking at deployment/POC for that in Q2 with a bit of luck; I guess it might be worth considering whether it should still be a separate system at some point.

Yes, I'd welcome that conversation. Given that I've still got quite a bit of work to do getting radosgw up and running, any of our experiments that depend on S3 will still be a way away, but I think it makes sense to review all of the potential use cases together.

Apropos your CRUSH rules, it might be worth adding rack/row as well? We have the equivalent in ms-swift and most of our cassandra setups, and if we've managed to put our kit into different racks/rows then I think it makes sense to tell CRUSH about that - that way we can be confident that if (WLOG) we have replication with n=3 and lose 1 rack out of 3 that we still have two copies of all pgs available and the cluster will continue to operate.

Yes, I think that's a good suggestion as well. The main reason that I haven't done it yet is because the five servers are all in different racks already. Three different racks across eqiad rows E and F, so we implicitly get rack failure tolerance, although rack awareness. I'd certainly be happy to add rack and row awareness as well, particularly if we get more OSD hosts in this cluster.

[sorry, I think I'm ending up providing suggestions just after they'd have been most useful!]

Not at all :-) All suggestions welcome and we're at the point where we could still shut down the cluster and re-bootstrap it. I'm just trying to get to a point where I can get some tests done. We can modify crush rules, pool definitions, config etc as we go.

I created two test volumes, one on the HDDs and the other on the SSDs.
Data is stored on the erasure coded pool. Metadata for both was on replicated pool on the same medium.

btullis@cephosd1001:~$ sudo rbd create --size 10G --data-pool rbd-data-hdd rbd-metadata-hdd/test-hdd-volume
btullis@cephosd1001:~$ sudo rbd create --size 10G --data-pool rbd-data-ssd rbd-metadata-ssd/test-ssd-volume

Checked the info:

btullis@cephosd1001:~$ sudo rbd info rbd-metadata-hdd/test-hdd-volume
rbd image 'test-hdd-volume':
	size 10 GiB in 2560 objects
	order 22 (4 MiB objects)
	snapshot_count: 0
	id: 25b9964a80408a
	data_pool: rbd-data-hdd
	block_name_prefix: rbd_data.4.25b9964a80408a
	format: 2
	features: layering, exclusive-lock, object-map, fast-diff, deep-flatten, data-pool
	op_features: 
	flags: 
	create_timestamp: Wed Jul 26 16:29:22 2023
	access_timestamp: Wed Jul 26 16:29:22 2023
	modify_timestamp: Wed Jul 26 16:29:22 2023

btullis@cephosd1001:~$ sudo rbd info rbd-metadata-ssd/test-ssd-volume
rbd image 'test-ssd-volume':
	size 10 GiB in 2560 objects
	order 22 (4 MiB objects)
	snapshot_count: 0
	id: 25b9aaf6aa3bd
	data_pool: rbd-data-ssd
	block_name_prefix: rbd_data.3.25b9aaf6aa3bd
	format: 2
	features: layering, exclusive-lock, object-map, fast-diff, deep-flatten, data-pool
	op_features: 
	flags: 
	create_timestamp: Wed Jul 26 16:30:55 2023
	access_timestamp: Wed Jul 26 16:30:55 2023
	modify_timestamp: Wed Jul 26 16:30:55 2023

In T326945#9045370, @BTullis wrote:

In T326945#9045226, @MatthewVernon wrote:

Apropos your CRUSH rules, it might be worth adding rack/row as well? We have the equivalent in ms-swift and most of our cassandra setups, and if we've managed to put our kit into different racks/rows then I think it makes sense to tell CRUSH about that - that way we can be confident that if (WLOG) we have replication with n=3 and lose 1 rack out of 3 that we still have two copies of all pgs available and the cluster will continue to operate.

Yes, I think that's a good suggestion as well. The main reason that I haven't done it yet is because the five servers are all in different racks already. Three different racks across eqiad rows E and F, so we implicitly get rack failure tolerance, although rack awareness. I'd certainly be happy to add rack and row awareness as well, particularly if we get more OSD hosts in this cluster.

In which case, the time to add those to the CRUSH rules is now - adjusting the CRUSH rule later often ends up involving a log of data reshuffling (which takes ages and/or causes a lot of load).

In T326945#9048873, @MatthewVernon wrote:

In which case, the time to add those to the CRUSH rules is now - adjusting the CRUSH rule later often ends up involving a log of data reshuffling (which takes ages and/or causes a lot of load).

OK, good call. I'll do that now.

I did some standard benchmarks with rados bench as per the guidance here.
The numbers look good in general, although it's not the most thorough of tests.

First we drop the caches.

btullis@cephosd1001:~$ sudo echo 3 | sudo tee /proc/sys/vm/drop_caches && sudo sync
3

Then we run a command like this on the newly created pool.

btullis@cephosd1001:~$ sudo rados bench -p rbd-metadata-ssd 10 write --no-cleanup
hints = 1
Maintaining 16 concurrent writes of 4194304 bytes to objects of size 4194304 for up to 10 seconds or 0 objects
Object prefix: benchmark_data_cephosd1001_826673
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
    0       0         0         0         0         0           -           0
    1      16       229       213   851.946       852    0.104653   0.0714868
    2      16       462       446     891.9       932    0.164442   0.0707659
    3      16       691       675   899.903       916   0.0684295   0.0699727
    4      16       932       916   915.903       964   0.0532415   0.0691734
    5      16      1168      1152   921.498       944   0.0436329   0.0690155
    6      16      1399      1383   921.898       924   0.0798538    0.068921
    7      16      1637      1621   926.185       952   0.0552721   0.0688312
    8      16      1865      1849   924.398       912   0.0922172   0.0689293
    9      16      2100      2084   926.121       940   0.0587682   0.0687279
   10      16      2324      2308   923.099       896   0.0534711   0.0690442
Total time run:         10.0578
Total writes made:      2324
Write size:             4194304
Object size:            4194304
Bandwidth (MB/sec):     924.255
Stddev Bandwidth:       32.0998
Max bandwidth (MB/sec): 964
Min bandwidth (MB/sec): 852
Average IOPS:           231
Stddev IOPS:            8.02496
Max IOPS:               241
Min IOPS:               213
Average Latency(s):     0.0691069
Stddev Latency(s):      0.0240677
Max latency(s):         0.165458
Min latency(s):         0.0263169

This is testing the rbd-metadata-ssd pool, which is the replicated n=3 ssd pool. 16 threads with 4 MB objects for 10 seconds shows around 924 MB throughput in total.

I did several more of these tests on different pools, including the erasure coded pools and with different block sizes and numbers of threads.
Another example is this one:

btullis@cephosd1001:~$ sudo rados bench -t 64 -p rbd-data-hdd 10 write --no-cleanup
hints = 1
Maintaining 64 concurrent writes of 4194304 bytes to objects of size 4194304 for up to 10 seconds or 0 objects
Object prefix: benchmark_data_cephosd1001_832596
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
    0       0         0         0         0         0           -           0
    1      63       266       203   811.953       812    0.234105    0.246373
    2      63       532       469   937.924      1064    0.232409    0.245503
    3      63       790       727   969.245      1032    0.232147    0.245328
    4      63      1051       988   987.902      1044    0.219935    0.245281
    5      63      1309      1246     996.7      1032    0.224981    0.245203
    6      63      1573      1510   1006.56      1056    0.253328    0.245078
    7      63      1840      1777   1015.32      1068    0.218455    0.244535
    8      63      2099      2036   1017.89      1036    0.231165    0.244696
    9      63      2359      2296   1020.34      1040    0.222975    0.244995
   10      63      2621      2558   1023.09      1048    0.219574    0.244684
Total time run:         10.0756
Total writes made:      2622
Write size:             4194304
Object size:            4194304
Bandwidth (MB/sec):     1040.94
Stddev Bandwidth:       75.2784
Max bandwidth (MB/sec): 1068
Min bandwidth (MB/sec): 812
Average IOPS:           260
Stddev IOPS:            18.8196
Max IOPS:               267
Min IOPS:               203
Average Latency(s):     0.242551
Stddev Latency(s):      0.0606215
Max latency(s):         0.914458
Min latency(s):         0.0377207

This is 64 threads of 4MB objects to the erasure-coded pool based on hard drives. I was able to verify using dstat -f (on a very wide terminal) that writes to the hard drives also generate write load on the /dev/nvme0n1 device, which contains the bluestore WAL devices for all of the hard drives.

I then did a little bit of testing with rbd bench
First writing 10 GB in 4 MB chunks and using 16 threads to the SSDs.

 btullis@cephosd1001:~$ sudo rbd bench --io-type write --io-total 10G test-ssd-volume --pool=rbd-metadata-ssd
bench  type write io_size 4096 io_threads 16 bytes 10737418240 pattern sequential
  SEC       OPS   OPS/SEC   BYTES/SEC
    1     89408   89782.5   351 MiB/s
    2    184944   92664.7   362 MiB/s
    3    277936   92773.7   362 MiB/s
    4    370480     92716   362 MiB/s
    5    467904   93658.3   366 MiB/s
    6    566544   95426.5   373 MiB/s
    7    657696   94549.7   369 MiB/s
    8    759280   96268.1   376 MiB/s
    9    862544   98412.1   384 MiB/s
   10    950128   96444.1   377 MiB/s
   11   1048064   96303.3   376 MiB/s
   12   1152032   98866.5   386 MiB/s
   13   1253360   98815.3   386 MiB/s
   14   1352176   97925.7   383 MiB/s
   15   1450400    100054   391 MiB/s
   16   1552928    100972   394 MiB/s
   17   1650528   99698.5   389 MiB/s
   18   1739904   97308.1   380 MiB/s
   19   1836528   96869.7   378 MiB/s
   20   1937664   97452.1   381 MiB/s
   21   2036944   96802.5   378 MiB/s
   22   2132784   96450.5   377 MiB/s
   23   2234720   98962.5   387 MiB/s
   24   2335168   99727.3   390 MiB/s
   25   2440064    100479   392 MiB/s
   26   2533424   99295.3   388 MiB/s
elapsed: 26   ops: 2621440   ops/sec: 97291.5   bytes/sec: 380 MiB/s

Then writing 10 GB with the same 16 threads and 4MB objects to the HDDs.

btullis@cephosd1001:~$ sudo rbd bench --io-type write --io-total 10G test-hdd-volume --pool=rbd-metadata-hdd
bench  type write io_size 4096 io_threads 16 bytes 10737418240 pattern sequential
  SEC       OPS   OPS/SEC   BYTES/SEC
    1     11648   11710.8    46 MiB/s
    2     81184   40681.1   159 MiB/s
    3    177008   59086.4   231 MiB/s
    4    271984   68067.6   266 MiB/s
    5    359856   72031.5   281 MiB/s
    6    448160   87301.8   341 MiB/s
    7    532976   90357.7   353 MiB/s
    8    625840   89765.8   351 MiB/s
    9    723440   90290.5   353 MiB/s
   10    817472   91522.5   358 MiB/s
   11    913808   93128.9   364 MiB/s
   12   1018416   97087.3   379 MiB/s
   13   1113056   97442.5   381 MiB/s
   14   1214080   98127.3   383 MiB/s
   15   1303856   97276.1   380 MiB/s
   16   1403216   97880.9   382 MiB/s
   17   1492304   94776.9   370 MiB/s
   18   1590544   95496.9   373 MiB/s
   19   1688720   94927.3   371 MiB/s
   20   1781664   95560.9   373 MiB/s
   21   1885008   96357.7   376 MiB/s
   22   1985744   98687.3   385 MiB/s
   23   2081776   98245.7   384 MiB/s
   24   2183424   98940.1   386 MiB/s
   25   2287552    101177   395 MiB/s
   26   2393664    101730   397 MiB/s
   27   2489616    100774   394 MiB/s
   28   2579136   99471.3   389 MiB/s
elapsed: 28   ops: 2621440   ops/sec: 92316.6   bytes/sec: 361 MiB/s

BTullis updated the task description. (Show Details)Jul 27 2023, 5:17 PM

I've had fun with rados bench in the past; it's at least arguably useful to see how much performance you can squeeze out of it, since it'll give us an idea later of how close we're getting to that performance via the other layers (rbd, s3, ...).

Fishing out my old notes (which I can share if you're particularly interested), it's worth making sure the pool(s) you use for benchmarking have enough pgs to hit all of your storage devices; parameter-sweeping around thread numbers and how many client systems you run rados bench on at once can be useful (I found about 300s runtime long enough to get reasonable confidence, and ended up using 8 client machines and 32 threads each); --run-name $(hostname); more like an hour's runtime helped reduced variation between runs; worth doing all of rand-read, seq-read, and write benchmarks (separately); can be worth pausing {deep,}scrub while running benchmarks; ceph -s during a run can give you an indication of how things are going.

Added the two buckets for the rows in use.

btullis@cephosd1001:~$ sudo ceph osd crush add-bucket eqiad-e row
added bucket eqiad-e type row to crush map
btullis@cephosd1001:~$ sudo ceph osd crush add-bucket eqiad-f row
added bucket eqiad-f type row to crush map

Added the five buckets for the racks in use.

btullis@cephosd1001:~$ sudo ceph osd crush add-bucket e1 rack
added bucket e1 type rack to crush map
btullis@cephosd1001:~$ sudo ceph osd crush add-bucket e2 rack
added bucket e2 type rack to crush map
btullis@cephosd1001:~$ sudo ceph osd crush add-bucket e3 rack
added bucket e3 type rack to crush map
btullis@cephosd1001:~$ sudo ceph osd crush add-bucket f1 rack
added bucket f1 type rack to crush map
btullis@cephosd1001:~$ sudo ceph osd crush add-bucket f2 rack
added bucket f2 type rack to crush map

Moving the rack buckets into their corresponding rows.

btullis@cephosd1001:~$ sudo ceph osd crush move e1 rack=eqiad-e
moved item id -21 name 'e1' to location {rack=eqiad-e} in crush map
btullis@cephosd1001:~$ sudo ceph osd crush move e2 rack=eqiad-e
moved item id -22 name 'e2' to location {rack=eqiad-e} in crush map
btullis@cephosd1001:~$ sudo ceph osd crush move e3 rack=eqiad-e
moved item id -23 name 'e3' to location {rack=eqiad-e} in crush map
btullis@cephosd1001:~$ sudo ceph osd crush move f1 rack=eqiad-f
moved item id -24 name 'f1' to location {rack=eqiad-f} in crush map
btullis@cephosd1001:~$ sudo ceph osd crush move f2 rack=eqiad-f
moved item id -25 name 'f2' to location {rack=eqiad-f} in crush map

Moving the hosts to their corresponding rows and racks:

btullis@cephosd1001:~$ sudo ceph osd crush move cephosd1001 row=eqiad-e rack=e1
moved item id -3 name 'cephosd1001' to location {rack=e1,row=eqiad-e} in crush map
btullis@cephosd1001:~$ sudo ceph osd crush move cephosd1002 row=eqiad-e rack=e2
moved item id -7 name 'cephosd1002' to location {rack=e2,row=eqiad-e} in crush map
btullis@cephosd1001:~$ sudo ceph osd crush move cephosd1003 row=eqiad-e rack=e3
moved item id -10 name 'cephosd1003' to location {rack=e3,row=eqiad-e} in crush map
btullis@cephosd1001:~$ sudo ceph osd crush move cephosd1004 row=eqiad-f rack=f1
moved item id -13 name 'cephosd1004' to location {rack=f1,row=eqiad-f} in crush map
btullis@cephosd1001:~$ sudo ceph osd crush move cephosd1005 row=eqiad-f rack=f2
moved item id -16 name 'cephosd1005' to location {rack=f2,row=eqiad-f} in crush map

The tree now looks like this:

btullis@cephosd1001:~$ sudo ceph osd tree
ID   CLASS  WEIGHT     TYPE NAME                 STATUS  REWEIGHT  PRI-AFF
-20         459.97241  row eqiad-f                                        
-24         229.98621      rack f1                                        
-13         229.98621          host cephosd1004                           
 60    hdd   16.83679              osd.60            up   1.00000  1.00000
 61    hdd   16.83679              osd.61            up   1.00000  1.00000
 62    hdd   16.83679              osd.62            up   1.00000  1.00000
 63    hdd   16.83679              osd.63            up   1.00000  1.00000
 64    hdd   16.83679              osd.64            up   1.00000  1.00000
 65    hdd   16.83679              osd.65            up   1.00000  1.00000
 66    hdd   16.83679              osd.66            up   1.00000  1.00000
 67    hdd   16.83679              osd.67            up   1.00000  1.00000
 68    hdd   16.83679              osd.68            up   1.00000  1.00000
 69    hdd   16.83679              osd.69            up   1.00000  1.00000
 70    hdd   16.83679              osd.70            up   1.00000  1.00000
 71    hdd   16.83679              osd.71            up   1.00000  1.00000
 72    ssd    3.49309              osd.72            up   1.00000  1.00000
 73    ssd    3.49309              osd.73            up   1.00000  1.00000
 74    ssd    3.49309              osd.74            up   1.00000  1.00000
 75    ssd    3.49309              osd.75            up   1.00000  1.00000
 76    ssd    3.49309              osd.76            up   1.00000  1.00000
 77    ssd    3.49309              osd.77            up   1.00000  1.00000
 78    ssd    3.49309              osd.78            up   1.00000  1.00000
 79    ssd    3.49309              osd.79            up   1.00000  1.00000
-25         229.98621      rack f2                                        
-16         229.98621          host cephosd1005                           
 80    hdd   16.83679              osd.80            up   1.00000  1.00000
 81    hdd   16.83679              osd.81            up   1.00000  1.00000
 82    hdd   16.83679              osd.82            up   1.00000  1.00000
 83    hdd   16.83679              osd.83            up   1.00000  1.00000
 84    hdd   16.83679              osd.84            up   1.00000  1.00000
 85    hdd   16.83679              osd.85            up   1.00000  1.00000
 86    hdd   16.83679              osd.86            up   1.00000  1.00000
 87    hdd   16.83679              osd.87            up   1.00000  1.00000
 88    hdd   16.83679              osd.88            up   1.00000  1.00000
 89    hdd   16.83679              osd.89            up   1.00000  1.00000
 90    hdd   16.83679              osd.90            up   1.00000  1.00000
 91    hdd   16.83679              osd.91            up   1.00000  1.00000
 92    ssd    3.49309              osd.92            up   1.00000  1.00000
 93    ssd    3.49309              osd.93            up   1.00000  1.00000
 94    ssd    3.49309              osd.94            up   1.00000  1.00000
 95    ssd    3.49309              osd.95            up   1.00000  1.00000
 96    ssd    3.49309              osd.96            up   1.00000  1.00000
 97    ssd    3.49309              osd.97            up   1.00000  1.00000
 98    ssd    3.49309              osd.98            up   1.00000  1.00000
 99    ssd    3.49309              osd.99            up   1.00000  1.00000
-19         689.95862  row eqiad-e                                        
-21         229.98621      rack e1                                        
 -3         229.98621          host cephosd1001                           
  0    hdd   16.83679              osd.0             up   1.00000  1.00000
  1    hdd   16.83679              osd.1             up   1.00000  1.00000
  2    hdd   16.83679              osd.2             up   1.00000  1.00000
  3    hdd   16.83679              osd.3             up   1.00000  1.00000
  4    hdd   16.83679              osd.4             up   1.00000  1.00000
  5    hdd   16.83679              osd.5             up   1.00000  1.00000
  6    hdd   16.83679              osd.6             up   1.00000  1.00000
  7    hdd   16.83679              osd.7             up   1.00000  1.00000
  8    hdd   16.83679              osd.8             up   1.00000  1.00000
  9    hdd   16.83679              osd.9             up   1.00000  1.00000
 10    hdd   16.83679              osd.10            up   1.00000  1.00000
 11    hdd   16.83679              osd.11            up   1.00000  1.00000
 12    ssd    3.49309              osd.12            up   1.00000  1.00000
 13    ssd    3.49309              osd.13            up   1.00000  1.00000
 14    ssd    3.49309              osd.14            up   1.00000  1.00000
 15    ssd    3.49309              osd.15            up   1.00000  1.00000
 16    ssd    3.49309              osd.16            up   1.00000  1.00000
 17    ssd    3.49309              osd.17            up   1.00000  1.00000
 18    ssd    3.49309              osd.18            up   1.00000  1.00000
 19    ssd    3.49309              osd.19            up   1.00000  1.00000
-22         229.98621      rack e2                                        
 -7         229.98621          host cephosd1002                           
 20    hdd   16.83679              osd.20            up   1.00000  1.00000
 21    hdd   16.83679              osd.21            up   1.00000  1.00000
 22    hdd   16.83679              osd.22            up   1.00000  1.00000
 23    hdd   16.83679              osd.23            up   1.00000  1.00000
 24    hdd   16.83679              osd.24            up   1.00000  1.00000
 25    hdd   16.83679              osd.25            up   1.00000  1.00000
 26    hdd   16.83679              osd.26            up   1.00000  1.00000
 27    hdd   16.83679              osd.27            up   1.00000  1.00000
 28    hdd   16.83679              osd.28            up   1.00000  1.00000
 29    hdd   16.83679              osd.29            up   1.00000  1.00000
 30    hdd   16.83679              osd.30            up   1.00000  1.00000
 31    hdd   16.83679              osd.31            up   1.00000  1.00000
 32    ssd    3.49309              osd.32            up   1.00000  1.00000
 33    ssd    3.49309              osd.33            up   1.00000  1.00000
 34    ssd    3.49309              osd.34            up   1.00000  1.00000
 35    ssd    3.49309              osd.35            up   1.00000  1.00000
 36    ssd    3.49309              osd.36            up   1.00000  1.00000
 37    ssd    3.49309              osd.37            up   1.00000  1.00000
 38    ssd    3.49309              osd.38            up   1.00000  1.00000
 39    ssd    3.49309              osd.39            up   1.00000  1.00000
-23         229.98621      rack e3                                        
-10         229.98621          host cephosd1003                           
 40    hdd   16.83679              osd.40            up   1.00000  1.00000
 41    hdd   16.83679              osd.41            up   1.00000  1.00000
 42    hdd   16.83679              osd.42            up   1.00000  1.00000
 43    hdd   16.83679              osd.43            up   1.00000  1.00000
 44    hdd   16.83679              osd.44            up   1.00000  1.00000
 45    hdd   16.83679              osd.45            up   1.00000  1.00000
 46    hdd   16.83679              osd.46            up   1.00000  1.00000
 47    hdd   16.83679              osd.47            up   1.00000  1.00000
 48    hdd   16.83679              osd.48            up   1.00000  1.00000
 49    hdd   16.83679              osd.49            up   1.00000  1.00000
 50    hdd   16.83679              osd.50            up   1.00000  1.00000
 51    hdd   16.83679              osd.51            up   1.00000  1.00000
 52    ssd    3.49309              osd.52            up   1.00000  1.00000
 53    ssd    3.49309              osd.53            up   1.00000  1.00000
 54    ssd    3.49309              osd.54            up   1.00000  1.00000
 55    ssd    3.49309              osd.55            up   1.00000  1.00000
 56    ssd    3.49309              osd.56            up   1.00000  1.00000
 57    ssd    3.49309              osd.57            up   1.00000  1.00000
 58    ssd    3.49309              osd.58            up   1.00000  1.00000
 59    ssd    3.49309              osd.59            up   1.00000  1.00000
 -1                 0  root default

I'll check to see if I have to modify any of the existing rules, or whether this is enough for now.

Looking at it, I think it's going to be better to continue to use the root=default bucket at the top of the hierarchy.
So now we have root=default, row=eqiad=e, rack=e1, host=cephosd1001 and similar.

All of the rules that I have created should continue to work, but I can tune them if needed and we still don't have very much data to move around, even if we decide to change them significantly later.

btullis@cephosd1001:~$ sudo ceph osd crush move eqiad-e root=default
moved item id -19 name 'eqiad-e' to location {root=default} in crush map
btullis@cephosd1001:~$ sudo ceph osd crush move eqiad-f root=default
moved item id -20 name 'eqiad-f' to location {root=default} in crush map
btullis@cephosd1001:~$ sudo ceph osd tree
ID   CLASS  WEIGHT      TYPE NAME                     STATUS  REWEIGHT  PRI-AFF
 -1         1149.93103  root default                                           
-19          689.95862      row eqiad-e                                        
-21          229.98621          rack e1                                        
 -3          229.98621              host cephosd1001                           
  0    hdd    16.83679                  osd.0             up   1.00000  1.00000
  1    hdd    16.83679                  osd.1             up   1.00000  1.00000
  2    hdd    16.83679                  osd.2             up   1.00000  1.00000
  3    hdd    16.83679                  osd.3             up   1.00000  1.00000
  4    hdd    16.83679                  osd.4             up   1.00000  1.00000
  5    hdd    16.83679                  osd.5             up   1.00000  1.00000
  6    hdd    16.83679                  osd.6             up   1.00000  1.00000
  7    hdd    16.83679                  osd.7             up   1.00000  1.00000
  8    hdd    16.83679                  osd.8             up   1.00000  1.00000
  9    hdd    16.83679                  osd.9             up   1.00000  1.00000
 10    hdd    16.83679                  osd.10            up   1.00000  1.00000
 11    hdd    16.83679                  osd.11            up   1.00000  1.00000
 12    ssd     3.49309                  osd.12            up   1.00000  1.00000
 13    ssd     3.49309                  osd.13            up   1.00000  1.00000
 14    ssd     3.49309                  osd.14            up   1.00000  1.00000
 15    ssd     3.49309                  osd.15            up   1.00000  1.00000
 16    ssd     3.49309                  osd.16            up   1.00000  1.00000
 17    ssd     3.49309                  osd.17            up   1.00000  1.00000
 18    ssd     3.49309                  osd.18            up   1.00000  1.00000
 19    ssd     3.49309                  osd.19            up   1.00000  1.00000
-22          229.98621          rack e2                                        
 -7          229.98621              host cephosd1002                           
 20    hdd    16.83679                  osd.20            up   1.00000  1.00000
 21    hdd    16.83679                  osd.21            up   1.00000  1.00000
 22    hdd    16.83679                  osd.22            up   1.00000  1.00000
 23    hdd    16.83679                  osd.23            up   1.00000  1.00000
 24    hdd    16.83679                  osd.24            up   1.00000  1.00000
 25    hdd    16.83679                  osd.25            up   1.00000  1.00000
 26    hdd    16.83679                  osd.26            up   1.00000  1.00000
 27    hdd    16.83679                  osd.27            up   1.00000  1.00000
 28    hdd    16.83679                  osd.28            up   1.00000  1.00000
 29    hdd    16.83679                  osd.29            up   1.00000  1.00000
 30    hdd    16.83679                  osd.30            up   1.00000  1.00000
 31    hdd    16.83679                  osd.31            up   1.00000  1.00000
 32    ssd     3.49309                  osd.32            up   1.00000  1.00000
 33    ssd     3.49309                  osd.33            up   1.00000  1.00000
 34    ssd     3.49309                  osd.34            up   1.00000  1.00000
 35    ssd     3.49309                  osd.35            up   1.00000  1.00000
 36    ssd     3.49309                  osd.36            up   1.00000  1.00000
 37    ssd     3.49309                  osd.37            up   1.00000  1.00000
 38    ssd     3.49309                  osd.38            up   1.00000  1.00000
 39    ssd     3.49309                  osd.39            up   1.00000  1.00000
-23          229.98621          rack e3                                        
-10          229.98621              host cephosd1003                           
 40    hdd    16.83679                  osd.40            up   1.00000  1.00000
 41    hdd    16.83679                  osd.41            up   1.00000  1.00000
 42    hdd    16.83679                  osd.42            up   1.00000  1.00000
 43    hdd    16.83679                  osd.43            up   1.00000  1.00000
 44    hdd    16.83679                  osd.44            up   1.00000  1.00000
 45    hdd    16.83679                  osd.45            up   1.00000  1.00000
 46    hdd    16.83679                  osd.46            up   1.00000  1.00000
 47    hdd    16.83679                  osd.47            up   1.00000  1.00000
 48    hdd    16.83679                  osd.48            up   1.00000  1.00000
 49    hdd    16.83679                  osd.49            up   1.00000  1.00000
 50    hdd    16.83679                  osd.50            up   1.00000  1.00000
 51    hdd    16.83679                  osd.51            up   1.00000  1.00000
 52    ssd     3.49309                  osd.52            up   1.00000  1.00000
 53    ssd     3.49309                  osd.53            up   1.00000  1.00000
 54    ssd     3.49309                  osd.54            up   1.00000  1.00000
 55    ssd     3.49309                  osd.55            up   1.00000  1.00000
 56    ssd     3.49309                  osd.56            up   1.00000  1.00000
 57    ssd     3.49309                  osd.57            up   1.00000  1.00000
 58    ssd     3.49309                  osd.58            up   1.00000  1.00000
 59    ssd     3.49309                  osd.59            up   1.00000  1.00000
-20          459.97241      row eqiad-f                                        
-24          229.98621          rack f1                                        
-13          229.98621              host cephosd1004                           
 60    hdd    16.83679                  osd.60            up   1.00000  1.00000
 61    hdd    16.83679                  osd.61            up   1.00000  1.00000
 62    hdd    16.83679                  osd.62            up   1.00000  1.00000
 63    hdd    16.83679                  osd.63            up   1.00000  1.00000
 64    hdd    16.83679                  osd.64            up   1.00000  1.00000
 65    hdd    16.83679                  osd.65            up   1.00000  1.00000
 66    hdd    16.83679                  osd.66            up   1.00000  1.00000
 67    hdd    16.83679                  osd.67            up   1.00000  1.00000
 68    hdd    16.83679                  osd.68            up   1.00000  1.00000
 69    hdd    16.83679                  osd.69            up   1.00000  1.00000
 70    hdd    16.83679                  osd.70            up   1.00000  1.00000
 71    hdd    16.83679                  osd.71            up   1.00000  1.00000
 72    ssd     3.49309                  osd.72            up   1.00000  1.00000
 73    ssd     3.49309                  osd.73            up   1.00000  1.00000
 74    ssd     3.49309                  osd.74            up   1.00000  1.00000
 75    ssd     3.49309                  osd.75            up   1.00000  1.00000
 76    ssd     3.49309                  osd.76            up   1.00000  1.00000
 77    ssd     3.49309                  osd.77            up   1.00000  1.00000
 78    ssd     3.49309                  osd.78            up   1.00000  1.00000
 79    ssd     3.49309                  osd.79            up   1.00000  1.00000
-25          229.98621          rack f2                                        
-16          229.98621              host cephosd1005                           
 80    hdd    16.83679                  osd.80            up   1.00000  1.00000
 81    hdd    16.83679                  osd.81            up   1.00000  1.00000
 82    hdd    16.83679                  osd.82            up   1.00000  1.00000
 83    hdd    16.83679                  osd.83            up   1.00000  1.00000
 84    hdd    16.83679                  osd.84            up   1.00000  1.00000
 85    hdd    16.83679                  osd.85            up   1.00000  1.00000
 86    hdd    16.83679                  osd.86            up   1.00000  1.00000
 87    hdd    16.83679                  osd.87            up   1.00000  1.00000
 88    hdd    16.83679                  osd.88            up   1.00000  1.00000
 89    hdd    16.83679                  osd.89            up   1.00000  1.00000
 90    hdd    16.83679                  osd.90            up   1.00000  1.00000
 91    hdd    16.83679                  osd.91            up   1.00000  1.00000
 92    ssd     3.49309                  osd.92            up   1.00000  1.00000
 93    ssd     3.49309                  osd.93            up   1.00000  1.00000
 94    ssd     3.49309                  osd.94            up   1.00000  1.00000
 95    ssd     3.49309                  osd.95            up   1.00000  1.00000
 96    ssd     3.49309                  osd.96            up   1.00000  1.00000
 97    ssd     3.49309                  osd.97            up   1.00000  1.00000
 98    ssd     3.49309                  osd.98            up   1.00000  1.00000
 99    ssd     3.49309                  osd.99            up   1.00000  1.00000

BTullis updated the task description. (Show Details)Aug 7 2023, 5:31 PM

BTullis moved this task from In Progress to Done on the Data-Platform-SRE board.

Gehel closed this task as Resolved.Aug 8 2023, 3:45 PM

BTullis mentioned this in T327259: Enable the Container Storage Interface (CSI) and the Ceph CSI plugin on dse-k8s cluster.May 2 2024, 12:21 PM

Gehel removed a subtask: T328010: ceph: introduce puppet logic to purge stale keyfiles.May 7 2024, 8:52 AM

Gehel removed a subtask: T328011: cloudceph: reimage some servers.May 7 2024, 8:54 AM

BTullis mentioned this in T330152: Deploy ceph radosgw processes to data-engineering cluster.Aug 19 2024, 12:21 PM

In T326945#9045539, @BTullis wrote:
I created two test volumes, one on the HDDs and the other on the SSDs.
Data is stored on the erasure coded pool. Metadata for both was on replicated pool on the same medium.
btullis@cephosd1001:~$ sudo rbd create --size 10G --data-pool rbd-data-hdd rbd-metadata-hdd/test-hdd-volume
btullis@cephosd1001:~$ sudo rbd create --size 10G --data-pool rbd-data-ssd rbd-metadata-ssd/test-ssd-volume

I am now removing these erasure coded volumes and pools, since we are not actively using them.

Removing the volumes was quite easy.

btullis@cephosd1004:~$ sudo rbd list rbd-metadata-hdd
test-hdd-volume
btullis@cephosd1004:~$ sudo rbd list rbd-metadata-ssd
test-ssd-volume

btullis@cephosd1004:~$ sudo rbd rm rbd-metadata-hdd/test-hdd-volume
Removing image: 100% complete...done.
btullis@cephosd1004:~$ sudo rbd rm rbd-metadata-ssd/test-ssd-volume
Removing image: 100% complete...done.

Removing the pools was not so easy.

btullis@cephosd1004:~$ sudo ceph osd pool rm rbd-data-hdd
Error EPERM: WARNING: this will *PERMANENTLY DESTROY* all data stored in pool rbd-data-hdd.  If you are *ABSOLUTELY CERTAIN* that is what you want, pass the pool name *twice*, followed by --yes-i-really-really-mean-it.

btullis@cephosd1004:~$ sudo ceph osd pool rm rbd-data-hdd rbd-data-hdd --yes-i-really-really-mean-it
Error EPERM: pool deletion is disabled; you must first set the mon_allow_pool_delete config option to true before you can destroy a pool

I did not want to add this config option to /etc/ceph/ceph.conf` so I used the ceph injectargs command to update the runtime configuration:

btullis@cephosd1004:~$ sudo ceph tell mon.* injectargs --mon_allow_pool_delete true
mon.cephosd1001: {}
mon.cephosd1001: mon_allow_pool_delete = 'true' 
mon.cephosd1002: {}
mon.cephosd1002: mon_allow_pool_delete = 'true' 
mon.cephosd1003: {}
mon.cephosd1003: mon_allow_pool_delete = 'true' 
mon.cephosd1004: {}
mon.cephosd1004: mon_allow_pool_delete = 'true' 
mon.cephosd1005: {}
mon.cephosd1005: mon_allow_pool_delete = 'true'

Now I could remove the pools.

btullis@cephosd1004:~$ sudo ceph osd pool rm rbd-data-hdd rbd-data-hdd --yes-i-really-really-mean-it
pool 'rbd-data-hdd' removed
btullis@cephosd1004:~$ sudo ceph osd pool rm rbd-data-ssd rbd-data-ssd --yes-i-really-really-mean-it
pool 'rbd-data-ssd' removed
btullis@cephosd1004:~$ sudo ceph osd pool rm rbd-metadata-hdd rbd-metadata-hdd --yes-i-really-really-mean-it
pool 'rbd-metadata-hdd' removed
btullis@cephosd1004:~$ sudo ceph osd pool rm rbd-metadata-ssd rbd-metadata-ssd --yes-i-really-really-mean-it
pool 'rbd-metadata-ssd' removed

Then reset the runtime configuration.

btullis@cephosd1004:~$ sudo ceph tell mon.* injectargs --mon_allow_pool_delete false
mon.cephosd1001: {}
mon.cephosd1001: mon_allow_pool_delete = 'false' 
mon.cephosd1002: {}
mon.cephosd1002: mon_allow_pool_delete = 'false' 
mon.cephosd1003: {}
mon.cephosd1003: mon_allow_pool_delete = 'false' 
mon.cephosd1004: {}
mon.cephosd1004: mon_allow_pool_delete = 'false' 
mon.cephosd1005: {}
mon.cephosd1005: mon_allow_pool_delete = 'false'

Now check the output from cepf df and ceph osd lspools

btullis@cephosd1004:~$ sudo ceph df
--- RAW STORAGE ---
CLASS      SIZE    AVAIL    USED  RAW USED  %RAW USED
hdd    1010 TiB  982 TiB  28 TiB    28 TiB       2.77
ssd     140 TiB  140 TiB  33 GiB    33 GiB       0.02
TOTAL   1.1 PiB  1.1 PiB  28 TiB    28 TiB       2.43
 
--- POOLS ---
POOL                     ID   PGS   STORED  OBJECTS     USED  %USED  MAX AVAIL
.mgr                      2     1  160 MiB       41  479 MiB      0     44 TiB
dse-k8s-csi-ssd           7    32  1.4 GiB      526  4.3 GiB      0     44 TiB
.rgw.root                 8    32  5.0 KiB       21  240 KiB      0    311 TiB
eqiad.rgw.log             9    32   23 KiB      338  1.9 MiB      0    311 TiB
eqiad.rgw.control        10    32      0 B        8      0 B      0    311 TiB
eqiad.rgw.meta           11    32  1.5 KiB        9   96 KiB      0    311 TiB
eqiad.rgw.buckets.index  12    32      0 B       11      0 B      0    311 TiB
eqiad.rgw.buckets.data   13  2048     53 B        1   12 KiB      0    311 TiB

btullis@cephosd1004:~$ sudo ceph osd lspools
2 .mgr
7 dse-k8s-csi-ssd
8 .rgw.root
9 eqiad.rgw.log
10 eqiad.rgw.control
11 eqiad.rgw.meta
12 eqiad.rgw.buckets.index
13 eqiad.rgw.buckets.data
btullis@cephosd1004:~$

Similarly, I have removed the unused crush rules.

btullis@cephosd1004:~$ sudo ceph osd crush rule rm rbd-data-hdd
btullis@cephosd1004:~$ sudo ceph osd crush rule rm rbd-data-ssd

BTullis mentioned this in T374447: Test the S3 and swift interfaces of rgw.eqiad.dpe.anycast.wmnet.Sep 11 2024, 11:39 AM

	F36181184: image.png
	Jan 13 2023, 11:58 AM

	F36178580: image.png
	Jan 13 2023, 11:58 AM

Decide on installation details for new ceph clusterClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Key questions

Current situation

Decide on installation details for new ceph cluster
Closed, ResolvedPublic
Actions

Related Objects
Search...