[ceph] Getting rack level HA
Open, HighPublic
Actions

Assigned To

Authored By

	dcaro
	Dec 6 2021, 9:42 AM

Description

These are the things we need to get high availability at the rack/rack switch level for our current cloud ceph cluster:

We need to spread the 3 mons in different racks under different switches:
- Current: all the mons are under the B* switch (B2/B4/B7)
- Example of HA: 1 mon under B*, 1 mon under D5 cloudswitch, 1 mon under C8 cloudswitch

We need to spread the osds in 3 equal-sized groups (if there's any difference on the sizes, that spaces would not be able to be used until the other racks match it or we bring another rack with 2x the difference)
- Current:
  - 1 B2
  - 1 B4
  - 1 B7
  - 11 C8
  - 10 D5
- Example of HA:
  - 7 B*
  - 7 D5
  - 7 C8

Only when the above is done, then we can proceed and configure the cluster (if we do it before the cluster will halt due to lack of high availability).

We have to configure the osds to report a new bucket (see https://docs.ceph.com/en/latest/rados/operations/crush-map/)
- Current: they only report the host
- Future: we need to report the rack and row, and maybe datacenter

We have to configure the ceph crush map to account for the rack location:
- Current:

We only have host-level spreading:

root@cloudcephosd1001:~# ceph osd crush rule dump replicated_rule
{
    "rule_id": 0,
    "rule_name": "replicated_rule",
    "ruleset": 0,
    "type": 1,
    "min_size": 1,
    "max_size": 10,
    "steps": [
        {
            "op": "take",
            "item": -1,
            "item_name": "default"
        },
        {
            "op": "chooseleaf_firstn",
            "num": 0,
            "type": "host"
        },
        {
            "op": "emit"
        }
    ]
}

Future: We need to add an extra level for the rack, see https://docs.ceph.com/en/latest/rados/operations/crush-map/

Details

Subject	Repo	Branch	Lines +/-
ceph: move the location hook to the osd top level	operations/puppet	production	+6 -3
ceph.eqiad: enable location hook	operations/puppet	production	+1 -1
ceph: fix location hook path	operations/puppet	production	+1 -1
p:cloudceph::osd: enable location hook	operations/puppet	production	+1 -0
ceph: Allow setting a crush location hook for the rack	operations/puppet	production	+73 -12
cloudceph: add the location info to the hosts	operations/puppet	production	+30 -0

Customize query in gerrit

Related Objects
Search...

Status	Subtype	Assigned	Task
Open		None	T253824 planned upstream deprecation of the ssh-rsa signing algorithm (RSA with SHA-1)
Resolved		ayounsi	T254013 all network devices must run OpenSSH >= 7.2p1 but != 7.4p1
Resolved		ayounsi	T317175 Junos: resolve DNS through mgmt_junos
Resolved		ayounsi	T327862 Use mgmt_junos on all network devices
			Restricted Task
Open		None	T316539 Upgrade network devices to Junos 20+
Open		cmooney	T316544 Upgrade cloudsw1-c8-eqiad and cloudsw1-d5-eqiad to Junos 20+
Open		dcaro	T297083 [ceph] Getting rack level HA
Resolved	Request	• Cmjohnson	T303058 hw troubleshooting: move cloudcephmon1003.eqiad.wmnet from rack B2 to rack C8
Resolved		• Cmjohnson	T304096 move cloudcephmon1002.eqiad.wmnet from rack B4 to rack D5
Resolved		• nskaggs	T329498 [ceph] Move cloudcephosd1001 (b7) and cloudcephosd1002 (b4) to rack e4
Resolved	BUG REPORT	dcaro	T329535 Cloud Ceph outage 2023-02-13
In Progress		dcaro	T329709 [cookbooks.ceph] Add a cookbook to drain a ceph osd in a safe manner
Resolved		dcaro	T329711 [ceph] Add monitoring for inter-osd/mon/cloudvirt connectivity
Open		dcaro	T329778 [ceph] Investigate if there's a way to degrade instead of failing when jumbo frames are being dropped in the network
Resolved		cmooney	T329799 Add network-layer protections to avoid inadvertently lowering IRB MTU
Resolved		• nskaggs	T329502 [ceph] Move cloudcephosd1003 (b2) to rack e4 and cloudcephosd1004 (c8) to rack f4
Resolved		• nskaggs	T329504 [ceph] Move cloudcephosd1005 (c8) and cloudcephosd1010 (d5) to rack f4
Resolved		dcaro	T329507 [ceph] Test crush tree with rack level HA on codfw
Resolved	Request	Papaul	T330754 hw troubleshooting: Link hard down (probably cable) for cloudcephosd2002-dev.codfw.wmnet
Resolved		dcaro	T331141 Change crushmap in eqiad to have rack HA
Resolved		dcaro	T331145 [cookbooks] adapt to having an extra level of buckets in the crushmap
Open		None	T330733 Move coludcephmon1001 from B7 to rack F4
Open		dcaro	T331636 [cookbooks.ceph] create a script to get the list of rbd images affected by stuck/inactive PGs

Event Timeline

dcaro triaged this task as High priority.Dec 6 2021, 9:42 AM

dcaro created this task.

dcaro added a subscriber: • nskaggs.Dec 6 2021, 9:43 AM

dcaro mentioned this in T296862: [ceph]Write down the requirements to get ceph HA at the rack/switch level.Dec 6 2021, 9:48 AM

dcaro removed a project: User-dcaro.Dec 6 2021, 10:17 AM

taavi subscribed.Dec 6 2021, 7:32 PM

dcaro updated the task description. (Show Details)Dec 9 2021, 9:35 AM

• nskaggs added a project: DC-Ops.Dec 16 2021, 2:44 PM

• nskaggs added a subscriber: wiki_willy.

Hi @dcaro - with the requirement these servers have in utilizing dual switchports, I don't think we have enough space to move any of them to row B. However, we plan on dedicating two additional racks for WMCS in the new eqiad expansion cage when it's ready for use in January. It would be one in row E and another in row F. If we were to install everything WMCS going forward across these 4x racks (only C8, D5, E*, F*), would that meet all your future needs? (including equal diversification across racks, etc)

Thanks,
Willy

Yes, that be awesome, making sure of course that the clouds witches on
those racks are interconnected in a high available way (mesh, ring...).

Perfect. We'll update you in January when things are set up, and you can create a move task for these servers then. For future install tasks though, just let us know in the Phabricator task description if there are any constraints that we should follow, when racking them up. Thanks, Willy

In T297083#7576627, @dcaro wrote:

Yes, that be awesome, making sure of course that the clouds witches on
those racks are interconnected in a high available way (mesh, ring...).

• nskaggs edited projects, added cloud-services-team (Hardware); removed cloud-services-team (Kanban).Jan 5 2022, 7:28 PM

• nskaggs moved this task from Backlog to Racking / Decom on the cloud-services-team (Hardware) board.

@dcaro, given we will soon have row E and row F as WMCS racks, can we update this plan with suggested layout? We want to move all the monitors out of row B for isntance, and then split the OSD's. Depending on how many new OSD's we are looking at buying over the next year, it may be possible for us to avoid moving as many machines (instead we could rack new ones).

I assume we can / should move a single server or two at a time, and if done 1 by 1, it can happen without WMCS's direct involvement? Once we have a plan, let's open specific tickets for each move under https://phabricator.wikimedia.org/maniphest/task/edit/form/55/.

Hey, so yes, we can start already moving one monitor to rack C and one to D if possible, otherwise we have to sort out first the ceph setup on different L3 domains (on the works, @cmooney is working on the new setup, and we wanted to try the configs with the hosts in T294972: Q2:(Need By: TBD) rack/setup/install cloudcephosd10[25-34] if possible).

They have to be moved one by one, and our intervention is needed in to make sure that after moving one, the cluster comes back to healthy before moving the other, for the cloudcephosd ones, we can move two at a time, though those are less critical (there's many more xd).

Thanks for the update! Ok, so let's start with the mons, and start with one request to move to rack C. If that works, we'll move one more to rack D. That will be a nice improvement already over what we have. Filed T303058.

• nskaggs added a subtask: T303058: hw troubleshooting: move cloudcephmon1003.eqiad.wmnet from rack B2 to rack C8.Mar 4 2022, 3:33 PM

• Cmjohnson closed subtask T303058: hw troubleshooting: move cloudcephmon1003.eqiad.wmnet from rack B2 to rack C8 as Resolved.Mar 17 2022, 5:37 PM

dcaro mentioned this in T304096: move cloudcephmon1002.eqiad.wmnet from rack B4 to rack D5.Mar 17 2022, 5:44 PM

dcaro added a subtask: T304096: move cloudcephmon1002.eqiad.wmnet from rack B4 to rack D5.

• Cmjohnson closed subtask T304096: move cloudcephmon1002.eqiad.wmnet from rack B4 to rack D5 as Resolved.Jul 6 2022, 6:15 PM

• nskaggs closed this task as Resolved.Oct 12 2022, 12:56 PM

• nskaggs claimed this task.

dcaro mentioned this in T316544: Upgrade cloudsw1-c8-eqiad and cloudsw1-d5-eqiad to Junos 20+.Feb 6 2023, 1:46 PM

An update on this, it would help us greatly to be able to take a rack down without having downtime (see T316544).

Currently we have 4 different failure domains:

c8 -> cloudsw-c8-eqiad (row c, column 8) https://netbox.wikimedia.org/dcim/devices/2686/
d5 -> cloudswitch-d5 https://netbox.wikimedia.org/dcim/devices/2687/
e4 -> cloudsw1-e4 https://netbox.wikimedia.org/dcim/devices/3931/
f4 -> cloudsw1-f4 https://netbox.wikimedia.org/dcim/devices/3935/

Switch on c8 (that includes b*, as they connect through this one) -> 14 hosts
Switch on d5 -> 10 hosts
Switch on e4 -> 5 hosts
Switch on f4 -> 5 hosts

And 34 hosts (+6 coming soon, but those are going to go on a different pool).

Given that we want to be able to failover with those 34 hosts, so the balance should be something like:

c8 - 9 hosts
d5 - 9 hosts
e4 - 8 hosts
f4 - 8 hosts

So a possibility for that would be:

Move 3 hosts from c8 -> e4
Move 2 host from c8 -> f4
Move 1 host from d5 -> f4

Currently we have spaces on F4 and E4, so this move works well, the disadvantage is that then when we buy more hosts, we not only have to buy them in 3s (as before) but those will have to go in different racks/domains, so the rack space becomes a limiting factor too.

@Cmjohnson How much effort would it take to move these hosts? The process would have to be two by two (move two, wait for the cluster to sync, should be in the matter of seconds if everything is ok, move the next two...)

Unfortunately the IPs would have to change (E4 and F4 have different subnets), but everything else should stay the same, and those racks already have similar service host (so fw/router/switches config should be ok)

fnegri subscribed.Feb 6 2023, 3:58 PM

++ @Jclark-ctr, since Chris is out until the end of February

In T297083#8589833, @dcaro wrote:

@Cmjohnson How much effort would it take to move these hosts? The process would have to be two by two (move two, wait for the cluster to sync, should be in the matter of seconds if everything is ok, move the next two...)

Unfortunately the IPs would have to change (E4 and F4 have different subnets), but everything else should stay the same, and those racks already have similar service host (so fw/router/switches config should be ok)

Have synced up with @Andrew on scheduling this move. Waiting for update from meeting he has tomorrow to discuss this

@Jclark-ctr Hi! We decided to do the moves yep

We have to move them two-by-two, in batches, they can be done one right after the other if there's no problems, but no more than 2 hosts should be down at a time:

Move cloudcephosd1001 b7 -> e4 + cloudcephosd1002 b4 -> e4 (T329498)

Move cloudcephosd1003 b2 -> e4 + cloudcephosd1004 c8 -> f4 (T329502)

Move cloudcephosd1005 c8 -> f4 + cloudcephosd1010 d5 -> f4 (T329504)

dcaro added a parent task: T316544: Upgrade cloudsw1-c8-eqiad and cloudsw1-d5-eqiad to Junos 20+.Feb 13 2023, 12:34 PM

dcaro closed subtask T329498: [ceph] Move cloudcephosd1001 (b7) and cloudcephosd1002 (b4) to rack e4 as Resolved.Feb 21 2023, 1:57 PM

ayounsi mentioned this in T330479: Move WMCS servers out of eqiad row B.Feb 24 2023, 8:24 AM

dcaro closed subtask T329502: [ceph] Move cloudcephosd1003 (b2) to rack e4 and cloudcephosd1004 (c8) to rack f4 as Resolved.Feb 28 2023, 8:51 AM

dcaro claimed this task.Feb 28 2023, 2:08 PM

dcaro changed the status of subtask T329507: [ceph] Test crush tree with rack level HA on codfw from Open to In Progress.

dcaro closed subtask T329504: [ceph] Move cloudcephosd1005 (c8) and cloudcephosd1010 (d5) to rack f4 as Resolved.Mar 3 2023, 1:14 PM

dcaro closed subtask T329507: [ceph] Test crush tree with rack level HA on codfw as Resolved.Mar 3 2023, 4:06 PM

dcaro added a subtask: T330733: Move coludcephmon1001 from B7 to rack F4.Mar 8 2023, 10:03 AM

dcaro closed subtask T331141: Change crushmap in eqiad to have rack HA as Resolved.Mar 8 2023, 2:49 PM

Change 896372 had a related patch set uploaded (by David Caro; author: David Caro):

[operations/puppet@production] cloudceph: add the location info to the hosts

https://gerrit.wikimedia.org/r/896372

gerritbot added a project: Patch-For-Review.Mar 10 2023, 3:34 PM

Change 896372 merged by David Caro:

[operations/puppet@production] cloudceph: add the location info to the hosts

https://gerrit.wikimedia.org/r/896372

Change 904787 had a related patch set uploaded (by David Caro; author: David Caro):

[operations/puppet@production] ceph: Allow setting a crush location hook for the rack

https://gerrit.wikimedia.org/r/904787

Change 904788 had a related patch set uploaded (by David Caro; author: David Caro):

[operations/puppet@production] p:cloudceph::osd: enable location hook

https://gerrit.wikimedia.org/r/904788

Change 904787 merged by David Caro:

[operations/puppet@production] ceph: Allow setting a crush location hook for the rack

https://gerrit.wikimedia.org/r/904787

Change 904788 merged by David Caro:

[operations/puppet@production] p:cloudceph::osd: enable location hook

https://gerrit.wikimedia.org/r/904788

fnegri edited projects, added cloud-services-team (FY2023/2024-Q1-Q2), Goal; removed cloud-services-team (Hardware).Jul 26 2023, 4:04 PM

fnegri updated the task description. (Show Details)

fnegri moved this task from Backlog to Planned Goals on the cloud-services-team (FY2023/2024-Q1-Q2) board.Jul 26 2023, 4:11 PM

fnegri edited projects, added cloud-services-team (FY2023/2024-Q3-Q4); removed cloud-services-team (FY2023/2024-Q1-Q2).Feb 1 2024, 11:14 AM

fnegri moved this task from Backlog to Planned Goals on the cloud-services-team (FY2023/2024-Q3-Q4) board.

fnegri moved this task from Planned Goals to Backlog on the cloud-services-team (FY2023/2024-Q3-Q4) board.Feb 1 2024, 11:23 AM

Change #1013524 had a related patch set uploaded (by David Caro; author: David Caro):

[operations/puppet@production] ceph: fix location hook path

https://gerrit.wikimedia.org/r/1013524

Change #1013525 had a related patch set uploaded (by David Caro; author: David Caro):

[operations/puppet@production] ceph.eqiad: enable location hook

https://gerrit.wikimedia.org/r/1013525

Change #1013524 merged by David Caro:

[operations/puppet@production] ceph: fix location hook path

https://gerrit.wikimedia.org/r/1013524

Change #1013525 merged by David Caro:

[operations/puppet@production] ceph.eqiad: enable location hook

https://gerrit.wikimedia.org/r/1013525

Change #1013540 had a related patch set uploaded (by David Caro; author: David Caro):

[operations/puppet@production] ceph: move the location hook to the osd top level

https://gerrit.wikimedia.org/r/1013540

Change #1013540 merged by David Caro: