Page MenuHomePhabricator

[ceph] Getting rack level HA
Open, HighPublic

Description

These are the things we need to get high availability at the rack/rack switch level for our current cloud ceph cluster:

  • We need to spread the 3 mons in different racks under different switches:
    • Current: all the mons are under the B* switch (B2/B4/B7)
    • Example of HA: 1 mon under B*, 1 mon under D5 cloudswitch, 1 mon under C8 cloudswitch
  • We need to spread the osds in 3 equal-sized groups (if there's any difference on the sizes, that spaces would not be able to be used until the other racks match it or we bring another rack with 2x the difference)
    • Current:
      • 1 B2
      • 1 B4
      • 1 B7
      • 11 C8
      • 10 D5
    • Example of HA:
      • 7 B*
      • 7 D5
      • 7 C8

Only when the above is done, then we can proceed and configure the cluster (if we do it before the cluster will halt due to lack of high availability).

  • We have to configure the ceph crush map to account for the rack location:
    • Current:

We only have host-level spreading:

root@cloudcephosd1001:~# ceph osd crush rule dump replicated_rule
{
    "rule_id": 0,
    "rule_name": "replicated_rule",
    "ruleset": 0,
    "type": 1,
    "min_size": 1,
    "max_size": 10,
    "steps": [
        {
            "op": "take",
            "item": -1,
            "item_name": "default"
        },
        {
            "op": "chooseleaf_firstn",
            "num": 0,
            "type": "host"
        },
        {
            "op": "emit"
        }
    ]
}

Related Objects

StatusSubtypeAssignedTask
OpenNone
Resolvedayounsi
Resolvedayounsi
Resolvedayounsi
OpenNone
Opencmooney
Opendcaro
ResolvedRequest Cmjohnson
Resolved Cmjohnson
Resolved nskaggs
ResolvedBUG REPORTdcaro
In Progressdcaro
Resolveddcaro
Opendcaro
Resolvedcmooney
Resolved nskaggs
Resolved nskaggs
Resolveddcaro
ResolvedRequestPapaul
Resolveddcaro
Resolveddcaro
OpenNone
Opendcaro

Event Timeline

dcaro triaged this task as High priority.Dec 6 2021, 9:42 AM
dcaro created this task.

Hi @dcaro - with the requirement these servers have in utilizing dual switchports, I don't think we have enough space to move any of them to row B. However, we plan on dedicating two additional racks for WMCS in the new eqiad expansion cage when it's ready for use in January. It would be one in row E and another in row F. If we were to install everything WMCS going forward across these 4x racks (only C8, D5, E*, F*), would that meet all your future needs? (including equal diversification across racks, etc)

Thanks,
Willy

Yes, that be awesome, making sure of course that the clouds witches on
those racks are interconnected in a high available way (mesh, ring...).

Perfect. We'll update you in January when things are set up, and you can create a move task for these servers then. For future install tasks though, just let us know in the Phabricator task description if there are any constraints that we should follow, when racking them up. Thanks, Willy

Yes, that be awesome, making sure of course that the clouds witches on
those racks are interconnected in a high available way (mesh, ring...).

@dcaro, given we will soon have row E and row F as WMCS racks, can we update this plan with suggested layout? We want to move all the monitors out of row B for isntance, and then split the OSD's. Depending on how many new OSD's we are looking at buying over the next year, it may be possible for us to avoid moving as many machines (instead we could rack new ones).

I assume we can / should move a single server or two at a time, and if done 1 by 1, it can happen without WMCS's direct involvement? Once we have a plan, let's open specific tickets for each move under https://phabricator.wikimedia.org/maniphest/task/edit/form/55/.

Hey, so yes, we can start already moving one monitor to rack C and one to D if possible, otherwise we have to sort out first the ceph setup on different L3 domains (on the works, @cmooney is working on the new setup, and we wanted to try the configs with the hosts in T294972: Q2:(Need By: TBD) rack/setup/install cloudcephosd10[25-34] if possible).

They have to be moved one by one, and our intervention is needed in to make sure that after moving one, the cluster comes back to healthy before moving the other, for the cloudcephosd ones, we can move two at a time, though those are less critical (there's many more xd).

Thanks for the update! Ok, so let's start with the mons, and start with one request to move to rack C. If that works, we'll move one more to rack D. That will be a nice improvement already over what we have. Filed T303058.

nskaggs claimed this task.
dcaro reopened this task as Open.EditedFeb 6 2023, 1:52 PM

An update on this, it would help us greatly to be able to take a rack down without having downtime (see T316544).

Currently we have 4 different failure domains:

c8 -> cloudsw-c8-eqiad (row c, column 8) https://netbox.wikimedia.org/dcim/devices/2686/
d5 -> cloudswitch-d5 https://netbox.wikimedia.org/dcim/devices/2687/
e4 -> cloudsw1-e4 https://netbox.wikimedia.org/dcim/devices/3931/
f4 -> cloudsw1-f4 https://netbox.wikimedia.org/dcim/devices/3935/

  • Switch on c8 (that includes b*, as they connect through this one) -> 14 hosts
  • Switch on d5 -> 10 hosts
  • Switch on e4 -> 5 hosts
  • Switch on f4 -> 5 hosts

And 34 hosts (+6 coming soon, but those are going to go on a different pool).

Given that we want to be able to failover with those 34 hosts, so the balance should be something like:

  • c8 - 9 hosts
  • d5 - 9 hosts
  • e4 - 8 hosts
  • f4 - 8 hosts

So a possibility for that would be:

  • Move 3 hosts from c8 -> e4
  • Move 2 host from c8 -> f4
  • Move 1 host from d5 -> f4

Currently we have spaces on F4 and E4, so this move works well, the disadvantage is that then when we buy more hosts, we not only have to buy them in 3s (as before) but those will have to go in different racks/domains, so the rack space becomes a limiting factor too.

@Cmjohnson How much effort would it take to move these hosts? The process would have to be two by two (move two, wait for the cluster to sync, should be in the matter of seconds if everything is ok, move the next two...)

Unfortunately the IPs would have to change (E4 and F4 have different subnets), but everything else should stay the same, and those racks already have similar service host (so fw/router/switches config should be ok)

++ @Jclark-ctr, since Chris is out until the end of February

@Cmjohnson How much effort would it take to move these hosts? The process would have to be two by two (move two, wait for the cluster to sync, should be in the matter of seconds if everything is ok, move the next two...)

Unfortunately the IPs would have to change (E4 and F4 have different subnets), but everything else should stay the same, and those racks already have similar service host (so fw/router/switches config should be ok)

Have synced up with @Andrew on scheduling this move. Waiting for update from meeting he has tomorrow to discuss this

@Jclark-ctr Hi! We decided to do the moves yep

We have to move them two-by-two, in batches, they can be done one right after the other if there's no problems, but no more than 2 hosts should be down at a time:

  • Move cloudcephosd1001 b7 -> e4 + cloudcephosd1002 b4 -> e4 (T329498)
  • Move cloudcephosd1003 b2 -> e4 + cloudcephosd1004 c8 -> f4 (T329502)
  • Move cloudcephosd1005 c8 -> f4 + cloudcephosd1010 d5 -> f4 (T329504)

Change 896372 had a related patch set uploaded (by David Caro; author: David Caro):

[operations/puppet@production] cloudceph: add the location info to the hosts

https://gerrit.wikimedia.org/r/896372

Change 896372 merged by David Caro:

[operations/puppet@production] cloudceph: add the location info to the hosts

https://gerrit.wikimedia.org/r/896372

Change 904787 had a related patch set uploaded (by David Caro; author: David Caro):

[operations/puppet@production] ceph: Allow setting a crush location hook for the rack

https://gerrit.wikimedia.org/r/904787

Change 904788 had a related patch set uploaded (by David Caro; author: David Caro):

[operations/puppet@production] p:cloudceph::osd: enable location hook

https://gerrit.wikimedia.org/r/904788

Change 904787 merged by David Caro:

[operations/puppet@production] ceph: Allow setting a crush location hook for the rack

https://gerrit.wikimedia.org/r/904787

Change 904788 merged by David Caro:

[operations/puppet@production] p:cloudceph::osd: enable location hook

https://gerrit.wikimedia.org/r/904788

Change #1013524 had a related patch set uploaded (by David Caro; author: David Caro):

[operations/puppet@production] ceph: fix location hook path

https://gerrit.wikimedia.org/r/1013524

Change #1013525 had a related patch set uploaded (by David Caro; author: David Caro):

[operations/puppet@production] ceph.eqiad: enable location hook

https://gerrit.wikimedia.org/r/1013525

Change #1013524 merged by David Caro:

[operations/puppet@production] ceph: fix location hook path

https://gerrit.wikimedia.org/r/1013524

Change #1013525 merged by David Caro:

[operations/puppet@production] ceph.eqiad: enable location hook

https://gerrit.wikimedia.org/r/1013525

Change #1013540 had a related patch set uploaded (by David Caro; author: David Caro):

[operations/puppet@production] ceph: move the location hook to the osd top level

https://gerrit.wikimedia.org/r/1013540

Change #1013540 merged by David Caro:

[operations/puppet@production] ceph: move the location hook to the osd top level

https://gerrit.wikimedia.org/r/1013540