Page MenuHomePhabricator

Erroneous node placement (AQS Cassandra cluster)
Open, LowPublic

Description

When configured with the NetworkTopologyStrategy (as the AQS cluster is), Cassandra will distribute replicas across different points of failure. Cassandra uses the nomenclature "rack", but at the WMF we treat rows in our datacenters as the unit of failure for replica placement. The below table shows current placement according to the configuration, in addition to where nodes are actually located in the datacenter (eqiad only, at the moment).

hostcassandra "rack"datacenter row
aqs1010rack1a
aqs1013rack1c
aqs1011rack2b
aqs1014rack2d
aqs1012rack3c
aqs1015rack3d

We make heavy use of a replica count of 3, and QUORUM consistency for both reads & writes. With replicas properly distributed over 3 or more rows, we are able to survive an entire row outage without any disruption to the service(s). The placement above is incorrect though because there are scenarios where a single row failure will result in outages; In our configuration, a failure of either row C or D will make a significant number of the replicas drop below a quorum, and create outage(s).

Fixing this will mean decommissioning, physically relocating, and bootstrapping servers back into the cluster. Given this situation isn't new, it probably makes sense to wait until we've deployed the servers for the new expansion (T304173). We should however take into account the moves needed here, when we're determining row placement for the new servers (my understanding is that row space is constrained in places).


Proposed

eqiad
hostcassandra "rack"datacenter rowtarget row
aqs1010rack1aa
aqs1013rack1cd
aqs1016rack1a
aqs1019rack1d
aqs1011rack2bb
aqs1014rack2de
aqs1017rack2b
aqs1020rack2e
aqs1012rack3cc
aqs1015rack3df
aqs1018rack3c
aqs1021rack3f
codfw
hostcassandra "rack"datacenter rowtarget row
aqs2001rack1N/Aa
aqs2002rack1N/Aa
aqs2003rack1N/Aa
aqs2004rack1N/Aa
aqs2005rack2N/Ab
aqs2006rack2N/Ab
aqs2007rack2N/Ab
aqs2008rack2N/Ab
aqs2009rack3N/Ad
aqs2010rack3N/Ad
aqs2011rack3N/Ad
aqs2012rack3N/Ad

Event Timeline

Eevans updated the task description. (Show Details)
LSobanski updated the task description. (Show Details)

As discussed in a meeting, we have decided to use the v2 proposal above for eqiad and the v1 proposal for codfw.

In terms of the timeline, we will:

  • rack the 6 new nodes in eqiad
  • bring them online in the cluster
  • move three hosts (aqs1013, aqs1014, aqs1015) from their current to their v2 specified locations - assuming DC Ops has no issue with this request
  • rack the 12 new hosts in codfw with their v1 style allocations, as shown above.

I will update T304173 and T305568 with the newloy proposed racking details.

Hi - John already started racking some of the new aqs1016-1021 servers in T305570. The racking details in that task didn't specify servers needing to go into specific racks (only general disbursement across rows, using the same rows as aqs1010-1015), so just confirming if these servers need to be in the exact racks of A1, D1, B2, E2, C3, F3 outlined in the task description to function properly? Or is there some wiggle room to use other racks in these same rows?

For example, would it be possible to leave aqs1013 in rack C1 and install aqs1018 someplace in row D instead? Similarly, would it still work if we left aqs1014 in D2, and installed aqs1019 in row E as an alternative? Basically, just clarifying what the criteria is, to see if it's possible to avoid any physical server moves. Thanks in advance. ~Willy

Hi Willy,

Apologies for any gaps in the information. I hadn't spotted T305570 but I'll try to update that one as well with more clarity.

In answer to this question:

The racking details in that task didn't specify servers needing to go into specific racks (only general disbursement across rows, using the same rows as aqs1010-1015), so just confirming if these servers need to be in the exact racks of A1, D1, B2, E2, C3, F3 outlined in the task description to function properly?

The answer is no, these servers don't need to be in the numbered rack. In fact, the cassandra "rack" column in the task description should be ignored as far as the placement of servers within a row is concerned.

Apologies for this confusion. The "rack" that is mentioned here is only an internal property of the cassandra by which it tries to maintain data availability in terms of node outage - it wasn't meant to be a reference to a physical rack for your team - so any physical rack within the row is good for us.

Or is there some wiggle room to use other racks in these same rows?

Yes, any rack within the selected row is good for us. Ideally two different racks within the row for the two hosts, please.

For example, would it be possible to leave aqs1013 in rack C1 and install aqs1018 someplace in row D instead?

Not really, I'm afraid. This is the crux of it. We're trying to make sure that even if a whole row is down, this doesn't adversely affect more than one of the cassandra racks, because then we can be sure of still serving 100% of the available data from the remaining hosts.

Therefore, for the six new servers, we really would like these to go into:

aqs1016 -> row a
aqs1017 -> row b
aqs1018 -> row c
aqs1019 -> row d
aqs1020 -> row e
aqs1021 -> row f

Once this is done, then we would really appreciate it if you could move three of the existing servers:

aqs1013 - c to d
aqs1014 - d to e
aqs1015 - d to f

That will complete the work to achieve resilience for AQS to the failure of a whole row. Unfortunately, we can't work out a way of achieving the level of resilience to row failure that we would like without a minimum of three physical server moves, so I hope that this isn't too inconvenient to achieve.

Please do let me know if you have any other thoughts, suggestions, or concerns about this.

Got it, that makes sense. Thanks for the details and the feedback @BTullis. It definitely gives us a bit more flexibility knowing we can use different racks in those same rows. We'll go ahead and re-rack some of the new aqs1016-1021 servers to follow the proposed plan. Feel free to submit a Dc-Ops task (with the "ops-eqiad" project tag), along with some proposed timeframes for the physical move, and we'll get aqs1013-1015 migrated as well.

Thanks,
Willy

Please confirm no issues with the these host locations

hostRack
aqs1016A3
aqs1017B5
aqs1018C5
aqs1019D3
aqs1020E2
aqs1021F2

Thanks
John

I've updated the description for codfw based on https://phabricator.wikimedia.org/T305568#7881920

In summary:

We have rows A, B, C & D there. We're distributing the new machines across A, B, & D. We can treat A & C as equivalent now, and if we ever expand then pair B/E & D/F. This should be documented somewhere, any suggestions for where?

Got it, that makes sense. Thanks for the details and the feedback @BTullis. It definitely gives us a bit more flexibility knowing we can use different racks in those same rows. We'll go ahead and re-rack some of the new aqs1016-1021 servers to follow the proposed plan. Feel free to submit a Dc-Ops task (with the "ops-eqiad" project tag), along with some proposed timeframes for the physical move, and we'll get aqs1013-1015 migrated as well.

T307035: Relocate hosts: aqs10[3-5] created, but it would probably be best to wait until the other 6 machines are up.

Please confirm no issues with the these host locations

hostRack
aqs1016A3
aqs1017B5
aqs1018C5
aqs1019D3
aqs1020E2
aqs1021F2

Thanks
John

This will work; Thanks!