Erroneous node placement (AQS Cassandra cluster)
Open, LowPublic
Actions

Assigned To

None

Authored By

	Eevans
	Mar 30 2022, 7:51 PM

Description

When configured with the NetworkTopologyStrategy (as the AQS cluster is), Cassandra will distribute replicas across different points of failure. Cassandra uses the nomenclature "rack", but at the WMF we treat rows in our datacenters as the unit of failure for replica placement. The below table shows current placement according to the configuration, in addition to where nodes are actually located in the datacenter (eqiad only, at the moment).

host	cassandra "rack"	datacenter row
aqs1010	rack1	a
aqs1013	rack1	c
aqs1011	rack2	b
aqs1014	rack2	d
aqs1012	rack3	c
aqs1015	rack3	d

We make heavy use of a replica count of 3, and QUORUM consistency for both reads & writes. With replicas properly distributed over 3 or more rows, we are able to survive an entire row outage without any disruption to the service(s). The placement above is incorrect though because there are scenarios where a single row failure will result in outages; In our configuration, a failure of either row C or D will make a significant number of the replicas drop below a quorum, and create outage(s).

Fixing this will mean decommissioning, physically relocating, and bootstrapping servers back into the cluster. Given this situation isn't new, it probably makes sense to wait until we've deployed the servers for the new expansion (T304173). We should however take into account the moves needed here, when we're determining row placement for the new servers (my understanding is that row space is constrained in places).

Proposed

eqiad

host	cassandra "rack"	datacenter row	target row
aqs1010	rack1	a	a
aqs1013	rack1	c	d
aqs1016	rack1		a
aqs1019	rack1		d
aqs1011	rack2	b	b
aqs1014	rack2	d	e
aqs1017	rack2		b
aqs1020	rack2		e
aqs1012	rack3	c	c
aqs1015	rack3	d	f
aqs1018	rack3		c
aqs1021	rack3		f

codfw

host	cassandra "rack"	datacenter row	target row
aqs2001	rack1	N/A	a
aqs2002	rack1	N/A	a
aqs2003	rack1	N/A	a
aqs2004	rack1	N/A	a
aqs2005	rack2	N/A	b
aqs2006	rack2	N/A	b
aqs2007	rack2	N/A	b
aqs2008	rack2	N/A	b
aqs2009	rack3	N/A	d
aqs2010	rack3	N/A	d
aqs2011	rack3	N/A	d
aqs2012	rack3	N/A	d

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Open		None	T305102 Erroneous node placement (AQS Cassandra cluster)
		Open		Eevans	T307035 Relocate hosts: aqs10[3-5]

Event Timeline

Eevans created this task.Mar 30 2022, 7:51 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 30 2022, 7:51 PM

Eevans triaged this task as Low priority.Mar 30 2022, 7:52 PM

MatthewVernon subscribed.Apr 1 2022, 3:23 PM

Eevans added subscribers: BTullis, JAllemandou.Apr 1 2022, 5:38 PM

LSobanski edited projects, added Cassandra; removed Data-Persistence.Apr 7 2022, 2:48 PM

Eevans updated the task description. (Show Details)Apr 13 2022, 1:52 PM

Eevans updated the task description. (Show Details)

Eevans updated the task description. (Show Details)Apr 13 2022, 1:55 PM

LSobanski updated the task description. (Show Details)Apr 13 2022, 1:57 PM

LSobanski updated the task description. (Show Details)Apr 13 2022, 1:59 PM

LSobanski updated the task description. (Show Details)

As discussed in a meeting, we have decided to use the v2 proposal above for eqiad and the v1 proposal for codfw.

In terms of the timeline, we will:

rack the 6 new nodes in eqiad
bring them online in the cluster
move three hosts (aqs1013, aqs1014, aqs1015) from their current to their v2 specified locations - assuming DC Ops has no issue with this request
rack the 12 new hosts in codfw with their v1 style allocations, as shown above.

I will update T304173 and T305568 with the newloy proposed racking details.

Eevans updated the task description. (Show Details)Apr 13 2022, 4:03 PM

Eevans mentioned this in T305568: Q4:(Need By: TBD) rack/setup/install aqs2001-aqs2012.Apr 22 2022, 12:32 AM

Hi - John already started racking some of the new aqs1016-1021 servers in T305570. The racking details in that task didn't specify servers needing to go into specific racks (only general disbursement across rows, using the same rows as aqs1010-1015), so just confirming if these servers need to be in the exact racks of A1, D1, B2, E2, C3, F3 outlined in the task description to function properly? Or is there some wiggle room to use other racks in these same rows?

For example, would it be possible to leave aqs1013 in rack C1 and install aqs1018 someplace in row D instead? Similarly, would it still work if we left aqs1014 in D2, and installed aqs1019 in row E as an alternative? Basically, just clarifying what the criteria is, to see if it's possible to avoid any physical server moves. Thanks in advance. ~Willy

Hi Willy,

Apologies for any gaps in the information. I hadn't spotted T305570 but I'll try to update that one as well with more clarity.

In answer to this question:

The racking details in that task didn't specify servers needing to go into specific racks (only general disbursement across rows, using the same rows as aqs1010-1015), so just confirming if these servers need to be in the exact racks of A1, D1, B2, E2, C3, F3 outlined in the task description to function properly?

The answer is no, these servers don't need to be in the numbered rack. In fact, the cassandra "rack" column in the task description should be ignored as far as the placement of servers within a row is concerned.

Apologies for this confusion. The "rack" that is mentioned here is only an internal property of the cassandra by which it tries to maintain data availability in terms of node outage - it wasn't meant to be a reference to a physical rack for your team - so any physical rack within the row is good for us.

Or is there some wiggle room to use other racks in these same rows?

Yes, any rack within the selected row is good for us. Ideally two different racks within the row for the two hosts, please.

For example, would it be possible to leave aqs1013 in rack C1 and install aqs1018 someplace in row D instead?

Not really, I'm afraid. This is the crux of it. We're trying to make sure that even if a whole row is down, this doesn't adversely affect more than one of the cassandra racks, because then we can be sure of still serving 100% of the available data from the remaining hosts.

Therefore, for the six new servers, we really would like these to go into:

aqs1016 -> row a
aqs1017 -> row b
aqs1018 -> row c
aqs1019 -> row d
aqs1020 -> row e
aqs1021 -> row f

Once this is done, then we would really appreciate it if you could move three of the existing servers:

aqs1013 - c to d
aqs1014 - d to e
aqs1015 - d to f

That will complete the work to achieve resilience for AQS to the failure of a whole row. Unfortunately, we can't work out a way of achieving the level of resilience to row failure that we would like without a minimum of three physical server moves, so I hope that this isn't too inconvenient to achieve.

Please do let me know if you have any other thoughts, suggestions, or concerns about this.

Got it, that makes sense. Thanks for the details and the feedback @BTullis. It definitely gives us a bit more flexibility knowing we can use different racks in those same rows. We'll go ahead and re-rack some of the new aqs1016-1021 servers to follow the proposed plan. Feel free to submit a Dc-Ops task (with the "ops-eqiad" project tag), along with some proposed timeframes for the physical move, and we'll get aqs1013-1015 migrated as well.

Thanks,
Willy

Please confirm no issues with the these host locations

host	Rack
aqs1016	A3
aqs1017	B5
aqs1018	C5
aqs1019	D3
aqs1020	E2
aqs1021	F2

Thanks
John

Eevans updated the task description. (Show Details)Apr 27 2022, 7:27 PM

I've updated the description for codfw based on https://phabricator.wikimedia.org/T305568#7881920

In summary:

We have rows A, B, C & D there. We're distributing the new machines across A, B, & D. We can treat A & C as equivalent now, and if we ever expand then pair B/E & D/F. This should be documented somewhere, any suggestions for where?

In T305102#7882366, @wiki_willy wrote:

Got it, that makes sense. Thanks for the details and the feedback @BTullis. It definitely gives us a bit more flexibility knowing we can use different racks in those same rows. We'll go ahead and re-rack some of the new aqs1016-1021 servers to follow the proposed plan. Feel free to submit a Dc-Ops task (with the "ops-eqiad" project tag), along with some proposed timeframes for the physical move, and we'll get aqs1013-1015 migrated as well.

T307035: Relocate hosts: aqs10[3-5] created, but it would probably be best to wait until the other 6 machines are up.

In T305102#7882637, @Jclark-ctr wrote:

Please confirm no issues with the these host locations

host Rack

aqs1016 A3

aqs1017 B5

aqs1018 C5

aqs1019 D3

aqs1020 E2

aqs1021 F2

Thanks
John

This will work; Thanks!

Eevans moved this task from Backlog to Next on the Cassandra board.Nov 30 2022, 1:51 AM

Eevans moved this task from Next to Backlog on the Cassandra board.Feb 22 2024, 5:37 PM

Erroneous node placement (AQS Cassandra cluster)Open, LowPublicActions