Re-IP Swift hosts to per-rack subnets in codfw row A and B.
Open, MediumPublic
Actions

Assigned To

None

Authored By

	cmooney
	Jan 11 2024, 2:47 PM

Description

As part of the move from per-row to per-rack redundancy model hosts in codfw rows A and B need to be configured / moved to new per-rack vlans/subnets. This work can be tackled once we have completed the physical move of all hosts in those rows from old 'asw' switch devices to new 'lsw' ones.

In discussion on irc we touched on some of the challenges for these hosts, which as I understand may use IP addresses as identifiers. We also need to consider how clusters function with hosts on different subnets that were previous layer-2 adjacent.

Creating this task so we can discuss the options, make plans and test the way forward.

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Open		None	T354869 Re-IP hosts on codfw row A and B to new per-rack vlans/subnets
		Open		None	T354872 Re-IP Swift hosts to per-rack subnets in codfw row A and B.

Event Timeline

cmooney triaged this task as Medium priority.Jan 11 2024, 2:47 PM

cmooney created this task.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 11 2024, 2:47 PM

cmooney added a parent task: T354869: Re-IP hosts on codfw row A and B to new per-rack vlans/subnets.Jan 11 2024, 3:48 PM

Swift uses IP(v4) address (and then device name) as the identifier for entries in its rings.

Additionally, when adding nodes to the ring, we use IP address to tell where the node is located, and thus which "zone" it should be in (the zones are used to make sure each of the three replicas is in a different row) - see the find_ip_zone function.

The safest approach would be to drain a node & remove it from the rings, then renumber it and add it again. But a drain takes 2-3 weeks (we do it gradually to avoid overload), and a reload the same time again.

In theory swift-ring-builder has a set_info command with a --change-ip argument, so one could change every device on a node in the rings, renumber it and push out the new rings. We'd need to write some tooling to do this, and I've no idea how safe such an operation is.

In either approach, extra constraints are that we'd not want too many nodes "in flight" at once, because swift will try and backfill to make up for missing/down devices and we need to avoid overloading (in terms of load or capacity) the rest of the cluster; and that you have to wait 12 hours between changes to the rings.

Sorry, I think object stores are often not really written with renumbering in mind...

MatthewVernon added a project: SRE-swift-storage.Feb 9 2024, 3:44 PM

Re-IP Swift hosts to per-rack subnets in codfw row A and B.Open, MediumPublicActions

Description

Related ObjectsSearch...

Event Timeline

Re-IP Swift hosts to per-rack subnets in codfw row A and B.
Open, MediumPublic
Actions

Related Objects
Search...