mw2420-mw2451 do have unnecessary raid controllers (configured)
Open, Needs TriagePublic
Actions

Assigned To

None

Authored By

	JMeybohm
	Feb 26 2024, 1:56 PM

Description

While working on T357380: Degraded RAID on mw2442 I realized a strange looking RAID controller config (for each disk there is one "RAID-0" virtual drive created which then is used to create an mdadm software RAID) which turns out to be the same on all 32 hosts of that batch.

As of the procurement task T325215 those systems (Config C-1G) should not have RAID controllers at all, so I assume something went wrong during procurement as well as during provisioning as T326362 does not request HW-Raid config.

This is not ideal as it makes those 32 hosts different from the others and it does require extra care/extra steps in case of disk replacements (see T357380#9575876). We should probably re-provision those hosts with the RAID controllers configured in HBA mode or have the RAID controllers removed (if that's even possible and it makes sense to keep spares).

Opening this task to figure out what to do or at least keep the information around that probably the following steps are needed when a disk is replaced to make it appear in the OS:

megacli -GetPreservedCacheList -a0
megacli -DiscardPreservedCache -L'disk_number' -a0
megacli -CfgEachDskRaid0 WB RA Direct CachedBadBBU -a0

Related Objects
Search...

Status	Assigned	Task
		Unknown Object (Task)
Resolved	Papaul	T326362 Q3:rack/setup/install mw2420-mw2451
Open	None	T358489 mw2420-mw2451 do have unnecessary raid controllers (configured)

Event Timeline

JMeybohm renamed this task from mw2420-mw2451 do have unncecesarry raid controllers (configured to mw2420-mw2451 do have unncecesarry raid controllers (configured).Feb 26 2024, 1:56 PM

JMeybohm created this task.

JMeybohm mentioned this in T357380: Degraded RAID on mw2442.

If you do decide you might want to reprovision these nodes as non-RAID, there is a sre.swift.convert-disks cookbook that does most of the heavy lifting (though you'd probably need to relax the host restriction a bit).

Moritz asked me about this, and I have some background. So orders placed in January 2023 via the dell portal for standard configs also included a number of hosts with raid which should not have had raid.

We did not 'pay' for the raid controllers as we had the set Config C per unit price, but it was misconfigured with a raid controller. This has seen been fixed (was discovered when they landed from the first batch of orders) but requires a work around on those hosts.

Set each disk as its own 1 disk raid0, and then it'll operate normally like a raidless system in terms of OS partitioning and the like. We've since modified our process of ordering to ensure this doesn't happen again.

I'm told there is a question on 'can we pull these raid controllers to use elsewhere' and the answer is 'no, or the host you remove it from has no controller.'

These are wired with cables from the backplane to the raid controller, which are not the same custom length cables to route in the chassis to an onboard sata controller that likely isn't present in these hosts. I would not recommend attempting to pull hardware from the R440 and install into another host, as it'll break warranty for both.

It could be possible, but it would require someone to take time to offline a host from this small batch of affected hosts and see if the cables can reach and if an onboard controller is even present. As this was a small one-off event of Config C having raid, I'd recommend we just leave it as is.

Please note that my understanding could be wrong, we may want to create a task for on-site to pull one of these hosts and double check the above.

Maintenance_bot added a project: SRE.Feb 26 2024, 2:29 PM

In T358489#9576376, @MatthewVernon wrote:

If you do decide you might want to reprovision these nodes as non-RAID, there is a sre.swift.convert-disks cookbook that does most of the heavy lifting (though you'd probably need to relax the host restriction a bit).

We could verify that while T351074: Move servers from the appserver/api cluster to kubernetes (although most of the servers have already been moved to k8s unfortunately).

In T358489#9576608, @JMeybohm wrote:

In T358489#9576376, @MatthewVernon wrote:

If you do decide you might want to reprovision these nodes as non-RAID, there is a sre.swift.convert-disks cookbook that does most of the heavy lifting (though you'd probably need to relax the host restriction a bit).

We could verify that while T351074: Move servers from the appserver/api cluster to kubernetes (although most of the servers have already been moved to k8s unfortunately).

We'll try with one of the not yet migrated to k8s server first to see if it has any performance implications and then try to phase out the RAID config when the servers need reimaging for the name change anyways.

Jhancock.wm moved this task from Backlog to Hardware Failure / Troubleshoot on the ops-codfw board.Feb 26 2024, 9:17 PM

JMeybohm renamed this task from mw2420-mw2451 do have unncecesarry raid controllers (configured) to mw2420-mw2451 do have unnecessary raid controllers (configured).Feb 28 2024, 5:49 PM

@JMeybohm hello is there anything DC-ops need to do on this task?

In T358489#9637825, @Papaul wrote:

@JMeybohm hello is there anything DC-ops need to do on this task?

No, all good from your end. Cheers

mw2420-mw2451 do have unnecessary raid controllers (configured)Open, Needs TriagePublicActions

Description

Related ObjectsSearch...

Event Timeline

mw2420-mw2451 do have unnecessary raid controllers (configured)
Open, Needs TriagePublic
Actions

Related Objects
Search...