Page MenuHomePhabricator

Degraded RAID on cloudvirt1018
Closed, ResolvedPublic

Description

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID (megacli) was detected on host cloudvirt1018. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

CRITICAL: 1 failed LD(s) (Degraded)

$ sudo /usr/local/lib/nagios/plugins/get-raid-status-megacli
=== RaidStatus (does not include components in optimal state)
name: Adapter #0

	Virtual Drive: 0 (Target Id: 0)
	RAID Level: Primary-1, Secondary-0, RAID Level Qualifier-0
	State: =====> Degraded <=====
	Number Of Drives: 8
	Number of Spans: 1
	Current Cache Policy: WriteBack, ReadAhead, Direct, No Write Cache if Bad BBU

		Span: 0 - Number of PDs: 8

			PD: 1 Information
			Enclosure Device ID: 32
			Slot Number: 8
			Drive's position: DiskGroup: 0, Span: 0, Arm: 1
			Media Error Count: 0
			Other Error Count: 0
			Predictive Failure Count: 0
			Last Predictive Failure Event Seq Number: 0

				Raw Size: 1.455 TB [0xba4d4ab0 Sectors]
				Firmware state: =====> Rebuild <=====
				Media Type: Solid State Device
				Drive Temperature: 26C (78.80 F)

=== RaidStatus completed

Event Timeline

taavi added a subscriber: taavi.

This is one of the localdisk hypervisors we use for Toolforge/Toolsbeta etcd, thankfully not a ToolsDB server

Mentioned in SAL (#wikimedia-cloud) [2021-11-28T17:48:48Z] <andrewbogott> moved cloudvirt1018 out of the 'localstorage' aggregate and into 'maintenance' for T296592. It will need to be moved back after the raid is rebuilt.

Cmjohnson added subscribers: wiki_willy, RobH, Cmjohnson.

@wiki_willy @RobH this server is out of warranty, they have a 1.6TB SSD that has failed. I recommend buying a new SSD.

This server is 5 years old in May 2022, do we want to throw new hardware in a host going away in less than 6 months or just move the refresh to Q3?

++ @nskaggs & @Andrew - since this server is scheduled for a refresh in Q4 (line 130 on the procurement doc), are you guys ok with not fixing this, and just sticking with a Q4 (or Q3) refresh schedule? Thanks, Willy

@wiki_willy we can probably live without it for a few months; I'll do a bit of research to figure out what would be a good replacement (it was one in a set of three handling a weird use-case)

@wiki_willy Is their an advantage to getting the normal refresh order in now? Would you recommend it? I know prices and lead times have been an issue this year.

Hi @nskaggs - sure, we can totally order the "refresh of cloudvirt101[6-8]" now, and have it to arrive in early Q3 if you want. The lead times with vendors have been fluctuating a bit, but I can definitely work with Finance to bump the forecast on this one a bit earlier on the budget. Does that work for you?

@wiki_willy Is their an advantage to getting the normal refresh order in now? Would you recommend it? I know prices and lead times have been an issue this year.

I just did some digging and some thinking and I don't see a reason why we need to rush this refresh. Most likely these hosts will be replaced with thinvirts which won't meet the 'localdisk' role that this host was playing anyway.

I've opened T296664 to discuss the future of fatvirts but we need to resolve that separately from the cloudvirt1018 refresh.

Thanks @Andrew, sounds good. I'll leave it on the schedule for a Q4 refresh for now, but let me know if you end up needing it earlier, and we can always adjust and push things up then. Thanks, Willy

I just did some digging and some thinking and I don't see a reason why we need to rush this refresh. Most likely these hosts will be replaced with thinvirts which won't meet the 'localdisk' role that this host was playing anyway.

I've opened T296664 to discuss the future of fatvirts but we need to resolve that separately from the cloudvirt1018 refresh.

@Andrew If we're going to leave this be, is it okay to close this task? Your server refresh can be tracked in a separate task

Are we going to simply decommission this machine then and remove it from the rack? If so, then I would consider that the resolution of this ticket. But I think we should decide what the outcome of the existing server will be before closing.

Yep, let's just decom. I'll open a ticket for that and then close this