Page MenuHomePhabricator

Degraded RAID on an-worker1099
Closed, ResolvedPublic

Description

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID (megacli) was detected on host an-worker1099. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

CRITICAL: 1 failed LD(s) (Offline)

$ sudo /usr/local/lib/nagios/plugins/get-raid-status-megacli
=== RaidStatus (does not include components in optimal state)
name: Adapter #0

	Virtual Drive: 17 (Target Id: 17)
	RAID Level: Primary-0, Secondary-0, RAID Level Qualifier-0
	State: =====> Offline <=====
	Number Of Drives: 1
	Number of Spans: 1
	Current Cache Policy: WriteThrough, ReadAheadNone, Direct, No Write Cache if Bad BBU

		Span: 0 - Number of PDs: 1

			PD: 0 Information
			Enclosure Device ID: 32
			Slot Number: 17
			Drive's position: DiskGroup: 17, Span: 0, Arm: 0
			Media Error Count: 0
			Other Error Count: 0
			Predictive Failure Count: 0
			Last Predictive Failure Event Seq Number: 0

				Raw Size: 1.819 TB [0xe8e088b0 Sectors]
				Firmware state: =====> Failed <=====
				Media Type: Hard Disk Device
				Drive Temperature: 26C (78.80 F)

=== RaidStatus completed

Event Timeline

Mentioned in SAL (#wikimedia-operations) [2021-01-27T15:42:13Z] <elukey> umount /var/hadoop/data/r on an-worker1099 and restart hadoop daemons - T273034

@Ottomata @razzi this is the first datanode disk failure after the change that I made to use facter to populate the available partitions that Yarn and HDFS can use on a given worker node. In the past I had to explicitly remove the partition from the Yarn nodemanager's list since it doesn't tolerate failures like the Datanode does very well. So what I did this time was:

  • Comment the failed partition (after checking dmesg) in /etc/fstab
  • umount the partition
  • run puppet (yarn and hdfs confs updated)
  • restart the daemons to pick up the changes

We should do the reverse after a new disk is back online. Maybe with newer versions of Hadoop we'll not need this anymore, but so far it seems quick and easy :)

A ticket has been requested with Dell for a new hard drive. You have successfully submitted request SR1050179269.

Legoktm triaged this task as Medium priority.Jan 28 2021, 8:24 PM

@razzi today I remembered this task by chance, I had to follow https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration#Swapping_broken_disk to add the new disk back (it was not used yet). Please read the docs and tell me what it is missing/unclear (I have updated it with new info).