Page MenuHomePhabricator

Degraded RAID on an-worker1100
Closed, ResolvedPublic

Description

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID (megacli) was detected on host an-worker1100. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

CRITICAL: 1 failed LD(s) (Offline)

$ sudo /usr/local/lib/nagios/plugins/get-raid-status-megacli
=== RaidStatus (does not include components in optimal state)
name: Adapter #0

	Virtual Drive: 10 (Target Id: 10)
	RAID Level: Primary-0, Secondary-0, RAID Level Qualifier-0
	State: =====> Offline <=====
	Number Of Drives: 1
	Number of Spans: 1
	Current Cache Policy: WriteThrough, ReadAheadNone, Direct, No Write Cache if Bad BBU

		Span: 0 - Number of PDs: 1

			PD: 0 Information
			Enclosure Device ID: 32
			Slot Number: 10
			Drive's position: DiskGroup: 23, Span: 0, Arm: 0
			Media Error Count: 404
			Other Error Count: 13
			Predictive Failure Count: =====> 7 <=====
			Last Predictive Failure Event Seq Number: 3875

				Raw Size: 1.819 TB [0xe8e088b0 Sectors]
				Firmware state: =====> Failed <=====
				Media Type: Hard Disk Device
				Drive Temperature: 25C (77.00 F)

=== RaidStatus completed

Event Timeline

ticket opened with Dell! You have successfully submitted request SR1057103007.

@razzi

elukey@an-worker1100:~$ cat /proc/mounts  | grep /var/lib/hadoop/data
/dev/sdx1 /var/lib/hadoop/data/w ext4 rw,relatime 0 0

/dev/sdl1 /var/lib/hadoop/data/k ext4 ro,relatime 0 0   <============= (Note the "ro" read-only flag for this)

/dev/sdq1 /var/lib/hadoop/data/n ext4 rw,relatime 0 0
/dev/sde1 /var/lib/hadoop/data/e ext4 rw,relatime 0 0
/dev/sdf1 /var/lib/hadoop/data/g ext4 rw,relatime 0 0
/dev/sds1 /var/lib/hadoop/data/p ext4 rw,relatime 0 0
/dev/sdu1 /var/lib/hadoop/data/s ext4 rw,relatime 0 0
/dev/sdm1 /var/lib/hadoop/data/q ext4 rw,relatime 0 0
/dev/sdo1 /var/lib/hadoop/data/o ext4 rw,relatime 0 0
/dev/sdg1 /var/lib/hadoop/data/f ext4 rw,relatime 0 0
/dev/sdw1 /var/lib/hadoop/data/x ext4 rw,relatime 0 0
/dev/sdd1 /var/lib/hadoop/data/d ext4 rw,relatime 0 0
/dev/sdp1 /var/lib/hadoop/data/t ext4 rw,relatime 0 0
/dev/sdi1 /var/lib/hadoop/data/j ext4 rw,relatime 0 0
/dev/sdh1 /var/lib/hadoop/data/i ext4 rw,relatime 0 0
/dev/sdt1 /var/lib/hadoop/data/u ext4 rw,relatime 0 0
/dev/sdn1 /var/lib/hadoop/data/l ext4 rw,relatime 0 0
/dev/sdc1 /var/lib/hadoop/data/c ext4 rw,relatime 0 0
/dev/sdv1 /var/lib/hadoop/data/v ext4 rw,relatime 0 0
/dev/sdj1 /var/lib/hadoop/data/h ext4 rw,relatime 0 0
/dev/sdk1 /var/lib/hadoop/data/m ext4 rw,relatime 0 0
/dev/sdr1 /var/lib/hadoop/data/r ext4 rw,relatime 0 0

I did the following:

  1. commented the disk in /etc/fstab
  2. umounted it manually - sudo umount /var/lib/hadoop/data/k
  3. ran puppet to regenerate the list of datadir for yarn and hdfs
  4. the yarn nodemanager was down due to this problem, but puppet brought it up again after 3)

After Chris swaps the disk, we'll need to follow the procedure to re-add it (https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration#Swapping_broken_disk), including re-adding it in /etc/fstab.

The disk has been swapped, I am resolving this task because the on-site work has been completed.

razzi removed a project: Analytics-Kanban.

For simplicity, I'll create a new task, and this one can stay resolved. Thanks @Cmjohnson!