Degraded RAID on an-worker1100
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	ops-monitoring-bot
	Apr 14 2021, 12:44 PM

Description

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID (megacli) was detected on host an-worker1100. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

CRITICAL: 1 failed LD(s) (Offline)

$ sudo /usr/local/lib/nagios/plugins/get-raid-status-megacli
=== RaidStatus (does not include components in optimal state)
name: Adapter #0

	Virtual Drive: 10 (Target Id: 10)
	RAID Level: Primary-0, Secondary-0, RAID Level Qualifier-0
	State: =====> Offline <=====
	Number Of Drives: 1
	Number of Spans: 1
	Current Cache Policy: WriteThrough, ReadAheadNone, Direct, No Write Cache if Bad BBU

		Span: 0 - Number of PDs: 1

			PD: 0 Information
			Enclosure Device ID: 32
			Slot Number: 10
			Drive's position: DiskGroup: 23, Span: 0, Arm: 0
			Media Error Count: 404
			Other Error Count: 13
			Predictive Failure Count: =====> 7 <=====
			Last Predictive Failure Event Seq Number: 3875

				Raw Size: 1.819 TB [0xe8e088b0 Sectors]
				Firmware state: =====> Failed <=====
				Media Type: Hard Disk Device
				Drive Temperature: 25C (77.00 F)

=== RaidStatus completed

Related Objects

Mentioned In: T281427: Re-add disk to an-worker1100

Event Timeline

ops-monitoring-bot created this task.Apr 14 2021, 12:44 PM

RhinosF1 subscribed.Apr 14 2021, 1:14 PM

wiki_willy assigned this task to • Cmjohnson.Apr 14 2021, 2:55 PM

ticket opened with Dell! You have successfully submitted request SR1057103007.

• Cmjohnson moved this task from Backlog to Hardware Failure / Troubleshoot on the ops-eqiad board.Apr 15 2021, 1:12 PM

elukey merged a task: T280313: an-worker1100 disk swap required.Apr 16 2021, 5:35 AM

elukey added subscribers: elukey, Ottomata, • razzi, • Cmjohnson.

@razzi

elukey@an-worker1100:~$ cat /proc/mounts  | grep /var/lib/hadoop/data
/dev/sdx1 /var/lib/hadoop/data/w ext4 rw,relatime 0 0

/dev/sdl1 /var/lib/hadoop/data/k ext4 ro,relatime 0 0   <============= (Note the "ro" read-only flag for this)

/dev/sdq1 /var/lib/hadoop/data/n ext4 rw,relatime 0 0
/dev/sde1 /var/lib/hadoop/data/e ext4 rw,relatime 0 0
/dev/sdf1 /var/lib/hadoop/data/g ext4 rw,relatime 0 0
/dev/sds1 /var/lib/hadoop/data/p ext4 rw,relatime 0 0
/dev/sdu1 /var/lib/hadoop/data/s ext4 rw,relatime 0 0
/dev/sdm1 /var/lib/hadoop/data/q ext4 rw,relatime 0 0
/dev/sdo1 /var/lib/hadoop/data/o ext4 rw,relatime 0 0
/dev/sdg1 /var/lib/hadoop/data/f ext4 rw,relatime 0 0
/dev/sdw1 /var/lib/hadoop/data/x ext4 rw,relatime 0 0
/dev/sdd1 /var/lib/hadoop/data/d ext4 rw,relatime 0 0
/dev/sdp1 /var/lib/hadoop/data/t ext4 rw,relatime 0 0
/dev/sdi1 /var/lib/hadoop/data/j ext4 rw,relatime 0 0
/dev/sdh1 /var/lib/hadoop/data/i ext4 rw,relatime 0 0
/dev/sdt1 /var/lib/hadoop/data/u ext4 rw,relatime 0 0
/dev/sdn1 /var/lib/hadoop/data/l ext4 rw,relatime 0 0
/dev/sdc1 /var/lib/hadoop/data/c ext4 rw,relatime 0 0
/dev/sdv1 /var/lib/hadoop/data/v ext4 rw,relatime 0 0
/dev/sdj1 /var/lib/hadoop/data/h ext4 rw,relatime 0 0
/dev/sdk1 /var/lib/hadoop/data/m ext4 rw,relatime 0 0
/dev/sdr1 /var/lib/hadoop/data/r ext4 rw,relatime 0 0

I did the following:

commented the disk in /etc/fstab
umounted it manually - sudo umount /var/lib/hadoop/data/k
ran puppet to regenerate the list of datadir for yarn and hdfs
the yarn nodemanager was down due to this problem, but puppet brought it up again after 3)

After Chris swaps the disk, we'll need to follow the procedure to re-add it (https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration#Swapping_broken_disk), including re-adding it in /etc/fstab.

The disk has been swapped, I am resolving this task because the on-site work has been completed.

@Ottomata @razzi this task needs some follow up :)

Ottomata reassigned this task from • Cmjohnson to • razzi.Apr 28 2021, 12:16 PM

• razzi added a project: Analytics-Kanban.Apr 28 2021, 10:51 PM

For simplicity, I'll create a new task, and this one can stay resolved. Thanks @Cmjohnson!

• razzi mentioned this in T281427: Re-add disk to an-worker1100.Apr 28 2021, 10:58 PM

Degraded RAID on an-worker1100Closed, ResolvedPublicActions

Description

Related Objects

Event Timeline

Degraded RAID on an-worker1100
Closed, ResolvedPublic
Actions