Page MenuHomePhabricator

Degraded RAID on an-worker1096
Closed, ResolvedPublic

Description

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID (megacli) was detected on host an-worker1096. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

CRITICAL: 1 failed LD(s) (Offline)

$ sudo /usr/local/lib/nagios/plugins/get-raid-status-megacli
Failed to execute '['/usr/lib/nagios/plugins/check_nrpe', '-4', '-H', 'an-worker1096', '-c', 'get_raid_status_megacli']': RETCODE: 2
STDOUT:
b'CHECK_NRPE STATE CRITICAL: Socket timeout after 10 seconds.\n'
STDERR:
None

Event Timeline

odimitrijevic triaged this task as Medium priority.
odimitrijevic moved this task from Incoming to Operational Excellence on the Analytics board.
elukey@an-worker1096:~$ sudo /usr/local/lib/nagios/plugins/get-raid-status-megacli
=== RaidStatus (does not include components in optimal state)
name: Adapter #0

	Virtual Drive: 6 (Target Id: 6)
	RAID Level: Primary-0, Secondary-0, RAID Level Qualifier-0
	State: =====> Offline <=====
	Number Of Drives: 1
	Number of Spans: 1
	Current Cache Policy: WriteThrough, ReadAheadNone, Direct, No Write Cache if Bad BBU

		Span: 0 - Number of PDs: 1

			PD: 0 Information
			Enclosure Device ID: 32
			Slot Number: 6
			Drive's position: DiskGroup: 6, Span: 0, Arm: 0
			Media Error Count: 2
			Other Error Count: 12
			Predictive Failure Count: 0
			Last Predictive Failure Event Seq Number: 0

				Raw Size: 1.819 TB [0xe8e088b0 Sectors]
				Firmware state: =====> Failed <=====
				Media Type: Hard Disk Device
				Drive Temperature: 31C (87.80 F)

=== RaidStatus completed

@razzi couple of things to remember:

  1. comment the partition mount related to the disk that failed in /etc/fstab (if we reboot there may be issues to the broken disk)
  2. umount the partition (if needed), run puppet (the yarn/hdfs config should list one less partition to use) and restart the Yarn node manager

A new disk has been ordered and will be here this week.

You have successfully submitted request SR1070175430.

Replaced the disk and added back to the array

cmjohnson@an-worker1096:~$ sudo megacli -CfgEachDskRaid0 WB RA Direct CachedBadBBU -a0

Adapter 0: Created VD 6
Configured physical device at Encl-32:Slot-6.

1 physical devices are Configured on adapter 0.

Exit Code: 0x00
cmjohnson@an-worker1096:~$ sudo megacli -PDList -aALL | grep "Firmware state"
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up

For the records, puppet was failing with:

Sep 14 12:01:09 an-worker1096 puppet-agent[35073]: (/Stage[main]/Bigtop::Hadoop::Worker/Bigtop::Hadoop::Worker::Paths[/var/lib/hadoop/data/g]/File[/var/lib/hadoop/data/g/yarn]) Could not evaluate: Input/output error @ rb_file_s_lstat - /var/lib/hadoop/data/g/yarn
Sep 14 12:01:09 an-worker1096 puppet-agent[35073]: (/Stage[main]/Bigtop::Hadoop::Worker/Bigtop::Hadoop::Worker::Paths[/var/lib/hadoop/data/g]/File[/var/lib/hadoop/data/g/yarn/local]) Dependency File[/var/lib/hadoop/data/g/yarn] has failures: true
Sep 14 12:01:09 an-worker1096 puppet-agent[35073]: (/Stage[main]/Bigtop::Hadoop::Worker/Bigtop::Hadoop::Worker::Paths[/var/lib/hadoop/data/g]/File[/var/lib/hadoop/data/g/yarn/local]) Skipping because of failed dependencies

Then there was an update:

Sep 14 16:01:19 an-worker1096 puppet-agent[21841]: (/Stage[main]/Bigtop::Hadoop/File[/etc/hadoop/conf.analytics-hadoop/yarn-site.xml]/content) -    <value>/var/lib/hadoop/data/c/yarn/local,/var/lib/hadoop/data/f/yarn/local,/var/lib/hadoop/data/d/yarn/local,/var/lib/hadoop/data/e/yarn/local,/var/lib/hadoop/data/i/yarn/local,/var/lib/hadoop/data/g/yarn/local,/var/lib/hadoop/data/h/yarn/local,/var/lib/hadoop/data/j/yarn/local,/var/lib/hadoop/data/m/yarn/local,/var/lib/hadoop/data/l/yarn/local,/var/lib/hadoop/data/n/yarn/local,/var/lib/hadoop/data/k/yarn/local,/var/lib/hadoop/data/o/yarn/local,/var/lib/hadoop/data/p/yarn/local,/var/lib/hadoop/data/q/yarn/local,/var/lib/hadoop/data/u/yarn/local,/var/lib/hadoop/data/t/yarn/local,/var/lib/hadoop/data/r/yarn/local,/var/lib/hadoop/data/s/yarn/local,/var/lib/hadoop/data/w/yarn/local,/var/lib/hadoop/data/x/yarn/local,/var/lib/hadoop/data/v/yarn/local</value>
Sep 14 16:01:19 an-worker1096 puppet-agent[21841]: (/Stage[main]/Bigtop::Hadoop/File[/etc/hadoop/conf.analytics-hadoop/yarn-site.xml]/content) +    <value>/var/lib/hadoop/data/c/yarn/local,/var/lib/hadoop/data/f/yarn/local,/var/lib/hadoop/data/d/yarn/local,/var/lib/hadoop/data/e/yarn/local,/var/lib/hadoop/data/i/yarn/local,/var/lib/hadoop/data/h/yarn/local,/var/lib/hadoop/data/j/yarn/local,/var/lib/hadoop/data/m/yarn/local,/var/lib/hadoop/data/l/yarn/local,/var/lib/hadoop/data/n/yarn/local,/var/lib/hadoop/data/k/yarn/local,/var/lib/hadoop/data/o/yarn/local,/var/lib/hadoop/data/p/yarn/local,/var/lib/hadoop/data/q/yarn/local,/var/lib/hadoop/data/u/yarn/local,/var/lib/hadoop/data/t/yarn/local,/var/lib/hadoop/data/r/yarn/local,/var/lib/hadoop/data/s/yarn/local,/var/lib/hadoop/data/w/yarn/local,/var/lib/hadoop/data/x/yarn/local,/var/lib/hadoop/data/v/yarn/local</value>

/var/lib/hadoop/data/g/yarn/local seems missing from the yarn's config, so for the moment we should be good.

Next steps: https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration#Swapping_broken_disk

Mentioned in SAL (#wikimedia-operations) [2021-09-23T16:13:42Z] <elukey> reboot an-worker1096 to see if megacli status for a new disk changes - T290805

New disk up and running, I added some more info to https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration#Swapping_broken_disk (in this case there was no unconfigured disk, so less things to do).

@razzi let's not forget to follow up on these tasks, this one got closed and we were not using capacity available on the node :)