Page MenuHomePhabricator

Degraded RAID on lvs3001
Closed, ResolvedPublic

Description

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID (megacli) was detected on host lvs3001. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

CRITICAL: 1 failed LD(s) (Offline)
=== RaidStatus (does not include components in optimal state)
name: Adapter #0

	Virtual Drive: 1 (Target Id: 1)
	RAID Level: Primary-0, Secondary-0, RAID Level Qualifier-0
	State: =====> Offline <=====
	Number Of Drives: 1
	Number of Spans: 1
	Current Cache Policy: WriteThrough, ReadAheadNone, Direct, No Write Cache if Bad BBU

		Span: 0 - Number of PDs: 1

			PD: 0 Information
			Enclosure Device ID: 32
			Slot Number: 1
			Drive's position: DiskGroup: 1, Span: 0, Arm: 0
			Media Error Count: 2
			Other Error Count: 3
			Predictive Failure Count: 0
			Last Predictive Failure Event Seq Number: 0

				Raw Size: 232.885 GB [0x1d1c5970 Sectors]
				Firmware state: =====> Failed <=====
				Media Type: Hard Disk Device
				Drive Temperature: 25C (77.00 F)

=== RaidStatus completed

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 4 2017, 1:11 AM
ema moved this task from Triage to LoadBalancer on the Traffic board.Jun 6 2017, 1:30 PM
RobH added a subscriber: RobH.Jun 7 2017, 4:41 PM

warrany for lvs3001 ended on May 08, 2015

Dzahn added a subscriber: Dzahn.Jun 9 2017, 9:25 PM

importing text from (almost) duplicate ticket T166964 (merging into this ticket)

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID (md) was detected on host lvs3001. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

CRITICAL: State: degraded, Active: 2, Working: 2, Failed: 2, Spare: 0
Personalities : [raid1] 
md1 : active raid1 sda2[0] sdb2[1](F)
      194702336 blocks super 1.2 [2/1] [U_]
      bitmap: 0/2 pages [0KB], 65536KB chunk

md0 : active raid1 sda1[0] sdb1[1](F)
      48794624 blocks super 1.2 [2/1] [U_]
      
unused devices: <none>
Marostegui triaged this task as Normal priority.Jun 13 2017, 10:00 AM
Volans added a comment.Jul 3 2017, 7:15 AM

I've commented out the MAILADDR line to avoid to get one email per day. Given that we have also the Icinga check we could consider to comment it out broadly across the fleet. The file is currently not managed by puppet.

Volans added a comment.Jul 3 2017, 7:33 AM

And of course that was not enough, I had to also add an exit 0 to /etc/cron.daily/mdadm to prevent it from running, without the MAILADDR setting the report check refuses to run and generates cronspam.

Volans added a subscriber: mark.Aug 28 2017, 9:21 PM
BBlack moved this task from LoadBalancer to Hardware on the Traffic board.Oct 23 2017, 2:49 PM
Dzahn removed a subscriber: Dzahn.Oct 23 2017, 5:17 PM
mark moved this task from Backlog to Break/Fix on the ops-esams board.Jan 3 2018, 1:24 PM
Stashbot added a subscriber: Stashbot.

Mentioned in SAL (#wikimedia-operations) [2018-01-09T14:59:34Z] <ema> lvs3001: upgrade to latest jessie point release (8.10) T182656 and linux kernel 4.9.65-3+deb9u1~bpo8+2 (KPTI) T184267, replace sdb T166965

Mentioned in SAL (#wikimedia-operations) [2018-01-09T15:18:10Z] <ema> lvs3001 disk swap: failover traffic to lvs3003 T166965

ema closed this task as Resolved.Jan 9 2018, 5:20 PM
ema claimed this task.
ema added a subscriber: ema.

Disk replaced today, raid rebuilt.