Degraded RAID on restbase-dev1001
Closed, ResolvedPublic

Description

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID (megacli) was detected on host restbase-dev1001. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

=== RaidStatus (does not include components in optimal state)
name: Adapter #0

	Virtual Drive: 2 (Target Id: 2)
	RAID Level: Primary-0, Secondary-0, RAID Level Qualifier-0
	State: =====> Offline <=====
	Number Of Drives: 1
	Number of Spans: 1
	Current Cache Policy: WriteThrough, ReadAheadNone, Direct, No Write Cache if Bad BBU

		Span: 0 - Number of PDs: 1

			PD: 0 Information
			Enclosure Device ID: 32
			Slot Number: 2
			Drive's position: DiskGroup: 2, Span: 0, Arm: 0
			Media Error Count: 0
			Other Error Count: 0
			Predictive Failure Count: 0
			Last Predictive Failure Event Seq Number: 0

				Raw Size: 745.211 GB [0x5d26ceb0 Sectors]
				Firmware state: =====> Failed <=====
				Media Type: Solid State Device
				Drive Temperature: 29C (84.20 F)

=== RaidStatus completed
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 7 2017, 9:03 AM
elukey triaged this task as Normal priority.Feb 7 2017, 9:12 AM
elukey assigned this task to Cmjohnson.
elukey added subscribers: Eevans, mobrovac, fgiunchedi, elukey.

Mentioned in SAL (#wikimedia-operations) [2017-02-07T09:46:32Z] <elukey> stopped and masked cassandra-{a,b} - T157425

Is there an ETA on this? We have some testing as a part of T156199 that could benefit from this environment; Having some idea would help with planning these tasks.

To summarize from IRC today:

15:08 < urandom> cmjohnson1: is there any ETA on https://phabricator.wikimedia.org/T157425?
...
16:35 < cmjohnson1> urandom: I was waiting on the spare disks that did arrive. I will swap it out in the morning

I confused this server with something else. This server has 12 SSDs that we purchased and placed in the system in December 2016. Strange that one failed after 2 months of use. A new ssd will need to be ordered and will need @RobH and @mark and/or @faidon approval.

Disk type is Intel S3610 800GB

GWicke added a subscriber: GWicke.Feb 22 2017, 5:43 PM

Are there any spares of this disk type at hand?

RobH added a comment.Feb 22 2017, 6:48 PM

There are not, however, it was discussed during the ops meeting today to order some. Task T158795 tracks the ordering of replacement disks.

I removed the disk and will bring it with me while I am gone. @RobH will let know if and where I need to send it for RMA.

RobH mentioned this in Unknown Object (Task).Mar 10 2017, 8:08 PM

@GWicke the ssd has been replaced. you may need to reboot the server or add it back to the cfg.

Eevans added a comment.EditedMar 16 2017, 4:10 PM

Is there a standard process for assigning this to someone in Operations to be completed? I do have root on these machines, but wouldn't want to intervene without first coordinating with someone (and of course I do not have access to the console).

Current status:

eevans@restbase-dev1001:~$ sudo mdadm --detail /dev/md0 /dev/md1 /dev/md2 
/dev/md0:
        Version : 1.2
  Creation Time : Wed Jan  4 22:52:52 2017
     Raid Level : raid1
     Array Size : 29279232 (27.92 GiB 29.98 GB)
  Used Dev Size : 29279232 (27.92 GiB 29.98 GB)
   Raid Devices : 4
  Total Devices : 4
    Persistence : Superblock is persistent

    Update Time : Thu Mar 16 16:05:28 2017
          State : clean, degraded 
 Active Devices : 3
Working Devices : 3
 Failed Devices : 1
  Spare Devices : 0

           Name : restbase-dev1001:0  (local to host restbase-dev1001)
           UUID : ca7a247c:52533443:2bf86059:1958c2eb
         Events : 1539652

    Number   Major   Minor   RaidDevice State
       0       8        1        0      active sync   /dev/sda1
       1       8       17        1      active sync   /dev/sdb1
       4       0        0        4      removed
       3       8       49        3      active sync   /dev/sdd1

       2       8       33        -      faulty   /dev/sdc1
/dev/md1:
        Version : 1.2
  Creation Time : Wed Jan  4 22:52:52 2017
     Raid Level : raid1
     Array Size : 976320 (953.60 MiB 999.75 MB)
  Used Dev Size : 976320 (953.60 MiB 999.75 MB)
   Raid Devices : 4
  Total Devices : 4
    Persistence : Superblock is persistent

    Update Time : Wed Jan  4 23:04:50 2017
          State : clean, resyncing (PENDING) 
 Active Devices : 4
Working Devices : 4
 Failed Devices : 0
  Spare Devices : 0

           Name : restbase-dev1001:1  (local to host restbase-dev1001)
           UUID : 09ae75be:f1274a9e:69a8911d:fd58afa8
         Events : 3

    Number   Major   Minor   RaidDevice State
       0       8        2        0      active sync   /dev/sda2
       1       8       18        1      active sync   /dev/sdb2
       2       8       34        2      active sync   /dev/sdc2
       3       8       50        3      active sync   /dev/sdd2
/dev/md2:
        Version : 1.2
  Creation Time : Wed Jan  4 22:52:52 2017
     Raid Level : raid0
     Array Size : 3001561088 (2862.51 GiB 3073.60 GB)
   Raid Devices : 4
  Total Devices : 4
    Persistence : Superblock is persistent

    Update Time : Wed Jan  4 22:52:52 2017
          State : clean 
 Active Devices : 4
Working Devices : 4
 Failed Devices : 0
  Spare Devices : 0

     Chunk Size : 512K

           Name : restbase-dev1001:2  (local to host restbase-dev1001)
           UUID : a2c676cb:607f8730:a1c16523:afe0023f
         Events : 0

    Number   Major   Minor   RaidDevice State
       0       8        3        0      active sync   /dev/sda3
       1       8       19        1      active sync   /dev/sdb3
       2       8       35        2      active sync   /dev/sdc3
       3       8       51        3      active sync   /dev/sdd3
eevans@restbase-dev1001:~$
elukey closed this task as Resolved.Mar 17 2017, 4:48 PM

megacli -LDInfo -Lall -aALL showed state Offline for the new disk slot so I forced it online with megacli -PDOnline -PhysDrv [32:2] -a0. Instead of fixing 3 raids (including a raid0) I simply reimaged the host and now everything seems up and running.