Degraded RAID on restbase-dev1001
Closed, ResolvedPublic

Description

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID (megacli) was detected on host restbase-dev1001. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

=== RaidStatus (does not include components in optimal state)
name: Adapter #0

	Virtual Drive: 2 (Target Id: 2)
	RAID Level: Primary-0, Secondary-0, RAID Level Qualifier-0
	State: =====> Offline <=====
	Number Of Drives: 1
	Number of Spans: 1
	Current Cache Policy: WriteThrough, ReadAheadNone, Direct, No Write Cache if Bad BBU

		Span: 0 - Number of PDs: 1

			PD: 0 Information
			Enclosure Device ID: 32
			Slot Number: 2
			Drive's position: DiskGroup: 2, Span: 0, Arm: 0
			Media Error Count: 0
			Other Error Count: 0
			Predictive Failure Count: 0
			Last Predictive Failure Event Seq Number: 0

				Raw Size: 745.211 GB [0x5d26ceb0 Sectors]
				Firmware state: =====> Failed <=====
				Media Type: Solid State Device
				Drive Temperature: 29C (84.20 F)

=== RaidStatus completed
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 7 2017, 9:03 AM
elukey triaged this task as Normal priority.Feb 7 2017, 9:12 AM
elukey assigned this task to Cmjohnson.
elukey added subscribers: Eevans, mobrovac, fgiunchedi, elukey.

Mentioned in SAL (#wikimedia-operations) [2017-02-07T09:46:32Z] <elukey> stopped and masked cassandra-{a,b} - T157425

Is there an ETA on this? We have some testing as a part of T156199 that could benefit from this environment; Having some idea would help with planning these tasks.

To summarize from IRC today:

15:08 < urandom> cmjohnson1: is there any ETA on https://phabricator.wikimedia.org/T157425?
...
16:35 < cmjohnson1> urandom: I was waiting on the spare disks that did arrive. I will swap it out in the morning

I confused this server with something else. This server has 12 SSDs that we purchased and placed in the system in December 2016. Strange that one failed after 2 months of use. A new ssd will need to be ordered and will need @RobH and @mark and/or @faidon approval.

Disk type is Intel S3610 800GB

Are there any spares of this disk type at hand?

There are not, however, it was discussed during the ops meeting today to order some. Task T158795 tracks the ordering of replacement disks.

I removed the disk and will bring it with me while I am gone. @RobH will let know if and where I need to send it for RMA.

RobH mentioned this in Unknown Object (Task).Mar 10 2017, 8:08 PM

@GWicke the ssd has been replaced. you may need to reboot the server or add it back to the cfg.

Eevans added a comment.EditedMar 16 2017, 4:10 PM

Is there a standard process for assigning this to someone in Operations to be completed? I do have root on these machines, but wouldn't want to intervene without first coordinating with someone (and of course I do not have access to the console).

Current status:

eevans@restbase-dev1001:~$ sudo mdadm --detail /dev/md0 /dev/md1 /dev/md2 
/dev/md0:
        Version : 1.2
  Creation Time : Wed Jan  4 22:52:52 2017
     Raid Level : raid1
     Array Size : 29279232 (27.92 GiB 29.98 GB)
  Used Dev Size : 29279232 (27.92 GiB 29.98 GB)
   Raid Devices : 4
  Total Devices : 4
    Persistence : Superblock is persistent

    Update Time : Thu Mar 16 16:05:28 2017
          State : clean, degraded 
 Active Devices : 3
Working Devices : 3
 Failed Devices : 1
  Spare Devices : 0

           Name : restbase-dev1001:0  (local to host restbase-dev1001)
           UUID : ca7a247c:52533443:2bf86059:1958c2eb
         Events : 1539652

    Number   Major   Minor   RaidDevice State
       0       8        1        0      active sync   /dev/sda1
       1       8       17        1      active sync   /dev/sdb1
       4       0        0        4      removed
       3       8       49        3      active sync   /dev/sdd1

       2       8       33        -      faulty   /dev/sdc1
/dev/md1:
        Version : 1.2
  Creation Time : Wed Jan  4 22:52:52 2017
     Raid Level : raid1
     Array Size : 976320 (953.60 MiB 999.75 MB)
  Used Dev Size : 976320 (953.60 MiB 999.75 MB)
   Raid Devices : 4
  Total Devices : 4
    Persistence : Superblock is persistent

    Update Time : Wed Jan  4 23:04:50 2017
          State : clean, resyncing (PENDING) 
 Active Devices : 4
Working Devices : 4
 Failed Devices : 0
  Spare Devices : 0

           Name : restbase-dev1001:1  (local to host restbase-dev1001)
           UUID : 09ae75be:f1274a9e:69a8911d:fd58afa8
         Events : 3

    Number   Major   Minor   RaidDevice State
       0       8        2        0      active sync   /dev/sda2
       1       8       18        1      active sync   /dev/sdb2
       2       8       34        2      active sync   /dev/sdc2
       3       8       50        3      active sync   /dev/sdd2
/dev/md2:
        Version : 1.2
  Creation Time : Wed Jan  4 22:52:52 2017
     Raid Level : raid0
     Array Size : 3001561088 (2862.51 GiB 3073.60 GB)
   Raid Devices : 4
  Total Devices : 4
    Persistence : Superblock is persistent

    Update Time : Wed Jan  4 22:52:52 2017
          State : clean 
 Active Devices : 4
Working Devices : 4
 Failed Devices : 0
  Spare Devices : 0

     Chunk Size : 512K

           Name : restbase-dev1001:2  (local to host restbase-dev1001)
           UUID : a2c676cb:607f8730:a1c16523:afe0023f
         Events : 0

    Number   Major   Minor   RaidDevice State
       0       8        3        0      active sync   /dev/sda3
       1       8       19        1      active sync   /dev/sdb3
       2       8       35        2      active sync   /dev/sdc3
       3       8       51        3      active sync   /dev/sdd3
eevans@restbase-dev1001:~$
elukey closed this task as Resolved.Mar 17 2017, 4:48 PM

megacli -LDInfo -Lall -aALL showed state Offline for the new disk slot so I forced it online with megacli -PDOnline -PhysDrv [32:2] -a0. Instead of fixing 3 raids (including a raid0) I simply reimaged the host and now everything seems up and running.