Degraded RAID on restbase-dev1006
Closed, DuplicatePublic
Actions

Assigned To

Authored By

	ops-monitoring-bot
	May 19 2019, 1:42 PM

Description

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID (md) was detected on host restbase-dev1006. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

CRITICAL: State: degraded, Active: 11, Working: 11, Failed: 1, Spare: 0

$ sudo /usr/local/lib/nagios/plugins/get-raid-status-md
Personalities : [raid1] [raid0] 
md2 : active raid0 sda3[0] sdd3[3] sdc3[2] sdb3[1]
      3004026880 blocks super 1.2 512k chunks
      
md1 : active (auto-read-only) raid1 sda2[0] sdd2[3] sdc2[2] sdb2[1]
      976320 blocks super 1.2 [4/4] [UUUU]
      
md0 : active raid1 sda1[0] sdd1[3](F) sdc1[2] sdb1[1]
      29279232 blocks super 1.2 [4/3] [UUU_]
      
unused devices: <none>

Event Timeline

ops-monitoring-bot added projects: SRE, ops-eqiad.May 19 2019, 1:42 PM

ops-monitoring-bot subscribed.

Volans triaged this task as Medium priority.May 20 2019, 8:37 AM

Volans edited subscribers, added: fgiunchedi; removed: ops-monitoring-bot.

syslog and dmesg are full of:

May 20 06:32:38 restbase-dev1006 kernel: [16121413.008778] ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
May 20 06:32:38 restbase-dev1006 kernel: [16121413.039012] ata4.00: irq_stat 0x40000001
May 20 06:32:38 restbase-dev1006 kernel: [16121413.058082] ata4.00: failed command: READ DMA EXT
May 20 06:32:38 restbase-dev1006 kernel: [16121413.080715] ata4.00: cmd 25/00:08:80:c3:26/00:00:5d:00:00/e0 tag 4 dma 4096 in
May 20 06:32:38 restbase-dev1006 kernel: [16121413.080715]          res 51/10:00:00:00:00/00:00:00:00:00/00 Emask 0x81 (invalid argument)
May 20 06:32:39 restbase-dev1006 kernel: [16121413.154538] ata4.00: status: { DRDY ERR }
May 20 06:32:39 restbase-dev1006 kernel: [16121413.174813] ata4.00: error: { IDNF }
May 20 06:32:39 restbase-dev1006 kernel: [16121413.193564] ata4.00: configured for UDMA/133
May 20 06:32:39 restbase-dev1006 kernel: [16121413.193572] ata4: EH complete
May 20 06:32:39 restbase-dev1006 kernel: [16121413.788655] ata4.00: Enabling discard_zeroes_data

And they ends with:

May 20 08:43:33 restbase-dev1006 kernel: [16129267.255695] sd 3:0:0:0: [sdd] tag#25 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
May 20 08:43:33 restbase-dev1006 kernel: [16129267.255699] sd 3:0:0:0: [sdd] tag#25 Sense Key : Illegal Request [current]
May 20 08:43:33 restbase-dev1006 kernel: [16129267.255703] sd 3:0:0:0: [sdd] tag#25 Add. Sense: Logical block address out of range
May 20 08:43:33 restbase-dev1006 kernel: [16129267.255707] sd 3:0:0:0: [sdd] tag#25 CDB: Read(10) 28 00 03 9b e0 00 00 00 08 00
May 20 08:43:33 restbase-dev1006 kernel: [16129267.255710] blk_update_request: I/O error, dev sdd, sector 60547072

Not production; This host can be taken down at any time, without coordination.

• Cmjohnson moved this task from Backlog to Hardware Failure / Troubleshoot on the ops-eqiad board.May 28 2019, 2:53 PM

a ticket has been created with HP for a replacement 5338974144

This server's SSD's are not part of the original build and under HP warranty. They are intel SSDs that I believe came from restbase1001-1003. Assigning to @RobH to order new SSDs.

description: ATA Disk

product: INTEL SSDSC2BB80
physical id: 0.0.0
bus info: scsi@2:0.0.0
logical name: /dev/sdc
version: 0121
serial: PHDV731402WR800CGN
size: 745GiB (800GB)
capabilities: partitioned partitioned:dos
configuration: ansiversion=5 logicalsectorsize=512 sectorsize=4096 signature=131ebc29

this is a duplicate task declining

jijiki closed this task as a duplicate of T224260: restbase-dev1006 has a broken disk.Jun 27 2019, 3:09 PM

Degraded RAID on restbase-dev1006Closed, DuplicatePublicActions

Description

Event Timeline

Degraded RAID on restbase-dev1006
Closed, DuplicatePublic
Actions