Page MenuHomePhabricator

backup2001 RAID controller failure, unable to post 2020-08-19
Closed, ResolvedPublic

Description

dmesg
[Wed Aug 19 03:46:13 2020] megaraid_sas 0000:af:00.0: waiting for controller reset to finish
[Wed Aug 19 03:46:13 2020] megaraid_sas 0000:af:00.0: waiting for controller reset to finish
[Wed Aug 19 03:46:16 2020] megaraid_sas 0000:af:00.0: waiting for controller reset to finish
[Wed Aug 19 03:46:18 2020] megaraid_sas 0000:af:00.0: waiting for controller reset to finish
[Wed Aug 19 03:46:18 2020] megaraid_sas 0000:af:00.0: waiting for controller reset to finish
[Wed Aug 19 03:46:21 2020] megaraid_sas 0000:af:00.0: waiting for controller reset to finish
[Wed Aug 19 03:46:23 2020] megaraid_sas 0000:af:00.0: waiting for controller reset to finish
[Wed Aug 19 03:46:23 2020] megaraid_sas 0000:af:00.0: waiting for controller reset to finish
[Wed Aug 19 03:46:26 2020] megaraid_sas 0000:af:00.0: waiting for controller reset to finish
[Wed Aug 19 03:46:28 2020] megaraid_sas 0000:af:00.0: waiting for controller reset to finish
[Wed Aug 19 03:46:28 2020] megaraid_sas 0000:af:00.0: megasas_wait_for_adapter_operational HBA failed to become operational, adp_state 1
[Wed Aug 19 03:46:30 2020] megaraid_sas 0000:af:00.0: waiting for controller reset to finish
[Wed Aug 19 03:46:31 2020] megaraid_sas 0000:af:00.0: megasas_wait_for_adapter_operational HBA failed to become operational, adp_state 1
[Wed Aug 19 03:46:33 2020] megaraid_sas 0000:af:00.0: waiting for controller reset to finish
[Wed Aug 19 03:46:35 2020] megaraid_sas 0000:af:00.0: waiting for controller reset to finish
[Wed Aug 19 03:46:38 2020] megaraid_sas 0000:af:00.0: waiting for controller reset to finish
[Wed Aug 19 03:46:40 2020] megaraid_sas 0000:af:00.0: waiting for controller reset to finish
[Wed Aug 19 03:46:43 2020] megaraid_sas 0000:af:00.0: waiting for controller reset to finish
[Wed Aug 19 03:46:45 2020] megaraid_sas 0000:af:00.0: waiting for controller reset to finish
[Wed Aug 19 03:46:49 2020] megaraid_sas 0000:af:00.0: waiting for controller reset to finish
[Wed Aug 19 03:46:50 2020] megaraid_sas 0000:af:00.0: waiting for controller reset to finish
[Wed Aug 19 03:46:54 2020] megaraid_sas 0000:af:00.0: waiting for controller reset to finish
[Wed Aug 19 03:46:55 2020] megaraid_sas 0000:af:00.0: waiting for controller reset to finish
...
[Wed Aug 19 03:50:18 2020] megaraid_sas 0000:af:00.0: Controller in crit error
[Wed Aug 19 03:50:18 2020] megaraid_sas 0000:af:00.0: Controller in crit error
[Wed Aug 19 03:50:18 2020] megaraid_sas 0000:af:00.0: Controller in crit error
[Wed Aug 19 03:50:18 2020] megaraid_sas 0000:af:00.0: Controller in crit error
[Wed Aug 19 03:50:18 2020] megaraid_sas 0000:af:00.0: Controller in crit error
[Wed Aug 19 03:50:18 2020] megaraid_sas 0000:af:00.0: Controller in crit error
[Wed Aug 19 03:50:18 2020] megaraid_sas 0000:af:00.0: Controller in crit error
[Wed Aug 19 03:50:18 2020] megaraid_sas 0000:af:00.0: Controller in crit error
[Wed Aug 19 03:50:18 2020] megaraid_sas 0000:af:00.0: Controller in crit error
[Wed Aug 19 03:50:18 2020] megaraid_sas 0000:af:00.0: Controller in crit error
[Wed Aug 19 03:50:18 2020] megaraid_sas 0000:af:00.0: Controller in crit error
[Wed Aug 19 03:50:18 2020] megaraid_sas 0000:af:00.0: Controller in crit error
[Wed Aug 19 03:50:18 2020] sd 0:2:1:0: Device offlined - not ready after error recovery
[Wed Aug 19 03:50:18 2020] sd 0:2:1:0: Device offlined - not ready after error recovery
[Wed Aug 19 03:50:18 2020] sd 0:2:1:0: Device offlined - not ready after error recovery
[Wed Aug 19 03:50:18 2020] sd 0:2:1:0: Device offlined - not ready after error recovery
[Wed Aug 19 03:50:18 2020] sd 0:2:1:0: Device offlined - not ready after error recovery
[Wed Aug 19 03:50:18 2020] sd 0:2:1:0: Device offlined - not ready after error recovery
[Wed Aug 19 03:50:18 2020] sd 0:2:1:0: Device offlined - not ready after error recovery
[Wed Aug 19 03:50:18 2020] sd 0:2:1:0: Device offlined - not ready after error recovery
[Wed Aug 19 03:50:18 2020] sd 0:2:1:0: Device offlined - not ready after error recovery
[Wed Aug 19 03:50:18 2020] sd 0:2:1:0: Device offlined - not ready after error recovery
...
[Wed Aug 19 03:54:26 2020] scsi_io_completion_action: 231 callbacks suppressed
[Wed Aug 19 03:54:26 2020] sd 0:2:0:0: [sda] tag#1134 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
[Wed Aug 19 03:54:26 2020] sd 0:2:0:0: [sda] tag#1134 CDB: Read(16) 88 00 00 00 00 00 00 00 00 00 00 00 01 00 00 00
[Wed Aug 19 03:54:26 2020] print_req_error: 254 callbacks suppressed
[Wed Aug 19 03:54:26 2020] print_req_error: I/O error, dev sda, sector 0
[Wed Aug 19 03:54:26 2020] sd 0:2:0:0: [sda] tag#1134 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
[Wed Aug 19 03:54:26 2020] sd 0:2:0:0: [sda] tag#1134 CDB: Read(16) 88 00 00 00 00 00 00 00 08 00 00 00 01 00 00 00
[Wed Aug 19 03:54:26 2020] print_req_error: I/O error, dev sda, sector 2048
[Wed Aug 19 03:54:26 2020] sd 0:2:0:0: [sda] tag#1134 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
[Wed Aug 19 03:54:26 2020] sd 0:2:0:0: [sda] tag#1134 CDB: Read(16) 88 00 00 00 00 00 00 00 10 00 00 00 01 00 00 00
[Wed Aug 19 03:54:26 2020] print_req_error: I/O error, dev sda, sector 4096
[Wed Aug 19 03:54:26 2020] sd 0:2:0:0: [sda] tag#1134 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
[Wed Aug 19 03:54:26 2020] sd 0:2:0:0: [sda] tag#1134 CDB: Read(16) 88 00 00 00 00 00 80 00 10 00 00 00 01 00 00 00
[Wed Aug 19 03:54:26 2020] print_req_error: I/O error, dev sda, sector 2147487744
[Wed Aug 19 03:54:26 2020] sd 0:2:1:0: rejecting I/O to offline device
[Wed Aug 19 03:54:26 2020] print_req_error: I/O error, dev sdb, sector 2048
[Wed Aug 19 03:54:26 2020] sd 0:2:0:0: [sda] tag#1792 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
[Wed Aug 19 03:54:26 2020] sd 0:2:0:0: [sda] tag#1792 CDB: Read(16) 88 00 00 00 00 00 00 00 00 00 00 00 01 00 00 00
[Wed Aug 19 03:54:26 2020] print_req_error: I/O error, dev sda, sector 0
[Wed Aug 19 03:54:26 2020] sd 0:2:1:0: rejecting I/O to offline device
[Wed Aug 19 03:54:26 2020] print_req_error: I/O error, dev sdb, sector 4096
[Wed Aug 19 03:54:26 2020] sd 0:2:0:0: [sda] tag#1792 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
[Wed Aug 19 03:54:26 2020] sd 0:2:0:0: [sda] tag#1792 CDB: Read(16) 88 00 00 00 00 00 00 00 00 00 00 00 01 00 00 00
[Wed Aug 19 03:54:26 2020] print_req_error: I/O error, dev sda, sector 0
[Wed Aug 19 03:54:26 2020] sd 0:2:1:0: rejecting I/O to offline device
[Wed Aug 19 03:54:26 2020] print_req_error: I/O error, dev sdb, sector 4096
[Wed Aug 19 03:54:26 2020] sd 0:2:0:0: [sda] tag#1792 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
[Wed Aug 19 03:54:26 2020] sd 0:2:0:0: [sda] tag#1792 CDB: Read(16) 88 00 00 00 00 00 00 00 08 00 00 00 01 00 00 00
[Wed Aug 19 03:54:26 2020] print_req_error: I/O error, dev sda, sector 2048
[Wed Aug 19 03:54:26 2020] sd 0:2:0:0: [sda] tag#1792 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
[Wed Aug 19 03:54:26 2020] sd 0:2:0:0: [sda] tag#1792 CDB: Read(16) 88 00 00 00 00 00 00 00 10 00 00 00 01 00 00 00
[Wed Aug 19 03:54:26 2020] print_req_error: I/O error, dev sda, sector 4096
[Wed Aug 19 03:54:26 2020] sd 0:2:0:0: [sda] tag#1792 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
[Wed Aug 19 03:54:26 2020] sd 0:2:0:0: [sda] tag#1792 CDB: Read(16) 88 00 00 00 00 00 80 00 10 00 00 00 01 00 00 00
[Wed Aug 19 03:54:26 2020] print_req_error: I/O error, dev sda, sector 2147487744
[Wed Aug 19 03:54:26 2020] sd 0:2:1:0: rejecting I/O to offline device
[Wed Aug 19 03:54:26 2020] sd 0:2:0:0: [sda] tag#1792 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
[Wed Aug 19 03:54:26 2020] sd 0:2:0:0: [sda] tag#1792 CDB: Read(16) 88 00 00 00 00 00 00 00 00 00 00 00 01 00 00 00
[Wed Aug 19 03:54:26 2020] sd 0:2:0:0: [sda] tag#1792 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
[Wed Aug 19 03:54:26 2020] sd 0:2:0:0: [sda] tag#1792 CDB: Read(16) 88 00 00 00 00 00 00 00 00 00 00 00 01 00 00 00
[Wed Aug 19 03:54:26 2020] sd 0:2:1:0: rejecting I/O to offline device
...
[Wed Aug 19 08:24:42 2020] sd 0:2:0:0: [sda] tag#2705 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
[Wed Aug 19 08:24:42 2020] sd 0:2:0:0: [sda] tag#2705 CDB: Read(16) 88 00 00 00 00 00 00 00 00 00 00 00 01 00 00 00
[Wed Aug 19 08:24:42 2020] sd 0:2:0:0: [sda] tag#2705 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
[Wed Aug 19 08:24:42 2020] sd 0:2:0:0: [sda] tag#2705 CDB: Read(16) 88 00 00 00 00 00 00 00 00 00 00 00 01 00 00 00
[Wed Aug 19 08:24:42 2020] sd 0:2:1:0: rejecting I/O to offline device
[Wed Aug 19 08:24:42 2020] sd 0:2:1:0: rejecting I/O to offline device
[Wed Aug 19 08:24:42 2020] sd 0:2:1:0: rejecting I/O to offline device
[Wed Aug 19 08:24:43 2020] sd 0:2:1:0: rejecting I/O to offline device
[Wed Aug 19 08:24:43 2020] sd 0:2:1:0: rejecting I/O to offline device
management lifecycle log
 		2020-08-19 08:31:41 	USR0030 	Successfully logged in using root, from 10.192.48.16 and GUI.	
		2020-08-19 04:05:42 	CTL137 	The storage controller RAID Controller in Slot 3 is unable to communicate to the BMC because either the storage controller or BMC is not responding to the commands either because of an internal error or the bus is in an error state.	
		2020-08-19 04:05:42 	LOG007 	The previous log entry was repeated 23 times.	
		2020-08-19 03:25:58 	PDR87 	Disk 10 in Enclosure 1 on Connector 0 of RAID Controller in Slot 3 was reset.	
		2020-08-18 15:21:20 	PDR8 	Disk 0 in Enclosure 1 on Connector 0 of RAID Controller in Slot 3 is inserted.	
		2020-08-18 15:21:20 	PDR8 	Disk 9 in Enclosure 1 on Connector 0 of RAID Controller in Slot 3 is inserted.	
		2020-08-18 15:21:19 	PDR8 	Disk 8 in Enclosure 1 on Connector 0 of RAID Controller in Slot 3 is inserted.	
		2020-08-18 15:21:19 	PDR8 	Disk 1 in Enclosure 1 on Connector 0 of RAID Controller in Slot 3 is inserted.	
		2020-08-18 15:21:19 	PDR8 	Disk 5 in Enclosure 1 on Connector 0 of RAID Controller in Slot 3 is inserted.	
		2020-08-18 15:21:19 	PDR8 	Disk 4 in Enclosure 1 on Connector 0 of RAID Controller in Slot 3 is inserted.	
		2020-08-18 15:21:19 	PDR8 	Disk 3 in Enclosure 1 on Connector 0 of RAID Controller in Slot 3 is inserted.	
		2020-08-18 15:21:19 	PDR8 	Disk 6 in Enclosure 1 on Connector 0 of RAID Controller in Slot 3 is inserted.	
		2020-08-18 15:21:19 	PDR8 	Disk 2 in Enclosure 1 on Connector 0 of RAID Controller in Slot 3 is inserted.	
		2020-08-18 15:21:19 	PDR8 	Disk 10 in Enclosure 1 on Connector 0 of RAID Controller in Slot 3 is inserted.	
		2020-08-18 15:21:19 	PDR8 	Disk 11 in Enclosure 1 on Connector 0 of RAID Controller in Slot 3 is inserted.	
		2020-08-18 15:21:19 	PDR8 	Disk 7 in Enclosure 1 on Connector 0 of RAID Controller in Slot 3 is inserted.	
		2020-08-18 15:21:19 	PDR8 	Disk 5 in Enclosure 0 on Connector 0 of RAID Controller in Slot 3 is inserted.	
		2020-08-18 15:21:19 	PDR8 	Disk 8 in Enclosure 0 on Connector 0 of RAID Controller in Slot 3 is inserted.	
		2020-08-18 15:21:19 	PDR8 	Disk 9 in Enclosure 0 on Connector 0 of RAID Controller in Slot 3 is inserted.	
		2020-08-18 15:21:19 	PDR8 	Disk 4 in Enclosure 0 on Connector 0 of RAID Controller in Slot 3 is inserted.	
		2020-08-18 15:21:19 	PDR8 	Disk 6 in Enclosure 0 on Connector 0 of RAID Controller in Slot 3 is inserted.	
		2020-08-18 15:21:19 	PDR8 	Disk 1 in Enclosure 0 on Connector 0 of RAID Controller in Slot 3 is inserted.	
		2020-08-18 15:21:19 	PDR8 	Disk 3 in Enclosure 0 on Connector 0 of RAID Controller in Slot 3 is inserted.	
		2020-08-18 15:21:19 	PDR8 	Disk 0 in Enclosure 0 on Connector 0 of RAID Controller in Slot 3 is inserted.	
		2020-08-18 15:21:19 	PDR8 	Disk 2 in Enclosure 0 on Connector 0 of RAID Controller in Slot 3 is inserted.	
		2020-08-18 15:21:19 	PDR8 	Disk 10 in Enclosure 0 on Connector 0 of RAID Controller in Slot 3 is inserted.	
		2020-08-18 15:21:19 	PDR8 	Disk 11 in Enclosure 0 on Connector 0 of RAID Controller in Slot 3 is inserted.	
		2020-08-18 15:21:19 	PDR8 	Disk 7 in Enclosure 0 on Connector 0 of RAID Controller in Slot 3 is inserted.	
		2020-08-18 15:21:19 	LOG007 	The previous log entry was repeated 1 times.	
		2020-08-18 15:21:19 	ENC40 	A new enclosure was detected on RAID Controller in Slot 3.	
		2020-08-18 15:21:19 	LOG007 	The previous log entry was repeated 30 times.	
		2020-08-18 15:12:40 	PDR87 	Disk 10 in Enclosure 1 on Connector 0 of RAID Controller in Slot 3 was reset.	
		2020-08-18 14:55:40 	PDR8 	Disk 0 in Enclosure 1 on Connector 0 of RAID Controller in Slot 3 is inserted.	
		2020-08-18 14:55:40 	PDR8 	Disk 9 in Enclosure 1 on Connector 0 of RAID Controller in Slot 3 is inserted.	
		2020-08-18 14:55:40 	PDR8 	Disk 8 in Enclosure 1 on Connector 0 of RAID Controller in Slot 3 is inserted.	
		2020-08-18 14:55:40 	PDR8 	Disk 1 in Enclosure 1 on Connector 0 of RAID Controller in Slot 3 is inserted.	
		2020-08-18 14:55:40 	PDR8 	Disk 5 in Enclosure 1 on Connector 0 of RAID Controller in Slot 3 is inserted.	
		2020-08-18 14:55:40 	PDR8 	Disk 4 in Enclosure 1 on Connector 0 of RAID Controller in Slot 3 is inserted.	
		2020-08-18 14:55:40 	PDR8 	Disk 3 in Enclosure 1 on Connector 0 of RAID Controller in Slot 3 is inserted.	
		2020-08-18 14:55:40 	PDR8 	Disk 6 in Enclosure 1 on Connector 0 of RAID Controller in Slot 3 is inserted.	
		2020-08-18 14:55:40 	PDR8 	Disk 2 in Enclosure 1 on Connector 0 of RAID Controller in Slot 3 is inserted.	
		2020-08-18 14:55:40 	PDR8 	Disk 10 in Enclosure 1 on Connector 0 of RAID Controller in Slot 3 is inserted.	
		2020-08-18 14:55:40 	PDR8 	Disk 11 in Enclosure 1 on Connector 0 of RAID Controller in Slot 3 is inserted.	
		2020-08-18 14:55:40 	PDR8 	Disk 7 in Enclosure 1 on Connector 0 of RAID Controller in Slot 3 is inserted.	
		2020-08-18 14:55:40 	PDR8 	Disk 5 in Enclosure 0 on Connector 0 of RAID Controller in Slot 3 is inserted.	
		2020-08-18 14:55:40 	PDR8 	Disk 8 in Enclosure 0 on Connector 0 of RAID Controller in Slot 3 is inserted.	
		2020-08-18 14:55:40 	PDR8 	Disk 9 in Enclosure 0 on Connector 0 of RAID Controller in Slot 3 is inserted.	
		2020-08-18 14:55:40 	PDR8 	Disk 4 in Enclosure 0 on Connector 0 of RAID Controller in Slot 3 is inserted.	
		2020-08-18 14:55:40 	PDR8 	Disk 6 in Enclosure 0 on Connector 0 of RAID Controller in Slot 3 is inserted.	
		2020-08-18 14:55:40 	PDR8 	Disk 1 in Enclosure 0 on Connector 0 of RAID Controller in Slot 3 is inserted.	
		2020-08-18 14:55:40 	PDR8 	Disk 3 in Enclosure 0 on Connector 0 of RAID Controller in Slot 3 is inserted.	
		2020-08-18 14:55:40 	PDR8 	Disk 0 in Enclosure 0 on Connector 0 of RAID Controller in Slot 3 is inserted.	
		2020-08-18 14:55:40 	PDR8 	Disk 2 in Enclosure 0 on Connector 0 of RAID Controller in Slot 3 is inserted.	
		2020-08-18 14:55:39 	PDR8 	Disk 10 in Enclosure 0 on Connector 0 of RAID Controller in Slot 3 is inserted.	
		2020-08-18 14:55:39 	PDR8 	Disk 11 in Enclosure 0 on Connector 0 of RAID Controller in Slot 3 is inserted.	
		2020-08-18 14:55:39 	PDR8 	Disk 7 in Enclosure 0 on Connector 0 of RAID Controller in Slot 3 is inserted.	
		2020-08-18 14:55:39 	LOG007 	The previous log entry was repeated 1 times.	
		2020-08-18 14:55:39 	ENC40 	A new enclosure was detected on RAID Controller in Slot 3.	
		2020-08-18 14:55:39 	LOG007 	The previous log entry was repeated 32 times.	
		2020-08-18 14:46:30 	PDR87 	Disk 10 in Enclosure 1 on Connector 0 of RAID Controller in Slot 3 was reset.	
		2020-08-18 06:43:02 	PDR8 	Disk 0 in Enclosure 1 on Connector 0 of RAID Controller in Slot 3 is inserted.	
		2020-08-18 06:43:02 	PDR8 	Disk 9 in Enclosure 1 on Connector 0 of RAID Controller in Slot 3 is inserted.	
		2020-08-18 06:43:02 	PDR8 	Disk 8 in Enclosure 1 on Connector 0 of RAID Controller in Slot 3 is inserted.	
		2020-08-18 06:43:02 	PDR8 	Disk 1 in Enclosure 1 on Connector 0 of RAID Controller in Slot 3 is inserted.	
		2020-08-18 06:43:02 	PDR8 	Disk 5 in Enclosure 1 on Connector 0 of RAID Controller in Slot 3 is inserted.	
		2020-08-18 06:43:02 	PDR8 	Disk 4 in Enclosure 1 on Connector 0 of RAID Controller in Slot 3 is inserted.	
		2020-08-18 06:43:02 	PDR8 	Disk 3 in Enclosure 1 on Connector 0 of RAID Controller in Slot 3 is inserted.	
		2020-08-18 06:43:02 	PDR8 	Disk 6 in Enclosure 1 on Connector 0 of RAID Controller in Slot 3 is inserted.	
		2020-08-18 06:43:02 	PDR8 	Disk 2 in Enclosure 1 on Connector 0 of RAID Controller in Slot 3 is inserted.	
		2020-08-18 06:43:02 	PDR8 	Disk 10 in Enclosure 1 on Connector 0 of RAID Controller in Slot 3 is inserted.	
		2020-08-18 06:43:02 	PDR8 	Disk 11 in Enclosure 1 on Connector 0 of RAID Controller in Slot 3 is inserted.	
		2020-08-18 06:43:02 	PDR8 	Disk 7 in Enclosure 1 on Connector 0 of RAID Controller in Slot 3 is inserted.	
		2020-08-18 06:43:02 	PDR8 	Disk 5 in Enclosure 0 on Connector 0 of RAID Controller in Slot 3 is inserted.	
		2020-08-18 06:43:02 	PDR8 	Disk 8 in Enclosure 0 on Connector 0 of RAID Controller in Slot 3 is inserted.	
		2020-08-18 06:43:02 	PDR8 	Disk 9 in Enclosure 0 on Connector 0 of RAID Controller in Slot 3 is inserted.	
		2020-08-18 06:43:02 	PDR8 	Disk 4 in Enclosure 0 on Connector 0 of RAID Controller in Slot 3 is inserted.	
		2020-08-18 06:43:02 	PDR8 	Disk 6 in Enclosure 0 on Connector 0 of RAID Controller in Slot 3 is inserted.	
		2020-08-18 06:43:02 	PDR8 	Disk 1 in Enclosure 0 on Connector 0 of RAID Controller in Slot 3 is inserted.	
		2020-08-18 06:43:02 	PDR8 	Disk 3 in Enclosure 0 on Connector 0 of RAID Controller in Slot 3 is inserted.	
		2020-08-18 06:43:02 	PDR8 	Disk 0 in Enclosure 0 on Connector 0 of RAID Controller in Slot 3 is inserted.	
		2020-08-18 06:43:02 	PDR8 	Disk 2 in Enclosure 0 on Connector 0 of RAID Controller in Slot 3 is inserted.	
		2020-08-18 06:43:02 	PDR8 	Disk 10 in Enclosure 0 on Connector 0 of RAID Controller in Slot 3 is inserted.	
		2020-08-18 06:43:02 	PDR8 	Disk 11 in Enclosure 0 on Connector 0 of RAID Controller in Slot 3 is inserted.	
		2020-08-18 06:43:02 	PDR8 	Disk 7 in Enclosure 0 on Connector 0 of RAID Controller in Slot 3 is inserted.	
		2020-08-18 06:43:02 	LOG007 	The previous log entry was repeated 1 times.	
		2020-08-18 06:43:02 	ENC40 	A new enclosure was detected on RAID Controller in Slot 3.	
		2020-08-18 06:43:02 	LOG007 	The previous log entry was repeated 29 times.	
		2020-08-18 06:34:49 	PDR87 	Disk 10 in Enclosure 1 on Connector 0 of RAID Controller in Slot 3 was reset.	
		2020-08-18 05:21:31 	PDR8 	Disk 0 in Enclosure 1 on Connector 0 of RAID Controller in Slot 3 is inserted.	
		2020-08-18 05:21:31 	PDR8 	Disk 9 in Enclosure 1 on Connector 0 of RAID Controller in Slot 3 is inserted.	
		2020-08-18 05:21:31 	PDR8 	Disk 8 in Enclosure 1 on Connector 0 of RAID Controller in Slot 3 is inserted.	
		2020-08-18 05:21:31 	PDR8 	Disk 1 in Enclosure 1 on Connector 0 of RAID Controller in Slot 3 is inserted.	
		2020-08-18 05:21:31 	PDR8 	Disk 5 in Enclosure 1 on Connector 0 of RAID Controller in Slot 3 is inserted.	
		2020-08-18 05:21:31 	PDR8 	Disk 4 in Enclosure 1 on Connector 0 of RAID Controller in Slot 3 is inserted.	
		2020-08-18 05:21:31 	PDR8 	Disk 3 in Enclosure 1 on Connector 0 of RAID Controller in Slot 3 is inserted.	
		2020-08-18 05:21:31 	PDR8 	Disk 6 in Enclosure 1 on Connector 0 of RAID Controller in Slot 3 is inserted.	
		2020-08-18 05:21:31 	PDR8 	Disk 2 in Enclosure 1 on Connector 0 of RAID Controller in Slot 3 is inserted.	
		2020-08-18 05:21:31 	PDR8 	Disk 10 in Enclosure 1 on Connector 0 of RAID Controller in Slot 3 is inserted.	
		2020-08-18 05:21:31 	PDR8 	Disk 11 in Enclosure 1 on Connector 0 of RAID Controller in Slot 3 is inserted.	
		2020-08-18 05:21:31 	PDR8 	Disk 7 in Enclosure 1 on Connector 0 of RAID Controller in Slot 3 is inserted.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
fgiunchedi triaged this task as Medium priority.Aug 19 2020, 9:07 AM
jcrespo added a subscriber: Papaul.

It doesn't post after restart, I tried twice, it gets stuck after "initializing devices", even after a porwerdown and a powerup:

F2  = System Setup
F10 = Lifecycle Controller
F11 = Boot Manager
F12 = PXE Boot
IPMI: Boot to  

Initializing Serial ATA devices...
 Port A: SSDSC2KG240G7R
 Port B: SSDSC2KG240G7R


Broadcom NetXtreme Ethernet Boot Agent
Copyright (C) 2000-2017 Broadcom Limited
All rights reserved.
Press Ctrl-S to enter Configuration Menu

PowerEdge Expandable RAID Controller BIOS
Copyright(c) 2017 AVAGO Technologies
HA -0 (Bus 175 Dev 0)

@Papaul you are our only hope!

jcrespo renamed this task from backup2001 RAID controller failure to backup2001 RAID controller failure, unable to post.Aug 19 2020, 10:45 AM

backup2001 has been unstable for a while- and now it got extra load from database backups from eqiad.

Previous crashes:

Yesterday I started to notice weird IO errors on generating backups.

jcrespo renamed this task from backup2001 RAID controller failure, unable to post to backup2001 RAID controller failure, unable to post 2020-08-19.Aug 19 2020, 10:59 AM

To counter that maybe the setup was not ideal/hw configuration was chosen wrongly (e.g. too many disks/arrays for a single host), backup1001 had none of the crashes backup2001 has, which leds me to believe it is a hw issue (as it continues after firmware upgrade).

@wiki_willy @Papaul It seems we've had an ongoing pattern of crashes with this (rather important) backup host, which means we are not yet able to trust it. Until we are able to resolve this we also cannot decommission the older hosts (that this replaces) either. At the moment the system doesn't even boot. Are there any steps we can take soon to debug this issue? Anything we can help with? Thanks!

@mark I am planing on opening a case with Dell to see what they ca find on the end.

Change 621520 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb-backups: Ignore backup freshness check for dbprov1* hosts

https://gerrit.wikimedia.org/r/621520

Change 621520 merged by Jcrespo:
[operations/puppet@production] mariadb-backups: Ignore backup freshness check for dbprov1* hosts

https://gerrit.wikimedia.org/r/621520

Hi Papul,

case 77793042 ST: server Crashing, POWEREDGE R440 I’m your case owner and primary point of contact through resolution of this issue. Here are the best ways to contact me

• Email:daren_manor@dell.com (Preferred)
• Direct Extension: 1 800-945-3355 ext: 513-5090
• My working hours: Tuesday - Saturday 6:00 hrs. to 3:00 hrs. CST

can we please depool this server

Server is down and unusable (aka depooled). The only ask is if data on arrays could possibly kept in order not to lose previous backups. It is also downtime'd until monday.

According to Dell the PERC controller is not been detected correctly

Step 1- Upgrade the controller drivers. if same problem go to step 2
Step 2- A replacement will be sent out.

There is 1 bad disk on backup2001-array2002 slot 8

Can you please depool and shutdown this server. I need to replace the raid controller

This can be considered "depooled" at all times until we close the ticket- no need to ask permission (unlike databases, for which we need to pool them back quite often or they get out of sync). I can put it down if you want, or you can do it on your own, unattended. I will extend the downtime for an extra week.

1- RAID controller drivers updated
2- Replaced RAiD controller
3- upgrade IDRAC

@Papaul, for the record, you got sent a new hw RAID controller from vendor (I wasn't aware of that if true) or you mean controller as in driver (firmware/software)?

Change 622209 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] Revert "mariadb-backups: Ignore backup freshness check for dbprov1* hosts"

https://gerrit.wikimedia.org/r/622209

Change 622209 merged by Jcrespo:
[operations/puppet@production] Revert "mariadb-backups: Ignore backup freshness check for dbprov1* hosts"

https://gerrit.wikimedia.org/r/622209