- - Provide FQDN of system.
labstore1005.eqiad.wmnet
- - If other than a hard drive issue, please depool the machine (and confirm that it’s been depooled) for us to work on it. If not, please provide time frame for us to take the machine down.
@Bstorm, @Andrew can you help with this? I'm not sure what/how to depool this machine, and/or what repercussion it will have.
- - Put system into a failed state in Netbox.
- - Provide urgency of request, along with justification (redundancy, dependencies, etc)
- - Describe issue and/or attach hardware failure log. (Refer to https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook if you need help)
The server gets IO issues that end up with the megaraid controlling reseting itself. This render the server unavailable for a few minutes and forces drbd sync to catch up every time it happens. The raid is stable and none of the disks is failing.
The frequency varies from once an hour to half a dozen times per hour.
Dell support assist report - F34630916
Events log - F34630921
The part of the journal log with the reset loop:
# First instance: Sep 02 09:14:36 labstore1005 kernel: drbd tools: meta connection shut down by peer. Sep 02 09:14:36 labstore1005 kernel: drbd tools: peer( Primary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown ) Sep 02 09:14:37 labstore1005 sshd[56107]: Connection from 208.80.153.84 port 57818 on 10.64.37.20 port 22 Sep 02 09:14:37 labstore1005 sshd[56107]: Connection closed by 208.80.153.84 port 57818 [preauth] Sep 02 09:14:38 labstore1005 kernel: megaraid_sas 0000:03:00.0: Iop2SysDoorbellIntfor scsi0 Sep 02 09:14:39 labstore1005 kernel: megaraid_sas 0000:03:00.0: Found FW in FAULT state, will reset adapter scsi0. Sep 02 09:14:39 labstore1005 kernel: megaraid_sas 0000:03:00.0: resetting fusion adapter scsi0. Sep 02 09:14:39 labstore1005 kernel: drbd misc: meta connection shut down by peer. Sep 02 09:14:39 labstore1005 kernel: drbd misc: peer( Primary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown ) Sep 02 09:14:48 labstore1005 kernel: megaraid_sas 0000:03:00.0: Waiting for FW to come to ready state Sep 02 09:14:58 labstore1005 kernel: megaraid_sas 0000:03:00.0: FW now in Ready state Sep 02 09:14:58 labstore1005 kernel: megaraid_sas 0000:03:00.0: Current firmware maximum commands: 928 LDIO threshold: 0 Sep 02 09:14:58 labstore1005 kernel: megaraid_sas 0000:03:00.0: FW supports sync cache : No Sep 02 09:14:59 labstore1005 kernel: megaraid_sas 0000:03:00.0: Init cmd success Sep 02 09:14:59 labstore1005 kernel: megaraid_sas 0000:03:00.0: firmware type : Extended VD(240 VD)firmware Sep 02 09:14:59 labstore1005 kernel: megaraid_sas 0000:03:00.0: controller type : MR(1024MB) Sep 02 09:14:59 labstore1005 kernel: megaraid_sas 0000:03:00.0: Online Controller Reset(OCR) : Enabled Sep 02 09:14:59 labstore1005 kernel: megaraid_sas 0000:03:00.0: Secure JBOD support : No Sep 02 09:14:59 labstore1005 kernel: megaraid_sas 0000:03:00.0: Jbod map is not supported megasas_setup_jbod_map 4980 Sep 02 09:14:59 labstore1005 kernel: megaraid_sas 0000:03:00.0: Reset successful for scsi0. Sep 02 09:14:59 labstore1005 kernel: megaraid_sas 0000:03:00.0: 67083 (683889164s/0x0020/CRIT) - Controller encountered a fatal error and was reset
Frequency of incidence by hour:
root@labstore1005:~# journalctl -S "2021-09-01" | grep "Controller encountered a fatal error and was reset" | cut -d: -f 1 | sort | uniq -c 2 Sep 02 09 1 Sep 02 10 11 Sep 02 11 1 Sep 02 12 2 Sep 02 13 1 Sep 02 14 1 Sep 02 15 1 Sep 02 16 1 Sep 02 17 1 Sep 02 18 2 Sep 02 19 6 Sep 02 20 3 Sep 02 21 2 Sep 02 22 5 Sep 02 23 5 Sep 03 00 2 Sep 03 01 5 Sep 03 02 1 Sep 03 03 3 Sep 03 04 6 Sep 03 05 2 Sep 03 06 2 Sep 03 07 4 Sep 03 08 2 Sep 03 09 1 Sep 03 10 1 Sep 03 11 2 Sep 03 12 2 Sep 03 13
The disks statuses (all online):
root@labstore1005:~# sudo megacli -PDList -aALL | grep -i 'firmware state' | uniq -c 26 Firmware state: Online, Spun Up
- - Assign correct project tag and appropriate owner (based on above). Also, please ensure the service owners of the host(s) are added as subscribers to provide any additional input.
summary of work
A backup controller was ordered via T290602.
Firmware was flashed to idrac and raid controller via T290318#7357112
System returned to service, monitoring for errors via T290318#7359652
If no errors are generated by 2021-09-18, we can likely resolve this task. (If they happen after that time a new task can be generated and point at this task, then the controller will simply be swapped.)