[x] - Provide FQDN of system.
labstore1005.eqiad.wmnet
[] - If other than a hard drive issue, please depool the machine (and confirm that it’s been depooled) for us to work on it. If not, please provide time frame for us to take the machine down.
@Bstorm, @Andrew can you help with this? I'm not sure what/how to depool this machine, and/or what repercussion it will have.
[] - Put system into a failed state in Netbox.
[] - Provide urgency of request, along with justification (redundancy, dependencies, etc)
[x] - Describe issue and/or attach hardware failure log. (Refer to https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook if you need help)
The server gets IO issues that end up with the megaraid controlling reseting itself. This render the server unavailable for a few minutes and forces drbd sync to catch up every time it happens. The raid is stable and none of the disks is failing.
The frequency varies from once an hour to half a dozen times per hour.
There's some issue uploading files to phabricator, will retry attaching the logs later again.Dell support assist report - F34630916
{F34630916}
Events log - F34630921
{F34630921}
The part of the journal log with the reset loop:
```
# First instance:
Sep 02 09:14:36 labstore1005 kernel: drbd tools: meta connection shut down by peer.
Sep 02 09:14:36 labstore1005 kernel: drbd tools: peer( Primary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown )
Sep 02 09:14:37 labstore1005 sshd[56107]: Connection from 208.80.153.84 port 57818 on 10.64.37.20 port 22
Sep 02 09:14:37 labstore1005 sshd[56107]: Connection closed by 208.80.153.84 port 57818 [preauth]
Sep 02 09:14:38 labstore1005 kernel: megaraid_sas 0000:03:00.0: Iop2SysDoorbellIntfor scsi0
Sep 02 09:14:39 labstore1005 kernel: megaraid_sas 0000:03:00.0: Found FW in FAULT state, will reset adapter scsi0.
Sep 02 09:14:39 labstore1005 kernel: megaraid_sas 0000:03:00.0: resetting fusion adapter scsi0.
Sep 02 09:14:39 labstore1005 kernel: drbd misc: meta connection shut down by peer.
Sep 02 09:14:39 labstore1005 kernel: drbd misc: peer( Primary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown )
Sep 02 09:14:48 labstore1005 kernel: megaraid_sas 0000:03:00.0: Waiting for FW to come to ready state
Sep 02 09:14:58 labstore1005 kernel: megaraid_sas 0000:03:00.0: FW now in Ready state
Sep 02 09:14:58 labstore1005 kernel: megaraid_sas 0000:03:00.0: Current firmware maximum commands: 928 LDIO threshold: 0
Sep 02 09:14:58 labstore1005 kernel: megaraid_sas 0000:03:00.0: FW supports sync cache : No
Sep 02 09:14:59 labstore1005 kernel: megaraid_sas 0000:03:00.0: Init cmd success
Sep 02 09:14:59 labstore1005 kernel: megaraid_sas 0000:03:00.0: firmware type : Extended VD(240 VD)firmware
Sep 02 09:14:59 labstore1005 kernel: megaraid_sas 0000:03:00.0: controller type : MR(1024MB)
Sep 02 09:14:59 labstore1005 kernel: megaraid_sas 0000:03:00.0: Online Controller Reset(OCR) : Enabled
Sep 02 09:14:59 labstore1005 kernel: megaraid_sas 0000:03:00.0: Secure JBOD support : No
Sep 02 09:14:59 labstore1005 kernel: megaraid_sas 0000:03:00.0: Jbod map is not supported megasas_setup_jbod_map 4980
Sep 02 09:14:59 labstore1005 kernel: megaraid_sas 0000:03:00.0: Reset successful for scsi0.
Sep 02 09:14:59 labstore1005 kernel: megaraid_sas 0000:03:00.0: 67083 (683889164s/0x0020/CRIT) - Controller encountered a fatal error and was reset
```
Frequency of incidence by hour:
```
root@labstore1005:~# journalctl -S "2021-09-01" | grep "Controller encountered a fatal error and was reset" | cut -d: -f 1 | sort | uniq -c
2 Sep 02 09
1 Sep 02 10
11 Sep 02 11
1 Sep 02 12
2 Sep 02 13
1 Sep 02 14
1 Sep 02 15
1 Sep 02 16
1 Sep 02 17
1 Sep 02 18
2 Sep 02 19
6 Sep 02 20
3 Sep 02 21
2 Sep 02 22
5 Sep 02 23
5 Sep 03 00
2 Sep 03 01
5 Sep 03 02
1 Sep 03 03
3 Sep 03 04
6 Sep 03 05
2 Sep 03 06
2 Sep 03 07
4 Sep 03 08
2 Sep 03 09
1 Sep 03 10
1 Sep 03 11
2 Sep 03 12
2 Sep 03 13
```
The disks statuses (all online):
```
root@labstore1005:~# sudo megacli -PDList -aALL | grep -i 'firmware state' | uniq -c
26 Firmware state: Online, Spun Up
```
[x] - Assign correct project tag and appropriate owner (based on above). Also, please ensure the service owners of the host(s) are added as subscribers to provide any additional input.