- FQDN: wdqs2002.mgmt.codfw.wmnet
- Machine de-pooled, begin work whenever
- Put system into a failed state in Netbox: https://netbox.wikimedia.org/dcim/devices/152/
- Urgency: High-Medium (the underlying host is working fine but we lack access to mgmt port incase anything goes wrong)
- Issue: ssh alert flapping for mgmt console specifically: wdqs2002.mgmt/SSH is CRITICAL
- - Assign correct project tag and appropriate owner (based on above). Also, please ensure the service owners of the host(s) are added as subscribers to provide any additional input.
ryankemper@wdqs2002:~$ sudo ipmi-sel ID | Date | Time | Name | Type | Event 1 | Aug-15-2016 | 20:37:05 | SEL | Event Logging Disabled | Log Area Reset/Cleared 2 | Aug-22-2016 | 10:14:24 | PS Redundancy | Power Supply | Redundancy Lost 3 | Aug-22-2016 | 10:14:29 | Status | Power Supply | Power Supply input lost (AC/DC) 4 | Aug-22-2016 | 10:26:48 | Status | Power Supply | Power Supply input lost (AC/DC) 5 | Aug-22-2016 | 10:26:58 | PS Redundancy | Power Supply | Fully Redundant 6 | Mar-09-2019 | 05:22:35 | Mem ECC Warning | Memory | transition to Non-Critical from OK ; OEM Event Data2 code = A1h ; OEM Event Data3 code = 02h 7 | Mar-10-2019 | 05:07:14 | Mem ECC Warning | Memory | transition to Critical from less severe ; OEM Event Data2 code = A1h ; OEM Event Data3 code = 02h 8 | May-28-2019 | 09:15:20 | Additional Info | OEM Reserved | OEM Event Offset = 02h ; OEM Event Data2 code = 04h ; OEM Event Data3 code = 00h 9 | May-28-2019 | 09:15:20 | ECC Uncorr Err | Memory | Uncorrectable memory error ; OEM Event Data2 code = A1h ; OEM Event Data3 code = 02h 10 | May-28-2019 | 09:15:20 | Additional Info | OEM Reserved | OEM Event Offset = 02h ; OEM Event Data2 code = 04h ; OEM Event Data3 code = 00h 11 | May-28-2019 | 09:15:20 | ECC Uncorr Err | Memory | Uncorrectable memory error ; OEM Event Data2 code = A1h ; OEM Event Data3 code = 02h 12 | Jun-05-2019 | 22:31:01 | Mem ECC Warning | Memory | transition to Non-Critical from OK ; OEM Event Data2 code = A1h ; OEM Event Data3 code = 02h 13 | Jun-16-2019 | 18:07:45 | Mem ECC Warning | Memory | transition to Critical from less severe ; OEM Event Data2 code = A1h ; OEM Event Data3 code = 02h