Page MenuHomePhabricator

Monitor the BMC's event log for hardware errors
Open, MediumPublic

Description

Slightly similar but not exactly as T125205 (as that one is only concerned with the BMC's sensors): we should monitor the BMC (whether it's the IPMI SEL or the HP's IML or similar things) for certain critical events. Consider this, from the T130702 investigation:

root@es2017:~# ipmitool sel list
   1 | 02/08/2016 | 16:06:18 | Event Logging Disabled #0x72 | Log area reset/cleared | Asserted
   2 | 05/26/2016 | 12:22:06 | Processor #0x61 | IERR | Asserted
   3 | 05/26/2016 | 12:24:04 | Unknown #0x2e |  | Asserted
   4 | 05/26/2016 | 12:24:04 | Memory #0x02 | Uncorrectable ECC (UnCorrectable ECC |  DIMMA2) | Asserted