Page MenuHomePhabricator

KernelErrors Server cloudvirt1043 logged kernel errors
Closed, DeclinedPublic

Description

Common information

  • alertname: KernelErrors
  • category: keyword_error
  • cluster: wmcs
  • instance: cloudvirt1043:9100
  • job: node
  • prometheus: ops
  • severity: critical
  • site: eqiad
  • source: prometheus
  • team: wmcs

Firing alerts


Event Timeline

Andrew subscribed.
[1549405.992360] perf: interrupt took too long (6430 > 6150), lowering kernel.perf_event_max_sample_rate to 31000
[1549406.049708] {3}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
[1549406.050022] {3}[Hardware Error]: It has been corrected by h/w and requires no further action
[1549406.050337] {3}[Hardware Error]: event severity: corrected
[1549406.050338] {3}[Hardware Error]:  Error 0, type: corrected
[1549406.050651] {3}[Hardware Error]:  fru_text: B7
[1549406.050652] {3}[Hardware Error]:   section_type: memory error
[1549406.050653] {3}[Hardware Error]:    error_status: Storage error in DRAM memory (0x0000000000000400)
[1549406.050967] {3}[Hardware Error]:   physical_address: 0x00000041e2a1d840
[1549406.051283] {3}[Hardware Error]:   node:2 card:0 module:1 rank:0 bank:1 dev
ice:6 row:1956 column:400 
[1549406.051596] {3}[Hardware Error]:   error_type: 2, single-bit ECC
[1549406.051912] {3}[Hardware Error]:   DIMM location: not present. DMI handle: 0x0000 
[1549406.060103] mce: [Hardware Error]: Machine check events logged
[1549406.061680] EDAC skx MC2: HANDLING MCE MEMORY ERROR
[1549406.061993] EDAC skx MC2: CPU 0: Machine Check Event: 0x0 Bank 255: 0x9c0000000000009f
[1549406.061995] EDAC skx MC2: TSC 0x0 
[1549406.061997] EDAC skx MC2: ADDR 0x41e2a1d840 
[1549406.062309] EDAC skx MC2: MISC 0x8c 
[1549406.062309] EDAC skx MC2: PROCESSOR 0:0x50657 TIME 1740424152 SOCKET 0 APIC 0x0

We are trying to ignore these 'it has been corrected' memory errors and hope that the systems are replaced before error correction isn't up to the job. See also T386083