Page MenuHomePhabricator

cloudcephosd1006: down, possible memory error
Closed, ResolvedPublic

Description

On 2020-10-29 at 07:49 UTC the cloudcephosd1006 server stopped responding. Monitoring alerted.

To bring the server back online, @aborrero hard-rebooted it using the IPMI interface at 08:15 UTC.

Nothing interesting in syslog:

Oct 29 07:48:01 cloudcephosd1006 CRON[49089]: (prometheus) CMD (/usr/local/bin/prometheus-puppet-agent-stats --outfile /var/lib/prometheus/node.d/puppet_agent.prom)
Oct 29 07:49:01 cloudcephosd1006 CRON[322]: (prometheus) CMD (/usr/local/bin/prometheus-puppet-agent-stats --outfile /var/lib/prometheus/node.d/puppet_agent.prom)
Oct 29 08:15:21 cloudcephosd1006 systemd-modules-load[743]: Inserted module 'nf_conntrack'
Oct 29 08:15:21 cloudcephosd1006 systemd-modules-load[743]: Inserted module 'ipmi_devintf'
Oct 29 08:15:21 cloudcephosd1006 lvm[746]:   2 logical volume(s) in volume group "vg0" monitored
Oct 29 08:15:21 cloudcephosd1006 systemd[1]: Started Load Kernel Modules.

The racadm interface showed some messages that might indicate hardware errors:

racadm>>racadm getsel
[..]
Record:      8
Date/Time:   10/29/2020 07:49:43
Source:      system
Severity:    Non-Critical
Description: Correctable Machine Check Exception detected on CPU 2.
-------------------------------------------------------------------------------
Record:      9
Date/Time:   10/29/2020 07:49:43
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      10
Date/Time:   10/29/2020 07:49:43
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      11
Date/Time:   10/29/2020 07:49:43
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      12
Date/Time:   10/29/2020 07:49:44
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      13
Date/Time:   10/29/2020 07:49:44
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      14
Date/Time:   10/29/2020 07:49:44
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      15
Date/Time:   10/29/2020 07:49:44
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      16
Date/Time:   10/29/2020 07:49:44
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      17
Date/Time:   10/29/2020 07:49:44
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      18
Date/Time:   10/29/2020 07:49:44
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      19
Date/Time:   10/29/2020 07:49:44
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      20
Date/Time:   10/29/2020 07:49:44
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      21
Date/Time:   10/29/2020 07:49:44
Source:      system
Severity:    Critical
Description: Correctable memory error logging disabled for a memory device at location DIMM_B1.
-------------------------------------------------------------------------------
Record:      22
Date/Time:   10/29/2020 07:49:44
Source:      system
Severity:    Critical
Description: CPU 2 machine check error detected.
-------------------------------------------------------------------------------
Record:      23
Date/Time:   10/29/2020 07:49:45
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      24
Date/Time:   10/29/2020 07:49:45
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      25
Date/Time:   10/29/2020 07:49:45
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      26
Date/Time:   10/29/2020 07:49:45
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      27
Date/Time:   10/29/2020 07:49:45
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      28
Date/Time:   10/29/2020 07:49:45
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      29
Date/Time:   10/29/2020 07:49:45
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      30
Date/Time:   10/29/2020 07:49:45
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      31
Date/Time:   10/29/2020 07:49:45
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      32
Date/Time:   10/29/2020 07:49:45
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      33
Date/Time:   10/29/2020 07:49:45
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      34
Date/Time:   10/29/2020 07:49:45
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      35
Date/Time:   10/29/2020 07:51:42
Source:      system
Severity:    Ok
Description: A problem was detected related to the previous server boot.
-------------------------------------------------------------------------------
Record:      36
Date/Time:   10/29/2020 07:51:42
Source:      system
Severity:    Critical
Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_B1.
-------------------------------------------------------------------------------

Specifically:

-------------------------------------------------------------------------------
Record:      34
Date/Time:   10/29/2020 07:49:45
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      35
Date/Time:   10/29/2020 07:51:42
Source:      system
Severity:    Ok
Description: A problem was detected related to the previous server boot.
-------------------------------------------------------------------------------
Record:      36
Date/Time:   10/29/2020 07:51:42
Source:      system
Severity:    Critical
Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_B1.

Event Timeline

For the record, the ceph cluster health seems OK after the force reboot:

aborrero@cloudcephmon1001:~ $ sudo ceph health detail
HEALTH_OK
aborrero@cloudcephmon1001:~ $ sudo ceph -s
  cluster:
    id:     5917e6d9-06a0-4928-827a-f489384975b1
    health: HEALTH_OK
 
  services:
    mon: 3 daemons, quorum cloudcephmon1003,cloudcephmon1002,cloudcephmon1001 (age 7w)
    mgr: cloudcephmon1003(active, since 7w), standbys: cloudcephmon1001, cloudcephmon1002
    osd: 120 osds: 120 up (since 40m), 120 in (since 40m)
 
  data:
    pools:   2 pools, 2052 pgs
    objects: 7.18M objects, 27 TiB
    usage:   81 TiB used, 129 TiB / 209 TiB avail
    pgs:     2052 active+clean
 
  io:
    client:   363 MiB/s rd, 324 MiB/s wr, 1.33k op/s rd, 4.46k op/s wr

This seems not to have happened in a long time, should we close it?

Andrew assigned this task to aborrero.

This seems not to have happened in a long time, should we close it?

yep, we can always reopen if it shows up again.