Page MenuHomePhabricator

[ceph] cloudcephosd1007 - osd.6 - Icinga/Ceph Cluster Health - 2021-07-14 - OSD daemon crash
Closed, ResolvedPublic

Description

Write the description below

From the alerts dashboard:
https://alerts.wikimedia.org/?q=alertname%3D~.%2A%28cloud%7Cceph%7Clabs%29.%2A&q=alertname%3DIcinga%2FCeph%20Cluster%20Health

alertname: Icinga/Ceph Cluster Health
summary: cluster=misc instance=cloudcephmon1002 job=ceph_eqiad site=eqiad

Event Timeline

dcaro triaged this task as High priority.Jul 14 2021, 2:17 PM
dcaro created this task.

This is a crash on an osd daemon:

dcaro@cloudcephosd1019:~$ sudo ceph health detail
HEALTH_WARN 1 daemons have recently crashed
[WRN] RECENT_CRASH: 1 daemons have recently crashed
    osd.6 crashed on host cloudcephosd1007 at 2021-07-14T13:27:32.881517Z


dcaro@cloudcephosd1019:~$ sudo ceph crash ls-new
ID                                                                ENTITY  NEW
2021-07-14T13:27:32.881517Z_17153103-7e31-4fd7-be93-cdbc285f0c5f  osd.6    *


dcaro@cloudcephosd1019:~$ sudo ceph crash info 2021-07-14T13:27:32.881517Z_17153103-7e31-4fd7-be93-cdbc285f0c5f
{
    "backtrace": [
        "(()+0x12730) [0x7f2c99ba3730]",
        "(std::_Rb_tree<boost::intrusive_ptr<AsyncConnection>, boost::intrusive_ptr<AsyncConnection>, std::_Identity<boost::intrusive_ptr<AsyncConnection> >, std::less<boost::intrusive_ptr<AsyncConnection> >, std::allocator<boost::intrusive_ptr<AsyncConnection> > >::find(boost::intrusive_ptr<AsyncConnection> const&) const+0x24) [0x5561dced7f64]",
        "(AsyncConnection::_stop()+0xa7) [0x5561dced25a7]",
        "(ProtocolV2::stop()+0x8b) [0x5561dcef9fbb]",
        "(ProtocolV2::_fault()+0x6b) [0x5561dcefa13b]",
        "(AsyncConnection::tick(unsigned long)+0x1ec) [0x5561dced73ec]",
        "(EventCenter::process_time_events()+0x650) [0x5561dcd37940]",
        "(EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)+0xb20) [0x5561dcd38720]",
        "(()+0x11feedb) [0x5561dcd3dedb]",
        "(()+0xbbb2f) [0x7f2c99a68b2f]",
        "(()+0x7fa3) [0x7f2c99b98fa3]",
        "(clone()+0x3f) [0x7f2c997464cf]"
    ],
    "ceph_version": "15.2.11",
    "crash_id": "2021-07-14T13:27:32.881517Z_17153103-7e31-4fd7-be93-cdbc285f0c5f",
    "entity_name": "osd.6",
    "os_id": "10",
    "os_name": "Debian GNU/Linux 10 (buster)",
    "os_version": "10 (buster)",
    "os_version_id": "10",
    "process_name": "ceph-osd",
    "stack_sig": "aa0107431ae7c6312d0d536b352d27869baebde33d63df769ed90cc7044c7468",
    "timestamp": "2021-07-14T13:27:32.881517Z",
    "utsname_hostname": "cloudcephosd1007",
    "utsname_machine": "x86_64",
    "utsname_release": "4.19.0-16-amd64",
    "utsname_sysname": "Linux",
    "utsname_version": "#1 SMP Debian 4.19.181-1 (2021-03-19)"
}
dcaro renamed this task from [ceph] Icinga/Ceph Cluster Health - 2021-07-14 to [ceph] Icinga/Ceph Cluster Health - 2021-07-14 - OSD daemon crash.Jul 14 2021, 2:29 PM
dcaro renamed this task from [ceph] Icinga/Ceph Cluster Health - 2021-07-14 - OSD daemon crash to [ceph] cloudcephosd1007 - osd.6 - Icinga/Ceph Cluster Health - 2021-07-14 - OSD daemon crash.Jul 16 2021, 10:00 AM

Added journal relevant logs here: F34553052

Closing the task as this is the first time it happens, and caused no outage. If it happens again will look deeper into it.

Cleared crash list and restored the health status with 'ceph crash archive-all'.