Page MenuHomePhabricator

[ceph] investigate osd crash in eqiad
Closed, ResolvedPublic

Description

Just noticed that ceph had a health warning when showing the status:

dcaro@cloudcephmon1001:~$ sudo ceph status
  cluster:
    id:     5917e6d9-06a0-4928-827a-f489384975b1
    health: HEALTH_WARN
            1 daemons have recently crashed
...

Going to details it seems one osd crashed:

dcaro@cloudcephmon1001:~$ sudo ceph health detail
HEALTH_WARN 1 daemons have recently crashed
RECENT_CRASH 1 daemons have recently crashed
    osd.20 crashed on host cloudcephosd1009 at 2021-01-21 19:01:18.756490Z

Event Timeline

dcaro triaged this task as Medium priority.

It crashed again today: osd.20 crashed on host cloudcephosd1009 at 2021-02-01 09:56:00.017099Z

Two more crashes happened this weekend, they seem related to the bug:
https://tracker.ceph.com/issues/48276

This has a fix already merged:
https://github.com/ceph/ceph/pull/38637

That will be released on the next release probably (14.2.17).

In the meantime there's a workaround, that is changing the allocator to avl or
bitmap from hybrid:

dcaro@cloudcephosd1001:~$ sudo ceph daemon osd.48 config get bluefs_allocator
{
    "bluefs_allocator": "hybrid"
}

Changing that requires rebooting the osd (we have a full reboot pending: T272458
So we can do it then.

Luckily for us the issue is not taking down the osds as they run on ssd, so it
seems not critical.

This has been fixed by upgrading to Octopus (v15)