Page MenuHomePhabricator

[ceph] Investigate osds going down for very short periods of time
Closed, ResolvedPublic

Description

While investigating the performance of the cluster I noticed that some ods go
down for a couple seconds at a time and then come up again.

For example:

root@cloudcephosd1001:~# while true; do ceph osd tree | grep down; sleep 1; done
 47   ssd   1.74609         osd.47              down  1.00000 1.00000 
 47   ssd   1.74609         osd.47              down  1.00000 1.00000 
 47   ssd   1.74609         osd.47              down  1.00000 1.00000 
 47   ssd   1.74609         osd.47              down  1.00000 1.00000 
 47   ssd   1.74609         osd.47              down  1.00000 1.00000 
 47   ssd   1.74609         osd.47              down  1.00000 1.00000 
 47   ssd   1.74609         osd.47              down  1.00000 1.00000 
 47   ssd   1.74609         osd.47              down  1.00000 1.00000 
 47   ssd   1.74609         osd.47              down  1.00000 1.00000 
 47   ssd   1.74609         osd.47              down  1.00000 1.00000 
 47   ssd   1.74609         osd.47              down  1.00000 1.00000 
 47   ssd   1.74609         osd.47              down  1.00000 1.00000 
 47   ssd   1.74609         osd.47              down  1.00000 1.00000 
 47   ssd   1.74609         osd.47              down  1.00000 1.00000 


 12   ssd   1.74609         osd.12              down  1.00000 1.00000 
 12   ssd   1.74609         osd.12              down  1.00000 1.00000 
 12   ssd   1.74609         osd.12              down  1.00000 1.00000 
 12   ssd   1.74609         osd.12              down  1.00000 1.00000 
 12   ssd   1.74609         osd.12              down  1.00000 1.00000 
 12   ssd   1.74609         osd.12              down  1.00000 1.00000 
 12   ssd   1.74609         osd.12              down  1.00000 1.00000 
 12   ssd   1.74609         osd.12              down  1.00000 1.00000 
 12   ssd   1.74609         osd.12              down  1.00000 1.00000 
 12   ssd   1.74609         osd.12              down  1.00000 1.00000 
 12   ssd   1.74609         osd.12              down  1.00000 1.00000

Event Timeline

dcaro triaged this task as High priority.Feb 5 2021, 8:29 AM
dcaro created this task.

It seems they are being restarted by systemd for some reason... looking:

Feb 05 08:26:33 cloudcephosd1006 systemd[1]: ceph-osd@47.service: Main process exited, code=killed, status=9/KILL
Feb 05 08:26:33 cloudcephosd1006 systemd[1]: ceph-osd@47.service: Failed with result 'signal'.
Feb 05 08:26:34 cloudcephosd1006 systemd[1]: ceph-osd@47.service: Service RestartSec=100ms expired, scheduling restart.
Feb 05 08:26:34 cloudcephosd1006 systemd[1]: ceph-osd@47.service: Scheduled restart job, restart counter is at 6.
Feb 05 08:26:34 cloudcephosd1006 systemd[1]: Stopped Ceph object storage daemon osd.47.
Feb 05 08:26:34 cloudcephosd1006 systemd[1]: Starting Ceph object storage daemon osd.47...
Feb 05 08:26:34 cloudcephosd1006 systemd[1]: Started Ceph object storage daemon osd.47.
Feb 05 08:26:50 cloudcephosd1006 ceph-osd[19063]: 2021-02-05 08:26:50.387 7f21875b8c80 -1 osd.47 324531 log_to_monitors {default=true}
Feb 05 08:31:05 cloudcephosd1006 ceph-osd[19063]:  -7682> 2021-02-05 08:26:50.387 7f21875b8c80 -1 osd.47 324531 log_to_monitors {default=true}

Ok... found it.

It turns out that the machines we have for osds have two different amounts of
RAM in them, that means that some of them were getting OOMed when I tweaked the
memory to fit the available one:

Feb 05 08:26:33 cloudcephosd1006 kernel: Out of memory: Kill process 27120 (ceph-osd) score 155 or sacrifice child
Feb 05 08:26:33 cloudcephosd1006 kernel: Killed process 27120 (ceph-osd) total-vm:11202468kB, anon-rss:10249720kB, file-rss:0kB, shmem-rss:0kB

I've preemptively reduced the amount of memory on the osds to 6G :/

This is fixed now, it was the ram, there's some other issue going on thoug
being tracked here T272993