Page MenuHomePhabricator

[cloudceph] Slow operations - tracking task
Open, HighPublic

Description

We get sometimes slow operations, and that correlates with VMs getting "stuck" while those happen.

This task is to track such events for long-term debugging and add any ideas or improvement made during the span of the investigation.

2023-04-04

From IRC:

<taavi> Taavi Väänänen tools-puppetdb-1 seems down
12:29:23 ceph is in HEALTH_WARN
12:29:25 dcaro:
12:29:40 <dcaro> David Caro oh, let me check
12:30:44 getting slow ops
12:30:47 https://www.irccloud.com/pastebin/J3vS2U5e/
---
root@cloudcephmon1001:~# ceph health detail
HEALTH_WARN 8 slow ops, oldest one blocked for 551 sec, osd.218 has slow ops
[WRN] SLOW_OPS: 8 slow ops, oldest one blocked for 551 sec, osd.218 has slow ops
---

12:32:36 the host is heavily writing, though it does not seem excessive
12:37:57 it went away
12:38:15 the slow ops were all "waiting for readable"
hm
12:39:59 this is not the first time we see slow ops randomly appearing, and they are taking down some VMs, what can we do to prevent this happening in the future?
12:40:45 also the alert needs adjusting as it says 'things should still be working as expected'
12:42:16 <dcaro> David Caro and they should according to ceph, writes just take longer, but they happen, maybe that's not acceptable at the libvirt level though
12:43:28 this is the list of clients with stuck operations at that point:
12:43:39 https://www.irccloud.com/pastebin/L6vusXGQ/
---
root@cloudcephosd1028:~# <lolo jq -r .ops[].type_data.client_info.client_addr | cut -d: -f1 | sort | uniq -c | sort -n
      1 10.64.148.7
      1 10.64.20.24
      1 10.64.20.50
      1 10.64.20.6
      1 10.64.20.9
      2 10.64.149.7
      2 10.64.149.9
      2 10.64.20.51
      2 10.64.20.7
      2 10.64.20.80
      3 10.64.148.8
      3 10.64.20.73
      4 10.64.20.12
      4 10.64.20.78
      4 10.64.20.8
      5 10.64.20.53
      5 10.64.20.75
      7 10.64.20.81
     12 10.64.148.9
    106 10.64.20.54
---
12:43:49 cloudvirt1030 being the one with more there
12:47:45 answering your question, I think that to prevent this we have to find the cause/causes/bottleneck there
13:03:15 taavi: looking at ceph status changes there's not too many warning status flaps, though some
13:03:17 https://usercontent.irccloud-cdn.com/file/MxnCGwAM/image.png

{F36941730}

13:03:53 some are also when I was tweaking the rack HA, but some are not
13:04:53 can you create a task to keep track of this issues?