Page MenuHomePhabricator

[cloudceph] Slow operations - tracking task
Open, HighPublic

Description

We get sometimes slow operations, and that correlates with VMs getting "stuck" while those happen.

This task is to track such events for long-term debugging and add any ideas or improvement made during the span of the investigation.

2024-02-07

Slow operations brought NFS down again

TO BE FILLED UP

2023-04-04

From IRC:

<taavi> Taavi Väänänen tools-puppetdb-1 seems down
12:29:23 ceph is in HEALTH_WARN
12:29:25 dcaro:
12:29:40 <dcaro> David Caro oh, let me check
12:30:44 getting slow ops
12:30:47 https://www.irccloud.com/pastebin/J3vS2U5e/
---
root@cloudcephmon1001:~# ceph health detail
HEALTH_WARN 8 slow ops, oldest one blocked for 551 sec, osd.218 has slow ops
[WRN] SLOW_OPS: 8 slow ops, oldest one blocked for 551 sec, osd.218 has slow ops
---

12:32:36 the host is heavily writing, though it does not seem excessive
12:37:57 it went away
12:38:15 the slow ops were all "waiting for readable"
hm
12:39:59 this is not the first time we see slow ops randomly appearing, and they are taking down some VMs, what can we do to prevent this happening in the future?
12:40:45 also the alert needs adjusting as it says 'things should still be working as expected'
12:42:16 <dcaro> David Caro and they should according to ceph, writes just take longer, but they happen, maybe that's not acceptable at the libvirt level though
12:43:28 this is the list of clients with stuck operations at that point:
12:43:39 https://www.irccloud.com/pastebin/L6vusXGQ/
---
root@cloudcephosd1028:~# <lolo jq -r .ops[].type_data.client_info.client_addr | cut -d: -f1 | sort | uniq -c | sort -n
      1 10.64.148.7
      1 10.64.20.24
      1 10.64.20.50
      1 10.64.20.6
      1 10.64.20.9
      2 10.64.149.7
      2 10.64.149.9
      2 10.64.20.51
      2 10.64.20.7
      2 10.64.20.80
      3 10.64.148.8
      3 10.64.20.73
      4 10.64.20.12
      4 10.64.20.78
      4 10.64.20.8
      5 10.64.20.53
      5 10.64.20.75
      7 10.64.20.81
     12 10.64.148.9
    106 10.64.20.54
---
12:43:49 cloudvirt1030 being the one with more there
12:47:45 answering your question, I think that to prevent this we have to find the cause/causes/bottleneck there
13:03:15 taavi: looking at ceph status changes there's not too many warning status flaps, though some
13:03:17 https://usercontent.irccloud-cdn.com/file/MxnCGwAM/image.png

{F36941730}

13:03:53 some are also when I was tweaking the rack HA, but some are not
13:04:53 can you create a task to keep track of this issues?

Event Timeline

dcaro triaged this task as High priority.Apr 6 2023, 4:52 PM
dcaro created this task.
dcaro added a subscriber: taavi.
dcaro updated the task description. (Show Details)

Change 999068 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] ceph/rbd: slow down snap trimming on ssds, corrected

https://gerrit.wikimedia.org/r/999068

Change 999068 merged by Andrew Bogott:

[operations/puppet@production] ceph/rbd: slow down snap trimming on ssds, corrected

https://gerrit.wikimedia.org/r/999068