We get sometimes slow operations, and that correlates with VMs getting "stuck" while those happen.
This task is to track such events for long-term debugging and add any ideas or improvement made during the span of the investigation.
2024-02-07
Slow operations brought NFS down again
- possibly related to T356904: [cinder] [toolsdb] Deleting snapshot does not work (to be verified)
TO BE FILLED UP
2023-04-04
From IRC:
<taavi> Taavi Väänänen tools-puppetdb-1 seems down 12:29:23 ceph is in HEALTH_WARN 12:29:25 dcaro: 12:29:40 <dcaro> David Caro oh, let me check 12:30:44 getting slow ops 12:30:47 https://www.irccloud.com/pastebin/J3vS2U5e/ --- root@cloudcephmon1001:~# ceph health detail HEALTH_WARN 8 slow ops, oldest one blocked for 551 sec, osd.218 has slow ops [WRN] SLOW_OPS: 8 slow ops, oldest one blocked for 551 sec, osd.218 has slow ops --- 12:32:36 the host is heavily writing, though it does not seem excessive 12:37:57 it went away 12:38:15 the slow ops were all "waiting for readable" hm 12:39:59 this is not the first time we see slow ops randomly appearing, and they are taking down some VMs, what can we do to prevent this happening in the future? 12:40:45 also the alert needs adjusting as it says 'things should still be working as expected' 12:42:16 <dcaro> David Caro and they should according to ceph, writes just take longer, but they happen, maybe that's not acceptable at the libvirt level though 12:43:28 this is the list of clients with stuck operations at that point: 12:43:39 https://www.irccloud.com/pastebin/L6vusXGQ/ --- root@cloudcephosd1028:~# <lolo jq -r .ops[].type_data.client_info.client_addr | cut -d: -f1 | sort | uniq -c | sort -n 1 10.64.148.7 1 10.64.20.24 1 10.64.20.50 1 10.64.20.6 1 10.64.20.9 2 10.64.149.7 2 10.64.149.9 2 10.64.20.51 2 10.64.20.7 2 10.64.20.80 3 10.64.148.8 3 10.64.20.73 4 10.64.20.12 4 10.64.20.78 4 10.64.20.8 5 10.64.20.53 5 10.64.20.75 7 10.64.20.81 12 10.64.148.9 106 10.64.20.54 --- 12:43:49 cloudvirt1030 being the one with more there 12:47:45 answering your question, I think that to prevent this we have to find the cause/causes/bottleneck there 13:03:15 taavi: looking at ceph status changes there's not too many warning status flaps, though some 13:03:17 https://usercontent.irccloud-cdn.com/file/MxnCGwAM/image.png
{F36941730}
13:03:53 some are also when I was tweaking the rack HA, but some are not 13:04:53 can you create a task to keep track of this issues?