Earlier today we have experienced prometheus (eqiad, codfw) and centrallog + vrts hosts locking up more or less at the same time, the hosts were for the most part unresponsive on ssh and sometimes in console.
A correlation I was able to find so far is that fstrim.service started a few moments before the kernel started reporting problems. Full paste from e.g. centrallog1002 is at https://phabricator.wikimedia.org/P75746
2025-05-05T01:39:02.626566+00:00 centrallog1002 systemd[1]: Starting fstrim.service - Discard unused blocks on filesystems from /etc/fstab... 2025-05-05T01:39:03.328414+00:00 centrallog1002 kernel: [472531.143541] BUG: kernel NULL pointer dereference, address: 0000000000000000 2025-05-05T01:39:03.328452+00:00 centrallog1002 kernel: [472531.150586] #PF: supervisor instruction fetch in kernel mode 2025-05-05T01:39:03.328455+00:00 centrallog1002 kernel: [472531.156331] #PF: error_code(0x0010) - not-present page 2025-05-05T01:39:03.328456+00:00 centrallog1002 kernel: [472531.161558] PGD 0 P4D 0 2025-05-05T01:39:03.328458+00:00 centrallog1002 kernel: [472531.164185] Oops: 0010 [#1] PREEMPT SMP NOPTI