Page MenuHomePhabricator

soft lockup on prometheus, centrallog, vrts hosts with the new kernel
Closed, InvalidPublic

Description

Earlier today we have experienced prometheus (eqiad, codfw) and centrallog + vrts hosts locking up more or less at the same time, the hosts were for the most part unresponsive on ssh and sometimes in console.

A correlation I was able to find so far is that fstrim.service started a few moments before the kernel started reporting problems. Full paste from e.g. centrallog1002 is at https://phabricator.wikimedia.org/P75746

2025-05-05T01:39:02.626566+00:00 centrallog1002 systemd[1]: Starting fstrim.service - Discard unused blocks on filesystems from /etc/fstab...
2025-05-05T01:39:03.328414+00:00 centrallog1002 kernel: [472531.143541] BUG: kernel NULL pointer dereference, address: 0000000000000000
2025-05-05T01:39:03.328452+00:00 centrallog1002 kernel: [472531.150586] #PF: supervisor instruction fetch in kernel mode
2025-05-05T01:39:03.328455+00:00 centrallog1002 kernel: [472531.156331] #PF: error_code(0x0010) - not-present page
2025-05-05T01:39:03.328456+00:00 centrallog1002 kernel: [472531.161558] PGD 0 P4D 0 
2025-05-05T01:39:03.328458+00:00 centrallog1002 kernel: [472531.164185] Oops: 0010 [#1] PREEMPT SMP NOPTI

Event Timeline

fgiunchedi renamed this task from soft lockup on prometheus and centrallog hosts with the new kernel to soft lockup on prometheus, centrallog, vrts hosts with the new kernel.May 5 2025, 9:02 AM
fgiunchedi updated the task description. (Show Details)

Mentioned in SAL (#wikimedia-operations) [2025-05-05T09:03:10Z] <godog> powercycle vrts1003 + vrts2002 - soft lockup T393357

Another correlation (maybe causation) is the fact that all hosts locking up so far have mdadm raid10

RAID 10 is a good lead! It seems the same was already reported in Debian a few days ago: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1104460

I'm resolving this in favor of T393366