Page MenuHomePhabricator

Host lockups for frban2002 and frmon2002 after fstrim operation
Closed, ResolvedPublic

Description

Was woken up to alerts and found frban2002 and frmon2002 pingable, but unable to log in via ssh or the console. Restarted both hosts from the console using serveraction powercycle. When investigating the logs, both host appear to have a kernel panic in conjunction with an fstrim operation:

frmon2002

May  5 00:27:07 frmon2002 systemd[1]: Starting fstrim.service - Discard unused blocks on filesystems from /etc/fstab...
May  5 00:27:08 frmon2002 kernel: [537444.719622] BUG: kernel NULL pointer dereference, address: 0000000000000000
May  5 00:27:08 frmon2002 kernel: [537444.726674] #PF: supervisor instruction fetch in kernel mode
May  5 00:27:08 frmon2002 kernel: [537444.732420] #PF: error_code(0x0010) - not-present page
May  5 00:27:08 frmon2002 kernel: [537444.737647] PGD c2a784067 P4D 0-
May  5 00:27:08 frmon2002 kernel: [537444.740965] Oops: 0010 [#1] PREEMPT SMP NOPTI
May  5 00:27:08 frmon2002 kernel: [537444.745412] CPU: 19 PID: 442801 Comm: fstrim Not tainted 6.1.0-34-amd64 #1  Debian 6.1.135-1
May  5 00:27:08 frmon2002 kernel: [537444.753933] Hardware name: Dell Inc. PowerEdge R450/073H50, BIOS 1.9.2 11/17/2022
May  5 00:27:08 frmon2002 kernel: [537444.761498] RIP: 0010:0x0
May  5 00:27:08 frmon2002 kernel: [537444.764212] Code: Unable to access opcode bytes at 0xffffffffffffffd6.
May  5 00:27:08 frmon2002 kernel: [537444.770822] RSP: 0018:ff244d8da6fa7718 EFLAGS: 00010206
May  5 00:27:08 frmon2002 kernel: [537444.776136] RAX: 0000000000000000 RBX: 0000000000092800 RCX: 0000000000000c00
May  5 00:27:08 frmon2002 kernel: [537444.783356] RDX: 0000000000000803 RSI: 0000000000000000 RDI: 0000000000092800
May  5 00:27:08 frmon2002 kernel: [537444.790575] RBP: ff1a917ae0996718 R08: ff1a917ae0996700 R09: ff1a9179cd8fbf50
May  5 00:27:08 frmon2002 kernel: [537444.797792] R10: 0000000000000001 R11: 000000000005554a R12: 0000000000092c00
May  5 00:27:08 frmon2002 kernel: [537444.805014] R13: 0000000000000400 R14: 0000000000000803 R15: 0000000000000000
May  5 00:27:08 frmon2002 kernel: [537444.812233] FS:  00007f662b746840(0000) GS:ff1a918a1f840000(0000) knlGS:0000000000000000
May  5 00:27:08 frmon2002 kernel: [537444.820404] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
May  5 00:27:08 frmon2002 kernel: [537444.826238] CR2: ffffffffffffffd6 CR3: 00000009e7428005 CR4: 0000000000771ee0
May  5 00:27:08 frmon2002 kernel: [537444.833458] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
May  5 00:27:08 frmon2002 kernel: [537444.840677] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
May  5 00:27:08 frmon2002 kernel: [537444.847895] PKRU: 55555554
May  5 00:27:08 frmon2002 kernel: [537444.850695] Call Trace:

frban2002

May  5 00:42:52 frban2002 systemd[1]: Starting fstrim.service - Discard unused blocks on filesystems from /etc/fstab...
May  5 00:42:53 frban2002 kernel: [372148.019323] BUG: kernel NULL pointer dereference, address: 0000000000000000 
May  5 00:42:53 frban2002 kernel: [372148.026380] #PF: supervisor instruction fetch in kernel mode
May  5 00:42:53 frban2002 kernel: [372148.032126] #PF: error_code(0x0010) - not-present page
May  5 00:42:53 frban2002 kernel: [372148.037354] PGD 8ab89d067 P4D 0-
May  5 00:42:53 frban2002 kernel: [372148.040683] Oops: 0010 [#1] PREEMPT SMP NOPTI
May  5 00:42:53 frban2002 kernel: [372148.045134] CPU: 13 PID: 447162 Comm: fstrim Not tainted 6.1.0-34-amd64 #1  Debian 6.1.135-1
May  5 00:42:53 frban2002 kernel: [372148.053655] Hardware name: Dell Inc. PowerEdge R450/0VT18Y, BIOS 1.14.1 03/11/2024
May  5 00:42:53 frban2002 kernel: [372148.061306] RIP: 0010:0x0
May  5 00:42:53 frban2002 kernel: [372148.064024] Code: Unable to access opcode bytes at 0xffffffffffffffd6.
May  5 00:42:53 frban2002 kernel: [372148.070633] RSP: 0018:ff218493645b78d8 EFLAGS: 00010206
May  5 00:42:53 frban2002 kernel: [372148.075945] RAX: 0000000000000000 RBX: 0000000000092800 RCX: 0000000000000c00
May  5 00:42:53 frban2002 kernel: [372148.083164] RDX: 0000000000000803 RSI: 0000000000000000 RDI: 0000000000092800
May  5 00:42:53 frban2002 kernel: [372148.090385] RBP: ff11c1add95c4718 R08: ff11c1add95c4700 R09: ff11c1b54961f850
May  5 00:42:53 frban2002 kernel: [372148.097604] R10: 0000000000000001 R11: 000000000005554e R12: 0000000000092c00
May  5 00:42:53 frban2002 kernel: [372148.104822] R13: 0000000000000400 R14: 0000000000000803 R15: 0000000000000000
May  5 00:42:53 frban2002 kernel: [372148.112043] FS:  00007f3ba72f3840(0000) GS:ff11c1bd1f780000(0000) knlGS:0000000000000000
May  5 00:42:53 frban2002 kernel: [372148.120215] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033 
May  5 00:42:53 frban2002 kernel: [372148.126049] CR2: ffffffffffffffd6 CR3: 0000000893646002 CR4: 0000000000771ee0
May  5 00:42:53 frban2002 kernel: [372148.133267] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
May  5 00:42:53 frban2002 kernel: [372148.140487] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
May  5 00:42:53 frban2002 kernel: [372148.147708] PKRU: 55555554
May  5 00:42:53 frban2002 kernel: [372148.150505] Call Trace:

Full logs available on host and on frlog2002.

Event Timeline

Dwisehaupt closed this task as Resolved.EditedMay 5 2025, 8:19 PM
Dwisehaupt claimed this task.
Dwisehaupt moved this task from Triage to Done on the fundraising-tech-ops board.

As stated in T393366: Regression in RAID10 software RAID with 6.1.135 linked from T393357, we want to downgrade the kernel on the hosts running software RAID10. We only had 3 hosts (frmon2002 frban1001 frban1002) in that config. They have all be downgraded to 6.1.0-33-amd64 and rebooted. We can reassess when the next kernel rev comes out.