Page MenuHomePhabricator

wikikube-worker2280 unreachable
Closed, ResolvedPublic

Description

Since 2026-04-14T21:18:02.84Z

> start /system1/sol1
/system1/sol1
press <Enter>, <Esc>, and then <T> to terminate session
(press the keys in sequence, one after the other)
[2495758.980765] watchdog: BUG: soft lockup - CPU#30 stuck for 40285s! [migration/30:167]
[2495758.980765] watchdog: BUG: soft lockup - CPU#29 stuck for 41276s! [migration/29:162]
[2495759.013359] watchdog: BUG: soft lockup - CPU#42 stuck for 40520s! [python3:2086447]
[2495762.365047] watchdog: BUG: soft lockup - CPU#11 stuck for 41615s! [cadvisor:2065]
[2495762.512702] watchdog: BUG: soft lockup - CPU#12 stuck for 41588s! [migration/12:76]
[2495762.672849] watchdog: BUG: soft lockup - CPU#16 stuck for 41589s! [migration/16:97]
[2495762.717367] watchdog: BUG: soft lockup - CPU#17 stuck for 41615s! [migration/17:102]
[2495762.797539] watchdog: BUG: soft lockup - CPU#19 stuck for 40598s! [prometheus-ipmi:1546852]
[2495762.957004] watchdog: BUG: soft lockup - CPU#23 stuck for 40598s! [prometheus-ipmi:1546855]
[2495762.961198] watchdog: BUG: soft lockup - CPU#24 stuck for 41406s! [kworker/24:0:1514115]
[2495762.968654] watchdog: BUG: soft lockup - CPU#25 stuck for 40520s! [prometheus-ipmi:1546799]
[2495762.984668] watchdog: BUG: soft lockup - CPU#31 stuck for 41619s! [cadvisor:2079]
[2495763.021300] watchdog: BUG: soft lockup - CPU#46 stuck for 40598s! [prometheus-ipmi:1546798]
...
...

Health log from the webinterface does not show anything from the recent past.

After a power-cycle it shows:

2026-04-15 09:43:24	ProcessorConfiguration	[PC-0001] Configuration error - CPU 1 DCU BUS Fatal error(Last Boot error) - Assertion

And immediate kernel lockups again:

[   59.253565] watchdog: BUG: soft lockup - CPU#47 stuck for 45s! [swapper/47:0]
[   59.261577] Modules linked in: raid1 md_mod rndis_host(+) cdc_ether usbnet mii hid_generic(+) sd_mod usbhid t10_pi hid crc64_rocksoft crc64 crc_t10dif crct10dif_generic ahci libahci xhci_pci xhci_hcd crct10dif_pclmul libata igb(+) crct10dif_common crc32_pclmul bnxt_en usbcore crc32c_intel i2c_algo_bit scsi_mod i2c_i801 dca i2c_smbus usb_common scsi_common
[   59.296905] CPU: 47 PID: 0 Comm: swapper/47 Not tainted 6.1.0-44-amd64 #1  Debian 6.1.164-1
[   59.306281] Hardware name: Supermicro SYS-120C-TR/X12DDW-A6, BIOS 2.1 07/04/2024
[   59.314583] RIP: 0010:file_free_rcu+0x8/0x50
[   59.319377] Code: c9 48 85 c0 49 0f 48 c1 48 89 05 23 4d 80 01 e9 6e 72 d4 ff 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 0f 1f 44 00 00 48 89 fe <48> 8b bf 90 00 00 00 48 85 ff 74 06 f0 48 ff 0f 74 0c 48 8b 3d 87
[   59.340444] RSP: 0018:ff56ee3c87048ed0 EFLAGS: 00010286
[   59.346308] RAX: ff1b7d0f84404500 RBX: 0000000000000006 RCX: 00000000802a0025
[   59.354316] RDX: ffffffff81d65f30 RSI: ff1b7d0f84404500 RDI: ff1b7d0f84404500
[   59.362324] RBP: ff1b7d0f83dea000 R08: 0000000000000000 R09: 00000000802a0025
[   59.370333] R10: ffffffff83406100 R11: 0000000000000000 R12: ff1b7d1f3fbf2880
[   59.378342] R13: 0000000000000005 R14: 000000000000000a R15: 0000000000000000
[   59.386352] FS:  0000000000000000(0000) GS:ff1b7d1f3fbc0000(0000) knlGS:0000000000000000
[   59.395435] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   59.401884] CR2: 00007ff5c76e15c3 CR3: 0000001a4a010001 CR4: 0000000000771ee0
[   59.409893] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   59.417900] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   59.425909] PKRU: 55555554
[   59.428948] Call Trace:
[   59.431684]  <IRQ>
[   59.433941]  rcu_do_batch+0x197/0x540
[   59.438055]  rcu_core+0x1b5/0x4d0
[   59.441774]  handle_softirqs+0xd4/0x280
[   59.446080]  __irq_exit_rcu+0xac/0xe0
[   59.450190]  sysvec_apic_timer_interrupt+0x6e/0x90
[   59.455560]  </IRQ>
[   59.457913]  <TASK>
[   59.460268]  asm_sysvec_apic_timer_interrupt+0x16/0x20
[   59.466040] RIP: 0010:cpuidle_enter_state+0xde/0x420
[   59.471618] Code: 00 00 31 ff e8 43 bc 96 ff 45 84 ff 74 16 9c 58 0f 1f 40 00 f6 c4 02 0f 85 25 03 00 00 31 ff e8 18 69 9d ff fb 0f 1f 44 00 00 <45> 85 f6 0f 88 85 01 00 00 49 63 d6 48 8d 04 52 48 8d 04 82 49 8d
[   59.492682] RSP: 0018:ff56ee3c84727e90 EFLAGS: 00000246
[   59.498547] RAX: ff1b7d1f3fbf1ac0 RBX: ff88ee3c7d3ff400 RCX: 0000000000000000
[   59.506555] RDX: 000000000000002f RSI: ff89b4ba6b913a22 RDI: 0000000000000000
[   59.514564] RBP: 0000000000000003 R08: 0000000000000004 R09: 000000003cf3cf3d
[   59.522573] R10: 0000000000000018 R11: 000000000000081b R12: ffffffff8359f440
[   59.530583] R13: 00000003282318bb R14: 0000000000000003 R15: 0000000000000000
[   59.538587]  cpuidle_enter+0x29/0x40
[   59.542600]  do_idle+0x202/0x2a0
[   59.546225]  cpu_startup_entry+0x26/0x30
[   59.550628]  start_secondary+0x12a/0x150
[   59.555034]  secondary_startup_64_no_verify+0xe5/0xeb
[   59.560706]  </TASK>

Powering down completely and then powering on again made the system boot. https://www.supermicro.com/en/support/faqs/faq.php?faq=40335 recommends to disable C6 states:

Please diasble the C6 State in the BIOS.

Advanced -> CPU Configuration -> Advanced Power Management Configuration -> Package C State Control

Package C State: [C2 State]

I did not check the current bios setting since the host is back up.

Event Timeline

JMeybohm updated the task description. (Show Details)

Icinga downtime and Alertmanager silence (ID=27539ca9-7c38-4cc9-b608-8e3021f0b499) set by jayme@cumin2002 for 7 days, 0:00:00 on 1 host(s) and their services with reason: hardware issues

wikikube-worker2280.codfw.wmnet

Cookbook cookbooks.sre.k8s.pool-depool-node started by jayme@cumin2002 pool for host wikikube-worker2280.codfw.wmnet completed:

  • wikikube-worker2280.codfw.wmnet (PASS)
    • Host wikikube-worker2280.codfw.wmnet pooled in wikikube-codfw
grep . /sys/devices/system/cpu/cpu0/cpuidle/state*/name
/sys/devices/system/cpu/cpu0/cpuidle/state0/name:POLL
/sys/devices/system/cpu/cpu0/cpuidle/state1/name:C1
/sys/devices/system/cpu/cpu0/cpuidle/state2/name:C1E
/sys/devices/system/cpu/cpu0/cpuidle/state3/name:C6

C6 seems to be enabled

JMeybohm claimed this task.
grep . /sys/devices/system/cpu/cpu0/cpuidle/state*/name
/sys/devices/system/cpu/cpu0/cpuidle/state0/name:POLL
/sys/devices/system/cpu/cpu0/cpuidle/state1/name:C1
/sys/devices/system/cpu/cpu0/cpuidle/state2/name:C1E
/sys/devices/system/cpu/cpu0/cpuidle/state3/name:C6

C6 seems to be enabled

I'd still leave it like that. First occasion of this and we have a couple supermicros already. We can try disabling C6 if we see this again I'd say.

grep . /sys/devices/system/cpu/cpu0/cpuidle/state*/name
/sys/devices/system/cpu/cpu0/cpuidle/state0/name:POLL
/sys/devices/system/cpu/cpu0/cpuidle/state1/name:C1
/sys/devices/system/cpu/cpu0/cpuidle/state2/name:C1E
/sys/devices/system/cpu/cpu0/cpuidle/state3/name:C6

C6 seems to be enabled

I'd still leave it like that. First occasion of this and we have a couple supermicros already. We can try disabling C6 if we see this again I'd say.

Not accounting for SuperMicro or not, all wikikube-workers have C6 enabled, so if we there is an issue with it, it probably will happen again. Agreed on letting it be for now because a BIOS setting update campaign would be painful.

Cookbook cookbooks.sre.k8s.pool-depool-node started by cgoubert@cumin1003 depool for host wikikube-worker2280.codfw.wmnet completed:

  • wikikube-worker2280.codfw.wmnet (PASS)
    • Host wikikube-worker2280.codfw.wmnet depooled from wikikube-codfw

Happened right again:

press <Enter>, <Esc>, and then <T> to terminate session
(press the keys in sequence, one after the other)

Password: [17342.978428] watchdog: BUG: soft lockup - CPU#23 stuck for 5133s! [migration/23:132]
[17346.990613] watchdog: BUG: soft lockup - CPU#25 stuck for 5141s! [php-fpm8.3:340534]
[17370.978471] watchdog: BUG: soft lockup - CPU#23 stuck for 5159s! [migration/23:132]

Given it's in a weird state where connections are hanging, the kubernetes scheduler thinks it's still ok to schedule workloads there, and they never start or never terminate, breaking deployments.
I had to manually force delete all non-daemonset workloads on it to unblock.

The system was still somewhat responsive this time (maybe because it got caught fast). I've now disabled C6 states in bios. We can maybe repool tomorrow EU morning so we can keep an eye on it.

Cookbook cookbooks.sre.k8s.pool-depool-node started by jayme@cumin2002 pool for host wikikube-worker2280.codfw.wmnet completed:

  • wikikube-worker2280.codfw.wmnet (PASS)
    • Host wikikube-worker2280.codfw.wmnet pooled in wikikube-codfw

No issue since repool, I'll close this again to jinx it.