Page MenuHomePhabricator

restbase2037 is crashy
Closed, ResolvedPublic

Description

There have been (3) recent crashes/reboots of restbase2037

Sun Jan 05 09:55:30 UTC 2025
Mon Jan 13 16:09:45 UTC 2025
Wed Jan 15 10:11:47 UTC 2025

The cause seems to be an (uncorrectable) memory fault:

1[ ... ]
2Jan 13 16:00:09 restbase2037 systemd[1]: Starting Export confd Prometheus metrics...
3Jan 13 16:00:09 restbase2037 systemd[1]: clean-confd-rundir.service: Succeeded.
4Jan 13 16:00:09 restbase2037 systemd[1]: Finished Clean old stale files in /var/run/confd-template.
5Jan 13 16:00:09 restbase2037 systemd[1]: confd_prometheus_metrics.service: Succeeded.
6Jan 13 16:00:09 restbase2037 systemd[1]: Finished Export confd Prometheus metrics.
7Jan 13 16:01:09 restbase2037 systemd[1]: Starting Export confd Prometheus metrics...
8Jan 13 16:01:10 restbase2037 systemd[1]: confd_prometheus_metrics.service: Succeeded.
9Jan 13 16:01:10 restbase2037 systemd[1]: Finished Export confd Prometheus metrics.
10Jan 13 16:02:09 restbase2037 systemd[1]: Starting Daily apt download activities...
11Jan 13 16:02:09 restbase2037 systemd[1]: Starting Export confd Prometheus metrics...
12Jan 13 16:02:09 restbase2037 systemd[1]: confd_prometheus_metrics.service: Succeeded.
13Jan 13 16:02:09 restbase2037 systemd[1]: Finished Export confd Prometheus metrics.
14Jan 13 16:02:10 restbase2037 systemd[1]: apt-daily.service: Succeeded.
15Jan 13 16:02:10 restbase2037 systemd[1]: Finished Daily apt download activities.
16Jan 13 16:02:20 restbase2037 systemd[1]: Starting Update NIC firmware stats exported by node_exporter...
17Jan 13 16:02:20 restbase2037 systemd[1]: prometheus-nic-firmware-textfile.service: Succeeded.
18Jan 13 16:02:20 restbase2037 systemd[1]: Finished Update NIC firmware stats exported by node_exporter.
19Jan 13 16:03:09 restbase2037 systemd[1]: Starting Export confd Prometheus metrics...
20Jan 13 16:03:10 restbase2037 systemd[1]: confd_prometheus_metrics.service: Succeeded.
21Jan 13 16:03:10 restbase2037 systemd[1]: Finished Export confd Prometheus metrics.
22Jan 13 16:04:10 restbase2037 systemd[1]: Starting Export confd Prometheus metrics...
23Jan 13 16:04:10 restbase2037 systemd[1]: Starting Update Debian version stat exported by node_exporter...
24Jan 13 16:04:10 restbase2037 systemd[1]: prometheus-debian-version-textfile.service: Succeeded.
25Jan 13 16:04:10 restbase2037 systemd[1]: Finished Update Debian version stat exported by node_exporter.
26Jan 13 16:04:10 restbase2037 systemd[1]: confd_prometheus_metrics.service: Succeeded.
27Jan 13 16:04:10 restbase2037 systemd[1]: Finished Export confd Prometheus metrics.
28Jan 13 16:05:01 restbase2037 CRON[1811836]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
29Jan 13 16:05:10 restbase2037 systemd[1]: Starting Export confd Prometheus metrics...
30Jan 13 16:05:10 restbase2037 systemd[1]: confd_prometheus_metrics.service: Succeeded.
31Jan 13 16:05:10 restbase2037 systemd[1]: Finished Export confd Prometheus metrics.
32Jan 13 16:06:05 restbase2037 kernel: [713286.500985] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
33Jan 13 16:06:05 restbase2037 kernel: [713286.509340] {1}[Hardware Error]: event severity: recoverable
34Jan 13 16:06:05 restbase2037 kernel: [713286.515093] {1}[Hardware Error]: Error 0, type: recoverable
35Jan 13 16:06:05 restbase2037 kernel: [713286.520848] {1}[Hardware Error]: fru_text: B1
36Jan 13 16:06:05 restbase2037 kernel: [713286.525377] {1}[Hardware Error]: section_type: memory error
37Jan 13 16:06:05 restbase2037 kernel: [713286.531213] {1}[Hardware Error]: error_status: 0x0000000000000400
38Jan 13 16:06:05 restbase2037 kernel: [713286.537565] {1}[Hardware Error]: physical_address: 0x0000001fe0a6a8c0
39Jan 13 16:06:05 restbase2037 kernel: [713286.544264] {1}[Hardware Error]: physical_address_mask: 0xffffffffffffffc0
40Jan 13 16:06:05 restbase2037 kernel: [713286.551398] {1}[Hardware Error]: node: 1 card: 0 module: 0 rank: 0 bank: 14 device: 0 row: 121878 column: 536
41Jan 13 16:06:05 restbase2037 kernel: [713286.561650] {1}[Hardware Error]: error_type: 3, multi-bit ECC
42Jan 13 16:06:05 restbase2037 kernel: [713286.567656] {1}[Hardware Error]: DIMM location: not present. DMI handle: 0x0000
43Jan 13 16:06:05 restbase2037 kernel: [713286.577561] mce: [Hardware Error]: Machine check events logged
44Jan 13 16:06:05 restbase2037 kernel: [713286.579041] EDAC skx MC4: HANDLING MCE MEMORY ERROR
45Jan 13 16:06:05 restbase2037 kernel: [713286.579042] EDAC skx MC4: CPU 0: Machine Check Event: 0x0 Bank 255: 0xb40000000000009f
46Jan 13 16:06:05 restbase2037 kernel: [713286.579043] EDAC skx MC4: TSC 0x0
47Jan 13 16:06:05 restbase2037 kernel: [713286.579044] EDAC skx MC4: ADDR 0x1fe0a6a8c0
48Jan 13 16:06:05 restbase2037 kernel: [713286.579046] EDAC skx MC4: MISC 0x0
49Jan 13 16:06:05 restbase2037 kernel: [713286.579047] EDAC skx MC4: PROCESSOR 0:0x606a6 TIME 1736784365 SOCKET 0 APIC 0x0
50Jan 13 16:06:05 restbase2037 kernel: [713286.579058] EDAC MC4: 1 UE memory read error on CPU_SrcID#1_MC#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0x1fe0a6a offset:0x8c0 grain:32 - err_code:0x0000:0x009f SystemAddress:0x1fe0a6a8c0 ProcessorSocketId:0x1 MemoryControllerId:0x0 ChannelAddress:0x7b05350c0 ChannelId:0x0 RankAddress:0x3d829b0c0 PhysicalRankId:0x0 DimmSlotId:0x0 Row:0x1dc16 Column:0x218 Bank:0x2 BankGroup:0x3 ChipSelect:0x0 ChipId:0x0)
51Jan 13 16:06:05 restbase2037 kernel: [713286.579928] Memory failure: 0x1fe0a6a: recovery action for dirty LRU page: Recovered
52Jan 13 16:06:05 restbase2037 systemd[1]: Starting Export confd Prometheus metrics...
53Jan 13 16:06:06 restbase2037 systemd[1]: confd_prometheus_metrics.service: Succeeded.
54Jan 13 16:06:06 restbase2037 systemd[1]: Finished Export confd Prometheus metrics.
55Jan 13 16:06:21 restbase2037 kernel: [713302.548292] Disabling lock debugging due to kernel taint
56Jan 13 16:06:21 restbase2037 kernel: [713302.548450] mce: Uncorrected hardware memory error in user-access at 1ff97e3e80
57Jan 13 16:06:21 restbase2037 kernel: [713302.548475] mce: [Hardware Error]: Machine check events logged
58Jan 13 16:06:21 restbase2037 kernel: [713302.548874] {2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
59Jan 13 16:06:21 restbase2037 kernel: [713302.556250] mce: [Hardware Error]: CPU 31: Machine Check Exception: 7 Bank 1: bd80000000100134
60Jan 13 16:06:21 restbase2037 kernel: [713302.564230] {2}[Hardware Error]: event severity: recoverable
61Jan 13 16:06:21 restbase2037 kernel: [713302.564232] {2}[Hardware Error]: Error 0, type: recoverable
62Jan 13 16:06:21 restbase2037 kernel: [713302.564234] {2}[Hardware Error]: fru_text: B1
63Jan 13 16:06:21 restbase2037 kernel: [713302.564236] {2}[Hardware Error]: section_type: memory error
64Jan 13 16:06:21 restbase2037 kernel: [713302.564237] {2}[Hardware Error]: error_status: 0x0000000000000400
65Jan 13 16:06:21 restbase2037 kernel: [713302.564238] {2}[Hardware Error]: physical_address: 0x0000001ff97e3e80
66Jan 13 16:06:21 restbase2037 kernel: [713302.564239] {2}[Hardware Error]: physical_address_mask: 0xffffffffffffffc0
67Jan 13 16:06:21 restbase2037 kernel: [713302.564241] {2}[Hardware Error]: node: 1 card: 0 module: 0 rank: 0 bank: 14 device: 0 row: 122670 column: 976
68Jan 13 16:06:21 restbase2037 kernel: [713302.564242] {2}[Hardware Error]: error_type: 3, multi-bit ECC
69Jan 13 16:06:21 restbase2037 kernel: [713302.564244] {2}[Hardware Error]: DIMM location: not present. DMI handle: 0x0000
70Jan 13 16:06:21 restbase2037 kernel: [713302.565467] Memory failure: 0x1ff97e3: already hardware poisoned
71Jan 13 16:06:21 restbase2037 kernel: [713302.572999] mce: [Hardware Error]: RIP 33:<00007f083163504d>
72Jan 13 16:06:21 restbase2037 kernel: [713302.573002] mce: [Hardware Error]: TSC 290d1bb8f78bc7 ADDR 1ff97e3e80 MISC 86 PPIN 63b889bd12cd0082
73Jan 13 16:06:21 restbase2037 kernel: [713302.573005] mce: [Hardware Error]: PROCESSOR 0:606a6 TIME 1736784381 SOCKET 1 APIC 5e microcode d0003e7
74Jan 13 16:06:21 restbase2037 kernel: [713302.574369] EDAC skx MC4: HANDLING MCE MEMORY ERROR
75Jan 13 16:06:21 restbase2037 kernel: [713302.574371] EDAC skx MC4: CPU 0: Machine Check Event: 0x0 Bank 255: 0xb40000000000009f
76Jan 13 16:06:21 restbase2037 kernel: [713302.574372] EDAC skx MC4: TSC 0x0
77Jan 13 16:06:21 restbase2037 kernel: [713302.574373] EDAC skx MC4: ADDR 0x1ff97e3e80
78Jan 13 16:06:21 restbase2037 kernel: [713302.574374] EDAC skx MC4: MISC 0x0
79Jan 13 16:06:21 restbase2037 kernel: [713302.574375] EDAC skx MC4: PROCESSOR 0:0x606a6 TIME 1736784381 SOCKET 0 APIC 0x0
80Jan 13 16:06:21 restbase2037 kernel: [713302.574390] EDAC MC4: 1 UE memory read error on CPU_SrcID#1_MC#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0x1ff97e3 offset:0xe80 grain:32 - err_code:0x0000:0x009f SystemAddress:0x1ff97e3e80 ProcessorSocketId:0x1 MemoryControllerId:0x0 ChannelAddress:0x7bcbf1e80 ChannelId:0x0 RankAddress:0x3de5f9e80 PhysicalRankId:0x0 DimmSlotId:0x0 Row:0x1df2e Column:0x3d0 Bank:0x2 BankGroup:0x3 ChipSelect:0x0 ChipId:0x0)
81Jan 13 16:06:21 restbase2037 kernel: [713302.574474] Memory failure: 0x1ff97e3: Sending SIGBUS to node:1283309 due to hardware memory corruption
82Jan 13 16:06:22 restbase2037 kernel: [713302.680570] Memory failure: 0x1ff97e3: recovery action for dirty LRU page: Recovered
83Jan 13 16:06:24 restbase2037 ipmiseld[1295]: SEL System Event: Jan-13-2025, 16:05:55, System Firmware DIMM Critical, N/A, OEM Event Offset = 0Eh ; OEM Event Data2 code = 41h
84Jan 13 16:06:24 restbase2037 ipmiseld[1295]: SEL System Event: Jan-13-2025, 16:05:55, Sensor #73, N/A, OEM Diagnostic Data Event ; Register Offset = 34h ; Register Value = 92h
85Jan 13 16:06:24 restbase2037 ipmiseld[1295]: SEL System Event: Jan-13-2025, 16:05:55, Sensor #73, N/A, OEM Diagnostic Data Event ; Register Offset = B90h ; Register Value = 00h
86Jan 13 16:06:24 restbase2037 ipmiseld[1295]: SEL System Event: Jan-13-2025, 16:05:55, Sensor #73, N/A, OEM Diagnostic Data Event ; Register Offset = 0h ; Register Value = 00h
87Jan 13 16:06:24 restbase2037 ipmiseld[1295]: SEL System Event: Jan-13-2025, 16:05:56, Sensor #73, N/A, OEM Diagnostic Data Event ; Register Offset = 434h ; Register Value = 92h
88Jan 13 16:06:24 restbase2037 ipmiseld[1295]: SEL System Event: Jan-13-2025, 16:05:56, Sensor #73, N/A, OEM Diagnostic Data Event ; Register Offset = B02h ; Register Value = 12h
89Jan 13 16:06:24 restbase2037 ipmiseld[1295]: SEL System Event: Jan-13-2025, 16:05:56, Sensor #73, N/A, OEM Diagnostic Data Event ; Register Offset = ACh ; Register Value = 00h
90Jan 13 16:06:24 restbase2037 ipmiseld[1295]: SEL System Event: Jan-13-2025, 16:06:11, System Firmware DIMM Critical, N/A, OEM Event Offset = 0Eh ; OEM Event Data2 code = 41h
91Jan 13 16:06:24 restbase2037 ipmiseld[1295]: SEL System Event: Jan-13-2025, 16:06:11, Sensor #73, N/A, OEM Diagnostic Data Event ; Register Offset = 34h ; Register Value = 92h
92Jan 13 16:06:24 restbase2037 ipmiseld[1295]: SEL System Event: Jan-13-2025, 16:06:11, Sensor #73, N/A, OEM Diagnostic Data Event ; Register Offset = B90h ; Register Value = 00h
93Jan 13 16:06:24 restbase2037 ipmiseld[1295]: SEL System Event: Jan-13-2025, 16:06:11, Sensor #73, N/A, OEM Diagnostic Data Event ; Register Offset = 0h ; Register Value = 00h
94Jan 13 16:06:24 restbase2037 ipmiseld[1295]: SEL System Event: Jan-13-2025, 16:06:12, Sensor #73, N/A, OEM Diagnostic Data Event ; Register Offset = 434h ; Register Value = 92h
95Jan 13 16:06:24 restbase2037 ipmiseld[1295]: SEL System Event: Jan-13-2025, 16:06:12, Sensor #73, N/A, OEM Diagnostic Data Event ; Register Offset = B02h ; Register Value = 0Eh
96Jan 13 16:06:24 restbase2037 ipmiseld[1295]: SEL System Event: Jan-13-2025, 16:06:12, Sensor #73, N/A, OEM Diagnostic Data Event ; Register Offset = ACh ; Register Value = 00h
97Jan 13 16:06:24 restbase2037 ipmiseld[1295]: SEL System Event: Jan-13-2025, 16:06:12, System Firmware DIMM Critical, N/A, OEM Event Offset = 0Eh ; OEM Event Data2 code = 42h
98Jan 13 16:06:26 restbase2037 kernel: [713307.464119] MCE: Killing node:1912 due to hardware memory corruption fault at 55d4f706a708
99Jan 13 16:06:32 restbase2037 kernel: [713313.124667] {3}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
100Jan 13 16:06:32 restbase2037 kernel: [713313.133021] {3}[Hardware Error]: event severity: recoverable
101Jan 13 16:06:32 restbase2037 kernel: [713313.138777] {3}[Hardware Error]: Error 0, type: recoverable
102Jan 13 16:06:32 restbase2037 kernel: [713313.144530] {3}[Hardware Error]: fru_text: B1
103Jan 13 16:06:32 restbase2037 kernel: [713313.149071] {3}[Hardware Error]: section_type: memory error
104Jan 13 16:06:32 restbase2037 kernel: [713313.154911] {3}[Hardware Error]: error_status: 0x0000000000000400
105Jan 13 16:06:32 restbase2037 kernel: [713313.161263] {3}[Hardware Error]: physical_address: 0x0000001fe31e8d40
106Jan 13 16:06:32 restbase2037 kernel: [713313.167963] {3}[Hardware Error]: physical_address_mask: 0xffffffffffffffc0
107Jan 13 16:06:32 restbase2037 kernel: [713313.175097] {3}[Hardware Error]: node: 1 card: 0 module: 0 rank: 0 bank: 14 device: 0 row: 121958 column: 168
108Jan 13 16:06:32 restbase2037 kernel: [713313.185346] {3}[Hardware Error]: error_type: 3, multi-bit ECC
109Jan 13 16:06:32 restbase2037 kernel: [713313.191354] {3}[Hardware Error]: DIMM location: not present. DMI handle: 0x0000
110Jan 13 16:06:32 restbase2037 kernel: [713313.203549] EDAC skx MC4: HANDLING MCE MEMORY ERROR
111Jan 13 16:06:32 restbase2037 kernel: [713313.203553] EDAC skx MC4: CPU 0: Machine Check Event: 0x0 Bank 255: 0xb40000000000009f
112Jan 13 16:06:32 restbase2037 kernel: [713313.203555] EDAC skx MC4: TSC 0x0
113Jan 13 16:06:32 restbase2037 kernel: [713313.203556] EDAC skx MC4: ADDR 0x1fe31e8d40
114Jan 13 16:06:32 restbase2037 kernel: [713313.203558] EDAC skx MC4: MISC 0x0
115Jan 13 16:06:32 restbase2037 kernel: [713313.203560] EDAC skx MC4: PROCESSOR 0:0x606a6 TIME 1736784392 SOCKET 0 APIC 0x0
116Jan 13 16:06:32 restbase2037 kernel: [713313.203575] EDAC MC4: 1 UE memory read error on CPU_SrcID#1_MC#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0x1fe31e8 offset:0xd40 grain:32 - err_code:0x0000:0x009f SystemAddress:0x1fe31e8d40 ProcessorSocketId:0x1 MemoryControllerId:0x0 ChannelAddress:0x7b18f4540 ChannelId:0x0 RankAddress:0x3d8c7a540 PhysicalRankId:0x0 DimmSlotId:0x0 Row:0x1dc66 Column:0xa8 Bank:0x2 BankGroup:0x3 ChipSelect:0x0 ChipId:0x0)
117Jan 13 16:06:32 restbase2037 kernel: [713313.208172] Memory failure: 0x1fe31e8: corrupted page was clean: dropped without side effects
118Jan 13 16:06:32 restbase2037 kernel: [713313.208197] Memory failure: 0x1fe31e8: recovery action for clean LRU page: Recovered
119Jan 13 16:06:52 restbase2037 kernel: [713333.164087] mce: Uncorrected hardware memory error in user-access at 1fee4e3000
120Jan 13 16:06:52 restbase2037 kernel: [713333.164116] mce: [Hardware Error]: CPU 3: Machine Check Exception: 7 Bank 1: bd80000000100134
121Jan 13 16:06:52 restbase2037 kernel: [713333.165003] {4}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
122Jan 13 16:06:52 restbase2037 kernel: [713333.167304] Memory failure: 0x1fee4e3: Sending SIGBUS to node:214326 due to hardware memory corruption
123Jan 13 16:06:52 restbase2037 kernel: [713333.167315] Memory failure: 0x1fee4e3: recovery action for dirty LRU page: Recovered
124Jan 13 16:06:52 restbase2037 kernel: [713333.171546] mce: [Hardware Error]: RIP 33:<00007f77f8da2249>
125Jan 13 16:06:52 restbase2037 kernel: [713333.180128] {4}[Hardware Error]: event severity: recoverable
126Jan 13 16:06:52 restbase2037 kernel: [713333.180132] {4}[Hardware Error]: Error 0, type: recoverable
127Jan 13 16:06:52 restbase2037 kernel: [713333.180133] {4}[Hardware Error]: fru_text: B1
128Jan 13 16:06:52 restbase2037 kernel: [713333.180134] {4}[Hardware Error]: section_type: memory error
129Jan 13 16:06:52 restbase2037 kernel: [713333.180137] {4}[Hardware Error]: error_status: 0x0000000000000400
130Jan 13 16:06:52 restbase2037 kernel: [713333.180138] {4}[Hardware Error]: physical_address: 0x0000001fee4e3000
131Jan 13 16:06:52 restbase2037 kernel: [713333.180139] {4}[Hardware Error]: physical_address_mask: 0xffffffffffffffc0
132Jan 13 16:06:52 restbase2037 kernel: [713333.180144] {4}[Hardware Error]: node: 1 card: 0 module: 0 rank: 0 bank: 14 device: 0 row: 122318 column: 768
133Jan 13 16:06:52 restbase2037 kernel: [713333.180146] {4}[Hardware Error]: error_type: 3, multi-bit ECC
134Jan 13 16:06:52 restbase2037 kernel: [713333.180152] {4}[Hardware Error]: DIMM location: not present. DMI handle: 0x0000
135Jan 13 16:06:52 restbase2037 kernel: [713333.188500]
136Jan 13 16:06:52 restbase2037 kernel: [713333.199514] Memory failure: 0x1fee4e3: already hardware poisoned
137Jan 13 16:06:52 restbase2037 kernel: [713333.205745] mce: [Hardware Error]: TSC 290d2cca4df4bb ADDR 1fee4e3000 MISC 86 PPIN 63b889bd12cd0082
138Jan 13 16:06:52 restbase2037 kernel: [713333.205751] mce: [Hardware Error]: PROCESSOR 0:606a6 TIME 1736784412 SOCKET 1 APIC 50 microcode d0003e7
139Jan 13 16:06:52 restbase2037 kernel: [713333.207429] EDAC skx MC4: HANDLING MCE MEMORY ERROR
140Jan 13 16:06:52 restbase2037 kernel: [713333.207431] EDAC skx MC4: CPU 0: Machine Check Event: 0x0 Bank 255: 0xb40000000000009f
141Jan 13 16:06:52 restbase2037 kernel: [713333.207432] EDAC skx MC4: TSC 0x0
142Jan 13 16:06:52 restbase2037 kernel: [713333.207433] EDAC skx MC4: ADDR 0x1fee4e3000
143Jan 13 16:06:52 restbase2037 kernel: [713333.207434] EDAC skx MC4: MISC 0x0
144Jan 13 16:06:52 restbase2037 kernel: [713333.207435] EDAC skx MC4: PROCESSOR 0:0x606a6 TIME 1736784412 SOCKET 0 APIC 0x0
145Jan 13 16:06:52 restbase2037 kernel: [713333.207450] EDAC MC4: 1 UE memory read error on CPU_SrcID#1_MC#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0x1fee4e3 offset:0x0 grain:32 - err_code:0x0000:0x009f SystemAddress:0x1fee4e3000 ProcessorSocketId:0x1 MemoryControllerId:0x0 ChannelAddress:0x7b7271800 ChannelId:0x0 RankAddress:0x3db939800 PhysicalRankId:0x0 DimmSlotId:0x0 Row:0x1ddce Column:0x300 Bank:0x2 BankGroup:0x3 ChipSelect:0x0 ChipId:0x0)
146Jan 13 16:06:55 restbase2037 kernel: [713336.493109] {5}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
147Jan 13 16:06:55 restbase2037 kernel: [713336.503599] mce: Uncorrected hardware memory error in user-access at 1fece6be80
148Jan 13 16:06:55 restbase2037 kernel: [713336.504989] {5}[Hardware Error]: event severity: recoverable
149Jan 13 16:06:55 restbase2037 kernel: [713336.504996] {5}[Hardware Error]: Error 0, type: recoverable
150Jan 13 16:06:55 restbase2037 kernel: [713336.512417] mce: [Hardware Error]: CPU 1: Machine Check Exception: 7 Bank 1: bd80000000100134
151Jan 13 16:06:55 restbase2037 kernel: [713336.518144] {5}[Hardware Error]: fru_text: B1
152Jan 13 16:06:55 restbase2037 kernel: [713336.518147] {5}[Hardware Error]: section_type: memory error
153Jan 13 16:06:55 restbase2037 kernel: [713336.518149] {5}[Hardware Error]: error_status: 0x0000000000000400
154Jan 13 16:06:55 restbase2037 kernel: [713336.518150] {5}[Hardware Error]: physical_address: 0x0000001fece6be80
155Jan 13 16:06:55 restbase2037 kernel: [713336.518151] {5}[Hardware Error]: physical_address_mask: 0xffffffffffffffc0
156Jan 13 16:06:55 restbase2037 kernel: [713336.518155] {5}[Hardware Error]: node: 1 card: 0 module: 0 rank: 0 bank: 14 device: 0 row: 122270 column: 984
157Jan 13 16:06:55 restbase2037 kernel: [713336.518157] {5}[Hardware Error]: error_type: 3, multi-bit ECC
158Jan 13 16:06:55 restbase2037 kernel: [713336.518164] {5}[Hardware Error]: DIMM location: not present. DMI handle: 0x0000
159Jan 13 16:06:55 restbase2037 kernel: [713336.519765] Memory failure: 0x1fece6b: already hardware poisoned
160Jan 13 16:06:55 restbase2037 kernel: [713336.523923] mce: [Hardware Error]: RIP 33:<00007f2d0305028e>
161Jan 13 16:06:55 restbase2037 kernel: [713336.523925] mce: [Hardware Error]: TSC 290d2ea6e84967 ADDR 1fece6be80 MISC 86 PPIN 63b889bd12cd0082
162Jan 13 16:06:55 restbase2037 kernel: [713336.523929] mce: [Hardware Error]: PROCESSOR 0:606a6 TIME 1736784415 SOCKET 1 APIC 40 microcode d0003e7
163Jan 13 16:06:55 restbase2037 kernel: [713336.525355] EDAC skx MC4: HANDLING MCE MEMORY ERROR
164Jan 13 16:06:55 restbase2037 kernel: [713336.525452] Memory failure: 0x1fece6b: Sending SIGBUS to node:1811945 due to hardware memory corruption
165Jan 13 16:06:55 restbase2037 kernel: [713336.525463] Memory failure: 0x1fece6b: recovery action for dirty LRU page: Recovered
166Jan 13 16:06:55 restbase2037 kernel: [713336.636243] EDAC skx MC4: CPU 0: Machine Check Event: 0x0 Bank 255: 0xb40000000000009f
167Jan 13 16:06:55 restbase2037 kernel: [713336.636243] EDAC skx MC4: TSC 0x0
168Jan 13 16:06:55 restbase2037 kernel: [713336.636244] EDAC skx MC4: ADDR 0x1fece6be80
169Jan 13 16:06:55 restbase2037 kernel: [713336.636245] EDAC skx MC4: MISC 0x0
170Jan 13 16:06:55 restbase2037 kernel: [713336.636245] EDAC skx MC4: PROCESSOR 0:0x606a6 TIME 1736784415 SOCKET 0 APIC 0x0
171Jan 13 16:06:55 restbase2037 kernel: [713336.636262] EDAC MC4: 1 UE memory read error on CPU_SrcID#1_MC#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0x1fece6b offset:0xe80 grain:32 - err_code:0x0000:0x009f SystemAddress:0x1fece6be80 ProcessorSocketId:0x1 MemoryControllerId:0x0 ChannelAddress:0x7b6735e80 ChannelId:0x0 RankAddress:0x3db39be80 PhysicalRankId:0x0 DimmSlotId:0x0 Row:0x1dd9e Column:0x3d8 Bank:0x2 BankGroup:0x3 ChipSelect:0x0 ChipId:0x0)
172Jan 13 16:07:05 restbase2037 systemd[1]: Starting Export confd Prometheus metrics...
173Jan 13 16:07:05 restbase2037 systemd[1]: confd_prometheus_metrics.service: Succeeded.
174Jan 13 16:07:05 restbase2037 systemd[1]: Finished Export confd Prometheus metrics.
175Jan 13 16:07:15 restbase2037 kernel: [713356.161065] {6}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
176Jan 13 16:07:15 restbase2037 kernel: [713356.169420] {6}[Hardware Error]: event severity: recoverable
177Jan 13 16:07:15 restbase2037 kernel: [713356.175176] {6}[Hardware Error]: Error 0, type: recoverable
178Jan 13 16:07:15 restbase2037 kernel: [713356.180928] {6}[Hardware Error]: fru_text: B1
179Jan 13 16:07:15 restbase2037 kernel: [713356.185460] {6}[Hardware Error]: section_type: memory error
180Jan 13 16:07:15 restbase2037 kernel: [713356.191292] {6}[Hardware Error]: error_status: 0x0000000000000400
181Jan 13 16:07:15 restbase2037 kernel: [713356.197645] {6}[Hardware Error]: physical_address: 0x0000001fe6c63900
182Jan 13 16:07:15 restbase2037 kernel: [713356.204343] {6}[Hardware Error]: physical_address_mask: 0xffffffffffffffc0
183Jan 13 16:07:15 restbase2037 kernel: [713356.211477] {6}[Hardware Error]: node: 1 card: 0 module: 0 rank: 0 bank: 14 device: 0 row: 122078 column: 800
184Jan 13 16:07:15 restbase2037 kernel: [713356.221729] {6}[Hardware Error]: error_type: 3, multi-bit ECC
185Jan 13 16:07:15 restbase2037 kernel: [713356.227735] {6}[Hardware Error]: DIMM location: not present. DMI handle: 0x0000
186Jan 13 16:07:15 restbase2037 kernel: [713356.238489] mce_notify_irq: 6 callbacks suppressed
187Jan 13 16:07:15 restbase2037 kernel: [713356.238491] mce: [Hardware Error]: Machine check events logged
188Jan 13 16:07:15 restbase2037 kernel: [713356.240034] EDAC skx MC4: HANDLING MCE MEMORY ERROR
189Jan 13 16:07:15 restbase2037 kernel: [713356.240038] EDAC skx MC4: CPU 0: Machine Check Event: 0x0 Bank 255: 0xb40000000000009f
190Jan 13 16:07:15 restbase2037 kernel: [713356.240039] EDAC skx MC4: TSC 0x0
191Jan 13 16:07:15 restbase2037 kernel: [713356.240051] EDAC skx MC4: ADDR 0x1fe6c63900
192Jan 13 16:07:15 restbase2037 kernel: [713356.240052] EDAC skx MC4: MISC 0x0
193Jan 13 16:07:15 restbase2037 kernel: [713356.240053] EDAC skx MC4: PROCESSOR 0:0x606a6 TIME 1736784435 SOCKET 0 APIC 0x0
194Jan 13 16:07:15 restbase2037 kernel: [713356.240077] EDAC MC4: 1 UE memory read error on CPU_SrcID#1_MC#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0x1fe6c63 offset:0x900 grain:32 - err_code:0x0000:0x009f SystemAddress:0x1fe6c63900 ProcessorSocketId:0x1 MemoryControllerId:0x0 ChannelAddress:0x7b3631900 ChannelId:0x0 RankAddress:0x3d9b19900 PhysicalRankId:0x0 DimmSlotId:0x0 Row:0x1dcde Column:0x320 Bank:0x2 BankGroup:0x3 ChipSelect:0x0 ChipId:0x0)
195Jan 13 16:07:15 restbase2037 kernel: [713356.242347] Memory failure: 0x1fe6c63: corrupted page was clean: dropped without side effects
196Jan 13 16:07:15 restbase2037 kernel: [713356.242374] Memory failure: 0x1fe6c63: recovery action for clean LRU page: Recovered
197Jan 13 16:07:20 restbase2037 kernel: [713361.163823] {7}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
198Jan 13 16:07:20 restbase2037 kernel: [713361.172181] {7}[Hardware Error]: event severity: recoverable
199Jan 13 16:07:20 restbase2037 kernel: [713361.177934] {7}[Hardware Error]: Error 0, type: recoverable
200Jan 13 16:07:20 restbase2037 kernel: [713361.183680] {7}[Hardware Error]: fru_text: B1
201Jan 13 16:07:20 restbase2037 kernel: [713361.188213] {7}[Hardware Error]: section_type: memory error
202Jan 13 16:07:20 restbase2037 kernel: [713361.194045] {7}[Hardware Error]: error_status: 0x0000000000000400
203Jan 13 16:07:20 restbase2037 kernel: [713361.200395] {7}[Hardware Error]: physical_address: 0x0000001fe6c69080
204Jan 13 16:07:20 restbase2037 kernel: [713361.207098] {7}[Hardware Error]: physical_address_mask: 0xffffffffffffffc0
205Jan 13 16:07:20 restbase2037 kernel: [713361.214231] {7}[Hardware Error]: node: 1 card: 0 module: 0 rank: 0 bank: 14 device: 0 row: 122078 column: 280
206Jan 13 16:07:20 restbase2037 kernel: [713361.224481] {7}[Hardware Error]: error_type: 3, multi-bit ECC
207Jan 13 16:07:20 restbase2037 kernel: [713361.230487] {7}[Hardware Error]: DIMM location: not present. DMI handle: 0x0000
208Jan 13 16:07:20 restbase2037 kernel: [713361.239756] mce: [Hardware Error]: Machine check events logged
209Jan 13 16:07:20 restbase2037 kernel: [713361.241172] EDAC skx MC4: HANDLING MCE MEMORY ERROR
210Jan 13 16:07:20 restbase2037 kernel: [713361.241177] EDAC skx MC4: CPU 0: Machine Check Event: 0x0 Bank 255: 0xb40000000000009f
211Jan 13 16:07:20 restbase2037 kernel: [713361.241182] EDAC skx MC4: TSC 0x0
212Jan 13 16:07:20 restbase2037 kernel: [713361.241182] EDAC skx MC4: ADDR 0x1fe6c69080
213Jan 13 16:07:20 restbase2037 kernel: [713361.241183] EDAC skx MC4: MISC 0x0
214Jan 13 16:07:20 restbase2037 kernel: [713361.241185] EDAC skx MC4: PROCESSOR 0:0x606a6 TIME 1736784440 SOCKET 0 APIC 0x0
215Jan 13 16:07:20 restbase2037 kernel: [713361.241194] EDAC MC4: 1 UE memory read error on CPU_SrcID#1_MC#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0x1fe6c69 offset:0x80 grain:32 - err_code:0x0000:0x009f SystemAddress:0x1fe6c69080 ProcessorSocketId:0x1 MemoryControllerId:0x0 ChannelAddress:0x7b3634880 ChannelId:0x0 RankAddress:0x3d9b1a880 PhysicalRankId:0x0 DimmSlotId:0x0 Row:0x1dcde Column:0x118 Bank:0x2 BankGroup:0x3 ChipSelect:0x0 ChipId:0x0)
216Jan 13 16:07:20 restbase2037 kernel: [713361.242037] Memory failure: 0x1fe6c69: corrupted page was clean: dropped without side effects
217Jan 13 16:07:20 restbase2037 kernel: [713361.242056] Memory failure: 0x1fe6c69: recovery action for clean LRU page: Recovered
218Jan 13 16:07:20 restbase2037 systemd[1]: Starting Update NIC firmware stats exported by node_exporter...
219Jan 13 16:07:20 restbase2037 systemd[1]: prometheus-nic-firmware-textfile.service: Succeeded.
220Jan 13 16:07:20 restbase2037 systemd[1]: Finished Update NIC firmware stats exported by node_exporter.
221[ Unpaste-able binary data removed... ]
222Jan 13 16:13:00 restbase2037 systemd-modules-load[886]: Inserted module 'nf_conntrack'
223Jan 13 16:13:00 restbase2037 systemd-modules-load[886]: Inserted module 'ipmi_devintf'
224Jan 13 16:13:00 restbase2037 systemd[1]: Mounted POSIX Message Queue File System.
225Jan 13 16:13:00 restbase2037 systemd[1]: Mounted Kernel Debug File System.
226Jan 13 16:13:00 restbase2037 systemd[1]: Mounted Kernel Trace File System.
227Jan 13 16:13:00 restbase2037 systemd[1]: Finished Create list of static device nodes for the current kernel.
228Jan 13 16:13:00 restbase2037 systemd[1]: modprobe@configfs.service: Succeeded.
229Jan 13 16:13:00 restbase2037 systemd[1]: Finished Load Kernel Module configfs.
230Jan 13 16:13:00 restbase2037 systemd[1]: modprobe@drm.service: Succeeded.
231Jan 13 16:13:00 restbase2037 systemd[1]: Finished Load Kernel Module drm.
232Jan 13 16:13:00 restbase2037 systemd[1]: modprobe@fuse.service: Succeeded.
233Jan 13 16:13:00 restbase2037 systemd[1]: Finished Load Kernel Module fuse.
234Jan 13 16:13:00 restbase2037 systemd[1]: Finished Load Kernel Modules.
235[ ... ]

Related Objects

Event Timeline

The drac (if correct) says this is the DIMM in B1:

image.png (895×1 px, 168 KB)

Eevans triaged this task as Medium priority.Jan 15 2025, 9:15 PM
Eevans added a project: ops-codfw.
Eevans renamed this task from restbase2037 periodically rebooting(?) to restbase2037 is crashy.Jan 15 2025, 9:16 PM
Eevans added a project: Cassandra.

This host went down again this morning, same DIMM errors. I've depooled it for the time being.

11:30 <+icinga-wm> PROBLEM - Host restbase2037 is DOWN: PING CRITICAL - Packet loss = 100%
11:32 <+jinxer-wm> FIRING: [6x] ProbeDown: Service restbase2037-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - 
                   https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
11:34 <+icinga-wm> RECOVERY - Host restbase2037 is UP: PING OK - Packet loss = 0%, RTA = 30.27 ms
11:39 <+jinxer-wm> FIRING: [6x] ProbeDown: Service restbase2037-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - 
                   https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
11:42 <+jinxer-wm> RESOLVED: [6x] ProbeDown: Service restbase2037-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - 
                   https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
11:45 <+icinga-wm> PROBLEM - Host restbase2037 is DOWN: PING CRITICAL - Packet loss = 100%

Mentioned in SAL (#wikimedia-operations) [2025-01-21T12:08:57Z] <hnowlan@cumin2002> DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on restbase2037.codfw.wmnet with reason: Memory issues, rebooting frequently. Depooled. T383820

Eevans raised the priority of this task from Medium to High.Jan 21 2025, 4:00 PM

Mentioned in SAL (#wikimedia-operations) [2025-01-21T12:08:57Z] <hnowlan@cumin2002> DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on restbase2037.codfw.wmnet with reason: Memory issues, rebooting frequently. Depooled. T383820

I've restarted the host (which was still down); If the host is down for any significant period of time, any hope of a graceful recovery will be lost. I would in fact just decommission it now, except I'm afraid it might crash during, which could also complicate matters.

I hope we can escalate this.


FYI, I first tried a hardreset, which failed and caused the following errors to be logged. Afterward I powered off/on and it seems to have come back up OK:

image.png (633×1 px, 115 KB)

swapped B1 to A1. gotta let run and see if it crashes again. might not. sometimes that's all it needs. (Thanks for your patience, i was unexpectedly out the last half of last week)

swapped B1 to A1. gotta let run and see if it crashes again. might not. sometimes that's all it needs. (Thanks for your patience, i was unexpectedly out the last half of last week)

Thanks; Fingers-crossed!


P.S. I've removed the downtime.

Jhancock.wm claimed this task.

not seeing any new errors on this machine. gonna close this ticket for now, but if it errors again, feel free to reopen or start a new one.