There have been (3) recent crashes/reboots of restbase2037
Sun Jan 05 09:55:30 UTC 2025 |
Mon Jan 13 16:09:45 UTC 2025 |
Wed Jan 15 10:11:47 UTC 2025 |
The cause seems to be an (uncorrectable) memory fault:
1 | [ ... ] |
---|---|
2 | Jan 13 16:00:09 restbase2037 systemd[1]: Starting Export confd Prometheus metrics... |
3 | Jan 13 16:00:09 restbase2037 systemd[1]: clean-confd-rundir.service: Succeeded. |
4 | Jan 13 16:00:09 restbase2037 systemd[1]: Finished Clean old stale files in /var/run/confd-template. |
5 | Jan 13 16:00:09 restbase2037 systemd[1]: confd_prometheus_metrics.service: Succeeded. |
6 | Jan 13 16:00:09 restbase2037 systemd[1]: Finished Export confd Prometheus metrics. |
7 | Jan 13 16:01:09 restbase2037 systemd[1]: Starting Export confd Prometheus metrics... |
8 | Jan 13 16:01:10 restbase2037 systemd[1]: confd_prometheus_metrics.service: Succeeded. |
9 | Jan 13 16:01:10 restbase2037 systemd[1]: Finished Export confd Prometheus metrics. |
10 | Jan 13 16:02:09 restbase2037 systemd[1]: Starting Daily apt download activities... |
11 | Jan 13 16:02:09 restbase2037 systemd[1]: Starting Export confd Prometheus metrics... |
12 | Jan 13 16:02:09 restbase2037 systemd[1]: confd_prometheus_metrics.service: Succeeded. |
13 | Jan 13 16:02:09 restbase2037 systemd[1]: Finished Export confd Prometheus metrics. |
14 | Jan 13 16:02:10 restbase2037 systemd[1]: apt-daily.service: Succeeded. |
15 | Jan 13 16:02:10 restbase2037 systemd[1]: Finished Daily apt download activities. |
16 | Jan 13 16:02:20 restbase2037 systemd[1]: Starting Update NIC firmware stats exported by node_exporter... |
17 | Jan 13 16:02:20 restbase2037 systemd[1]: prometheus-nic-firmware-textfile.service: Succeeded. |
18 | Jan 13 16:02:20 restbase2037 systemd[1]: Finished Update NIC firmware stats exported by node_exporter. |
19 | Jan 13 16:03:09 restbase2037 systemd[1]: Starting Export confd Prometheus metrics... |
20 | Jan 13 16:03:10 restbase2037 systemd[1]: confd_prometheus_metrics.service: Succeeded. |
21 | Jan 13 16:03:10 restbase2037 systemd[1]: Finished Export confd Prometheus metrics. |
22 | Jan 13 16:04:10 restbase2037 systemd[1]: Starting Export confd Prometheus metrics... |
23 | Jan 13 16:04:10 restbase2037 systemd[1]: Starting Update Debian version stat exported by node_exporter... |
24 | Jan 13 16:04:10 restbase2037 systemd[1]: prometheus-debian-version-textfile.service: Succeeded. |
25 | Jan 13 16:04:10 restbase2037 systemd[1]: Finished Update Debian version stat exported by node_exporter. |
26 | Jan 13 16:04:10 restbase2037 systemd[1]: confd_prometheus_metrics.service: Succeeded. |
27 | Jan 13 16:04:10 restbase2037 systemd[1]: Finished Export confd Prometheus metrics. |
28 | Jan 13 16:05:01 restbase2037 CRON[1811836]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1) |
29 | Jan 13 16:05:10 restbase2037 systemd[1]: Starting Export confd Prometheus metrics... |
30 | Jan 13 16:05:10 restbase2037 systemd[1]: confd_prometheus_metrics.service: Succeeded. |
31 | Jan 13 16:05:10 restbase2037 systemd[1]: Finished Export confd Prometheus metrics. |
32 | Jan 13 16:06:05 restbase2037 kernel: [713286.500985] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4 |
33 | Jan 13 16:06:05 restbase2037 kernel: [713286.509340] {1}[Hardware Error]: event severity: recoverable |
34 | Jan 13 16:06:05 restbase2037 kernel: [713286.515093] {1}[Hardware Error]: Error 0, type: recoverable |
35 | Jan 13 16:06:05 restbase2037 kernel: [713286.520848] {1}[Hardware Error]: fru_text: B1 |
36 | Jan 13 16:06:05 restbase2037 kernel: [713286.525377] {1}[Hardware Error]: section_type: memory error |
37 | Jan 13 16:06:05 restbase2037 kernel: [713286.531213] {1}[Hardware Error]: error_status: 0x0000000000000400 |
38 | Jan 13 16:06:05 restbase2037 kernel: [713286.537565] {1}[Hardware Error]: physical_address: 0x0000001fe0a6a8c0 |
39 | Jan 13 16:06:05 restbase2037 kernel: [713286.544264] {1}[Hardware Error]: physical_address_mask: 0xffffffffffffffc0 |
40 | Jan 13 16:06:05 restbase2037 kernel: [713286.551398] {1}[Hardware Error]: node: 1 card: 0 module: 0 rank: 0 bank: 14 device: 0 row: 121878 column: 536 |
41 | Jan 13 16:06:05 restbase2037 kernel: [713286.561650] {1}[Hardware Error]: error_type: 3, multi-bit ECC |
42 | Jan 13 16:06:05 restbase2037 kernel: [713286.567656] {1}[Hardware Error]: DIMM location: not present. DMI handle: 0x0000 |
43 | Jan 13 16:06:05 restbase2037 kernel: [713286.577561] mce: [Hardware Error]: Machine check events logged |
44 | Jan 13 16:06:05 restbase2037 kernel: [713286.579041] EDAC skx MC4: HANDLING MCE MEMORY ERROR |
45 | Jan 13 16:06:05 restbase2037 kernel: [713286.579042] EDAC skx MC4: CPU 0: Machine Check Event: 0x0 Bank 255: 0xb40000000000009f |
46 | Jan 13 16:06:05 restbase2037 kernel: [713286.579043] EDAC skx MC4: TSC 0x0 |
47 | Jan 13 16:06:05 restbase2037 kernel: [713286.579044] EDAC skx MC4: ADDR 0x1fe0a6a8c0 |
48 | Jan 13 16:06:05 restbase2037 kernel: [713286.579046] EDAC skx MC4: MISC 0x0 |
49 | Jan 13 16:06:05 restbase2037 kernel: [713286.579047] EDAC skx MC4: PROCESSOR 0:0x606a6 TIME 1736784365 SOCKET 0 APIC 0x0 |
50 | Jan 13 16:06:05 restbase2037 kernel: [713286.579058] EDAC MC4: 1 UE memory read error on CPU_SrcID#1_MC#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0x1fe0a6a offset:0x8c0 grain:32 - err_code:0x0000:0x009f SystemAddress:0x1fe0a6a8c0 ProcessorSocketId:0x1 MemoryControllerId:0x0 ChannelAddress:0x7b05350c0 ChannelId:0x0 RankAddress:0x3d829b0c0 PhysicalRankId:0x0 DimmSlotId:0x0 Row:0x1dc16 Column:0x218 Bank:0x2 BankGroup:0x3 ChipSelect:0x0 ChipId:0x0) |
51 | Jan 13 16:06:05 restbase2037 kernel: [713286.579928] Memory failure: 0x1fe0a6a: recovery action for dirty LRU page: Recovered |
52 | Jan 13 16:06:05 restbase2037 systemd[1]: Starting Export confd Prometheus metrics... |
53 | Jan 13 16:06:06 restbase2037 systemd[1]: confd_prometheus_metrics.service: Succeeded. |
54 | Jan 13 16:06:06 restbase2037 systemd[1]: Finished Export confd Prometheus metrics. |
55 | Jan 13 16:06:21 restbase2037 kernel: [713302.548292] Disabling lock debugging due to kernel taint |
56 | Jan 13 16:06:21 restbase2037 kernel: [713302.548450] mce: Uncorrected hardware memory error in user-access at 1ff97e3e80 |
57 | Jan 13 16:06:21 restbase2037 kernel: [713302.548475] mce: [Hardware Error]: Machine check events logged |
58 | Jan 13 16:06:21 restbase2037 kernel: [713302.548874] {2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4 |
59 | Jan 13 16:06:21 restbase2037 kernel: [713302.556250] mce: [Hardware Error]: CPU 31: Machine Check Exception: 7 Bank 1: bd80000000100134 |
60 | Jan 13 16:06:21 restbase2037 kernel: [713302.564230] {2}[Hardware Error]: event severity: recoverable |
61 | Jan 13 16:06:21 restbase2037 kernel: [713302.564232] {2}[Hardware Error]: Error 0, type: recoverable |
62 | Jan 13 16:06:21 restbase2037 kernel: [713302.564234] {2}[Hardware Error]: fru_text: B1 |
63 | Jan 13 16:06:21 restbase2037 kernel: [713302.564236] {2}[Hardware Error]: section_type: memory error |
64 | Jan 13 16:06:21 restbase2037 kernel: [713302.564237] {2}[Hardware Error]: error_status: 0x0000000000000400 |
65 | Jan 13 16:06:21 restbase2037 kernel: [713302.564238] {2}[Hardware Error]: physical_address: 0x0000001ff97e3e80 |
66 | Jan 13 16:06:21 restbase2037 kernel: [713302.564239] {2}[Hardware Error]: physical_address_mask: 0xffffffffffffffc0 |
67 | Jan 13 16:06:21 restbase2037 kernel: [713302.564241] {2}[Hardware Error]: node: 1 card: 0 module: 0 rank: 0 bank: 14 device: 0 row: 122670 column: 976 |
68 | Jan 13 16:06:21 restbase2037 kernel: [713302.564242] {2}[Hardware Error]: error_type: 3, multi-bit ECC |
69 | Jan 13 16:06:21 restbase2037 kernel: [713302.564244] {2}[Hardware Error]: DIMM location: not present. DMI handle: 0x0000 |
70 | Jan 13 16:06:21 restbase2037 kernel: [713302.565467] Memory failure: 0x1ff97e3: already hardware poisoned |
71 | Jan 13 16:06:21 restbase2037 kernel: [713302.572999] mce: [Hardware Error]: RIP 33:<00007f083163504d> |
72 | Jan 13 16:06:21 restbase2037 kernel: [713302.573002] mce: [Hardware Error]: TSC 290d1bb8f78bc7 ADDR 1ff97e3e80 MISC 86 PPIN 63b889bd12cd0082 |
73 | Jan 13 16:06:21 restbase2037 kernel: [713302.573005] mce: [Hardware Error]: PROCESSOR 0:606a6 TIME 1736784381 SOCKET 1 APIC 5e microcode d0003e7 |
74 | Jan 13 16:06:21 restbase2037 kernel: [713302.574369] EDAC skx MC4: HANDLING MCE MEMORY ERROR |
75 | Jan 13 16:06:21 restbase2037 kernel: [713302.574371] EDAC skx MC4: CPU 0: Machine Check Event: 0x0 Bank 255: 0xb40000000000009f |
76 | Jan 13 16:06:21 restbase2037 kernel: [713302.574372] EDAC skx MC4: TSC 0x0 |
77 | Jan 13 16:06:21 restbase2037 kernel: [713302.574373] EDAC skx MC4: ADDR 0x1ff97e3e80 |
78 | Jan 13 16:06:21 restbase2037 kernel: [713302.574374] EDAC skx MC4: MISC 0x0 |
79 | Jan 13 16:06:21 restbase2037 kernel: [713302.574375] EDAC skx MC4: PROCESSOR 0:0x606a6 TIME 1736784381 SOCKET 0 APIC 0x0 |
80 | Jan 13 16:06:21 restbase2037 kernel: [713302.574390] EDAC MC4: 1 UE memory read error on CPU_SrcID#1_MC#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0x1ff97e3 offset:0xe80 grain:32 - err_code:0x0000:0x009f SystemAddress:0x1ff97e3e80 ProcessorSocketId:0x1 MemoryControllerId:0x0 ChannelAddress:0x7bcbf1e80 ChannelId:0x0 RankAddress:0x3de5f9e80 PhysicalRankId:0x0 DimmSlotId:0x0 Row:0x1df2e Column:0x3d0 Bank:0x2 BankGroup:0x3 ChipSelect:0x0 ChipId:0x0) |
81 | Jan 13 16:06:21 restbase2037 kernel: [713302.574474] Memory failure: 0x1ff97e3: Sending SIGBUS to node:1283309 due to hardware memory corruption |
82 | Jan 13 16:06:22 restbase2037 kernel: [713302.680570] Memory failure: 0x1ff97e3: recovery action for dirty LRU page: Recovered |
83 | Jan 13 16:06:24 restbase2037 ipmiseld[1295]: SEL System Event: Jan-13-2025, 16:05:55, System Firmware DIMM Critical, N/A, OEM Event Offset = 0Eh ; OEM Event Data2 code = 41h |
84 | Jan 13 16:06:24 restbase2037 ipmiseld[1295]: SEL System Event: Jan-13-2025, 16:05:55, Sensor #73, N/A, OEM Diagnostic Data Event ; Register Offset = 34h ; Register Value = 92h |
85 | Jan 13 16:06:24 restbase2037 ipmiseld[1295]: SEL System Event: Jan-13-2025, 16:05:55, Sensor #73, N/A, OEM Diagnostic Data Event ; Register Offset = B90h ; Register Value = 00h |
86 | Jan 13 16:06:24 restbase2037 ipmiseld[1295]: SEL System Event: Jan-13-2025, 16:05:55, Sensor #73, N/A, OEM Diagnostic Data Event ; Register Offset = 0h ; Register Value = 00h |
87 | Jan 13 16:06:24 restbase2037 ipmiseld[1295]: SEL System Event: Jan-13-2025, 16:05:56, Sensor #73, N/A, OEM Diagnostic Data Event ; Register Offset = 434h ; Register Value = 92h |
88 | Jan 13 16:06:24 restbase2037 ipmiseld[1295]: SEL System Event: Jan-13-2025, 16:05:56, Sensor #73, N/A, OEM Diagnostic Data Event ; Register Offset = B02h ; Register Value = 12h |
89 | Jan 13 16:06:24 restbase2037 ipmiseld[1295]: SEL System Event: Jan-13-2025, 16:05:56, Sensor #73, N/A, OEM Diagnostic Data Event ; Register Offset = ACh ; Register Value = 00h |
90 | Jan 13 16:06:24 restbase2037 ipmiseld[1295]: SEL System Event: Jan-13-2025, 16:06:11, System Firmware DIMM Critical, N/A, OEM Event Offset = 0Eh ; OEM Event Data2 code = 41h |
91 | Jan 13 16:06:24 restbase2037 ipmiseld[1295]: SEL System Event: Jan-13-2025, 16:06:11, Sensor #73, N/A, OEM Diagnostic Data Event ; Register Offset = 34h ; Register Value = 92h |
92 | Jan 13 16:06:24 restbase2037 ipmiseld[1295]: SEL System Event: Jan-13-2025, 16:06:11, Sensor #73, N/A, OEM Diagnostic Data Event ; Register Offset = B90h ; Register Value = 00h |
93 | Jan 13 16:06:24 restbase2037 ipmiseld[1295]: SEL System Event: Jan-13-2025, 16:06:11, Sensor #73, N/A, OEM Diagnostic Data Event ; Register Offset = 0h ; Register Value = 00h |
94 | Jan 13 16:06:24 restbase2037 ipmiseld[1295]: SEL System Event: Jan-13-2025, 16:06:12, Sensor #73, N/A, OEM Diagnostic Data Event ; Register Offset = 434h ; Register Value = 92h |
95 | Jan 13 16:06:24 restbase2037 ipmiseld[1295]: SEL System Event: Jan-13-2025, 16:06:12, Sensor #73, N/A, OEM Diagnostic Data Event ; Register Offset = B02h ; Register Value = 0Eh |
96 | Jan 13 16:06:24 restbase2037 ipmiseld[1295]: SEL System Event: Jan-13-2025, 16:06:12, Sensor #73, N/A, OEM Diagnostic Data Event ; Register Offset = ACh ; Register Value = 00h |
97 | Jan 13 16:06:24 restbase2037 ipmiseld[1295]: SEL System Event: Jan-13-2025, 16:06:12, System Firmware DIMM Critical, N/A, OEM Event Offset = 0Eh ; OEM Event Data2 code = 42h |
98 | Jan 13 16:06:26 restbase2037 kernel: [713307.464119] MCE: Killing node:1912 due to hardware memory corruption fault at 55d4f706a708 |
99 | Jan 13 16:06:32 restbase2037 kernel: [713313.124667] {3}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4 |
100 | Jan 13 16:06:32 restbase2037 kernel: [713313.133021] {3}[Hardware Error]: event severity: recoverable |
101 | Jan 13 16:06:32 restbase2037 kernel: [713313.138777] {3}[Hardware Error]: Error 0, type: recoverable |
102 | Jan 13 16:06:32 restbase2037 kernel: [713313.144530] {3}[Hardware Error]: fru_text: B1 |
103 | Jan 13 16:06:32 restbase2037 kernel: [713313.149071] {3}[Hardware Error]: section_type: memory error |
104 | Jan 13 16:06:32 restbase2037 kernel: [713313.154911] {3}[Hardware Error]: error_status: 0x0000000000000400 |
105 | Jan 13 16:06:32 restbase2037 kernel: [713313.161263] {3}[Hardware Error]: physical_address: 0x0000001fe31e8d40 |
106 | Jan 13 16:06:32 restbase2037 kernel: [713313.167963] {3}[Hardware Error]: physical_address_mask: 0xffffffffffffffc0 |
107 | Jan 13 16:06:32 restbase2037 kernel: [713313.175097] {3}[Hardware Error]: node: 1 card: 0 module: 0 rank: 0 bank: 14 device: 0 row: 121958 column: 168 |
108 | Jan 13 16:06:32 restbase2037 kernel: [713313.185346] {3}[Hardware Error]: error_type: 3, multi-bit ECC |
109 | Jan 13 16:06:32 restbase2037 kernel: [713313.191354] {3}[Hardware Error]: DIMM location: not present. DMI handle: 0x0000 |
110 | Jan 13 16:06:32 restbase2037 kernel: [713313.203549] EDAC skx MC4: HANDLING MCE MEMORY ERROR |
111 | Jan 13 16:06:32 restbase2037 kernel: [713313.203553] EDAC skx MC4: CPU 0: Machine Check Event: 0x0 Bank 255: 0xb40000000000009f |
112 | Jan 13 16:06:32 restbase2037 kernel: [713313.203555] EDAC skx MC4: TSC 0x0 |
113 | Jan 13 16:06:32 restbase2037 kernel: [713313.203556] EDAC skx MC4: ADDR 0x1fe31e8d40 |
114 | Jan 13 16:06:32 restbase2037 kernel: [713313.203558] EDAC skx MC4: MISC 0x0 |
115 | Jan 13 16:06:32 restbase2037 kernel: [713313.203560] EDAC skx MC4: PROCESSOR 0:0x606a6 TIME 1736784392 SOCKET 0 APIC 0x0 |
116 | Jan 13 16:06:32 restbase2037 kernel: [713313.203575] EDAC MC4: 1 UE memory read error on CPU_SrcID#1_MC#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0x1fe31e8 offset:0xd40 grain:32 - err_code:0x0000:0x009f SystemAddress:0x1fe31e8d40 ProcessorSocketId:0x1 MemoryControllerId:0x0 ChannelAddress:0x7b18f4540 ChannelId:0x0 RankAddress:0x3d8c7a540 PhysicalRankId:0x0 DimmSlotId:0x0 Row:0x1dc66 Column:0xa8 Bank:0x2 BankGroup:0x3 ChipSelect:0x0 ChipId:0x0) |
117 | Jan 13 16:06:32 restbase2037 kernel: [713313.208172] Memory failure: 0x1fe31e8: corrupted page was clean: dropped without side effects |
118 | Jan 13 16:06:32 restbase2037 kernel: [713313.208197] Memory failure: 0x1fe31e8: recovery action for clean LRU page: Recovered |
119 | Jan 13 16:06:52 restbase2037 kernel: [713333.164087] mce: Uncorrected hardware memory error in user-access at 1fee4e3000 |
120 | Jan 13 16:06:52 restbase2037 kernel: [713333.164116] mce: [Hardware Error]: CPU 3: Machine Check Exception: 7 Bank 1: bd80000000100134 |
121 | Jan 13 16:06:52 restbase2037 kernel: [713333.165003] {4}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4 |
122 | Jan 13 16:06:52 restbase2037 kernel: [713333.167304] Memory failure: 0x1fee4e3: Sending SIGBUS to node:214326 due to hardware memory corruption |
123 | Jan 13 16:06:52 restbase2037 kernel: [713333.167315] Memory failure: 0x1fee4e3: recovery action for dirty LRU page: Recovered |
124 | Jan 13 16:06:52 restbase2037 kernel: [713333.171546] mce: [Hardware Error]: RIP 33:<00007f77f8da2249> |
125 | Jan 13 16:06:52 restbase2037 kernel: [713333.180128] {4}[Hardware Error]: event severity: recoverable |
126 | Jan 13 16:06:52 restbase2037 kernel: [713333.180132] {4}[Hardware Error]: Error 0, type: recoverable |
127 | Jan 13 16:06:52 restbase2037 kernel: [713333.180133] {4}[Hardware Error]: fru_text: B1 |
128 | Jan 13 16:06:52 restbase2037 kernel: [713333.180134] {4}[Hardware Error]: section_type: memory error |
129 | Jan 13 16:06:52 restbase2037 kernel: [713333.180137] {4}[Hardware Error]: error_status: 0x0000000000000400 |
130 | Jan 13 16:06:52 restbase2037 kernel: [713333.180138] {4}[Hardware Error]: physical_address: 0x0000001fee4e3000 |
131 | Jan 13 16:06:52 restbase2037 kernel: [713333.180139] {4}[Hardware Error]: physical_address_mask: 0xffffffffffffffc0 |
132 | Jan 13 16:06:52 restbase2037 kernel: [713333.180144] {4}[Hardware Error]: node: 1 card: 0 module: 0 rank: 0 bank: 14 device: 0 row: 122318 column: 768 |
133 | Jan 13 16:06:52 restbase2037 kernel: [713333.180146] {4}[Hardware Error]: error_type: 3, multi-bit ECC |
134 | Jan 13 16:06:52 restbase2037 kernel: [713333.180152] {4}[Hardware Error]: DIMM location: not present. DMI handle: 0x0000 |
135 | Jan 13 16:06:52 restbase2037 kernel: [713333.188500] |
136 | Jan 13 16:06:52 restbase2037 kernel: [713333.199514] Memory failure: 0x1fee4e3: already hardware poisoned |
137 | Jan 13 16:06:52 restbase2037 kernel: [713333.205745] mce: [Hardware Error]: TSC 290d2cca4df4bb ADDR 1fee4e3000 MISC 86 PPIN 63b889bd12cd0082 |
138 | Jan 13 16:06:52 restbase2037 kernel: [713333.205751] mce: [Hardware Error]: PROCESSOR 0:606a6 TIME 1736784412 SOCKET 1 APIC 50 microcode d0003e7 |
139 | Jan 13 16:06:52 restbase2037 kernel: [713333.207429] EDAC skx MC4: HANDLING MCE MEMORY ERROR |
140 | Jan 13 16:06:52 restbase2037 kernel: [713333.207431] EDAC skx MC4: CPU 0: Machine Check Event: 0x0 Bank 255: 0xb40000000000009f |
141 | Jan 13 16:06:52 restbase2037 kernel: [713333.207432] EDAC skx MC4: TSC 0x0 |
142 | Jan 13 16:06:52 restbase2037 kernel: [713333.207433] EDAC skx MC4: ADDR 0x1fee4e3000 |
143 | Jan 13 16:06:52 restbase2037 kernel: [713333.207434] EDAC skx MC4: MISC 0x0 |
144 | Jan 13 16:06:52 restbase2037 kernel: [713333.207435] EDAC skx MC4: PROCESSOR 0:0x606a6 TIME 1736784412 SOCKET 0 APIC 0x0 |
145 | Jan 13 16:06:52 restbase2037 kernel: [713333.207450] EDAC MC4: 1 UE memory read error on CPU_SrcID#1_MC#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0x1fee4e3 offset:0x0 grain:32 - err_code:0x0000:0x009f SystemAddress:0x1fee4e3000 ProcessorSocketId:0x1 MemoryControllerId:0x0 ChannelAddress:0x7b7271800 ChannelId:0x0 RankAddress:0x3db939800 PhysicalRankId:0x0 DimmSlotId:0x0 Row:0x1ddce Column:0x300 Bank:0x2 BankGroup:0x3 ChipSelect:0x0 ChipId:0x0) |
146 | Jan 13 16:06:55 restbase2037 kernel: [713336.493109] {5}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4 |
147 | Jan 13 16:06:55 restbase2037 kernel: [713336.503599] mce: Uncorrected hardware memory error in user-access at 1fece6be80 |
148 | Jan 13 16:06:55 restbase2037 kernel: [713336.504989] {5}[Hardware Error]: event severity: recoverable |
149 | Jan 13 16:06:55 restbase2037 kernel: [713336.504996] {5}[Hardware Error]: Error 0, type: recoverable |
150 | Jan 13 16:06:55 restbase2037 kernel: [713336.512417] mce: [Hardware Error]: CPU 1: Machine Check Exception: 7 Bank 1: bd80000000100134 |
151 | Jan 13 16:06:55 restbase2037 kernel: [713336.518144] {5}[Hardware Error]: fru_text: B1 |
152 | Jan 13 16:06:55 restbase2037 kernel: [713336.518147] {5}[Hardware Error]: section_type: memory error |
153 | Jan 13 16:06:55 restbase2037 kernel: [713336.518149] {5}[Hardware Error]: error_status: 0x0000000000000400 |
154 | Jan 13 16:06:55 restbase2037 kernel: [713336.518150] {5}[Hardware Error]: physical_address: 0x0000001fece6be80 |
155 | Jan 13 16:06:55 restbase2037 kernel: [713336.518151] {5}[Hardware Error]: physical_address_mask: 0xffffffffffffffc0 |
156 | Jan 13 16:06:55 restbase2037 kernel: [713336.518155] {5}[Hardware Error]: node: 1 card: 0 module: 0 rank: 0 bank: 14 device: 0 row: 122270 column: 984 |
157 | Jan 13 16:06:55 restbase2037 kernel: [713336.518157] {5}[Hardware Error]: error_type: 3, multi-bit ECC |
158 | Jan 13 16:06:55 restbase2037 kernel: [713336.518164] {5}[Hardware Error]: DIMM location: not present. DMI handle: 0x0000 |
159 | Jan 13 16:06:55 restbase2037 kernel: [713336.519765] Memory failure: 0x1fece6b: already hardware poisoned |
160 | Jan 13 16:06:55 restbase2037 kernel: [713336.523923] mce: [Hardware Error]: RIP 33:<00007f2d0305028e> |
161 | Jan 13 16:06:55 restbase2037 kernel: [713336.523925] mce: [Hardware Error]: TSC 290d2ea6e84967 ADDR 1fece6be80 MISC 86 PPIN 63b889bd12cd0082 |
162 | Jan 13 16:06:55 restbase2037 kernel: [713336.523929] mce: [Hardware Error]: PROCESSOR 0:606a6 TIME 1736784415 SOCKET 1 APIC 40 microcode d0003e7 |
163 | Jan 13 16:06:55 restbase2037 kernel: [713336.525355] EDAC skx MC4: HANDLING MCE MEMORY ERROR |
164 | Jan 13 16:06:55 restbase2037 kernel: [713336.525452] Memory failure: 0x1fece6b: Sending SIGBUS to node:1811945 due to hardware memory corruption |
165 | Jan 13 16:06:55 restbase2037 kernel: [713336.525463] Memory failure: 0x1fece6b: recovery action for dirty LRU page: Recovered |
166 | Jan 13 16:06:55 restbase2037 kernel: [713336.636243] EDAC skx MC4: CPU 0: Machine Check Event: 0x0 Bank 255: 0xb40000000000009f |
167 | Jan 13 16:06:55 restbase2037 kernel: [713336.636243] EDAC skx MC4: TSC 0x0 |
168 | Jan 13 16:06:55 restbase2037 kernel: [713336.636244] EDAC skx MC4: ADDR 0x1fece6be80 |
169 | Jan 13 16:06:55 restbase2037 kernel: [713336.636245] EDAC skx MC4: MISC 0x0 |
170 | Jan 13 16:06:55 restbase2037 kernel: [713336.636245] EDAC skx MC4: PROCESSOR 0:0x606a6 TIME 1736784415 SOCKET 0 APIC 0x0 |
171 | Jan 13 16:06:55 restbase2037 kernel: [713336.636262] EDAC MC4: 1 UE memory read error on CPU_SrcID#1_MC#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0x1fece6b offset:0xe80 grain:32 - err_code:0x0000:0x009f SystemAddress:0x1fece6be80 ProcessorSocketId:0x1 MemoryControllerId:0x0 ChannelAddress:0x7b6735e80 ChannelId:0x0 RankAddress:0x3db39be80 PhysicalRankId:0x0 DimmSlotId:0x0 Row:0x1dd9e Column:0x3d8 Bank:0x2 BankGroup:0x3 ChipSelect:0x0 ChipId:0x0) |
172 | Jan 13 16:07:05 restbase2037 systemd[1]: Starting Export confd Prometheus metrics... |
173 | Jan 13 16:07:05 restbase2037 systemd[1]: confd_prometheus_metrics.service: Succeeded. |
174 | Jan 13 16:07:05 restbase2037 systemd[1]: Finished Export confd Prometheus metrics. |
175 | Jan 13 16:07:15 restbase2037 kernel: [713356.161065] {6}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4 |
176 | Jan 13 16:07:15 restbase2037 kernel: [713356.169420] {6}[Hardware Error]: event severity: recoverable |
177 | Jan 13 16:07:15 restbase2037 kernel: [713356.175176] {6}[Hardware Error]: Error 0, type: recoverable |
178 | Jan 13 16:07:15 restbase2037 kernel: [713356.180928] {6}[Hardware Error]: fru_text: B1 |
179 | Jan 13 16:07:15 restbase2037 kernel: [713356.185460] {6}[Hardware Error]: section_type: memory error |
180 | Jan 13 16:07:15 restbase2037 kernel: [713356.191292] {6}[Hardware Error]: error_status: 0x0000000000000400 |
181 | Jan 13 16:07:15 restbase2037 kernel: [713356.197645] {6}[Hardware Error]: physical_address: 0x0000001fe6c63900 |
182 | Jan 13 16:07:15 restbase2037 kernel: [713356.204343] {6}[Hardware Error]: physical_address_mask: 0xffffffffffffffc0 |
183 | Jan 13 16:07:15 restbase2037 kernel: [713356.211477] {6}[Hardware Error]: node: 1 card: 0 module: 0 rank: 0 bank: 14 device: 0 row: 122078 column: 800 |
184 | Jan 13 16:07:15 restbase2037 kernel: [713356.221729] {6}[Hardware Error]: error_type: 3, multi-bit ECC |
185 | Jan 13 16:07:15 restbase2037 kernel: [713356.227735] {6}[Hardware Error]: DIMM location: not present. DMI handle: 0x0000 |
186 | Jan 13 16:07:15 restbase2037 kernel: [713356.238489] mce_notify_irq: 6 callbacks suppressed |
187 | Jan 13 16:07:15 restbase2037 kernel: [713356.238491] mce: [Hardware Error]: Machine check events logged |
188 | Jan 13 16:07:15 restbase2037 kernel: [713356.240034] EDAC skx MC4: HANDLING MCE MEMORY ERROR |
189 | Jan 13 16:07:15 restbase2037 kernel: [713356.240038] EDAC skx MC4: CPU 0: Machine Check Event: 0x0 Bank 255: 0xb40000000000009f |
190 | Jan 13 16:07:15 restbase2037 kernel: [713356.240039] EDAC skx MC4: TSC 0x0 |
191 | Jan 13 16:07:15 restbase2037 kernel: [713356.240051] EDAC skx MC4: ADDR 0x1fe6c63900 |
192 | Jan 13 16:07:15 restbase2037 kernel: [713356.240052] EDAC skx MC4: MISC 0x0 |
193 | Jan 13 16:07:15 restbase2037 kernel: [713356.240053] EDAC skx MC4: PROCESSOR 0:0x606a6 TIME 1736784435 SOCKET 0 APIC 0x0 |
194 | Jan 13 16:07:15 restbase2037 kernel: [713356.240077] EDAC MC4: 1 UE memory read error on CPU_SrcID#1_MC#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0x1fe6c63 offset:0x900 grain:32 - err_code:0x0000:0x009f SystemAddress:0x1fe6c63900 ProcessorSocketId:0x1 MemoryControllerId:0x0 ChannelAddress:0x7b3631900 ChannelId:0x0 RankAddress:0x3d9b19900 PhysicalRankId:0x0 DimmSlotId:0x0 Row:0x1dcde Column:0x320 Bank:0x2 BankGroup:0x3 ChipSelect:0x0 ChipId:0x0) |
195 | Jan 13 16:07:15 restbase2037 kernel: [713356.242347] Memory failure: 0x1fe6c63: corrupted page was clean: dropped without side effects |
196 | Jan 13 16:07:15 restbase2037 kernel: [713356.242374] Memory failure: 0x1fe6c63: recovery action for clean LRU page: Recovered |
197 | Jan 13 16:07:20 restbase2037 kernel: [713361.163823] {7}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4 |
198 | Jan 13 16:07:20 restbase2037 kernel: [713361.172181] {7}[Hardware Error]: event severity: recoverable |
199 | Jan 13 16:07:20 restbase2037 kernel: [713361.177934] {7}[Hardware Error]: Error 0, type: recoverable |
200 | Jan 13 16:07:20 restbase2037 kernel: [713361.183680] {7}[Hardware Error]: fru_text: B1 |
201 | Jan 13 16:07:20 restbase2037 kernel: [713361.188213] {7}[Hardware Error]: section_type: memory error |
202 | Jan 13 16:07:20 restbase2037 kernel: [713361.194045] {7}[Hardware Error]: error_status: 0x0000000000000400 |
203 | Jan 13 16:07:20 restbase2037 kernel: [713361.200395] {7}[Hardware Error]: physical_address: 0x0000001fe6c69080 |
204 | Jan 13 16:07:20 restbase2037 kernel: [713361.207098] {7}[Hardware Error]: physical_address_mask: 0xffffffffffffffc0 |
205 | Jan 13 16:07:20 restbase2037 kernel: [713361.214231] {7}[Hardware Error]: node: 1 card: 0 module: 0 rank: 0 bank: 14 device: 0 row: 122078 column: 280 |
206 | Jan 13 16:07:20 restbase2037 kernel: [713361.224481] {7}[Hardware Error]: error_type: 3, multi-bit ECC |
207 | Jan 13 16:07:20 restbase2037 kernel: [713361.230487] {7}[Hardware Error]: DIMM location: not present. DMI handle: 0x0000 |
208 | Jan 13 16:07:20 restbase2037 kernel: [713361.239756] mce: [Hardware Error]: Machine check events logged |
209 | Jan 13 16:07:20 restbase2037 kernel: [713361.241172] EDAC skx MC4: HANDLING MCE MEMORY ERROR |
210 | Jan 13 16:07:20 restbase2037 kernel: [713361.241177] EDAC skx MC4: CPU 0: Machine Check Event: 0x0 Bank 255: 0xb40000000000009f |
211 | Jan 13 16:07:20 restbase2037 kernel: [713361.241182] EDAC skx MC4: TSC 0x0 |
212 | Jan 13 16:07:20 restbase2037 kernel: [713361.241182] EDAC skx MC4: ADDR 0x1fe6c69080 |
213 | Jan 13 16:07:20 restbase2037 kernel: [713361.241183] EDAC skx MC4: MISC 0x0 |
214 | Jan 13 16:07:20 restbase2037 kernel: [713361.241185] EDAC skx MC4: PROCESSOR 0:0x606a6 TIME 1736784440 SOCKET 0 APIC 0x0 |
215 | Jan 13 16:07:20 restbase2037 kernel: [713361.241194] EDAC MC4: 1 UE memory read error on CPU_SrcID#1_MC#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0x1fe6c69 offset:0x80 grain:32 - err_code:0x0000:0x009f SystemAddress:0x1fe6c69080 ProcessorSocketId:0x1 MemoryControllerId:0x0 ChannelAddress:0x7b3634880 ChannelId:0x0 RankAddress:0x3d9b1a880 PhysicalRankId:0x0 DimmSlotId:0x0 Row:0x1dcde Column:0x118 Bank:0x2 BankGroup:0x3 ChipSelect:0x0 ChipId:0x0) |
216 | Jan 13 16:07:20 restbase2037 kernel: [713361.242037] Memory failure: 0x1fe6c69: corrupted page was clean: dropped without side effects |
217 | Jan 13 16:07:20 restbase2037 kernel: [713361.242056] Memory failure: 0x1fe6c69: recovery action for clean LRU page: Recovered |
218 | Jan 13 16:07:20 restbase2037 systemd[1]: Starting Update NIC firmware stats exported by node_exporter... |
219 | Jan 13 16:07:20 restbase2037 systemd[1]: prometheus-nic-firmware-textfile.service: Succeeded. |
220 | Jan 13 16:07:20 restbase2037 systemd[1]: Finished Update NIC firmware stats exported by node_exporter. |
221 | [ Unpaste-able binary data removed... ] |
222 | Jan 13 16:13:00 restbase2037 systemd-modules-load[886]: Inserted module 'nf_conntrack' |
223 | Jan 13 16:13:00 restbase2037 systemd-modules-load[886]: Inserted module 'ipmi_devintf' |
224 | Jan 13 16:13:00 restbase2037 systemd[1]: Mounted POSIX Message Queue File System. |
225 | Jan 13 16:13:00 restbase2037 systemd[1]: Mounted Kernel Debug File System. |
226 | Jan 13 16:13:00 restbase2037 systemd[1]: Mounted Kernel Trace File System. |
227 | Jan 13 16:13:00 restbase2037 systemd[1]: Finished Create list of static device nodes for the current kernel. |
228 | Jan 13 16:13:00 restbase2037 systemd[1]: modprobe@configfs.service: Succeeded. |
229 | Jan 13 16:13:00 restbase2037 systemd[1]: Finished Load Kernel Module configfs. |
230 | Jan 13 16:13:00 restbase2037 systemd[1]: modprobe@drm.service: Succeeded. |
231 | Jan 13 16:13:00 restbase2037 systemd[1]: Finished Load Kernel Module drm. |
232 | Jan 13 16:13:00 restbase2037 systemd[1]: modprobe@fuse.service: Succeeded. |
233 | Jan 13 16:13:00 restbase2037 systemd[1]: Finished Load Kernel Module fuse. |
234 | Jan 13 16:13:00 restbase2037 systemd[1]: Finished Load Kernel Modules. |
235 | [ ... ] |