There have been (3) recent crashes/reboots of restbase2037
| Sun Jan 05 09:55:30 UTC 2025 |
| Mon Jan 13 16:09:45 UTC 2025 |
| Wed Jan 15 10:11:47 UTC 2025 |
The cause seems to be an (uncorrectable) memory fault:
| 1 | [ ... ] |
|---|---|
| 2 | Jan 13 16:00:09 restbase2037 systemd[1]: Starting Export confd Prometheus metrics... |
| 3 | Jan 13 16:00:09 restbase2037 systemd[1]: clean-confd-rundir.service: Succeeded. |
| 4 | Jan 13 16:00:09 restbase2037 systemd[1]: Finished Clean old stale files in /var/run/confd-template. |
| 5 | Jan 13 16:00:09 restbase2037 systemd[1]: confd_prometheus_metrics.service: Succeeded. |
| 6 | Jan 13 16:00:09 restbase2037 systemd[1]: Finished Export confd Prometheus metrics. |
| 7 | Jan 13 16:01:09 restbase2037 systemd[1]: Starting Export confd Prometheus metrics... |
| 8 | Jan 13 16:01:10 restbase2037 systemd[1]: confd_prometheus_metrics.service: Succeeded. |
| 9 | Jan 13 16:01:10 restbase2037 systemd[1]: Finished Export confd Prometheus metrics. |
| 10 | Jan 13 16:02:09 restbase2037 systemd[1]: Starting Daily apt download activities... |
| 11 | Jan 13 16:02:09 restbase2037 systemd[1]: Starting Export confd Prometheus metrics... |
| 12 | Jan 13 16:02:09 restbase2037 systemd[1]: confd_prometheus_metrics.service: Succeeded. |
| 13 | Jan 13 16:02:09 restbase2037 systemd[1]: Finished Export confd Prometheus metrics. |
| 14 | Jan 13 16:02:10 restbase2037 systemd[1]: apt-daily.service: Succeeded. |
| 15 | Jan 13 16:02:10 restbase2037 systemd[1]: Finished Daily apt download activities. |
| 16 | Jan 13 16:02:20 restbase2037 systemd[1]: Starting Update NIC firmware stats exported by node_exporter... |
| 17 | Jan 13 16:02:20 restbase2037 systemd[1]: prometheus-nic-firmware-textfile.service: Succeeded. |
| 18 | Jan 13 16:02:20 restbase2037 systemd[1]: Finished Update NIC firmware stats exported by node_exporter. |
| 19 | Jan 13 16:03:09 restbase2037 systemd[1]: Starting Export confd Prometheus metrics... |
| 20 | Jan 13 16:03:10 restbase2037 systemd[1]: confd_prometheus_metrics.service: Succeeded. |
| 21 | Jan 13 16:03:10 restbase2037 systemd[1]: Finished Export confd Prometheus metrics. |
| 22 | Jan 13 16:04:10 restbase2037 systemd[1]: Starting Export confd Prometheus metrics... |
| 23 | Jan 13 16:04:10 restbase2037 systemd[1]: Starting Update Debian version stat exported by node_exporter... |
| 24 | Jan 13 16:04:10 restbase2037 systemd[1]: prometheus-debian-version-textfile.service: Succeeded. |
| 25 | Jan 13 16:04:10 restbase2037 systemd[1]: Finished Update Debian version stat exported by node_exporter. |
| 26 | Jan 13 16:04:10 restbase2037 systemd[1]: confd_prometheus_metrics.service: Succeeded. |
| 27 | Jan 13 16:04:10 restbase2037 systemd[1]: Finished Export confd Prometheus metrics. |
| 28 | Jan 13 16:05:01 restbase2037 CRON[1811836]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1) |
| 29 | Jan 13 16:05:10 restbase2037 systemd[1]: Starting Export confd Prometheus metrics... |
| 30 | Jan 13 16:05:10 restbase2037 systemd[1]: confd_prometheus_metrics.service: Succeeded. |
| 31 | Jan 13 16:05:10 restbase2037 systemd[1]: Finished Export confd Prometheus metrics. |
| 32 | Jan 13 16:06:05 restbase2037 kernel: [713286.500985] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4 |
| 33 | Jan 13 16:06:05 restbase2037 kernel: [713286.509340] {1}[Hardware Error]: event severity: recoverable |
| 34 | Jan 13 16:06:05 restbase2037 kernel: [713286.515093] {1}[Hardware Error]: Error 0, type: recoverable |
| 35 | Jan 13 16:06:05 restbase2037 kernel: [713286.520848] {1}[Hardware Error]: fru_text: B1 |
| 36 | Jan 13 16:06:05 restbase2037 kernel: [713286.525377] {1}[Hardware Error]: section_type: memory error |
| 37 | Jan 13 16:06:05 restbase2037 kernel: [713286.531213] {1}[Hardware Error]: error_status: 0x0000000000000400 |
| 38 | Jan 13 16:06:05 restbase2037 kernel: [713286.537565] {1}[Hardware Error]: physical_address: 0x0000001fe0a6a8c0 |
| 39 | Jan 13 16:06:05 restbase2037 kernel: [713286.544264] {1}[Hardware Error]: physical_address_mask: 0xffffffffffffffc0 |
| 40 | Jan 13 16:06:05 restbase2037 kernel: [713286.551398] {1}[Hardware Error]: node: 1 card: 0 module: 0 rank: 0 bank: 14 device: 0 row: 121878 column: 536 |
| 41 | Jan 13 16:06:05 restbase2037 kernel: [713286.561650] {1}[Hardware Error]: error_type: 3, multi-bit ECC |
| 42 | Jan 13 16:06:05 restbase2037 kernel: [713286.567656] {1}[Hardware Error]: DIMM location: not present. DMI handle: 0x0000 |
| 43 | Jan 13 16:06:05 restbase2037 kernel: [713286.577561] mce: [Hardware Error]: Machine check events logged |
| 44 | Jan 13 16:06:05 restbase2037 kernel: [713286.579041] EDAC skx MC4: HANDLING MCE MEMORY ERROR |
| 45 | Jan 13 16:06:05 restbase2037 kernel: [713286.579042] EDAC skx MC4: CPU 0: Machine Check Event: 0x0 Bank 255: 0xb40000000000009f |
| 46 | Jan 13 16:06:05 restbase2037 kernel: [713286.579043] EDAC skx MC4: TSC 0x0 |
| 47 | Jan 13 16:06:05 restbase2037 kernel: [713286.579044] EDAC skx MC4: ADDR 0x1fe0a6a8c0 |
| 48 | Jan 13 16:06:05 restbase2037 kernel: [713286.579046] EDAC skx MC4: MISC 0x0 |
| 49 | Jan 13 16:06:05 restbase2037 kernel: [713286.579047] EDAC skx MC4: PROCESSOR 0:0x606a6 TIME 1736784365 SOCKET 0 APIC 0x0 |
| 50 | Jan 13 16:06:05 restbase2037 kernel: [713286.579058] EDAC MC4: 1 UE memory read error on CPU_SrcID#1_MC#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0x1fe0a6a offset:0x8c0 grain:32 - err_code:0x0000:0x009f SystemAddress:0x1fe0a6a8c0 ProcessorSocketId:0x1 MemoryControllerId:0x0 ChannelAddress:0x7b05350c0 ChannelId:0x0 RankAddress:0x3d829b0c0 PhysicalRankId:0x0 DimmSlotId:0x0 Row:0x1dc16 Column:0x218 Bank:0x2 BankGroup:0x3 ChipSelect:0x0 ChipId:0x0) |
| 51 | Jan 13 16:06:05 restbase2037 kernel: [713286.579928] Memory failure: 0x1fe0a6a: recovery action for dirty LRU page: Recovered |
| 52 | Jan 13 16:06:05 restbase2037 systemd[1]: Starting Export confd Prometheus metrics... |
| 53 | Jan 13 16:06:06 restbase2037 systemd[1]: confd_prometheus_metrics.service: Succeeded. |
| 54 | Jan 13 16:06:06 restbase2037 systemd[1]: Finished Export confd Prometheus metrics. |
| 55 | Jan 13 16:06:21 restbase2037 kernel: [713302.548292] Disabling lock debugging due to kernel taint |
| 56 | Jan 13 16:06:21 restbase2037 kernel: [713302.548450] mce: Uncorrected hardware memory error in user-access at 1ff97e3e80 |
| 57 | Jan 13 16:06:21 restbase2037 kernel: [713302.548475] mce: [Hardware Error]: Machine check events logged |
| 58 | Jan 13 16:06:21 restbase2037 kernel: [713302.548874] {2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4 |
| 59 | Jan 13 16:06:21 restbase2037 kernel: [713302.556250] mce: [Hardware Error]: CPU 31: Machine Check Exception: 7 Bank 1: bd80000000100134 |
| 60 | Jan 13 16:06:21 restbase2037 kernel: [713302.564230] {2}[Hardware Error]: event severity: recoverable |
| 61 | Jan 13 16:06:21 restbase2037 kernel: [713302.564232] {2}[Hardware Error]: Error 0, type: recoverable |
| 62 | Jan 13 16:06:21 restbase2037 kernel: [713302.564234] {2}[Hardware Error]: fru_text: B1 |
| 63 | Jan 13 16:06:21 restbase2037 kernel: [713302.564236] {2}[Hardware Error]: section_type: memory error |
| 64 | Jan 13 16:06:21 restbase2037 kernel: [713302.564237] {2}[Hardware Error]: error_status: 0x0000000000000400 |
| 65 | Jan 13 16:06:21 restbase2037 kernel: [713302.564238] {2}[Hardware Error]: physical_address: 0x0000001ff97e3e80 |
| 66 | Jan 13 16:06:21 restbase2037 kernel: [713302.564239] {2}[Hardware Error]: physical_address_mask: 0xffffffffffffffc0 |
| 67 | Jan 13 16:06:21 restbase2037 kernel: [713302.564241] {2}[Hardware Error]: node: 1 card: 0 module: 0 rank: 0 bank: 14 device: 0 row: 122670 column: 976 |
| 68 | Jan 13 16:06:21 restbase2037 kernel: [713302.564242] {2}[Hardware Error]: error_type: 3, multi-bit ECC |
| 69 | Jan 13 16:06:21 restbase2037 kernel: [713302.564244] {2}[Hardware Error]: DIMM location: not present. DMI handle: 0x0000 |
| 70 | Jan 13 16:06:21 restbase2037 kernel: [713302.565467] Memory failure: 0x1ff97e3: already hardware poisoned |
| 71 | Jan 13 16:06:21 restbase2037 kernel: [713302.572999] mce: [Hardware Error]: RIP 33:<00007f083163504d> |
| 72 | Jan 13 16:06:21 restbase2037 kernel: [713302.573002] mce: [Hardware Error]: TSC 290d1bb8f78bc7 ADDR 1ff97e3e80 MISC 86 PPIN 63b889bd12cd0082 |
| 73 | Jan 13 16:06:21 restbase2037 kernel: [713302.573005] mce: [Hardware Error]: PROCESSOR 0:606a6 TIME 1736784381 SOCKET 1 APIC 5e microcode d0003e7 |
| 74 | Jan 13 16:06:21 restbase2037 kernel: [713302.574369] EDAC skx MC4: HANDLING MCE MEMORY ERROR |
| 75 | Jan 13 16:06:21 restbase2037 kernel: [713302.574371] EDAC skx MC4: CPU 0: Machine Check Event: 0x0 Bank 255: 0xb40000000000009f |
| 76 | Jan 13 16:06:21 restbase2037 kernel: [713302.574372] EDAC skx MC4: TSC 0x0 |
| 77 | Jan 13 16:06:21 restbase2037 kernel: [713302.574373] EDAC skx MC4: ADDR 0x1ff97e3e80 |
| 78 | Jan 13 16:06:21 restbase2037 kernel: [713302.574374] EDAC skx MC4: MISC 0x0 |
| 79 | Jan 13 16:06:21 restbase2037 kernel: [713302.574375] EDAC skx MC4: PROCESSOR 0:0x606a6 TIME 1736784381 SOCKET 0 APIC 0x0 |
| 80 | Jan 13 16:06:21 restbase2037 kernel: [713302.574390] EDAC MC4: 1 UE memory read error on CPU_SrcID#1_MC#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0x1ff97e3 offset:0xe80 grain:32 - err_code:0x0000:0x009f SystemAddress:0x1ff97e3e80 ProcessorSocketId:0x1 MemoryControllerId:0x0 ChannelAddress:0x7bcbf1e80 ChannelId:0x0 RankAddress:0x3de5f9e80 PhysicalRankId:0x0 DimmSlotId:0x0 Row:0x1df2e Column:0x3d0 Bank:0x2 BankGroup:0x3 ChipSelect:0x0 ChipId:0x0) |
| 81 | Jan 13 16:06:21 restbase2037 kernel: [713302.574474] Memory failure: 0x1ff97e3: Sending SIGBUS to node:1283309 due to hardware memory corruption |
| 82 | Jan 13 16:06:22 restbase2037 kernel: [713302.680570] Memory failure: 0x1ff97e3: recovery action for dirty LRU page: Recovered |
| 83 | Jan 13 16:06:24 restbase2037 ipmiseld[1295]: SEL System Event: Jan-13-2025, 16:05:55, System Firmware DIMM Critical, N/A, OEM Event Offset = 0Eh ; OEM Event Data2 code = 41h |
| 84 | Jan 13 16:06:24 restbase2037 ipmiseld[1295]: SEL System Event: Jan-13-2025, 16:05:55, Sensor #73, N/A, OEM Diagnostic Data Event ; Register Offset = 34h ; Register Value = 92h |
| 85 | Jan 13 16:06:24 restbase2037 ipmiseld[1295]: SEL System Event: Jan-13-2025, 16:05:55, Sensor #73, N/A, OEM Diagnostic Data Event ; Register Offset = B90h ; Register Value = 00h |
| 86 | Jan 13 16:06:24 restbase2037 ipmiseld[1295]: SEL System Event: Jan-13-2025, 16:05:55, Sensor #73, N/A, OEM Diagnostic Data Event ; Register Offset = 0h ; Register Value = 00h |
| 87 | Jan 13 16:06:24 restbase2037 ipmiseld[1295]: SEL System Event: Jan-13-2025, 16:05:56, Sensor #73, N/A, OEM Diagnostic Data Event ; Register Offset = 434h ; Register Value = 92h |
| 88 | Jan 13 16:06:24 restbase2037 ipmiseld[1295]: SEL System Event: Jan-13-2025, 16:05:56, Sensor #73, N/A, OEM Diagnostic Data Event ; Register Offset = B02h ; Register Value = 12h |
| 89 | Jan 13 16:06:24 restbase2037 ipmiseld[1295]: SEL System Event: Jan-13-2025, 16:05:56, Sensor #73, N/A, OEM Diagnostic Data Event ; Register Offset = ACh ; Register Value = 00h |
| 90 | Jan 13 16:06:24 restbase2037 ipmiseld[1295]: SEL System Event: Jan-13-2025, 16:06:11, System Firmware DIMM Critical, N/A, OEM Event Offset = 0Eh ; OEM Event Data2 code = 41h |
| 91 | Jan 13 16:06:24 restbase2037 ipmiseld[1295]: SEL System Event: Jan-13-2025, 16:06:11, Sensor #73, N/A, OEM Diagnostic Data Event ; Register Offset = 34h ; Register Value = 92h |
| 92 | Jan 13 16:06:24 restbase2037 ipmiseld[1295]: SEL System Event: Jan-13-2025, 16:06:11, Sensor #73, N/A, OEM Diagnostic Data Event ; Register Offset = B90h ; Register Value = 00h |
| 93 | Jan 13 16:06:24 restbase2037 ipmiseld[1295]: SEL System Event: Jan-13-2025, 16:06:11, Sensor #73, N/A, OEM Diagnostic Data Event ; Register Offset = 0h ; Register Value = 00h |
| 94 | Jan 13 16:06:24 restbase2037 ipmiseld[1295]: SEL System Event: Jan-13-2025, 16:06:12, Sensor #73, N/A, OEM Diagnostic Data Event ; Register Offset = 434h ; Register Value = 92h |
| 95 | Jan 13 16:06:24 restbase2037 ipmiseld[1295]: SEL System Event: Jan-13-2025, 16:06:12, Sensor #73, N/A, OEM Diagnostic Data Event ; Register Offset = B02h ; Register Value = 0Eh |
| 96 | Jan 13 16:06:24 restbase2037 ipmiseld[1295]: SEL System Event: Jan-13-2025, 16:06:12, Sensor #73, N/A, OEM Diagnostic Data Event ; Register Offset = ACh ; Register Value = 00h |
| 97 | Jan 13 16:06:24 restbase2037 ipmiseld[1295]: SEL System Event: Jan-13-2025, 16:06:12, System Firmware DIMM Critical, N/A, OEM Event Offset = 0Eh ; OEM Event Data2 code = 42h |
| 98 | Jan 13 16:06:26 restbase2037 kernel: [713307.464119] MCE: Killing node:1912 due to hardware memory corruption fault at 55d4f706a708 |
| 99 | Jan 13 16:06:32 restbase2037 kernel: [713313.124667] {3}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4 |
| 100 | Jan 13 16:06:32 restbase2037 kernel: [713313.133021] {3}[Hardware Error]: event severity: recoverable |
| 101 | Jan 13 16:06:32 restbase2037 kernel: [713313.138777] {3}[Hardware Error]: Error 0, type: recoverable |
| 102 | Jan 13 16:06:32 restbase2037 kernel: [713313.144530] {3}[Hardware Error]: fru_text: B1 |
| 103 | Jan 13 16:06:32 restbase2037 kernel: [713313.149071] {3}[Hardware Error]: section_type: memory error |
| 104 | Jan 13 16:06:32 restbase2037 kernel: [713313.154911] {3}[Hardware Error]: error_status: 0x0000000000000400 |
| 105 | Jan 13 16:06:32 restbase2037 kernel: [713313.161263] {3}[Hardware Error]: physical_address: 0x0000001fe31e8d40 |
| 106 | Jan 13 16:06:32 restbase2037 kernel: [713313.167963] {3}[Hardware Error]: physical_address_mask: 0xffffffffffffffc0 |
| 107 | Jan 13 16:06:32 restbase2037 kernel: [713313.175097] {3}[Hardware Error]: node: 1 card: 0 module: 0 rank: 0 bank: 14 device: 0 row: 121958 column: 168 |
| 108 | Jan 13 16:06:32 restbase2037 kernel: [713313.185346] {3}[Hardware Error]: error_type: 3, multi-bit ECC |
| 109 | Jan 13 16:06:32 restbase2037 kernel: [713313.191354] {3}[Hardware Error]: DIMM location: not present. DMI handle: 0x0000 |
| 110 | Jan 13 16:06:32 restbase2037 kernel: [713313.203549] EDAC skx MC4: HANDLING MCE MEMORY ERROR |
| 111 | Jan 13 16:06:32 restbase2037 kernel: [713313.203553] EDAC skx MC4: CPU 0: Machine Check Event: 0x0 Bank 255: 0xb40000000000009f |
| 112 | Jan 13 16:06:32 restbase2037 kernel: [713313.203555] EDAC skx MC4: TSC 0x0 |
| 113 | Jan 13 16:06:32 restbase2037 kernel: [713313.203556] EDAC skx MC4: ADDR 0x1fe31e8d40 |
| 114 | Jan 13 16:06:32 restbase2037 kernel: [713313.203558] EDAC skx MC4: MISC 0x0 |
| 115 | Jan 13 16:06:32 restbase2037 kernel: [713313.203560] EDAC skx MC4: PROCESSOR 0:0x606a6 TIME 1736784392 SOCKET 0 APIC 0x0 |
| 116 | Jan 13 16:06:32 restbase2037 kernel: [713313.203575] EDAC MC4: 1 UE memory read error on CPU_SrcID#1_MC#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0x1fe31e8 offset:0xd40 grain:32 - err_code:0x0000:0x009f SystemAddress:0x1fe31e8d40 ProcessorSocketId:0x1 MemoryControllerId:0x0 ChannelAddress:0x7b18f4540 ChannelId:0x0 RankAddress:0x3d8c7a540 PhysicalRankId:0x0 DimmSlotId:0x0 Row:0x1dc66 Column:0xa8 Bank:0x2 BankGroup:0x3 ChipSelect:0x0 ChipId:0x0) |
| 117 | Jan 13 16:06:32 restbase2037 kernel: [713313.208172] Memory failure: 0x1fe31e8: corrupted page was clean: dropped without side effects |
| 118 | Jan 13 16:06:32 restbase2037 kernel: [713313.208197] Memory failure: 0x1fe31e8: recovery action for clean LRU page: Recovered |
| 119 | Jan 13 16:06:52 restbase2037 kernel: [713333.164087] mce: Uncorrected hardware memory error in user-access at 1fee4e3000 |
| 120 | Jan 13 16:06:52 restbase2037 kernel: [713333.164116] mce: [Hardware Error]: CPU 3: Machine Check Exception: 7 Bank 1: bd80000000100134 |
| 121 | Jan 13 16:06:52 restbase2037 kernel: [713333.165003] {4}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4 |
| 122 | Jan 13 16:06:52 restbase2037 kernel: [713333.167304] Memory failure: 0x1fee4e3: Sending SIGBUS to node:214326 due to hardware memory corruption |
| 123 | Jan 13 16:06:52 restbase2037 kernel: [713333.167315] Memory failure: 0x1fee4e3: recovery action for dirty LRU page: Recovered |
| 124 | Jan 13 16:06:52 restbase2037 kernel: [713333.171546] mce: [Hardware Error]: RIP 33:<00007f77f8da2249> |
| 125 | Jan 13 16:06:52 restbase2037 kernel: [713333.180128] {4}[Hardware Error]: event severity: recoverable |
| 126 | Jan 13 16:06:52 restbase2037 kernel: [713333.180132] {4}[Hardware Error]: Error 0, type: recoverable |
| 127 | Jan 13 16:06:52 restbase2037 kernel: [713333.180133] {4}[Hardware Error]: fru_text: B1 |
| 128 | Jan 13 16:06:52 restbase2037 kernel: [713333.180134] {4}[Hardware Error]: section_type: memory error |
| 129 | Jan 13 16:06:52 restbase2037 kernel: [713333.180137] {4}[Hardware Error]: error_status: 0x0000000000000400 |
| 130 | Jan 13 16:06:52 restbase2037 kernel: [713333.180138] {4}[Hardware Error]: physical_address: 0x0000001fee4e3000 |
| 131 | Jan 13 16:06:52 restbase2037 kernel: [713333.180139] {4}[Hardware Error]: physical_address_mask: 0xffffffffffffffc0 |
| 132 | Jan 13 16:06:52 restbase2037 kernel: [713333.180144] {4}[Hardware Error]: node: 1 card: 0 module: 0 rank: 0 bank: 14 device: 0 row: 122318 column: 768 |
| 133 | Jan 13 16:06:52 restbase2037 kernel: [713333.180146] {4}[Hardware Error]: error_type: 3, multi-bit ECC |
| 134 | Jan 13 16:06:52 restbase2037 kernel: [713333.180152] {4}[Hardware Error]: DIMM location: not present. DMI handle: 0x0000 |
| 135 | Jan 13 16:06:52 restbase2037 kernel: [713333.188500] |
| 136 | Jan 13 16:06:52 restbase2037 kernel: [713333.199514] Memory failure: 0x1fee4e3: already hardware poisoned |
| 137 | Jan 13 16:06:52 restbase2037 kernel: [713333.205745] mce: [Hardware Error]: TSC 290d2cca4df4bb ADDR 1fee4e3000 MISC 86 PPIN 63b889bd12cd0082 |
| 138 | Jan 13 16:06:52 restbase2037 kernel: [713333.205751] mce: [Hardware Error]: PROCESSOR 0:606a6 TIME 1736784412 SOCKET 1 APIC 50 microcode d0003e7 |
| 139 | Jan 13 16:06:52 restbase2037 kernel: [713333.207429] EDAC skx MC4: HANDLING MCE MEMORY ERROR |
| 140 | Jan 13 16:06:52 restbase2037 kernel: [713333.207431] EDAC skx MC4: CPU 0: Machine Check Event: 0x0 Bank 255: 0xb40000000000009f |
| 141 | Jan 13 16:06:52 restbase2037 kernel: [713333.207432] EDAC skx MC4: TSC 0x0 |
| 142 | Jan 13 16:06:52 restbase2037 kernel: [713333.207433] EDAC skx MC4: ADDR 0x1fee4e3000 |
| 143 | Jan 13 16:06:52 restbase2037 kernel: [713333.207434] EDAC skx MC4: MISC 0x0 |
| 144 | Jan 13 16:06:52 restbase2037 kernel: [713333.207435] EDAC skx MC4: PROCESSOR 0:0x606a6 TIME 1736784412 SOCKET 0 APIC 0x0 |
| 145 | Jan 13 16:06:52 restbase2037 kernel: [713333.207450] EDAC MC4: 1 UE memory read error on CPU_SrcID#1_MC#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0x1fee4e3 offset:0x0 grain:32 - err_code:0x0000:0x009f SystemAddress:0x1fee4e3000 ProcessorSocketId:0x1 MemoryControllerId:0x0 ChannelAddress:0x7b7271800 ChannelId:0x0 RankAddress:0x3db939800 PhysicalRankId:0x0 DimmSlotId:0x0 Row:0x1ddce Column:0x300 Bank:0x2 BankGroup:0x3 ChipSelect:0x0 ChipId:0x0) |
| 146 | Jan 13 16:06:55 restbase2037 kernel: [713336.493109] {5}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4 |
| 147 | Jan 13 16:06:55 restbase2037 kernel: [713336.503599] mce: Uncorrected hardware memory error in user-access at 1fece6be80 |
| 148 | Jan 13 16:06:55 restbase2037 kernel: [713336.504989] {5}[Hardware Error]: event severity: recoverable |
| 149 | Jan 13 16:06:55 restbase2037 kernel: [713336.504996] {5}[Hardware Error]: Error 0, type: recoverable |
| 150 | Jan 13 16:06:55 restbase2037 kernel: [713336.512417] mce: [Hardware Error]: CPU 1: Machine Check Exception: 7 Bank 1: bd80000000100134 |
| 151 | Jan 13 16:06:55 restbase2037 kernel: [713336.518144] {5}[Hardware Error]: fru_text: B1 |
| 152 | Jan 13 16:06:55 restbase2037 kernel: [713336.518147] {5}[Hardware Error]: section_type: memory error |
| 153 | Jan 13 16:06:55 restbase2037 kernel: [713336.518149] {5}[Hardware Error]: error_status: 0x0000000000000400 |
| 154 | Jan 13 16:06:55 restbase2037 kernel: [713336.518150] {5}[Hardware Error]: physical_address: 0x0000001fece6be80 |
| 155 | Jan 13 16:06:55 restbase2037 kernel: [713336.518151] {5}[Hardware Error]: physical_address_mask: 0xffffffffffffffc0 |
| 156 | Jan 13 16:06:55 restbase2037 kernel: [713336.518155] {5}[Hardware Error]: node: 1 card: 0 module: 0 rank: 0 bank: 14 device: 0 row: 122270 column: 984 |
| 157 | Jan 13 16:06:55 restbase2037 kernel: [713336.518157] {5}[Hardware Error]: error_type: 3, multi-bit ECC |
| 158 | Jan 13 16:06:55 restbase2037 kernel: [713336.518164] {5}[Hardware Error]: DIMM location: not present. DMI handle: 0x0000 |
| 159 | Jan 13 16:06:55 restbase2037 kernel: [713336.519765] Memory failure: 0x1fece6b: already hardware poisoned |
| 160 | Jan 13 16:06:55 restbase2037 kernel: [713336.523923] mce: [Hardware Error]: RIP 33:<00007f2d0305028e> |
| 161 | Jan 13 16:06:55 restbase2037 kernel: [713336.523925] mce: [Hardware Error]: TSC 290d2ea6e84967 ADDR 1fece6be80 MISC 86 PPIN 63b889bd12cd0082 |
| 162 | Jan 13 16:06:55 restbase2037 kernel: [713336.523929] mce: [Hardware Error]: PROCESSOR 0:606a6 TIME 1736784415 SOCKET 1 APIC 40 microcode d0003e7 |
| 163 | Jan 13 16:06:55 restbase2037 kernel: [713336.525355] EDAC skx MC4: HANDLING MCE MEMORY ERROR |
| 164 | Jan 13 16:06:55 restbase2037 kernel: [713336.525452] Memory failure: 0x1fece6b: Sending SIGBUS to node:1811945 due to hardware memory corruption |
| 165 | Jan 13 16:06:55 restbase2037 kernel: [713336.525463] Memory failure: 0x1fece6b: recovery action for dirty LRU page: Recovered |
| 166 | Jan 13 16:06:55 restbase2037 kernel: [713336.636243] EDAC skx MC4: CPU 0: Machine Check Event: 0x0 Bank 255: 0xb40000000000009f |
| 167 | Jan 13 16:06:55 restbase2037 kernel: [713336.636243] EDAC skx MC4: TSC 0x0 |
| 168 | Jan 13 16:06:55 restbase2037 kernel: [713336.636244] EDAC skx MC4: ADDR 0x1fece6be80 |
| 169 | Jan 13 16:06:55 restbase2037 kernel: [713336.636245] EDAC skx MC4: MISC 0x0 |
| 170 | Jan 13 16:06:55 restbase2037 kernel: [713336.636245] EDAC skx MC4: PROCESSOR 0:0x606a6 TIME 1736784415 SOCKET 0 APIC 0x0 |
| 171 | Jan 13 16:06:55 restbase2037 kernel: [713336.636262] EDAC MC4: 1 UE memory read error on CPU_SrcID#1_MC#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0x1fece6b offset:0xe80 grain:32 - err_code:0x0000:0x009f SystemAddress:0x1fece6be80 ProcessorSocketId:0x1 MemoryControllerId:0x0 ChannelAddress:0x7b6735e80 ChannelId:0x0 RankAddress:0x3db39be80 PhysicalRankId:0x0 DimmSlotId:0x0 Row:0x1dd9e Column:0x3d8 Bank:0x2 BankGroup:0x3 ChipSelect:0x0 ChipId:0x0) |
| 172 | Jan 13 16:07:05 restbase2037 systemd[1]: Starting Export confd Prometheus metrics... |
| 173 | Jan 13 16:07:05 restbase2037 systemd[1]: confd_prometheus_metrics.service: Succeeded. |
| 174 | Jan 13 16:07:05 restbase2037 systemd[1]: Finished Export confd Prometheus metrics. |
| 175 | Jan 13 16:07:15 restbase2037 kernel: [713356.161065] {6}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4 |
| 176 | Jan 13 16:07:15 restbase2037 kernel: [713356.169420] {6}[Hardware Error]: event severity: recoverable |
| 177 | Jan 13 16:07:15 restbase2037 kernel: [713356.175176] {6}[Hardware Error]: Error 0, type: recoverable |
| 178 | Jan 13 16:07:15 restbase2037 kernel: [713356.180928] {6}[Hardware Error]: fru_text: B1 |
| 179 | Jan 13 16:07:15 restbase2037 kernel: [713356.185460] {6}[Hardware Error]: section_type: memory error |
| 180 | Jan 13 16:07:15 restbase2037 kernel: [713356.191292] {6}[Hardware Error]: error_status: 0x0000000000000400 |
| 181 | Jan 13 16:07:15 restbase2037 kernel: [713356.197645] {6}[Hardware Error]: physical_address: 0x0000001fe6c63900 |
| 182 | Jan 13 16:07:15 restbase2037 kernel: [713356.204343] {6}[Hardware Error]: physical_address_mask: 0xffffffffffffffc0 |
| 183 | Jan 13 16:07:15 restbase2037 kernel: [713356.211477] {6}[Hardware Error]: node: 1 card: 0 module: 0 rank: 0 bank: 14 device: 0 row: 122078 column: 800 |
| 184 | Jan 13 16:07:15 restbase2037 kernel: [713356.221729] {6}[Hardware Error]: error_type: 3, multi-bit ECC |
| 185 | Jan 13 16:07:15 restbase2037 kernel: [713356.227735] {6}[Hardware Error]: DIMM location: not present. DMI handle: 0x0000 |
| 186 | Jan 13 16:07:15 restbase2037 kernel: [713356.238489] mce_notify_irq: 6 callbacks suppressed |
| 187 | Jan 13 16:07:15 restbase2037 kernel: [713356.238491] mce: [Hardware Error]: Machine check events logged |
| 188 | Jan 13 16:07:15 restbase2037 kernel: [713356.240034] EDAC skx MC4: HANDLING MCE MEMORY ERROR |
| 189 | Jan 13 16:07:15 restbase2037 kernel: [713356.240038] EDAC skx MC4: CPU 0: Machine Check Event: 0x0 Bank 255: 0xb40000000000009f |
| 190 | Jan 13 16:07:15 restbase2037 kernel: [713356.240039] EDAC skx MC4: TSC 0x0 |
| 191 | Jan 13 16:07:15 restbase2037 kernel: [713356.240051] EDAC skx MC4: ADDR 0x1fe6c63900 |
| 192 | Jan 13 16:07:15 restbase2037 kernel: [713356.240052] EDAC skx MC4: MISC 0x0 |
| 193 | Jan 13 16:07:15 restbase2037 kernel: [713356.240053] EDAC skx MC4: PROCESSOR 0:0x606a6 TIME 1736784435 SOCKET 0 APIC 0x0 |
| 194 | Jan 13 16:07:15 restbase2037 kernel: [713356.240077] EDAC MC4: 1 UE memory read error on CPU_SrcID#1_MC#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0x1fe6c63 offset:0x900 grain:32 - err_code:0x0000:0x009f SystemAddress:0x1fe6c63900 ProcessorSocketId:0x1 MemoryControllerId:0x0 ChannelAddress:0x7b3631900 ChannelId:0x0 RankAddress:0x3d9b19900 PhysicalRankId:0x0 DimmSlotId:0x0 Row:0x1dcde Column:0x320 Bank:0x2 BankGroup:0x3 ChipSelect:0x0 ChipId:0x0) |
| 195 | Jan 13 16:07:15 restbase2037 kernel: [713356.242347] Memory failure: 0x1fe6c63: corrupted page was clean: dropped without side effects |
| 196 | Jan 13 16:07:15 restbase2037 kernel: [713356.242374] Memory failure: 0x1fe6c63: recovery action for clean LRU page: Recovered |
| 197 | Jan 13 16:07:20 restbase2037 kernel: [713361.163823] {7}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4 |
| 198 | Jan 13 16:07:20 restbase2037 kernel: [713361.172181] {7}[Hardware Error]: event severity: recoverable |
| 199 | Jan 13 16:07:20 restbase2037 kernel: [713361.177934] {7}[Hardware Error]: Error 0, type: recoverable |
| 200 | Jan 13 16:07:20 restbase2037 kernel: [713361.183680] {7}[Hardware Error]: fru_text: B1 |
| 201 | Jan 13 16:07:20 restbase2037 kernel: [713361.188213] {7}[Hardware Error]: section_type: memory error |
| 202 | Jan 13 16:07:20 restbase2037 kernel: [713361.194045] {7}[Hardware Error]: error_status: 0x0000000000000400 |
| 203 | Jan 13 16:07:20 restbase2037 kernel: [713361.200395] {7}[Hardware Error]: physical_address: 0x0000001fe6c69080 |
| 204 | Jan 13 16:07:20 restbase2037 kernel: [713361.207098] {7}[Hardware Error]: physical_address_mask: 0xffffffffffffffc0 |
| 205 | Jan 13 16:07:20 restbase2037 kernel: [713361.214231] {7}[Hardware Error]: node: 1 card: 0 module: 0 rank: 0 bank: 14 device: 0 row: 122078 column: 280 |
| 206 | Jan 13 16:07:20 restbase2037 kernel: [713361.224481] {7}[Hardware Error]: error_type: 3, multi-bit ECC |
| 207 | Jan 13 16:07:20 restbase2037 kernel: [713361.230487] {7}[Hardware Error]: DIMM location: not present. DMI handle: 0x0000 |
| 208 | Jan 13 16:07:20 restbase2037 kernel: [713361.239756] mce: [Hardware Error]: Machine check events logged |
| 209 | Jan 13 16:07:20 restbase2037 kernel: [713361.241172] EDAC skx MC4: HANDLING MCE MEMORY ERROR |
| 210 | Jan 13 16:07:20 restbase2037 kernel: [713361.241177] EDAC skx MC4: CPU 0: Machine Check Event: 0x0 Bank 255: 0xb40000000000009f |
| 211 | Jan 13 16:07:20 restbase2037 kernel: [713361.241182] EDAC skx MC4: TSC 0x0 |
| 212 | Jan 13 16:07:20 restbase2037 kernel: [713361.241182] EDAC skx MC4: ADDR 0x1fe6c69080 |
| 213 | Jan 13 16:07:20 restbase2037 kernel: [713361.241183] EDAC skx MC4: MISC 0x0 |
| 214 | Jan 13 16:07:20 restbase2037 kernel: [713361.241185] EDAC skx MC4: PROCESSOR 0:0x606a6 TIME 1736784440 SOCKET 0 APIC 0x0 |
| 215 | Jan 13 16:07:20 restbase2037 kernel: [713361.241194] EDAC MC4: 1 UE memory read error on CPU_SrcID#1_MC#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0x1fe6c69 offset:0x80 grain:32 - err_code:0x0000:0x009f SystemAddress:0x1fe6c69080 ProcessorSocketId:0x1 MemoryControllerId:0x0 ChannelAddress:0x7b3634880 ChannelId:0x0 RankAddress:0x3d9b1a880 PhysicalRankId:0x0 DimmSlotId:0x0 Row:0x1dcde Column:0x118 Bank:0x2 BankGroup:0x3 ChipSelect:0x0 ChipId:0x0) |
| 216 | Jan 13 16:07:20 restbase2037 kernel: [713361.242037] Memory failure: 0x1fe6c69: corrupted page was clean: dropped without side effects |
| 217 | Jan 13 16:07:20 restbase2037 kernel: [713361.242056] Memory failure: 0x1fe6c69: recovery action for clean LRU page: Recovered |
| 218 | Jan 13 16:07:20 restbase2037 systemd[1]: Starting Update NIC firmware stats exported by node_exporter... |
| 219 | Jan 13 16:07:20 restbase2037 systemd[1]: prometheus-nic-firmware-textfile.service: Succeeded. |
| 220 | Jan 13 16:07:20 restbase2037 systemd[1]: Finished Update NIC firmware stats exported by node_exporter. |
| 221 | [ Unpaste-able binary data removed... ] |
| 222 | Jan 13 16:13:00 restbase2037 systemd-modules-load[886]: Inserted module 'nf_conntrack' |
| 223 | Jan 13 16:13:00 restbase2037 systemd-modules-load[886]: Inserted module 'ipmi_devintf' |
| 224 | Jan 13 16:13:00 restbase2037 systemd[1]: Mounted POSIX Message Queue File System. |
| 225 | Jan 13 16:13:00 restbase2037 systemd[1]: Mounted Kernel Debug File System. |
| 226 | Jan 13 16:13:00 restbase2037 systemd[1]: Mounted Kernel Trace File System. |
| 227 | Jan 13 16:13:00 restbase2037 systemd[1]: Finished Create list of static device nodes for the current kernel. |
| 228 | Jan 13 16:13:00 restbase2037 systemd[1]: modprobe@configfs.service: Succeeded. |
| 229 | Jan 13 16:13:00 restbase2037 systemd[1]: Finished Load Kernel Module configfs. |
| 230 | Jan 13 16:13:00 restbase2037 systemd[1]: modprobe@drm.service: Succeeded. |
| 231 | Jan 13 16:13:00 restbase2037 systemd[1]: Finished Load Kernel Module drm. |
| 232 | Jan 13 16:13:00 restbase2037 systemd[1]: modprobe@fuse.service: Succeeded. |
| 233 | Jan 13 16:13:00 restbase2037 systemd[1]: Finished Load Kernel Module fuse. |
| 234 | Jan 13 16:13:00 restbase2037 systemd[1]: Finished Load Kernel Modules. |
| 235 | [ ... ] |

