A number of MCE errors have been logged, e.g. the one below. There's also a lot of temporature warnings in mcelog (with the CPUs throttled as a result), I'm wondering if the memory error is a result of overheating.
Sep 21 11:08:19 mw2181 kernel: [1563977.997564] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
Sep 21 11:08:19 mw2181 kernel: [1563977.997569] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 10: 8c00004d000800c1
Sep 21 11:08:19 mw2181 kernel: [1563977.997570] EDAC sbridge MC0: TSC 0
Sep 21 11:08:19 mw2181 kernel: [1563977.997571] EDAC sbridge MC0: ADDR 4a38ec000
Sep 21 11:08:19 mw2181 kernel: [1563977.997572] EDAC sbridge MC0: MISC 908500010001a8c
Sep 21 11:08:19 mw2181 kernel: [1563977.997574] EDAC sbridge MC0: PROCESSOR 0:306e4 TIME 1537528099 SOCKET 0 APIC 0
Sep 21 11:08:19 mw2181 kernel: [1563977.997592] EDAC MC0: 1 CE memory scrubbing error on CPU_SrcID#0_Ha#0_Chan#1_DIMM#0 (channel:1 slot:0 page:0x4a38ec offset:0x0 grain:32 syndrome:0x0 - area:DRAM err_code:0008:00c1 socket:0 ha:0 channel_mask:2 rank:1)