Around 16:12 UTC on 2023-07-29, db1130 crashed.
Uncorrectable memory error (I think, from a paste into IRC).
It was master, so was brought back into service, but I think this needs looking at (and I think the plan is for an emergency switchover).
Around 16:12 UTC on 2023-07-29, db1130 crashed.
Uncorrectable memory error (I think, from a paste into IRC).
It was master, so was brought back into service, but I think this needs looking at (and I think the plan is for an emergency switchover).
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Unknown Object (Task) | |||||
Resolved | Marostegui | T343076 db1130 crash memory errors | |||
Resolved | Marostegui | T343077 Switchover s5 master (db1130 -> db1183) |
The initial issue was triagged. I'll be home in 10 minutes and will replace the master
cgoubert@db1130:~$ sudo ipmi-sel | grep Jul-29; date; uptime 123 | Jul-29-2023 | 16:08:55 | ECC Uncorr Err | Memory | Uncorrectable memory error Sat 29 Jul 2023 04:27:01 PM UTC 16:27:01 up 14 min, 4 users, load average: 3.10, 1.92, 0.99
Mentioned in SAL (#wikimedia-operations) [2023-07-29T16:57:34Z] <marostegui> Starting emergency s5 eqiad failover from db1130 to db1183 - T343077 T343076
@Jclark-ctr any chances you've got an old DIMM somewhere to replace this one?
/admin1/system1/logs1/log1-> show record123 properties CreationTimestamp = 20230707040843.000000-300 ElementName = System Event Log Entry RecordData = Correctable memory error rate exceeded for DIMM_B2.
This host is scheduled for replacement in this quarter, but in case you had a spare dimm from another decommissioned host, I'd like to get it replaced.
@Marostegui i do have a few decom host i can pull from is this server down? I would like to do it today
Thank you John! I will get the host back in production after a few days until making sure it is stable.