Page MenuHomePhabricator

db1130 crash memory errors
Closed, ResolvedPublic

Description

Around 16:12 UTC on 2023-07-29, db1130 crashed.

Uncorrectable memory error (I think, from a paste into IRC).

It was master, so was brought back into service, but I think this needs looking at (and I think the plan is for an emergency switchover).

Event Timeline

Marostegui subscribed.

The initial issue was triagged. I'll be home in 10 minutes and will replace the master

cgoubert@db1130:~$ sudo ipmi-sel | grep Jul-29; date; uptime
123 | Jul-29-2023 | 16:08:55 | ECC Uncorr Err   | Memory                      | Uncorrectable memory error
Sat 29 Jul 2023 04:27:01 PM UTC
16:27:01 up 14 min,  4 users,  load average: 3.10, 1.92, 0.99
Marostegui renamed this task from db1130 crash to db1130 crash memory errors.Jul 29 2023, 4:30 PM

Mentioned in SAL (#wikimedia-operations) [2023-07-29T16:57:34Z] <marostegui> Starting emergency s5 eqiad failover from db1130 to db1183 - T343077 T343076

db1130 is scheduled for refresh with the HW that is arriving this quarter at T341269

@Jclark-ctr any chances you've got an old DIMM somewhere to replace this one?

/admin1/system1/logs1/log1-> show record123

	properties
		CreationTimestamp = 20230707040843.000000-300
		ElementName = System Event Log Entry
		RecordData = Correctable memory error rate exceeded for DIMM_B2.

This host is scheduled for replacement in this quarter, but in case you had a spare dimm from another decommissioned host, I'd like to get it replaced.

Marostegui added a parent task: Unknown Object (Task).Jul 29 2023, 5:07 PM

@Marostegui i do have a few decom host i can pull from is this server down? I would like to do it today

Let me depool it for you, give me 5 minutes

@Jclark-ctr host down, you can proceed whenever you want during the day.
Thank you

Thank you John! I will get the host back in production after a few days until making sure it is stable.