Page MenuHomePhabricator

Memory errors on clouddb1019
Closed, ResolvedPublic

Description

When trying to reboot clouddb1019, this host was getting stuck on boot, not sure if it is related, but I have seen this happening at the same time of the reboot (times do match):

-------------------------------------------------------------------------------
Record:      2
Date/Time:   01/15/2021 08:56:59
Source:      system
Severity:    Critical
Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_A4.
-------------------------------------------------------------------------------

I am trying to issue a hardreset from the idrac, but it doesn't seem to be doing anything. Can we get this host looked at on-site? Maybe move the DIMM to another slot and check if the host boots and/or if the error moves to a different place?

Related Objects

StatusSubtypeAssignedTask
ResolvedMarostegui
ResolvedRobH
OpenNone
OpenBstorm
ResolvedBstorm
ResolvedMarostegui
ResolvedMarostegui
OpenNone
OpenNone
OpenMarostegui
ResolvedRobH
OpenNone
OpenNone
ResolvedMarostegui
StalledMarostegui
ResolvedCmjohnson
Resolveddcaro

Event Timeline

Marostegui added a parent task: Restricted Task.

Setting to high as we are trying to finish up the new wiki replicas infra

There's something going on with this host:

racadm>>serveraction powerstatus
Server power status: OFF
racadm>>serveraction powerup
Server power operation initiated successfully
racadm>>serveraction powerstatus
Server power status: ON
racadm>>serveraction powerstatus
Server power status: ON
racadm>>serveraction powerstatus
Server power status: OFF
racadm>>

And without doing anything again:

racadm>>serveraction powerstatus
Server power status: ON

Interestingly, I cannot see anything on the console, so I have no idea what it is doing and if it is rebooting or doing something else.

Mentioned in SAL (#wikimedia-cloud) [2021-01-15T09:47:36Z] <arturo> labstore1004 maintain-dbusers affected by T272127 and T272125

Mentioned in SAL (#wikimedia-cloud) [2021-01-15T13:41:57Z] <arturo> icinga downtime labstore1004 maintain-dbuser alert until 2021-01-19 (T272125)

Record: 1
Date/Time: 08/31/2020 17:37:02
Source: system
Severity: Ok

Description: Log cleared.

Record: 2
Date/Time: 01/15/2021 08:56:59
Source: system
Severity: Critical

Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_A4.

Record: 3
Date/Time: 01/20/2021 18:44:27
Source: system
Severity: Critical

Description: The chassis is open while the power is off.

Swapped DIMM A4 with DIMM B4, cleared the system log and powered on. Let's see if the error returns, stays the same or changes.

Fast response by the server, after swapping the DIMM, the server was stuck in a continuous reboot. connected the console and see that the server is failing during post at the memory check. Not sure if it's both DIMM for B4 and A4 but I removed them both, used the DIMM from A9 and A10 to populate A4/B4 to keep the correct DIMM sequence. Server booted with no issues. A ticket will need to be created with Dell to have new DIMM sent.

@Cmjohnson unfortunately the server isn't accessible yet - I cannot even reach its idrac :-(

root@cumin1001:~# ping clouddb1019.eqiad.wmnet -c5
PING clouddb1019.eqiad.wmnet (10.64.48.9) 56(84) bytes of data.

--- clouddb1019.eqiad.wmnet ping statistics ---
5 packets transmitted, 0 received, 100% packet loss, time 87ms

root@cumin1001:~# ping clouddb1019.mgmt.eqiad.wmnet -c5
PING clouddb1019.mgmt.eqiad.wmnet (10.65.0.131) 56(84) bytes of data.

--- clouddb1019.mgmt.eqiad.wmnet ping statistics ---
5 packets transmitted, 0 received, 100% packet loss, time 95ms

Could we have an ETA on when this server can be worked on? We were aiming to open a new infrastructure this server is part of to users 1st of Feb

I am working on it, I am dependant on Dell. I do need to update all the f/w and idrac today.

I failed to re-connect the mgmt cable after getting it to power on and was not able to remotely access the server to get the logs for the Dell tech. I connected everything, updated the bios and submitting a task to Dell now.

Thanks Chris, any chances that we can get the host to boot up at least so MySQL replication can catch up a bit.
Thank you!

There is just let memory at the moment

Dell ticket number SR1049824647

Thanks Chris - I can now access the server and will start mysql so it can catch up on replication!. Let's coordinate to install the new memory once it arrives.
Thanks again

New DIMM has been dispatched for the server I will coordinate a time with you to power down to restore the original configuration.

Sounds good @Cmjohnson let me know when it arrives and you plan to change it so I can stop mysql
Thank you

Mentioned in SAL (#wikimedia-operations) [2021-01-28T15:49:08Z] <marostegui> Power off clouddb1019 for memory replacement T272125

This has been fixed.

DELL Return tracking #
USPS 9202394653012447257126