Page MenuHomePhabricator

ganeti5002 was down / powered off, machine check entries in SEL
Closed, ResolvedPublic

Description

Today ganeti5002 was down with no output on console, however this is what ipmi-sel has to say. cc @wiki_willy for visibility in case we need to take action on the hardware

10  | Aug-24-2020 | 11:53:14 | CPU Machine Chk  | Processor                   | transition to Non-recoverable ; OEM Event Data2 code = 01h ; OEM Event Data3 code = 18h
11  | Aug-24-2020 | 11:53:14 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 01h
12  | Aug-24-2020 | 11:53:14 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 00h
13  | Aug-24-2020 | 11:53:14 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 00h
14  | Aug-24-2020 | 11:53:14 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 00h
15  | Aug-24-2020 | 11:55:42 | Additional Info  | OEM Reserved                | OEM Event Offset = 02h ; OEM Event Data2 code = 02h ; OEM Event Data3 code = 00h
16  | Aug-24-2020 | 11:55:42 | CPU Machine Chk  | Processor                   | transition to Non-recoverable ; OEM Event Data2 code = 02h ; OEM Event Data3 code = 18h
17  | Aug-24-2020 | 11:55:42 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 01h
18  | Aug-24-2020 | 11:55:42 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 00h
19  | Aug-24-2020 | 11:55:42 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 00h
20  | Aug-24-2020 | 11:55:42 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 00h

Related Objects

StatusSubtypeAssignedTask
ResolvedRobH

Event Timeline

jijiki triaged this task as High priority.Aug 24 2020, 10:20 PM
jijiki added a project: serviceops.

So we likely need to run a CPU test via the Dell testing suite, and that will require downtime of the node. AFAICT the directions for this are on: https://wikitech.wikimedia.org/wiki/Ganeti#Shutdown_a_node_for_a_prolonged_period_of_time

So, I'll follow those later this week to migrate instances from ganeti5002 to run the software tests.

Mentioned in SAL (#wikimedia-operations) [2020-09-23T17:29:22Z] <robh> migrating ganeti instances off ganeti5002 for troubleshooting per T261130

Ok, export of the SEL (have to clear it to run the hw diagnostic or it throws error for errors in SEL)

/admin1-> racadm getsel
Record:      1
Date/Time:   06/04/2019 13:10:38
Source:      system
Severity:    Ok
Description: Log cleared.
-------------------------------------------------------------------------------
Record:      2
Date/Time:   02/07/2020 02:27:25
Source:      system
Severity:    Critical
Description: The power input for power supply 1 is lost.
-------------------------------------------------------------------------------
Record:      3
Date/Time:   02/07/2020 02:27:30
Source:      system
Severity:    Critical
Description: Power supply redundancy is lost.
-------------------------------------------------------------------------------
Record:      4
Date/Time:   02/07/2020 02:37:55
Source:      system
Severity:    Ok
Description: The power supplies are redundant.
-------------------------------------------------------------------------------
Record:      5
Date/Time:   02/07/2020 02:37:55
Source:      system
Severity:    Ok
Description: The input power for power supply 1 has been restored.
-------------------------------------------------------------------------------
Record:      6
Date/Time:   02/07/2020 02:43:23
Source:      system
Severity:    Critical
Description: The power input for power supply 2 is lost.
-------------------------------------------------------------------------------
Record:      7
Date/Time:   02/07/2020 02:43:25
Source:      system
Severity:    Critical
Description: Power supply redundancy is lost.
-------------------------------------------------------------------------------
Record:      8
Date/Time:   02/07/2020 03:01:08
Source:      system
Severity:    Ok
Description: The input power for power supply 2 has been restored.
-------------------------------------------------------------------------------
Record:      9
Date/Time:   02/07/2020 03:01:10
Source:      system
Severity:    Ok
Description: The power supplies are redundant.
-------------------------------------------------------------------------------
Record:      10
Date/Time:   08/24/2020 11:53:14
Source:      system
Severity:    Critical
Description: CPU 1 machine check error detected.
-------------------------------------------------------------------------------
Record:      11
Date/Time:   08/24/2020 11:53:14
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      12
Date/Time:   08/24/2020 11:53:14
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      13
Date/Time:   08/24/2020 11:53:14
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      14
Date/Time:   08/24/2020 11:53:14
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      15
Date/Time:   08/24/2020 11:55:42
Source:      system
Severity:    Ok
Description: A problem was detected related to the previous server boot.
-------------------------------------------------------------------------------
Record:      16
Date/Time:   08/24/2020 11:55:42
Source:      system
Severity:    Critical
Description: CPU 2 machine check error detected.
-------------------------------------------------------------------------------
Record:      17
Date/Time:   08/24/2020 11:55:42
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      18
Date/Time:   08/24/2020 11:55:42
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      19
Date/Time:   08/24/2020 11:55:42
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      20
Date/Time:   08/24/2020 11:55:42
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
/admin1->

Removed system from ganeti cluster via directions on wikitech for extended downtime. will do hw testing on it next.

Mentioned in SAL (#wikimedia-operations) [2020-09-23T19:42:40Z] <robh> ganeti5002 firmware update before hw testing via T261130

Technical Support will need this information to diagnose the problem.
Please record the information below.

Service Tag : FLX09X2
Error Code : 2000-0620
Validation Code : 74812
Network 2 - Failed with Device Error
Continue testing?

I continued the testing since the CPU was what threw an error, so I suspect a bad mainboard is causing these errors. I'll keep updating this task with the rest of the error codes.

I've created SR1037478758 to dispatch a replacement mainboard. I'll open an inbound shipment ticket with SG3 once I get notification of the shipment, and arrange with Jin for the work on-site. (We also had the option of having a dell tech go out, but then we would need to pay SG3 staff 321 per hour to monitor them, easier to go with Jin.)

RobH mentioned this in Unknown Object (Task).Sep 23 2020, 9:51 PM

Mentioned in SAL (#wikimedia-operations) [2020-09-25T06:50:41Z] <elukey> shutdown ganeti5002 (mistakenly powercycled it without seeing T261130)

Added a week of downtime, sorry for the powercycle :(

RobH mentioned this in Unknown Object (Task).Sep 30 2020, 3:11 PM
RobH added a subtask: Unknown Object (Task).

I've gone ahead and fixed the self dispatch issue (had to add a new SG group to get this to work, not sure how it worked last time I sent parts but whatever!).

SR1038298224

Once it dispatches I'll pass the info to our local engineer (Jin) to quote and schedule on-site work to swap the mainboard.

Any news on this one? (just found out today about it while working on T265607)

So we got some movement on this Friday/replies today. Dell Singapore is being very difficult and require a local contact number. I've gone ahead and cleared Jin's info with him, and handed it to Dell. Jin 2/ DreamIIC will coordinate the part replacement and update us.

Hopefully we see movement on this during this week.

For some reason (we found this out a few months ago), Dell Singapore part replacements don't go out with return tags. They require you to call and schedule a pickup of the part with Dell after you swap things out.

Jin has swapped out the defective mainboard and is taking the bad part offsite and calling Dell to schedule the pickup. The new mainboard has imported the settings of the old, and is currently awaiting reimage.

I'll do the reimage tomorrow during normal working hours.

RobH closed subtask Unknown Object (Task) as Resolved.Nov 2 2020, 4:48 PM

The reimage of this host is giving me trouble. I've verified in the idrac bios setttings that IPMI over lan is enabled, but the script errors out with the following:

robh@cumin2001:~$ sudo -i wmf-auto-reimage-host -p T261130 ganeti5002.eqsin.wmnet
18:11:02 | ganeti5002.eqsin.wmnet | REIMAGE START | To monitor the full log and cumin output:
sudo tail -F /var/log/wmf-auto-reimage/202011051811_robh_7532_ganeti5002_eqsin_wmnet.log
sudo tail -F /var/log/wmf-auto-reimage/202011051811_robh_7532_ganeti5002_eqsin_wmnet_cumin.out
IPMI Password: 
Error: Unable to establish IPMI v2 / RMCP+ session
18:11:09 | ganeti5002.eqsin.wmnet | Unable to run wmf-auto-reimage-host: Remote IPMI failed for mgmt 'ganeti5002.mgmt.eqsin.wmnet': Command '['ipmitool', '-I', 'lanplus', '-H', 'ganeti5002.mgmt.eqsin.wmnet', '-U', 'root', '-E', 'chassis', 'power', 'status']' returned non-zero exit status 1.
18:11:09 | ganeti5002.eqsin.wmnet | REIMAGE END | retcode=2

I was reimaging hosts just fine yesterday with this script, so I suspect its a setting on the server I'm missing somewhere.

@RobH, what's the status here? Was the IPMI error reproducible on a second attempt?

IPMI Password: 
Error: Unable to establish IPMI v2 / RMCP+ session
18:15:55 | ganeti5002.eqsin.wmnet | Unable to run wmf-auto-reimage-host: Remote IPMI failed for mgmt 'ganeti5002.mgmt.eqsin.wmnet': Command '['ipmitool', '-I', 'lanplus', '-H', 'ganeti5002.mgmt.eqsin.wmnet', '-U', 'root', '-E', 'chassis', 'power', 'status']' returned non-zero exit status 1.
18:15:55 | ganeti5002.eqsin.wmnet | REIMAGE END | retcode=2

I've double checked the settings, IPMI over idrac is enabled, so I'm not sure why this host is doing this. I'd like to ask somoene else in DC ops to give this host a once over remotely to see if they notice anything I'm missing.

Currently host fails IPMI, and if I manually try to pxe boot it doesn't actually PXE boot and fails to OS on HDD (which is outdated and for old mainboard).

RobH added subscribers: Papaul, Cmjohnson, RobH.

@wiki_willy,

Can we ask either @Papaul or @Cmjohnson to double check me here and see what I am missing? Reassigning to you rather than directly to one of them, since you know their schedules better than I do! I'd ping in IRC, but I've been told not to do that on Silent Fridays so I'm avoiding doing so now.

I manually try to pxe boot it, it boot into the installer. so there is communication between the host and install5001

Jan 15 20:19:49 install5001 dhcpd[32758]: DHCPDISCOVER from .........:0d via 10.132.0.3
Jan 15 20:19:49 install5001 dhcpd[32758]: DHCPOFFER on 10.132.0.22..........:0d via 10.132.0.3
Jan 15 20:19:49 install5001 dhcpd[32758]: DHCPDISCOVER from ..............:0d via 10.132.0.2
Jan 15 20:19:49 install5001 dhcpd[32758]: DHCPOFFER on 10.132.0.22 to .............:0d via 10.132.0.2

the install started with no problem
now i am checking to see what is the issue with the IPMI

Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts:

ganeti5002.eqsin.wmnet

The log can be found in /var/log/wmf-auto-reimage/202101152053_pt1979_19778_ganeti5002_eqsin_wmnet.log.

This issue was that after replacing the system motherboard I am guess that the credentials were restored in the new IDRAC board from the chassis flash backup, and despite the fact the credentials still allowed to enter the IDRAC Web Interface and remote login but with the same credentials we not able anymore to interact with the IDRAC board through IPMIv2/lanplus giving the error : Unable to establish IPMI v2 / RMCP+ session".

what needs to be done in this case, is to reset the IDRAC password to the same value.

The install is in progress will update the task once the install finished.

Completed auto-reimage of hosts:

['ganeti5002.eqsin.wmnet']

and were ALL successful.

RobH claimed this task.

So this is now ready to be pushed back into service, resolving this hw repair task.

Thanks Papaul and Rob, I'll take care of re-adding ganeti5002 to the eqsin Ganeti cluster.

Mentioned in SAL (#wikimedia-operations) [2021-02-09T15:10:53Z] <moritzm> readding ganeti5002 to the eqsin Ganeti cluster following mainboard replacement/reinstall T261130

Mentioned in SAL (#wikimedia-operations) [2021-02-11T13:27:33Z] <moritzm> re-adding ganeti5002 to the eqsin Ganeti cluster following mainboard replacement/reinstall T261130