Page MenuHomePhabricator

ganeti5002 was down / powered off, machine check entries in SEL
Open, HighPublic

Description

Today ganeti5002 was down with no output on console, however this is what ipmi-sel has to say. cc @wiki_willy for visibility in case we need to take action on the hardware

10  | Aug-24-2020 | 11:53:14 | CPU Machine Chk  | Processor                   | transition to Non-recoverable ; OEM Event Data2 code = 01h ; OEM Event Data3 code = 18h
11  | Aug-24-2020 | 11:53:14 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 01h
12  | Aug-24-2020 | 11:53:14 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 00h
13  | Aug-24-2020 | 11:53:14 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 00h
14  | Aug-24-2020 | 11:53:14 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 00h
15  | Aug-24-2020 | 11:55:42 | Additional Info  | OEM Reserved                | OEM Event Offset = 02h ; OEM Event Data2 code = 02h ; OEM Event Data3 code = 00h
16  | Aug-24-2020 | 11:55:42 | CPU Machine Chk  | Processor                   | transition to Non-recoverable ; OEM Event Data2 code = 02h ; OEM Event Data3 code = 18h
17  | Aug-24-2020 | 11:55:42 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 01h
18  | Aug-24-2020 | 11:55:42 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 00h
19  | Aug-24-2020 | 11:55:42 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 00h
20  | Aug-24-2020 | 11:55:42 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 00h

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 24 2020, 2:54 PM
jijiki triaged this task as High priority.Aug 24 2020, 10:20 PM
jijiki added a project: serviceops.
wiki_willy assigned this task to RobH.Aug 24 2020, 10:46 PM
RobH added a comment.Aug 24 2020, 11:47 PM

So we likely need to run a CPU test via the Dell testing suite, and that will require downtime of the node. AFAICT the directions for this are on: https://wikitech.wikimedia.org/wiki/Ganeti#Shutdown_a_node_for_a_prolonged_period_of_time

So, I'll follow those later this week to migrate instances from ganeti5002 to run the software tests.

Mentioned in SAL (#wikimedia-operations) [2020-09-23T17:29:22Z] <robh> migrating ganeti instances off ganeti5002 for troubleshooting per T261130

RobH added a comment.EditedWed, Sep 23, 7:38 PM

Ok, export of the SEL (have to clear it to run the hw diagnostic or it throws error for errors in SEL)

/admin1-> racadm getsel
Record:      1
Date/Time:   06/04/2019 13:10:38
Source:      system
Severity:    Ok
Description: Log cleared.
-------------------------------------------------------------------------------
Record:      2
Date/Time:   02/07/2020 02:27:25
Source:      system
Severity:    Critical
Description: The power input for power supply 1 is lost.
-------------------------------------------------------------------------------
Record:      3
Date/Time:   02/07/2020 02:27:30
Source:      system
Severity:    Critical
Description: Power supply redundancy is lost.
-------------------------------------------------------------------------------
Record:      4
Date/Time:   02/07/2020 02:37:55
Source:      system
Severity:    Ok
Description: The power supplies are redundant.
-------------------------------------------------------------------------------
Record:      5
Date/Time:   02/07/2020 02:37:55
Source:      system
Severity:    Ok
Description: The input power for power supply 1 has been restored.
-------------------------------------------------------------------------------
Record:      6
Date/Time:   02/07/2020 02:43:23
Source:      system
Severity:    Critical
Description: The power input for power supply 2 is lost.
-------------------------------------------------------------------------------
Record:      7
Date/Time:   02/07/2020 02:43:25
Source:      system
Severity:    Critical
Description: Power supply redundancy is lost.
-------------------------------------------------------------------------------
Record:      8
Date/Time:   02/07/2020 03:01:08
Source:      system
Severity:    Ok
Description: The input power for power supply 2 has been restored.
-------------------------------------------------------------------------------
Record:      9
Date/Time:   02/07/2020 03:01:10
Source:      system
Severity:    Ok
Description: The power supplies are redundant.
-------------------------------------------------------------------------------
Record:      10
Date/Time:   08/24/2020 11:53:14
Source:      system
Severity:    Critical
Description: CPU 1 machine check error detected.
-------------------------------------------------------------------------------
Record:      11
Date/Time:   08/24/2020 11:53:14
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      12
Date/Time:   08/24/2020 11:53:14
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      13
Date/Time:   08/24/2020 11:53:14
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      14
Date/Time:   08/24/2020 11:53:14
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      15
Date/Time:   08/24/2020 11:55:42
Source:      system
Severity:    Ok
Description: A problem was detected related to the previous server boot.
-------------------------------------------------------------------------------
Record:      16
Date/Time:   08/24/2020 11:55:42
Source:      system
Severity:    Critical
Description: CPU 2 machine check error detected.
-------------------------------------------------------------------------------
Record:      17
Date/Time:   08/24/2020 11:55:42
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      18
Date/Time:   08/24/2020 11:55:42
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      19
Date/Time:   08/24/2020 11:55:42
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      20
Date/Time:   08/24/2020 11:55:42
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
/admin1->
RobH added a comment.Wed, Sep 23, 7:39 PM

Removed system from ganeti cluster via directions on wikitech for extended downtime. will do hw testing on it next.

Mentioned in SAL (#wikimedia-operations) [2020-09-23T19:42:40Z] <robh> ganeti5002 firmware update before hw testing via T261130

RobH added a comment.Wed, Sep 23, 8:03 PM
Technical Support will need this information to diagnose the problem.
Please record the information below.

Service Tag : FLX09X2
Error Code : 2000-0620
Validation Code : 74812
Network 2 - Failed with Device Error
Continue testing?

I continued the testing since the CPU was what threw an error, so I suspect a bad mainboard is causing these errors. I'll keep updating this task with the rest of the error codes.

RobH added a comment.Wed, Sep 23, 9:44 PM

I've created SR1037478758 to dispatch a replacement mainboard. I'll open an inbound shipment ticket with SG3 once I get notification of the shipment, and arrange with Jin for the work on-site. (We also had the option of having a dell tech go out, but then we would need to pay SG3 staff 321 per hour to monitor them, easier to go with Jin.)

RobH mentioned this in Unknown Object (Task).Wed, Sep 23, 9:51 PM

Mentioned in SAL (#wikimedia-operations) [2020-09-25T06:50:41Z] <elukey> shutdown ganeti5002 (mistakenly powercycled it without seeing T261130)

elukey added a subscriber: elukey.Fri, Sep 25, 6:53 AM

Added a week of downtime, sorry for the powercycle :(