Page MenuHomePhabricator

mc2036 mainboard fuse failure
Closed, ResolvedPublic

Description

mc2036 was reported offline by icinga on 2018-01-23 @ 17:10 GMT.

Logging into the mgmt interface showed no serial output, and that the system was powered off. Further investigation of the mgmt log showed the following:

System Error
01/23/2018 17:08
Server Critical Fault (Service Information: Bad Fuse, System Board, P12V Main/AUX Regulator 1 (10h))

This system is under warranty until 2020-01-13.

@Papaul: Please open a support case with HP and address the system board issue above.

codfw memcached is currently off, so this being offline is not an emergency. If the system mainboard is replaced, it is typically best to re-image the host post hardware replacement.

Event Timeline

RobH triaged this task as Medium priority.Jan 23 2018, 5:34 PM
RobH created this task.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 23 2018, 5:34 PM
Papaul added a comment.Feb 2 2018, 5:04 PM

Dear Mr Papaul Tshibamba,

Thank you for contacting Hewlett Packard Enterprise for your service request. This email confirms your request for service and the details are below.

Your request is being worked on under reference number 5326746523
Status: Case is generated and in Progress

Product description: HPE ProLiant DL360 Gen9 8SFF Configure-to-order Server
Product number: 755258-B21
Serial number: MXQ70202VK
Subject: DL360 Gen9 - Server critical fault

Yours sincerely,
Hewlett Packard Enterprise

Papaul reassigned this task from Papaul to RobH.Feb 2 2018, 5:11 PM

@RobH please see below for instruction on how to fix this problem. We need to run the file within the OS. so download the file, copy it somewhere on the server and run it from there.
Hi Papaul,

Update the CPLD Smart Component to version 14:

  1. Download the CPLD Smart Component Version 14 available at the following ftp sites:

· Windows:

https://downloads.hpe.com/pub/softlib2/software1/sc-windows-fw/p702371824/v113449/cp028274.exe

· Linux:

https://downloads.hpe.com/pub/softlib2/software1/sc-linux-fw/p414089357/v113451/CP028275.scexe

  1. ACCESS DENIED ERRORS: Click Here if You Encounter Access Denied Errors with the FTP links above .
  1. Run the CPLD Smart Component.
  1. Click OK at the end of the process to reboot the system.
  1. Upon first reboot after updating the CPLD Smart Component, press F9 to enter the ROM-Based Setup Utility (RBSU) and verify that the CPLD level ("Hardware PAL/CPLD") is at "0x14" (see Figure 1 below).

https://support.hpe.com/hpsc/doc/public/imageServlet?DOCID=emr_na-c04927201-3/c04927202.png

RobH reassigned this task from RobH to Papaul.Feb 2 2018, 5:17 PM

Papaul,

I'm not entirely certain what has happened with this system. Can you please clarify the troubleshooting that has taken place? Has the mainboard been replaced, or is support requiring this before they actually send us a replacement part?

The system powered off at the board failure, so we cannot really run it to flash it, correct? No OS or boot due to failed mainboard, so we cannot load into the OS to run this flash.

Please advise support we need a new mainboard. Please update this task on what has been done so far by onsite work to repair this.

Papaul added a comment.Feb 5 2018, 4:40 PM

HP will send a replacement main board for the system.

Hello Papaul,

If that's the case, I would be setting up an onsite service and recommending the system board to be replaced.

Kindly confirm the Point of Contact, Address for service and preferred time for service.

Regards,
Rahul B Singh
Technical Solutions Consultant, Industry Standard Server
Global Solution Centre, Bangalore, Hewlett Packard Enterprise

Papaul added a comment.Feb 6 2018, 2:31 AM

Dear Mr Papaul

Hewlett Packard Enterprise Reference Number: 5326746523

STATUS: Customer Self Repair Part has been shipped

Part/s shipped: 843307-001
Part description: SPS-PCA DL380/DL360 Gen9 SYS I/O BRDWL
Carrier Name: UPSN
Tracking Number: 1Z4217AR0134495374

Product description: HPE ProLiant DL360 Gen9 Server
Product number: 755258-B21
Serial number: MXQ70202VK
Problem description:

Server critical fault error
Papaul added a comment.Feb 6 2018, 6:34 PM

The tracking status on the main board says "Delay" as for today Feb. 6th at 12:33pm CT

Papaul reassigned this task from Papaul to RobH.Feb 8 2018, 4:51 PM

Main board replacement complete.
New eth0 MAC address : e0:07:1b:f8:87:c8

@Papaul : When you're back in the data centre, can you please check the serial console? I tried to reimage the host, but it failed:

13:51:07 | mc2036.codfw.wmnet | Unable to run wmf-auto-reimage-host: Command '['ipmitool', '-I', 'lanplus', '-H', 'mc2036.mgmt.codfw.wmnet', '-U', 'root', '-E', 'chassis', 'bootdev', 'pxe']' returned non-zero exit status 1

When I open the serial console over the ILO it's also stuck.

MoritzMuehlenhoff reassigned this task from RobH to Papaul.Mar 7 2018, 2:40 PM

Server would not power on

  • Draining power

It looks like another dead main board. I will contact HP and see what they say.

Dear Mr Papaul Tshibamba,

Thank you for contacting Hewlett Packard Enterprise for your service request. This email confirms your request for service and the details are below.

Your request is being worked on under reference number 5327787437
Status: Case is generated and in Progress

Product description: HPE ProLiant DL360 Gen9 8SFF Configure-to-order Server
Product number: 755258-B21
Serial number: MXQ70202VK
Subject: DL360 Gen9 - Server down

Yours sincerely,
Hewlett Packard Enterprise

@MoritzMuehlenhoff

  • main board replacement
  • ILO configuration
  • new nic 1 MAC address e0:07:1b:f7:63:68

Mentioned in SAL (#wikimedia-operations) [2018-03-15T07:21:39Z] <moritzm> reimaging mc2036 after hardware replacement T185587

I tried to reimage, but the BIOS prints "PXE-E61: Media test failure, check cable" for 20 times or so until it falls back to attempting to boot from disk. Can you check whether the cabling is correct?

@MoritzMuehlenhoff the cable is connected, please check switch side if the port is active. I have no light on that port.

RobH added a comment.Mar 15 2018, 4:23 PM

Looks like its acutally set to admin state down:

ge-8/0/0 down down mc2036

I've gone ahead and re-enabled it, and ensured its in the proper (internal) vlan.

Mentioned in SAL (#wikimedia-operations) [2018-03-16T07:53:09Z] <moritzm> reimage mc2036 after mainboard replacement (T185587)

MoritzMuehlenhoff closed this task as Resolved.Mar 16 2018, 8:50 AM

mc2036 has been reimaged, closing.