Page MenuHomePhabricator

hw troubleshooting: failure to power up for elastic2043.codfw.wmnet
Closed, ResolvedPublicRequest

Description

FQDN => elastic2043.codfw.wmnet

  • - Banned from elasticsearch cluster
  • - Put system into a failed state in Netbox.
  • - Medium urgency: we can handle 1 or 2 nodes offline indefinitely, but it increases the chance that we drop into yellow or red cluster status (loss of redundancy or actual data loss respectively)
  • - Host fails to power up (already tried to powercycle). See https://phabricator.wikimedia.org/T281215 for more context
  • - Assign correct project tag and appropriate owner (based on above). Also, please ensure the service owners of the host(s) are added as subscribers to provide any additional input.

Event Timeline

  • Drained power
  • update the CPLD firmware

-Relocate the server from U 3 to U 32

  • Change switch power xe-7/0/1 to xe-7/0/31

Server is back up. Please change status in Netbox once the server is back in service

curl -H 'Content-Type: application/json' -XPUT http://localhost:9200/_cluster/settings -d '{"transient":{"cluster.routing.allocation.exclude":{"_host": null,"_name": null}}}'
curl -H 'Content-Type: application/json' -XPUT http://localhost:9600/_cluster/settings -d '{"transient":{"cluster.routing.allocation.exclude":{"_host": null,"_name": null}}}'

Unbanned from the Elasticsearch cluster. Will circle back to update the status in netbox once I've confirmed that the host is behaving properly.

Mentioned in SAL (#wikimedia-operations) [2021-05-05T01:55:00Z] <ryankemper> T281327 [Elastic] Unbanned elastic2043 from cluster

elastic2043 seems to have PSU problems, which caused it to randomly reboot:

racadm>>racadm getsel
Record:      1
Date/Time:   04/28/2021 22:19:04
Source:      system
Severity:    Ok
Description: Log cleared.
-------------------------------------------------------------------------------
Record:      2
Date/Time:   05/02/2021 01:08:16
Source:      system
Severity:    Critical
Description: The system board BP1 PG voltage is outside of range.
-------------------------------------------------------------------------------
Record:      3
Date/Time:   05/05/2021 17:19:59
Source:      system
Severity:    Critical
Description: The system board BP1 PG voltage is outside of range.
-------------------------------------------------------------------------------

I've taken the host back out of service:

curl -H 'Content-Type: application/json' -XPUT http://localhost:9200/_cluster/settings -d '{"transient":{"cluster.routing.allocation.exclude":{"_host": "","_name": ["elastic2033-production-search-codfw", "elastic2043-production-search-codfw"]}}}'
curl -H 'Content-Type: application/json' -XPUT http://localhost:9400/_cluster/settings -d '{"transient":{"cluster.routing.allocation.exclude":{"_host": "","_name": ["elastic2033-production-search-omega-codfw", "elastic2043-production-search-omega-codfw"]}}}'
curl -H 'Content-Type: application/json' -XPUT http://localhost:9600/_cluster/settings -d '{"transient":{"cluster.routing.allocation.exclude":{"_host": "","_name": ["elastic2033-production-search-psi-codfw", "elastic2043-production-search-psi-codfw"]}}}'

Mentioned in SAL (#wikimedia-operations) [2021-05-05T23:35:18Z] <ryankemper> T281621 T281327 [Elastic] Banned elastic2033 and elastic2043 from the Cirrussearch Elasticsearch clusters

This host is ssh unreachable again. There is definitely some underlying hardware failure.

Tested both power supplies by running the server on only one PSU, the server works fine.
I also upgrade the BIOS and IDRAC on the server, the server is back and not showing any errors when login to the IDRAC (GUI ) anymore.

@RKemper You can put the server back in service or leave it running for a week to see if we still have the same problem. If we do, I will have to open a case with DELL
Let me know if you have any questions. Resolving the task for now

elukey added a subscriber: elukey.

Still reported down :(

racadm>>racadm getsel
Record:      1
Date/Time:   05/07/2021 00:43:42
Source:      system
Severity:    Ok
Description: Log cleared.
-------------------------------------------------------------------------------
Record:      2
Date/Time:   05/09/2021 17:27:32
Source:      system
Severity:    Critical
Description: The system board BP1 PG voltage is outside of range.
-------------------------------------------------------------------------------
Record:      3
Date/Time:   05/09/2021 17:28:10
Source:      system
Severity:    Critical
Description: The system board BP1 PG voltage is outside of range.
-------------------------------------------------------------------------------
Record:      4
Date/Time:   05/09/2021 17:28:35
Source:      system
Severity:    Critical
Description: The system board Pfault fail-safe voltage is outside of range.
-------------------------------------------------------------------------------
Record:      5
Date/Time:   05/09/2021 17:31:58
Source:      system
Severity:    Critical
Description: The storage BP1 Power A cable is not connected, or is improperly connected.
-------------------------------------------------------------------------------
Record:      6
Date/Time:   05/09/2021 17:33:53
Source:      system
Severity:    Ok
Description: The storage BP1 Power A cable or interconnect is connected.
-------------------------------------------------------------------------------

@Papaul Per the above, let's go ahead and open the case with Dell for the failure

Hello Papaul,

Here is your case information:

Case#: 113228160
Tag#: GBJ8CS2
Model#: POWEREDGE R440

Dell ask to upgrade the IDRAC and BIOS since the TSR report doesn't show what component is failing, Maybe the new IDRAC and BIOS upgrade will provide more information on the faulty component. According to him by looking at the logs, the issue is not the Power supplies. if aFter upgrading the BIOS and IDRAC we still have the same issue, the next step will be to disconnect all the components from the main-board and connect them one by one to see which component is faulty.

After firmware upgrade, server is not powering up anymore when pressing the power button

Dell Called me today about this server and recommend that we do a minimum to post on the server to find out which part is causing the issue. I will be on touch with them next Tuesday.

I did perform the minimum to post as requested by Dell last week on the server, the server still wouldn't power on. I requested that a new main-board been set to me.

I know work is still ongoing but just wanted to say - thanks for all your work on this Papaul! I know the server's in good hands with you :)

Last update from Dell

The Dell replacement part(s) for your POWEREDGE R440,ICE PE has been shipped by FEDX on tracking number 977965240257.

I have the main board on site, I will be replacing it on Monday.

@RKemper main board replaced on the server and firmware upgrade done as well. The server is back up on line

@Papaul the host is down again :(

racadm>>racadm getsel
Record:      1
Date/Time:   <System Boot>
Source:      system
Severity:    Ok
Description: Log cleared.
-------------------------------------------------------------------------------
Record:      2
Date/Time:   05/07/2020 02:12:05
Source:      system
Severity:    Critical
Description: The chassis is open while the power is off.
-------------------------------------------------------------------------------
Record:      3
Date/Time:   05/07/2020 02:12:10
Source:      system
Severity:    Ok
Description: The chassis is closed while the power is off.
-------------------------------------------------------------------------------
Record:      4
Date/Time:   06/08/2021 13:56:16
Source:      system
Severity:    Critical
Description: The system board BP1 PG voltage is outside of range.
-------------------------------------------------------------------------------
Record:      5
Date/Time:   06/08/2021 16:55:23
Source:      system
Severity:    Critical
Description: The system board BP1 PG voltage is outside of range.
-------------------------------------------------------------------------------
Record:      6
Date/Time:   06/08/2021 16:55:23
Source:      system
Severity:    Critical
Description: The storage BP1 Power A cable is not connected, or is improperly connected.
-------------------------------------------------------------------------------
Record:      7
Date/Time:   06/08/2021 16:55:43
Source:      system
Severity:    Ok
Description: The storage BP1 Power A cable or interconnect is connected.
-------------------------------------------------------------------------------
Record:      8
Date/Time:   06/08/2021 16:55:45
Source:      system
Severity:    Critical
Description: The system board Pfault fail-safe voltage is outside of range.

I email Dell with the last update.

Email from Dell below

After having others look at the logs we have a backplane cable that is having issues.  Can you disconnect and reconnect any cables going to the backplane and call us if you have more issues so we can look at things with you at the unit.  Also can you look at the cables for damage when you are reseating them please

I will ask them to send someone on site to do the troubleshooting

I Sent an email to Dell asking them to dispatch one of their Tech to do the troubleshooting on this server, since it is taking a while to get it fix.

Dell Tech replaced the back plane and cables. server is backup online. Closing this task for now. Please re-open if any issues.

Thanks.

Mentioned in SAL (#wikimedia-operations) [2021-06-29T16:55:09Z] <ryankemper> T281327 [Cirrus -> codfw] Current banned nodes are`elastic2043` and elastic2045; elastic2043 can be unbanned after a re-image, and elastic2045 can be unbanned in ~30 minutes after shards rebalance (had heavy shards scheduled)

Script wmf-auto-reimage was launched by ryankemper on cumin2001.codfw.wmnet for hosts:

elastic2043.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202107211825_ryankemper_3197_elastic2043_codfw_wmnet.log.

Mentioned in SAL (#wikimedia-operations) [2021-07-21T18:27:11Z] <ryankemper> T281327 [Elastic] sudo -i wmf-auto-reimage-host -p T281327 elastic2043.codfw.wmnet on ryankemper@cumin2001 tmux session reimage_elastic2043

Completed auto-reimage of hosts:

['elastic2043.codfw.wmnet']

Of which those FAILED:

['elastic2043.codfw.wmnet']

Script wmf-auto-reimage was launched by ryankemper on cumin2001.codfw.wmnet for hosts:

elastic2043.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202107212231_ryankemper_810_elastic2043_codfw_wmnet.log.

Completed auto-reimage of hosts:

['elastic2043.codfw.wmnet']

Of which those FAILED:

['elastic2043.codfw.wmnet']

Script wmf-auto-reimage was launched by ryankemper on cumin2001.codfw.wmnet for hosts:

elastic2043.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202107212232_ryankemper_885_elastic2043_codfw_wmnet.log.

Completed auto-reimage of hosts:

['elastic2043.codfw.wmnet']

Of which those FAILED:

['elastic2043.codfw.wmnet']

Script wmf-auto-reimage was launched by ryankemper on cumin2001.codfw.wmnet for hosts:

elastic2043.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202107220015_ryankemper_14754_elastic2043_codfw_wmnet.log.

Completed auto-reimage of hosts:

['elastic2043.codfw.wmnet']

Of which those FAILED:

['elastic2043.codfw.wmnet']

Script wmf-auto-reimage was launched by ryankemper on cumin2001.codfw.wmnet for hosts:

elastic2043.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202107220031_ryankemper_15765_elastic2043_codfw_wmnet.log.

Completed auto-reimage of hosts:

['elastic2043.codfw.wmnet']

and were ALL successful.

Mentioned in SAL (#wikimedia-operations) [2021-07-22T05:31:20Z] <ryankemper> T281327 [Elastic] Unbanned elastic2043.codfw.wmnet from all 3 cirrus/elasticsearch clusters; node is back in the fleet