Page MenuHomePhabricator

elastic2043 doesn't power up
Closed, DeclinedPublic

Description

-------------------------------------------------------------------------------
Record:      5
Date/Time:   04/26/2021 17:30:10
Source:      system
Severity:    Critical
Description: The system board BP1 PG voltage is outside of range.
-------------------------------------------------------------------------------
Record:      6
Date/Time:   04/26/2021 17:30:34
Source:      system
Severity:    Critical
Description: The system board BP1 PG voltage is outside of range.
-------------------------------------------------------------------------------
Record:      7
Date/Time:   04/26/2021 17:30:47
Source:      system
Severity:    Critical
Description: The system board Pfault fail-safe voltage is outside of range.
-------------------------------------------------------------------------------
Record:      8
Date/Time:   04/26/2021 17:32:38
Source:      system
Severity:    Critical
Description: The storage BP1 Power A cable is not connected, or is improperly connected.
-------------------------------------------------------------------------------
Record:      9
Date/Time:   04/26/2021 17:32:58
Source:      system
Severity:    Ok
Description: The storage BP1 Power A cable or interconnect is connected.

Hi Papaul!

I have tried to powercycle/hardreset, the server seems not able to power up :(

Event Timeline

elukey renamed this task from elastic2043 powercycle needed to elastic2043 doesn't power up.Apr 27 2021, 6:26 AM
elukey added a project: ops-codfw.
elukey updated the task description. (Show Details)
elukey added a subscriber: Papaul.

Mentioned in SAL (#wikimedia-operations) [2021-04-27T17:19:34Z] <ryankemper> T281215 Banned elastic2043 from codfw cirrussearch cluster

ryankemper@elastic2044:~$ curl -s localhost:9600/_cluster/health
{"cluster_name":"production-search-psi-codfw","status":"green","timed_out":false,"number_of_nodes":17,"number_of_data_nodes":17,"active_primary_shards":1518,"active_shards":4553,"relocating_shards":0,"initializing_shards":0,"unassigned_shards":0,"delayed_unassigned_shards":0,"number_of_pending_tasks":0,"number_of_in_flight_fetch":0,"task_max_waiting_in_queue_millis":0,"active_shardscurl -s localhost:9200/_cat/shards | grep elastic2043
ryankemper@elastic2044:~$ curl -H 'Content-Type: application/json' -XPUT http://localhost:9200/_cluster/settings -d '{"transient":{"cluster.routing.allocation.exclude":{"_host": "","_name": "elastic2043-production-search-codfw"}}}'
{"acknowledged":true,"persistent":{},"transient":{"cluster":{"routing":{"allocation":{"exclude":{"_host":"","_name":"elastic2043-production-search-codfw"}}}}}}ryankemper@elastic2044:~$ curl -H 'Content-Type: application/json' -XPUT http://localhost:9200/_cluster/settings -d '{"transient":{"cluster.routing.allocation.exclude":{"_host": "","_name": "elastic2043-production-search-codfw^C}}'
ryankemper@elastic2044:~$ curl -H 'Content-Type: application/json' -XPUT http://localhost:9400/_cluster/settings -d '{"transient":{"cluster.routing.allocation.exclude":{"_host": "","_name": "elastic2043-production-search-codfw"}}}'
curl: (7) Failed to connect to localhost port 9400: Connection refused
ryankemper@elastic2044:~$ curl -H 'Content-Type: application/json' -XPUT http://localhost:9600/_cluster/settings -d '{"transient":{"cluster.routing.allocation.exclude":{"_host": "","_name": "elastic2043-production-search-codfw"}}}'
{"acknowledged":true,"persistent":{},"transient":{"cluster":{"routing":{"allocation":{"exclude":{"_host":"","_name":"elastic2043-production-search-codfw"}}}}}}

Made a ticket using the hardware failure template from the dc-ops group. In retrospect I probably should have just copied over the template to here but wasn't sure if the template does anything special (I don't think it does)

https://phabricator.wikimedia.org/T281327