Page MenuHomePhabricator

elastic2054 unresponsive
Closed, ResolvedPublic

Description

Host unresponsive, down for Icinga:

PROBLEM - Host elastic2054 is DOWN: PING CRITICAL - Packet loss = 100%

Unable to ping, ssh or get any output on console.

Event Timeline

Volans created this task.Jul 4 2019, 9:26 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 4 2019, 9:26 PM
Volans triaged this task as High priority.Jul 4 2019, 9:31 PM

The host is part of the main and psi clusters:

$ confctl --quiet select name="elastic2054.codfw.wmnet" get
{"elastic2054.codfw.wmnet": {"weight": 10, "pooled": "yes"}, "tags": "dc=codfw,cluster=elasticsearch,service=elasticsearch"}
{"elastic2054.codfw.wmnet": {"weight": 10, "pooled": "yes"}, "tags": "dc=codfw,cluster=elasticsearch,service=elasticsearch-psi-ssl"}
{"elastic2054.codfw.wmnet": {"weight": 10, "pooled": "yes"}, "tags": "dc=codfw,cluster=elasticsearch,service=elasticsearch-ssl"}

Current status of the clusters (taken from another host in the same clusters):

 elastic2053  0 ~$ curl -s localhost:9600/_cluster/health?pretty
{
  "cluster_name" : "production-search-psi-codfw",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 14,
  "number_of_data_nodes" : 14,
  "active_primary_shards" : 1450,
  "active_shards" : 4349,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 0,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 100.0
}
 elastic2053  0 ~$ curl -s localhost:9200/_cluster/health?pretty
{
  "cluster_name" : "production-search-codfw",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 29,
  "number_of_data_nodes" : 29,
  "active_primary_shards" : 1254,
  "active_shards" : 3731,
  "relocating_shards" : 0,
  "initializing_shards" : 10,
  "unassigned_shards" : 0,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 99.73269179363807
}

Mentioned in SAL (#wikimedia-operations) [2019-07-04T21:35:00Z] <volans> forcing reboot of elastic2054 from console, host unresponsive - T227298

Volans added a comment.EditedJul 4 2019, 9:40 PM

Nothing in syslog.
It first detected a CPU error and then a memory one, here the hardware logs:

-------------------------------------------------------------------------------
Record:      4
Date/Time:   07/04/2019 21:12:40
Source:      system
Severity:    Critical
Description: CPU 1 machine check error detected.
-------------------------------------------------------------------------------
Record:      5
Date/Time:   07/04/2019 21:12:40
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      6
Date/Time:   07/04/2019 21:12:41
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      7
Date/Time:   07/04/2019 21:12:41
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      8
Date/Time:   07/04/2019 21:12:41
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      9
Date/Time:   07/04/2019 21:15:21
Source:      system
Severity:    Ok
Description: A problem was detected related to the previous server boot.
-------------------------------------------------------------------------------
Record:      10
Date/Time:   07/04/2019 21:15:21
Source:      system
Severity:    Critical
Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_B2.
-------------------------------------------------------------------------------
Volans added a comment.Jul 4 2019, 9:41 PM

Both clusters back to green:

 elastic2054  0 ~$ curl -s localhost:9600/_cluster/health?pretty
{
  "cluster_name" : "production-search-psi-codfw",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 15,
  "number_of_data_nodes" : 15,
  "active_primary_shards" : 1450,
  "active_shards" : 4349,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 0,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 100.0
}
 elastic2054  0 ~$ curl -s localhost:9200/_cluster/health?pretty
{
  "cluster_name" : "production-search-codfw",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 30,
  "number_of_data_nodes" : 30,
  "active_primary_shards" : 1254,
  "active_shards" : 3741,
  "relocating_shards" : 2,
  "initializing_shards" : 0,
  "unassigned_shards" : 0,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 100.0
}
Volans lowered the priority of this task from High to Normal.Jul 4 2019, 9:43 PM

@Gehel I'll leave the task open if you want to investigate more tomorrow for potential hardware parts to replace. (see above for hardware logs).

Gehel added a subscriber: Papaul.Jul 5 2019, 9:26 AM

elastic2054 is down again.

It is set to pooled=inactive, and marked as failed in netbox.

@Papaul: it looks like this is going to need your help. You can do whatever you need with this server and reboot it when done.

Papaul claimed this task.Jul 8 2019, 2:18 PM
Papaul added a comment.Jul 8 2019, 2:33 PM

Multi-bit memory errors detected on a memory device at location(s) DIMM_B2.

Papaul added a comment.Jul 8 2019, 2:52 PM

I swapped B2 with A2, no more error. leaving this task open for a week. If we do have the same problem on A2, I will request a replacement.

Mentioned in SAL (#wikimedia-operations) [2019-07-09T14:53:10Z] <gehel> repooled elastic2054 - T227298

Papaul closed this task as Resolved.Jul 17 2019, 3:01 PM

@Gehel I checked this server again today, all looks good. Resolving this task for now. We can reopen it anytime.

thanks.

Mentioned in SAL (#wikimedia-operations) [2019-08-07T23:22:38Z] <mutante> elastic2054 - powercycling after it went down unexpectedly and Icinga alerted, this happened before in T227298

Dzahn reopened this task as Open.Wed, Aug 7, 11:24 PM
Dzahn closed this task as Resolved.Wed, Aug 7, 11:37 PM
Dzahn added a subscriber: Dzahn.

in syslog there were no memory errors this time either. just stops and then continues.

but in DRAC:

-------------------------------------------------------------------------------
Record:      21
Date/Time:   08/07/2019 23:14:09
Source:      system
Severity:    Critical
Description: CPU 1 machine check error detected.
-------------------------------------------------------------------------------
Record:      22
Date/Time:   08/07/2019 23:14:10
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      23
Date/Time:   08/07/2019 23:14:10
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      24
Date/Time:   08/07/2019 23:14:10
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      25
Date/Time:   08/07/2019 23:14:10
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      26
Date/Time:   08/07/2019 23:16:42
Source:      system
Severity:    Ok
Description: A problem was detected related to the previous server boot.
-------------------------------------------------------------------------------
Record:      27
Date/Time:   08/07/2019 23:16:42
Source:      system
Severity:    Critical
Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_A2.
-------------------------------------------------------------------------------
/admin1->

My last comment on July 8 was "I swapped B2 with A2, no more error. leaving this task open for a week. If we do have the same problem on A2, I will request a replacement." It looks like we do have the error on A2 now . I will request a DIMM replacement

Papaul reopened this task as Open.Wed, Aug 7, 11:51 PM
Papaul added a comment.Fri, Aug 9, 9:39 PM

You have successfully submitted request SR995910217.

Your dispatch request has been successfully created and will be reviewed by our team. You can monitor its progress on your Dell EMC TechDirect dashboard.

Papaul closed this task as Resolved.Tue, Aug 13, 3:20 PM

DIMM A2 replaced and log cleared . Closing this task for now .

Return information