elastic2038 DOWN (CPU/memory errors )
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	MoritzMuehlenhoff
	Mar 1 2019, 9:44 AM

Description

elastic2038 went down at 9:23 today, system event log shows CPU and memory errors: (elasticsearch is fine with one node down)

-------------------------------------------------------------------------------
Record:      2
Date/Time:   03/01/2019 03:21:28
Source:      system
Severity:    Critical
Description: CPU 1 machine check error detected.
-------------------------------------------------------------------------------
Record:      3
Date/Time:   03/01/2019 03:21:28
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      4
Date/Time:   03/01/2019 03:21:29
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      5
Date/Time:   03/01/2019 03:21:29
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      6
Date/Time:   03/01/2019 03:21:29
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      7
Date/Time:   03/01/2019 09:23:41
Source:      system
Severity:    Ok
Description: A problem was detected related to the previous server boot.
-------------------------------------------------------------------------------
Record:      8
Date/Time:   03/01/2019 09:23:41
Source:      system
Severity:    Critical
Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_A2.
-------------------------------------------------------------------------------

Related Objects

Mentioned In: T222432: Decrease shard alert threshold for omega and psi elasticsearch clusters

Event Timeline

MoritzMuehlenhoff created this task.Mar 1 2019, 9:44 AM

Restricted Application added a project: SRE. · View Herald TranscriptMar 1 2019, 9:44 AM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

• Mathew.onipe triaged this task as High priority.Mar 1 2019, 9:45 AM

@Papaul while this is not an emergency, we're already missing a bunch of servers, so ping me if there is anything I can do to help move this forward.

@Gehel Please power the server off if it is not off. Thanks.

I've just powered down elastic2038 via the mgmt interface.

The server warranty is until Nov 2021 and the firmware on the serve is old. If i call DELL they will ask me first to upgrade the firmware.
Old version
BIOS Version 1.5.6
Lifecycle Controller Firmware 3.21.21.21

Clear log before upgrade.
New version
BIOS Version 1.7.0
Lifecycle Controller Firmware 3.21.26.22

Firmware upgrade.
The next step will be to monitor the system and see if we have the same error. IF we do I will swap DIMM A2 with DIMM B2 if the error follows (shows in DIMM B2) I will then call DELL for memory replacement if the error stills show on DIMM A2 it will be bad slot on the mainboard.

I power the system back up.

@Gehel Since yesterday there is no error reported in the log. Can you repool the server so we can monitor it while it is under load.

Thanks.

@Papaul server has been repooled since it restarted. Let's blame cosmic rays until we prove otherwise?

Papaul lowered the priority of this task from High to Medium.Mar 5 2019, 5:08 PM

@Gehel I checked the system log again today no errors. Closing this task for now.

@Gehel It crashed today. Just went down and had no console output. Then i powercycled it.. Then it was back.

Note the service came back before the host was back.. so it did fail-over succesfully:

20:30 <+icinga-wm> PROBLEM - Host elastic2038 is DOWN: PING CRITICAL - Packet loss = 100%
20:31 <+icinga-wm> PROBLEM - ElasticSearch health check for shards on 9443 on search.svc.codfw.wmnet is CRITICAL: CRITICAL - elasticsearch https://search.svc.codfw.wmnet:9443/_cluster/health error while 
                   fetching: HTTPSConnectionPool(host=search.svc.codfw.wmnet, port=9443): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
20:32 < mutante> !log powercycling elastic2038

20:33 <+icinga-wm> RECOVERY - ElasticSearch health check for shards on 9443 on search.svc.codfw.wmnet is OK: OK - elasticsearch status production-search-omega-codfw: cluster_name: production-search-omega-codfw, 
                   number_of_nodes: 14, active_shards: 3158, unassigned_shards: 225, active_primary_shards: 1128, initializing_shards: 0, delayed_unassigned_shards: 225, status: yellow, number_of_data_nodes: 
                   14, number_of_pending_tasks: 0, relocating_shards: 0

20:34 <+icinga-wm> RECOVERY - Host elastic2038 is UP: PING OK - Packet loss = 0%, RTA = 36.10 ms

Mentioned in SAL (#wikimedia-operations) [2019-05-03T09:20:59Z] <gehel> ban elastic2038 from elastic clusters pending memory issue investigation - T217398

It looks like we need to investigate this a bit more

In T217398#4999270, @Papaul wrote:

The next step will be to monitor the system and see if we have the same error. IF we do I will swap DIMM A2 with DIMM B2 if the error follows (shows in DIMM B2) I will then call DELL for memory replacement if the error stills show on DIMM A2 it will be bad slot on the mainboard.

@Papaul: it looks like it is time to switch those DIMMs.

elastic2038 is banned from the cluster, and downtimed in icinga. Feel free to do whatever you need with it.

Gehel reopened this task as Open.May 3 2019, 9:24 AM

Gehel mentioned this in T222432: Decrease shard alert threshold for omega and psi elasticsearch clusters.May 3 2019, 9:27 AM

@Gehel DIMM swap complete

Dzahn moved this task from Backlog to Hardware Failure / Troubleshoot on the ops-codfw board.May 6 2019, 10:20 PM

Gehel moved this task from Waiting to Incoming on the Discovery-Search (Current work) board.May 7 2019, 5:22 PM

Dzahn reassigned this task from Papaul to Gehel.May 7 2019, 5:44 PM

Dzahn added a subscriber: Papaul.

it's fully down now:

https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=1&host=elastic2038

Dzahn renamed this task from elastic2038 CPU/memory errors to elastic2038 DOWN (CPU/memory errors ).May 11 2019, 2:06 AM

Papaul claimed this task.May 13 2019, 2:46 AM

It looks like the error is showing now on DIMM B2, so we have a bad DIMM. I will go ahead and request a replacement.

Description Date and Time
Correctable memory error rate exceeded for DIMM_B2. Fri 10 May 2019 16:49:06
A problem was detected related to the previous server boot. Fri 10 May 2019 09:40:39
Multi-bit memory errors detected on a memory device at location(s) DIMM_B2. Fri 10 May 2019 09:40:39
An OEM diagnostic event occurred. Fri 10 May 2019 09:38:16
An OEM diagnostic event occurred. Fri 10 May 2019 09:38:16
An OEM diagnostic event occurred. Fri 10 May 2019 09:38:16
An OEM diagnostic event occurred. Fri 10 May 2019 09:38:16

Create Dispatch: Success
You have successfully submitted request SR990577292.

Memory Replaced on DIMM B2
clear log

Part return tracking information

New Doc 2019-05-14 10.30.45_1.pdf319 KBDownload

closing this task for now

Mentioned in SAL (#wikimedia-operations) [2019-05-14T19:25:43Z] <gehel> ban elastic2038 from elasticsearch cluster for memory replacement - T217398

Mentioned in SAL (#wikimedia-operations) [2019-05-14T19:28:11Z] <gehel> shutting down elastic2038 for memory replacement - T217398

Started up just now, forgot to add the task in the SAL :)

Mentioned in SAL (#wikimedia-operations) [2019-05-15T17:19:32Z] <onimisionipe> unban elastic2038 from shard allocation - T217398

Mentioned in SAL (#wikimedia-operations) [2019-05-16T02:27:16Z] <onimisionipe> pooling elastic2038 after unbanning - T217398

EBernhardson moved this task from Incoming to Needs Reporting on the Discovery-Search (Current work) board.Jul 11 2019, 11:50 PM

	F29051855: New Doc 2019-05-14 10.30.45_1.pdf
	May 14 2019, 3:40 PM

elastic2038 DOWN (CPU/memory errors )Closed, ResolvedPublicActions

Description

Related Objects

Event Timeline

elastic2038 DOWN (CPU/memory errors )
Closed, ResolvedPublic
Actions