Followup on elastic1026 blowing up May 9, 21:43-22:14 UTC
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	EBernhardson
	May 9 2016, 11:26 PM

Description

Around 21:43 elastic1026 had a very long GC:

[elastic1026] [gc][young][28976][19906] duration [49.8s], collections [3]/[1.1m], total [49.8s]/[12.9m], memory [27.4gb]->[7.5gb]/[29.8gb], all_pools {[young] [18.6mb]->[185.5mb]/[1.4gb]}{[survivor] [154.3mb]->[0b]/[191.3mb]}{[old] [27.2gb]->[7.3gb]
[elastic1026] [gc][old][28976][8] duration [18.1s], collections [1]/[1.1m], total [18.1s]/[24.8s], memory [27.4gb]->[7.5gb]/[29.8gb], all_pools {[young] [18.6mb]->[185.5mb]/[1.4gb]}{[survivor] [154.3mb]->[0b]/[191.3mb]}{[old] [27.2gb]->[7.3gb]/[28.1]

Immediatly after this long GC elastic1026 search thread pool filled up and it started rejecting queries.

Comparing this node to a healthy node in our node comparison dashboard (https://grafana.wikimedia.org/dashboard/db/elasticsearch-node-comparison?from=1462824162813&to=1462834962813&var-nodeA=elastic1024&var-nodeB=elastic1026) the following stand out to me:

CPU usage spiked, user space went from 10% to 70%+
Disk throughput spiked, but not more than i would expect from standard segment merges
Young GC latency through the roof. Typically 1-2s, it spiked to >10s
JVM heap was incredibly spastic. I don't know enough about JVM to comment but it's not our typical sawtooth
query and fetch latency spiked
QPS reduce to 30% of pre-problem QPS
There was a 7.5G of segment merge that started right when the server went cray cray

Also of interest is the dmesg log: P3023

This machine was last rebooted (dmesg starts on) Mar 18
Mon May 9 22:58:15 2016 - cpu starts throttling due to overheating, total events = 1
Multiple overheating events, last one (as of 23:15 UTC) reports total events = 339725
elastic1026 is in rack D4, along with 1005-06 and 1023-29
- No temp problems in 1005, 1006, 1023, 1024, 1028, 1029.
- 1025 throttled for ~1s twice on apr 29, and three more times may 1
- 1027 throttled for ~1s once apr 29, four times may 9th. Timing does not coincide with 1026 meltdown
Temperature problems may be a symptom of pushing user space to >70%, rather than the cause of the issue

@Gehel restarted elasticsearch on 1026 and the problems appear to have gone away.

Related Objects
Search...

Status	Assigned	Task
Resolved	debt	T134829 Followup on elastic1026 blowing up May 9, 21:43-22:14 UTC
Resolved	EBernhardson	T134852 Check Icinga alert on CirrusSearch response time
Resolved	Gehel	T134853 Enable GC (garbage collection) logs on Elasticsearch JVM

Event Timeline

EBernhardson created this task.May 9 2016, 11:26 PM

Restricted Application added subscribers: Zppix, Aklapper. · View Herald TranscriptMay 9 2016, 11:26 PM

EBernhardson updated the task description. (Show Details)May 9 2016, 11:27 PM

MoritzMuehlenhoff triaged this task as High priority.May 10 2016, 8:07 AM

This incident coincide in timing with a deployment of wmf.23 (according to @Dzahn, but the change I found seems to have been deployed after the incident).

Some more notes (@EBernhardson has already written most if not all of the facts, let's start the speculation):

Heap graphs look like the GC is not able to free memory at the same rate that it is allocated. The 1 minute sampling is a bit too long to have clear visibility, but it still seems clear that we have a GC issue. Looking at the logs, the GC did manage to free memory, so the issue is allocation rate vs GC rate, not overall heap size (actually reducing heap size can help in some cases).
No Icinga alert was raised, probably because the graphite metric on which it is based has been renamed.

Things we should do (I'll create sub tasks):

check / fix icinga alert on CirrusSearch response time.
enable GC logs for elasticsearch JVMs

Gehel created subtask T134852: Check Icinga alert on CirrusSearch response time.May 10 2016, 9:10 AM

Gehel created subtask T134853: Enable GC (garbage collection) logs on Elasticsearch JVM.May 10 2016, 9:12 AM

GC issue seems to perfectly coincide with a merge on elastic1026. We could also start to monitor all segment related stats, most of them are bytes stored in RAM :

curl -s localhost:9200/_nodes/elastic1026/stats |jq '.nodes["_Z3TqKpDS-qBDyGky2r0Kg"].indices.segments':

        "segments": {
          "fixed_bit_set_memory_in_bytes": 4513824,
          "version_map_memory_in_bytes": 105604,
          "index_writer_max_memory_in_bytes": 3182454991,
          "index_writer_memory_in_bytes": 52100548,
          "memory_in_bytes": 2608021926,
          "count": 2268
        },

But maybe also all stats that report size stored in RAM: completion.size_in_bytes, field_data, filter_cache.

Another small fact : when GC issue started most of the queries reported as WARN (> 1s) in /var/log/elasticsearch/production-search-eqiad_index_search_slowlog.log.1 are for dewiki_content.
I copied this log file in ~dcausse/T134829/production-search-eqiad_index_search_slowlog.log.1

Gehel merged a task: T135509: high load on elastic1001.May 18 2016, 9:27 AM

Gehel added a subscriber: Stashbot.

debt closed subtask T134852: Check Icinga alert on CirrusSearch response time as Resolved.Jun 8 2016, 12:33 AM

Are we going to do anything else with this ticket? Should move it to done?

EBernhardson moved this task from Incoming to not in use - please delete on the Discovery-Search (Current work) board.Jun 16 2016, 10:04 PM

EBernhardson moved this task from not in use - please delete to Needs Reporting on the Discovery-Search (Current work) board.Jun 21 2016, 10:06 PM