Page MenuHomePhabricator

Followup on elastic1026 blowing up May 9, 21:43-22:14 UTC
Closed, ResolvedPublic

Description

Around 21:43 elastic1026 had a very long GC:

[elastic1026] [gc][young][28976][19906] duration [49.8s], collections [3]/[1.1m], total [49.8s]/[12.9m], memory [27.4gb]->[7.5gb]/[29.8gb], all_pools {[young] [18.6mb]->[185.5mb]/[1.4gb]}{[survivor] [154.3mb]->[0b]/[191.3mb]}{[old] [27.2gb]->[7.3gb]
[elastic1026] [gc][old][28976][8] duration [18.1s], collections [1]/[1.1m], total [18.1s]/[24.8s], memory [27.4gb]->[7.5gb]/[29.8gb], all_pools {[young] [18.6mb]->[185.5mb]/[1.4gb]}{[survivor] [154.3mb]->[0b]/[191.3mb]}{[old] [27.2gb]->[7.3gb]/[28.1]

Immediatly after this long GC elastic1026 search thread pool filled up and it started rejecting queries.

Comparing this node to a healthy node in our node comparison dashboard (https://grafana.wikimedia.org/dashboard/db/elasticsearch-node-comparison?from=1462824162813&to=1462834962813&var-nodeA=elastic1024&var-nodeB=elastic1026) the following stand out to me:

  • CPU usage spiked, user space went from 10% to 70%+
  • Disk throughput spiked, but not more than i would expect from standard segment merges
  • Young GC latency through the roof. Typically 1-2s, it spiked to >10s
  • JVM heap was incredibly spastic. I don't know enough about JVM to comment but it's not our typical sawtooth
  • query and fetch latency spiked
  • QPS reduce to 30% of pre-problem QPS
  • There was a 7.5G of segment merge that started right when the server went cray cray

Also of interest is the dmesg log: P3023

  • This machine was last rebooted (dmesg starts on) Mar 18
  • Mon May 9 22:58:15 2016 - cpu starts throttling due to overheating, total events = 1
  • Multiple overheating events, last one (as of 23:15 UTC) reports total events = 339725
  • elastic1026 is in rack D4, along with 1005-06 and 1023-29
    • No temp problems in 1005, 1006, 1023, 1024, 1028, 1029.
    • 1025 throttled for ~1s twice on apr 29, and three more times may 1
    • 1027 throttled for ~1s once apr 29, four times may 9th. Timing does not coincide with 1026 meltdown
  • Temperature problems may be a symptom of pushing user space to >70%, rather than the cause of the issue

@Gehel restarted elasticsearch on 1026 and the problems appear to have gone away.

Event Timeline

This incident coincide in timing with a deployment of wmf.23 (according to @Dzahn, but the change I found seems to have been deployed after the incident).

Some more notes (@EBernhardson has already written most if not all of the facts, let's start the speculation):

  • Heap graphs look like the GC is not able to free memory at the same rate that it is allocated. The 1 minute sampling is a bit too long to have clear visibility, but it still seems clear that we have a GC issue. Looking at the logs, the GC did manage to free memory, so the issue is allocation rate vs GC rate, not overall heap size (actually reducing heap size can help in some cases).
  • No Icinga alert was raised, probably because the graphite metric on which it is based has been renamed.

Things we should do (I'll create sub tasks):

  • check / fix icinga alert on CirrusSearch response time.
  • enable GC logs for elasticsearch JVMs

GC issue seems to perfectly coincide with a merge on elastic1026. We could also start to monitor all segment related stats, most of them are bytes stored in RAM :

curl -s localhost:9200/_nodes/elastic1026/stats |jq '.nodes["_Z3TqKpDS-qBDyGky2r0Kg"].indices.segments':

        "segments": {
          "fixed_bit_set_memory_in_bytes": 4513824,
          "version_map_memory_in_bytes": 105604,
          "index_writer_max_memory_in_bytes": 3182454991,
          "index_writer_memory_in_bytes": 52100548,
          "memory_in_bytes": 2608021926,
          "count": 2268
        },

But maybe also all stats that report size stored in RAM: completion.size_in_bytes, field_data, filter_cache.

Another small fact : when GC issue started most of the queries reported as WARN (> 1s) in /var/log/elasticsearch/production-search-eqiad_index_search_slowlog.log.1 are for dewiki_content.
I copied this log file in ~dcausse/T134829/production-search-eqiad_index_search_slowlog.log.1

Are we going to do anything else with this ticket? Should move it to done?

Dzahn claimed this task.

If it's on Done on a board, the status should also be resolved, right?

The usage in Discovery is to move tasks to Done on board and let our product owner have a final review and closing them.

Dzahn removed Dzahn as the assignee of this task.
debt claimed this task.