Around 21:43 elastic1026 had a very long GC:
[elastic1026] [gc][young][28976][19906] duration [49.8s], collections [3]/[1.1m], total [49.8s]/[12.9m], memory [27.4gb]->[7.5gb]/[29.8gb], all_pools {[young] [18.6mb]->[185.5mb]/[1.4gb]}{[survivor] [154.3mb]->[0b]/[191.3mb]}{[old] [27.2gb]->[7.3gb] [elastic1026] [gc][old][28976][8] duration [18.1s], collections [1]/[1.1m], total [18.1s]/[24.8s], memory [27.4gb]->[7.5gb]/[29.8gb], all_pools {[young] [18.6mb]->[185.5mb]/[1.4gb]}{[survivor] [154.3mb]->[0b]/[191.3mb]}{[old] [27.2gb]->[7.3gb]/[28.1]
Immediatly after this long GC elastic1026 search thread pool filled up and it started rejecting queries.
Comparing this node to a healthy node in our node comparison dashboard (https://grafana.wikimedia.org/dashboard/db/elasticsearch-node-comparison?from=1462824162813&to=1462834962813&var-nodeA=elastic1024&var-nodeB=elastic1026) the following stand out to me:
- CPU usage spiked, user space went from 10% to 70%+
- Disk throughput spiked, but not more than i would expect from standard segment merges
- Young GC latency through the roof. Typically 1-2s, it spiked to >10s
- JVM heap was incredibly spastic. I don't know enough about JVM to comment but it's not our typical sawtooth
- query and fetch latency spiked
- QPS reduce to 30% of pre-problem QPS
- There was a 7.5G of segment merge that started right when the server went cray cray
Also of interest is the dmesg log: P3023
- This machine was last rebooted (dmesg starts on) Mar 18
- Mon May 9 22:58:15 2016 - cpu starts throttling due to overheating, total events = 1
- Multiple overheating events, last one (as of 23:15 UTC) reports total events = 339725
- elastic1026 is in rack D4, along with 1005-06 and 1023-29
- No temp problems in 1005, 1006, 1023, 1024, 1028, 1029.
- 1025 throttled for ~1s twice on apr 29, and three more times may 1
- 1027 throttled for ~1s once apr 29, four times may 9th. Timing does not coincide with 1026 meltdown
- Temperature problems may be a symptom of pushing user space to >70%, rather than the cause of the issue
@Gehel restarted elasticsearch on 1026 and the problems appear to have gone away.