Around 21:43 elastic1026 had a very long GC:
```
[elastic1026] [gc][young][28976][19906] duration [49.8s], collections [3]/[1.1m], total [49.8s]/[12.9m], memory [27.4gb]->[7.5gb]/[29.8gb], all_pools {[young] [18.6mb]->[185.5mb]/[1.4gb]}{[survivor] [154.3mb]->[0b]/[191.3mb]}{[old] [27.2gb]->[7.3gb]
[elastic1026] [gc][old][28976][8] duration [18.1s], collections [1]/[1.1m], total [18.1s]/[24.8s], memory [27.4gb]->[7.5gb]/[29.8gb], all_pools {[young] [18.6mb]->[185.5mb]/[1.4gb]}{[survivor] [154.3mb]->[0b]/[191.3mb]}{[old] [27.2gb]->[7.3gb]/[28.1]
```
Immediatly after this long GC elastic1026 search thread pool filled up and it started rejecting queries.
Comparing this node to a healthy node in our node comparison dashboard (https://grafana.wikimedia.org/dashboard/db/elasticsearch-node-comparison?from=1462824162813&to=1462834962813&var-nodeA=elastic1024&var-nodeB=elastic1026) the following stand out to me:
* CPU usage spiked, user space went from 10% to 70%+
* Disk throughput spiked, but not more than i would expect from standard segment merges
* Young GC latency through the roof. Typically 1-2s, it spiked to >10s
* JVM heap was incredibly spastic. I don't know enough about JVM to comment but it's not our typical sawtooth
* query and fetch latency spiked
* QPS reduce to 30% of pre-problem QPS
* There was a 7.5G of segment merge that started right when the server went cray cray
Also of interest is the dmesg log: P3023
* This machine was last rebooted (dmesg starts on) Mar 18
* Mon May 9 22:58:15 2016 - cpu starts throttling due to overheating, total events = 1
* Multiple overheating events, last one (as of 23:15 UTC) reports total events = 339725
* elastic1026 is in rack D4, along with 1005-06 and 1023-29
** No temp problems in 1005, 1006, 1023, 1024, 1028, 1029.
** 1025 throttled for ~1s twice on apr 29, and three more times may 1
** 1027 throttled for ~1s once apr 29, four times may 9th. Timing does not coincide with 1026 meltdown
* Temperature problems may be a symptom of pushing user space to >70%, rather than the cause of the issue