Recent issue on Elasticsearch indicates a GC overload. Collecting GC logs would help diagnose this kind of issue if it ever happens again. Some care needs to be taken around log rotation (GC logging is overly optimized and creates a few issues for log rotation).
|operations/puppet : production||elasticsearch - enable GC logs by default|
|operations/puppet : production||elasticsearch - enable garbage collection logs on relforge servers|
|Resolved||debt||T134829 Followup on elastic1026 blowing up May 9, 21:43-22:14 UTC|
|Resolved||Gehel||T134853 Enable GC (garbage collection) logs on Elasticsearch JVM|
This follow-up task from an incident report has not been updated recently. If it is no longer valid, please add a comment explaining why. If it is still valid, please prioritize it appropriately relative to your other work. If you have any questions, feel free to ask me (Greg Grossmeier).
The patch *should* resolve the issue, but it is not yet deployed. So at this point GC logs are enabled on relforge cluster, but not anywhere else. I'm reopening this and will close it for real fairly soon.