Page MenuHomePhabricator

Reduce frequency of garbage collection alerts on cloudelastic
Closed, ResolvedPublic

Description

We've been getting a lot of alerts for excessive garbage collection on the cloudelastic hosts.

Creating this ticket to address the issue. Possible approaches:

  1. Stop the GC from happening
  2. Detune the alerts

Event Timeline

Mentioned in SAL (#wikimedia-operations) [2024-09-03T16:01:58Z] <bking@cumin2002> START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: apply heap settings - bking@cumin2002 - T373895

I merged this puppet patch to increase heap size for the secondary clusters on Cloudelastic.

Mentioned in SAL (#wikimedia-operations) [2024-09-03T16:26:30Z] <bking@cumin2002> END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: apply heap settings - bking@cumin2002 - T373895

bking triaged this task as Low priority.Sep 3 2024, 7:53 PM

The alerts have cleared, but let's leave this open for a few days so we can get a better idea if the heap size increase helped.

It's been 12 days and I have not seen any new alerts for garbage collection in cloudelastic. As such, I'm moving to "needs review." If/when the Search Platform software engineers are happy, we can close this one completely.

Checked over the GC graphs for the last week, everything there looks reasonable.