Page MenuHomePhabricator

Investigate and fix GC issues on cloudelastic machines
Open, HighPublic

Description

Since cloudelastic is receiving production updates cloudelastic1001 has suffered from weird GC behaviors causing slowdows on this cluster making it hard for it to keep up to date with live updates.

Symptoms
  • the number old GC operations/sec increases up to insane values (300 ops/hour)
  • the gc logs reports Full GC (Ergonomics) with very high pause times
2019-08-29T06:52:28.077+0000: 484409.002: [Full GC (Ergonomics) [PSYoungGen: 13380608K->1802631K(14554624K)] [ParOldGen: 31457212K->31456931K(31457280K)] 44837820K->33259562K(46011904K), [Metaspace: 77274K->77274K(90112K)], 6.6373629 secs] [Times: user=92.26 sys=0.50, real=6.64 secs] 
2019-08-29T06:52:34.714+0000: 484415.639: Total time for which application threads were stopped: 6.6400531 seconds, Stopping threads took: 0.0002179 seconds
2019-08-29T06:52:38.751+0000: 484419.676: Total time for which application threads were stopped: 0.0024733 seconds, Stopping threads took: 0.0001990 seconds
2019-08-29T06:52:39.476+0000: 484420.401: [Full GC (Ergonomics) [PSYoungGen: 13380608K->1782371K(14554624K)] [ParOldGen: 31456931K->31457151K(31457280K)] 44837539K->33239522K(46011904K), [Metaspace: 77274K->77274K(90112K)], 7.2366785 secs] [Times: user=100.64 sys=0.77, real=7.24 secs] 
2019-08-29T06:52:46.713+0000: 484427.638: Total time for which application threads were stopped: 7.2393857 seconds, Stopping threads took: 0.0001504 seconds
2019-08-29T06:52:49.902+0000: 484430.827: Total time for which application threads were stopped: 0.0024902 seconds, Stopping threads took: 0.0001788 seconds
2019-08-29T06:52:50.391+0000: 484431.316: [Full GC (Ergonomics) [PSYoungGen: 13380608K->1844540K(14554624K)] [ParOldGen: 31457151K->31456756K(31457280K)] 44837759K->33301296K(46011904K), [Metaspace: 77274K->77274K(90112K)], 7.6572372 secs] [Times: user=106.90 sys=0.73, real=7.65 secs] 
2019-08-29T06:52:58.048+0000: 484438.973: Total time for which application threads were stopped: 7.6619386 seconds, Stopping threads took: 0.0019666 seconds
What have been tried so far
  • Set NewRatio to 2 to increase the old gen size and try to prevent the GC from trying to reshape its memory layout
  • Increase the heap up to 45G as we thought that we saturated the heap
  • Reduce refresh_interval and remove custom index merging limits to reduce segment count. Reduced from ~65k to ~58k. No obvious effect on tracked memory buckets.

None of these solutions showed worked.
What we do not know is the reason the GC behaves like this on these machines.

What we haven't tried:

  • Switch to G1GC
  • Deactivate UseAdaptiveSizePolicy -XX:-UseAdaptiveSizePolicy
  • Reduce replica count from 2 to 1 to reduce cluster data size
Workaround

Restart the affected JVM when the GC starts misbehaving

Immediate actions:

  • T231516: Alert when a jvm hits more than 100ops/hour

Event Timeline

dcausse created this task.Aug 29 2019, 7:14 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 29 2019, 7:14 AM
dcausse triaged this task as High priority.Aug 29 2019, 7:14 AM
dcausse updated the task description. (Show Details)
dcausse moved this task from needs triage to Ops / SRE on the Discovery-Search board.
dcausse updated the task description. (Show Details)Aug 29 2019, 8:28 AM
dcausse updated the task description. (Show Details)

Mentioned in SAL (#wikimedia-operations) [2019-08-29T17:15:20Z] <dcausse> restarted elasticsearch on cloudelastic1004 (T231517)

Looking into the graphs, it seems to me that the underlying problem is that the min heap keeps growing over time. When the node gets to 2/3 (~30.8GB) heap used the old GC goes crazy. We can re-apply NewRatio with the 45G heap and the old gen will be able to grow by a few more GB, but unless we can figure out what the final steady state value is for the old gen we can only really keep trying larger values.

The terms data doesn't seem large enough to account for the heap growth, in particular because terms memory doesn't decrease when restarting a node but used heap drops 10-20GB. A bit puzzled where this memory is going.

Mentioned in SAL (#wikimedia-operations) [2019-08-29T18:23:23Z] <ebernhardson> restart elasticsearch on cloudelastic1001 (T231517)

Mentioned in SAL (#wikimedia-operations) [2019-08-29T18:41:48Z] <ebernhardson> set index.merge.scheduler.max_thread_count to null to accept default values on cloudelastic-chi (T231517)

Mentioned in SAL (#wikimedia-operations) [2019-08-29T18:50:15Z] <ebernhardson> restart elasticsearch on cloudelastic1002 (T231517)

Mentioned in SAL (#wikimedia-operations) [2019-08-29T18:59:00Z] <ebernhardson> restart elasticsearch on cloudelastic1003 (T231517)

Not sure it's helping or not, but i increased refresh_interval on all cloudelastic-chi indices to 5 minutes, and removed their index.merge.max_thread_count settings (was 1, now takes default of 3) to see if we could cut back on the number of tiny indices. Segment count reduced from 65k to 57k, about 10%. Might be a minor memory savings but likely very little as the tradeoff is an increase of IndexWriter buffers from ~200M/server to ~1GB/server.

EBernhardson updated the task description. (Show Details)Aug 29 2019, 7:31 PM
EBernhardson updated the task description. (Show Details)Sep 3 2019, 4:01 PM

Reducing replica count from 2 to 1 had a dramatic effect on the cluster. Things are generally looking happy now.