Change Details

Since cloudelastic is receiving production updates cloudelastic1001 has suffered from weird GC behaviors causing slowdows on this cluster making it hard for it to keep up to date with live updates. ====Symptoms==== - the number old GC operations/sec increases up to insane values (300 ops/hour) - the gc logs reports `Full GC (Ergonomics)` with very high pause times ``` 2019-08-29T06:52:28.077+0000: 484409.002: [Full GC (Ergonomics) [PSYoungGen: 13380608K->1802631K(14554624K)] [ParOldGen: 31457212K->31456931K(31457280K)] 44837820K->33259562K(46011904K), [Metaspace: 77274K->77274K(90112K)], 6.6373629 secs] [Times: user=92.26 sys=0.50, real=6.64 secs] 2019-08-29T06:52:34.714+0000: 484415.639: Total time for which application threads were stopped: 6.6400531 seconds, Stopping threads took: 0.0002179 seconds 2019-08-29T06:52:38.751+0000: 484419.676: Total time for which application threads were stopped: 0.0024733 seconds, Stopping threads took: 0.0001990 seconds 2019-08-29T06:52:39.476+0000: 484420.401: [Full GC (Ergonomics) [PSYoungGen: 13380608K->1782371K(14554624K)] [ParOldGen: 31456931K->31457151K(31457280K)] 44837539K->33239522K(46011904K), [Metaspace: 77274K->77274K(90112K)], 7.2366785 secs] [Times: user=100.64 sys=0.77, real=7.24 secs] 2019-08-29T06:52:46.713+0000: 484427.638: Total time for which application threads were stopped: 7.2393857 seconds, Stopping threads took: 0.0001504 seconds 2019-08-29T06:52:49.902+0000: 484430.827: Total time for which application threads were stopped: 0.0024902 seconds, Stopping threads took: 0.0001788 seconds 2019-08-29T06:52:50.391+0000: 484431.316: [Full GC (Ergonomics) [PSYoungGen: 13380608K->1844540K(14554624K)] [ParOldGen: 31457151K->31456756K(31457280K)] 44837759K->33301296K(46011904K), [Metaspace: 77274K->77274K(90112K)], 7.6572372 secs] [Times: user=106.90 sys=0.73, real=7.65 secs] 2019-08-29T06:52:58.048+0000: 484438.973: Total time for which application threads were stopped: 7.6619386 seconds, Stopping threads took: 0.0019666 seconds ``` ====What have been tried so far==== * Set NewRatio to 2 to increase the old gen size and try to prevent the GC from trying to reshape its memory layout * Increase the heap up to 45G as we thought that we saturated the heap * Reduce refresh_interval and remove custom index merging limits to reduce segment count. Reduced from ~65k to ~58k. No obvious effect on tracked memory buckets. None of these solutions showed worked. What we do not know is the reason the GC behaves like this on these machines. What we haven't tried: * [] Switch to G1GC * [] Deactivate UseAdaptiveSizePolicy `-XX:-UseAdaptiveSizePolicy` * [] Reduce replica count from 2 to 1 to reduce cluster data size ====Workaround==== Restart the affected JVM when the GC starts misbehaving Immediate actions: * [] T231516: Alert when a jvm hits more than 100ops/hour

Since cloudelastic is receiving production updates cloudelastic1001 has suffered from weird GC behaviors causing slowdows on this cluster making it hard for it to keep up to date with live updates. ====Symptoms==== - the number old GC operations/sec increases up to insane values (300 ops/hour) - the gc logs reports `Full GC (Ergonomics)` with very high pause times ``` 2019-08-29T06:52:28.077+0000: 484409.002: [Full GC (Ergonomics) [PSYoungGen: 13380608K->1802631K(14554624K)] [ParOldGen: 31457212K->31456931K(31457280K)] 44837820K->33259562K(46011904K), [Metaspace: 77274K->77274K(90112K)], 6.6373629 secs] [Times: user=92.26 sys=0.50, real=6.64 secs] 2019-08-29T06:52:34.714+0000: 484415.639: Total time for which application threads were stopped: 6.6400531 seconds, Stopping threads took: 0.0002179 seconds 2019-08-29T06:52:38.751+0000: 484419.676: Total time for which application threads were stopped: 0.0024733 seconds, Stopping threads took: 0.0001990 seconds 2019-08-29T06:52:39.476+0000: 484420.401: [Full GC (Ergonomics) [PSYoungGen: 13380608K->1782371K(14554624K)] [ParOldGen: 31456931K->31457151K(31457280K)] 44837539K->33239522K(46011904K), [Metaspace: 77274K->77274K(90112K)], 7.2366785 secs] [Times: user=100.64 sys=0.77, real=7.24 secs] 2019-08-29T06:52:46.713+0000: 484427.638: Total time for which application threads were stopped: 7.2393857 seconds, Stopping threads took: 0.0001504 seconds 2019-08-29T06:52:49.902+0000: 484430.827: Total time for which application threads were stopped: 0.0024902 seconds, Stopping threads took: 0.0001788 seconds 2019-08-29T06:52:50.391+0000: 484431.316: [Full GC (Ergonomics) [PSYoungGen: 13380608K->1844540K(14554624K)] [ParOldGen: 31457151K->31456756K(31457280K)] 44837759K->33301296K(46011904K), [Metaspace: 77274K->77274K(90112K)], 7.6572372 secs] [Times: user=106.90 sys=0.73, real=7.65 secs] 2019-08-29T06:52:58.048+0000: 484438.973: Total time for which application threads were stopped: 7.6619386 seconds, Stopping threads took: 0.0019666 seconds ``` ====What have been tried so far==== * Set NewRatio to 2 to increase the old gen size and try to prevent the GC from trying to reshape its memory layout * Increase the heap up to 45G as we thought that we saturated the heap * Reduce refresh_interval and remove custom index merging limits to reduce segment count. Reduced from ~65k to ~58k. No obvious effect on tracked memory buckets. None of these solutions showed worked. What we do not know is the reason the GC behaves like this on these machines. What we haven't tried: * [] Switch to G1GC * [] Deactivate UseAdaptiveSizePolicy `-XX:-UseAdaptiveSizePolicy` * [x] Reduce replica count from 2 to 1 to reduce cluster data size ====Workaround==== Restart the affected JVM when the GC starts misbehaving Immediate actions: * [] T231516: Alert when a jvm hits more than 100ops/hour