The 99%-ile of elasticsearch response time is significantly higher than the 95%-ile. We have a full GC every hour, with a pause time of between 1 and 1.5 second. With 32 servers, this means one full GC every 2 minute. So the GC pauses could explain the response time we see. It make sense to try to optimize them.
Ideas to investigate:
- moving to G1: strongly discouraged by elastic, might lead to data corruption, see Garbage First
- reducing heap size: we seem to have less than 10Go of long lived objects, but a configured heap size of 30Go, reducing heap size might give us shorter but more frequent full GC
- tuning new / old ratio: increasing the size of the young generations might help keep objects out of old space and reduce the number of full GC (this is border line black magic and very much needs to be validated in testing)
- aggressive timeouts on queries: elasticsearch supports client timeouts and will return partial results if some shards do not answer fast enough. For completion suggester, where median resposne time is 20ms and the users are probably not waiting for > 100ms this probably make sense. Note that this might hide other issues, so we need to have some metrics on partial results to track potential issues.