Page MenuHomePhabricator

Investigate performance differences between elastic2037-2054 and 2055-2086
Closed, ResolvedPublic

Description

In IRC, @EBernhardson mentioned that:

  • elastic2037-2054 are showing elevated disk utilization, > 50%. elastic2055-2086 are showing ~15%
  • from per-node latency metrics, 2037-2054 are giving avg p95's of ~200ms, whereas 2055+ are seeing ~150ms

Since high load on elastic2044 was a contributing factor to a recent Elasticsearch outage, creating this ticket to:

  • Investigate performance differences
  • Take steps to normalize and/or optimize performance, if necessary.

Event Timeline

@RKemper noted that the difference here is the available memory. everything < 2055 has 128G of memory, >=2055 has 256G of memory. The additional memory is doing exactly what we expected of it, reducing the need to go out to disk while serving typical query loads. It seems unlikely we will be able to do anything, it looks like this is a fundamental difference in hardware.

bking claimed this task.
bking moved this task from Incoming to Done on the Data-Platform-SRE board.

Upon further review, it looks like we have confirmed the reason for the performance differences. I don't think we need to change anything (such as tiering hosts) so I'm going to close this out. Thanks everyone for taking a look!