Page MenuHomePhabricator

[Epic] Longer term plan for increases in Elasticsearch cluster disk IO
Closed, DuplicatePublic


As a user of Search, I want my queries to complete successfully in a timely manner.

As investigated in T264053, IO load has increased to a point where queries are dropped. Some short term mitigation have been put in place, but we need a longer term strategy to deal with this.


  • Search Elasticsearch cluster is robust in term of IO for the next year

Event Timeline

To really have a plan, I would like to know the trajectory of our hot set. Unfortunately, I'm not really sure how to measure the hot set. The only data points i can think of at the moment are times when we ran out of IO, and estimating from there. Also these "total cache" numbers aren't exactly the cache size, rather the estimate of referenced mmap pages (vs loaded but unreferenced) at the point when our servers started to have issues.

july 2017 cache estimate:

  • No direct history of memory.
  • We started running the extra clusters after this, so estimate as current + 8G to account for second elastic process
  • That gives 88G buff/cache
  • page-types reported 30G of pages loaded but not referenced
  • 58G/server of useful cache

sept 2020 cache estimate:

  • top reports 50G used, 80G buff/cache
  • page-types reports 11G of pages loaded but not referenced
  • 70G/server of useful cache

oct 2020 cache estimate (post-mitigation)

  • top reports 50G used, 80G buff/cache
  • page-types reports 3G of pages loaded but not referenced
  • 77G/server of useful cache
daten serversest. cache per servertotal cache
July 20173558G2,030 G
Sept 20203570G2,450 G (+21%)
*Oct 20203577G2,695 G (+10%)

If these estimates are right, then simply getting to the point where we replace the current 128G instances with 256G instances will be sufficient. At that point our available cache baloons to 7,175 GB. I'm not sure if we can make it to that point though. The recent readahead mitigation should have increased available cache from 2,450G to 2,695G. A previous increase of 21% bought us 3 years of growth. With the same timeline we might get 12-18 months of runway from the current adjustments (but i still think we likely need hardware next FY, what exactly tbd).

Perhaps a silly thought, but one problem we have is that it's hard to directly measure how much cache memory we need. Since we have swap turned off on these machines we can simulate having less memory by having some other application simply allocate the memory (have python create a 10GB string). Can start at a low value and increase a couple GB at a time until IO starts climbing. We would then have a direct measure of how much memory was required (at that moment in time).

Or maybe there is a much simpler and more direct method?

Gehel renamed this task from Longer term plan for increases in Elasticsearch cluster disk IO to [Epic] Longer term plan for increases in Elasticsearch cluster disk IO.Oct 12 2020, 3:20 PM
Gehel added a project: Epic.
Gehel moved this task from needs triage to [epic] on the Discovery-Search board.