If we get enough daily traffic that we are hitting a 90-95% utilization frequently I would suggest that the first tuning we do is reducing the replica count from 2 to 1 selectively. Right now each index is present on each data host in the cluster. If we started dropping that back having the data for a given index present on only 2/3 of the cluster we would still be fairly robust from individual node failure and regain quite a bit of disk on each node.
I would suggest that this decrease in redundancy be phased in slowly as space is needed. For the first pass we could drop that replica count on days N-22 to N-30 and when we outgrew that start dropping it sooner and sooner until we only kept full copy for days N and N-1. That would give us quite a bit of headroom on the current hardware and should easily carry us forward until we can budget for adding nodes/disk to the cluster in the next fiscal year.
|Resolved||bd808||T113571 Logstash elasticsearch cluster filled up, dropping logstash events|
|Declined||bd808||T117438 Reduce replica count from 2 to 1 for indices that are >21 days old|