During cluster recovery (if a node fails, or during planned cluster restarts) elasticsearch moves shards around, which increases I/O. Currently recoveries are throttled at 40MB/second. Even with this relatively low additional I/O, we see a significant increase in disk utilisation (reported as high as 20%) and a decrease in elasticsearch response time. The specs for our SSDs indicate that we should be able to get more IOPS than we currently see.
As an example on elastic2006, which did see increased activity around 2016-12-12 16:15 UTC:
- IOPS climb to ~4k per disk, while disk utilization blimbs to ~50% at the same time
- still at the same time, we see disk writes climbing to ~40 MBps (the throttling limit)and average query latency climbing from ~20 ms to ~40-50 ms