If we look at the graph of the high lag for last 30 days:
it is clear that wdqs1003 has highest instance of lag problems. This is weird, since three servers 1003, 1004 and 1005 should be sharing load & updates equally, and thus if there is a problem, either with query load or update load, it should be evenly distributed.
However, it is clear that 1003 has problems more often, and they are usually more severe (higher lag, longer duration). If could be:
- 1003 has some hardware/software issue that makes it slower
- our load balancing is not balanced between hosts
- some other reason?
- just a coincidence?
I think we need to investigate this and see if there's some issue we could fix.