At the moment, cumin provides batching capabilities, with sliding windows. That is great, but it is limited to a single uniform set of nodes. Having a way to make batching cluster aware would allow for nice optimization and increased parallelism.
For example: WDQS has 4 clusters (internal / public, eqiad / codfw). Those clusters are uniform and independent. When doing rolling restarts, we don't want to restart more than one node per cluster at a time, but we can parallelize restarts across multiple clusters. Atm, we limit batch size to 1 to ensure that no more than 1 node is restarted at a time, but that's more constraint than what is actually needed. We could probably reduce restart time by a factor of 4 by parallelizing across clusters.
Implementing this correctly needs some thoughts.