We are running into limits of the elasticsearch architecture, basically we are "full" on indices and can't really create more. Our systems are already over the baselines, with us having to adjust the default master timeout from 5s up to 30s to ensure the daily creation of completion suggesters doesn't fail. Evaluation of adding more indices to the cluster in T192972 showed the cluster having problems placing indices around the cluster even if they were empty.
High level solution:
* Run two jvm's per node in separate clusters
* One large jvm for wikis with shards > 100M
* One small jvm for the remaining wikis
* The small jvm's to be split into two clusters of ~17 nodes each.
* We can almost certainly shrink the large jvm's from their current 30G to some smaller number.
* Estimating small jvm's at 6g, if we can shave a couple g from the large jvm's there should be very little impact on disk cache availability
Looking at our data sizes, roughly 600 primary shards would go to the large jvm's and 3000 primary shards would be split between the two small clusters for 1500 shards each. This gets our cluster sizes back into manageable ranges and re-opens the ability to add new indices if it is the right solution to a problem.
* sister-wikis should be entirely within a single cluster
* commonswiki search will need some special considerations
* OtherIndex has to write to a different cluster at times
* Configuration to assign small wikis and sister wikis to appropriate places without spelling out each and every wiki. Or maybe we do spell it out with a dblist?
* This certainly adds operational complexity
* Probably more