Currently it just is 'Number of instances of class > unique number of hosts on which the instances are hosted', which is terrible. Get a better metric.
Description
Related Objects
Event Timeline
I propose the use of entropy to measure uniformity. See http://stats.stackexchange.com/questions/66935/measure-for-the-uniformity-of-a-distribution for a discussion.
>>> from math import log >>> >>> def entropy(counts, epsilon=0.01): ... total = sum(counts) ... props = [c/total for c in counts] ... return sum(p * log(1/max(p, epsilon)) ... for p in props) ... >>> uniform = [3,3,3,3,3,3,3] >>> non_uniform = [0,2,0,5,0,2,12] >>> really_non_uniform = [0,0,0,0,0,1,20] >>> >>> entropy(uniform) 1.945910149055313 >>> entropy(non_uniform) 1.1093482433488377 >>> entropy(really_non_uniform) 0.19144408195771734
If you want to know how much damage a single host going down would cause, then I propose to measure that directly.
>>> def max_downage(counts): ... return max(counts)/sum(counts) ... >>> max_downage(uniform) 0.14285714285714285 >>> max_downage(non_uniform) 0.5714285714285714 >>> max_downage(really_non_uniform) 0.9523809523809523
It also depends on the kind of host, I think. For failover services, the question is 'how many virt hosts need to go down to make us unreachable', while for exec nodes, I'd ask 'how much of our computing power do we lose if a single virt host goes down'.
So, for failover services, I'd just check that
- N_hosts > given number, and
- each host is on a different virt host
For other services, I'd measure
max(number of hosts per virt host) / total number of hosts
(..which is what @Halfak suggested already because I was too slow in typing)
I iterated a bit more with @yuvipanda. See my notes here: https://gist.github.com/halfak/9c991183166fb8a06760
We came up with a simple approach:
if sum(counts) >= 2 and max(counts)/sum(counts) >= 0.5: alert()
This ensures that we don't alert when we have only one instance. It also makes sure that no downtime can take out more than 50% or more instances without alerting.