Page MenuHomePhabricator

Improve algorithm that detects 'spreadiness' of Tool Labs instances on Labs Hosts
Closed, ResolvedPublic


Currently it just is 'Number of instances of class > unique number of hosts on which the instances are hosted', which is terrible. Get a better metric.

Event Timeline

yuvipanda raised the priority of this task from to Needs Triage.
yuvipanda updated the task description. (Show Details)
yuvipanda added a project: Toolforge.
yuvipanda added subscribers: yuvipanda, Joe, BBlack, chasemp.

I propose the use of entropy to measure uniformity. See for a discussion.

>>> from math import log
>>> def entropy(counts, epsilon=0.01):
...   total = sum(counts)
...   props = [c/total for c in counts]
...   return sum(p * log(1/max(p, epsilon)) 
...              for p in props)
>>> uniform = [3,3,3,3,3,3,3]
>>> non_uniform = [0,2,0,5,0,2,12]
>>> really_non_uniform = [0,0,0,0,0,1,20]
>>> entropy(uniform)
>>> entropy(non_uniform)
>>> entropy(really_non_uniform)

If you want to know how much damage a single host going down would cause, then I propose to measure that directly.

>>> def max_downage(counts):
...   return max(counts)/sum(counts)
>>> max_downage(uniform)
>>> max_downage(non_uniform)
>>> max_downage(really_non_uniform)

It also depends on the kind of host, I think. For failover services, the question is 'how many virt hosts need to go down to make us unreachable', while for exec nodes, I'd ask 'how much of our computing power do we lose if a single virt host goes down'.

So, for failover services, I'd just check that

  • N_hosts > given number, and
  • each host is on a different virt host

For other services, I'd measure
max(number of hosts per virt host) / total number of hosts

(..which is what @Halfak suggested already because I was too slow in typing)

I iterated a bit more with @yuvipanda. See my notes here:

We came up with a simple approach:

if sum(counts) >= 2 and max(counts)/sum(counts) >= 0.5:

This ensures that we don't alert when we have only one instance. It also makes sure that no downtime can take out more than 50% or more instances without alerting.

valhallasw triaged this task as Medium priority.Jul 2 2015, 7:44 PM