Improve algorithm that detects 'spreadiness' of Tool Labs instances on Labs Hosts
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	yuvipanda
	Jun 8 2015, 5:09 PM

Description

Currently it just is 'Number of instances of class > unique number of hosts on which the instances are hosted', which is terrible. Get a better metric.

Related Objects

Mentioned In: T101635: Write an icinga check to ensure that toollabs instances are appropriately distributed across labvirt** hosts
Mentioned Here: rOPUP5a956bf836f1: cloud: rewrite spreadcheck.py NPRE check

Event Timeline

yuvipanda created this task.Jun 8 2015, 5:09 PM

yuvipanda raised the priority of this task from to Needs Triage.

yuvipanda updated the task description. (Show Details)

yuvipanda added a project: Toolforge.

yuvipanda added subscribers: yuvipanda, Joe, BBlack, • chasemp.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 8 2015, 5:09 PM

yuvipanda mentioned this in T101635: Write an icinga check to ensure that toollabs instances are appropriately distributed across labvirt** hosts.Jun 8 2015, 5:10 PM

I propose the use of entropy to measure uniformity. See http://stats.stackexchange.com/questions/66935/measure-for-the-uniformity-of-a-distribution for a discussion.

>>> from math import log
>>> 
>>> def entropy(counts, epsilon=0.01):
...   total = sum(counts)
...   props = [c/total for c in counts]
...   return sum(p * log(1/max(p, epsilon)) 
...              for p in props)
... 
>>> uniform = [3,3,3,3,3,3,3]
>>> non_uniform = [0,2,0,5,0,2,12]
>>> really_non_uniform = [0,0,0,0,0,1,20]
>>> 
>>> entropy(uniform)
1.945910149055313
>>> entropy(non_uniform)
1.1093482433488377
>>> entropy(really_non_uniform)
0.19144408195771734

If you want to know how much damage a single host going down would cause, then I propose to measure that directly.

>>> def max_downage(counts):
...   return max(counts)/sum(counts)
... 
>>> max_downage(uniform)
0.14285714285714285
>>> max_downage(non_uniform)
0.5714285714285714
>>> max_downage(really_non_uniform)
0.9523809523809523

It also depends on the kind of host, I think. For failover services, the question is 'how many virt hosts need to go down to make us unreachable', while for exec nodes, I'd ask 'how much of our computing power do we lose if a single virt host goes down'.

So, for failover services, I'd just check that

N_hosts > given number, and
each host is on a different virt host

For other services, I'd measure
max(number of hosts per virt host) / total number of hosts

(..which is what @Halfak suggested already because I was too slow in typing)

I iterated a bit more with @yuvipanda. See my notes here: https://gist.github.com/halfak/9c991183166fb8a06760

We came up with a simple approach:

if sum(counts) >= 2 and max(counts)/sum(counts) >= 0.5:
  alert()

This ensures that we don't alert when we have only one instance. It also makes sure that no downtime can take out more than 50% or more instances without alerting.

valhallasw triaged this task as Medium priority.Jul 2 2015, 7:44 PM

Restricted Application added a project: Cloud-Services. · View Herald TranscriptJul 2 2015, 7:44 PM

valhallasw moved this task from Backlog to Ready to be worked on on the Toolforge board.Jul 2 2015, 7:46 PM

• Phabricator_maintenance removed a subscriber: yuvipanda.Jun 7 2017, 6:50 PM

rOPUP5a956bf836f1: cloud: rewrite spreadcheck.py NPRE check

Improve algorithm that detects 'spreadiness' of Tool Labs instances on Labs HostsClosed, ResolvedPublicActions

Description

Related Objects

Event Timeline

Improve algorithm that detects 'spreadiness' of Tool Labs instances on Labs Hosts
Closed, ResolvedPublic
Actions