Page MenuHomePhabricator

non-wdqs VMs sometimes getting scheduled on wdqs hardware
Closed, ResolvedPublic

Description

We have three special-purpose hypervisors, cloudvirt-wdqs100[1-3].eqiad.wmnet which are reserved for wdqs workloads.

In theory there are host aggregates preventing regular VMs from getting scheduled there. In practice, this seems to not be working properly.

Event Timeline

Every example I've seen of this involves a flavor that doesn't have aggregate_instance_extra_specs set. It would be nice if we had some way of detecting that, preventing that, or getting the scheduler to do something reasonable in that case.

Change 619575 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] Openstack Nova: warn if any flavors are not assigned aggregates

https://gerrit.wikimedia.org/r/619575

Change 619575 merged by Andrew Bogott:
[operations/puppet@production] Openstack Nova: warn if any flavors are not assigned aggregates

https://gerrit.wikimedia.org/r/619575

Change 619599 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] nova flavor aggregate monitoring: increase timeout

https://gerrit.wikimedia.org/r/619599

Change 619599 merged by Andrew Bogott:
[operations/puppet@production] nova flavor aggregate monitoring: increase timeout

https://gerrit.wikimedia.org/r/619599

Change 621760 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] wmcs: amend the check-flavor-aggregates test to send an email

https://gerrit.wikimedia.org/r/621760

Change 621760 merged by Andrew Bogott:
[operations/puppet@production] wmcs: amend the check-flavor-aggregates test to send an email

https://gerrit.wikimedia.org/r/621760

Change 621768 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] wmcs check_flavor_aggregates.py: return CRITICAL rather than WARNING

https://gerrit.wikimedia.org/r/621768

Change 621768 merged by Andrew Bogott:
[operations/puppet@production] wmcs check_flavor_aggregates.py: return CRITICAL rather than WARNING

https://gerrit.wikimedia.org/r/621768

I don't think I can fix this comprehensively without hacking on nova, but now we have an alert that detects problematic flavors. I think that responding to those alerts as needed will prevent this issue from happening.