Page MenuHomePhabricator

Implement nova host-aggregates
Closed, ResolvedPublic

Description

Host aggregates are used to logically partition hypervisors into groups or availability zones that can be leveraged for advanced scheduling capabilities. This feature was first introduced in OpenStack Grizzly and continues to be supported in the latest release Train[0].

Example use cases for grouping hypervisors into host aggregates:

  • Different hardware platforms (e.g. CPU type, disk, etc)
  • Dedicated project equipment/servers
  • Restrict scheduling for specific admin workloads (e.g. maintenance, dev, stress testing)

To enable these examples key-pair metadata values are required on each host aggregate and matching flavor. The nova scheduler filter AggregateInstanceExtraSpecsFilter is responsible for matching flavor metadata in the aggregate_instance_extra_specs namespace to the correct host aggregate. This functionality is only visible to administrators. Other than additional flavor options, there are no changes to the end-users.

Today the Toolforge scheduling pool is controlled by a custom Nova scheduling filter[1]. This filter currently supports enabling or disabling a hypervisor from the default scheduling pool. It's a very simple scheduler that has been successful, but has some limitations.

Benefits of using host aggregates in place of the custom filter:

  • Updating the scheduling pool is dynamic and does not require any configuration[2] changes or service restarts.
  • Ability to depool hypervisors only from end-users, while not blocking admins ability to schedule VMs

[0] https://docs.openstack.org/nova/latest/user/aggregates.html
[1] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/openstack/files/mitaka/nova/scheduler/scheduler_pool_filter.py
[2] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/hieradata/eqiad/profile/openstack/eqiad1/nova.yaml#63

Event Timeline

This would be be very good for the database servers and some other things.

I'm certainly in favor of replacing custom code with upstream code! In particular it seems like we'll need this in order to make live-migration work sensibly between different CPU-typed cloudvirts, right? I do have a few concerns:

  • Right now when a host is pooled or depooled we have a record of who did it and why (via code comments and commit messages). We've had some issues with the grid and k8s where nodes sit around depooled and no-one knows why. If we have a live system for this we'll need to be compulsive about logging every pool and depool
  • Users (and staff in particular) often request their own private hypervisors. In general I like to discourage this because bigger pools make better use of resources and because it's very rare that a particular use case consumes a whole hypervisor. Host aggregates would provide better support for project-specific hardware but I'd prefer that we avoid promoting that use if at all possible.
  • Users (and staff in particular) often request their own private hypervisors. In general I like to discourage this because bigger pools make better use of resources and because it's very rare that a particular use case consumes a whole hypervisor. Host aggregates would provide better support for project-specific hardware but I'd prefer that we avoid promoting that use if at all possible.

I think that when @JHedden was explaining host aggregates to me recently he said that a given cloudvirt could be in multiple groups too so that we would still be able to mix workloads when that was appropriate. I agree that if we open the floodgates there are many projects that would love to have a whole cloudvirt to themselves, but that we can't sustain that so it should not be an easy process to qualify for.

I'm certainly in favor of replacing custom code with upstream code! In particular it seems like we'll need this in order to make live-migration work sensibly between different CPU-typed cloudvirts, right?

Yeah, this would let the scheduler pick a hypervisor in the same aggregate. (side note: unless we have a specific need for newer CPU flags, it's best to configure the same CPU on all libvirt hosts.)

I think that when @JHedden was explaining host aggregates to me recently he said that a given cloudvirt could be in multiple groups too so that we would still be able to mix workloads when that was appropriate. I agree that if we open the floodgates there are many projects that would love to have a whole cloudvirt to themselves, but that we can't sustain that so it should not be an easy process to qualify for.

That's correct. A cloudvirt can be a member of multiple aggregates within the same availability zone.

Change 575540 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] Nova scheduler: disable the scheduling pool filter

https://gerrit.wikimedia.org/r/575540

Change 575541 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] nova: remove the custom scheduler pool filter

https://gerrit.wikimedia.org/r/575541

Change 575540 merged by Andrew Bogott:
[operations/puppet@production] Nova scheduler: disable the scheduling pool filter

https://gerrit.wikimedia.org/r/575540

Change 575541 merged by Andrew Bogott:
[operations/puppet@production] nova: remove the custom scheduler pool filter

https://gerrit.wikimedia.org/r/575541

Andrew claimed this task.

The first pass at using Host Aggregates is now in place, documented at https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Host_aggregates