Page MenuHomePhabricator

HA for openstack services
Closed, ResolvedPublic

Description

Right now we have cold spares for most of our openstack hosts. It should be fairly straightforward to set up active/active support for most or all of these. The (incomplete) upstream docs for this are here:

https://docs.openstack.org/ha-guide/

  • rabbitmq
  • active/active backend services (e.g. conductor).
  • nova
  • neuron
  • keystone
  • [] glance (requires shared storage)
  • HAproxy layer7 load balancing
  • Nova API and Metadata
  • Neutron server
  • Glance API and Registry
  • Keystone API
  • Run some failover tests
  • Adjust monitoring/paging to reflect that these are now redundant

Event Timeline

The docs say:

"To make nova-conductor highly available and fault tolerant, just launch more instances of the nova-conductor process, either on the same server or across multiple servers."

so that's good!

I also found "If you running two nova-scheduler processes they race each other, they don’t find out about each others choices until the DB gets updated by the nova-compute resource tracker." that's a little bit bad, but only really bad in terms of overrunning quotas as I understand it.

https://docs.openstack.org/designate/latest/admin/ha.html for designate HA. Looks pretty trivial, we can just run multiples of each service except for the api which needs a load balancer.

Change 511901 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] clouddb2001-dev: adjust firewall rules to allow access to both cloudcontrols.

https://gerrit.wikimedia.org/r/511901

Change 511901 merged by Andrew Bogott:
[operations/puppet@production] clouddb2001-dev: adjust firewall rules to allow access to both cloudcontrols.

https://gerrit.wikimedia.org/r/511901

Change 511950 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] nova: make nova-conductor and nova-scheduler active/active

https://gerrit.wikimedia.org/r/511950

Andrew updated the task description. (Show Details)

Change 512081 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] designate: make designate nodes active/active

https://gerrit.wikimedia.org/r/512081

Change 512192 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] neutron: make the neutron api server ('neutron-server') active/active

https://gerrit.wikimedia.org/r/512192

Change 511950 merged by Andrew Bogott:
[operations/puppet@production] nova: make all services active/active

https://gerrit.wikimedia.org/r/511950

Change 512081 merged by Andrew Bogott:
[operations/puppet@production] designate: make designate nodes active/active

https://gerrit.wikimedia.org/r/512081

Change 512192 merged by Andrew Bogott:
[operations/puppet@production] neutron: make the neutron api server ('neutron-server') active/active

https://gerrit.wikimedia.org/r/512192

The rabbit based services are now active/active and seem to be working OK. I'm setting the API issues aside because we're blocked on naming (and, certain naming decisions may mean that I have to build my own load balancer)

Change 512788 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] keystone: add firewall rules to acess from the nova controller

https://gerrit.wikimedia.org/r/512788

Change 512788 merged by Andrew Bogott:
[operations/puppet@production] keystone: add firewall rules to acess from the nova controller

https://gerrit.wikimedia.org/r/512788

Change 512789 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] keystone: make the api service active on both controller nodes

https://gerrit.wikimedia.org/r/512789

Change 512789 merged by Andrew Bogott:
[operations/puppet@production] keystone: make the api service active on both controller nodes

https://gerrit.wikimedia.org/r/512789

Andrew updated the task description. (Show Details)
bd808 triaged this task as Medium priority.
bd808 added a project: Goal.
bd808 moved this task from Inbox to Doing on the cloud-services-team (Kanban) board.