Page MenuHomePhabricator

Create a detailed migration plan for implementing Neutron as our OpenStack SDN layer
Closed, ResolvedPublic

Description

Continue the multi-quarter Neutron project by creating a detailed migration plan for replacing the current nova-network service with a more modern Neutron stack. This plan should include reasonably detailed steps needed to plan the actual implementation including changes that will need to be made both inside and outside of the Cloud Services environment. It should also identify contributions needed from other Foundation teams so that we can start talking to those teams in preparation for actually implementing the plan.

hardware status

Some of this is refresh and we need to rebuild and retire the existing, and some of this is additional hardware from the 16/17 budget for a second region testing.

These hosts can be used for the new Neutron deployment

  • labtestmetal2001
  • labtestservices2002
  • labtestservices2003

in use

labtestn
  • labtestcontrol2003
    • control plane elements
      • neutron-server
      • keystone?
      • nova-conductor
      • nova-scheduler
      • glance
      • database server for openstack components
  • labtestneutron2001 (ha)
    • l3 elements. data plane.
      • neutron-l3-agent
      • neutron-l2-agent
  • labtestneutron2002 (ha)
    • l3 elements. data plane.
      • neutron-l3-agent
      • neutron-l2-agent
  • * labtestvirt2003
    • hypervisor
      • neutron-l2-agent
      • nova-compute
labtest
  • labtestpuppetmaster2001

( two clouds sharing a puppetmaster?)

  • puppet master
    • puppet server
    • enc and enc api
  • labtestvirt2002.codfw.wmnet && labtestvirt2001.codfw.wmnet
    • Compute node
      • Runs nova-compute
      • Has space for 4 or 5 smallish VMs
  • labtestnet2001.codfw.wmnet
    • Current network host and api server
      • Runs nova-api
      • Runs nova-network
  • labtestnet2002
    • redundant network
  • labtestcontrol2001.wikimedia.org

(why can't we have shared keytone?)

  • Openstack controller node
    • database server for openstack components
    • Runs keystone API and all keystone services
    • Runs nova scheduler
    • Runs nova conductor
    • Hosts all OpenStack databases for labtest
    • Hosts puppetmaster for all labtest VMs
  • labtestservices2001.wikimedia.org

(is there a way to have designate see tokens from multiple keystons as valid?)

  • Ldap/DNS server
    • Runs pdns
    • Runs all designate services
  • labtestweb2001.wikimedia.org
    • Web UI frontend for labtest
      • Runs Horizon test instance
      • Runs Wikitech test instance

Current IP and ASN allocations

https://wikitech.wikimedia.org/wiki/IP_and_AS_allocations

208.80.155.128/25 - labs virtualization - floating IPs
10.68.16.0/21 - labs-instances1-b-eqiad

working diagrams

https://drive.google.com/a/wikimedia.org/file/d/0B03SolcDY21YTmVMN2N5aGRlbk0/view?usp=sharing

Event Timeline

bd808 added a subscriber: chasemp.

Assigning to @chasemp as the tech lead for this initiative. He will be responsible for creating a plan for this work and helping me report on it as the quarter progresses.

chasemp updated the task description. (Show Details)
chasemp updated the task description. (Show Details)

Dual stacks and migration and options

As part of the larger T167293 we know we need to move instances ultimately to an different model that is not compatible with our current setup. Separating nova-compute and nova-api from nova-network means configurations that are mutually exclusive within nova.conf such as network_api_class = nova.network.neutronv2.api.API. Neutron itself has a separate model where ports, subnets, networks, metadata-proxy, dhcpd, and tenant routers are all first class objects and independent instead of loosely attached to a tenant.

The rough plan is to do some amount of side-by-side setup while we are in a refresh cycle that allows us to stop and cleanup an instance in one context and move it into the other.

Rough migration outline:

-1) If we choose to not split the current labs-instances subnet range in half (make it a /22 for DHCP) and move instances into the first range we will most likely need to find another range to use for the new deployment. This means T122406 and finding a separate external range is probably our best option.

0) Announce we are doing the migration for project instances and the date that a customer freeze on tenant management (via Horizon) will happen for instances not-yet migrated. This means once we pull the trigger any instance not yet in the Neutron environment will require a request to an admin to manage it from the OpenStack layer. i.e. do reboots within the instance, not able to do self-serve deletes and creates. Instances that are struck or otherwise experience issues may be diagnosed by an admin but in general we are going to advise a period of holistic change freeze. We need to have a labvirt reimaged with T167356 addressed and ready to accept instances from nova-api and neutron-server processes. This means running the OpenStack Networking Linux bridge layer-2 agent and nova-compute processes.

  1. Stop an instance and do limited cleanup. We think that everything (assuming enough control plane separation) meta configuration wise can remain (definitions within the nova db, etc). The exception is designate. We are unsure at the moment how a cut over will interact with old authoritative information and in powerdns backing store. One idea is to purge only this layer on initial instance shutdown. This purge also needs to take into account floating IP names. Puppet certificates should survive the move since they are tied to hostname and the Puppetmaster is external to this consideration.

Instances where cloud admin root keys are disabled will not be migrated.

In the case of maximum control plan separation we need to create the project within the new keystone, port the user memberships and roles, and in general ensure everything with the instance is working.

  1. rsync instance to a "neutron labvirt". Run a script to create the setup for the virt to coexist within the new ecosystem. This will involve a manual insert for DHCP, and potentially pseudo-scheduling the instance. I am assuming we will create the port via the normal Neutron API and attach the instance to it. One concern is that the instance will balk at starting up with old DHCP reservation information intact and we may need to mount and purge that prior to starting.
  1. Start up instance and ensure DHCP is working. Contact gateway and route to the internet etc. Run script that side loads floating IP reservations. Fix authoritative DNS record for domain names.

Possible separations between the two side-by-side setups

(old nova-network driven and newly minted neutron-server have been worked through a few times with different outcomes. I am persisting notes I have to this point here for discussion

Reasons to go for maximum control plane split (new labcontrol, labnet, labvirts):

  • We can do actual side by side testing for entire setup before cutting over instances
  • We can preserve the existing stack in as known-goood state as possible for troubleshooting and a plan c revert
  • nova has been managing networking up to this point and I'm not sure how polluted the configuration is with nova-network related content. A clean break would allow us to not carry over that cruft and fight it for the next few years.
  • It will make a cleanup of the existing Puppet code more palatable. It is possible to refactor everything now in-place but the tangled mess of things makes that really difficult. I have refactored a bit over the last year and all the parts are so intertwined it's a serious headache.
  • In the canonical reference model of separation based on control, data, and compute the nova-api process lives with the control farm. I think we should move nova-api and install neutron-server on our control* servers. Right now this is run alongside nova-network on the labnet* servers. Three reasons for changing this up: labnet are private addressed hosts and I would prefer not to have them be the endpoint for api requests from instances or nodepool, labnet's primary work is dataplane forwarding and coupling with control plane logic is overhead we can afford elsewhere, and all of the documentation assumes a split based on this topology (it would be a nicety along with the practical concerns).
  • Since we not cutting over nova-api to a Neutron context (i.e. only aware of "new" things) we can keep CI running on in the nova-network context (potentially with scheduling limited to a specific labvirt or labvirt pool to contain active changes).

Reasons to go for minimum control plane split (shared labcontrol -- new labnet, labvirts):

  • Shared keystone is simpler as components in old nova-network ecosystem and new neutron-server ecosystem can share trust via commonly issued tokens.

Change 365868 had a related patch set uploaded (by Rush; owner: cpettet):
[operations/puppet@production] labtest: rabbitmq for openstack control node

https://gerrit.wikimedia.org/r/365868

Change 365868 merged by Rush:
[operations/puppet@production] labtest: rabbitmq for openstack control node

https://gerrit.wikimedia.org/r/365868

Change 366166 had a related patch set uploaded (by Rush; owner: cpettet):
[operations/puppet@production] labtest: labcontrol2001 use rabbitmq role

https://gerrit.wikimedia.org/r/366166

groupings at the moment

main (nova-network)

labcontrol1001.wikimedia.org
labcontrol1002.wikimedia.org
labmon1001.eqiad.wmnet
labmon1002.eqiad.wmnet
labnet1001.eqiad.wmnet
labnet1002.eqiad.wmnet
labnodepool1001.eqiad.wmnet
labnodepool1002.eqiad.wmnet
labservices1001.wikimedia.org
labservices1002.wikimedia.org
labvirt1001.eqiad.wmnet
labvirt1002.eqiad.wmnet
labvirt1003.eqiad.wmnet
labvirt1004.eqiad.wmnet
labvirt1005.eqiad.wmnet
labvirt1006.eqiad.wmnet
labvirt1007.eqiad.wmnet
labvirt1008.eqiad.wmnet
labvirt1009.eqiad.wmnet
labvirt1010.eqiad.wmnet
labvirt1011.eqiad.wmnet
labvirt1012.eqiad.wmnet
labvirt1013.eqiad.wmnet
labvirt1014.eqiad.wmnet
labvirt1015.eqiad.wmnet
labvirt1016.eqiad.wmnet
labvirt1017.eqiad.wmnet
labvirt1018.eqiad.wmnet
labvirt1019.eqiad.wmnet
labvirt1020.eqiad.wmnet
labvirt1021.eqiad.wmnet
labvirt1022.eqiad.wmnet
labweb1001.wikimedia.org
labweb1002.wikimedia.org

main (neutron)

labcontrol1003.wikimedia.org
labcontrol1004.wikimedia.org
labnet1003.wikimedia.org
labnet1004.wikimedia.org

labtest

labtestcontrol2001.wikimedia.org
labtestnet2001.codfw.wmnet
labtestnet2002.codfw.wmnet
labtestpuppetmaster2001.wikimedia.org
labtestservices2001.wikimedia.org
labtestvirt2001.codfw.wmnet
labtestvirt2002.codfw.wmnet
labtestweb2001.wikimedia.org

labtestn

labtestcontrol2003.wikimedia.org
labtestneutron2001.codfw.wmnet
labtestneutron2002.codfw.wmnet
labtestservices2002.wikimedia.org
labtestservices2003.wikimedia.org
labtestvirt2003.codfw.wmnet
labtestmetal2001.codfw.wmnet (as virt)

Change 366166 abandoned by Rush:
labtest: labcontrol2001 use rabbitmq role

Reason:
obsolete

https://gerrit.wikimedia.org/r/366166

Change 402115 had a related patch set uploaded (by Rush; owner: cpettet):
[operations/puppet@production] openstack: these servers should be an HA pair

https://gerrit.wikimedia.org/r/402115

Change 402115 merged by Rush:
[operations/puppet@production] openstack: these servers should be an HA pair

https://gerrit.wikimedia.org/r/402115

groupings at the moment

main (nova-network)

labcontrol1001.wikimedia.org
labcontrol1002.wikimedia.org
labmon1001.eqiad.wmnet
labmon1002.eqiad.wmnet
labnet1001.eqiad.wmnet
labnet1002.eqiad.wmnet
labnodepool1001.eqiad.wmnet
labnodepool1002.eqiad.wmnet
labservices1001.wikimedia.org
labservices1002.wikimedia.org
labvirt1001.eqiad.wmnet
labvirt1002.eqiad.wmnet
labvirt1003.eqiad.wmnet
labvirt1004.eqiad.wmnet
labvirt1005.eqiad.wmnet
labvirt1006.eqiad.wmnet
labvirt1007.eqiad.wmnet
labvirt1008.eqiad.wmnet
labvirt1009.eqiad.wmnet
labvirt1010.eqiad.wmnet
labvirt1011.eqiad.wmnet
labvirt1012.eqiad.wmnet
labvirt1013.eqiad.wmnet
labvirt1014.eqiad.wmnet
labvirt1015.eqiad.wmnet
labvirt1016.eqiad.wmnet
labvirt1017.eqiad.wmnet
labvirt1018.eqiad.wmnet
labvirt1019.eqiad.wmnet
labvirt1020.eqiad.wmnet
labvirt1021.eqiad.wmnet
labvirt1022.eqiad.wmnet
labweb1001.wikimedia.org
labweb1002.wikimedia.org

main (neutron)

labcontrol1003.wikimedia.org
labcontrol1004.wikimedia.org
labnet1003.wikimedia.org
labnet1004.wikimedia.org

labtest

labtestcontrol2001.wikimedia.org
labtestnet2001.codfw.wmnet
labtestnet2002.codfw.wmnet
labtestpuppetmaster2001.wikimedia.org
labtestservices2001.wikimedia.org
labtestvirt2001.codfw.wmnet
labtestvirt2002.codfw.wmnet
labtestweb2001.wikimedia.org

labtestn

labtestcontrol2003.wikimedia.org
labtestneutron2001.codfw.wmnet
labtestneutron2002.codfw.wmnet
labtestservices2002.wikimedia.org
labtestservices2003.wikimedia.org
labtestvirt2003.codfw.wmnet
labtestmetal2001.codfw.wmnet (as virt)

updated

Change 433734 had a related patch set uploaded (by Rush; owner: cpettet):
[operations/puppet@production] openstack: labtest use labtestcontrol2003 for keystone

https://gerrit.wikimedia.org/r/433734

Change 433734 merged by Rush:
[operations/puppet@production] openstack: labtest use labtestcontrol2003 for keystone

https://gerrit.wikimedia.org/r/433734

Change 436853 had a related patch set uploaded (by Rush; owner: cpettet):
[operations/puppet@production] openstack: allow glance to call back for token validation

https://gerrit.wikimedia.org/r/436853

Change 436853 merged by Rush:
[operations/puppet@production] openstack: allow glance to call back for token validation

https://gerrit.wikimedia.org/r/436853

Change 437783 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] openstack: labtest: keystone: delete service (collapsed)

https://gerrit.wikimedia.org/r/437783

Change 437783 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] openstack: labtest: keystone: delete service (collapsed)

https://gerrit.wikimedia.org/r/437783

Change 437812 had a related patch set uploaded (by Rush; owner: cpettet):
[operations/puppet@production] openstack: allow designate in labtest to contact labtestn keystone

https://gerrit.wikimedia.org/r/437812

Change 437812 merged by Rush:
[operations/puppet@production] openstack: allow designate in labtest to contact labtestn keystone

https://gerrit.wikimedia.org/r/437812

I'm going to call this one {{done}}. Work is actively in-progress on testing the migration plan in the labtestn environment and we are starting to plan how to implement the migration in our main cluster starting in July/August 2018.