Page MenuHomePhabricator

Try to fail over to labnet1002
Closed, ResolvedPublic

Description

There are some items of configuration on labnet1001 that we are not sure of their source. One of these is the internal VM gateway .1 IP. We don't know if this is a floating IP managed by openstack or not. Suspect that it is. We have a labnet1002 now that should be able to take over for labnet1001 in case of emergency, or so we can upgrade 1001. We want to try to move the networking functionality over this week as test.

These are the steps:

  1. puppet agent --disable on both labnet1001 and 1002
  2. change the db to point to labnet1002
  3. change the nova.conf on labnet1002 to reflect itself being the main thing (Change in puppet also to be reflected on current lab* hosts)
  4. stop nova-network on labnet1001
  5. start nova-network on labnet1002
  6. verify openstack moves over the bridge .1 interface (and if not move it ourselves -- first disabling the interface on 1001)

(we need the internal .1 to move, we need nova-network to start and understand its role, we need other hosts to see the new nova-network node and to try to communicate with it, and we need the new nova-network node to try to manipulate iptables as appropriate)

  1. change the router to point the 208 and labs addresses to the labnet1002 ext ip

In theory, at this point labnet1002 has supplanted labnet1001 and labnet1001 has no running nova-network

Not clear:

  • Is the gateway .1 IP a relic of manually config or from openstack?
  • Are dhcp sessions stored in the DB?
  • Arp cache issues with moving the .1 IP?

To revert:

  1. we stop nova-network on labnet1002
  2. make sure .1 is gone from labnet1002
  3. revert that puppet patch pointing hosts to 1002 and make sure all hosts know to look in teh right place
  4. enable puppet on labnet1001
  5. make sure .1 is on labnet1001 (and not on 1002) (make sure all interfaces are up and reachable on labnet1001)
  6. make sure the db reflects labnet1001
  7. change static routes back to .13 on 1001
  8. make sure labnet1002 isn't trying to do anything it shouln't be still?

Event Timeline

Andrew set Security to None.

We switched things over to labnet1002 today, and it went OK. It should be faster next time! Etherpad of our experience in progress here: https://etherpad.wikimedia.org/p/labnet_failover

Andrew moved this task from Done to Doing on the labs-sprint-110 board.
Andrew added a project: Labs-Sprint-111.

This is done, but we need to do a bit more research and documentation so we don't forget what we learned in the switchover.