Page MenuHomePhabricator

Migrate eqiad1 cloudnets to Neutron OVS agent
Closed, ResolvedPublic

Description

This task is for migrating the eqiad1 Neutron L3 routers to the OVS agent. This will cause some brief service outages, so it will need to be performed in a pre-announced maintenance window.

These should be done far in advance:

  • Prepare Puppet patches for the planned operations below, get them reviewed

These should be done just before the announced maintenance window:

  • Take the inactive cloudnet out of service (disable Puppet, stop Neutron services)
  • Reimage that host as a spare (this will need to use the no-firewall spare class)
  • Delete the Linuxbridge agent for the now-spare host from Neutron (openstack network agent delete)

At this point there is one spare host, and one active Linuxbridge host. Rollback plan at this point is to apply the previous role to the now-spare host.

Wait for announced maintenance window to begin.

  • Make the now-spare host as an OVS-based cloudnet in Puppet
  • Immediately disable Puppet and the Neutron services on the OVS host
    • sudo systemctl stop neutron-metadata-agent.service neutron-dhcp-agent.service neutron-l3-agent.service neutron-openvswitch-agent.service

At this point there is one active Linuxbridge host, and one inactive OVS host. Rollback plan at this point is to delete the new OVS agent from the Neutron database, and to reimage the now-OVS host back to a Linuxbridge host.

The following steps will cause some minor downtime (~30 sec in codfw1dev):

  • Disable Puppet and stop Neutron services on the active Linuxbridge host
    • sudo systemctl stop neutron-metadata-agent.service neutron-dhcp-agent.service neutron-l3-agent.service neutron-linuxbridge-agent.service
  • Update Neutron ports associated with any routers to use the OVS VIF type in the database (T358761#9780138)
    • sudo mariadb -u root neutron
    • SELECT * FROM ml2_port_bindings WHERE port_id IN (SELECT port_id FROM routerports WHERE router_id IN (SELECT id FROM routers))\G
    • UPDATE ml2_port_bindings SET vif_type = 'ovs' WHERE port_id IN (SELECT port_id FROM routerports WHERE router_id IN (SELECT id FROM routers));
    • to rollback: UPDATE ml2_port_bindings SET vif_type = 'bridge' WHERE port_id IN (SELECT port_id FROM routerports WHERE router_id IN (SELECT id FROM routers));
  • Start Neutron services on the new OVS host
    • sudo systemctl start neutron-metadata-agent.service neutron-dhcp-agent.service neutron-l3-agent.service neutron-openvswitch-agent.service

At this point there is one active OVS host and one inactive Linuxbridge host. Rollback plan at this point is to follow this section section but in the opposite direction (replace OVS with Linuxbridge and the other way around in the instructions).

  • Reimage old Linuxbridge host as a spare
  • Delete the old Linuxbridge agent from the Neutron database
  • Apply the OVS cloudnet role to the spare host

At this point there is one active OVS host and one standby OVS host. Rollback plan at this point is to follow this entire procedure but in the opposite direction.

Event Timeline

taavi updated the task description. (Show Details)

I have some questions:

  • What tests will be run on each checkpoint? (linuxbridge active-ovs inactive, ovs active-linuxbridge inactive, ovs active-ovs inactive)
  • The outage will affect all the running VMs or only creating new ones?
    • If it affects all the running VMs, it will affect NFS servers too right? Will the announcement be for all services? (quarry/paws/toolforge/cloudvps/...)
      • Do you expect NFS in toolforge (specifically, though probably others) to come back up without issues?
  • Is the failover process the same with OVS? (we have some cookbooks that deal with it, do they need updating?)
  • Are there any changes needed in Netbox? (I'm guessing no, as they are not mentioned there xd, asking just in case)

I have some questions:

  • What tests will be run on each checkpoint? (linuxbridge active-ovs inactive, ovs active-linuxbridge inactive, ovs active-ovs inactive)

I need to check what exactly the network tests cookbook does, but possibly that. The main thing to look for is traffic getting in and out of the Neutron-managed VLAN and not getting duplicated.

  • The outage will affect all the running VMs or only creating new ones?

This will affect all traffic coming in and out of the Neutron-manage dvirtual network. Based on my tests in codfw1dev I expect the outage to last for about half a minute.

  • If it affects all the running VMs, it will affect NFS servers too right? Will the announcement be for all services? (quarry/paws/toolforge/cloudvps/...)

It will affect the dumps shares since those are hosted outside Cloud VPS, but not traffic from and to the virtualized NFS servers since that never leaves the Neutron VLAN. (I'm assuming the DNS outage for half a minute won't have an effect on existing NFS shares.)

  • Do you expect NFS in toolforge (specifically, though probably others) to come back up without issues?

See above, the Toolforge NFS server will not be affected. Toolforge K8s might see some issues with the dumps shares, but that is a) read-only and b) much less used, so while we should expect some things to get stuck I don't expect issues on the scale of a general NFS outage would create.

  • Is the failover process the same with OVS? (we have some cookbooks that deal with it, do they need updating?)

The general process is the same, with possible s/linuxbridge/openvswitch/ somewhere. I'll check the cookbooks.

  • Are there any changes needed in Netbox? (I'm guessing no, as they are not mentioned there xd, asking just in case)

No.

Thanks for the task @taavi. Looks well put together let me know the exact time you're starting and if feel free to ping me if there is anything you need checked from the physical network side of things (where MAC addresses are in the forwarding tables etc.)

taavi changed the task status from Open to In Progress.May 14 2024, 12:19 PM

Change #1032388 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] P:openstack: neutron: add ovs config to eqiad1 profiles

https://gerrit.wikimedia.org/r/1032388

Change #1032389 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] O:wmcs::openstack: add eqiad1 net_ovs role

https://gerrit.wikimedia.org/r/1032389

Change #1032390 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] site: Move cloudnet1005 to insetup_noferm to prep for OVS

https://gerrit.wikimedia.org/r/1032390

Change #1032391 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] site: Move cloudnet1005 to insetup_noferm to OVS agent

https://gerrit.wikimedia.org/r/1032391

Change #1032388 merged by Majavah:

[operations/puppet@production] P:openstack: neutron: add ovs config to eqiad1 profiles

https://gerrit.wikimedia.org/r/1032388

Change #1032389 merged by Majavah:

[operations/puppet@production] O:wmcs::openstack: add eqiad1 net_ovs role

https://gerrit.wikimedia.org/r/1032389

Change #1034089 had a related patch set uploaded (by Majavah; author: Majavah):

[cloud/wmcs-cookbooks@main] openstack: cloudnet: Add one-off cookbook for OVS migration

https://gerrit.wikimedia.org/r/1034089

Mentioned in SAL (#wikimedia-cloud) [2024-05-21T08:18:42Z] <taavi> stop neutron services on cloudnet1005 T364459

Change #1032390 merged by Majavah:

[operations/puppet@production] site: Move cloudnet1005 to insetup_noferm to prep for OVS

https://gerrit.wikimedia.org/r/1032390

Change #1034089 merged by jenkins-bot:

[cloud/wmcs-cookbooks@main] openstack: cloudnet: Add one-off cookbook for OVS migration

https://gerrit.wikimedia.org/r/1034089

Change #1032391 merged by Majavah:

[operations/puppet@production] site: Move cloudnet1005 to OVS agent

https://gerrit.wikimedia.org/r/1032391

Change #1034468 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] site: Move cloudnet1006 to insetup_noferm

https://gerrit.wikimedia.org/r/1034468

Change #1034469 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] site: Move cloudnet1006 to net_ovs

https://gerrit.wikimedia.org/r/1034469

Change #1034468 merged by Majavah:

[operations/puppet@production] site: Move cloudnet1006 to insetup_noferm

https://gerrit.wikimedia.org/r/1034468

Change #1034469 merged by Majavah:

[operations/puppet@production] site: Move cloudnet1006 to net_ovs

https://gerrit.wikimedia.org/r/1034469