Page MenuHomePhabricator

Deploy OVS test setup in codfw1dev
Open, Needs TriagePublic

Description

Try deploying OVS in codfw1dev in parallel to the current setup to see if a migration without a full Openstack redeployment is even possible.

Event Timeline

Change 1007900 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] Add new role for OVS cloudnet

https://gerrit.wikimedia.org/r/1007900

Change 1007901 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] Add some new networks for WMCS OVS testing

https://gerrit.wikimedia.org/r/1007901

Cookbook cookbooks.sre.hosts.reimage was started by taavi@cumin1002 for host cloudnet2007-dev.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage was started by taavi@cumin1002 for host cloudnet2008-dev.codfw.wmnet with OS bookworm

Change 1007900 merged by Majavah:

[operations/puppet@production] Add new role for OVS cloudnet

https://gerrit.wikimedia.org/r/1007900

Cookbook cookbooks.sre.hosts.reimage started by taavi@cumin1002 for host cloudnet2008-dev.codfw.wmnet with OS bookworm completed:

  • cloudnet2008-dev (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run failed and logged in /var/log/spicerack/sre/hosts/reimage/202403041117_taavi_146478_cloudnet2008-dev.out, asking the operator what to do
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202403041126_taavi_146478_cloudnet2008-dev.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by taavi@cumin1002 for host cloudnet2007-dev.codfw.wmnet with OS bookworm completed:

  • cloudnet2007-dev (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run failed and logged in /var/log/spicerack/sre/hosts/reimage/202403041114_taavi_146402_cloudnet2007-dev.out, asking the operator what to do
    • First Puppet run failed and logged in /var/log/spicerack/sre/hosts/reimage/202403041125_taavi_146402_cloudnet2007-dev.out, asking the operator what to do
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202403041125_taavi_146402_cloudnet2007-dev.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Change 1008422 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] O:wmcs: codfw1dev: net_ovs: add base neutron config

https://gerrit.wikimedia.org/r/1008422

Change 1008422 merged by Majavah:

[operations/puppet@production] O:wmcs: codfw1dev: net_ovs: add base neutron config

https://gerrit.wikimedia.org/r/1008422

Change 1008462 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] openstack: neutron: add API support for OVS

https://gerrit.wikimedia.org/r/1008462

Change 1008463 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] openstack: neutron: first attempt of installing ovs-agent

https://gerrit.wikimedia.org/r/1008463

Change 1008462 merged by Majavah:

[operations/puppet@production] openstack: neutron: add API support for OVS

https://gerrit.wikimedia.org/r/1008462

Change 1008463 merged by Majavah:

[operations/puppet@production] openstack: neutron: first attempt of installing ovs-agent

https://gerrit.wikimedia.org/r/1008463

Change 1009496 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] P:openstack: neutron: fix VLAN names on OVS test hosts

https://gerrit.wikimedia.org/r/1009496

Change 1007901 merged by Majavah:

[operations/puppet@production] Add some new networks for WMCS OVS testing

https://gerrit.wikimedia.org/r/1007901

Change 1009496 merged by Majavah:

[operations/puppet@production] P:openstack: neutron: fix VLAN names on OVS test hosts

https://gerrit.wikimedia.org/r/1009496

Change 1009511 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] P:opesntack: nova: convert cloudvirt2001-dev to OVS agent

https://gerrit.wikimedia.org/r/1009511

Change 1009511 merged by Majavah:

[operations/puppet@production] P:openstack: nova: convert cloudvirt2001-dev to OVS agent

https://gerrit.wikimedia.org/r/1009511

Cookbook cookbooks.sre.hosts.reimage was started by taavi@cumin1002 for host cloudvirt2001-dev.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by taavi@cumin1002 for host cloudvirt2001-dev.codfw.wmnet with OS bookworm completed:

  • cloudvirt2001-dev (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202403080940_taavi_945517_cloudvirt2001-dev.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
$ sudo wmcs-openstack network create --project admin --share --provider-network-type vxlan lan-flat-cloudinstances3
$ sudo wmcs-openstack subnet create --network lan-flat-cloudinstances3 --subnet-range 172.16.129.0/24 --gateway 172.16.129.1 --dns-nameserver 172.20.254.1 cloud-instances-flat3-codfw-v4

# unset maintenance
$ sudo wmcs-openstack server create --os-compute-api-version 2.74 --os-project-id taavitestproject --flavor g3.cores1.ram2.disk20 --image debian-12.0-bookworm --security-group 4c29a64f-b883-4622-893c-eb3fd78b0b7f --nic net-id=e40a1c9f-cc09-4751-a6b8-0469a52318b7 --host cloudvirt2001-dev taavi-ovs-test
# set maintenance

Now the instance is failing to create with:

2024-03-08 10:38:24.213 1331 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent ; Stdout: ; Stderr: iptables-restore v1.8.9 (nf_tables): interface name `105c0477-6f00-4b3d-8749-795a34c5f9c4' must be shorter than IFNAMSIZ (15)

Note: The UUID in the iptables error is the Neutron port UUID. So presumably that's not being mapped to the actual interface name somewhere in the Neutron code.

The issue above is still persisting on Bobcat. Here's a log of an instance creating where that happened: P60929

Change #1021484 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] openstack: neutron: Fix firewall driver with openvswitch

https://gerrit.wikimedia.org/r/1021484

Change #1021484 merged by Majavah:

[operations/puppet@production] openstack: neutron: Fix firewall driver with openvswitch

https://gerrit.wikimedia.org/r/1021484

The firewall issue was fixed by the above patch setting the firewall driver to the same value on all config files. I can now create an instance on cloudvirt2001-dev with the command at P60933 that has an interface on an OVS provided network.

Next up:

  • Add a DHCP agent to the OVS provider network
  • Move a second cloudvirt (2002-dev, most likely) to the new setup
  • Set up a second VM on that, and see if they can talk to each other

After that start looking at outbound connectivity from an OVS backed network, and also check if the OVS agent can talk to the current VLAN-backed network or whether each cloudvirt will strictly have to use one or the other.

Change #1021867 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] O:wmcs: codfw1dev: net_ovs: install dhcp and metadata agents

https://gerrit.wikimedia.org/r/1021867

Change #1021867 merged by Majavah:

[operations/puppet@production] O:wmcs: codfw1dev: net_ovs: install dhcp and metadata agents

https://gerrit.wikimedia.org/r/1021867

Change #1021894 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] openstack: neutron: set dhcp interface driver correctly

https://gerrit.wikimedia.org/r/1021894

Change #1021894 merged by Majavah:

[operations/puppet@production] openstack: neutron: set dhcp interface driver correctly

https://gerrit.wikimedia.org/r/1021894

Change #1021968 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] openstack: neutron: Connect OVS agents to provider networks

https://gerrit.wikimedia.org/r/1021968

Change #1021968 merged by Majavah:

[operations/puppet@production] openstack: neutron: Connect OVS agents to provider networks

https://gerrit.wikimedia.org/r/1021968

Change #1023384 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] hieradata: move cloudvirt2002-dev to OVS agent

https://gerrit.wikimedia.org/r/1023384

I've been looking at this error recently:

Apr 25 12:51:08 cloudvirt2001-dev nova-compute[2572868]: 2024-04-25 12:51:08.870 2572868 ERROR ovsdbapp.backend.ovs_idl.transaction [-] OVSDB Error: {"details":"cannot delete QoS row 4f1bdbb2-2063-4789-9603-b53982670743 because of 1 remaining reference(s)","error":"referential integrity violation"}

Change #1023384 merged by Majavah:

[operations/puppet@production] hieradata: move cloudvirt2002-dev to OVS agent

https://gerrit.wikimedia.org/r/1023384

To move an instance from linuxbridge to OVS, the following UPDATE needs to be manually run on the database:

mysql:root@localhost [neutron]> UPDATE ml2_port_bindings SET vif_type = 'ovs' WHERE port_id = '<port id>';