Page MenuHomePhabricator

Debug and understand why bringing down cloud net/gw/lb resulted in cloud vps network down
Closed, ResolvedPublic

Description

In parent task T417393 we have brought down cloudnet1005/cloudgw1003/cloudlb1001 hosts which resulted in the whole cloud vps network being down both internally and externally. This task tracks understanding and fixing cloud networking automatic failover.

@taavi reports that cloudgw1004 (in the other rack) was the active host at the time of testing, thus it was not in the active networking path. vrrpd worked/reacted as expected though (P89835)

Event Timeline

Focusing on the neutron part first, I noticed two agents on cloudnet1006 are marked DOWN

root@cloudcontrol1006:~# wmcs-openstack network agent list --host cloudnet1006
+--------------------------------------+--------------------+--------------+-------------------+-------+-------+---------------------------+
| ID                                   | Agent Type         | Host         | Availability Zone | Alive | State | Binary                    |
+--------------------------------------+--------------------+--------------+-------------------+-------+-------+---------------------------+
| 3f54b3c2-503f-4667-8263-859a259b3b21 | L3 agent           | cloudnet1006 | nova              | :-)   | DOWN  | neutron-l3-agent          |
| 617c2d4f-3d67-4b55-a8b8-1ecec4cba608 | Metadata agent     | cloudnet1006 | None              | :-)   | UP    | neutron-metadata-agent    |
| 7f4b27b6-b3d0-479c-bbe1-f36e5108dc90 | Open vSwitch agent | cloudnet1006 | None              | :-)   | DOWN  | neutron-openvswitch-agent |
| e4f71e5d-e182-487d-8c5f-eb15f1ff2bf6 | DHCP agent         | cloudnet1006 | nova              | :-)   | UP    | neutron-dhcp-agent        |
+--------------------------------------+--------------------+--------------+-------------------+-------+-------+---------------------------+
root@cloudcontrol1006:~# wmcs-openstack network agent show 3f54b3c2-503f-4667-8263-859a259b3b21
+-------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Field             | Value                                                                                                                                                                                                                 |
+-------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| admin_state_up    | DOWN                                                                                                                                                                                                                  |
| agent_type        | L3 agent                                                                                                                                                                                                              |
| alive             | :-)                                                                                                                                                                                                                   |
| availability_zone | nova                                                                                                                                                                                                                  |
| binary            | neutron-l3-agent                                                                                                                                                                                                      |
| configuration     | {'agent_mode': 'legacy', 'ex_gw_ports': 0, 'extensions': [], 'floating_ips': 0, 'handle_internal_only_routers': True, 'interface_driver': 'openvswitch', 'interfaces': 0, 'log_agent_heartbeats': True, 'routers': 0} |
| created_at        | 2022-10-04 13:04:14                                                                                                                                                                                                   |
| description       | None                                                                                                                                                                                                                  |
| ha_state          | None                                                                                                                                                                                                                  |
| host              | cloudnet1006                                                                                                                                                                                                          |
| id                | 3f54b3c2-503f-4667-8263-859a259b3b21                                                                                                                                                                                  |
| last_heartbeat_at | 2026-03-10 10:02:11                                                                                                                                                                                                   |
| resources_synced  | None                                                                                                                                                                                                                  |
| started_at        | 2026-01-06 16:46:14                                                                                                                                                                                                   |
| topic             | l3_agent                                                                                                                                                                                                              |
+-------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
root@cloudcontrol1006:~# wmcs-openstack network agent show 7f4b27b6-b3d0-479c-bbe1-f36e5108dc90
+-------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Field             | Value                                                                                                                                                                                                                   |
+-------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| admin_state_up    | DOWN                                                                                                                                                                                                                    |
| agent_type        | Open vSwitch agent                                                                                                                                                                                                      |
| alive             | :-)                                                                                                                                                                                                                     |
| availability_zone | None                                                                                                                                                                                                                    |
| binary            | neutron-openvswitch-agent                                                                                                                                                                                               |
| configuration     | {'arp_responder_enabled': False, 'baremetal_smartnic': False, 'bridge_mappings': {'cloudinstances2b': 'br-internal', 'br-external': 'br-external'}, 'datapath_type': 'system', 'devices': 0,                            |
|                   | 'enable_distributed_routing': False, 'extensions': [], 'in_distributed_mode': False, 'integration_bridge': 'br-int', 'l2_population': False, 'log_agent_heartbeats': True, 'ovs_capabilities': {'datapath_types':       |
|                   | ['netdev', 'system'], 'iface_types': ['afxdp', 'afxdp-nonpmd', 'bareudp', 'erspan', 'geneve', 'gre', 'gtpu', 'internal', 'ip6erspan', 'ip6gre', 'lisp', 'patch', 'srv6', 'stt', 'system', 'tap', 'vxlan']},             |
|                   | 'ovs_hybrid_plug': False, 'resource_provider_bandwidths': {}, 'resource_provider_hypervisors': {'br-external': 'cloudnet1006.eqiad.wmnet', 'rp_tunnelled': 'cloudnet1006.eqiad.wmnet', 'br-internal':                   |
|                   | 'cloudnet1006.eqiad.wmnet'}, 'resource_provider_inventory_defaults': {'allocation_ratio': 1.0, 'min_unit': 1, 'step_size': 1, 'reserved': 0}, 'resource_provider_packet_processing_inventory_defaults':                 |
|                   | {'allocation_ratio': 1.0, 'min_unit': 1, 'step_size': 1, 'reserved': 0}, 'resource_provider_packet_processing_with_direction': {}, 'resource_provider_packet_processing_without_direction': {}, 'tunnel_types':         |
|                   | ['vxlan'], 'tunneling_ip': '172.20.2.3', 'vhostuser_socket_dir': '/var/run/openvswitch'}                                                                                                                                |
| created_at        | 2024-05-21 12:12:23                                                                                                                                                                                                     |
| description       | None                                                                                                                                                                                                                    |
| ha_state          | None                                                                                                                                                                                                                    |
| host              | cloudnet1006                                                                                                                                                                                                            |
| id                | 7f4b27b6-b3d0-479c-bbe1-f36e5108dc90                                                                                                                                                                                    |
| last_heartbeat_at | 2026-03-10 10:01:59                                                                                                                                                                                                     |
| resources_synced  | None                                                                                                                                                                                                                    |
| started_at        | 2026-02-26 03:35:52                                                                                                                                                                                                     |
| topic             | N/A                                                                                                                                                                                                                     |
+-------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Thanks @fgiunchedi. The other aspects we should monitor are the keepalived operations on both the cloudnet and cloudgw nodes, to make sure they are failing over. I think if we test each element one at a time we can be set up and logged on to all hosts and monitoring events, so hopefully we can isolate exactly what parts aren't working as expected.

The Neutron side I'm less familiar with but the above does look a little concerning alright.

Thanks @fgiunchedi. The other aspects we should monitor are the keepalived operations on both the cloudnet and cloudgw nodes, to make sure they are failing over. I think if we test each element one at a time we can be set up and logged on to all hosts and monitoring events, so hopefully we can isolate exactly what parts aren't working as expected.

Indeed, for cloudgw I have updated the task description and vrrpd looks like it behaved as expected, plus the active host was not on the affected rack. Also I was too hasty in bringing down all network hosts and thus mudding the waters! Next time we will be indeed testing hosts in isolation.

Next up is understanding how/if we can test the neutron failover in codfw1dev and how that looks like. So far my best lead is neutron not failing over automatically due to agents being admin_state_up DOWN

It seems that the network services on 1006 were manually (or via cookbook) set to down. That would certainly explain the failover.

The wmcs.openstack.cloudnet.reboot_node cookbook is the only cookbook I see that would do that. The SAL doesn't show that cookbook as being last run on 2025-05-26 which doesn't fit, we've surely had working failovers since then.

I have now re-enabled the services with 'openstack network agent set --enable'

I suspect we will never know why they were disabled. For a followup, we could add an alert for anytime network agents (or really any agents) show as 'down' for a certain amount of time, or we could have a pre-flight script that verifies that neutron is ready for a failover.

Change #1249976 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/alerts@master] team-wmcs: neutron: Alert if Neutron agents are forgotten in adminDown

https://gerrit.wikimedia.org/r/1249976

Change #1249976 merged by jenkins-bot:

[operations/alerts@master] team-wmcs: neutron: Alert if Neutron agents are forgotten in adminDown

https://gerrit.wikimedia.org/r/1249976