Page MenuHomePhabricator

Rabbitmq, neutron-openvswitch-agent, and network outages
Closed, InvalidPublic

Description

Just now I did some routine-ish rabbitmq maintenance (rebooting hosts in turn) and that caused the neutron-openvswitch-agents to crash.

That's a neutron bug, as those agents should fail over between non-working and working rabbitmq backends; and also they should just wait even if they can't reach one.

But, anyway, when those workers crashed, the cloud-vps VMs on associated cloudvirts fell off the internet.

Up until today I believed that openstack services were only involved in changing state and not in maintaining active workloads. Is that untrue of neutron-openvswitch-agent? Is it an active router which touches every packet? If so, we will need to change several of our maintenance practices.

Details

Other Assignee
taavi

Event Timeline

Andrew triaged this task as High priority.Jun 24 2025, 9:23 PM

Something I noticed linked from https://wikitech.wikimedia.org/wiki/Incidents/2024-11-26_WMCS_network_problems when I searched Wikitech for notes on neutron-openvswitch-agent:

T380972: openstack: prevent puppet from restarting neutron-openvswitch-agent

Is there any theory about why restarting openvswitch-agent is more delicate than restarting the old linuxbridge agent?

I'm in favor of avoiding outages, but because the agent runs in many places (cloudvirts), decoupling it from puppet can result in agent state being out of sync with config which also seems bad.

my current theory is that the linuxbridge agent was stateless, whereas openvswitch is stateful.

I stopped tried stopping neutron-openvswitch-agent on all cloudvirts in codfw1dev and that did not interrupt my ssh connection to a codfw1dev VM.

Then I restarted it in cloudvirts and stopped in cloudnet nodes; this also didn't interrupt the VM.

Whatever caused this outage wasnt't just neutron-openvswitch-agent going down.l