Page MenuHomePhabricator

2023-12-01 Cloud VPS network outage
Closed, ResolvedPublic

Description

Timeline (UTC):

11:50 Neutron alerts start firing. Network tests don't show any errors. Some database deadlocks are shown in cloudcontrol1005 logs.
13:51 Neutron alerts resolve by themselves, then fire again
14:10 (probably unrelated) Francesco is paged about cloudvirt1046 missing the canary vm
14:43 Francesco runs systemctl restart neutron* on both cloudnet1005 and cloudnet1006
14:46 All Cloud VPS networking is failing
14:54 Taavi stops nova and neutron services on cloudcontrols to troubleshoot deadlocks
15:06 Taavi stops galera on cloudcontrol1005 after noticing it's not working correctly
~15:07 Taavi restarts nova-api.service, neutron-*.service on cloudcontrol1006, then starts neutron-l3-agent on cloudnet1005
15:09 Networking is up again
15:48 Andrew reimages cloudcontrol1005 because it's still not working correctly
16:44 cloudcontrol1005 is up again
16:46 Andrew runs cookbook wmcs.openstack.restart_openstack --neutron --cluster-name eqiad1
16:58 Andrew restarts nova services because openstack hypervisor list is not working

Event Timeline

taavi triaged this task as Unbreak Now! priority.
fnegri changed the task status from Open to In Progress.Dec 1 2023, 3:07 PM
fnegri updated the task description. (Show Details)