Page MenuHomePhabricator

2023-12-01 Cloud VPS network outage
Closed, ResolvedPublic

Description

Timeline (UTC):

11:50 Neutron alerts start firing. Network tests don't show any errors. Some database deadlocks are shown in cloudcontrol1005 logs.
13:51 Neutron alerts resolve by themselves, then fire again
14:10 (probably unrelated) Francesco is paged about cloudvirt1046 missing the canary vm
14:43 Francesco runs systemctl restart neutron* on both cloudnet1005 and cloudnet1006
14:46 All Cloud VPS networking is failing
14:54 Taavi stops nova and neutron services on cloudcontrols to troubleshoot deadlocks
15:06 Taavi stops galera on cloudcontrol1005 after noticing it's not working correctly
~15:07 Taavi restarts nova-api.service, neutron-*.service on cloudcontrol1006, then starts neutron-l3-agent on cloudnet1005
15:09 Networking is up again
15:48 Andrew reimages cloudcontrol1005 because it's still not working correctly
16:44 cloudcontrol1005 is up again
16:46 Andrew runs cookbook wmcs.openstack.restart_openstack --neutron --cluster-name eqiad1
16:58 Andrew restarts nova services because openstack hypervisor list is not working

Event Timeline

taavi triaged this task as Unbreak Now! priority.Dec 1 2023, 2:48 PM
taavi created this task.
fnegri changed the task status from Open to In Progress.Dec 1 2023, 3:07 PM
fnegri updated the task description. (Show Details)
taavi claimed this task.