Page MenuHomePhabricator

2022-07-20 CloudVPS unstability after network outage
Closed, ResolvedPublic

Description

We are seeing sustained instability in the system in many ways:

  • RabbitMQ dropped connections in general (affecting nova-api, neutron)
  • Neutron agents breaking due to rabbitmq timeouts/connection issues
  • VMs unable to create due to network not being created
  • VMs unable to delete due to lost messages

Event Timeline

dcaro triaged this task as High priority.Jul 20 2022, 11:34 AM
dcaro created this task.
dcaro added a parent task: T313382: asw2-c5-eqiad crash.

Things I've done and tried so far:

At first rabbit cluster was split in two, cloudcontrol1005 and the rest, restarting 1003 made 1005 aware that there are
other nodes (showed as unreachable, but showed, before it was showing no other nodes).

Restarting 1005 brought the cluster up and running.

Restarted all the neutron agents (cloudnets, cloudvirts and cloudcontrols) using cumin, and nova-api/nova-api-metadata.

Created a dashboard for rabbit to try to see if it was healthy:
https://grafana-rw.wikimedia.org/d/Kn5xm-gZk/wmcs-openstack-eqiad-rabbitmq-overview?orgId=1

Then some things started working (removing VMs), but novafullstack kept failing when trying to create a VM due to the
network not being there.

Looking at the logs:
https://logstash.wikimedia.org/app/dashboards#/view/8aa679f0-d52e-11eb-81e9-e1226573bad4?_g=h@41d3bb7&_a=h@251785e

I saw some neutron agents breaking due to being unable to connect to rabbit, restarting those manually got them
connected again and did some work until another agent breaks.

Currently playing whack-a-mole with the services, but there's something that is making the cluster unstable.

Rabbit seems to have enough file descriptors (one of the reasons why it would discard connections), looking at the
graphs and the process:
https://grafana-rw.wikimedia.org/d/Kn5xm-gZk/wmcs-openstack-eqiad-rabbitmq-overview?orgId=1

root@cloudcontrol1007:~# grep -i nofile /lib/systemd/system/rabbitmq-server.service
LimitNOFILE=65536

root@cloudcontrol1007:~# systemctl status rabbitmq-server.service  | grep rabbit
...
             ├─1010283 /usr/lib/erlang/erts-11.1.8/bin/beam.smp ...

root@cloudcontrol1007:~# grep -i 'open files' /proc/1010283/limits
Max open files            65536                65536                files

root@cloudcontrol1007:~# lsof -p 1010283 | wc
     72     675    9158

On cloudnet1003, neutron-dhcp-agent broke again, the error is:

2022-07-20 11:42:40.554 3170279 ERROR oslo_service.service oslo_messaging.exceptions.MessageDeliveryFailure: Unable to connect to AMQP server on cloudcontrol1005.wikimedia.org:5671 after inf tries: Queue.declare: (404) NOT_FOUND - failed to perform operation on queue 'dhcp_agent.cloudnet1003' in vhost '/' due to timeout

Looking

I'm seeing log entries on rabbit like:

root@cloudcontrol1003:~# rabbitmq-diagnostics log_tail --number 1000 | grep -B1 'missed heartbeats'
...
2022-07-20 12:07:37.891 [error] <0.12446.6> closing AMQP connection <0.12446.6> (208.80.154.132:34386 -> 208.80.154.23:5671 - uwsgi:4088405:9f68ae4e-c5ee-47b6-a65c-8ae618b4cf88):
missed heartbeats from client, timeout: 60s

That might be one of the sources of broken connections, will try to raise the heartbeat timeout value see if that helps.

Change 815705 had a related patch set uploaded (by David Caro; author: David Caro):

[operations/puppet@production] rabbit: introduce the heartbeat_timeout param and double

https://gerrit.wikimedia.org/r/815705

Change 815705 merged by David Caro:

[operations/puppet@production] rabbit: introduce the heartbeat_timeout param and double

https://gerrit.wikimedia.org/r/815705

Mentioned in SAL (#wikimedia-cloud) [2022-07-20T13:17:52Z] <dcaro> restarting the whole rabbit cluster (T313400)

Mentioned in SAL (#wikimedia-cloud) [2022-07-20T14:16:17Z] <dcaro> stopping rabbin on cloudcontrol1004, leaving only 1003 alive (T313400)

Mentioned in SAL (#wikimedia-cloud) [2022-07-20T15:51:38Z] <dcaro> things seem stable now with one rabbit node, trying to bring up a second (T313400)

Mentioned in SAL (#wikimedia-cloud) [2022-07-20T16:26:17Z] <dcaro> things seem stable, trying to bring up a third, cloudcontrol1005 (T313400)

Mentioned in SAL (#wikimedia-cloud) [2022-07-20T17:10:40Z] <dcaro> things seem stable, trying to bring up a fourth rabbit node, cloudcontrol1006 (T313400)

Mentioned in SAL (#wikimedia-cloud) [2022-07-20T18:02:40Z] <dcaro> things seem stable, trying to bring up a the last rabbit node, cloudcontrol1007 (T313400)

dcaro moved this task from Today to Done on the User-dcaro board.