2022-07-20 CloudVPS unstability after network outage
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	dcaro
	Jul 20 2022, 11:34 AM

Description

We are seeing sustained instability in the system in many ways:

RabbitMQ dropped connections in general (affecting nova-api, neutron)
Neutron agents breaking due to rabbitmq timeouts/connection issues
VMs unable to create due to network not being created
VMs unable to delete due to lost messages

Details

	Subject	Repo	Branch	Lines +/-
	rabbit: introduce the heartbeat_timeout param and double	operations/puppet	production	+38 -15

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	ayounsi	T313382 asw2-c5-eqiad crash
Resolved	dcaro	T313400 2022-07-20 CloudVPS unstability after network outage
Resolved	dcaro	T313402 NovafullstackSustainedFailures cloudcontrol1003:9100 The automated tests were unable to create, provision and decommission a VM in the last 5h
Resolved	dcaro	T313407 JobUnavailable The Prometheus job rabbitmq running on cloud@ has been unable to scrape 20% of its targets. Check if the targets are reachable and exporting metrics.

Event Timeline

dcaro triaged this task as High priority.Jul 20 2022, 11:34 AM

dcaro created this task.

dcaro added a parent task: T313382: asw2-c5-eqiad crash.

Things I've done and tried so far:

At first rabbit cluster was split in two, cloudcontrol1005 and the rest, restarting 1003 made 1005 aware that there are
other nodes (showed as unreachable, but showed, before it was showing no other nodes).

Restarting 1005 brought the cluster up and running.

Restarted all the neutron agents (cloudnets, cloudvirts and cloudcontrols) using cumin, and nova-api/nova-api-metadata.

Created a dashboard for rabbit to try to see if it was healthy:
https://grafana-rw.wikimedia.org/d/Kn5xm-gZk/wmcs-openstack-eqiad-rabbitmq-overview?orgId=1

Then some things started working (removing VMs), but novafullstack kept failing when trying to create a VM due to the
network not being there.

Looking at the logs:
https://logstash.wikimedia.org/app/dashboards#/view/8aa679f0-d52e-11eb-81e9-e1226573bad4?_g=h@41d3bb7&_a=h@251785e

I saw some neutron agents breaking due to being unable to connect to rabbit, restarting those manually got them
connected again and did some work until another agent breaks.

Currently playing whack-a-mole with the services, but there's something that is making the cluster unstable.

RhinosF1 added a subtask: T313402: NovafullstackSustainedFailures cloudcontrol1003:9100 The automated tests were unable to create, provision and decommission a VM in the last 5h.Jul 20 2022, 11:43 AM

RhinosF1 subscribed.

Rabbit seems to have enough file descriptors (one of the reasons why it would discard connections), looking at the
graphs and the process:
https://grafana-rw.wikimedia.org/d/Kn5xm-gZk/wmcs-openstack-eqiad-rabbitmq-overview?orgId=1

root@cloudcontrol1007:~# grep -i nofile /lib/systemd/system/rabbitmq-server.service
LimitNOFILE=65536

root@cloudcontrol1007:~# systemctl status rabbitmq-server.service  | grep rabbit
...
             ├─1010283 /usr/lib/erlang/erts-11.1.8/bin/beam.smp ...

root@cloudcontrol1007:~# grep -i 'open files' /proc/1010283/limits
Max open files            65536                65536                files

root@cloudcontrol1007:~# lsof -p 1010283 | wc
     72     675    9158

On cloudnet1003, neutron-dhcp-agent broke again, the error is:

2022-07-20 11:42:40.554 3170279 ERROR oslo_service.service oslo_messaging.exceptions.MessageDeliveryFailure: Unable to connect to AMQP server on cloudcontrol1005.wikimedia.org:5671 after inf tries: Queue.declare: (404) NOT_FOUND - failed to perform operation on queue 'dhcp_agent.cloudnet1003' in vhost '/' due to timeout

Looking

I'm seeing log entries on rabbit like:

root@cloudcontrol1003:~# rabbitmq-diagnostics log_tail --number 1000 | grep -B1 'missed heartbeats'
...
2022-07-20 12:07:37.891 [error] <0.12446.6> closing AMQP connection <0.12446.6> (208.80.154.132:34386 -> 208.80.154.23:5671 - uwsgi:4088405:9f68ae4e-c5ee-47b6-a65c-8ae618b4cf88):
missed heartbeats from client, timeout: 60s

That might be one of the sources of broken connections, will try to raise the heartbeat timeout value see if that helps.

Change 815705 had a related patch set uploaded (by David Caro; author: David Caro):

[operations/puppet@production] rabbit: introduce the heartbeat_timeout param and double

https://gerrit.wikimedia.org/r/815705

gerritbot added a project: Patch-For-Review.Jul 20 2022, 12:14 PM

Change 815705 merged by David Caro:

[operations/puppet@production] rabbit: introduce the heartbeat_timeout param and double

https://gerrit.wikimedia.org/r/815705

Mentioned in SAL (#wikimedia-cloud) [2022-07-20T13:17:52Z] <dcaro> restarting the whole rabbit cluster (T313400)

RhinosF1 added a subtask: T313407: JobUnavailable The Prometheus job rabbitmq running on cloud@ has been unable to scrape 20% of its targets. Check if the targets are reachable and exporting metrics..Jul 20 2022, 1:47 PM

Mentioned in SAL (#wikimedia-cloud) [2022-07-20T14:16:17Z] <dcaro> stopping rabbin on cloudcontrol1004, leaving only 1003 alive (T313400)

Mentioned in SAL (#wikimedia-cloud) [2022-07-20T15:51:38Z] <dcaro> things seem stable now with one rabbit node, trying to bring up a second (T313400)

Mentioned in SAL (#wikimedia-cloud) [2022-07-20T16:26:17Z] <dcaro> things seem stable, trying to bring up a third, cloudcontrol1005 (T313400)

Mentioned in SAL (#wikimedia-cloud) [2022-07-20T17:10:40Z] <dcaro> things seem stable, trying to bring up a fourth rabbit node, cloudcontrol1006 (T313400)

Mentioned in SAL (#wikimedia-cloud) [2022-07-20T18:02:40Z] <dcaro> things seem stable, trying to bring up a the last rabbit node, cloudcontrol1007 (T313400)

Finally stabilized, will leave open until tomorrow, but seems ok now.

dcaro closed subtask T313402: NovafullstackSustainedFailures cloudcontrol1003:9100 The automated tests were unable to create, provision and decommission a VM in the last 5h as Resolved.Jul 20 2022, 6:21 PM

dcaro moved this task from To refine to Today on the User-dcaro board.Aug 23 2022, 8:16 AM

bd808 moved this task from Inbox to Doing on the cloud-services-team (Kanban) board.Sep 27 2022, 9:25 PM

dcaro closed this task as Resolved.Oct 3 2022, 8:19 AM

dcaro moved this task from Today to Done on the User-dcaro board.

2022-07-20 CloudVPS unstability after network outageClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

2022-07-20 CloudVPS unstability after network outage
Closed, ResolvedPublic
Actions

Related Objects
Search...