Cloudcontrol split brain issues
Closed, DeclinedPublic
Actions

Assigned To

None

Authored By

	• Bstorm
	Sep 24 2020, 5:08 PM

Description

We were paged by lots of services starting at Thu Sept 24 14:40:04 UTC 2020 with the labs-ip-alias-dump process. The problem was clearly one of getting timeouts and failures contacting various openstack services
eg.

Sep 24 15:03:09 cloudcontrol1004 wmcs-dns-floating-ip-updater[28525]: requests.exceptions.HTTPError: 504 Server Error: GATEWAY TIMEOUT for url: http://openstack.eqiad1.wikimediacloud.org:9001/v2/zones

We initially suspected everything from haproxy to firewalls, but as of this ticket's creation, we haven't seen a root cause. The actual failures in openstack services were clearly caused by a concurrent split brain in both Galera and RabbitMQ at around the same time (14:48-ish UTC). During that time, the services reported a network partition, and TCP retransmit errors popped up:

Screen Shot 2020-09-24 at 10.06.15 AM.png (1×2 px, 270 KB)

Naturally, network and other activity went down. Memory saturation also spiked:

Screen Shot 2020-09-24 at 10.07.31 AM.png (1×2 px, 296 KB)

Memory utilization remained stable.

That roughly describes what the event looked like. Hopefully, we can refer to this ticket, update this and keep it from happening again.

Event Timeline

• Bstorm triaged this task as High priority.Sep 24 2020, 5:08 PM

• Bstorm created this task.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 24 2020, 5:08 PM

aborrero subscribed.Sep 25 2020, 11:23 AM

Additional context: there were some ongoing ops in the eqiad datacenter when this issue happened. I didn't have time yet to investigate what exactly happened, but a prod DNS server might have been down.

For historical info for now.

	F32362353: Screen Shot 2020-09-24 at 10.07.31 AM.png
	Sep 24 2020, 5:08 PM

	F32362347: Screen Shot 2020-09-24 at 10.06.15 AM.png
	Sep 24 2020, 5:08 PM

Cloudcontrol split brain issuesClosed, DeclinedPublicActions

Description

Event Timeline

Cloudcontrol split brain issues
Closed, DeclinedPublic
Actions