Intermittent openstack API failures
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Andrew
	Jul 14 2021, 8:52 PM

Description

I'm seeing flaky behavior in openstack -- some fullstack failures, some random horizon errors.

In logstash I see quite a few database issues: 'Deadlock: wsrep aborted transaction' and also some 'too many connections' complaints.

Details

Subject	Repo	Branch	Lines +/-
cloud galera: have haproxy shut down sessions when marked	operations/puppet	production	+1 -2
cloud galera: have haproxy shut down sessions when marked	operations/puppet	production	+1 -1
openstack galera: set monitor on failover	operations/puppet	production	+7 -0
Galera: increase number of allowed connections	operations/puppet	production	+1 -1

Customize query in gerrit

Event Timeline

Andrew created this task.Jul 14 2021, 8:52 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 14 2021, 8:52 PM

Change 704605 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] Galera: increase number of allowed connections

https://gerrit.wikimedia.org/r/704605

gerritbot added a project: Patch-For-Review.Jul 14 2021, 8:53 PM

Here's my theory:

mysql on cloudcontrol1005 was flapping (for unknown reasons, it seems to have recovered on its own)
most conductor threads went back to using cloudcontrol1005 after the flapping stopped, but some threads had persistent connections to other mysql hosts
writes to the same table at the same time on different cloudcontrols = wsrep deadlocks

There might be some HA proxy tuning we can do to prevent this.

Mentioned in SAL (#wikimedia-cloud) [2021-07-14T21:08:44Z] <andrewbogott> restarting lots of openstack services while trying to resolve T286675

In haproxy I see:

Server mysql/cloudcontrol1005.wikimedia.org is DOWN, reason: Socket error, info: "Connection reset by peer", check duration: 40ms. 0 active and 2 backup servers left. Running on backup. 800 sessions active, 0 requeued, 0 remaining in queue.

so going to merge the attached patch in case that's the cause of the flap.

Change 704605 merged by Andrew Bogott:

[operations/puppet@production] Galera: increase number of allowed connections

https://gerrit.wikimedia.org/r/704605

Andrew claimed this task.Jul 14 2021, 10:10 PM

Maintenance_bot removed a project: Patch-For-Review.Jul 14 2021, 10:10 PM

Change 704638 had a related patch set uploaded (by Bstorm; author: Bstorm):

[operations/puppet@production] cloud galera: have haproxy shut down sessions when marked

https://gerrit.wikimedia.org/r/704638

gerritbot added a project: Patch-For-Review.Jul 14 2021, 10:52 PM

Change 704846 had a related patch set uploaded (by Bstorm; author: Bstorm):

[operations/puppet@production] openstack galera: set monitor on failover