Page MenuHomePhabricator

Intermittent openstack API failures
Closed, ResolvedPublic

Description

I'm seeing flaky behavior in openstack -- some fullstack failures, some random horizon errors.

In logstash I see quite a few database issues: 'Deadlock: wsrep aborted transaction' and also some 'too many connections' complaints.

Event Timeline

Change 704605 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] Galera: increase number of allowed connections

https://gerrit.wikimedia.org/r/704605

Here's my theory:

  • mysql on cloudcontrol1005 was flapping (for unknown reasons, it seems to have recovered on its own)
  • most conductor threads went back to using cloudcontrol1005 after the flapping stopped, but some threads had persistent connections to other mysql hosts
  • writes to the same table at the same time on different cloudcontrols = wsrep deadlocks

There might be some HA proxy tuning we can do to prevent this.

Mentioned in SAL (#wikimedia-cloud) [2021-07-14T21:08:44Z] <andrewbogott> restarting lots of openstack services while trying to resolve T286675

In haproxy I see:

Server mysql/cloudcontrol1005.wikimedia.org is DOWN, reason: Socket error, info: "Connection reset by peer", check duration: 40ms. 0 active and 2 backup servers left. Running on backup. 800 sessions active, 0 requeued, 0 remaining in queue.

so going to merge the attached patch in case that's the cause of the flap.

Change 704605 merged by Andrew Bogott:

[operations/puppet@production] Galera: increase number of allowed connections

https://gerrit.wikimedia.org/r/704605

Change 704638 had a related patch set uploaded (by Bstorm; author: Bstorm):

[operations/puppet@production] cloud galera: have haproxy shut down sessions when marked

https://gerrit.wikimedia.org/r/704638

Change 704846 had a related patch set uploaded (by Bstorm; author: Bstorm):

[operations/puppet@production] openstack galera: set monitor on failover

https://gerrit.wikimedia.org/r/704846

Change 704846 merged by Bstorm:

[operations/puppet@production] openstack galera: set monitor on failover

https://gerrit.wikimedia.org/r/704846

Change 704638 merged by Bstorm:

[operations/puppet@production] cloud galera: have haproxy shut down sessions when marked

https://gerrit.wikimedia.org/r/704638

Change 705507 had a related patch set uploaded (by Bstorm; author: Bstorm):

[operations/puppet@production] cloud galera: have haproxy shut down sessions when marked

https://gerrit.wikimedia.org/r/705507

Change 705507 merged by Bstorm:

[operations/puppet@production] cloud galera: have haproxy shut down sessions when marked

https://gerrit.wikimedia.org/r/705507

In theory the cause of this is now resolved.