Page MenuHomePhabricator

Openstack API slowdowns
Closed, ResolvedPublic

Description

In the last few weeks the eqiad1 openstack APIs have become very slow. Running cumin on all hosts (which enumerates projects) now takes >16 minutes and Horizon panel loads are extremely tedious.

This might be due to increased traffic or some side-effect of the RBAC work (T330759) although I doubt the latter.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Change 918475 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] haproxy: increase http check interval

https://gerrit.wikimedia.org/r/918475

Query times are quite variable which suggests a load issue:

root@cloudcontrol1005:~/foo# time ./bar.py 

real	0m21.822s
user	0m1.836s
sys	0m0.160s
root@cloudcontrol1005:~/foo# 
root@cloudcontrol1005:~/foo# 
root@cloudcontrol1005:~/foo# time ./bar.py 

real	1m5.314s
user	0m1.905s
sys	0m0.117s
root@cloudcontrol1005:~/foo# cat bar.py 
#!/usr/bin/python3

import mwopenstackclients

for i in range(0,10):
    mwopenstackclients.clients(oscloud='novaadmin').allinstances(projectid='testlabs')

Change 918475 merged by Andrew Bogott:

[operations/puppet@production] haproxy: increase http check interval

https://gerrit.wikimedia.org/r/918475

Change 918488 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] Keystone: double the number of worker procs.

https://gerrit.wikimedia.org/r/918488

Change 918488 merged by Andrew Bogott:

[operations/puppet@production] Keystone: double the number of worker procs.

https://gerrit.wikimedia.org/r/918488

Change 918515 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] nova-api: increase the number of nova-api workers 3x

https://gerrit.wikimedia.org/r/918515

Change 918515 merged by Andrew Bogott:

[operations/puppet@production] nova-api: increase the number of nova-api workers 3x

https://gerrit.wikimedia.org/r/918515

Change 918522 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] Openstack eqiad1 galera: make cloudcontrol1007 the database primary

https://gerrit.wikimedia.org/r/918522

Change 918522 merged by Andrew Bogott:

[operations/puppet@production] Openstack eqiad1 galera: make cloudcontrol1006 the database primary

https://gerrit.wikimedia.org/r/918522

Mentioned in SAL (#wikimedia-cloud) [2023-05-10T15:52:43Z] <andrewbogott> running "cookbook -c ~/.config/spicerack/cookbook_config.yaml wmcs.openstack.restart_openstack --cluster-name eqiad1 --all" to pick up changes for testing T336379

Things are not yet as fast as I'd like them to be, but they are much faster. The cumin run has gone from 16 minutes to 1.5 minutes.

Change 918586 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] Openstack Neutron: double number of api workers

https://gerrit.wikimedia.org/r/918586

Change 918586 merged by Andrew Bogott:

[operations/puppet@production] Openstack Neutron: double number of api workers

https://gerrit.wikimedia.org/r/918586

root@cloudcontrol1005:~/foo# time ./bar.py 

real	0m7.205s
user	0m1.851s
sys	0m0.131s

root@cloudcontrol1005:~/foo# time ./bar.py

real 0m14.480s
user 0m1.966s
sys 0m0.176s
root@cloudcontrol1005:~/foo#
root@cloudcontrol1005:~/foo#
root@cloudcontrol1005:~/foo# time ./bar.py

real 0m11.460s
user 0m1.878s
sys 0m0.156s