DB operations in Labs hitting sqlalchemy limits
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• madhuvishy
	Aug 31 2016, 5:59 AM

Description

I tried to make a bunch of instances today - using the debian jessie testing image, and funky things happened

instances that are deleted and recreated with the same name - even after a time window as much as an hour, never seem to come up alright. (tools-static-03 and 04)
new instances created go from BUILD state to ERROR. (tools-static-test, c1.m2.s80, deleted now)
some new instances come up as ACTIVE, but I can't ssh into them or ping them. (tools-static-test-01, m1.small deleted now)
I deleted all my test instances, on both project orch and tools, 2 of them got deleted, 2 of them are in error state.

Currently,

root@labcontrol1001:~# OS_TENANT_NAME=tools openstack server list | grep ERROR

625419a0-6098-45fc-bb05-7c482f80966c

tools-static-04

ERROR

public=10.68.22.253

root@labcontrol1001:~# OS_TENANT_NAME=orch openstack server list | grep ERROR

c4f0336a-a9d2-4034-8edb-bf1eec80880a

test-puppet-fail

ERROR

public=10.68.22.46

Details

	Subject	Repo	Branch	Lines +/-
	openstack: tune sqlalchemy pooling behavior	operations/puppet	production	+6 -0

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	hashar	T142952 2016-08-10 CI incident follow-ups
Resolved	hashar	T143938 Bring back jobs to Nodepool
Resolved	• chasemp	T144339 DB operations in Labs hitting sqlalchemy limits

Event Timeline

• madhuvishy created this task.Aug 31 2016, 5:59 AM

Restricted Application added a project: Cloud-Services. · View Herald TranscriptAug 31 2016, 5:59 AM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

• chasemp renamed this task from Instance creation and deletion on Labs is funky to DB operations in Labs hitting sqlalchemy limits.Sep 1 2016, 2:49 PM

I looked into both instances and have an explanation as to this round of issues :)

As part of T143938 I bumped up the nodepool load /a small bit/ with https://gerrit.wikimedia.org/r/#/c/307526/. Previously we had some artificial deflation here with quota management issues addressed with https://gerrit.wikimedia.org/r/#/c/306683/. Both of these patches mean nodepool is actually doing close to what it was assumed it was doing before them.

If I look at nova-conductor.log I see a lot of operations like:

2016-09-01 14:50:08.792 11200 INFO nova.db.sqlalchemy.api [req-377cce23-4e54-4e0f-b483-98f2288cb995 nodepoolmanager contintcloud - - -] quota_usages out of sync, updating. project_id: contintcloud, user_id: None, resource: fixed_ips, tracked usage: 26, actual usage: 30

I think with nodepool now using closer to its actual max-servers pool and quota adjustments we are using more of our db connection pool. These limits are pretty low by default at pool size of 5, max-overflow of 10, and a timeout of 30s for getting the connection count to saturation.

Results in errors like:

fault {"message": "Remote error: RemoteError Remote error: TimeoutError QueuePool limit of size 5 overflow 10 reached, connection timed out, timeout 30

Which I see in the case of tools-static-04

sqlalchemy pooling
http://docs.sqlalchemy.org/en/latest/core/pooling.html

nova.conf options for addressing this
http://docs.openstack.org/liberty/config-reference/content/list-of-compute-config-options.html#config_table_nova_conductor

Saturation of this pool can result in operations being dropped entirely. In this case there was no actual instance created so a reboot of the entry always resulted in an ERROR state. The compute node had no record of it, etc.

nova.conf does a few odd things here where

min_pool_size appears to not be an option at all in reality.
max_pool_size is just pool_size for sqlalchemy

I think we have been seeing this intermittently for awhile and mistaking it to rabbitmq shenanigans.

Change 307956 had a related patch set uploaded (by Rush):
openstack: tune sqlalchemy pooling behavior

https://gerrit.wikimedia.org/r/307956

gerritbot added a project: Patch-For-Review.Sep 1 2016, 3:03 PM

hashar subscribed.Sep 1 2016, 3:06 PM

Change 307956 merged by Rush:
openstack: tune sqlalchemy pooling behavior

https://gerrit.wikimedia.org/r/307956