Page MenuHomePhabricator

DB operations in Labs hitting sqlalchemy limits
Closed, ResolvedPublic

Description

I tried to make a bunch of instances today - using the debian jessie testing image, and funky things happened

  • instances that are deleted and recreated with the same name - even after a time window as much as an hour, never seem to come up alright. (tools-static-03 and 04)
  • new instances created go from BUILD state to ERROR. (tools-static-test, c1.m2.s80, deleted now)
  • some new instances come up as ACTIVE, but I can't ssh into them or ping them. (tools-static-test-01, m1.small deleted now)
  • I deleted all my test instances, on both project orch and tools, 2 of them got deleted, 2 of them are in error state.

Currently,

root@labcontrol1001:~# OS_TENANT_NAME=tools openstack server list | grep ERROR

625419a0-6098-45fc-bb05-7c482f80966ctools-static-04ERRORpublic=10.68.22.253

root@labcontrol1001:~# OS_TENANT_NAME=orch openstack server list | grep ERROR

c4f0336a-a9d2-4034-8edb-bf1eec80880atest-puppet-failERRORpublic=10.68.22.46

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
chasemp renamed this task from Instance creation and deletion on Labs is funky to DB operations in Labs hitting sqlalchemy limits.Sep 1 2016, 2:49 PM

I looked into both instances and have an explanation as to this round of issues :)

As part of T143938 I bumped up the nodepool load /a small bit/ with https://gerrit.wikimedia.org/r/#/c/307526/. Previously we had some artificial deflation here with quota management issues addressed with https://gerrit.wikimedia.org/r/#/c/306683/. Both of these patches mean nodepool is actually doing close to what it was assumed it was doing before them.

If I look at nova-conductor.log I see a lot of operations like:

2016-09-01 14:50:08.792 11200 INFO nova.db.sqlalchemy.api [req-377cce23-4e54-4e0f-b483-98f2288cb995 nodepoolmanager contintcloud - - -] quota_usages out of sync, updating. project_id: contintcloud, user_id: None, resource: fixed_ips, tracked usage: 26, actual usage: 30

I think with nodepool now using closer to its actual max-servers pool and quota adjustments we are using more of our db connection pool. These limits are pretty low by default at pool size of 5, max-overflow of 10, and a timeout of 30s for getting the connection count to saturation.

Results in errors like:

fault{"message": "Remote error: RemoteError Remote error: TimeoutError QueuePool limit of size 5 overflow 10 reached, connection timed out, timeout 30

Which I see in the case of tools-static-04

sqlalchemy pooling
http://docs.sqlalchemy.org/en/latest/core/pooling.html

nova.conf options for addressing this
http://docs.openstack.org/liberty/config-reference/content/list-of-compute-config-options.html#config_table_nova_conductor

Saturation of this pool can result in operations being dropped entirely. In this case there was no actual instance created so a reboot of the entry always resulted in an ERROR state. The compute node had no record of it, etc.

nova.conf does a few odd things here where

min_pool_size appears to not be an option at all in reality.
max_pool_size is just pool_size for sqlalchemy

I think we have been seeing this intermittently for awhile and mistaking it to rabbitmq shenanigans.

Change 307956 had a related patch set uploaded (by Rush):
openstack: tune sqlalchemy pooling behavior

https://gerrit.wikimedia.org/r/307956

Change 307956 merged by Rush:
openstack: tune sqlalchemy pooling behavior

https://gerrit.wikimedia.org/r/307956

chasemp claimed this task.
chasemp triaged this task as High priority.

Fixes merged let's see if we run into this again