Page MenuHomePhabricator

keystone admin api occasionally refuses connection
Closed, ResolvedPublic

Description

Ever since I moved keystone from the (deprecated) keystone eventlet handler to a (recommended) uwsgi frontend, the admin api has been refusing connections at unpredictable intervals. Adding more threads (I tried 4x as many) doesn't seem to resolve the issue.

So... something is broken or leaking.

Event Timeline

My test case is running the 'allprecise2.py' novastats script on labvirt1001. If I set up me env with novaobserver (and the public port, 5000) all is well. But with novaadmin creds (and the admin api port) then it frequently errors out with 'connection refused'.

With the uwsgi deploy, what are the various choke points for parallel connections? Apache worker pool -> uwsgi worker pool -> uwsgi worker thread pool? Do we have any instrumentation that will show us when one of these pools is fully engaged?

I ran uwsgitop on this for a while, and there is definitely no lack of threads (I saw a failure when only 3 of 10 processes were busy.)

So I'm not sure what's happening... going to try turning on some additional caching features to see if that helps.

Change 334680 had a related patch set uploaded (by Andrew Bogott):
Keystone: Turn on caching of tokens and catalog

https://gerrit.wikimedia.org/r/334680

Change 334680 merged by Andrew Bogott:
Keystone: Turn on caching of tokens and catalog

https://gerrit.wikimedia.org/r/334680

Yuvi's proposed fixes are: 1. http-socket -> http, 2. get rid of threads=

Also to use uwsgi::app and not service::uwsgi, since the latter is meant for use with public facing 'services' (like ores or striker) running behind varnish (and hence they use http-socket and not http).

The logging related lines in the uwsgi config also seem to be noops, with no logs to be found there. Some logs in /var/logs/upstart/uwsgi-keystone-admin.log but not enough.

Change 334714 had a related patch set uploaded (by Andrew Bogott):
Keystone: use uwsgi::app instead of service::uwsgi

https://gerrit.wikimedia.org/r/334714

Change 334714 merged by Andrew Bogott:
Keystone: use uwsgi::app instead of service::uwsgi

https://gerrit.wikimedia.org/r/334714

Andrew renamed this task from keystone admin api easily overwhelmed to keystone admin api occasionally refuses connection.Jan 30 2017, 3:53 PM

So far I only ever see this in one situation, in a 'novastats' report I've been running. The python novaclient invokes keystoneclient to authorize a query, and keystoneclient requests a token, and the connection errors out.

The exact exception is mutilated by keystoneclient. The actual errors are mostly ('Connection aborted.', BadStatusLine("''",))

Of course outside of that context I can request tokens until I'm blue in the face with no errors.

It's possible that the uwsgi changes fixed most issues and I'm just hunting one weird corner case. It might be worth hanging out and seeing if this causes us trouble anywhere else.

Hm, interestingly I see the same symptom (periodic connection failures) with the public API, but it errors out differently, like this: ('Connection aborted.', error(104, 'Connection reset by peer'))

I've convinced myself that this is indeed a result of the uwsgi switch-over. We could always go back...

I've convinced myself that this is indeed a result of the uwsgi switch-over. We could always go back...

No complaints on that from me. Configuration convergence takes a back seat to actually working products. :)

I've convinced myself that this is indeed a result of the uwsgi switch-over. We could always go back...

No complaints on that from me. Configuration convergence takes a back seat to actually working products. :)

I agree, but I'm discouraged that the damn eventlet engine (the one that works) yells at me about being deprecated, as per https://phabricator.wikimedia.org/T150774.

Demonstrate this issue on labtestcontrol2001:

$ source /usr/local/bin/observerenv.sh
$ while /tmp/hammer.py; do :; done

The failures happen MUCH more frequently on labcontrol1001 than on labtest, so it might be best to just start out testing there.

Change 339467 had a related patch set uploaded (by Andrew Bogott):
Keystone: Go back to using eventlet for Liberty

https://gerrit.wikimedia.org/r/339467

Change 339467 merged by Andrew Bogott:
Keystone: Go back to using eventlet for Liberty

https://gerrit.wikimedia.org/r/339467

I've rolled us back to eventlet, and the problem is resolved. I still don't know what the story was with uwsgi/keystone but I suspect it was keystone bugs and they never really tested this despite the deprecation warnings.

The patches above will restore us to uwsgi in Mitaka; we'll see if these same problems return.