It has been reported by folks in #labs that Nodepool continually hammers the OpenStack api without regard for the max instances in use. Nodepool may not be respecting configuration limits under certain circumstances.
from the nodepool [[ http://docs.openstack.org/infra/system-config/nodepool.html | overview ]]
> Nodepool spins up new instances and tears down old as tests are queued up and completed
> always maintaining a consistent number of available instances for tests up to the set limits of the CI infrastructure.
The confusing part of this is per the copy and nodepool's internals it would seem that nodepool keeps state on VMs it has allocated. It periodically ensures VMs waiting in the wings are still in an ok state, and knows when a VM has been burned by testing and needs deletion. This kind of stateful intelligent tracking seems like the intention.
The outcome we are experiencing is that nodepool is constantly requesting new instances whether they can be granted or not.
Our web logs (which include a lot of operations) to nova-api from nodepool indicate the following for http status code:
``` 3650 200
313 202
350 204
4998 403```
First entry `2016-08-16 06:53:01.732` to last entry `2016-08-16 13:43:11.391` we see 9,311 requests of which 53% are 403. I didn't find examples of operations generating 403 that were not invalid quota yet, but I did find operations that are not instance creation that were successful. So the ratio of instance creation 200 vs 403 is more dramatic than even this.
```2016-08-16 13:50:18.859 7186 INFO nova.osapi_compute.wsgi.server [req-901eb118-947d-4f6c-a91c-882bd755905b nodepoolmanager contintcloud - - -] 10.64.20.18 "POST /v2/contintcloud/servers HTTP/1.1" status: 403 len: 292 time: 0.1547401
2016-08-16 13:50:20.010 7185 INFO nova.osapi_compute.wsgi.server [req-394092df-3ee9-4480-8010-236f44756a62 nodepoolmanager contintcloud - - -] 10.64.20.18 "POST /v2/contintcloud/servers HTTP/1.1" status: 403 len: 292 time: 0.1461952
2016-08-16 13:50:21.163 7191 INFO nova.osapi_compute.wsgi.server [req-b5c7f22b-21a2-440e-b44a-e6a06a7410b5 nodepoolmanager contintcloud - - -] 10.64.20.18 "POST /v2/contintcloud/servers HTTP/1.1" status: 403 len: 292 time: 0.1480989
2016-08-16 13:50:22.320 7174 INFO nova.osapi_compute.wsgi.server [req-d77c0606-6ef2-4733-bca1-05eb5cfa0a5e nodepoolmanager contintcloud - - -] 10.64.20.18 "POST /v2/contintcloud/servers HTTP/1.1" status: 403 len: 292 time: 0.1525040```
In the nodepool logs it seems like this is definitely:
```2016-08-16 13:51:00,321 ERROR nodepool.NodeLauncher: Forbidden launching node id: 330718 in provider: wmflabs-eqiad error:
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/nodepool/nodepool.py", line 337, in _run
dt = self.launchNode(session)
File "/usr/lib/python2.7/dist-packages/nodepool/nodepool.py", line 392, in launchNode
config_drive=self.image.config_drive)
File "/usr/lib/python2.7/dist-packages/nodepool/provider_manager.py", line 398, in createServer
return self.submitTask(CreateServerTask(**create_args))
File "/usr/lib/python2.7/dist-packages/nodepool/task_manager.py", line 119, in submitTask
return task.wait()
File "/usr/lib/python2.7/dist-packages/nodepool/task_manager.py", line 57, in run
self.done(self.main(client))
File "/usr/lib/python2.7/dist-packages/nodepool/provider_manager.py", line 116, in main
server = client.servers.create(**self.args)
File "/usr/lib/python2.7/dist-packages/novaclient/v2/servers.py", line 900, in create
**boot_kwargs)
File "/usr/lib/python2.7/dist-packages/novaclient/v2/servers.py", line 523, in _boot
return_raw=return_raw, **kwargs)
File "/usr/lib/python2.7/dist-packages/novaclient/base.py", line 161, in _create
_resp, body = self.api.client.post(url, body=body)
File "/usr/lib/python2.7/dist-packages/novaclient/client.py", line 453, in post
return self._cs_request(url, 'POST', **kwargs)
File "/usr/lib/python2.7/dist-packages/novaclient/client.py", line 428, in _cs_request
resp, body = self._time_request(url, method, **kwargs)
File "/usr/lib/python2.7/dist-packages/novaclient/client.py", line 397, in _time_request
resp, body = self.request(url, method, **kwargs)
File "/usr/lib/python2.7/dist-packages/novaclient/client.py", line 391, in request
raise exceptions.from_response(resp, body, url, method)
Forbidden: Quota exceeded for instances: Requested 1, but already used 10 of 10 instances (HTTP 403)```
Which is happening //constantly//. So nodepool is raising exceptions in its own logs for this operation which it seems like it considers part of normal instance catalogue management? I don't get why this pattern is so constanant but I'm wondering if the nodepool mechanism for instance catalogue management is behaving irrationally. The really interesting thing is we see only 7 instances in `ACTIVE` state right now but clearly the the nova-api request believes it cannot be fulfilled. My belief at the moment is that the request /rate/ is not compatible with quota processing. Before a VM can be released for reallocation openstack is being asked for new instances.
From nodepool logs:
> grep -c 'Quota exceeded for instances' /var/log/nodepool/nodepool.log
9120
from `2016-08-16 00:00:00` to `2016-08-16 14:04:37`
Or a better view is errant (ERROR) operations from nodepools pov vs INFO for this same period.
> grep -c ERROR /var/log/nodepool/nodepool.log
9207
> grep -c INFO /var/log/nodepool/nodepool.log
36388
total: 45595
25% of all operations are ERROR for nodepool over the last 14 hours in communications with `nova-api`.