It has been reported by folks in Cloud-Services that Nodepool continually hammers the OpenStack api without regard for the max instances in use. Nodepool may not be respecting configuration limits under certain circumstances.
from the nodepool overview
Nodepool spins up new instances and tears down old as tests are queued up and completed
always maintaining a consistent number of available instances for tests up to the set limits of the CI infrastructure.
The confusing part of this is per the copy and nodepool's internals it would seem that nodepool keeps state on VMs it has allocated. It periodically ensures VMs waiting in the wings are still in an ok state, and knows when a VM has been burned by testing and needs deletion. This kind of stateful intelligent tracking seems like the intention.
The outcome we are experiencing is that nodepool is constantly requesting new instances whether they can be granted or not.
Our web logs (which include a lot of operations) to nova-api from nodepool indicate the following for http status code:
3650 200 313 202 350 204 4998 403
First entry 2016-08-16 06:53:01.732 to last entry 2016-08-16 13:43:11.391 we see 9,311 requests of which 53% are 403. I didn't find examples of operations generating 403 that were not invalid quota yet (from the nova side a note on the nodepool logs is below), but I did find a lot of operations that are not instance creation that were successful. So the ratio of instance creation 200 vs 403 is more dramatic than even this.
2016-08-16 13:50:18.859 7186 INFO nova.osapi_compute.wsgi.server [req-901eb118-947d-4f6c-a91c-882bd755905b nodepoolmanager contintcloud - - -] 10.64.20.18 "POST /v2/contintcloud/servers HTTP/1.1" status: 403 len: 292 time: 0.1547401 2016-08-16 13:50:20.010 7185 INFO nova.osapi_compute.wsgi.server [req-394092df-3ee9-4480-8010-236f44756a62 nodepoolmanager contintcloud - - -] 10.64.20.18 "POST /v2/contintcloud/servers HTTP/1.1" status: 403 len: 292 time: 0.1461952 2016-08-16 13:50:21.163 7191 INFO nova.osapi_compute.wsgi.server [req-b5c7f22b-21a2-440e-b44a-e6a06a7410b5 nodepoolmanager contintcloud - - -] 10.64.20.18 "POST /v2/contintcloud/servers HTTP/1.1" status: 403 len: 292 time: 0.1480989 2016-08-16 13:50:22.320 7174 INFO nova.osapi_compute.wsgi.server [req-d77c0606-6ef2-4733-bca1-05eb5cfa0a5e nodepoolmanager contintcloud - - -] 10.64.20.18 "POST /v2/contintcloud/servers HTTP/1.1" status: 403 len: 292 time: 0.1525040
In the nodepool logs it seems like this is definitely:
2016-08-16 13:51:00,321 ERROR nodepool.NodeLauncher: Forbidden launching node id: 330718 in provider: wmflabs-eqiad error: Traceback (most recent call last): File "/usr/lib/python2.7/dist-packages/nodepool/nodepool.py", line 337, in _run dt = self.launchNode(session) File "/usr/lib/python2.7/dist-packages/nodepool/nodepool.py", line 392, in launchNode config_drive=self.image.config_drive) File "/usr/lib/python2.7/dist-packages/nodepool/provider_manager.py", line 398, in createServer return self.submitTask(CreateServerTask(**create_args)) File "/usr/lib/python2.7/dist-packages/nodepool/task_manager.py", line 119, in submitTask return task.wait() File "/usr/lib/python2.7/dist-packages/nodepool/task_manager.py", line 57, in run self.done(self.main(client)) File "/usr/lib/python2.7/dist-packages/nodepool/provider_manager.py", line 116, in main server = client.servers.create(**self.args) File "/usr/lib/python2.7/dist-packages/novaclient/v2/servers.py", line 900, in create **boot_kwargs) File "/usr/lib/python2.7/dist-packages/novaclient/v2/servers.py", line 523, in _boot return_raw=return_raw, **kwargs) File "/usr/lib/python2.7/dist-packages/novaclient/base.py", line 161, in _create _resp, body = self.api.client.post(url, body=body) File "/usr/lib/python2.7/dist-packages/novaclient/client.py", line 453, in post return self._cs_request(url, 'POST', **kwargs) File "/usr/lib/python2.7/dist-packages/novaclient/client.py", line 428, in _cs_request resp, body = self._time_request(url, method, **kwargs) File "/usr/lib/python2.7/dist-packages/novaclient/client.py", line 397, in _time_request resp, body = self.request(url, method, **kwargs) File "/usr/lib/python2.7/dist-packages/novaclient/client.py", line 391, in request raise exceptions.from_response(resp, body, url, method) Forbidden: Quota exceeded for instances: Requested 1, but already used 10 of 10 instances (HTTP 403)
Which is happening constantly. So nodepool is raising exceptions in its own logs for this operation which it seems like it considers part of normal instance catalogue management? I don't get why this pattern is so consistently but I'm wondering if the nodepool mechanism for instance catalogue management is behaving irrationally. The really interesting thing is we see only 7 instances in ACTIVE state right now but clearly the the nova-api request believes it cannot be fulfilled. My belief at the moment is that the request /rate/ is not compatible with quota processing. Before a VM can be released for reallocation openstack is being asked for new instances.
From nodepool logs:
grep -c 'Quota exceeded for instances' /var/log/nodepool/nodepool.log
9120
from 2016-08-16 00:00:00 to 2016-08-16 14:04:37
Or a better view is errant (ERROR) operations from nodepools pov vs INFO for this same period.
grep -c ERROR /var/log/nodepool/nodepool.log
9207
grep -c INFO /var/log/nodepool/nodepool.log
36388
total: 45595
25% of all operations are ERROR for nodepool over the last 14 hours in communications with nova-api.
Of this 81 (out of 9207) appear to not be forbidden quota, and of these 81 the vast majority or some version of failure to delete the node:
`2016-08-16 14:10:06,359 ERROR nodepool.NodeDeleter: Exception deleting node 278848: Traceback (most recent call last): File "/usr/lib/python2.7/dist-packages/nodepool/nodepool.py", line 297, in run self.nodepool._deleteNode(session, node) File "/usr/lib/python2.7/dist-packages/nodepool/nodepool.py", line 2159, in _deleteNode manager.waitForServerDeletion(node.external_id) File "/usr/lib/python2.7/dist-packages/nodepool/provider_manager.py", line 450, in waitForServerDeletion (server_id, self.provider.name)): File "/usr/lib/python2.7/dist-packages/nodepool/nodeutils.py", line 42, in iterate_timeout raise Exception("Timeout waiting for %s" % purpose) Exception: Timeout waiting for server 13d596d0-a3a7-473c-a967-d6dea9de8678 deletion in wmflabs-eqiad
As far as I understand this is a nodepool client side timeout error. I waited too long now I'm aborting. Also caught in the nodepool debug log:
File "/usr/lib/python2.7/dist-packages/nodepool/provider_manager.py", line 450, in waitForServerDeletion raise Exception("Timeout waiting for %s" % purpose)
The other issue we discovered during the outage referenced in the incident report is nodepool waiting approx 1s for an instance to move out of BUILD state, a rate which cannot always be honored by openstack.
2016-08-16 14:20:47,356 DEBUG nodepool.NodePool: Deleting node id: 331024 which has been in building state for 0.00037676135699 hours 2016-08-16 14:20:57,059 DEBUG nodepool.NodePool: Deleting node id: 331025 which has been in building state for 0.000294238328934 hours 2016-08-16 14:21:07,803 DEBUG nodepool.NodePool: Deleting node id: 331026 which has been in building state for 0.000223197208511 hours 2016-08-16 14:21:17,657 DEBUG nodepool.NodePool: Deleting node id: 331027 which has been in building state for 0.000182528363334 hours 2016-08-16 14:21:28,011 DEBUG nodepool.NodePool: Deleting node id: 331028 which has been in building state for 0.000280977752474 hours 2016-08-16 14:21:38,433 DEBUG nodepool.NodePool: Deleting node id: 331029 which has been in building state for 0.000120452245076 hours 2016-08-16 14:21:49,154 DEBUG nodepool.NodePool: Deleting node id: 331030 which has been in building state for 0.000320818887817 hours 2016-08-16 14:21:59,465 DEBUG nodepool.NodePool: Deleting node id: 331031 which has been in building state for 0.000129256116019 hours 2016-08-16 14:22:09,812 DEBUG nodepool.NodePool: Deleting node id: 331032 which has been in building state for 0.000225731399324 hours 2016-08-16 14:22:10,984 DEBUG nodepool.NodePool: Deleting node id: 331033 which has been in building state for 0.000551279717022 hours
The exact example we were seeing is:
DEBUG nodepool.NodePool: Deleting node id: 261807 which has been in building state for 0.000289699700144 hours
thcipriani> 1.04 seconds
The issue during the outage was a failure to process on the openstack side for honoring instance creation, AFAICT it was nodepool being so aggressive in instance management it was tripping over its own timeouts. The CI issues appear to have started with the merging of https://gerrit.wikimedia.org/r/#/c/304047/ which allows more time for operations as we were hitting the old limit with our large src group rulesets, and CI was part of this except it has far fewer instances than any other project affected. The few pieces of evidence indicated that nodepool was choking on the new correct delayed processing for instances being created in contintcloud. The "fix" was to remove src group rules from the project after which instance spin up was (seemingly) /barely/ able to fulfill within the provided timeouts.