Page MenuHomePhabricator

Nodepool instance instance creation quota management
Closed, ResolvedPublic

Description

It has been reported by folks in Cloud-Services that Nodepool continually hammers the OpenStack api without regard for the max instances in use. Nodepool may not be respecting configuration limits under certain circumstances.

from the nodepool overview

Nodepool spins up new instances and tears down old as tests are queued up and completed
always maintaining a consistent number of available instances for tests up to the set limits of the CI infrastructure.

The confusing part of this is per the copy and nodepool's internals it would seem that nodepool keeps state on VMs it has allocated. It periodically ensures VMs waiting in the wings are still in an ok state, and knows when a VM has been burned by testing and needs deletion. This kind of stateful intelligent tracking seems like the intention.

The outcome we are experiencing is that nodepool is constantly requesting new instances whether they can be granted or not.

Our web logs (which include a lot of operations) to nova-api from nodepool indicate the following for http status code:

3650 200
 313 202
 350 204
4998 403

First entry 2016-08-16 06:53:01.732 to last entry 2016-08-16 13:43:11.391 we see 9,311 requests of which 53% are 403. I didn't find examples of operations generating 403 that were not invalid quota yet (from the nova side a note on the nodepool logs is below), but I did find a lot of operations that are not instance creation that were successful. So the ratio of instance creation 200 vs 403 is more dramatic than even this.

2016-08-16 13:50:18.859 7186 INFO nova.osapi_compute.wsgi.server [req-901eb118-947d-4f6c-a91c-882bd755905b nodepoolmanager contintcloud - - -] 10.64.20.18 "POST /v2/contintcloud/servers HTTP/1.1" status: 403 len: 292 time: 0.1547401
2016-08-16 13:50:20.010 7185 INFO nova.osapi_compute.wsgi.server [req-394092df-3ee9-4480-8010-236f44756a62 nodepoolmanager contintcloud - - -] 10.64.20.18 "POST /v2/contintcloud/servers HTTP/1.1" status: 403 len: 292 time: 0.1461952
2016-08-16 13:50:21.163 7191 INFO nova.osapi_compute.wsgi.server [req-b5c7f22b-21a2-440e-b44a-e6a06a7410b5 nodepoolmanager contintcloud - - -] 10.64.20.18 "POST /v2/contintcloud/servers HTTP/1.1" status: 403 len: 292 time: 0.1480989
2016-08-16 13:50:22.320 7174 INFO nova.osapi_compute.wsgi.server [req-d77c0606-6ef2-4733-bca1-05eb5cfa0a5e nodepoolmanager contintcloud - - -] 10.64.20.18 "POST /v2/contintcloud/servers HTTP/1.1" status: 403 len: 292 time: 0.1525040

In the nodepool logs it seems like this is definitely:

2016-08-16 13:51:00,321 ERROR nodepool.NodeLauncher: Forbidden launching node id: 330718 in provider: wmflabs-eqiad error:
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/nodepool/nodepool.py", line 337, in _run
    dt = self.launchNode(session)
  File "/usr/lib/python2.7/dist-packages/nodepool/nodepool.py", line 392, in launchNode
    config_drive=self.image.config_drive)
  File "/usr/lib/python2.7/dist-packages/nodepool/provider_manager.py", line 398, in createServer
    return self.submitTask(CreateServerTask(**create_args))
  File "/usr/lib/python2.7/dist-packages/nodepool/task_manager.py", line 119, in submitTask
    return task.wait()
  File "/usr/lib/python2.7/dist-packages/nodepool/task_manager.py", line 57, in run
    self.done(self.main(client))
  File "/usr/lib/python2.7/dist-packages/nodepool/provider_manager.py", line 116, in main
    server = client.servers.create(**self.args)
  File "/usr/lib/python2.7/dist-packages/novaclient/v2/servers.py", line 900, in create
    **boot_kwargs)
  File "/usr/lib/python2.7/dist-packages/novaclient/v2/servers.py", line 523, in _boot
    return_raw=return_raw, **kwargs)
  File "/usr/lib/python2.7/dist-packages/novaclient/base.py", line 161, in _create
    _resp, body = self.api.client.post(url, body=body)
  File "/usr/lib/python2.7/dist-packages/novaclient/client.py", line 453, in post
    return self._cs_request(url, 'POST', **kwargs)
  File "/usr/lib/python2.7/dist-packages/novaclient/client.py", line 428, in _cs_request
    resp, body = self._time_request(url, method, **kwargs)
  File "/usr/lib/python2.7/dist-packages/novaclient/client.py", line 397, in _time_request
    resp, body = self.request(url, method, **kwargs)
  File "/usr/lib/python2.7/dist-packages/novaclient/client.py", line 391, in request
    raise exceptions.from_response(resp, body, url, method)
Forbidden: Quota exceeded for instances: Requested 1, but already used 10 of 10 instances (HTTP 403)

Which is happening constantly. So nodepool is raising exceptions in its own logs for this operation which it seems like it considers part of normal instance catalogue management? I don't get why this pattern is so consistently but I'm wondering if the nodepool mechanism for instance catalogue management is behaving irrationally. The really interesting thing is we see only 7 instances in ACTIVE state right now but clearly the the nova-api request believes it cannot be fulfilled. My belief at the moment is that the request /rate/ is not compatible with quota processing. Before a VM can be released for reallocation openstack is being asked for new instances.

From nodepool logs:

grep -c 'Quota exceeded for instances' /var/log/nodepool/nodepool.log

9120

from 2016-08-16 00:00:00 to 2016-08-16 14:04:37

Or a better view is errant (ERROR) operations from nodepools pov vs INFO for this same period.

grep -c ERROR /var/log/nodepool/nodepool.log

9207

grep -c INFO /var/log/nodepool/nodepool.log

36388

total: 45595

25% of all operations are ERROR for nodepool over the last 14 hours in communications with nova-api.

Of this 81 (out of 9207) appear to not be forbidden quota, and of these 81 the vast majority or some version of failure to delete the node:

`2016-08-16 14:10:06,359 ERROR nodepool.NodeDeleter: Exception deleting node 278848:
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/nodepool/nodepool.py", line 297, in run
    self.nodepool._deleteNode(session, node)
  File "/usr/lib/python2.7/dist-packages/nodepool/nodepool.py", line 2159, in _deleteNode
    manager.waitForServerDeletion(node.external_id)
  File "/usr/lib/python2.7/dist-packages/nodepool/provider_manager.py", line 450, in waitForServerDeletion
    (server_id, self.provider.name)):
  File "/usr/lib/python2.7/dist-packages/nodepool/nodeutils.py", line 42, in iterate_timeout
    raise Exception("Timeout waiting for %s" % purpose)
Exception: Timeout waiting for server 13d596d0-a3a7-473c-a967-d6dea9de8678 deletion in wmflabs-eqiad

As far as I understand this is a nodepool client side timeout error. I waited too long now I'm aborting. Also caught in the nodepool debug log:

File "/usr/lib/python2.7/dist-packages/nodepool/provider_manager.py", line 450, in waitForServerDeletion
  raise Exception("Timeout waiting for %s" % purpose)

The other issue we discovered during the outage referenced in the incident report is nodepool waiting approx 1s for an instance to move out of BUILD state, a rate which cannot always be honored by openstack.

2016-08-16 14:20:47,356 DEBUG nodepool.NodePool: Deleting node id: 331024 which has been in building state for 0.00037676135699 hours
2016-08-16 14:20:57,059 DEBUG nodepool.NodePool: Deleting node id: 331025 which has been in building state for 0.000294238328934 hours
2016-08-16 14:21:07,803 DEBUG nodepool.NodePool: Deleting node id: 331026 which has been in building state for 0.000223197208511 hours
2016-08-16 14:21:17,657 DEBUG nodepool.NodePool: Deleting node id: 331027 which has been in building state for 0.000182528363334 hours
2016-08-16 14:21:28,011 DEBUG nodepool.NodePool: Deleting node id: 331028 which has been in building state for 0.000280977752474 hours
2016-08-16 14:21:38,433 DEBUG nodepool.NodePool: Deleting node id: 331029 which has been in building state for 0.000120452245076 hours
2016-08-16 14:21:49,154 DEBUG nodepool.NodePool: Deleting node id: 331030 which has been in building state for 0.000320818887817 hours
2016-08-16 14:21:59,465 DEBUG nodepool.NodePool: Deleting node id: 331031 which has been in building state for 0.000129256116019 hours
2016-08-16 14:22:09,812 DEBUG nodepool.NodePool: Deleting node id: 331032 which has been in building state for 0.000225731399324 hours
2016-08-16 14:22:10,984 DEBUG nodepool.NodePool: Deleting node id: 331033 which has been in building state for 0.000551279717022 hours

The exact example we were seeing is:

DEBUG nodepool.NodePool: Deleting node id: 261807 which has been in building state for 0.000289699700144 hours

thcipriani> 1.04 seconds

The issue during the outage was a failure to process on the openstack side for honoring instance creation, AFAICT it was nodepool being so aggressive in instance management it was tripping over its own timeouts. The CI issues appear to have started with the merging of https://gerrit.wikimedia.org/r/#/c/304047/ which allows more time for operations as we were hitting the old limit with our large src group rulesets, and CI was part of this except it has far fewer instances than any other project affected. The few pieces of evidence indicated that nodepool was choking on the new correct delayed processing for instances being created in contintcloud. The "fix" was to remove src group rules from the project after which instance spin up was (seemingly) /barely/ able to fulfill within the provided timeouts.

Event Timeline

Are we looking for https://github.com/openstack-infra/nodepool/blob/877265912853e0f7a6a21e3771fe9250f71e9c26/nodepool/nodepool.py#L57 has it been configured to be lower and https://review.openstack.org/#/c/245220/ that is what were looking for created by @hashar. Plus has nodepool even been uipdated to the latest release with that patch backported?

We are using nodepool version 0.1.1, we should update to 0.3.0.

Nodepool has a 60 seconds timeout for the API requests. If OpenStack replies quickly with a 4xx/5xx it process with the next iteration in the mail event loop.

So I guess you can have it trying to spawn an instance, get a 503, continue the loop, attempt to spawn an instance etc..

@hashar we should update nodepool to see if it fixes some of our problems, including patching in some of your changes that still need merging or reviewing :)

chasemp renamed this task from Nodepool instance creation delay to Nodepool instance instance creation quota management .Aug 16 2016, 2:10 PM
chasemp updated the task description. (Show Details)

I dropped the max-servers in /etc/nodepool/nodepool.yaml to 6 as that seemed to be the max number of allocated instances that were available before nodepool started launching into fits of 403 exceptions.

After a restart of the service (at 2016-08-16 22:38:09 UTC), all the 403 exceptions in the logs went away.

Nodepool is seemingly designed to get all of its information about the world from its config, which is unfortunate. It is definitely a bad failure mode: flail around and jam up the works.

It also seems like the openstack limit for the contintcloud project may be 6 instances (unless there are instances that wouldn't be accounted for in openstack server list or in nodepool list).

Hopefully, this tweak will mitigate the error response rate for openstack and make CI (counterintuitively) a little faster.

hrm. Maybe I stopped seeing 403s since demand was lower. Still working with 6 instances, just got:

Forbidden: Quota exceeded for instances: Requested 1, but already used 10 of 10 instances (HTTP 403)

Still only seeing 6 instances in use at any given time :((

Messages like this one:

DEBUG nodepool.NodePool: Deleting node id: 261807 which has been in building state for 0.000289699700144 hours

I believe, are actually nodepool deleting the instance from its database, and not form openstack. Here's more context from the debug log:

2016-08-16 23:12:28,416 ERROR nodepool.NodeLauncher: Forbidden launching node id: 339419 in provider: wmflabs-eqiad error:
...
Forbidden: Quota exceeded for instances: Requested 1, but already used 10 of 10 instances (HTTP 403)
2016-08-16 23:12:28,421 DEBUG nodepool.NodePool: Deleting node id: 339419 which has been in building state for 0.00122828086217 hours

Some random digging, there is some mention of "quota skew" in the #openstack-infra irc: http://eavesdrop.openstack.org/irclogs/%23openstack-infra/%23openstack-infra.2016-06-16.log.html#t2016-06-16T20:24:56

Similar symptoms to what's mentioned on this ticket.

Change 305537 had a related patch set uploaded (by Rush):
nodepool: reduce ready count for jessie and set higher rate

https://gerrit.wikimedia.org/r/305537

Change 305537 merged by Rush:
nodepool: reduce ready count for jessie and set higher rate

https://gerrit.wikimedia.org/r/305537

yesterday we had some issues with CI and @thcipriani and I poked at it for a bit making some changes.

A note is that nodepool integrates natively with statsd. Many internal operations have the native ability to fire off a stat, but it requires the env var STATSD_HOST to be set. I had played with this last week to get an idea of nodepool health and we have now deployed it officially. So I have numbers from a few days ago and today from this mechanism.

https://phabricator.wikimedia.org/rOPUP5bd5047fb841fa2250e73b223790319876d191c0

One thing we noticed historically is our nodepool used instances has been artificially deflated. We had 20 max instances at one point, and have been on 10 for awhile. In openstack I usually saw 6 concurrently in use for the project during the latter period.

Screen Shot 2016-08-18 at 4.12.07 PM.png (487×958 px, 61 KB)

Along with that (and what partially prompted this task) we have seen a large number of errors against the openstack nova-api from nodepool.

Screen Shot 2016-08-18 at 4.16.44 PM.png (486×966 px, 127 KB)

https://graphite.wikimedia.org/render/?width=966&height=489&_salt=1471548962.29&target=cactiStyle(nodepool.launch.error.unknown.count)&from=-7d&until=-6d&lineMode=connected

Because of all of the failures against the API and the long wait to run times we suspected that the rate of change nodepool was expecting was not inline with the reality of actual change. When an instance is deleted and then a new one is requested within a few seconds that quota is possibly not available yet. There has to be some allowance for churn on instances as not all operations are performed in real time.

We adjusted the rate and min-ready for jessie

https://phabricator.wikimedia.org/rOPUPa82ffc941aeae8da623199761d8731db8a5f7d2b

min-ready to even the playing field for instance pre-allocation and rate as 1s seemed to result in an overly aggressive approach on nodepool's side. We are now at 10s with seemingly no ill effect yet. It's possible even this is overkill but it is at least so far, even if not valuable, benign.

Also because of the detach between nodepool and and openstack quota usage, I upped the quota here to 15 instances as noted in SUL. We played around with 10-15 and various max values in nodepool to see if the usage offset was fixed but it does not seem that way. There is an interplay between the ability of the openstack provider to start/build/run/delete instances and nodepools ability to be graceful. I think this means we should offer some padding to nodepool in excess of its max-servers based in part on what reclamation cycle we expect to be honored. If we need to up nodepools usage again we should similarly up the quota, but it's not going to be a fixed ratio necessarily.

Where before we had the same max concurrent servers count in nodepool now we see more servers existing in the used state.

https://graphite.wikimedia.org/render/?width=966&height=489&_salt=1471549160.573&target=cactiStyle(nodepool.target.gallium.wikimedia.org.ci-jessie-wikimedia.wmflabs-eqiad.*)&from=-7d&lineMode=connected

Screen Shot 2016-08-18 at 4.29.22 PM.png (487×965 px, 128 KB)

Other metrics:

max launch wait per nodepool label
https://grafana.wikimedia.org/dashboard/db/releng-zuul?panelId=18&fullscreen

gerrit changes processed by zuul
https://grafana-admin.wikimedia.org/dashboard/db/releng-zuul?panelId=25&fullscreen&edit

zuul queued and running

https://graphite.wikimedia.org/render/?width=866&height=511&_salt=1471553962.747&target=cactiStyle(zuul.geard.queue.waiting)&lineWidth=3&from=-3d&target=cactiStyle(zuul.geard.queue.running)&areaMode=stacked

Screen Shot 2016-08-18 at 4.35.16 PM.png (510×861 px, 93 KB)

my assumption is min-ready should be upped back to where it was or higher for jessie but we wanted to sit on this for a full day or a few days at least to gauge stability

Nice debugging!

Messages like this one:

DEBUG nodepool.NodePool: Deleting node id: 261807 which has been in building state for 0.000289699700144 hours

I believe, are actually nodepool deleting the instance from its database, and not form openstack. Here's more context from the debug log:

2016-08-16 23:12:28,416 ERROR nodepool.NodeLauncher: Forbidden launching node id: 339419 in provider: wmflabs-eqiad error:
...
Forbidden: Quota exceeded for instances: Requested 1, but already used 10 of 10 instances (HTTP 403)
2016-08-16 23:12:28,421 DEBUG nodepool.NodePool: Deleting node id: 339419 which has been in building state for 0.00122828086217 hours

The 0.000289699700144 hours is Nodepool internal state. It roughly:

  • add a row in its database for node 261807, in state BUILDING and a milliseconds timestamp
  • ask OpenStack to spawn an instance
  • get a 403 / Quota exceeded
  • mark the row in database to DELETE state and report back with the short wait

Clearly, Nodepool does not gracefully handle issues with the remote cloud provider :-(

If the issue got trigger by Nova rpc_response_timeout being changed from 60 to 180, maybe we can align Nodepool timeout with OpenStack which is set to 60:

providers:
  - name: wmflabs-eqiad
    service-type: 'compute'
    service-name: 'nova'
    ...
    api-timeout: 60  # seconds

As for the rate, Nodepool generate tasks which are pilled up in a queue which are process at that rate seconds, and sleep otherwise. The logic is in nodepool/task_manager.py.

With a rate of 10s, Nodepool only process six tasks per minute, be it deletion, spawning instance etc. Tasks count / timing are reported to stats under nodepool.task..

https://graphite.wikimedia.org/render/?width=884&height=516&_salt=1471897012.215&drawNullAsZero=true&from=00%3A00_20160811&hideLegend=false&target=aliasByNode(nodepool.task.wmflabs-eqiad.*.count%2C3)&logBase=&until=23%3A59_20160815

render-16.png (516×884 px, 125 KB)

Stacked https://graphite.wikimedia.org/render/?width=884&height=516&_salt=1471897218.326&drawNullAsZero=false&from=00%3A00_20160811&hideLegend=false&target=aliasByNode(nodepool.task.wmflabs-eqiad.*.count%2C3)&logBase=&until=23%3A59_20160815&lineMode=connected&areaMode=stacked

render-18.png (516×884 px, 118 KB)

After with rate at 10 (I have stacked the metrics in the graph):

https://graphite.wikimedia.org/render/?width=884&height=516&_salt=1471897116.638&drawNullAsZero=false&from=00%3A00_20160818&hideLegend=false&target=aliasByNode(nodepool.task.wmflabs-eqiad.*.count%2C3)&logBase=&until=23%3A59_20160822&lineMode=connected&areaMode=stacked

render-17.png (516×884 px, 72 KB)

The plateau in the first two graphs is on August 12th around 21:30 UTC.


We also have to upgrade the python modules, the ones we imported are a bit old and might not be fully compatible with Liberty.

Our setup is using 0.1.1 of upstream which got cut a year ago on May 2015. It most probably has a lot of enhancement to make it more resilient.

yeah we puzzled over this for a good long while.

https://graphite.wikimedia.org/render/?width=884&height=516&_salt=1471897116.638&drawNullAsZero=false&from=00%3A00_20160818&hideLegend=false&target=aliasByNode(nodepool.task.wmflabs-eqiad.*.count%2C3)&logBase=&until=23%3A59_20160822&lineMode=connected&areaMode=stacked

makes sense in terms of ratelimited actions based on cycle

The funny (not as in haha) part is

Screen Shot 2016-08-22 at 4.53.43 PM.png (511×880 px, 102 KB)

https://graphite.wikimedia.org/render/?width=884&height=516&_salt=1471897116.638&drawNullAsZero=false&from=00%3A00_20160818&hideLegend=false&target=cactiStyle(sumSeries(nodepool.task.wmflabs-eqiad.*.count))&logBase=&lineMode=connected&target=cactiStyle(nodepool.target.gallium.wikimedia.org.ci-jessie-wikimedia.wmflabs-eqiad.*)&from=-11d&until=-8d

https://graphite.wikimedia.org/render/?width=884&height=516&_salt=1471897116.638&drawNullAsZero=false&from=00%3A00_20160818&hideLegend=false&target=cactiStyle(integral(sumSeries(nodepool.task.wmflabs-eqiad.*.count)))&from=-11d&until=-9d

vs

Screen Shot 2016-08-22 at 4.54.00 PM.png (514×884 px, 218 KB)

https://graphite.wikimedia.org/render/?width=884&height=516&_salt=1471897116.638&drawNullAsZero=false&from=00%3A00_20160818&hideLegend=false&target=cactiStyle(sumSeries(nodepool.task.wmflabs-eqiad.*.count))&logBase=&lineMode=connected&target=cactiStyle(nodepool.target.gallium.wikimedia.org.ci-jessie-wikimedia.wmflabs-eqiad.*)&from=-5d

https://graphite.wikimedia.org/render/?width=884&height=516&_salt=1471897116.638&drawNullAsZero=false&from=00%3A00_20160818&hideLegend=false&target=cactiStyle(integral(sumSeries(nodepool.task.wmflabs-eqiad.*.count)))&from=-5d

i.e. we are doing a lot less and getting a lot more done it seems like

we are doing a lot less and getting a lot more done it seems like

Most of the jobs have been moved back to the permanent slaves (integration tenant) so that explains the currently low traffic.

With statsd metric enabled, I have crafted basic dashboard in Grafana which would help in debugging / tracking what is going. https://grafana.wikimedia.org/dashboard/db/nodepool

I would like to:

A) upgrade the python modules on labnodepool1001.eqiad.wmnet, namely use python-novaclient from jessie-backports T137217. A puppet patch is ready, should be straightforward and might help with the incorrect quota behavior (if at all).

B) The current quota for contintcloud for instances is 15 which offers some overhead as proposed by chase:

I think this means we should offer some padding to nodepool in excess of its max-servers based in part on what reclamation cycle we expect to be honored.

Thus we can raise the max-servers to 12 and min-ready for jessie progressively and observe whether the quota issue appears again. Then lower down the rate progressively from 10 to a lower value to make Nodepool process more than six tasks per minute.

C) If the issue reoccurs figure out extra debugging log to look at the OpenStack queries responses. Maybe try to reproduce with the openstack CLI or end via Horizon.

Gave it a try with:

  • max-server 12
  • jessie 8
  • trusty 4

And at 11 instances, attempt to spawn a 12th yields the 15/15 quota exceeded.

Nodepool spawn and snapshot instances that are used as reference images. I am wondering whether the snapshots images go against the instances count. I cleared a couple of them:

(openstack) image list --private
+--------------------------------------+----------------------------------+
| ID                                   | Name                             |
+--------------------------------------+----------------------------------+
| 133f4578-3a04-4bce-a954-27caa21b4b9f | ci-jessie-wikimedia-1470060840   |
| 9dd3617f-b6ea-425e-998d-7a1fa899b505 | ci-jessie-wikimedia-1469974440   |
| 947b1894-cc07-4600-8cdc-7f8ad7d30fa0 | ci-trusty-wikimedia-1469698821   |
| 76eb86dc-ebc1-41c0-a35a-f598ce0859e9 | ci-trusty-wikimedia              |
| e9f7acc0-ae52-44c1-9c32-c57520082423 | ci-trusty-wikimedia-1467989709   |
| c422ccea-0c45-4e31-b4df-158130cd8f47 | ci-trusty-wikimedia_old_20170728 |
| ad397708-8170-4c39-9a59-963c6b3d660c | ci-jessie-wikimedia              |
| 1ba5b7eb-7238-43b2-9c28-cfe6a59014b5 | ci-trusty-wikimedia_old_20170708 |
+--------------------------------------+----------------------------------+
(openstack) image delete ci-trusty-wikimedia_old_20170728
(openstack) image delete ci-trusty-wikimedia_old_20170708
(openstack) image list --private
+--------------------------------------+--------------------------------+
| ID                                   | Name                           |
+--------------------------------------+--------------------------------+
| 133f4578-3a04-4bce-a954-27caa21b4b9f | ci-jessie-wikimedia-1470060840 |
| 9dd3617f-b6ea-425e-998d-7a1fa899b505 | ci-jessie-wikimedia-1469974440 |
| 947b1894-cc07-4600-8cdc-7f8ad7d30fa0 | ci-trusty-wikimedia-1469698821 |
| 76eb86dc-ebc1-41c0-a35a-f598ce0859e9 | ci-trusty-wikimedia            |
| e9f7acc0-ae52-44c1-9c32-c57520082423 | ci-trusty-wikimedia-1467989709 |
| ad397708-8170-4c39-9a59-963c6b3d660c | ci-jessie-wikimedia            |
+--------------------------------------+--------------------------------+

But that does not magically fixed it :} Quota 15/15 is still reached with 12 instances known and attempting to spawn a 13rd:

1 used
 7 ready
 4 building

Via the openstack command line tool I got 8 instances:

ef6ece68-5d79-4ba0-b2f0-e6d0fe308201ci-trusty-wikimedia-358622ACTIVEpublic=10.68.16.197
55acccbe-e05c-4927-a6f9-c742b7d0e0e2ci-trusty-wikimedia-358621ACTIVEpublic=10.68.20.78
4c77100a-23c1-4039-b3cb-d43c508ac11eci-jessie-wikimedia-358617ACTIVEpublic=10.68.17.151
1c9e19b5-07a7-49a0-9c08-58044d042178ci-jessie-wikimedia-358616ACTIVEpublic=10.68.20.123
340f58f6-5af9-445f-93eb-26ada2b8e4fdci-jessie-wikimedia-358615ACTIVEpublic=10.68.20.217
bba911f4-29f9-4b1b-9334-ad2e721b81d6ci-jessie-wikimedia-358600ACTIVEpublic=10.68.20.177
9a143413-71d1-4783-b3c3-dbbb742f478aci-jessie-wikimedia-358598ACTIVEpublic=10.68.21.42
89e6ad44-a4ef-4bfe-9bb6-f8e375d343dfci-jessie-wikimedia-358597ACTIVEpublic=10.68.22.137
93e82c9b-7146-4b1c-8cbd-e092e15d392dci-jessie-wikimedia-358562ACTIVEpublic=10.68.20.220

Via https://horizon.wikimedia.org/project/ I got: Instances actives : 11 with some few extras which are "old":

ci-jessie-wikimedia-3585621 hour 50 minutes
ci-jessie-wikimedia-2613421 week+"4faadde6-b329-41e6-820b-644c3168f99a"
ci-trusty-wikimedia-3104581week+"cfdfe1ca-3f99-4d8e-b9f2-256e09fa1b78"

Or the three instances that caused my previous experiment to get over the quota of 15 whenever I got 12 instances.

The questions are:

A) why don't they show up in openstack server list nor over the API server list?
B) what is preventing the three instances from being cleared out?

@hashar I'm uncomfortable with you changing these values without a notice here before hand, a SAL entry, or a heads up. We bumped up the quota to 15 not to try to push that limit but to try to honor 10. On the Cloud-Services side we have limited availability today and would have preferred to delay this to wedesday.

Asked sorry @chasemp :-(

No worries, thanks for understanding.

fwiw I have caught 11 servers via openstack server list when max-server was set to 10. In tracking this down I spent some time reasoning about how nova figures out current quota state.

My instinct is this may be related and it fits the behavior and theory of our setup not being able to honor the higher rate of turn around

Refresh quotas usage
https://review.openstack.org/#/c/161782/

https://review.openstack.org/#/c/161782/6/specs/liberty/approved/refresh-quotas-usage.rst


https://bugs.launchpad.net/nova/+bug/1284424

https://bugs.launchpad.net/nova/+bug/1296414

I would like to play with https://github.com/cernops/nova-quota-sync

@thcipriani pointed to some IRC logs from OpenStack infrastructure that reflected them having the same issue with their OVH cloud provider. EG 48 instances spawned but openstack reporting 100 out of 100 exhaustion.

OVH fixed the bad quota accounting and they restored that cloud provider max-server limit on July 29th: https://review.openstack.org/#/c/348934/

Original max-server limitation has been done with https://review.openstack.org/#/c/347075/

Reduce ovh-gra1 to 40 nodes

We are seeing Nova exceptions in ovh-gra1 currently. As a results, we are not able to launch more then 40 nodes. Lets reduce the max-server to 40 nodes until we hear back from OVH support:

Quota exceeded for instances: Requested 1, but already used 200 of
200 instances (HTTP 403) (Request-ID: req-aa8caa30-f350-4965-8dca-6f3b473c14bd)

We can ask them in #openstack-infra IRC channel, maybe they have more details.

As for the quota glitches, the is two ghost instances that needs to be deleted. Also Nova codes seems to indicate the quota usage is tracked in mysql and not necessarily properly refreshed. So if we had a glitch of 4 instances that never lowered the quota, we would always have a base line of 4 even with no instances.

https://ask.openstack.org/en/question/494/how-to-reset-incorrect-quota-count/ seems to point that setting the quota to -1 would eventually trigger a refresh with something like:

update quota_usages set in_use='-1' where project_id='<my project id>';

And nova has:

nova/db/sqlalchemy/api.py
def _is_quota_refresh_needed(quota_usage, max_age):
    """Determines if a quota usage refresh is needed.

    :param quota_usage:   A QuotaUsage object for a given resource.
    :param max_age:       Number of seconds between subsequent usage refreshes.
    :return:              True if a refresh is needed, False otherwise.
    """
    refresh = False
    if quota_usage.in_use < 0:
        # Negative in_use count indicates a desync, so try to
        # heal from that...
        LOG.debug('in_use has dropped below 0; forcing refresh for '
                  'QuotaUsage: %s', dict(quota_usage))
        refresh = True
    elif quota_usage.until_refresh is not None:
        quota_usage.until_refresh -= 1
        if quota_usage.until_refresh <= 0:
            refresh = True
    elif max_age and (timeutils.utcnow() -
            quota_usage.updated_at).seconds >= max_age:
        refresh = True

    return refresh

From #openstack-infra:

[16:15:12Z] <pabelanger> hashar: There was a quota mismatch with our project in OVH. While I don't know the fix, they did "re-sync our tenant". Not that it helps with the fix.

So I am suspecting they manually updated the quota usage in whatever data backend.

@chasemp can't you check the value we have for the tenant in the database? Potentially we could stop Nodepool, delete all instances and look at the value, I would say it is stall to 4.

From #openstack-infra:

[16:15:12Z] <pabelanger> hashar: There was a quota mismatch with our project in OVH. While I don't know the fix, they did "re-sync our tenant". Not that it helps with the fix.

So I am suspecting they manually updated the quota usage in whatever data backend.

@chasemp can't you check the value we have for the tenant in the database? Potentially we could stop Nodepool, delete all instances and look at the value, I would say it is stall to 4.

If you look at https://phabricator.wikimedia.org/T143016#2576881 I imagine https://review.openstack.org/#/c/161782/6/specs/liberty/approved/refresh-quotas-usage.rst is the mechanics of what you mean.

Can you note where the places are you are viewing quota usage and what they currently report?

Yeah the quota usage links you have posted earlier have lead me to figure out how to look at the actual usage for contintcloud. What I am doing is:

ssh labnodepool1001.eqiad.wmnet
$ become-nodepool
$ nova absolute-limits|grep Instances
| Instances          | 15    | 15     |
$ openstack server list|grep -c ci-
11

I constantly have the quota usage with 4 more servers than there really is. Some instances are ghosts apparently and would need cleanup, yesterday they were showing up in Horizon. Maybe there are more or the counter is just off.

Right now with 2 jessie and 2 trusty instances (min-ready values).

On the Horizon project page at https://horizon.wikimedia.org/project/ I get:

  • a pie chart stating 8/15 instances
  • Usage showing the two ghost instances ci-jessie-wikimedia-261342 and ci-trusty-wikimedia-310458

Then on the instance list https://horizon.wikimedia.org/project/instances/ I only get 4 instances list.

nova absolute-limits has the same value has the Horizon pie chart:

Instances815

openstack server list|grep -c ci- get the four instances.

So basically one can look at https://horizon.wikimedia.org/project/ for the quotas and ghost instances. At https://horizon.wikimedia.org/project/instances/ for the actual instances.

I have looked at all the projects I have access too and tools seems to be off by one with the pie chart showing 124 / 150 while there are 123 instances.

@chasemp found a parameter that would cause Nova to refresh the quota on each reservation if the last refresh is more than 30 seconds old! rOPUP1aae21a7afc5: nova: trigger usage calculation refresh if stale for 30s !

The quota issue has been solved. With 4 instances floating around, nova absolute-limits properly shows 4 against the instances quota.

The other issue is Horizon on https://horizon.wikimedia.org/project/instances/ being off. With four instances, the pie chart that is showing the quota usage is at four, but the usage summary shows 2 active instances and two no more existing instances. But that might just be an oddity with the usage stats.

hashar claimed this task.

This has been fixed up quite fast. The root cause is the quota went out of sync and never got updated. Nova now refresh the quota automatically if the value is older than 30 seconds.