Page MenuHomePhabricator

Nodepool delay instance deletions by one minute
Closed, ResolvedPublic

Description

Nodepool has an hardcoded 1 minute delay before deleting an instance. Can be found in the source with the global DELETE_DELAY.

The reason is that OpenStack has an async publisher that grabs the console log and is executed after the build is completed. Deleting the instance as soon as the job is completed would cause the console to no more be available apparently.

Nodepool needs a configurable setting for that delay. To be proposed upstream.

Event Timeline

hashar raised the priority of this task from to Needs Triage.
hashar updated the task description. (Show Details)
hashar subscribed.

Will get to it when it starts becoming a problem (i.e.: Nodepool not refilling the pool fast enough). The actual patch is straightforward, then it is all about cherry picking it in the .deb package.

Patch proposed upstream to make the value configurable https://review.openstack.org/#/c/245220/

On our setup, I am just going to change the hardcoded value.

Change 237700 had a related patch set uploaded (by Hashar):
nodepool 0.1.1-wmf4

https://gerrit.wikimedia.org/r/237700

Change 275612 had a related patch set uploaded (by Hashar):
nodepool: set delete-delay to 0 seconds

https://gerrit.wikimedia.org/r/275612

Change 275612 merged by Andrew Bogott:
nodepool: set delete-delay to 0 seconds

https://gerrit.wikimedia.org/r/275612

The delay is gone.

Note deletion tasks are handle by a & minute internal cron:

cron:
  # Deletes old images and servers
  cleanup: '*/1 * * * *'
hashar claimed this task.

Change 275791 had a related patch set uploaded (by Hashar):
nodepool: lower task ratelimiting from 10 to 1 sec

https://gerrit.wikimedia.org/r/275791

I have also found out we were rate limiting delete requests to at most one every ten seconds or six per minutes. https://gerrit.wikimedia.org/r/275791 move it down to 1 second or up to 60 deletions per minute.

Change 275791 merged by Andrew Bogott:
nodepool: lower task ratelimiting from 10 to 1 sec

https://gerrit.wikimedia.org/r/275791

Definitely fixed. We had two issues:

  • a 1 minute delay before delay can happen (removed)
  • a 10 seconds throttle on any task nodepool proceed against an OpenStack API (down to 1)

So Nodepool can now delete up to an instance per second now. I monitored it a bit since Tuesday and confirmed it replenish the pool way faster.