Page MenuHomePhabricator

Nodepool delay instance deletions by one minute
Closed, ResolvedPublic

Description

Nodepool has an hardcoded 1 minute delay before deleting an instance. Can be found in the source with the global DELETE_DELAY.

The reason is that OpenStack has an async publisher that grabs the console log and is executed after the build is completed. Deleting the instance as soon as the job is completed would cause the console to no more be available apparently.

Nodepool needs a configurable setting for that delay. To be proposed upstream.

Event Timeline

hashar created this task.Sep 22 2015, 2:43 PM
hashar raised the priority of this task from to Needs Triage.
hashar updated the task description. (Show Details)
hashar added a subscriber: hashar.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 22 2015, 2:43 PM
hashar triaged this task as Low priority.Sep 22 2015, 6:27 PM

Will get to it when it starts becoming a problem (i.e.: Nodepool not refilling the pool fast enough). The actual patch is straightforward, then it is all about cherry picking it in the .deb package.

hashar set Security to None.

Patch proposed upstream to make the value configurable https://review.openstack.org/#/c/245220/

On our setup, I am just going to change the hardcoded value.

Change 237700 had a related patch set uploaded (by Hashar):
nodepool 0.1.1-wmf4

https://gerrit.wikimedia.org/r/237700

hashar changed the status of subtask T118573: Upgrade Nodepool to 0.1.1-wmf4 from Stalled to Open.Feb 29 2016, 7:49 PM

Change 237700 merged by Hashar:
nodepool 0.1.1-wmf4

https://gerrit.wikimedia.org/r/237700

Change 275612 had a related patch set uploaded (by Hashar):
nodepool: set delete-delay to 0 seconds

https://gerrit.wikimedia.org/r/275612

Change 275612 merged by Andrew Bogott:
nodepool: set delete-delay to 0 seconds

https://gerrit.wikimedia.org/r/275612

The delay is gone.

Note deletion tasks are handle by a & minute internal cron:

cron:
  # Deletes old images and servers
  cleanup: '*/1 * * * *'
hashar closed this task as Resolved.Mar 7 2016, 8:59 PM
hashar claimed this task.

Change 275791 had a related patch set uploaded (by Hashar):
nodepool: lower task ratelimiting from 10 to 1 sec

https://gerrit.wikimedia.org/r/275791

I have also found out we were rate limiting delete requests to at most one every ten seconds or six per minutes. https://gerrit.wikimedia.org/r/275791 move it down to 1 second or up to 60 deletions per minute.

hashar reopened this task as Open.Mar 8 2016, 11:54 AM
hashar moved this task from Backlog to In-progress on the Continuous-Integration-Scaling board.

Change 275791 merged by Andrew Bogott:
nodepool: lower task ratelimiting from 10 to 1 sec

https://gerrit.wikimedia.org/r/275791

hashar closed this task as Resolved.Mar 9 2016, 8:25 PM

Definitely fixed. We had two issues:

  • a 1 minute delay before delay can happen (removed)
  • a 10 seconds throttle on any task nodepool proceed against an OpenStack API (down to 1)

So Nodepool can now delete up to an instance per second now. I monitored it a bit since Tuesday and confirmed it replenish the pool way faster.