Investigate why nodepool keeps leaking instances and why it stops for no reason sometimes
Closed, ResolvedPublic

Description

On 03/03/17 nodepool stopped working, which is most likely node pool stopped working, but could have been caused by it keeping leaking instances.

All these issues may be bugs which have been fixed in a newer release. We are using a very ancient version of nodepool.

Paladox created this task.Fri, Mar 3, 4:11 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFri, Mar 3, 4:11 PM
Paladox triaged this task as "High" priority.Fri, Mar 3, 4:11 PM
chasemp assigned this task to Andrew.Fri, Mar 3, 4:27 PM

we merged https://gerrit.wikimedia.org/r/#/c/340986/ causing nova services to restart and a host of in-flight instances to go error and some labvirts are coming back slowly. Hopefully, it's all transient. @Andrew is babysitting this now to ensure.e

Paladox raised the priority of this task from "High" to "Unbreak Now!".Fri, Mar 3, 5:23 PM

Guessing unbreak as ci is down?

Restricted Application added subscribers: Jay8g, TerraCodes. · View Herald TranscriptFri, Mar 3, 5:23 PM

Mentioned in SAL (#wikimedia-operations) [2017-03-03T17:34:59Z] <hashar> CI is mostly recovered. It could not spawn instance anymore. The queue is being processed and will take a while to be completed. Check status on https://integration.wikimedia.org/zuul/ | T159543

Paladox lowered the priority of this task from "Unbreak Now!" to "High".Fri, Mar 3, 6:14 PM
hashar closed this task as "Resolved".Fri, Mar 3, 11:00 PM

Nova / OpenStack recovered. Thus instances managed to get deleted and Nodepool has then been able to refill the pool with fresh instances.