Page MenuHomePhabricator

Nodepool can not delete/spawn instances anymore
Closed, ResolvedPublic

Description

$ nodepool list

IDProviderLabelHostnameServer IDIPStateAge (hours)
168237wmflabs-eqiadci-jessie-wikimediaci-jessie-wikimedia-16823760745ab8-3308-4830-881e-e0a5c283985910.68.18.110delete0.38
168238wmflabs-eqiadci-jessie-wikimediaci-jessie-wikimedia-16823806bc1a19-31c7-452c-a015-e0ff7138828510.68.20.78delete0.27
168244wmflabs-eqiadci-jessie-wikimediaci-jessie-wikimedia-1682441bfa6bff-344f-4897-a259-30a7c531e9cb10.68.20.168delete0.21
168245wmflabs-eqiadci-jessie-wikimediaci-jessie-wikimedia-1682452000b0b4-44f8-4340-8e8f-becef4c58d9010.68.19.54delete0.35
168246wmflabs-eqiadci-jessie-wikimediaci-jessie-wikimedia-168246f69fc8c3-d62e-4989-8048-8232d367caa510.68.20.171delete0.23
168247wmflabs-eqiadci-jessie-wikimediaci-jessie-wikimedia-168247c439d7fb-6633-4c59-820b-667ed825ca8110.68.20.214delete0.23
168250wmflabs-eqiadci-jessie-wikimediaci-jessie-wikimedia-168250f36a4830-300c-4efa-8858-4dfdb56feaaf10.68.20.169delete0.23
168251wmflabs-eqiadci-jessie-wikimediaci-jessie-wikimedia-16825180efbd52-7364-4101-9f4f-116c765ac28910.68.20.10delete0.34
168254wmflabs-eqiadci-jessie-wikimediaci-jessie-wikimedia-168254c96906ba-008e-4fd3-8df8-a6c289838e5610.68.22.20delete0.23
168255wmflabs-eqiadci-jessie-wikimediaci-jessie-wikimedia-168255c5fbdd53-2224-4c57-8dc4-965bc8abac2310.68.20.113delete0.34
168256wmflabs-eqiadci-jessie-wikimediaci-jessie-wikimedia-168256004a1cff-ff7f-4d5d-af75-755e20a8316410.68.20.130delete0.27
168257wmflabs-eqiadci-jessie-wikimediaci-jessie-wikimedia-168257dae77117-a1d8-4dd4-a7e1-e5dccffa4b6510.68.20.186delete0.19
168260wmflabs-eqiadci-jessie-wikimediaci-jessie-wikimedia-168260342ff9ee-a457-4a4b-9b2c-ccf5a97cc756Nonedelete0.29
168261wmflabs-eqiadci-jessie-wikimediaci-jessie-wikimedia-168261bf9a61e7-9382-4c6d-94ce-2785ff5b439eNonedelete0.29
168262wmflabs-eqiadci-jessie-wikimediaci-jessie-wikimedia-168262a0fb133f-c1a0-4ddd-99e4-cd5e59e1a71fNonedelete0.24
168263wmflabs-eqiadci-jessie-wikimediaci-jessie-wikimedia-1682636ff25c83-6601-4209-8ffb-8898298045ceNonedelete0.24
168252wmflabs-eqiadci-trusty-wikimediaci-trusty-wikimedia-1682526e0084f5-61a6-42f6-973c-d2f503792f6610.68.20.36delete0.38
168253wmflabs-eqiadci-trusty-wikimediaci-trusty-wikimedia-168253863391b0-f117-45a9-b8ff-f241fffbae5210.68.20.194delete0.38
168258wmflabs-eqiadci-trusty-wikimediaci-trusty-wikimedia-168258aeee34dc-bac2-4fb5-9574-3c665e3f9cf110.68.20.129delete0.23
168259wmflabs-eqiadci-trusty-wikimediaci-trusty-wikimedia-16825995c78e27-c58d-4901-a65d-3a50c30f781410.68.20.210delete0.23
2016-07-04 09:39:05,653 ERROR nodepool.NodeDeleter: Exception deleting node 168252:
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/nodepool/nodepool.py", line 297, in run
    self.nodepool._deleteNode(session, node)
  File "/usr/lib/python2.7/dist-packages/nodepool/nodepool.py", line 2159, in _deleteNode
    manager.waitForServerDeletion(node.external_id)
  File "/usr/lib/python2.7/dist-packages/nodepool/provider_manager.py", line 450, in waitForServerDeletion
    (server_id, self.provider.name)):
  File "/usr/lib/python2.7/dist-packages/nodepool/nodeutils.py", line 42, in iterate_timeout
    raise Exception("Timeout waiting for %s" % purpose)
Exception: Timeout waiting for server 6e0084f5-61a6-42f6-973c-d2f503792f66 deletion in wmflabs-eqiad

Looks similar to T135631: Nodepool can not spawn instances anymore on wmflabs

Event Timeline

Mentioned in SAL [2016-07-04T09:44:05Z] <hashar> Labs infra cant delete instances anymore (impacts CI as well) T139285

Once labs is able to delete instances again, Nodepool would be able to delete them and thus spawn new ones. At worth we will have to manually delete the instance in the contintcloud project.

Status can be monitored on labnodepool1001.eqiad.wmnet with: nodepool list.

I've shut down nodepool just now since it was still trying to create and delete instances. We're *very* resource constrainted in labs atm, so my first priority would be to restore labs to a working condition (T139264 etc are happening atm - random instances are shutting off, and if that reaches tools that'll cause a lot of issues) before re-evaluating turning on nodepool.

Just ran:

nova list --all-tenants | grep -i error | grep contintcloud | awk '{ print $2; }' | xargs -L1 nova delete

To delete all the contintcloud instances in ERROR state

A combination of restarting rabbitmq + moving more instances to labvirt1011 + deleting unused instances seems to have fixed this. Still, I'll advice not creating lots of new instances just now, because we're still on a resource crunch...

Paladox triaged this task as Unbreak Now! priority.Jul 4 2016, 12:25 PM

Mentioned in SAL [2016-07-04T12:33:21Z] <yuvipanda> reduced instances quota to 10 before starting it back up for T139285

Change 297256 had a related patch set uploaded (by Hashar):
nodepool: lower # of instances

https://gerrit.wikimedia.org/r/297256

Change 297256 merged by Yuvipanda:
nodepool: lower # of instances

https://gerrit.wikimedia.org/r/297256

Mentioned in SAL [2016-07-04T12:43:15Z] <hashar> Nodepool back up with 10 instances (instead of 20) to accomodate for labs capacity T139285

hashar assigned this task to yuvipanda.

It is degraded from 20 to 10 instances until labs has the capacity for more instances. That is not ideal but at least the service is backup and the queue of pending jobs is draining properly.

Thanks to @yuvipanda for the quick sync up ;)

Change 297512 had a related patch set uploaded (by Hashar):
Revert "nodepool: lower # of instances"

https://gerrit.wikimedia.org/r/297512

Re opening due to ci having problems again.

Change 297512 abandoned by Hashar:
Revert "nodepool: lower # of instances"

Reason:
Andrew and/or Chase confirmed yesterday that it is going to hurt labs right now. No point in keeping this change open for now.

https://gerrit.wikimedia.org/r/297512

It is solved. What @Paladox noticed yesterday was the pool of instances being exhausted and the CI change being stuck in queue pending for instances to boot / be made available.

The root cause is that we are down to a maximum of 10 instances. See T139285 and b1d015b50ed404497a1f1c3b7ea67606a0d8181f