Page MenuHomePhabricator

wmflabs OpenStack is deadlocked (can't boot or delete instances)
Closed, ResolvedPublic

Description

I had issue with an instance T133652 that could not reach /dev/vda anymore. Looking at Nodepool it is unable to delete / spawn instances over the OpenStack API.

Seems Keystone / Nova or whatever is deadlocked somehow :(

The first issue in Nodepool logs is at 05:13am UTC

Attempting to spawn an instance times out

2016-04-26 05:13:17,416 ERROR nodepool.NodeLauncher: LaunchStatusException launching node id: 83522 in provider: wmflabs-eqiad error:
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/nodepool/nodepool.py", line 337, in _run
    dt = self.launchNode(session)
  File "/usr/lib/python2.7/dist-packages/nodepool/nodepool.py", line 403, in launchNode
    server['status']))
LaunchStatusException: Server 882f2ef7-ad9b-4e9f-9e01-86e788a39ed4 for node id: 83522 status: ERROR

Ditto for deletion:

2016-04-26 05:23:22,611 ERROR nodepool.NodeDeleter: Exception deleting node 83522:
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/nodepool/nodepool.py", line 297, in run
    self.nodepool._deleteNode(session, node)
  File "/usr/lib/python2.7/dist-packages/nodepool/nodepool.py", line 2159, in _deleteNode
    manager.waitForServerDeletion(node.external_id)
  File "/usr/lib/python2.7/dist-packages/nodepool/provider_manager.py", line 450, in waitForServerDeletion
    (server_id, self.provider.name)):
  File "/usr/lib/python2.7/dist-packages/nodepool/nodeutils.py", line 42, in iterate_timeout
    raise Exception("Timeout waiting for %s" % purpose)
Exception: Timeout waiting for server 882f2ef7-ad9b-4e9f-9e01-86e788a39ed4 deletion in wmflabs-eqiad

I also tried to create the instance castor2.integration.eqiad.wmflabs but it never spawn :(

Event Timeline

castor2 spawned via wikitech yields an error in the Horizon dashboard:

Error: Failed to perform requested operation on instance "castor2", the instance has an error status: Please try again later [Error: Build of instance 691004a0-cc52-4b95-93e9-5be2eee35c5a aborted: Could not clean up failed build, not rescheduling].

From Icinga:

labvirt1008

Disk space
WARNING 2016-04-26 08:38:43 0d 6h 59m 14s 3/3
DISK WARNING - free space: /var/lib/nova/instances 159698 MB (6% inode=99%):

Apparently still have 160GBytes free?

Mentioned in SAL [2016-04-26T08:45:21Z] <hashar> Most of CI is down / deadlocked due to wmflabs being unresponsive T133654

Mentioned in SAL [2016-04-26T08:50:21Z] <YuviPanda> restarted nova-conductor & scheduler on labcontrol1001 for T133654

yuvipanda claimed this task.
yuvipanda subscribed.

That seems to have fixed it now!

I'm going to file a bug to have a paging check for this.