wmflabs OpenStack is deadlocked (can't boot or delete instances)
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	hashar
	Apr 26 2016, 8:37 AM

Description

I had issue with an instance T133652 that could not reach /dev/vda anymore. Looking at Nodepool it is unable to delete / spawn instances over the OpenStack API.

Seems Keystone / Nova or whatever is deadlocked somehow :(

The first issue in Nodepool logs is at 05:13am UTC

Attempting to spawn an instance times out

2016-04-26 05:13:17,416 ERROR nodepool.NodeLauncher: LaunchStatusException launching node id: 83522 in provider: wmflabs-eqiad error:
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/nodepool/nodepool.py", line 337, in _run
    dt = self.launchNode(session)
  File "/usr/lib/python2.7/dist-packages/nodepool/nodepool.py", line 403, in launchNode
    server['status']))
LaunchStatusException: Server 882f2ef7-ad9b-4e9f-9e01-86e788a39ed4 for node id: 83522 status: ERROR

Ditto for deletion:

2016-04-26 05:23:22,611 ERROR nodepool.NodeDeleter: Exception deleting node 83522:
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/nodepool/nodepool.py", line 297, in run
    self.nodepool._deleteNode(session, node)
  File "/usr/lib/python2.7/dist-packages/nodepool/nodepool.py", line 2159, in _deleteNode
    manager.waitForServerDeletion(node.external_id)
  File "/usr/lib/python2.7/dist-packages/nodepool/provider_manager.py", line 450, in waitForServerDeletion
    (server_id, self.provider.name)):
  File "/usr/lib/python2.7/dist-packages/nodepool/nodeutils.py", line 42, in iterate_timeout
    raise Exception("Timeout waiting for %s" % purpose)
Exception: Timeout waiting for server 882f2ef7-ad9b-4e9f-9e01-86e788a39ed4 deletion in wmflabs-eqiad

I also tried to create the instance castor2.integration.eqiad.wmflabs but it never spawn :(

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		hashar	T133652 castor.integration.eqiad.wmflabs unreacheable deadlocking the whole CI
		Resolved		yuvipanda	T133654 wmflabs OpenStack is deadlocked (can't boot or delete instances)

Event Timeline

castor2 spawned via wikitech yields an error in the Horizon dashboard:

Error: Failed to perform requested operation on instance "castor2", the instance has an error status: Please try again later [Error: Build of instance 691004a0-cc52-4b95-93e9-5be2eee35c5a aborted: Could not clean up failed build, not rescheduling].

From Icinga:

labvirt1008

Disk space
WARNING 2016-04-26 08:38:43 0d 6h 59m 14s 3/3
DISK WARNING - free space: /var/lib/nova/instances 159698 MB (6% inode=99%):

Apparently still have 160GBytes free?

Mentioned in SAL [2016-04-26T08:45:21Z] <hashar> Most of CI is down / deadlocked due to wmflabs being unresponsive T133654

hashar mentioned this in T133655: npm-node-4.3 test fails on a core patch.Apr 26 2016, 8:46 AM

Mentioned in SAL [2016-04-26T08:50:21Z] <YuviPanda> restarted nova-conductor & scheduler on labcontrol1001 for T133654

That seems to have fixed it now!

I'm going to file a bug to have a paging check for this.

hashar mentioned this in T133656: Have a paging check for Nova API accessible.Apr 26 2016, 9:24 AM

• Phabricator_maintenance removed a subscriber: yuvipanda.Jun 7 2017, 6:43 PM

Restricted Application added a subscriber: Jay8g. · View Herald TranscriptJun 7 2017, 6:43 PM

wmflabs OpenStack is deadlocked (can't boot or delete instances)Closed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

wmflabs OpenStack is deadlocked (can't boot or delete instances)
Closed, ResolvedPublic
Actions

Related Objects
Search...