Page MenuHomePhabricator

Nodepool can not spawn instances anymore on wmflabs
Closed, ResolvedPublic

Description

Nodepool went out of instances from wmflabs. It can not delete or spawn new ones.

nodepool@labnodepool1001:~$ openstack server list
+--------------------------------------+----------------------------+--------+---------------------+
| ID                                   | Name                       | Status | Networks            |
+--------------------------------------+----------------------------+--------+---------------------+
| 3655bc8f-0a1c-471d-a6cd-e0cc1e4ba590 | ci-trusty-wikimedia-106243 | BUILD  |                     |
| 33a95d6b-31da-40c8-b8c6-2f8b4e4e6e0b | ci-jessie-wikimedia-106242 | BUILD  |                     |
| 12a5b5b3-947f-4ada-ba4a-8c5c42535569 | ci-jessie-wikimedia-106241 | BUILD  |                     |
| 94221a0e-1711-47ca-9f48-07d5091e6a5e | ci-trusty-wikimedia-106240 | BUILD  |                     |
| 88879e15-3b60-465f-9dee-5dbfe245da0d | ci-trusty-wikimedia-106239 | ERROR  | public=10.68.16.145 |
| 9058c918-e99b-418a-8412-e4e6695e0049 | ci-trusty-wikimedia-106238 | BUILD  |                     |
| 39e1463e-4f53-4fc9-873a-5531b951b7cf | ci-trusty-wikimedia-106237 | BUILD  |                     |
| ab5b014b-56e5-4c8a-8cf7-44d26569f632 | ci-jessie-wikimedia-106236 | ERROR  | public=10.68.17.77  |
| ebe7384d-28fc-480e-98d1-286d8878ee79 | ci-jessie-wikimedia-106235 | ERROR  | public=10.68.16.15  |
| fc7cd374-4001-4213-9294-ba25c7fcb431 | ci-trusty-wikimedia-106234 | ERROR  | public=10.68.17.61  |
| 6d77faaf-dd0a-4bc8-9335-0c8c7bc0fa33 | ci-trusty-wikimedia-106233 | ERROR  | public=10.68.17.67  |
| 2f7b1539-cdf1-47a5-9a77-c53d9359cae4 | ci-trusty-wikimedia-106232 | ERROR  | public=10.68.17.60  |
| b81209f7-3a39-4a24-b037-6923982cec08 | ci-jessie-wikimedia-106230 | ERROR  | public=10.68.16.187 |
| 41a8c225-54d1-4a0a-af99-df236d75cccc | ci-jessie-wikimedia-106229 | ERROR  | public=10.68.17.50  |
| 8f380e6d-ea04-481d-ad8f-9d8ed880e6d4 | ci-jessie-wikimedia-106228 | ERROR  | public=10.68.16.97  |
| 6f07110f-4f2f-4f46-bddc-1ea30192ab02 | ci-trusty-wikimedia-106226 | ERROR  | public=10.68.16.252 |
+--------------------------------------+----------------------------+--------+---------------------+

For one of the ERROR instance:

{u'message': u'No valid host was found. Exceeded max scheduling attempts 3 for instance 6f07110f-4f2f-4f46-bddc-1ea30192ab02. Last exception: [u\'Traceback (most recent call last):\\n\', u\'  File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 2248, in _do', u'code': 500, u'created': u'2016-05-18T15:08:23Z'} |

Event Timeline

Mentioned in SAL [2016-05-18T15:12:54Z] <hashar> Nodepool / labs instance issue filled as https://phabricator.wikimedia.org/T135631

openstack server delete 6f07110f-4f2f-4f46-bddc-1ea30192ab02 worked fine though :)

I have stopped nodepool on labnodepool1001.eqiad.wmnet in case it is adding load to the OpenStack labs.

To restart it:

$ ssh labnodepool1001.eqiad.wmnet
$ sudo /usr/sbin/service nodepool start

The first failure was apparently at 14:35 UTC

2016-05-18 14:35:17,112 INFO nodepool.NodePool: Need to launch 1 ci-jessie-wikimedia nodes for gallium.wikimedia.org on wmflabs-eqiad
2016-05-18 14:35:17,131 INFO nodepool.NodeLauncher: Creating server with hostname ci-jessie-wikimedia-106021 in wmflabs-eqiad from image ci-jessie-wikimedia for node id: 106021
2016-05-18 14:35:33,530 INFO nodepool.NodePool: Deleted jenkins node id: 106010
2016-05-18 14:35:55,969 INFO nodepool.NodePool: Deleted node id: 106010
2016-05-18 14:37:35,344 ERROR nodepool.NodeLauncher: LaunchStatusException launching node id: 106021 in provider: wmflabs-eqiad error:
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/nodepool/nodepool.py", line 337, in _run
    dt = self.launchNode(session)
  File "/usr/lib/python2.7/dist-packages/nodepool/nodepool.py", line 403, in launchNode
    server['status']))
LaunchStatusException: Server 650b985e-ad74-4518-a4a6-6f059ba00f21 for node id: 106021 status: ERROR
2016-05-18 14:37:35,363 INFO nodepool.NodePool: Deleted jenkins node id: 106021

Maybe that is the images that weren't correct I have deleted them

hashar@labnodepool1001:/var/log/nodepool$ nodepool image-list
+-----+---------------+---------------------+--------------------------------+------------+--------------------------------------+--------------------------------------+-------+-------------+
| ID  | Provider      | Image               | Hostname                       | Version    | Image ID                             | Server ID                            | State | Age (hours) |
+-----+---------------+---------------------+--------------------------------+------------+--------------------------------------+--------------------------------------+-------+-------------+
| 683 | wmflabs-eqiad | ci-jessie-wikimedia | ci-jessie-wikimedia-1463494440 | 1463494440 | 268258f9-1d6e-47f2-8bd4-73033bf0ef9a | 3c97f75b-3ed0-4999-b577-fa2d67a7a579 | ready | 25.13       |
| 685 | wmflabs-eqiad | ci-jessie-wikimedia | ci-jessie-wikimedia-1463580840 | 1463580840 | 75bb8c62-fa5e-498e-a426-e7bc9a8fa345 | 20601e50-7d70-49d1-8e00-33d4b99b3013 | ready | 1.13        |
| 684 | wmflabs-eqiad | ci-trusty-wikimedia | ci-trusty-wikimedia-1463494442 | 1463494442 | 0c8f2355-e90a-48a4-b653-f0a65a7a5354 | accd7f8e-6c5f-4bb9-85c5-f346a6a2c461 | ready | 25.11       |
| 686 | wmflabs-eqiad | ci-trusty-wikimedia | ci-trusty-wikimedia-1463580842 | 1463580842 | 6f3d7f7e-c749-492d-b378-d9e7191d508c | b678c2ab-8b85-499b-bc06-5d90781ce5c3 | ready | 1.07        |
+-----+---------------+---------------------+--------------------------------+------------+--------------------------------------+--------------------------------------+-------+-------------+
hashar@labnodepool1001:/var/log/nodepool$ 
hashar@labnodepool1001:/var/log/nodepool$ 
hashar@labnodepool1001:/var/log/nodepool$ 
hashar@labnodepool1001:/var/log/nodepool$ nodepool image-delete 685
2016-05-18 15:30:17,231 INFO urllib3.connectionpool: Starting new HTTP connection (1): labcontrol1001.wikimedia.org
2016-05-18 15:30:17,312 INFO urllib3.connectionpool: Starting new HTTP connection (1): labnet1002.eqiad.wmnet
2016-05-18 15:30:17,473 WARNING nodepool.NodePool: Image server id 20601e50-7d70-49d1-8e00-33d4b99b3013 not found
2016-05-18 15:30:18,475 INFO urllib3.connectionpool: Starting new HTTP connection (1): labnet1002.eqiad.wmnet
2016-05-18 15:30:19,857 INFO urllib3.connectionpool: Starting new HTTP connection (1): labnet1002.eqiad.wmnet
2016-05-18 15:30:21,056 INFO nodepool.NodePool: Deleted image id: 685
hashar@labnodepool1001:/var/log/nodepool$ nodepool image-delete 686
2016-05-18 15:30:26,126 INFO urllib3.connectionpool: Starting new HTTP connection (1): labcontrol1001.wikimedia.org
2016-05-18 15:30:26,237 INFO urllib3.connectionpool: Starting new HTTP connection (1): labnet1002.eqiad.wmnet
2016-05-18 15:30:26,399 WARNING nodepool.NodePool: Image server id b678c2ab-8b85-499b-bc06-5d90781ce5c3 not found
2016-05-18 15:30:27,402 INFO urllib3.connectionpool: Starting new HTTP connection (1): labnet1002.eqiad.wmnet
2016-05-18 15:30:28,797 INFO urllib3.connectionpool: Starting new HTTP connection (1): labnet1002.eqiad.wmnet
2016-05-18 15:30:30,333 INFO nodepool.NodePool: Deleted image id: 686
hashar@labnodepool1001:/var/log/nodepool$

Worrying warnings:

2016-05-18 15:30:17,473 WARNING nodepool.NodePool: Image server id 20601e50-7d70-49d1-8e00-33d4b99b3013 not found

2016-05-18 15:30:26,399 WARNING nodepool.NodePool: Image server id b678c2ab-8b85-499b-bc06-5d90781ce5c3 not found

Mentioned in SAL [2016-05-18T15:32:01Z] <hashar> Deleted Nodepool snapshot images created around 14:30 apparently they havent been provisioned properly. Started nodepool again. Poke T135631

I have restarted Nodepool, it is supposed to spawn instances out of yesterday snapshots:

$ nodepool image-list
+-----+---------------+---------------------+--------------------------------+------------+--------------------------------------+--------------------------------------+-------+-------------+
| ID  | Provider      | Image               | Hostname                       | Version    | Image ID                             | Server ID                            | State | Age (hours) |
+-----+---------------+---------------------+--------------------------------+------------+--------------------------------------+--------------------------------------+-------+-------------+
| 683 | wmflabs-eqiad | ci-jessie-wikimedia | ci-jessie-wikimedia-1463494440 | 1463494440 | 268258f9-1d6e-47f2-8bd4-73033bf0ef9a | 3c97f75b-3ed0-4999-b577-fa2d67a7a579 | ready | 25.18       |
| 684 | wmflabs-eqiad | ci-trusty-wikimedia | ci-trusty-wikimedia-1463494442 | 1463494442 | 0c8f2355-e90a-48a4-b653-f0a65a7a5354 | accd7f8e-6c5f-4bb9-85c5-f346a6a2c461 | ready | 25.16       |
+-----+---------------+---------------------+--------------------------------+------------+--------------------------------------+--------------------------------------+-------+-------------+

I restarted Nodepool and eventually it fails to spawn instances with the same error. I think that rules out today snapshot.

Might be an issue with glance that lost images somehow or is not reachable by the scheduler.

Nodepool is stopped again.

Luke081515 triaged this task as Unbreak Now! priority.May 18 2016, 3:50 PM
Luke081515 subscribed.

Blocks Zuul.

Nodepool eventually restarted due to puppet. Horizon interface shows up instances are blocked on various tasks: in Spawning, Scheduling, Networking, Block Device Mapping

I have manually created an instance on contintcloud project using ci-jessie-wikimedia-1463494440 has a base image:

Namehashar-testspawn
ID6fc4c33f-171a-4044-935d-cc83a6ca53fe
MessageNo valid host was found. There are not enough hosts available.
Code500
CreatedMay 18, 2016, 4:18 p.m.

Change 289451 had a related patch set uploaded (by Hashar):
zuul status: notice about ongoing outage

https://gerrit.wikimedia.org/r/289451

Change 289451 merged by jenkins-bot:
zuul status: notice about ongoing outage

https://gerrit.wikimedia.org/r/289451

2016-05-18 16:45:29.980 5375 ERROR nova.compute.manager [req-86afc675-0c57-44f4-a164-e1a8320c845b novaadmin testlabs - - -] [instance: 4f2877eb-e67a-407b-a038-96019fb6732c] Instance failed to spawn
2016-05-18 16:45:29.980 5375 TRACE nova.compute.manager [instance: 4f2877eb-e67a-407b-a038-96019fb6732c] Traceback (most recent call last):
2016-05-18 16:45:29.980 5375 TRACE nova.compute.manager [instance: 4f2877eb-e67a-407b-a038-96019fb6732c] File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 2473, in _build_resources
2016-05-18 16:45:29.980 5375 TRACE nova.compute.manager [instance: 4f2877eb-e67a-407b-a038-96019fb6732c] yield resources
2016-05-18 16:45:29.980 5375 TRACE nova.compute.manager [instance: 4f2877eb-e67a-407b-a038-96019fb6732c] File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 2345, in _build_and_run_instance
2016-05-18 16:45:29.980 5375 TRACE nova.compute.manager [instance: 4f2877eb-e67a-407b-a038-96019fb6732c] block_device_info=block_device_info)
2016-05-18 16:45:29.980 5375 TRACE nova.compute.manager [instance: 4f2877eb-e67a-407b-a038-96019fb6732c] File "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/driver.py", line 2385, in spawn
2016-05-18 16:45:29.980 5375 TRACE nova.compute.manager [instance: 4f2877eb-e67a-407b-a038-96019fb6732c] block_device_info=block_device_info)
2016-05-18 16:45:29.980 5375 TRACE nova.compute.manager [instance: 4f2877eb-e67a-407b-a038-96019fb6732c] File "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/driver.py", line 4408, in _create_domain_and_network
2016-05-18 16:45:29.980 5375 TRACE nova.compute.manager [instance: 4f2877eb-e67a-407b-a038-96019fb6732c] power_on=power_on)
2016-05-18 16:45:29.980 5375 TRACE nova.compute.manager [instance: 4f2877eb-e67a-407b-a038-96019fb6732c] File "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/driver.py", line 4339, in _create_domain
2016-05-18 16:45:29.980 5375 TRACE nova.compute.manager [instance: 4f2877eb-e67a-407b-a038-96019fb6732c] LOG.error(err)
2016-05-18 16:45:29.980 5375 TRACE nova.compute.manager [instance: 4f2877eb-e67a-407b-a038-96019fb6732c] File "/usr/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 85, in exit
2016-05-18 16:45:29.980 5375 TRACE nova.compute.manager [instance: 4f2877eb-e67a-407b-a038-96019fb6732c] six.reraise(self.type_, self.value, self.tb)
2016-05-18 16:45:29.980 5375 TRACE nova.compute.manager [instance: 4f2877eb-e67a-407b-a038-96019fb6732c] File "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/driver.py", line 4329, in _create_domain
2016-05-18 16:45:29.980 5375 TRACE nova.compute.manager [instance: 4f2877eb-e67a-407b-a038-96019fb6732c] domain.createWithFlags(launch_flags)
2016-05-18 16:45:29.980 5375 TRACE nova.compute.manager [instance: 4f2877eb-e67a-407b-a038-96019fb6732c] File "/usr/lib/python2.7/dist-packages/eventlet/tpool.py", line 183, in doit
2016-05-18 16:45:29.980 5375 TRACE nova.compute.manager [instance: 4f2877eb-e67a-407b-a038-96019fb6732c] result = proxy_call(self._autowrap, f, *args, kwargs)
2016-05-18 16:45:29.980 5375 TRACE nova.compute.manager [instance: 4f2877eb-e67a-407b-a038-96019fb6732c] File "/usr/lib/python2.7/dist-packages/eventlet/tpool.py", line 141, in proxy_call
2016-05-18 16:45:29.980 5375 TRACE nova.compute.manager [instance: 4f2877eb-e67a-407b-a038-96019fb6732c] rv = execute(f, *args,
kwargs)
2016-05-18 16:45:29.980 5375 TRACE nova.compute.manager [instance: 4f2877eb-e67a-407b-a038-96019fb6732c] File "/usr/lib/python2.7/dist-packages/eventlet/tpool.py", line 122, in execute
2016-05-18 16:45:29.980 5375 TRACE nova.compute.manager [instance: 4f2877eb-e67a-407b-a038-96019fb6732c] six.reraise(c, e, tb)
2016-05-18 16:45:29.980 5375 TRACE nova.compute.manager [instance: 4f2877eb-e67a-407b-a038-96019fb6732c] File "/usr/lib/python2.7/dist-packages/eventlet/tpool.py", line 80, in tworker
2016-05-18 16:45:29.980 5375 TRACE nova.compute.manager [instance: 4f2877eb-e67a-407b-a038-96019fb6732c] rv = meth(*args, **kwargs)
2016-05-18 16:45:29.980 5375 TRACE nova.compute.manager [instance: 4f2877eb-e67a-407b-a038-96019fb6732c] File "/usr/lib/python2.7/dist-packages/libvirt.py", line 896, in createWithFlags
2016-05-18 16:45:29.980 5375 TRACE nova.compute.manager [instance: 4f2877eb-e67a-407b-a038-96019fb6732c] if ret == -1: raise libvirtError ('virDomainCreateWithFlags() failed', dom=self)
2016-05-18 16:45:29.980 5375 TRACE nova.compute.manager [instance: 4f2877eb-e67a-407b-a038-96019fb6732c] libvirtError: Connection is closed
2016-05-18 16:45:29.980 5375 TRACE nova.compute.manager [instance: 4f2877eb-e67a-407b-a038-96019fb6732c]

Andrew claimed this task.

This is resolved now, and I don't know what went wrong :(

File "/usr/lib/python2.7/dist-packages/libvirt.py", line 896, in 
if ret == -1: raise libvirtError ('virDomainCreateWithFlags() failed', dom=self)
libvirtError: Connection is closed

That is from the libvirt python binding ( python-libvirt ). Maybe the compute nodes went out of assignable hosts somehow or some error condition caused them to pill up albeit in a non functioning state. Hard to know, maybe the compte hosts have more logs in syslog/kernel.

Anyway that seems fixed for now!