Page MenuHomePhabricator

Nodepool has trouble taking snapshots on OpenStack labs
Closed, ResolvedPublic

Description

For a few days now, Nodepool is unable to reliably get a snapshot image to be created by our OpenStack cluster. It does finish the instance provisioning but the snapshot itself does not occur and the image is never found.

An example from June 17:

2016-06-17 18:54:44,620 INFO nodepool.image.build.wmflabs-eqiad.ci-trusty-wikimedia: ./setup_node.sh complete (hostname: ci-trusty-wikimedia-1466189524)

Which has ID 5cf2baa5-233a-4f78-98dc-1079693c81e7. Nodepool then ask to create a snapshot:

POST /v2/contintcloud/servers/5cf2baa5-233a-4f78-98dc-1079693c81e7/action
HTTP/1.1
Host: labnet1002.eqiad.wmnet:8774
X-Auth-Project-Id: contintcloud
Accept-Encoding: gzip, deflate 
Content-Length: 105
Accept: application/json
User-Agent: python-novaclient
Connection: keep-alive
X-Auth-Token: XXXXXXXX
Content-Type: application/json

{
   "createImage" : {
      "metadata" : {
         "properties" : {
            "show" : "true"
         }
      },
      "name" : "ci-trusty-wikimedia-1466189524"
   }
}

HTTP/1.1 202 Accepted
Content-Type: text/html; charset=UTF-8
Content-Length: 0
Location: http://labnet1002.eqiad.wmnet:8774/v2/contintcloud/images/f9a0eb80-7e96-4ba2-82ca-8d19b56bc974
Date: Fri, 17 Jun 2016 18:54:45 GMT
Connection: keep-alive

Then Nodepool poll the API to enquires about the status of the snapshot:

2016-06-17 18:54:46,857 INFO urllib3.connectionpool: Starting new HTTP connection (1): labnet1002.eqiad.wmnet
GET /v2/contintcloud/images/f9a0eb80-7e96-4ba2-82ca-8d19b56bc974 HTTP/1.1
Host: labnet1002.eqiad.wmnet:8774
X-Auth-Project-Id: contintcloud
Accept-Encoding: gzip, deflate
Accept: application/json
User-Agent: python-novaclient
Connection: keep-alive
X-Auth-Token: 01eea503f0244410ba06a17ef51af329



HTTP/1.1 404 Not Found
Content-Length: 62
Content-Type: application/json; charset=UTF-8
Date: Fri, 17 Jun 2016 18:54:47 GMT
Connection: keep-alive

{
   "itemNotFound" : {
      "code" : 404,
      "message" : "Image not found."
   }
}
$ openstack image show f9a0eb80-7e96-4ba2-82ca-8d19b56bc974
ERROR: openstack No image with a name or ID of 'f9a0eb80-7e96-4ba2-82ca-8d19b56bc974' exists.

From there it keeps looping over and over for up to six hours...

I did hammer the command a few times earlier today, and after a few tries it eventually manages to take a snapshot.

Event Timeline

chasemp added subscribers: Andrew, chasemp.

@Andrew tossing your way as I'm not familiar with the setup here, I will take a look and see if I can make sense of this issue but it may wait for you.

Most likely this is failing due to lack of resources on the virt nodes (that's my go-to answer for everything this month.) There's probably a record of something breaking in the compute logs -- Antoine, if you're able to reproduce this with a particular instance on a particular host then we can dig in the logs.

I can't remember whether I managed to reproduce manually using the openstack CLI. But in case it is needed for later the process would be:


Create an instance:

ssh labnodepool1001.eqiad.wmnet
become-nodepool
openstack server create --image ci-jessie-wikimedia --flavor m1.medium T138106-instance

Poll until openstack server show T138106-instance -c status yields ACTIVE.

Snapshot with debug / waiting for task to complete:

openstack --debug server image create --name T138106-snapshot --wait T138106-instance

The HTTP output shows it originates from python-glanceclient and they do a HEAD request to http://labcontrol1001.wikimedia.org:9292/v1/images/e60159e1-54a7-4f69-8e63-5e7b217d4d80 waiting for the following header change:

- x-image-meta-status: saving
+ x-image-meta-status: active

That currently works just fine.


If the labs infrastructure is overloaded/overprovisionned, I guess this task is thus a known issue which will eventually magically be sorted out down the road. So I have lowered the priority, might even stall it until the labs infra has more capacity.

Real way to reproduce what Nodepool is doing would be:

ssh labnodepool1001.eqiad.wmnet
become-nodepool
nodepool image-update wmflabs-eqiad ci-jessie-wikimedia

That boot an instance, refresh repos, apply puppet, sync and then attempt to create an image out of that instance.

When I try those commands, it gets stuck on

2016-06-23 09:44:02,341 INFO urllib3.connectionpool: Starting new HTTP connection (1): labnet1002.eqiad.wmnet

is that the image creation?

Yes sir! Apparently looping forever trying to GET /v2/contintcloud/images/<some id> which raises a 404 cause the snapshotting never occur (apparently).

Nodepool intent is to show the image metadata and watch the x-image-meta-status header which should change from savingactive

Mentioned in SAL [2016-06-23T10:13:37Z] <andrewbogott> restarting rabbitmq-server on labcontrol1001 (random debugging attempt for T138106)

After two weeks and labs recovering some free space, Nodepool has finally managed to regenerate a Jessie image (TS 1467728040) roughly 4 hours ago.

The Trusty one is failing for some other reason.

Keeping it open because it will keep failing until the contintcloud quota is large enough to accomodate for an instance to be booted and snapshotted.

Is this still failing, or are things resolved now that we have increased labs capacity?

Le 27/07/2016 à 17:25, Andrew a écrit :

Andrew added a comment.

Is this still failing, or are things resolved now that we have increased
labs capacity?

!assign andrew
!close

That is solved. The instance would spawn, be provisioned via puppet but
the snapshotting failed for some reason.

I have confirmed it got solved, just forgot to close this task :-} Thank
you Cloud-Services !