For a few days now, Nodepool is unable to reliably get a snapshot image to be created by our OpenStack cluster. It does finish the instance provisioning but the snapshot itself does not occur and the image is never found.
An example from June 17:
2016-06-17 18:54:44,620 INFO nodepool.image.build.wmflabs-eqiad.ci-trusty-wikimedia: ./setup_node.sh complete (hostname: ci-trusty-wikimedia-1466189524)
Which has ID 5cf2baa5-233a-4f78-98dc-1079693c81e7. Nodepool then ask to create a snapshot:
POST /v2/contintcloud/servers/5cf2baa5-233a-4f78-98dc-1079693c81e7/action
HTTP/1.1
Host: labnet1002.eqiad.wmnet:8774
X-Auth-Project-Id: contintcloud
Accept-Encoding: gzip, deflate
Content-Length: 105
Accept: application/json
User-Agent: python-novaclient
Connection: keep-alive
X-Auth-Token: XXXXXXXX
Content-Type: application/json
{
"createImage" : {
"metadata" : {
"properties" : {
"show" : "true"
}
},
"name" : "ci-trusty-wikimedia-1466189524"
}
}
HTTP/1.1 202 Accepted
Content-Type: text/html; charset=UTF-8
Content-Length: 0
Location: http://labnet1002.eqiad.wmnet:8774/v2/contintcloud/images/f9a0eb80-7e96-4ba2-82ca-8d19b56bc974
Date: Fri, 17 Jun 2016 18:54:45 GMT
Connection: keep-aliveThen Nodepool poll the API to enquires about the status of the snapshot:
2016-06-17 18:54:46,857 INFO urllib3.connectionpool: Starting new HTTP connection (1): labnet1002.eqiad.wmnet
GET /v2/contintcloud/images/f9a0eb80-7e96-4ba2-82ca-8d19b56bc974 HTTP/1.1
Host: labnet1002.eqiad.wmnet:8774
X-Auth-Project-Id: contintcloud
Accept-Encoding: gzip, deflate
Accept: application/json
User-Agent: python-novaclient
Connection: keep-alive
X-Auth-Token: 01eea503f0244410ba06a17ef51af329
HTTP/1.1 404 Not Found
Content-Length: 62
Content-Type: application/json; charset=UTF-8
Date: Fri, 17 Jun 2016 18:54:47 GMT
Connection: keep-alive
{
"itemNotFound" : {
"code" : 404,
"message" : "Image not found."
}
}$ openstack image show f9a0eb80-7e96-4ba2-82ca-8d19b56bc974 ERROR: openstack No image with a name or ID of 'f9a0eb80-7e96-4ba2-82ca-8d19b56bc974' exists.
From there it keeps looping over and over for up to six hours...
I did hammer the command a few times earlier today, and after a few tries it eventually manages to take a snapshot.