For a few days now, Nodepool is unable to reliably get a snapshot image to be created by our OpenStack cluster. It does finish the instance provisioning but the snapshot itself does not occur and the image is never found.
An example from June 17:
2016-06-17 18:54:44,620 INFO nodepool.image.build.wmflabs-eqiad.ci-trusty-wikimedia: ./setup_node.sh complete (hostname: ci-trusty-wikimedia-1466189524)
Which has ID 5cf2baa5-233a-4f78-98dc-1079693c81e7. Nodepool then ask to create a snapshot:
POST /v2/contintcloud/servers/5cf2baa5-233a-4f78-98dc-1079693c81e7/action HTTP/1.1 Host: labnet1002.eqiad.wmnet:8774 X-Auth-Project-Id: contintcloud Accept-Encoding: gzip, deflate Content-Length: 105 Accept: application/json User-Agent: python-novaclient Connection: keep-alive X-Auth-Token: XXXXXXXX Content-Type: application/json { "createImage" : { "metadata" : { "properties" : { "show" : "true" } }, "name" : "ci-trusty-wikimedia-1466189524" } } HTTP/1.1 202 Accepted Content-Type: text/html; charset=UTF-8 Content-Length: 0 Location: http://labnet1002.eqiad.wmnet:8774/v2/contintcloud/images/f9a0eb80-7e96-4ba2-82ca-8d19b56bc974 Date: Fri, 17 Jun 2016 18:54:45 GMT Connection: keep-alive
Then Nodepool poll the API to enquires about the status of the snapshot:
2016-06-17 18:54:46,857 INFO urllib3.connectionpool: Starting new HTTP connection (1): labnet1002.eqiad.wmnet GET /v2/contintcloud/images/f9a0eb80-7e96-4ba2-82ca-8d19b56bc974 HTTP/1.1 Host: labnet1002.eqiad.wmnet:8774 X-Auth-Project-Id: contintcloud Accept-Encoding: gzip, deflate Accept: application/json User-Agent: python-novaclient Connection: keep-alive X-Auth-Token: 01eea503f0244410ba06a17ef51af329 HTTP/1.1 404 Not Found Content-Length: 62 Content-Type: application/json; charset=UTF-8 Date: Fri, 17 Jun 2016 18:54:47 GMT Connection: keep-alive { "itemNotFound" : { "code" : 404, "message" : "Image not found." } }
$ openstack image show f9a0eb80-7e96-4ba2-82ca-8d19b56bc974 ERROR: openstack No image with a name or ID of 'f9a0eb80-7e96-4ba2-82ca-8d19b56bc974' exists.
From there it keeps looping over and over for up to six hours...
I did hammer the command a few times earlier today, and after a few tries it eventually manages to take a snapshot.