Page MenuHomePhabricator

Nodepool can't refresh snapshot on labs since ~ Feb 15th
Closed, ResolvedPublic

Description

Every day around 14:30 UTC, Nodepool spawn an instance out of the reference image ci-jessie-wikimedia in order to create a snapshot. The image is really just refreshed by pulling material and rerunning the puppet provision script we use.

The last known good snapshots are:
nodepool image-list

IDProviderImageHostnameVersionImage IDServer IDStateAge (hours)
482wmflabs-eqiadci-jessie-wikimediaci-jessie-wikimedia-1455550768145555076873aab493-67bc-4adb-abec-045f5bc7b7d2760b0b44-b123-4fc6-b793-259ec78c6edbready173.45
483wmflabs-eqiadci-jessie-wikimediaci-jessie-wikimedia-1455552377145555237797ccc6a0-55f3-4731-a2ad-c3ac4a5a358f84efc5c9-28a9-463b-b358-e84d05e92007ready173.00

Which are Monday Feb 15th 16:10 and 16:37 UTC (they are most probably snapshot I have manually forced since they are not at 14:30).

After that date, when Nodepool schedule an image refresh it logs HTTP connections to labnet1002.eqiad.wmnet the Nova API. It successfully:

  • boot an instance with a name ci-jessie-wikimedia-<UNIX TIMESTAMP>
  • provision it
  • I can login on that instance just fine

But after that nodepool is caught looping over and over waiting for something. @Andrew pasted glance server side logs showing the image is somehow not found:

12016-02-22 20:56:54.063 15962 TRACE glance.registry.client.v1.client File "/usr/lib/python2.7/dist-packages/glance/common/client.py", line 523, in _do_request
22016-02-22 20:56:54.063 15962 TRACE glance.registry.client.v1.client raise exception.NotFound(res.read())
32016-02-22 20:56:54.063 15962 TRACE glance.registry.client.v1.client NotFound: 404 Not Found
42016-02-22 20:56:54.063 15962 TRACE glance.registry.client.v1.client
52016-02-22 20:56:54.063 15962 TRACE glance.registry.client.v1.client The resource could not be found.
62016-02-22 20:56:54.063 15962 TRACE glance.registry.client.v1.client
72016-02-22 20:56:54.063 15962 TRACE glance.registry.client.v1.client
82016-02-22 20:56:54.063 15962 TRACE glance.registry.client.v1.client
92016-02-22 20:56:54.065 15962 INFO eventlet.wsgi.server [req-5b8f0923-17d8-4b6f-ad3e-b0af6717184a nodepoolmanager contintcloud - - -] 10.64.20.25 - - [22/Feb/2016 20:56:54] "HEAD /v1/images/54de67d5-621d-4666-9bb4-2b9c5fb62321 HTTP/1.1" 404 193 0.139726
102016-02-22 20:56:58.003 15962 INFO eventlet.wsgi.server [req-ccb360dd-1a51-4d48-bb84-b63466e671f2 nodepoolmanager contintcloud - - -] 10.64.20.25 - - [22/Feb/2016 20:56:58] "HEAD /v1/images/97ccc6a0-55f3-4731-a2ad-c3ac4a5a358f HTTP/1.1" 200 1272 0.250186
112016-02-22 20:56:59.006 15962 INFO eventlet.wsgi.server [req-b5824492-29c8-42e0-8cad-4341c5f9f4a7 nodepoolmanager contintcloud - - -] 10.64.20.24 - - [22/Feb/2016 20:56:59] "HEAD /v1/images/97ccc6a0-55f3-4731-a2ad-c3ac4a5a358f HTTP/1.1" 200 1272 0.144431
122016-02-22 20:57:04.207 15962 ERROR glance.registry.client.v1.client [req-e8604ea7-e15d-4bdd-be3c-4744751c3867 nodepoolmanager contintcloud - - -] Registry client request GET /images/54de67d5-621d-4666-9bb4-2b9c5fb62321 raised NotFound
132016-02-22 20:57:04.207 15962 TRACE glance.registry.client.v1.client Traceback (most recent call last):
142016-02-22 20:57:04.207 15962 TRACE glance.registry.client.v1.client File "/usr/lib/python2.7/dist-packages/glance/registry/client/v1/client.py", line 117, in do_request
152016-02-22 20:57:04.207 15962 TRACE glance.registry.client.v1.client **kwargs)
162016-02-22 20:57:04.207 15962 TRACE glance.registry.client.v1.client File "/usr/lib/python2.7/dist-packages/glance/common/client.py", line 71, in wrapped
172016-02-22 20:57:04.207 15962 TRACE glance.registry.client.v1.client return func(self, *args, **kwargs)
182016-02-22 20:57:04.207 15962 TRACE glance.registry.client.v1.client File "/usr/lib/python2.7/dist-packages/glance/common/client.py", line 376, in do_request
192016-02-22 20:57:04.207 15962 TRACE glance.registry.client.v1.client headers=copy.deepcopy(headers))
202016-02-22 20:57:04.207 15962 TRACE glance.registry.client.v1.client File "/usr/lib/python2.7/dist-packages/glance/common/client.py", line 88, in wrapped
212016-02-22 20:57:04.207 15962 TRACE glance.registry.client.v1.client return func(self, method, url, body, headers)
222016-02-22 20:57:04.207 15962 TRACE glance.registry.client.v1.client File "/usr/lib/python2.7/dist-packages/glance/common/client.py", line 523, in _do_request
232016-02-22 20:57:04.207 15962 TRACE glance.registry.client.v1.client raise exception.NotFound(res.read())
242016-02-22 20:57:04.207 15962 TRACE glance.registry.client.v1.client NotFound: 404 Not Found
252016-02-22 20:57:04.207 15962 TRACE glance.registry.client.v1.client
262016-02-22 20:57:04.207 15962 TRACE glance.registry.client.v1.client The resource could not be found.
272016-02-22 20:57:04.207 15962 TRACE glance.registry.client.v1.client
282016-02-22 20:57:04.207 15962 TRACE glance.registry.client.v1.client
292016-02-22 20:57:04.207 15962 TRACE glance.registry.client.v1.client
302016-02-22 20:57:04.209 15962 INFO eventlet.wsgi.server [req-e8604ea7-e15d-4bdd-be3c-4744751c3867 nodepoolmanager contintcloud - - -] 10.64.20.25 - - [22/Feb/2016 20:57:04] "HEAD /v1/images/54de67d5-621d-4666-9bb4-2b9c5fb62321 HTTP/1.1" 404 193 0.121999
312016-02-22 20:57:09.112 15962 INFO eventlet.wsgi.server [req-93c7cba3-15b5-4f83-98cd-55665a39e997 nodepoolmanager contintcloud - - -] 10.64.20.24 - - [22/Feb/2016 20:57:09] "HEAD /v1/images/97ccc6a0-55f3-4731-a2ad-c3ac4a5a358f HTTP/1.1" 200 1272 0.127819
322016-02-22 20:57:14.373 15962 ERROR glance.registry.client.v1.client [req-39755c71-91e9-412a-9de9-463e02d46ac9 nodepoolmanager contintcloud - - -] Registry client request GET /images/54de67d5-621d-4666-9bb4-2b9c5fb62321 raised NotFound
332016-02-22 20:57:14.373 15962 TRACE glance.registry.client.v1.client Traceback (most recent call last):
342016-02-22 20:57:14.373 15962 TRACE glance.registry.client.v1.client File "/usr/lib/python2.7/dist-packages/glance/registry/client/v1/client.py", line 117, in do_request
352016-02-22 20:57:14.373 15962 TRACE glance.registry.client.v1.client **kwargs)
362016-02-22 20:57:14.373 15962 TRACE glance.registry.client.v1.client File "/usr/lib/python2.7/dist-packages/glance/common/client.py", line 71, in wrapped

Snippet:

2016-02-22 20:57:14.373 15962 ERROR glance.registry.client.v1.client
[req-39755c71-91e9-412a-9de9-463e02d46ac9 nodepoolmanager contintcloud - - -]
Registry client request GET /images/54de67d5-621d-4666-9bb4-2b9c5fb62321 raised NotFound

Deleting the instance works fine.

Related Objects

Event Timeline

Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald Transcript

Using the poor man debugger on labnodepool1001.eqiad.wmnet as nodepool user: strace -f -e recvfrom,sendto -s 1024 nodepool image-update wmflabs-eqiad ci-jessie-wikimedia

It attempts to create an image out of the server 326c80cf-0208-40ae-b3cf-0cb878478921:

2016-02-22 21:40:38,808 INFO urllib3.connectionpool: Starting new HTTP connection (1): labnet1002.eqiad.wmnet
POST /v2/contintcloud/servers/326c80cf-0208-40ae-b3cf-0cb878478921/action HTTP/1.1
Host: labnet1002.eqiad.wmnet:8774
X-Auth-Project-Id: contintcloud
Accept-Encoding: gzip, deflate
Content-Length: 105
Accept: application/json
User-Agent: python-novaclient
Connection: keep-alive
X-Auth-Token: xxxxxxxxxx
Content-Type: application/json

{"createImage": {
 "name": "ci-jessie-wikimedia-1456177078",
 "metadata": {"properties": {"show": "true"}}
}}

HTTP/1.1 202 Accepted
Location: http://labnet1002.eqiad.wmnet:8774/v2/contintcloud/images/9bc7a824-2f28-44e8-92ee-71ceefe4ad72

At this point Nodepool starts looping against labnodepool1002.eqiad.wmnet to get information about the snapshot creation

GET /v2/contintcloud/images/9bc7a824-2f28-44e8-92ee-71ceefe4ad72
HTTP/1.1 200 OK
{"image": {    "status": "SAVING", ....

There is a few such requests and then:

2016-02-22 21:41:19,663 INFO urllib3.connectionpool: Starting new HTTP connection (1): labnet1002.eqiad.wmnet
GET /v2/contintcloud/images/9bc7a824-2f28-44e8-92ee-71ceefe4ad72
Host: labnet1002.eqiad.wmnet:8774

HTTP/1.1 404 Not Found
Content-Length: 62
Content-Type: application/json; charset=UTF-8
Date: Mon, 22 Feb 2016 21:41:19 GMT
Connection: keep-alive

{"itemNotFound": {
    "message": "Image not found.",
    "code": 404}
}}

From there it loops querying the status of the image and receiving a 404 image not found until the snapshot creation is cancelled.

The headers have User-Agent: python-novaclient maybe a regression in there.

python-novaclient has not been updated on labnodepool. Maybe it should... No idea really

labnodepool1001:~$ apt-cache policy python-novaclient
python-novaclient:
  Installed: 2:2.23.0-1~bpo8+1
  Candidate: 2:2.23.0-1~bpo8+1
  Package pin: (not found)
  Version table:
     2:2.30.1-4~bpo8+1 1001
        100 http://mirrors.wikimedia.org/debian/ jessie-backports/main amd64 Packages
 *** 2:2.23.0-1~bpo8+1 1001
       1001 http://apt.wikimedia.org/wikimedia/ jessie-wikimedia/backports amd64 Packages
        100 /var/lib/dpkg/status
     2:2.18.1-1 1001
        500 http://mirrors.wikimedia.org/debian/ jessie/main amd64 Packages

Change 272621 had a related patch set uploaded (by Hashar):
Glance policy: grant manage_image_cache permission

https://gerrit.wikimedia.org/r/272621

Change 272621 merged by Andrew Bogott:
Glance policy: grant manage_image_cache permission

https://gerrit.wikimedia.org/r/272621

Not yet ..

GET /v2/contintcloud/images/2e45de58-b560-4d51-a4b3-3a20b7f47dde
HTTP/1.1 200 OK
Content-Type: application/json
Content-Length: 1234
Date: Mon, 22 Feb 2016 22:46:23 GMT
Connection: keep-alive

{
   "image" : {
      "server" : {
         "id" : "a4b0a323-4d34-426a-a075-a1fda47b914a",
         "links" : [
            {"href" : "http://labnet1002.eqiad.wmnet:8774/v2/contintcloud/servers/a4b0a323-4d34-426a-a075-a1fda47b914a",
             "rel" : "self"},
            {"href" : "http://labnet1002.eqiad.wmnet:8774/contintcloud/servers/a4b0a323-4d34-426a-a075-a1fda47b914a",
             "rel" : "bookmark"}
         ]
      }, 
      "links" : [
         {"rel" : "self",
          "href" : "http://labnet1002.eqiad.wmnet:8774/v2/contintcloud/images/2e45de58-b560-4d51-a4b3-3a20b7f47dde"
         }, 
         {"href" : "http://labnet1002.eqiad.wmnet:8774/contintcloud/images/2e45de58-b560-4d51-a4b3-3a20b7f47dde",
          "rel" : "bookmark"
         }, 
         {"rel" : "alternate",
          "href" : "http://labcontrol1001.wikimedia.org:9292/images/2e45de58-b560-4d51-a4b3-3a20b7f47dde",
          "type" : "application/vnd.openstack.image"
         }  
      ], 
      "updated" : "2016-02-22T22:45:52Z",
      "OS-EXT-IMG-SIZE:size" : 0,
      "progress" : 25,
      "status" : "SAVING",
      "minDisk" : 40,
      "name" : "ci-jessie-wikimedia-1456180990",
      "metadata" : {
         "instance_uuid" : "a4b0a323-4d34-426a-a075-a1fda47b914a",
         "base_image_ref" : "fea778f0-96bd-43c7-9f5c-0497bcae9876",
         "image_type" : "snapshot",
         "properties" : "{u'show': u'true'}",
         "show" : "true",
         "user_id" : "nodepoolmanager"
      }, 
      "id" : "2e45de58-b560-4d51-a4b3-3a20b7f47dde",
      "created" : "2016-02-22T22:45:52Z",
      "minRam" : 0
   }  
}

And then the next GET /v2/contintcloud/images/2e45de58-b560-4d51-a4b3-3a20b7f47dde yields: {\"itemNotFound\": {\"message\": \"Image not found.\", \"code\": 404}}

Nodepool tracks the instance used for snapshotting with:

Hostnameci-jessie-wikimedia-1456180990
Image ID (ie snapshot)2e45de58-b560-4d51-a4b3-3a20b7f47dde
Server ID a4b0a323-4d34-426a-a075-a1fda47b914a

And from openstack server show ci-jessie-wikimedia-1456180990:

FieldValue
OS-DCF:diskConfigMANUAL
OS-EXT-AZ:availability_zonenova
OS-EXT-STS:power_state1
OS-EXT-STS:task_stateNone
OS-EXT-STS:vm_stateactive
OS-SRV-USG:launched_at2016-02-22T22:44:02.000000
OS-SRV-USG:terminated_atNone
accessIPv4
accessIPv6
addressespublic=10.68.23.207
config_drive
created2016-02-22T22:43:52Z
flavorm1.medium (3)
hostId82fcfaec5fab89596ebf1acf85d785fdc4e96740a641e475ed1beff3
ida4b0a323-4d34-426a-a075-a1fda47b914a
imageci-jessie-wikimedia (fea778f0-96bd-43c7-9f5c-0497bcae9876)
key_nameci-jessie-wikimedia-1456180990
nameci-jessie-wikimedia-1456180990
os-extended-volumes:volumes_attached[]
progress0
project_idcontintcloud
propertiesproject-id='contintcloud'
security_groups[{u'name': u'default'}]
statusACTIVE
updated2016-02-22T22:46:27Z
user_idnodepoolmanager

If I try to create a server image from a running server it works just fine, i.e.:

openstack server image create --name 'hashar-server-image-create' ci-jessie-wikimedia-1456180990

Guess we are missing another permission.. To be continued.

openstack server image create does not work. The command returns immediately showing the images information and a 'queued' status:

openstack image show hashar-srv-image-create-2

FieldValue
checksumNone
container_formatbare
created_at2016-02-22T23:05:59.000000
deletedFalse
deleted_atNone
disk_formatqcow2
idaac37c37-350f-475c-a08b-56b85171c251
is_publicFalse
min_disk40
min_ram0
namehashar-srv-image-create-2
ownercontintcloud
properties{u'instance_uuid': u'b8a817db-dbf4-45f4-83b4-ed53c71fd18e', u'image_location': u'snapshot', u'image_state': u'available', u'user_id': u'nodepoolmanager', u'show': u'true', u'image_type': u'snapshot', u'properties': u"{u'show': u'true'}", u'base_image_ref': u'fea778f0-96bd-43c7-9f5c-0497bcae9876', u'owner_id': u'contintcloud'}
protectedFalse
size0
statusqueued
updated_at2016-02-22T23:05:59.000000
virtual_sizeNone

Then it is discarded:

ERROR: openstack No image with a name or ID of 'hashar-srv-image-create-2' exists.

And no more show in the list of images.

I have tried again to snapshot a running instance. On labnodepool1001.eqiad.wmnet as nodepool user I have first instructed nodepoold to never delete an instance, then ran the command to generate a snapshot but this time with debug and lot of verbosity.

$ nodepool hold 38741
$ openstack -vvv --debug server image create --wait --name 'hashar-image-create-test' ci-jessie-wikimedia-38741

The debug log emits the headers of poll requests with HEAD /v1/images/d2eb94d5-875a-4ca7-aac6-3b3bde3a2bd2

status queued:

HTTP/1.1 200 OK
content-length: 0
x-image-meta-property-user_id: nodepoolmanager
x-image-meta-status: queued
x-image-meta-property-image_state: available
x-image-meta-owner: contintcloud
x-image-meta-name: hashar-image-create-test
x-image-meta-container_format: bare
x-image-meta-property-show: true
x-image-meta-property-base_image_ref: fea778f0-96bd-43c7-9f5c-0497bcae9876
x-image-meta-property-image_type: snapshot
x-image-meta-created_at: 2016-02-24T16:19:40.000000
x-image-meta-property-image_location: snapshot
x-image-meta-min_ram: 0
x-image-meta-updated_at: 2016-02-24T16:19:40.000000
x-image-meta-id: d2eb94d5-875a-4ca7-aac6-3b3bde3a2bd2
x-image-meta-property-instance_uuid: 7052842b-384a-4cae-8d52-0ecd74560978
x-image-meta-property-properties: {u'show': u'true'}
date: Wed, 24 Feb 2016 16:21:23 GMT
x-openstack-request-id: req-req-ecab3349-5588-4a54-8243-ce73c41faff7
x-image-meta-deleted: False
content-type: text/html; charset=UTF-8
x-image-meta-protected: False
x-image-meta-min_disk: 40
x-image-meta-size: 0
connection: keep-alive
x-image-meta-is_public: False
x-image-meta-property-owner_id: contintcloud
x-image-meta-disk_format: qcow2

Then suddenly it is flagged as deleted in x-image-meta-status:

HTTP/1.1 200 OK
content-length: 0
x-image-meta-id: d2eb94d5-875a-4ca7-aac6-3b3bde3a2bd2
x-image-meta-deleted: False
x-image-meta-container_format: bare
x-image-meta-status: deleted
x-image-meta-min_ram: 0
x-image-meta-protected: False
x-image-meta-min_disk: 40
x-image-meta-created_at: 2016-02-24T16:19:40.000000
x-image-meta-size: 0
connection: keep-alive
x-image-meta-is_public: False
date: Wed, 24 Feb 2016 16:21:28 GMT
x-image-meta-owner: contintcloud
x-image-meta-updated_at: 2016-02-24T16:21:28.000000
content-type: text/html; charset=UTF-8
x-openstack-request-id: req-req-c6bd7d8b-ab83-40d8-825f-73b20aea9fa6
x-image-meta-disk_format: qcow2
x-image-meta-name: hashar-image-create-test

Then on the next poll that yields a 404 error since there is no such image.

@Andrew I guess our only hope is digging in the glance server side logs. Maybe the policy lacks a permission or there is some other weird error on the server side.

Change 273003 had a related patch set uploaded (by Andrew Bogott):
Glance policy.json: Give glanceadmin a bunch more rights

https://gerrit.wikimedia.org/r/273003

Change 273003 merged by Andrew Bogott:
Glance policy.json: Give glanceadmin a bunch more rights

https://gerrit.wikimedia.org/r/273003

That patch is scattershot, but I changed the defaults so if there were secret policies preventing this they should now permit it.

Gave it a try and it works fine now:

statusactive

And $ openstack image list --private

IDName
937dbe47-7e45-41c1-80fa-4e11578b2b19hashar-image-create-test

So the root cause was due to the glance policy missing some mysterious right.

I am going to ask nodepool to refresh a snapshot and see what happens.

Mentioned in SAL [2016-02-24T17:12:45Z] <hashar> Refreshing nodepool snapshot. Been stall since Feb 15th T127755

$ nodepool image-update wmflabs-eqiad ci-jessie-wikimedia
2016-02-24 17:12:59,940 INFO nodepool.SnapshotImageUpdater: Creating image id: 500 with hostname ci-jessie-wikimedia-1456333979 for ci-jessie-wikimedia in wmflabs-eqiad
... provisioning stuff ...
2016-02-24 17:15:39,365 INFO urllib3.connectionpool: Starting new HTTP connection (1): labnet1002.eqiad.wmnet
2016-02-24 17:16:20,355 INFO nodepool.SnapshotImageUpdater: Image ci-jessie-wikimedia-1456333979 in wmflabs-eqiad is ready

Which is well. A SUCCESS.

Thank you very much @Andrew :-}