Page MenuHomePhabricator

wmcs-image-create fails because of changes (breakage?) in VM snapshots
Closed, ResolvedPublic

Description

Currently I can't build new glance images with wmcs-image-create because the stage where we snapshot the new VM and download it is failing:

INFO:wmcs-image-create:Running command:
    ('mount', '/dev/nbd0p1', PosixPath('/usr/local/sbin/wmcs-image-createap1di13y/mnt'))
    options: {}
mount: /usr/local/sbin/wmcs-image-createap1di13y/mnt: special device /dev/nbd0p1 does not exist.

This isn't specific to the new Bookworm build I'm trying to make, and also isn't specific to the wmcs-image-create script; when I make a snapshot by hand and download it it has the same issue.

Are snapshots totally broken, or are they just in a different format? fdisk and gdisk can't find a partition table.

root@cloudcontrol1005:/home/andrew# fdisk -l clisnapshot.img 
Disk clisnapshot.img: 3.97 GiB, 4264165376 bytes, 8328448 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
root@cloudcontrol1005:/home/andrew# gdisk -l clisnapshot.img 
GPT fdisk (gdisk) version 1.0.6

Warning: Partition table header claims that the size of partition table
entries is 0 bytes, but this program  supports only 128-byte entries.
Adjusting accordingly, but partition table may be garbage.
Caution: invalid main GPT header, but valid backup; regenerating main header
from backup!

Warning! Read error 25! Misbehavior now likely!
Caution! After loading partitions, the CRC doesn't check out!
Caution! After loading partitions, the CRC doesn't check out!
Warning! Error 25 reading partition table for CRC check!
Warning! One or more CRCs don't match. You should repair the disk!
Main header: ERROR
Backup header: OK
Main partition table: ERROR
Backup partition table: ERROR

Partition table scan:
  MBR: not present
  BSD: not present
  APM: not present
  GPT: damaged

Found invalid MBR and corrupt GPT. What do you want to do? (Using the
GPT MAY permit recovery of GPT data.)
 1 - Use current GPT
 2 - Create blank GPT

Event Timeline

I downloaded a variety of images ('openstack image save') and it's only the recent snapshot of a VM that seems broken:

root@cloudcontrol1005:/srv/andrewsnaptests# file 160313be-b1b1-4b62-8906-bee26f25f6dc.img    # aka 'debian-9.13-stretch (deprecated 2021-03-11)'
160313be-b1b1-4b62-8906-bee26f25f6dc.img: DOS/MBR boot sector, extended partition table (last)

root@cloudcontrol1005:/srv/andrewsnaptests# file e69cb6f7-e5c7-41de-b08d-8e5739c20de3.img # aka 'debian-11.0-bullseye'
e69cb6f7-e5c7-41de-b08d-8e5739c20de3.img: DOS/MBR boot sector, extended partition table (last)

root@cloudcontrol1005:/srv/andrewsnaptests# file upstreambookworm.img  #aka 'upstream image from https://cloud.debian.org/images/cloud/bookworm/daily/20230606-1403/debian-12-genericcloud-amd64-daily-20230606-1403.tar.xz'
upstreambookworm.img: DOS/MBR boot sector, extended partition table (last)

root@cloudcontrol1005:/srv/andrewsnaptests# file ccacf30d-94c4-4e1d-ae18-95bd3d7a18e0.img  # aka 'snap for 09ae4404-44f8-4560-b74f-a3069201a398'
ccacf30d-94c4-4e1d-ae18-95bd3d7a18e0.img: data

So the breakage isn't a result of the downloading process.

While creating the snapshot, I see these errors:

Jun 7, 2023 @ 15:55:56.914glance-wsgi-api
cloudcontrol1006
ERR
 - 
glance_store._drivers.rbd
[None req-eefcd463-190b-4e04-8016-6bfed610714e novaadmin admin - - default default] Failed to store image e0bed001-9bac-4ad5-ae4a-08978775bf89 Store Exception RBD incomplete write (Wrote only 8388608 out of 8404337 bytes): rbd.IncompleteWriteError: RBD incomplete write (Wrote only 8388608 out of 8404337 bytes)
Jun 7, 2023 @ 15:55:54.932glance-wsgi-api
cloudcontrol1006
ERR
 - 
glance_store._drivers.rbd
[None req-eefcd463-190b-4e04-8016-6bfed610714e novaadmin admin - - default default] Failed to store image e0bed001-9bac-4ad5-ae4a-08978775bf89 Store Exception RBD incomplete write (Wrote only 8388608 out of 8404523 bytes): rbd.IncompleteWriteError: RBD incomplete write (Wrote only 8388608 out of 8404523 bytes)
Jun 7, 2023 @ 15:55:53.422glance-wsgi-api
cloudcontrol1007
ERR
 - 
glance_store._drivers.rbd
[None req-eefcd463-190b-4e04-8016-6bfed610714e novaadmin admin - - default default] Failed to store image e0bed001-9bac-4ad5-ae4a-08978775bf89 Store Exception RBD incomplete write (Wrote only 8388608 out of 8392903 bytes): rbd.IncompleteWriteError: RBD incomplete write (Wrote only 8388608 out of 8392903 bytes)
Jun 7, 2023 @ 15:55:51.834glance-wsgi-api
cloudcontrol1005
ERR
 - 
glance_store._drivers.rbd
[None req-eefcd463-190b-4e04-8016-6bfed610714e novaadmin admin - - default default] Failed to store image e0bed001-9bac-4ad5-ae4a-08978775bf89 Store Exception RBD incomplete write (Wrote only 8388608 out of 8404082 bytes): rbd.IncompleteWriteError: RBD incomplete write (Wrote only 8388608 out of 8404082 bytes)

In this case, after those errors the image vanished entirely, resulting in a crash with

glanceclient.exc.HTTPNotFound: HTTP 404 Not Found: No image found with ID e0bed001-9bac-4ad5-ae4a-08978775bf89

I reduced the rdb chunk size in glance-api.conf but that didn't resolve the issue... now I see

[None req-19ef0053-3e95-421e-bf17-e1f3d9f05db8 novaadmin admin - - default default] Failed to store image 5232da6d-7324-48a6-afc5-9f7d99081606 Store Exception RBD incomplete write (Wrote only 4194304 out of 4210595 bytes): rbd.IncompleteWriteError: RBD incomplete write (Wrote only 4194304 out of 4210595 bytes)

Next step is to try hacking in the fix proposed on the bug.

The proposed fix works! I've submitted it upstream

https://review.opendev.org/c/openstack/glance_store/+/885581

and will puppetize the local fix next.

Change 928545 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] glance: hack the rbd driver in glance_store to fix snapshotting

https://gerrit.wikimedia.org/r/928545

Change 928545 merged by Andrew Bogott:

[operations/puppet@production] glance: hack the rbd driver in glance_store to fix snapshotting

https://gerrit.wikimedia.org/r/928545

Change 928627 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] wmcs-image-create: add some longer naps

https://gerrit.wikimedia.org/r/928627

Change 928627 merged by Andrew Bogott:

[operations/puppet@production] wmcs-image-create: add some longer naps

https://gerrit.wikimedia.org/r/928627