Page MenuHomePhabricator

Rebuild tools-prometheus-04
Closed, ResolvedPublic

Description

tools-prometheus-04 did not entirely migrate according to plan. I noticed it was listed as down by alertmanager and checked it. This is the console:

Begin: Loading essential drivers ... done.
Begin: Running /scripts/init-premount ... Waiting 5s for disks to show up (T131961)
Usage: sleep seconds[.fraction]
done.
Begin: Mounting root file system ... Begin: Running /scripts/local-top ... done.
Begin: Running /scripts/local-premount ... done.
Begin: Will now check root file system ... fsck from util-linux 2.33.1
[/sbin/fsck.ext4 (1) -- /dev/vda2] fsck.ext4 -a -C0 /dev/vda2 
/dev/vda2 contains a file system with errors, check forced.
/dev/vda2: |                                                        |  0.5%   [    2.618215] usb 1-1: new full-speed USB device number 2 using uhci_hcd
/dev/vda2: |====                                                    /  7.8%   /dev/vda2: |=========                                               - 15.2%   [    2.815939] usb 1-1: New USB device found, idVendor=0627, idProduct=0001, bcdDevice= 0.00
[    2.822678] usb 1-1: New USB device strings: Mfr=1, Product=3, SerialNumber=5
[    2.828158] usb 1-1: Product: QEMU USB Tablet
[    2.830770] usb 1-1: Manufacturer: QEMU
[    2.832199] usb 1-1: SerialNumber: 42
[    2.847395] hidraw: raw HID events driver (C) Jiri Kosina
[    2.857145] usbcore: registered new interface driver usbhid
[    2.860841] usbhid: USB HID core driver
/dev/vda2:                                                                                Inode 786453, end of extent exceeds allowed value
	(logical block 72, physical block 34128, len 33)


/dev/vda2: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY.
	(i.e., without -a or -p options)
[    2.876239] input: QEMU QEMU USB Tablet as /devices/pci0000:00/0000:00:01.2/usb1/1-1/1-1:1.0/0003:0627:0001.0001/input/input4
fsck exited with status code 4
done.
Failure: File system check of the root filesystem failed
The root filesystem on /dev/vda2 requires a manual fsck
(initramfs) [    2.890976] hid-generic 0003:0627:0001.0001: input,hidraw0: USB HID v0.01 Mouse [QEMU QEMU USB Tablet] on usb-0000:00:01.2-1/input0

Attempts to repair the disk did not work. This needs rebuild.

Event Timeline

Bstorm created this task.

I went hunting for other VMs that were damaged in this same way and didn't find any. Nevertheless I'll take extra care with tools-prometheus-03 tomorrow and confirm it survives the move before deleting the static version.

First attempt at connecting to the disk with guestfish didn't go as planned:

root@cloudvirt1014:~# guestfish --rw -i -a rbd:///eqiad1-compute/2f970556-8714-48c9-bd90-b1736e4c6536
libguestfs: trace: set_verbose true
libguestfs: trace: set_verbose = 0
libguestfs: create: flags = 0, handle = 0x565319f72cf0, program = guestfish
libguestfs: trace: set_pgroup true
libguestfs: trace: set_pgroup = 0
libguestfs: trace: add_drive "eqiad1-compute/2f970556-8714-48c9-bd90-b1736e4c6536" "protocol:rbd"
libguestfs: trace: add_drive = 0
libguestfs: trace: is_config
libguestfs: trace: is_config = 1
libguestfs: trace: launch
libguestfs: trace: max_disks
libguestfs: trace: max_disks = 255
libguestfs: trace: get_tmpdir
libguestfs: trace: get_tmpdir = "/tmp"
libguestfs: trace: version
libguestfs: trace: version = <struct guestfs_version = major: 1, minor: 40, release: 2, extra: , >
libguestfs: trace: get_backend
libguestfs: trace: get_backend = "direct"
libguestfs: launch: program=guestfish
libguestfs: launch: version=1.40.2
libguestfs: launch: backend registered: unix
libguestfs: launch: backend registered: uml
libguestfs: launch: backend registered: libvirt
libguestfs: launch: backend registered: direct
libguestfs: launch: backend=direct
libguestfs: launch: tmpdir=/tmp/libguestfsAQ626F
libguestfs: launch: umask=0022
libguestfs: launch: euid=0
libguestfs: trace: get_cachedir
libguestfs: trace: get_cachedir = "/var/tmp"
libguestfs: begin building supermin appliance
libguestfs: run supermin
libguestfs: command: run: /usr/bin/supermin
libguestfs: command: run: \ --build
libguestfs: command: run: \ --verbose
libguestfs: command: run: \ --if-newer
libguestfs: command: run: \ --lock /var/tmp/.guestfs-0/lock
libguestfs: command: run: \ --copy-kernel
libguestfs: command: run: \ -f ext2
libguestfs: command: run: \ --host-cpu x86_64
libguestfs: command: run: \ /usr/lib/x86_64-linux-gnu/guestfs/supermin.d
libguestfs: command: run: \ -o /var/tmp/.guestfs-0/appliance.d
supermin: version: 5.1.20
supermin: package handler: debian/dpkg
supermin: acquiring lock on /var/tmp/.guestfs-0/lock
supermin: if-newer: output does not need rebuilding
libguestfs: finished building supermin appliance
libguestfs: begin testing qemu features
libguestfs: trace: get_cachedir
libguestfs: trace: get_cachedir = "/var/tmp"
libguestfs: checking for previously cached test results of /usr/bin/qemu-system-x86_64, in /var/tmp/.guestfs-0
libguestfs: loading previously cached test results
libguestfs: qemu version: 3.1
libguestfs: qemu mandatory locking: yes
libguestfs: qemu KVM: enabled
libguestfs: trace: get_backend_setting "force_tcg"
libguestfs: trace: get_backend_setting = NULL (error)
libguestfs: trace: get_sockdir
libguestfs: trace: get_sockdir = "/tmp"
libguestfs: finished testing qemu features
libguestfs: trace: get_backend_setting "gdb"
libguestfs: trace: get_backend_setting = NULL (error)
/usr/bin/qemu-system-x86_64 \
    -global virtio-blk-pci.scsi=off \
    -no-user-config \
    -enable-fips \
    -nodefaults \
    -display none \
    -machine accel=kvm:tcg \
    -cpu host \
    -m 768 \
    -no-reboot \
    -rtc driftfix=slew \
    -no-hpet \
    -global kvm-pit.lost_tick_policy=discard \
    -kernel /var/tmp/.guestfs-0/appliance.d/kernel \
    -initrd /var/tmp/.guestfs-0/appliance.d/initrd \
    -object rng-random,filename=/dev/urandom,id=rng0 \
    -device virtio-rng-pci,rng=rng0 \
    -device virtio-scsi-pci,id=scsi \
    -drive file=rbd:eqiad1-compute/2f970556-8714-48c9-bd90-b1736e4c6536:auth_supported=none,cache=writeback,id=hd0,if=none \
    -device scsi-hd,drive=hd0 \
    -drive file=/var/tmp/.guestfs-0/appliance.d/root,snapshot=on,id=appliance,cache=unsafe,if=none,format=raw \
    -device scsi-hd,drive=appliance \
    -device virtio-serial-pci \
    -serial stdio \
    -device sga \
    -chardev socket,path=/tmp/libguestfsyR8kzY/guestfsd.sock,id=channel0 \
    -device virtserialport,chardev=channel0,name=org.libguestfs.channel.0 \
    -append "panic=1 console=ttyS0 edd=off udevtimeout=6000 udev.event-timeout=6000 no_timer_check printk.time=1 cgroup_disable=memory usbcore.nousb cryptomgr.notests tsc=reliable 8250.nr_uarts=1 root=/dev/sdb selinux=0 guestfs_verbose=1 TERM=xterm-256color"
qemu-system-x86_64: -drive file=rbd:eqiad1-compute/2f970556-8714-48c9-bd90-b1736e4c6536:auth_supported=none,cache=writeback,id=hd0,if=none: error connecting: Operation not supported
libguestfs: error: appliance closed the connection unexpectedly, see earlier error messages
libguestfs: child_cleanup: 0x565319f72cf0: child process died
libguestfs: sending SIGTERM to process 41895
libguestfs: error: /usr/bin/qemu-system-x86_64 exited with error status 1, see debug messages above
libguestfs: error: guestfs_launch failed, see earlier error messages
libguestfs: trace: launch = -1 (error)
libguestfs: trace: close
libguestfs: closing guestfs handle 0x565319f72cf0 (state 0)
libguestfs: command: run: rm
libguestfs: command: run: \ -rf /tmp/libguestfsAQ626F
libguestfs: command: run: rm
libguestfs: command: run: \ -rf /tmp/libguestfsyR8kz

The clue I am seeing right away is "auth_supported=none", but maybe Google/SO have ideas.

Yeah, it looks like I need to specify auth, so it may be possible.

Adding auth, I error out with:

qemu-system-x86_64: -drive file=rbd:eqiad1-compute/2f970556-8714-48c9-bd90-b1736e4c6536:id=eqiad1-compute:auth_supported=cephx\;none,cache=writeback,id=hd0,if=none: error reading header from 2f970556-8714-48c9-bd90-b1736e4c6536: No such file or directory

So we are trying to auth and maybe need to specify more.

Running with guestfish still returns exit code 4. So I think the patient is dead. Documenting how to do this in ceph world and changing the task to a rebuild.

Bstorm renamed this task from tools-prometheus-04 did not survive migration to ceph to Rebuild tools-prometheus-04.Oct 2 2020, 5:05 PM
Bstorm updated the task description. (Show Details)

As an experiment I tried taking a snapshot of the working prometheus node (-03) and built a new VM off of that snap. It worked, but it hogs a ton of space:

root@cloudvirt1024:~# rbd --pool eqiad1-glance-images du 68e61634-d599-4c6b-94da-10e6c7d36573
NAME                                      PROVISIONED USED    
68e61634-d599-4c6b-94da-10e6c7d36573@snap     300 GiB 300 GiB 
68e61634-d599-4c6b-94da-10e6c7d36573          300 GiB     0 B 
<TOTAL>                                       300 GiB 300 GiB 
root@cloudvirt1024:~# rbd --pool eqiad1-compute du 3d69a4b3-bc7f-4570-bea0-6b87f8bf7732_disk
NAME                                      PROVISIONED USED    
3d69a4b3-bc7f-4570-bea0-6b87f8bf7732_disk     300 GiB 283 GiB

If the new VM works after the image is deleted then this is acceptable; if not then it probably isn't.

I just deleted the image backing tools-prometheus-05. So far it seems fine (other than not being able to report an image name in nova) -- it also survived a soft reboot, a hard reboot, and live migration to a different host.

If it's still happy after a few days then we're probably good!

I documented the cloning process here:

https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/VM_images#Building_an_image_from_an_existing_VM

Now I'm going to delete -04 and close this ticket.