Page MenuHomePhabricator

cloud instance rescue tools
Closed, ResolvedPublic

Description

To address my anxiety about T207536, I'm thinking about ways we can step up our emergency options when a VM becomes unreachable (e.g. if puppet and cumin misbehave both at once and every ssh key is scrambled etc. etc.)

I have one small idea and one big idea.

Small idea: Install guestfish and libguestfs on all cloudvirts. That will make reading and modifying the file system of busted VMs much easier. My only reservation about this is that it's a long dependency chain and when I installed it just now on cloudvirt1015 it prompted me about restarting the system disk array which seems... weird? I told it not to and all seems well but it put me on edge.

The big (but obvious) idea is: Have puppet install a single, shared root password on every VM, and store that password in pwstore. Then figure out how to launch a console from the appropriate cloudvirt. This is a much-scaled-down version of my previous support-remote-web-shells attempt; the advantage of scaling it down is it should be easier to understand the security implications. And, we use a global root password for prod already so how could this be worse?

Event Timeline

@bd808 suggests that we use a combined approach -- write a guestfish script to inject a root password as needed but not have the password just sitting around in the meantime.

See T130806#4897729 for a recent dive I did into part of this. The TL;DR is that we need to start a getty on ttyS1 in both our base images and via Puppet to make virsh console work.

we use a global root password for prod already so how could this be worse?

Famous last words. :)

Per discussion on IRC, production hosts are trusted to some extent. If we had a single global root password in labs, even if the VMs only ever stored a hash of it, someone could make a software change within the instance to record all root password attempts (even via console) - potentially then simply breaking their instance and asking for rescue to silently acquire the password, and use it to escalate privileges on any other labs host they have access to (e.g. ordinary tools user to root), or potentially log straight into any that allow root password SSH (so probably not those actively running our puppet for that particular one).

At least per-instance or maybe per-rescue-attempt root passwords sound like a good idea.

In a previous job, I've used the QEMU guest agent to enable more advanced VM management.

https://github.com/idi-ops/packer-kvm-centos/blob/b6266ff013cd072a3734eaf1c9767e3d6976abf4/centos7-kickstart.cfg#L72
https://github.com/idi-ops/packer-kvm-centos/blob/master/ansible/roles/common/files/grub#L5-L8

We can always use guestmount to access the disk directly and change password, etc: https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/VM_images#How_To_Inspect_Disk_Contents

# guestmount -a t3-disk.qcow2 -m /dev/sda3 -o allow_other --rw mnt
# chroot /mnt
vm# passwd root
vm# exit
# guestunmount /mnt

I think there are 2 things intermixed: access to instance storage (VM is offline) and access to instance shell/console (VM is online). There is a bit of overlapping in each issue, but I think they are separate problems.
The most basic is access to instance storage I think. But access to instance console is also really important. Also, I believe offering console access via horizon is a really neat feature :-)

A couple of questions:

  • How do others clouds implement these?
  • Does openstack upstream recommends any specific implementation for these?

We can always use guestmount to access the disk directly and change password, etc: https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/VM_images#How_To_Inspect_Disk_Contents

# guestmount -a t3-disk.qcow2 -m /dev/sda3 -o allow_other --rw mnt
# chroot /mnt
vm# passwd root
vm# exit
# guestunmount /mnt

+1, we can try to script this in python in wmcs-rescue as @bd808 suggested in IRC.

I'm working on a rescue script to get a root console. This turns out to be annoying because nova is overriding any local VM shutdowns/restarts that I do on the cloudvirt host. We could avoid that by setting handle_virt_lifecycle_events=False but I'm not clear on what the side-effects of that will be.

Change 489001 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] openstack: refactor 'envscript' bits into their own profile

https://gerrit.wikimedia.org/r/489001

Change 489001 merged by Andrew Bogott:
[operations/puppet@production] openstack: refactor 'envscript' bits into their own profile

https://gerrit.wikimedia.org/r/489001

Change 489005 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] openstack: include 'envscripts' on compute nodes

https://gerrit.wikimedia.org/r/489005

Change 489005 merged by Andrew Bogott:
[operations/puppet@production] openstack: include 'envscripts' on compute nodes

https://gerrit.wikimedia.org/r/489005

Change 489230 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] nova: add wmcs-rescue-console.sh to compute hosts

https://gerrit.wikimedia.org/r/489230

Change 489299 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] Cloud vms: enable a default tty

https://gerrit.wikimedia.org/r/489299

Change 489947 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] bootstrap-vz: set up a root terminal on S1

https://gerrit.wikimedia.org/r/489947

Change 489299 merged by Andrew Bogott:
[operations/puppet@production] Cloud vms: enable a default tty

https://gerrit.wikimedia.org/r/489299

Change 489947 merged by Andrew Bogott:
[operations/puppet@production] bootstrap-vz: set up a root terminal on ttyS1

https://gerrit.wikimedia.org/r/489947

Change 490775 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] bootstrap-vz: tidy up root terminal settings

https://gerrit.wikimedia.org/r/490775

Change 490775 merged by Andrew Bogott:
[operations/puppet@production] bootstrap-vz: tidy up root terminal settings

https://gerrit.wikimedia.org/r/490775

New Debian VMs started after today will all have a root console running on Serial1.
Existing Stretch VMs should also have a root console running.
Any Jessie VM rebooted since 2019-02-08 will have a root console running.
Trusty VMS don't have a console and I'm going to leave them like this.

Note that going forward the console is a property of the base image but not enforced by puppet. That's because of an ordering issue: on new VMs, puppet runs before the standard TTY setup stage and doing it prematurely prevents the puppet run from ever finishing. I don't think it's worth trying to work around this since I don't know of a good reason why the console would go away once it's set up in the initial VM.

Change 489230 abandoned by Andrew Bogott:
nova: add wmcs-rescue-console.sh to compute hosts

Reason:
This is unneeded, the issue was solved elsewhere. https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Root_console_access

https://gerrit.wikimedia.org/r/489230