Page MenuHomePhabricator

Ensure/confirm a way to shell into unpuppetized VMs
Closed, ResolvedPublic

Description

I am pretty sure that if a VM fails its initial puppet run that it never gets around to enabling the root shell. We need a way to shell into VMs that have never run puppet properly in order to e.g. create an initial puppetmaster.

Event Timeline

(it may be worth noting that while this is not directly necessary for the migration in the parent ticket, it is important to maintain the ability to bootstrap a realm that has no puppetmasters - either for a brand new realm or in the event of disaster wiping out all the existing puppetmasters. previously this was done by just making a new production host that the realm is allowed access to, with a move away from the model of puppetmasters for other realms sitting in production this becomes important)

The good news is that once the firstboot script exits we should be able to get a local console. The bad news is that if the initial cert sign doesn't work the firstboot script may NEVER exit:

# puppet agent --onetime --verbose --no-daemonize --no-splay --show_diff --waitforcert=1 --certname=consoletest-01.testlabs.eqiad.wmflabs --server=thisisnotarealpuppetmaster
Info: Creating a new SSL key for consoletest-01.testlabs.eqiad.wmflabs
Error: Could not request certificate: Failed to open TCP connection to thisisnotarealpuppetmaster:8140 (getaddrinfo: Name or service not known)
Error: Could not request certificate: Failed to open TCP connection to thisisnotarealpuppetmaster:8140 (getaddrinfo: Name or service not known)
Error: Could not request certificate: Failed to open TCP connection to thisisnotarealpuppetmaster:8140 (getaddrinfo: Name or service not known)
Error: Could not request certificate: Failed to open TCP connection to thisisnotarealpuppetmaster:8140 (getaddrinfo: Name or service not known)
Error: Could not request certificate: Failed to open TCP connection to thisisnotarealpuppetmaster:8140 (getaddrinfo: Name or service not known)
Error: Could not request certificate: Failed to open TCP connection to thisisnotarealpuppetmaster:8140 (getaddrinfo: Name or service not known)
<etc. literally forever>

If I remove the --waitforcert entirely then it exits. So I could replace --waitforcert with some kind of explicit loop wrapping the puppet agent call.

Change 512441 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] cloud image firstboot: don't --waitforcert on first puppet run

https://gerrit.wikimedia.org/r/512441

Change 512441 merged by Andrew Bogott:
[operations/puppet@production] cloud image firstboot: don't --waitforcert on first puppet run

https://gerrit.wikimedia.org/r/512441

With the attached patch in place, a new VM with no valid puppetmaster will flounder for a bit but then boot up such that a local virsh console can be attached. That's enough to allow us to bootstrap an initial puppetmaster.