I am pretty sure that if a VM fails its initial puppet run that it never gets around to enabling the root shell. We need a way to shell into VMs that have never run puppet properly in order to e.g. create an initial puppetmaster.
Description
Details
Subject | Repo | Branch | Lines +/- | |
---|---|---|---|---|
cloud image firstboot: don't --waitforcert on first puppet run | operations/puppet | production | +1 -1 |
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Restricted Task | |||||
Resolved | None | T207536 Move various support services for Cloud VPS currently in prod into their own instances | |||
Resolved | Krenair | T171188 Move the main WMCS puppetmaster into the Labs realm | |||
Resolved | Andrew | T223920 Ensure/confirm a way to shell into unpuppetized VMs |
Event Timeline
(it may be worth noting that while this is not directly necessary for the migration in the parent ticket, it is important to maintain the ability to bootstrap a realm that has no puppetmasters - either for a brand new realm or in the event of disaster wiping out all the existing puppetmasters. previously this was done by just making a new production host that the realm is allowed access to, with a move away from the model of puppetmasters for other realms sitting in production this becomes important)
The good news is that once the firstboot script exits we should be able to get a local console. The bad news is that if the initial cert sign doesn't work the firstboot script may NEVER exit:
# puppet agent --onetime --verbose --no-daemonize --no-splay --show_diff --waitforcert=1 --certname=consoletest-01.testlabs.eqiad.wmflabs --server=thisisnotarealpuppetmaster Info: Creating a new SSL key for consoletest-01.testlabs.eqiad.wmflabs Error: Could not request certificate: Failed to open TCP connection to thisisnotarealpuppetmaster:8140 (getaddrinfo: Name or service not known) Error: Could not request certificate: Failed to open TCP connection to thisisnotarealpuppetmaster:8140 (getaddrinfo: Name or service not known) Error: Could not request certificate: Failed to open TCP connection to thisisnotarealpuppetmaster:8140 (getaddrinfo: Name or service not known) Error: Could not request certificate: Failed to open TCP connection to thisisnotarealpuppetmaster:8140 (getaddrinfo: Name or service not known) Error: Could not request certificate: Failed to open TCP connection to thisisnotarealpuppetmaster:8140 (getaddrinfo: Name or service not known) Error: Could not request certificate: Failed to open TCP connection to thisisnotarealpuppetmaster:8140 (getaddrinfo: Name or service not known) <etc. literally forever>
If I remove the --waitforcert entirely then it exits. So I could replace --waitforcert with some kind of explicit loop wrapping the puppet agent call.
Change 512441 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] cloud image firstboot: don't --waitforcert on first puppet run
Change 512441 merged by Andrew Bogott:
[operations/puppet@production] cloud image firstboot: don't --waitforcert on first puppet run
With the attached patch in place, a new VM with no valid puppetmaster will flounder for a bit but then boot up such that a local virsh console can be attached. That's enough to allow us to bootstrap an initial puppetmaster.