Page MenuHomePhabricator

WMCS instance initial puppet run hasn't happened so can't sudo to trigger the initial puppet run: integration-agent-docker-1021
Closed, ResolvedPublic

Description

I was following the runbook to create a new CI agent instance (for T252071). Having created the agent, the first step is to ssh in and trigger the initial puppet run via sudo, but because the initial run hasn't run yet it doesn't know that I can do that…

Either I'm doing something wrong, or the runbook doesn't work any more. Is there a way it can be fixed?

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

The puppet agent probably failed due to neither this https://gerrit.wikimedia.org/r/c/operations/puppet/+/670524 or this https://gerrit.wikimedia.org/r/c/operations/puppet/+/717732/ being merged. I can look at the instance and verify that.

Actually, it seems to have stopped on the cert.

[   41.287925] cloud-init[1597]: [1;31mError: Could not request certificate: The certificate retrieved from the master does not match the agent's private key. Did you forget to run as root?
[   41.288376] cloud-init[1597]: Certificate fingerprint: A1:54:D2:A5:76:27:38:32:CA:E6:F0blahblahblahE8:81:31:7C:E0:79:27:27:95
[   41.290096] cloud-init[1597]: To fix this, remove the certificate from both the master and the agent and then start a puppet run, which will automatically regenerate a certificate.
[   41.291411] cloud-init[1597]: On the master:
[   41.292475] cloud-init[1597]:   puppet cert clean integration-agent-docker-1021.integration.eqiad1.wikimedia.cloud
[   41.294122] cloud-init[1597]: On the agent:
[   41.295263] cloud-init[1597]:   1a. On most platforms: find /var/lib/puppet/ssl -name integration-agent-docker-1021.integration.eqiad1.wikimedia.cloud.pem -delete
[   41.296286] cloud-init[1597]:   1b. On Windows: del "\var\lib\puppet\ssl\certs\integration-agent-docker-1021.integration.eqiad1.wikimedia.cloud.pem" /f
[   41.297384] cloud-init[1597]:   2. puppet agent -t
[   41.298446] cloud-init[1597]: [0m
[   41.299303] cloud-init[1597]: Exiting; failed to retrieve certificate and waitforcert is disabled

Is this re-using the name of an already-deleted VM?

Is this re-using the name of an already-deleted VM?

No. I killed 1001 to make space to make this, but it's a new name. For whatever reason our runbook says that the first thing to do is kill and re-load the puppet cert, so clearly something's been wrong with this area for a while?

Killing and reloading the puppet cert is "normal", but the first run shouldn't need it unless something has changed either with cloud-init (entirely possible) or the puppet master for the rest fo the cloud (less likely). @Andrew I recall there was something that recently started working on cloud-init that wasn't before in recent images? I don't know if that's it. I'll take a look at the puppetmaster for cloudinfra.