Investigate
Description
Event Timeline
The error is:
dcaro@cloudinfra-acme-chief-01:~$ sudo run-puppet-agent Exiting; no certificate found and waitforcert is disabled
Looking
Tried refreshing the host certificate, that allow puppet to run, but now it fails with:
root@cloudinfra-acme-chief-01:~# puppet agent -tv Info: Caching certificate for cloudinfra-acme-chief-01.novalocal Info: Caching certificate_revocation_list for ca Info: Caching certificate for cloudinfra-acme-chief-01.novalocal Warning: Unable to fetch my node definition, but the agent run will continue: Warning: Error 500 on SERVER: Server Error: Failed to find cloudinfra-acme-chief-01.novalocal via exec: Execution of '/usr/local/bin/puppet-enc cloudinfra-acme-chief-01.novalocal' returned 255: Invalid hostname (cloudinfra-acme-chief-01.novalocal) Unknown TLD. Info: Retrieving pluginfacts Info: Retrieving plugin Info: Retrieving locales Info: Loading facts Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Failed when searching for node cloudinfra-acme-chief-01.novalocal: Failed to find cloudinfra-acme-chief-01.novalocal via exec: Execution of '/usr/local/bin/puppet-enc cloudinfra-acme-chief-01.novalocal' returned 255: Invalid hostname (cloudinfra-acme-chief-01.novalocal) Unknown TLD. Warning: Not using cache on failed catalog Error: Could not retrieve catalog; skipping run
I think that the hostname is not correct, as it should be a real tld, looking
It was rebooted 2 days ago, that might be what triggered the error showing up.
root@cloudinfra-acme-chief-01:~# uptime 10:09:48 up 2 days, 13:04, 2 users, load average: 0.02, 0.02, 0.00
That reboot is probably related to the security reboots this weekend.
Why is it using .novalocal and not .cloudinfra.eqiad1.wikimedia.cloud?
Not sure, it seems to come from the cloud-config template (there it uses fqdn, so still looking where that one comes from), changed it manually and puppet is now running correctly, but I'm trying to find out if the hostname will not change again.
From cloud-init data:
root@cloudinfra-acme-chief-01:~# grep novalocal /run/cloud-init/* /run/cloud-init/instance-data.json: "hostname": "cloudinfra-acme-chief-01.novalocal", ...
looking...
In the /var/log/cloud-init-output.log file, there's the entry:
+ sed -i s/novalocal/cloudinfra.eqiad1.wikimedia.cloud/g /etc/hosts
So it seems that it was intended to change that in the hosts file directly :/, I wonder what happened
Mentioned in SAL (#wikimedia-cloud) [2021-03-01T10:37:14Z] <dcaro> rebooting cloudinfra-acme-chief-01 to ensure hostname stability (T276041)
That did not work, the hostname got replaced again (cloud-init is probably resetting it from the template + data), changing the data itself and rebooting again.
Openstack-browser still somehow has the correct hostname: https://openstack-browser.toolforge.org/server/cloudinfra-acme-chief-01.cloudinfra.eqiad1.wikimedia.cloud
THe data I think it's being stored under /etc/nova/vendor_data.json in the virt nodes, looking (it's put there by puppet, modules/openstack/manifests/nova/api/service.pp)
Those files have the correct fqdn though:
fqdn: {{ds.meta_data.name}}.{{ds.meta_data.project_id}}.eqiad1.wikimedia.cloud
Interesting... from the file /var/lib/cloud/instance/vendor-data.txt on the cloudinfra-acme-chief-01 machine:
# There's a timing issue that causes us to sometimes have 127.0.0.1 associated # with $hostname.novalocal instead of the proper fqdn sed -i "s/novalocal/${project}.${domain}/g" /etc/hosts
Fixed, so finally had to change the local cloud init data that it had downloaded on install it seems.
Created a local copy of the data file: /var/lib/cloud/instances/4fc636c2-6af8-4ce7-a9a9-6c23a73cbd73/obj.pkl
Then in python:
root@cloudinfra-acme-chief-01:~# python3 Python 3.7.3 (default, Jul 25 2020, 13:03:44) [GCC 8.3.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import pickle >>> res = pickle.load(open('obj.pkl', 'rb')) # Found these by grepping in the binary file, and looking around until there were no matches in the binary file >>> res.metadata['hostname'] = 'cloudinfra-acme-chief-01.cloudinfra.eqiad1.wikimedia.cloud' >>> res.metadata['local-hostname'] = 'cloudinfra-acme-chief-01.cloudinfra.eqiad1.wikimedia.cloud' >>> res.ec2_metadata['local-hostname'] = 'cloudinfra-acme-chief-01.cloudinfra.eqiad1.wikimedia.cloud' >>> res.ec2_metadata['public-hostname'] = 'cloudinfra-acme-chief-01.cloudinfra.eqiad1.wikimedia.cloud' >>> res.ec2_metadata['hostname'] = 'cloudinfra-acme-chief-01.cloudinfra.eqiad1.wikimedia.cloud' >>> pickle.dump(res, open('other.pkl', 'wb'))
Then replaced the old data file with this new one:
root@cloudinfra-acme-chief-01:~# cp other.pkl /var/lib/cloud/instances/4fc636c2-6af8-4ce7-a9a9-6c23a73cbd73/obj.pkl
And rebooted to see the hostname changed to the correct one:
dcaro@cloudinfra-acme-chief-01:~$ hostname -f cloudinfra-acme-chief-01.cloudinfra.eqiad1.wikimedia.cloud
The current templates on the nova side look ok, so I'm guessing this was an old install that cached the wrong data.
There's also an issue I've seen sometimes (again, with an older broken base image) where the flag that marks a first boot wasn't getting set correctly so that the VM re-ran the 'firstboot' logic on every boot. That could cause something similar to this.