Page MenuHomePhabricator

Puppet failure on cloudinfra-acme-chief-01.novalocal
Closed, ResolvedPublic

Description

Investigate

Event Timeline

dcaro triaged this task as High priority.Mar 1 2021, 9:35 AM
dcaro created this task.

The error is:

dcaro@cloudinfra-acme-chief-01:~$ sudo run-puppet-agent
Exiting; no certificate found and waitforcert is disabled

Looking

Tried refreshing the host certificate, that allow puppet to run, but now it fails with:

root@cloudinfra-acme-chief-01:~# puppet agent -tv
Info: Caching certificate for cloudinfra-acme-chief-01.novalocal
Info: Caching certificate_revocation_list for ca
Info: Caching certificate for cloudinfra-acme-chief-01.novalocal
Warning: Unable to fetch my node definition, but the agent run will continue:
Warning: Error 500 on SERVER: Server Error: Failed to find cloudinfra-acme-chief-01.novalocal via exec: Execution of '/usr/local/bin/puppet-enc cloudinfra-acme-chief-01.novalocal' returned 255: Invalid hostname (cloudinfra-acme-chief-01.novalocal) Unknown TLD.
Info: Retrieving pluginfacts
Info: Retrieving plugin
Info: Retrieving locales
Info: Loading facts
Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Failed when searching for node cloudinfra-acme-chief-01.novalocal: Failed to find cloudinfra-acme-chief-01.novalocal via exec: Execution of '/usr/local/bin/puppet-enc cloudinfra-acme-chief-01.novalocal' returned 255: Invalid hostname (cloudinfra-acme-chief-01.novalocal) Unknown TLD.
Warning: Not using cache on failed catalog
Error: Could not retrieve catalog; skipping run

I think that the hostname is not correct, as it should be a real tld, looking

It was rebooted 2 days ago, that might be what triggered the error showing up.

root@cloudinfra-acme-chief-01:~# uptime
 10:09:48 up 2 days, 13:04,  2 users,  load average: 0.02, 0.02, 0.00

That reboot is probably related to the security reboots this weekend.

Why is it using .novalocal and not .cloudinfra.eqiad1.wikimedia.cloud?

Not sure, it seems to come from the cloud-config template (there it uses fqdn, so still looking where that one comes from), changed it manually and puppet is now running correctly, but I'm trying to find out if the hostname will not change again.

From cloud-init data:

root@cloudinfra-acme-chief-01:~# grep novalocal /run/cloud-init/*
/run/cloud-init/instance-data.json:   "hostname": "cloudinfra-acme-chief-01.novalocal",
...

looking...

In the /var/log/cloud-init-output.log file, there's the entry:

+ sed -i s/novalocal/cloudinfra.eqiad1.wikimedia.cloud/g /etc/hosts

So it seems that it was intended to change that in the hosts file directly :/, I wonder what happened

Mentioned in SAL (#wikimedia-cloud) [2021-03-01T10:37:14Z] <dcaro> rebooting cloudinfra-acme-chief-01 to ensure hostname stability (T276041)

That did not work, the hostname got replaced again (cloud-init is probably resetting it from the template + data), changing the data itself and rebooting again.

THe data I think it's being stored under /etc/nova/vendor_data.json in the virt nodes, looking (it's put there by puppet, modules/openstack/manifests/nova/api/service.pp)

Those files have the correct fqdn though:

fqdn: {{ds.meta_data.name}}.{{ds.meta_data.project_id}}.eqiad1.wikimedia.cloud

Interesting... from the file /var/lib/cloud/instance/vendor-data.txt on the cloudinfra-acme-chief-01 machine:

# There's a timing issue that causes us to sometimes have 127.0.0.1 associated
# with $hostname.novalocal instead of the proper fqdn
sed -i "s/novalocal/${project}.${domain}/g" /etc/hosts

Fixed, so finally had to change the local cloud init data that it had downloaded on install it seems.

Created a local copy of the data file: /var/lib/cloud/instances/4fc636c2-6af8-4ce7-a9a9-6c23a73cbd73/obj.pkl
Then in python:

root@cloudinfra-acme-chief-01:~# python3
Python 3.7.3 (default, Jul 25 2020, 13:03:44)
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pickle
>>> res = pickle.load(open('obj.pkl', 'rb'))

# Found these by grepping in the binary file, and looking around until there were no matches in the binary file
>>> res.metadata['hostname'] = 'cloudinfra-acme-chief-01.cloudinfra.eqiad1.wikimedia.cloud'
>>> res.metadata['local-hostname'] = 'cloudinfra-acme-chief-01.cloudinfra.eqiad1.wikimedia.cloud'

>>> res.ec2_metadata['local-hostname'] = 'cloudinfra-acme-chief-01.cloudinfra.eqiad1.wikimedia.cloud'
>>> res.ec2_metadata['public-hostname'] = 'cloudinfra-acme-chief-01.cloudinfra.eqiad1.wikimedia.cloud'
>>> res.ec2_metadata['hostname'] = 'cloudinfra-acme-chief-01.cloudinfra.eqiad1.wikimedia.cloud'

>>> pickle.dump(res, open('other.pkl', 'wb'))

Then replaced the old data file with this new one:

root@cloudinfra-acme-chief-01:~# cp other.pkl /var/lib/cloud/instances/4fc636c2-6af8-4ce7-a9a9-6c23a73cbd73/obj.pkl

And rebooted to see the hostname changed to the correct one:

dcaro@cloudinfra-acme-chief-01:~$ hostname -f
cloudinfra-acme-chief-01.cloudinfra.eqiad1.wikimedia.cloud

The current templates on the nova side look ok, so I'm guessing this was an old install that cached the wrong data.

There's also an issue I've seen sometimes (again, with an older broken base image) where the flag that marks a first boot wasn't getting set correctly so that the VM re-ran the 'firstboot' logic on every boot. That could cause something similar to this.