Noticed from a pupper error email that it failed on acme-chief-2.cloudinfra-codfw1dev:
Date: Thu, 28 Apr 2022 08:15:05 +0000 From: root <root@acme-chief-2.cloudinfra-codfw1dev.codfw1dev.wikimedia.cloud> To: dcaro@wikimedia.org Subject: [Cloud VPS alert][cloudinfra-codfw1dev] Puppet failure on acme-chief-2.cloudinfra-codfw1dev.codfw1dev.wikimedia.cloud (172.16.128.164)
When checking on the host, it did not have a puppet-enc binary or config:
dcaro@acme-chief-2:~$ sudo run-puppet-agent 2022-04-28 08:26:29.888612 WARN puppetlabs.facter - locale environment variables were bad; continuing with LANG=C LC_ALL=C 2022-04-28 08:26:30.618051 WARN puppetlabs.facter - locale environment variables were bad; continuing with LANG=C LC_ALL=C Warning: Unable to fetch my node definition, but the agent run will continue: Warning: Error 500 on SERVER: Server Error: Failed to find acme-chief-2.cloudinfra-codfw1dev.codfw1dev.wikimedia.cloud via exec: Execution of '/usr/local/bin/puppet-enc acme-chief-2.cloudinfra-codfw1dev.codfw1dev.wikimedia.cloud' returned 1: ... dcaro@acme-chief-2:~$ /usr/local/bin/puppet-enc acme-chief-2.cloudinfra-codfw1dev.codfw1dev.wikimedia.cloud -bash: /usr/local/bin/puppet-enc: No such file or directory
So I went to the puppetmaster to check there, where puppet was also failing:
dcaro@cloudinfra-internal-puppetmaster-01:~$ sudo run-puppet-agent Warning: Unable to fetch my node definition, but the agent run will continue: Warning: Error 500 on SERVER: Server Error: Failed to find cloudinfra-internal-puppetmaster-01.cloudinfra-codfw1dev.codfw1dev.wikimedia.cloud via exec: Execution of '/usr/local/bin/puppet-enc cloudinfra-internal-puppetmaster-01.cloudinfra-codfw1dev.codfw1dev.wikimedia.cloud' returned 1: Info: Retrieving pluginfacts Info: Retrieving plugin Info: Retrieving locales Info: Loading facts Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Failed when searching for node cloudinfra-internal-puppetmaster-01.cloudinfra-codfw1dev.codfw1dev.wikimedia.cloud: Failed to find cloudinfra-internal-puppetmaster-01.cloudinfra-codfw1dev.codfw1dev.wikimedia.cloud via exec: Execution of '/usr/local/bin/puppet-enc cloudinfra-internal-puppetmaster-01.cloudinfra-codfw1dev.codfw1dev.wikimedia.cloud' returned 1: Warning: Not using cache on failed catalog
And this one had the binary, but manually running gives:
dcaro@cloudinfra-internal-puppetmaster-01:~$ /usr/local/bin/puppet-enc cloudinfra-internal-puppetmaster-01.cloudinfra-codfw1dev.codfw1dev.wikimedia.cloud ... requests.exceptions.ConnectionError: HTTPConnectionPool(host='puppet-enc.cloudinfra-codfw1dev.codfw1dev.wmcloud.org', port=8100): Max retries exceeded with url: /v1/cloudinfra-codfw1dev/node/cloudinfra-internal-puppetmaster-01.cloudinfra-codfw1dev.codfw1dev.wikimedia.cloud (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fbd01264160>: Failed to establish a new connection: [Errno 113] No route to host'))
So went to check on the enc-1 host, but the host was unreachable by ssh, checked if it was up on openstack:
root@cloudcontrol2004-dev:~# openstack --os-project-id=cloudinfra-codfw1dev server show enc-1 +-------------------------------------+--------------------------------------------------------------+ | Field | Value | +-------------------------------------+--------------------------------------------------------------+ | OS-DCF:diskConfig | AUTO | | OS-EXT-AZ:availability_zone | nova | | OS-EXT-SRV-ATTR:host | cloudvirt2003-dev | | OS-EXT-SRV-ATTR:hypervisor_hostname | cloudvirt2003-dev.codfw.wmnet | | OS-EXT-SRV-ATTR:instance_name | i-00000cdf | | OS-EXT-STS:power_state | Running | ...
So it was, went to the hypervisor (cloudvirt2003-dev) to connect using the console and the VM had no network:
root@cloudvirt2003-dev:~# virsh Welcome to virsh, the virtualization interactive terminal. Type: 'help' for help with commands 'quit' to quit virsh # console i-00000cdf Connected to domain 'i-00000cdf' Escape character is ^] (Ctrl + ]) root@enc-1:~# ip a 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever
Checking the logs it was failing to get DHCP replies, and as it had no stored leases it got no ip:
root@enc-1:~# journalctl | grep dhclient ... Apr 28 09:01:13 enc-1 dhclient[350]: No DHCPOFFERS received. Apr 28 09:01:13 enc-1 dhclient[350]: No working leases in persistent database - sleeping.
The last ack was:
Apr 27 07:35:26 enc-1 dhclient[350]: DHCPACK of 172.16.128.97 from 172.16.128.10
It seems to be hapenning on other VMs on that host too, like cloudinfra-db-01, where the last ack was:
root@cloudinfra-db-01:~# journalctl | grep dhclient | grep -i ack Apr 27 16:06:47 cloudinfra-db-01 dhclient[381]: DHCPACK of 172.16.128.23 from 172.16.128.14