Page MenuHomePhabricator

codfw1dev: several VMs not getting DHCP acks
Closed, ResolvedPublic

Description

Noticed from a pupper error email that it failed on acme-chief-2.cloudinfra-codfw1dev:

Date: Thu, 28 Apr 2022 08:15:05 +0000
From: root <root@acme-chief-2.cloudinfra-codfw1dev.codfw1dev.wikimedia.cloud>
To: dcaro@wikimedia.org
Subject: [Cloud VPS alert][cloudinfra-codfw1dev] Puppet failure on acme-chief-2.cloudinfra-codfw1dev.codfw1dev.wikimedia.cloud (172.16.128.164)

When checking on the host, it did not have a puppet-enc binary or config:

dcaro@acme-chief-2:~$ sudo run-puppet-agent
2022-04-28 08:26:29.888612 WARN  puppetlabs.facter - locale environment variables were bad; continuing with LANG=C LC_ALL=C
2022-04-28 08:26:30.618051 WARN  puppetlabs.facter - locale environment variables were bad; continuing with LANG=C LC_ALL=C
Warning: Unable to fetch my node definition, but the agent run will continue:
Warning: Error 500 on SERVER: Server Error: Failed to find acme-chief-2.cloudinfra-codfw1dev.codfw1dev.wikimedia.cloud via exec: Execution of '/usr/local/bin/puppet-enc acme-chief-2.cloudinfra-codfw1dev.codfw1dev.wikimedia.cloud' returned 1:
...

dcaro@acme-chief-2:~$ /usr/local/bin/puppet-enc acme-chief-2.cloudinfra-codfw1dev.codfw1dev.wikimedia.cloud
-bash: /usr/local/bin/puppet-enc: No such file or directory

So I went to the puppetmaster to check there, where puppet was also failing:

dcaro@cloudinfra-internal-puppetmaster-01:~$ sudo run-puppet-agent
Warning: Unable to fetch my node definition, but the agent run will continue:
Warning: Error 500 on SERVER: Server Error: Failed to find cloudinfra-internal-puppetmaster-01.cloudinfra-codfw1dev.codfw1dev.wikimedia.cloud via exec: Execution of '/usr/local/bin/puppet-enc cloudinfra-internal-puppetmaster-01.cloudinfra-codfw1dev.codfw1dev.wikimedia.cloud' returned 1:
Info: Retrieving pluginfacts
Info: Retrieving plugin
Info: Retrieving locales
Info: Loading facts
Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Failed when searching for node cloudinfra-internal-puppetmaster-01.cloudinfra-codfw1dev.codfw1dev.wikimedia.cloud: Failed to find cloudinfra-internal-puppetmaster-01.cloudinfra-codfw1dev.codfw1dev.wikimedia.cloud via exec: Execution of '/usr/local/bin/puppet-enc cloudinfra-internal-puppetmaster-01.cloudinfra-codfw1dev.codfw1dev.wikimedia.cloud' returned 1:
Warning: Not using cache on failed catalog

And this one had the binary, but manually running gives:

dcaro@cloudinfra-internal-puppetmaster-01:~$ /usr/local/bin/puppet-enc cloudinfra-internal-puppetmaster-01.cloudinfra-codfw1dev.codfw1dev.wikimedia.cloud
...
requests.exceptions.ConnectionError: HTTPConnectionPool(host='puppet-enc.cloudinfra-codfw1dev.codfw1dev.wmcloud.org', port=8100): Max retries exceeded with url: /v1/cloudinfra-codfw1dev/node/cloudinfra-internal-puppetmaster-01.cloudinfra-codfw1dev.codfw1dev.wikimedia.cloud (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fbd01264160>: Failed to establish a new connection: [Errno 113] No route to host'))

So went to check on the enc-1 host, but the host was unreachable by ssh, checked if it was up on openstack:

root@cloudcontrol2004-dev:~# openstack --os-project-id=cloudinfra-codfw1dev server show enc-1
+-------------------------------------+--------------------------------------------------------------+
| Field                               | Value                                                        |
+-------------------------------------+--------------------------------------------------------------+
| OS-DCF:diskConfig                   | AUTO                                                         |
| OS-EXT-AZ:availability_zone         | nova                                                         |
| OS-EXT-SRV-ATTR:host                | cloudvirt2003-dev                                            |
| OS-EXT-SRV-ATTR:hypervisor_hostname | cloudvirt2003-dev.codfw.wmnet                                |
| OS-EXT-SRV-ATTR:instance_name       | i-00000cdf                                                   |
| OS-EXT-STS:power_state              | Running                                                      |
...

So it was, went to the hypervisor (cloudvirt2003-dev) to connect using the console and the VM had no network:

root@cloudvirt2003-dev:~# virsh
Welcome to virsh, the virtualization interactive terminal.

Type:  'help' for help with commands
       'quit' to quit

virsh # console i-00000cdf
Connected to domain 'i-00000cdf'
Escape character is ^] (Ctrl + ])

root@enc-1:~# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever

Checking the logs it was failing to get DHCP replies, and as it had no stored leases it got no ip:

root@enc-1:~# journalctl | grep dhclient
...
Apr 28 09:01:13 enc-1 dhclient[350]: No DHCPOFFERS received.
Apr 28 09:01:13 enc-1 dhclient[350]: No working leases in persistent database - sleeping.

The last ack was:

Apr 27 07:35:26 enc-1 dhclient[350]: DHCPACK of 172.16.128.97 from 172.16.128.10

It seems to be hapenning on other VMs on that host too, like cloudinfra-db-01, where the last ack was:

root@cloudinfra-db-01:~# journalctl | grep dhclient | grep -i ack
Apr 27 16:06:47 cloudinfra-db-01 dhclient[381]: DHCPACK of 172.16.128.23 from 172.16.128.14

Related Objects

StatusSubtypeAssignedTask
ResolvedPapaul
Resolveddcaro

Event Timeline

dcaro triaged this task as High priority.Apr 28 2022, 9:05 AM
dcaro created this task.

Checking the cloudnet2006-dev and cloudnet2005-dev hosts it seems that the neutron-dhcp-agent had some issues at 2am:

root@cloudnet2005-dev:~# systemctl status neutron-\*
...
Apr 26 02:49:10 cloudnet2005-dev neutron-dhcp-agent[2575]: 2022-04-26 02:49:10.368 2575 ERROR neutron.agent.dhcp.agent [req-679a3416-99df-4b1e-aaeb-c4e01f09fb9d - - - - -] Failed reporting state!: oslo_messaging.exceptions.MessagingTimeout: Timed out waiting for a reply to message ID cdd8065e9fc94e54b7efe1f069e01a2b
...

That is on both hosts, restarted the services and they did not complain, looking...

That did not help.
Looking at the cloudnet2005-dev logs (also cloudnet2006-dev), I see the dhcp requests from the enc-1 VM (mac fa:16:3e:ad:10:28) being dropped at the firewall:

Apr 28 09:12:40 cloudnet2006-dev ulogd[987]: [fw-in-drop] IN=br-internal OUT= MAC=ff:ff:ff:ff:ff:ff:fa:16:3e:ad:10:28:08:00 SRC=0.0.0.0 DST=255.255.255.255 LEN=392 TOS=10 PREC=0x00 TTL=128 ID=0 PROTO=UDP SPT=68 DPT=67 LEN=372 MARK=0
Apr 28 09:12:41 cloudnet2006-dev ulogd[987]: [fw-in-drop] IN=br-internal OUT= MAC=ff:ff:ff:ff:ff:ff:fa:16:3e:ad:10:28:08:00 SRC=0.0.0.0 DST=255.255.255.255 LEN=392 TOS=10 PREC=0x00 TTL=128 ID=0 PROTO=UDP SPT=68 DPT=67 LEN=372 MARK=0
Apr 28 09:12:57 cloudnet2006-dev ulogd[987]: [fw-in-drop] IN=br-internal OUT= MAC=ff:ff:ff:ff:ff:ff:fa:16:3e:ad:10:28:08:00 SRC=0.0.0.0 DST=255.255.255.255 LEN=392 TOS=10 PREC=0x00 TTL=128 ID=0 PROTO=UDP SPT=68 DPT=67 LEN=372 MARK=0
Apr 28 09:13:01 cloudnet2006-dev ulogd[987]: [fw-in-drop] IN=br-internal OUT= MAC=ff:ff:ff:ff:ff:ff:fa:16:3e:ad:10:28:08:00 SRC=0.0.0.0 DST=255.255.255.255 LEN=392 TOS=10 PREC=0x00 TTL=128 ID=0 PROTO=UDP SPT=68 DPT=67 LEN=372 MARK=0

Listing the neutron agents showed that there were leftover of the ones that were removed, so I cleaned that up, though it had no effect:

root@cloudcontrol2004-dev:~# neutron agent-list
neutron CLI is deprecated and will be removed in the future. Use openstack CLI instead.
+--------------------------------------+--------------------+-------------------+-------------------+-------+----------------+---------------------------+
| id                                   | agent_type         | host              | availability_zone | alive | admin_state_up | binary                    |
+--------------------------------------+--------------------+-------------------+-------------------+-------+----------------+---------------------------+
| 06d461b8-b9ec-45a3-8c6e-ef56f22c721b | DHCP agent         | cloudnet2006-dev  | nova              | :-)   | True           | neutron-dhcp-agent        |
| 228b6925-6b3e-464f-9d23-70e250b928f2 | Linux bridge agent | cloudnet2004-dev  |                   | xxx   | True           | neutron-linuxbridge-agent |
| 2f9bd1b1-e51f-47d4-b527-ccfd6b062f8b | DHCP agent         | cloudnet2004-dev  | nova              | xxx   | True           | neutron-dhcp-agent        |
| 46573e30-a4f0-4424-84c5-e18d7a1d0902 | Linux bridge agent | cloudvirt2003-dev |                   | :-)   | True           | neutron-linuxbridge-agent |
| 4a0e32d8-f231-4e50-9636-414b3e44cd53 | L3 agent           | cloudnet2002-dev  | nova              | xxx   | True           | neutron-l3-agent          |
| 4ce9e60e-797d-47db-8e60-5d01405799eb | L3 agent           | cloudnet2006-dev  | nova              | :-)   | True           | neutron-l3-agent          |
| 503a6978-1545-47e7-9272-8be3e1140825 | Metadata agent     | cloudnet2005-dev  |                   | :-)   | True           | neutron-metadata-agent    |
| 5584e5f9-1e37-430c-b1cd-a3be0a1f1c5b | L3 agent           | cloudnet2004-dev  | nova              | xxx   | True           | neutron-l3-agent          |
| 59bc1a4d-5bbe-4035-a1cc-5e9a0cc790b2 | DHCP agent         | cloudnet2005-dev  | nova              | :-)   | True           | neutron-dhcp-agent        |
| 6be877da-0221-4d44-813a-7e77868a2364 | Metadata agent     | cloudnet2002-dev  |                   | xxx   | True           | neutron-metadata-agent    |
| 73206678-6394-4d0e-9668-2c6cdf28b595 | Linux bridge agent | cloudvirt2002-dev |                   | :-)   | True           | neutron-linuxbridge-agent |
| 73361b68-276d-45a6-87a4-2b704a56dedb | L3 agent           | cloudnet2005-dev  | nova              | :-)   | True           | neutron-l3-agent          |
| 865072bb-941d-4d89-bb39-282df7fe7110 | DHCP agent         | cloudnet2002-dev  | nova              | xxx   | True           | neutron-dhcp-agent        |
| 905782d2-fcd7-49ac-b499-8c068057c0a5 | Linux bridge agent | cloudnet2005-dev  |                   | :-)   | True           | neutron-linuxbridge-agent |
| 98f75540-ec40-4b32-be19-33dd3c24c5b5 | Linux bridge agent | cloudvirt2001-dev |                   | :-)   | True           | neutron-linuxbridge-agent |
| ac55fc68-6811-43eb-9d1c-f0a22f42eb18 | Metadata agent     | cloudnet2006-dev  |                   | :-)   | True           | neutron-metadata-agent    |
| cf504178-7bfe-4972-b2c6-0872cb829f2a | Metadata agent     | cloudnet2004-dev  |                   | xxx   | True           | neutron-metadata-agent    |
| e4828358-0291-4d00-a493-a866183689ee | Linux bridge agent | cloudnet2002-dev  |                   | xxx   | True           | neutron-linuxbridge-agent |
| e9cde754-b603-47c1-97b9-9ac2d74d043a | Linux bridge agent | cloudnet2006-dev  |                   | :-)   | True           | neutron-linuxbridge-agent |
+--------------------------------------+--------------------+-------------------+-------------------+-------+----------------+---------------------------+

root@cloudcontrol2004-dev:~# neutron agent-delete cf504178-7bfe-4972-b2c6-0872cb829f2a
...
root@cloudcontrol2004-dev:~# neutron agent-list
neutron CLI is deprecated and will be removed in the future. Use openstack CLI instead.
+--------------------------------------+--------------------+-------------------+-------------------+-------+----------------+---------------------------+
| id                                   | agent_type         | host              | availability_zone | alive | admin_state_up | binary                    |
+--------------------------------------+--------------------+-------------------+-------------------+-------+----------------+---------------------------+
| 06d461b8-b9ec-45a3-8c6e-ef56f22c721b | DHCP agent         | cloudnet2006-dev  | nova              | :-)   | True           | neutron-dhcp-agent        |
| 46573e30-a4f0-4424-84c5-e18d7a1d0902 | Linux bridge agent | cloudvirt2003-dev |                   | :-)   | True           | neutron-linuxbridge-agent |
| 4ce9e60e-797d-47db-8e60-5d01405799eb | L3 agent           | cloudnet2006-dev  | nova              | :-)   | True           | neutron-l3-agent          |
| 503a6978-1545-47e7-9272-8be3e1140825 | Metadata agent     | cloudnet2005-dev  |                   | :-)   | True           | neutron-metadata-agent    |
| 59bc1a4d-5bbe-4035-a1cc-5e9a0cc790b2 | DHCP agent         | cloudnet2005-dev  | nova              | :-)   | True           | neutron-dhcp-agent        |
| 73206678-6394-4d0e-9668-2c6cdf28b595 | Linux bridge agent | cloudvirt2002-dev |                   | :-)   | True           | neutron-linuxbridge-agent |
| 73361b68-276d-45a6-87a4-2b704a56dedb | L3 agent           | cloudnet2005-dev  | nova              | :-)   | True           | neutron-l3-agent          |
| 905782d2-fcd7-49ac-b499-8c068057c0a5 | Linux bridge agent | cloudnet2005-dev  |                   | :-)   | True           | neutron-linuxbridge-agent |
| 98f75540-ec40-4b32-be19-33dd3c24c5b5 | Linux bridge agent | cloudvirt2001-dev |                   | :-)   | True           | neutron-linuxbridge-agent |
| ac55fc68-6811-43eb-9d1c-f0a22f42eb18 | Metadata agent     | cloudnet2006-dev  |                   | :-)   | True           | neutron-metadata-agent    |
| e9cde754-b603-47c1-97b9-9ac2d74d043a | Linux bridge agent | cloudnet2006-dev  |                   | :-)   | True           | neutron-linuxbridge-agent |
+--------------------------------------+--------------------+-------------------+-------------------+-------+----------------+---------------------------+

Checking the firewall rules on cloudnet1003 vs cloudnet2005-dev, they are pretty different, specially the INPUT chain:

root@cloudnet1003:~# iptables -n -L
Chain INPUT (policy ACCEPT)
target     prot opt source               destination
neutron-linuxbri-INPUT  all  --  0.0.0.0/0            0.0.0.0/0

# vs
root@cloudnet2005-dev:~# iptables -n -L
Chain INPUT (policy DROP)
target     prot opt source               destination
neutron-linuxbri-INPUT  all  --  0.0.0.0/0            0.0.0.0/0
DROP       all  --  23.226.133.0/24      0.0.0.0/0
DROP       all  --  31.184.250.10        0.0.0.0/0
DROP       all  --  92.200.101.215       0.0.0.0/0
DROP       all  --  93.184.216.34        0.0.0.0/0
DROP       all  --  169.45.120.238       0.0.0.0/0
DROP       all  --  170.52.77.162        0.0.0.0/0
ACCEPT     all  --  0.0.0.0/0            0.0.0.0/0            state RELATED,ESTABLISHED
ACCEPT     all  --  0.0.0.0/0            0.0.0.0/0
ACCEPT     all  --  0.0.0.0/0            0.0.0.0/0            PKTTYPE = multicast
DROP       tcp  --  0.0.0.0/0            0.0.0.0/0            state NEW tcp flags:!0x17/0x02
ACCEPT     icmp --  0.0.0.0/0            0.0.0.0/0
ACCEPT     tcp  --  208.80.155.110       0.0.0.0/0            tcp dpt:22
ACCEPT     tcp  --  208.80.153.54        0.0.0.0/0            tcp dpt:22
ACCEPT     tcp  --  91.198.174.6         0.0.0.0/0            tcp dpt:22
ACCEPT     tcp  --  198.35.26.13         0.0.0.0/0            tcp dpt:22
ACCEPT     tcp  --  103.102.166.6        0.0.0.0/0            tcp dpt:22
ACCEPT     tcp  --  185.15.58.6          0.0.0.0/0            tcp dpt:22
ACCEPT     all  --  208.80.154.88        0.0.0.0/0
ACCEPT     all  --  208.80.153.84        0.0.0.0/0
ACCEPT     all  --  10.192.16.75         0.0.0.0/0
ACCEPT     all  --  10.192.32.67         0.0.0.0/0
ACCEPT     tcp  --  10.64.32.25          0.0.0.0/0            tcp dpt:22
ACCEPT     tcp  --  10.192.32.49         0.0.0.0/0            tcp dpt:22
DROP       udp  --  0.0.0.0/0            255.255.255.255      udp spt:67 dpt:68
NFLOG      all  --  0.0.0.0/0            0.0.0.0/0            limit: avg 1/sec burst 5 nflog-prefix  "[fw-in-drop]"

Rebooted the latter to see if it was some leftover rules from a previous puppet run, but that did not help, digging

After looking around (and thankful taavi's comment on irc), it turns out that the machines were reimaged by getting set to role(insetup) instead of role(insetup_noferm), and that installed ferm on them, and then the move to the final role does not touch ferm, so it was there conflicting with what neutron does.

I removed ferm from the machines (one by one) and restarted them and everything works now.

So the next step is reimage them with the expected process to make sure that they are what is expected from a clean cloudnet install.

Reimaged both hosts and they are up and running :), will close.