Page MenuHomePhabricator

cloudvps: labtestn: neutron issue with vxlan and l2population
Closed, ResolvedPublic

Description

There is an issue in the labtestn setup regarding the neutron networking deployment using vxlan and l2population.

In this deployment, vxlan-2 is the interface that should connect all the virts and the network nodes.

If you inspect the packets in this interface in a virt node, you see something like this:

aborrero@labtestvirt2003:~ $ sudo tcpdump -i vxlan-2
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on vxlan-2, link-type EN10MB (Ethernet), capture size 262144 bytes
16:44:00.177241 ARP, Request who-has 172.16.130.1 tell 172.16.130.13, length 28
16:44:00.231328 ARP, Request who-has 172.16.130.1 tell 172.16.130.15, length 28
16:44:01.201191 ARP, Request who-has 172.16.130.1 tell 172.16.130.13, length 28
16:44:01.255605 ARP, Request who-has 172.16.130.1 tell 172.16.130.15, length 28
16:44:01.844882 ARP, Request who-has 172.16.130.1 tell 172.16.130.14, length 28
[...]

(i.e, no ARP replies)

However, if you inspect the packets in the networking node, you see something like this:

aborrero@labtestneutron2001:~ 8s $ sudo tcpdump -i vxlan-2
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on vxlan-2, link-type EN10MB (Ethernet), capture size 262144 bytes
16:28:31.402268 ARP, Request who-has 172.16.130.1 tell 172.16.130.13, length 28
16:28:31.402311 ARP, Reply 172.16.130.1 is-at fa:16:3e:3f:d3:9a (oui Unknown), length 28
16:28:31.916441 ARP, Request who-has 172.16.130.1 tell 172.16.130.14, length 28
16:28:31.916575 ARP, Reply 172.16.130.1 is-at fa:16:3e:3f:d3:9a (oui Unknown), length 28
16:28:32.426641 ARP, Request who-has 172.16.130.1 tell 172.16.130.13, length 28
16:28:32.426708 ARP, Reply 172.16.130.1 is-at fa:16:3e:3f:d3:9a (oui Unknown), length 28
16:28:32.940036 ARP, Request who-has 172.16.130.1 tell 172.16.130.14, length 28
16:28:32.940067 ARP, Reply 172.16.130.1 is-at fa:16:3e:3f:d3:9a (oui Unknown), length 28
16:28:33.261335 ARP, Request who-has 172.16.130.1 tell 172.16.130.15, length 28
16:28:33.261376 ARP, Reply 172.16.130.1 is-at fa:16:3e:3f:d3:9a (oui Unknown), length 28
16:28:33.449983 ARP, Request who-has 172.16.130.1 tell 172.16.130.13, length 28
16:28:33.450057 ARP, Reply 172.16.130.1 is-at fa:16:3e:3f:d3:9a (oui Unknown), length 28
[...]

(i.e, ARP replies are being sent)

This may mean there is a misconfiguration/bug somewhere that prevents proper configuration using vxlan as network overlay.

Also, @chasemp mentioned we may be affected by an upstream openstack bug (https://bugs.launchpad.net/neutron/+bug/1365476) which is related to the HA setup.

We don't fully know what's going on, but right now, we can't contact instances in the subnet which uses vxlan.

Right now relevant servers are (https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Deployments#Labtestn_deployment):

labtestcontrol2003.wikimedia.org
labtestneutron2001.codfw.wmnet
labtestneutron2002.codfw.wmnet
labtestservices2002.wikimedia.org
labtestservices2003.wikimedia.org
labtestvirt2003.codfw.wmnet
labtestmetal2001.codfw.wmnet (as virt)

Event Timeline

aborrero triaged this task as Medium priority.May 28 2018, 4:54 PM
aborrero created this task.

A few things are confusing in that https://bugs.launchpad.net/neutron/+bug/1365476 seems to have been targeted for a fix in Liberty and we are at Mitaka, but the symptoms are so very close to the original bug reports. My thinking is there is probably another variation of the same issue at play. The good news is VXLAN isn't required for the initial transition from nova-network but it muddies the waters in trying to define a common HA story for the VPC gateways.

My thinkning is:

  • test w/o l2pop
  • think on a different gateway HA mechanism (maybe similar to our current hacks until Newton or fix)
  • ?

In labtestvirt2003.codfw.wmnet I can see the vxlan traffic by sniffing in eth0:

aborrero@labtestvirt2003:~$ sudo tcpdump -i eth0 udp port 8472
[...]
10:18:15.878886 IP labtestvirt2003.codfw.wmnet.35918 > labtestneutron2002.codfw.wmnet.8472: OTV, flags [I] (0x08), overlay 0, instance 2
ARP, Request who-has 172.16.130.1 tell 172.16.130.15, length 28
10:18:15.879545 IP labtestneutron2001.codfw.wmnet.59104 > labtestvirt2003.codfw.wmnet.8472: OTV, flags [I] (0x08), overlay 0, instance 2
ARP, Reply 172.16.130.1 is-at fa:16:3e:3f:d3:9a (oui Unknown), length 28
[...]

However, sniffing in vxlan-2 doesn't show the ARP reply packets:

aborrero@labtestvirt2003:~ $ sudo tcpdump -i vxlan-2
[...]
10:18:15.878856 ARP, Request who-has 172.16.130.1 tell 172.16.130.15, length 28
10:18:16.016930 ARP, Request who-has 172.16.130.1 tell 172.16.130.13, length 28
[...]

This could mean there is some mechanism preventing the replies from entering the vxlan-2 interface even though the packets arrived at the server via eth0.

I was able to get the setup working by disabling l2population, doing the following steps:

  • hosts are: labtestneutron2001, labtestneutron2002, labtestcontrol2003, labtestvirt2002, labtestmetal2001.
  • disable puppet on all of them
  • edit the file /etc/neutron/plugins/ml2/ml2_config.ini
  • in the [ml2] section, delete l2population in the mechanism_drivers= directive. It ends like mechanism_drivers = linuxbridge
  • restart services: sudo systemctl restart neutron-l3-agent.service neutron-linuxbridge-agent.service neutron-metada-agent.service (some hosts may not have all the services)

After these steps, the traffic using the vxlan-2 interface cross in both direction, and communication is therefore effective for VMs in the 172.16.130.0/24range (for example, VM at 172.16.130.13).

This indicates that, as @chasemp suggested, there is some weird bug in the l2population mechanism (or other misconfiguration in our side).

BTW this makes a floatingip assigned to an instance work again.

BTW if you assign a floating IP to an instance, you can SSH directly to it with this ssh config:

## labtestn
Host 172.16.*
    User root
    ProxyCommand ssh -W %h:%p labtestcontrol2003.wikimedia.org
    IdentityFile ~/.ssh/my_wmf_cloud_strong_key

Nice, a note that we decided to disable l2pop for now

Change 436319 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] openstack: neutron: disable l2population

https://gerrit.wikimedia.org/r/436319

Change 436319 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] openstack: neutron: disable l2population

https://gerrit.wikimedia.org/r/436319

Vvjjkkii renamed this task from cloudvps: labtestn: neutron issue with vxlan and l2population to t4baaaaaaa.Jul 1 2018, 1:07 AM
Vvjjkkii reopened this task as Open.
Vvjjkkii removed aborrero as the assignee of this task.
Vvjjkkii raised the priority of this task from Medium to High.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed subscribers: Aklapper, gerritbot.
CommunityTechBot renamed this task from t4baaaaaaa to cloudvps: labtestn: neutron issue with vxlan and l2population.Jul 2 2018, 3:24 AM
CommunityTechBot closed this task as Resolved.
CommunityTechBot assigned this task to aborrero.
CommunityTechBot lowered the priority of this task from High to Medium.
CommunityTechBot updated the task description. (Show Details)
CommunityTechBot added subscribers: Aklapper, gerritbot.