Page MenuHomePhabricator

analytics - hadoop-worker-3 can't talk to kdc
Closed, ResolvedPublic

Description

Pings from one another don't work.

kdc can't discover worker-3's ARP address. ARP discovery gets to worker-3, which replies back, but the reply never arrives at kdc's eth0 interface.

If the ARP address for worker-3 is added to kdc forcefully, ICMP echos go out and are received by worker-3 and replied back. However, replies don't arrive at kdc's eth0 interface.

Event Timeline

GTirloni created this task.Nov 9 2018, 5:51 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 9 2018, 5:51 PM
Paladox added a subscriber: Paladox.Nov 9 2018, 5:54 PM

Details about ARP issues below.

kdc:

18:08:07.525254 ARP, Request who-has 172.16.2.243 tell 172.16.2.235, length 28

hadoop-worker-3:

18:08:07.518395 fa:16:3e:91:a2:93 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 56: Request who-has 172.16.2.243 tell 172.16.2.235, length 42
18:08:07.518421 fa:16:3e:c7:f7:e7 > fa:16:3e:91:a2:93, ethertype ARP (0x0806), length 42: Reply 172.16.2.243 is-at fa:16:3e:c7:f7:e7, length 28

But still:

kdc# arp 172.16.2.243
Address                  HWtype  HWaddress           Flags Mask            Iface
hadoop-worker-3.analyti          (incomplete)                              eth0
Krenair added a subscriber: Krenair.Nov 9 2018, 6:13 PM
GTirloni added a comment.EditedNov 9 2018, 7:14 PM

kdc (cloudvirt1018) cannot ping any VMs running on cloudvirt1023:

IP           HYPERVISOR    INSTACE                 FROM_kdc FROM_worker3
172.16.2.234 cloudvirt1023 jessietest-1            fail     ok
172.16.2.235 cloudvirt1018 kdc                     -        fail
172.16.2.236 cloudvirt1023 turnilo                 fail     ok
172.16.2.237 cloudvirt1023 hadoop-coordinator-2    fail     ok
172.16.2.238 cloudvirt1021 hadoop-master-4         ok       ok
172.16.2.239 cloudvirt1018 hadoop-master-3         ok       ok
172.16.2.240 cloudvirt1023 d-3                     fail     ok
172.16.2.241 cloudvirt1023 d-1                     fail     ok
172.16.2.242 cloudvirt1023 d-2                     fail     ok
172.16.2.243 cloudvirt1023 hadoop-worker-3         fail     ok
172.16.2.244 cloudvirt1021 hadoop-worker-2         ok       ok
172.16.2.245 cloudvirt1021 hadoop-worker-1         ok       ok
172.16.2.246 cloudvirt1021 zk1-3                   ok       ok
172.16.2.247 cloudvirt1023 zk1-2                   fail     ok
172.16.2.248 cloudvirt1023 k4-2                    fail     ok
172.16.2.249 cloudvirt1023 k4-1                    fail     ok
172.16.2.250 cloudvirt1021 zk1-1                   ok       ok

*integration-slave-jessie-1003 and compiler1002 running on cloudvir1018 can ping all analytics VMs just fine.

cloudvirt1018:~# tcpdump -n -e -i brq7425e328-56 '(ether host fa:16:3e:91:a2:93 or ether host fa:16:3e:c7:f7:e7) and (arp or icmp)'
19:12:14.823240 fa:16:3e:91:a2:93 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Request who-has 172.16.2.243 tell 172.16.2.235, length 28
19:12:15.847389 fa:16:3e:91:a2:93 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Request who-has 172.16.2.243 tell 172.16.2.235, length 28
...

cloudvirt1023:~# tcpdump -n -e -i brq7425e328-56 '(ether host fa:16:3e:91:a2:93 or ether host fa:16:3e:c7:f7:e7) and (arp or icmp)'
19:12:16.873781 fa:16:3e:91:a2:93 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 56: Request who-has 172.16.2.243 tell 172.16.2.235, length 42
19:12:16.873903 fa:16:3e:c7:f7:e7 > fa:16:3e:91:a2:93, ethertype ARP (0x0806), length 42: Reply 172.16.2.243 is-at fa:16:3e:c7:f7:e7, length 28

The ARP reply sent by worker-3 leaves the VM and is seen in cloudvirt1023's bridge, but it doesn't make its way to cloudvirt1018's bridge.

The above is also true for the eth1 and eth1.1105 interfaces (replies do not arrive on the wire):

cloudvirt1018:~# tcpdump -n -e -i eth1.1105 '(ether host fa:16:3e:91:a2:93 or ether host fa:16:3e:c7:f7:e7) and (arp or icmp)'
19:38:27.685914 fa:16:3e:91:a2:93 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Request who-has 172.16.2.243 tell 172.16.2.235, length 28
19:38:28.709903 fa:16:3e:91:a2:93 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Request who-has 172.16.2.243 tell 172.16.2.235, length 28
19:38:29.733871 fa:16:3e:91:a2:93 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Request who-has 172.16.2.243 tell 172.16.2.235, length 28

cloudvirt1023:~# tcpdump -n -e -i eth1.1105 '(ether host fa:16:3e:91:a2:93 or ether host fa:16:3e:c7:f7:e7) and (arp or icmp)'
19:38:24.614251 fa:16:3e:91:a2:93 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 56: Request who-has 172.16.2.243 tell 172.16.2.235, length 42
19:38:24.614443 fa:16:3e:c7:f7:e7 > fa:16:3e:91:a2:93, ethertype ARP (0x0806), length 42: Reply 172.16.2.243 is-at fa:16:3e:c7:f7:e7, length 28
GTirloni added a comment.EditedNov 9 2018, 8:08 PM

MAC addresses are in each bridge's table:

cloudvirt1018:~# brctl showmacs brq7425e328-56 | grep -E '(fa:16:3e:c7:f7:e7|fa:16:3e:91:a2:93)'
 11	fa:16:3e:91:a2:93	no		   0.16
  2	fa:16:3e:c7:f7:e7	no		   0.80

cloudvirt1023:~# brctl showmacs brq7425e328-56 | grep -E '(fa:16:3e:c7:f7:e7|fa:16:3e:91:a2:93)'
  2	fa:16:3e:91:a2:93	no		   0.23
 73	fa:16:3e:c7:f7:e7	no		   0.23

More debugging info about the bridge+interfaces:

cloudvirt1018$ virsh dumpxml i-0000014e
      <nova:name>kdc</nova:name>
      <target dev='tap378739fc-6c'/>

cloudvirt1018:~# brctl show brq7425e328-56 | grep tap378739fc-6c
							tap378739fc-6c


cloudvirt1023:~# brctl showstp brq7425e328-56
tap378739fc-6c (11)
 port id                800b                    state                forwarding
 designated root        8000.1866dafc9c59       path cost                100
 designated bridge      8000.1866dafc9c59       message age timer          0.00
 designated port        800b                    forward delay timer        0.00
 designated cost           0                    hold timer                 0.00
 flags    


cloudvirt1023$ virsh dumpxml i-00000156
      <nova:name>hadoop-worker-3</nova:name>
      <target dev='tap3d2f8e5b-a1'/>

cloudvirt1023:~# brctl show brq7425e328-56 | grep tap3d2f8e5b-a1
							tap3d2f8e5b-a1

cloudvirt1023:~# brctl showstp brq7425e328-56
tap3d2f8e5b-a1 (73)
 port id                8049                    state                forwarding
 designated root        8000.d094666117c8       path cost                100
 designated bridge      8000.d094666117c8       message age timer          0.00
 designated port        8049                    forward delay timer        0.00
 designated cost           0                    hold timer                 0.00
 flags

Port #2 is the eth1.1105 interface:

eth1.1105 (2)
 port id                8002                    state                forwarding
 designated root        8000.1866dafc9c59       path cost                  4
 designated bridge      8000.1866dafc9c59       message age timer          0.00
 designated port        8002                    forward delay timer        0.00
 designated cost           0                    hold timer                 0.00
 flags

So it seems the bridges know that, to reach the other VMs, they need to forward to port #2.

@ayounsi cleared the MAC address from the switch ARP table and it's now working:

@asw2-b-eqiad> clear ethernet-switching table fa:16:3e:91:a2:93
GTirloni closed this task as Resolved.Nov 9 2018, 8:51 PM
GTirloni triaged this task as Normal priority.