Page MenuHomePhabricator

kafka-jumbo1006 and stat1005 network issues
Closed, ResolvedPublic

Description

At about 19:45UTC today kafka-jumbo1006.eqiad.wmnet paged as DOWN. I cannot reach it via the network, but I am able to log in via mgmt interface. Once in, I cannot reach out via the main network interface.

The if seems to be up:

root@kafka-jumbo1006:~# ip link show
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: eno1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
    link/ether 18:66:da:fc:d2:7c brd ff:ff:ff:ff:ff:ff
3: eno2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether 18:66:da:fc:d2:7d brd ff:ff:ff:ff:ff:ff
4: eno3: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether 18:66:da:fc:d2:7e brd ff:ff:ff:ff:ff:ff
5: eno4: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether 18:66:da:fc:d2:7f brd ff:ff:ff:ff:ff:ff

Cdanis linked me to a kafka-jumbo1006-old interface:
https://librenms.wikimedia.org/device/device=149/tab=port/port=12086
(Not sure if that is relevant.)

Help! :)

Event Timeline

Mentioned in SAL (#wikimedia-operations) [2020-03-12T20:49:01Z] <ottomata> kafka-jumbo1006 - stopping kafka and powercycling - T247561

https://librenms.wikimedia.org/device/device=149/tab=port/port=12085/
https://librenms.wikimedia.org/device/device=149/tab=port/port=12086/

stat1005 and kafka-jumbo1006 are in the same switch, and they started showing the same problem at around the same time. Moreover stat1005's interface is showing a ton of broadcast traffic since the issue started (close to 1G).

I did some tests and the two hosts are definitely related. I logged as root on both via mgmt console and turned off their interfaces, and the stat1005's broadcast traffic went away. I now brought up stat1005's interface again, leaving kafka-jumbo1006's down and I still don't see any spike in broadcast traffic.

I brought up again both interfaces, and I don't see anymore the broadcast traffic, so what I wrote above is not really holding anymore. One thing that I noticed though is that the two interfaces may be close to each other: ge-1/0/4 and ge-1/0/5, plus the hosts are on the same rack. @Cmjohnson @Jclark-ctr is it possible that the cables got swapped for some reason?

The other host in D1, the rack of stat1005 and jumbo1006 is kafka-jumbo1008, one of the new ones: https://netbox.wikimedia.org/dcim/devices/2510/

I checked the last changes happened yesterday on the switch via:

elukey@asw2-d-eqiad> show system rollback compare 3 0
[edit interfaces interface-range vlan-private1-d-eqiad]
     member ge-6/0/5 { ... }
+    member ge-1/0/7;
[edit interfaces interface-range disabled]
-    member ge-1/0/7;
[edit interfaces ge-1/0/5]
-   enable;
[edit interfaces ge-1/0/6]
-   enable;
[edit interfaces]
+   ge-1/0/7 {
+       description kafka-jumbo1008;
+   }
[edit interfaces ge-4/0/11]
-   description db1062;
+   description kafka-jumbo1009;
-   enable;

Was the change to ge-1/0/5 done on purpose? It looks up if I show interface etc.. on the router, but I am wondering if the enable bit is needed or not.

I 've had a look as well. I 've checked that the mac address of kafka-jumbo1006 is indeed the one the switch learns and indeed that's true. I 've bounced the port as well on both sides. In fact after that I started a tcpdump session on kafka-jumbo1006 and it sees 0 traffic from the switch (including no LLDP, even though it's reportedly configured on that port). So for some reason the host does not get the traffic from the switch. The other way around it seems to work (since the switch knows the MAC of the host)

The only thing that makes sense to me now is either cabling reseating or the cable having been somehow damaged.

Mentioned in SAL (#wikimedia-operations) [2020-03-13T16:02:18Z] <elukey> powercycle kafka-jumbo1006 after switch port changed - T247561

Chris moved the servers to different ports, and for kafka-jumbo1006 it helped, since it is now serving traffic. stat1005 is still suffering of the same issue though.

Mentioned in SAL (#wikimedia-operations) [2020-03-13T20:02:19Z] <mutante> stat1005 - ip link set en01 down ; ip link set en01 up (T247561)

Fixed by @Papaul for kafka-jumbo1006. We saw recoveries for kafka lag on other machines all at once.

I have @Jclark-ctr repalce the cable to stat1005 same issue. I have him also disconnect the cable while i was looking at the switch the interface went from up up to up down and when he plugged the cable back the interface went from up down to up up again so cable is good and right interface on the switch as well.

Mentioned in SAL (#wikimedia-operations) [2020-03-14T08:33:45Z] <elukey> run kafka preferred-replica-election on kafka-jumbo1001 - T247561

elukey renamed this task from kafka-jumbo1006 network issues to kafka-jumbo1006 and stat1005 network issues.Mar 14 2020, 8:34 AM

@Papaul @Jclark-ctr can we try to move stat1005 to a different switch port again?

I had a chat with Arzhel today and we didn't find a lot. From his perspective, it seems that something in the middle between the switch and stat1005 is not working (traffic leaves stat1005 and reaches the switch, but not the opposite). Since the cable was replaced, the next try should be moving to the first free port on the switch and see if it works..

elukey claimed this task.

stat1005 is back, John and Papaul switched it to port /43.