Page MenuHomePhabricator

Spread eqiad analytics Kafka nodes to multiple racks ans rows
Closed, ResolvedPublic

Description

After a chat with @ayounsi, @faidon and @Ottomata we decided that it would be better to move a couple of kafka nodes (one at the time) to different racks/rows before moving the row D racks.

The main problem is that:

  1. kafka1018/kafka1020 are in Rack D2 - https://racktables.wikimedia.org/index.php?page=object&tab=default&object_id=1562
  1. kafka1012/kafka1013 are in Rack A2 - https://racktables.wikimedia.org/index.php?page=object&tab=default&object_id=1556
  1. kafka1014/kafka1022 are in Rack C7 - https://racktables.wikimedia.org/index.php?page=object&tab=default&object_id=1558

While it should fine to loose a couple of nodes at the same time in a six nodes cluster it might be something that we don't want to experiment while doing network maintenance :)

So for the purpose of the Row D migration of the new stack I'd ask, if possible to migrate kafka1020 from D2 to BX before doing any maintenance on D2.

Afterwards, a very nice goal would be to move kafka1014 and kafka1012 to different racks.

@Cmjohnson really sorry to put more work on your shoulders, let me know if this will cause a lot of trouble or not. Thanks!

@faidon, @ayounsi - any issue from the network capabilities perspective to move a kafka node in row B?

Details

Related Gerrit Patches:
operations/dns : masterAdd analytics1-b-eqiad IPv6 stanza

Event Timeline

elukey created this task.Apr 14 2017, 2:51 PM

if possible to migrate kafka1022

I believe you mean kafka1020

any issue from the network capabilities perspective to move a kafka node in row B?

Row B7 and B8 have plenty of rack and switch space

elukey updated the task description. (Show Details)Apr 20 2017, 7:55 AM

if possible to migrate kafka1022

I believe you mean kafka1020

Definitely, fixed the task's description.

any issue from the network capabilities perspective to move a kafka node in row B?

Row B7 and B8 have plenty of rack and switch space

Thanks!

@Cmjohnson: any issue from your side?

@elukey move kafka1020 to row B8? Just want to be clear. I do not have an issue with this.

@Cmjohnson yep exactly! But it should be done before the 26th, the major goal is to avoid to loose two kafka nodes for extended maintenance at the same time.

Would it be possible to do it tomorrow or before the D2 rack maintenance? @Ottomata should be able to help and coordinate the node shutdown.

@Ottomata I would like to do this first thing in the morning (0830) 04/26 before the racks are shutdown. I will update this task with the switch port info and new IP address etc. Right before you power it off update the network settings in /etc/network/interfaces then I will move.

Hm, we got a problem! These Kafka nodes are in the Analytics VLAN networks, AND have IPv6 configured. There is no IPv6 VLAN setup in Row B. I'm not sure what will happen if we move this node over and only allocate IPv4 for it. It should be fine, but if it isn't, we'd have to move it back to Row D quick.

Instead of moving this into B, we could move it into a row where there is Analytics VLAN IPv6, like Row C. This would but 3 brokers into Row C, making Row C a SPOF while this broker is there. We could then move it back to Row D once the networking outage is over.

Or, we could just leave this node in Row D and deal with 2 brokers being out. As long as we don't lose any other brokers during that outage, everything should be fine.

I'm inclined to leave this broker in Row D.

Hm, we got a problem! These Kafka nodes are in the Analytics VLAN networks, AND have IPv6 configured. There is no IPv6 VLAN setup in Row B. I'm not sure what will happen if we move this node over and only allocate IPv4 for it. It should be fine, but if it isn't, we'd have to move it back to Row D quick.

If there is not IPV6 it would be a big problem since the mw appservers are using it to contact the Kafka analytics cluster (as found in T157435), and falling back to IPv4 is not that graceful and could probably cause higher latencies.

Are we sure that there is not IPv6 for Row-b? I checked one analytics host in row B and it seems to get a valid global IPv6:

elukey@analytics1050:~$ ip addr show eth0
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 44:a8:42:25:02:7b brd ff:ff:ff:ff:ff:ff
    inet 10.64.21.111/24 brd 10.64.21.255 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 2620:0:861:105:46a8:42ff:fe25:27b/64 scope global mngtmpaddr dynamic
       valid_lft 2591976sec preferred_lft 604776sec
    inet6 fe80::46a8:42ff:fe25:27b/64 scope link
       valid_lft forever preferred_lft forever
elukey@analytics1050:~$ sudo lldpcli show neighbors | grep SysName
    SysName:      asw-b-eqiad

Is there anything that I am missing?

Instead of moving this into B, we could move it into a row where there is Analytics VLAN IPv6, like Row C. This would but 3 brokers into Row C, making Row C a SPOF while this broker is there. We could then move it back to Row D once the networking outage is over.

Given that the SPOF will be temporary (and better than having two nodes down), I'd vote for this one if row-b is not viable.

Or, we could just leave this node in Row D and deal with 2 brokers being out. As long as we don't lose any other brokers during that outage, everything should be fine.

Paranoid Luca doesn't like this option, but if you feel strongly about it I'll shut up :)

Change 350381 had a related patch set uploaded (by Alexandros Kosiaris):
[operations/dns@master] Add analytics1-b-eqiad IPv6 stanza

https://gerrit.wikimedia.org/r/350381

analytics1-b-eqiad IPv6 does exist and it's 2620:0:861:105/64. I 've update the DNS templates in the patch above to reflect that, but that patch is only informational, so feel free to move ahead with the move to row B.

Change 350381 merged by Alexandros Kosiaris:
[operations/dns@master] Add analytics1-b-eqiad IPv6 stanza

https://gerrit.wikimedia.org/r/350381

@Ottomata @elukey Do you still want to spread out the nodes or okay to resolve this task?

I'd like hear what elukey thinks, but I think we can resolve this. This Kafka hardware is slated to be decommed anyway, but not until Q1 in next FY.

Cmjohnson moved this task from Backlog to Not urgent on the ops-eqiad board.Apr 27 2017, 8:27 PM
elukey closed this task as Resolved.Apr 28 2017, 10:05 AM

I'd love to do it anyway, but Chris is super busy and this is only a "good to have" for the moment, so I am inclined to close :)