Page MenuHomePhabricator

Renumber office-DC interconnect link
Closed, ResolvedPublic

Description

The office advertises 198.73.209.0/24 via BGP to both ulsfo and Zayo, and uses 198.73.209.240/28 (which is a subset of the /24) as interconnect subnet between the DC and the office.

The office NAT is 198.73.209.241.

So when the DC link went down, traffic from the Office to the DC properly went through the Zayo DIA link. But even though the office's 198.73.209.0/24 was only reachable from the DC via the Internet, that /28 exists as directly connected interface/subnet. Which cause the return traffic to any IP in the subnet to be blackholed.

This also explains why my tests to voip.corp.wikimedia.org worked fine (as it's outside that /28).

Cleanest fix is to renumber that interconnect link to use a subnet outside of the office advertised /24.

ulsfo IP space has free IPs we can use, I picked 198.35.26.224/29. see DNS CR bellow.

Event Timeline

ayounsi triaged this task as High priority.Oct 2 2018, 4:15 PM
ayounsi created this task.
Restricted Application added a project: Operations. · View Herald TranscriptOct 2 2018, 4:15 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Change 463977 had a related patch set uploaded (by Ayounsi; owner: Ayounsi):
[operations/dns@master] Assign IPs for ulsfo-office interco

https://gerrit.wikimedia.org/r/463977

ayounsi updated the task description. (Show Details)Oct 2 2018, 4:22 PM

Change 463977 merged by Ayounsi:
[operations/dns@master] Assign IPs for ulsfo-office interco

https://gerrit.wikimedia.org/r/463977

Mentioned in SAL (#wikimedia-operations) [2018-10-17T09:34:42Z] <XioNoX> update interfaces and BGP IPs for office-DC link (DC side, interfaces still disabled) - T205985

ayounsi added a comment.EditedOct 17 2018, 10:43 AM

Maintenance scheduled for Wednesday 10pm SF time (~2h).

I installed Quagga in a VM to verify the commands, but there will most likely be differences with the office Quagga.
Only things that might need to be added is updating iptables.
All steps are easy to revert if complications occurs.

Steps for the link renumbering:

  • Re-number DC side interfaces BGP sessions (keep interfaces deactivated)
  • Enable DC-side interfaces
  • Verify traffic to Wikimedia websites still working (as interfaces are renumbered)

Router2

  • Add new IP to router2 (office side) sudo ip addr add 198.35.26.228/29 dev eth2 (wont survive a restart)
  • Verify connectivity between DC and that new router2 IP ping
  • Update router2's Quagga's configuration by editing /etc/quagga/bgpd.conf

and replace 198.73.209.249 with 198.35.26.225 and 198.73.209.250 with 198.35.26.226

  • Restart quagga's daemon sudo service quagga restart
  • Verify BGP sessions are established and exchanging prefixes on the DC side

show bgp neighbor 198.35.26.228

  • Verify BGP sessions are established and exchanging prefixes on the Office side

sudo vtysh
show ip bgp summary

  • Verify with traceroute that wikimedia websites are reached via the office/DC link
  • Make IP config permanent by adding it to /etc/network/interfaces
auto eth2:1
iface eth2:1 inet static
address 198.35.26.228
netmask 255.255.255.248
  • Restart networking process to verify it comes back up properly

sudo service networking restart
ip addr

Router1

  • Add new IP to router1 (office side) sudo ip addr add 198.35.26.227/29 dev eth2 (wont survive a restart)
  • Verify connectivity between DC and that new router2 IP ping
  • Update router2's Quagga's configuration by editing /etc/quagga/bgpd.conf

and replace 198.73.209.249 with 198.35.26.225 and 198.73.209.250 with 198.35.26.226

  • Restart quagga's daemon sudo service quagga restart
  • Verify BGP sessions are established and exchanging prefixes on the DC side

show bgp neighbor 198.35.26.227

  • Verify BGP sessions are established and exchanging prefixes on the Office side

sudo vtysh
show ip bgp summary

  • Verify with traceroute that wikimedia websites are reached via the office/DC link
  • Make IP config permanent by adding it to /etc/network/interfaces
auto eth2:1
iface eth2:1 inet static
address 198.35.26.227
netmask 255.255.255.248
  • Restart networking process to verify it comes back up properly (this might break internet connectivity if any typos/issues)

sudo service networking restart
ip addr

  • Ensure internet and wikimedia websites are reachable from the office

Failover check

  • Unplug Zayo link, verify Internet and Wiki websites work
  • Unplug Office-DC link, verify Internet and Wiki websites work

Mentioned in SAL (#wikimedia-operations) [2018-10-18T05:05:03Z] <XioNoX> start office-DC link renumbering - T205985

ayounsi closed this task as Resolved.Oct 18 2018, 5:49 AM

the re-numbering went as expected, BGP sessions are back up.
The failover tests were not done, as the exact links needs to be properly identified on the switch stack.
They can be done any other time, off hours.