Page MenuHomePhabricator

Enable L3 routing on cloudsw nodes
Closed, ResolvedPublic

Description

From https://wikitech.wikimedia.org/wiki/Wikimedia_Cloud_Services_team/EnhancementProposals/2020_Network_refresh/Implementation_details#stage_2:_enable_L3_routing_on_cloudsw_nodes


Baseline configuration

  • Cloudsw vlans (L2) - 1102, 1103, 1104, 1120
  • iBGP and OSPF between cloudsw
  • eBGP between core routers and cloudsw (185.15.56.0/24 and 172.16.0.0/21, receive 0/0)
  • Static route for 185.15.56.0/25 and 172.16.0.0/21 on cloudsw
  • Firewall filters - lo, cloud-in4 (on core routers)
  • Test connectivity

cloud-instances-transport migration (downtime required [!])

  • Ensure cr1 is VRRP master for all vlans, including 1120
  • Move cr2:ae2.1120 to cloudsw1-d5:irb.1120
  • Test cr1:ae2.1120 to cloudsw1-d5:irb.1120 connectivity (and VRRP sync)
  • Ensure static routes are active and aggregates propagated on cloudsw1-d5
  • [!] Move vlan 1120 VRRP master to cloudsw1-d5:irb.1120
  • [!] Remove static routes for 185.15.56.0/25 and 172.16.0.0/21 on core routers
  • Test connectivity
  • Move cr1:ae2.1120 to cloudsw1-c8:irb.1120
  • Cleanup (remove passive OSPF, trunked vlans, update Netbox)

Renumber cloud-instances-transport (downtime required [!]) (similar to T207663)

  • Allocate IPs
  • Configure 185.15.56.240/29 IPs on all devices
  • [!] Reconfigure cloudnet with new gateway IP (to be confirmed)
  • [!] Update static routes on cloudsw to point to new VIP
  • Cleanup 208.80.155.88/29 IPs and advertisement (+Netbox)

Event Timeline

ayounsi triaged this task as Medium priority.Oct 12 2020, 2:47 PM
ayounsi created this task.
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Change 633683 had a related patch set uploaded (by Ayounsi; owner: Ayounsi):
[operations/homer/public@master] Add cloud-in4 filters to cloudsw interfaces

https://gerrit.wikimedia.org/r/633683

Change 633683 merged by jenkins-bot:
[operations/homer/public@master] Add cloud-in4 filters to cloudsw interfaces

https://gerrit.wikimedia.org/r/633683

@aborrero when would be a good time to schedule those changes? knowing that there is a short downtime needed.
Probably 2x5min, but I'd schedule 30min just in case some more issues arises.

This change has impacts to Toolforge (NFS, databases, etc). We want to reduce the downtime, i.e, failover things etc. For this it would be good if we can do this operation during EU/US overlapping time.
Also, we would like to announce the operation window to the community 1 week prior.

So, @Bstorm will take a look at the calendar and try to propose something soon. Do you @ayounsi have any specific timeline or requirement in mind?

I'm ready so anytime that works for you.

Looking at our calendars, I think Tuesday, October 27 might work.

Please @Bstorm @ayounsi ack this window, then I can send an announcement to the community.

Note October 27 is also scheduled for the MediaWiki datacenter switchback -- please let's not have both events going at the same time. :) The switchback is scheduled to wrap up by 15:00 UTC, but it's possible we'll still be doing cleanup work for a while after, depending on how things go.

That's fair. I will try proposing a new date tomorrow.

New proposed date: 2020-11-03,

Would another day in the October 26th week be possible otherwise?
I want to make sure we have time to schedule a followup work if something doesn't go as planned and we have to reschedule, as I'll be less available after Nov. 14th.

Mentioned in SAL (#wikimedia-operations) [2020-10-29T16:34:56Z] <XioNoX> force VRRP master on cr1-eqiad - T265288

Mentioned in SAL (#wikimedia-operations) [2020-10-29T16:38:35Z] <XioNoX> Move cr2-eqiad:ae2.1120 to cloudsw1-d5:irb.1120 - T265288

Mentioned in SAL (#wikimedia-operations) [2020-10-29T16:59:14Z] <XioNoX> Delete cr1-eqiad:ae2.1120 and related static routes - T265288

ayounsi updated the task description. (Show Details)

Will work on scheduling the renumber of cloud-instances-transport soon.

Regarding the renumber, a couple of things:

  • we renumbered this subnet from 10.64.22.0/24 to 208.80.155.88/29 in T207663 about 2 years ago.
  • we are now trying to renumber this subnet from 208.80.155.88/29 to 185.15.56.240/29.
  • I wonder why we didn't use this new subnet in the first place, 2 years ago? Could we clarify why this renumber is important or relevant a this point?
  • Could we allocate the mirror range for cofdw1dev? would it be 185.15.57.16/29 https://netbox.wikimedia.org/ipam/prefixes/354/ ?
  • the cloudgw project we are currently evaluating (see T261724) might change this subnet again. Perhaps it would be wise to hold on this subnet renumber until it is clear whether we can couple the change together with introducing cloudgw.

As explained previously on IRC,

208.80.155.88/29 is part of the eqiad IP space, 185.15.56.240/29 is part of the WMCS IP space.

When a "customer" connects to a provider network, the interco IPs uses IPs from the provider. This allows to aggregate the IPs, control security, routing, etc...

Previously the boundary between prod and WMCS was between the core routers and cloudnet, which was using the
208.80.155.88/29 space. Now that boundary is between the core routers and the cloudsw routers, the left orange links (/31s) are the interco IPs, which means 208.80.155.88/29 (part of prod) is now "inside WMCS", between cloudsw and cloudnet, requiring to use special routing policies to accommodate this special case.

See:

WMCS_network-L2_L3.png (590×1 px, 144 KB)

Because of that, I'd rather not wait before doing that renumbering.

Could we allocate the mirror range for cofdw1dev? would it be 185.15.57.16/29

That's correct.

Change 638425 had a related patch set uploaded (by Ayounsi; owner: Ayounsi):
[operations/homer/public@master] Core routing for WMCS via cloudsw

https://gerrit.wikimedia.org/r/638425

Change 638425 merged by jenkins-bot:
[operations/homer/public@master] Core routing for WMCS via cloudsw

https://gerrit.wikimedia.org/r/638425

Thanks for the explanations @ayounsi .

the cloudgw project we are currently evaluating (see T261724) might change this subnet again. Perhaps it would be wise to hold on this subnet renumber until it is clear whether we can couple the change together with introducing cloudgw.

Given the short timelines for SRE in this case, I'm ok with changing now and again later if it comes to it. I do believe that the renumbering is somewhat impactful to WMCS services, right @aborrero ? So, while less ideal if we need to change again, I don't want to miss the timeline window SRE has for this quarter which is rapidly closing.

Can we plan this for next Monday or Tuesday? Perhaps 2020-11-09 @ 1300 UTC ? It's a short week, but I believe it will be harder to complete after next week for SRE.

Given the short timelines for SRE in this case, I'm ok with changing now and again later if it comes to it. I do believe that the renumbering is somewhat impactful to WMCS services, right @aborrero ? So, while less ideal if we need to change again, I don't want to miss the timeline window SRE has for this quarter which is rapidly closing.

Can we plan this for next Monday or Tuesday? Perhaps 2020-11-09 @ 1300 UTC ? It's a short week, but I believe it will be harder to complete after next week for SRE.

Next monday works for me!

Yes, it is an impactful change. Our "best effort best practices" include sending an announcement to the community 1 week prior impactful maintenance operations.

For the D day:

# create new subnet
root@cloudcontrol1004:~# neutron subnet-create --gateway 185.15.56.241 --name cloud-instances-transport1-b-eqiad1 --ip-version 4 --disable-dhcp wan-transport-eqiad 185.15.56.240/29

# switch gateway (service impact)
root@cloudcontrol1004:~# neutron router-gateway-set  --fixed-ip subnet_id=cloud-instances-transport1-b-eqiad1,ip_address=185.15.56.244 cloudinstances2b-gw wan-transport-eqiad

# check ports in router
root@cloudcontrol1004:~# neutron router-port-list cloudinstances2b-gw

# cleanup if all is correct 
root@cloudcontrol1004:~# neutron subnet-delete dcbb0f98-5e9d-4a93-8dfc-4e3ec3c44dcc

In case of rollback:

# In case of rollback, create old subnet if was already cleaned up
root@cloudcontrol1004:~# neutron subnet-create --gateway 208.80.155.89 --name cloud-instances-transport1-b-eqiad --ip-version 4 --disable-dhcp wan-transport-eqiad 208.80.155.88/29

# In case of rollback, switch again to the old gateway (service impact)
root@cloudcontrol1004:~# neutron router-gateway-set --fixed-ip subnet_id=cloud-instances-transport1-b-eqiad,ip_address=208.80.155.92 cloudinstances2b-gw wan-transport-eqiad

# In case of rollback, check ports in router
root@cloudcontrol1004:~# neutron router-port-list cloudinstances2b-gw

# In case of rollback, cleanup new subnet, ID unknown by the time of this writting
root@cloudcontrol1004:~# neutron subnet-delete $ID

If all is OK, refresh docs on:

@ayounsi I made a diagram based on yours to update the one in our docs (https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Neutron)

Please verify these are the corrects IP addresses and static routes:

Old diagram (it doesn't even include cloudsw):

image.png (612×1 px, 75 KB)

New diagram:

CloudVPS current edge network state.png (682×2 px, 101 KB)

Also, I don't see the CIDR object created in netbox for 185.15.56.240/29 . Could you please create it? Or I can create it, whatever you prefer!

https://netbox.wikimedia.org/search/?q=185.15.56.240%2F29+

Also, I don't see the CIDR object created in netbox for 185.15.56.240/29 . Could you please create it? Or I can create it, whatever you prefer!

https://netbox.wikimedia.org/search/?q=185.15.56.240%2F29+

Created https://netbox.wikimedia.org/ipam/prefixes/359/

You diagram and commands (at least what I understand of them) LGTM, I defined the IPs in https://netbox.wikimedia.org/ipam/prefixes/359/ip-addresses/ as well.

Mentioned in SAL (#wikimedia-operations) [2020-11-09T10:36:35Z] <XioNoX> add 185.15.56.240/29 IPs to relevant cloudsw interfaces - T265288

Change 640093 had a related patch set uploaded (by Ayounsi; owner: Ayounsi):
[operations/homer/public@master] Remove 208.80.155.88/29 from cloud4 prefix list

https://gerrit.wikimedia.org/r/640093

To be pushed on both cloudsw at the same time as the neutron commands (or at least at the same time as the one deleting any 208.80.155. IPs on the Neutron side).

both cloudsw
[edit routing-options static route 172.16.0.0/21]
-    next-hop 208.80.155.92;
+    next-hop 185.15.56.244;
[edit routing-options static route 185.15.56.0/25]
-    next-hop 208.80.155.92;
+    next-hop 185.15.56.244;

Once all confirmed good, cleanup:

cloudsw1-c8
[edit interfaces irb unit 1120 family inet]
-       address 208.80.155.90/29 {
-           vrrp-group 121 {
-               virtual-address 208.80.155.89;
-               accept-data;
-           }
-       }
[edit policy-options prefix-list bgp-out]
-    208.80.155.88/29;
[edit policy-options policy-statement BGP_outfilter term aggregates_out from]
-     protocol [ aggregate direct static ];
+     protocol [ aggregate static ];
cloudsw1-d5
[edit interfaces irb unit 1120 family inet]
-       address 208.80.155.91/29 {
-           vrrp-group 121 {
-               virtual-address 208.80.155.89;
-               priority 70;
-               accept-data;
-           }
-       }
[edit policy-options prefix-list bgp-out]
-    208.80.155.88/29;
[edit policy-options policy-statement BGP_outfilter term aggregates_out from]
-     protocol [ aggregate direct static ];
+     protocol [ aggregate static ];

Push https://gerrit.wikimedia.org/r/c/operations/homer/public/+/640093
Cleanup old IPs and prefix from Netbox https://netbox.wikimedia.org/ipam/prefixes/289/

Change 640096 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/dns@master] cloud: refresh cloudinstances2b-gw router address

https://gerrit.wikimedia.org/r/640096

Mentioned in SAL (#wikimedia-cloud) [2020-11-09T12:15:37Z] <arturo> icinga-downtime toolschecker for 2h (T265288)

Mentioned in SAL (#wikimedia-cloud) [2020-11-09T12:19:16Z] <arturo> root@cloudcontrol1005:~# neutron subnet-create --gateway 185.15.56.241 --name cloud-instances-transport1-b-eqiad1 --ip-version 4 --disable-dhcp wan-transport-eqiad 185.15.56.240/29 (T265288)

Mentioned in SAL (#wikimedia-cloud) [2020-11-09T12:19:52Z] <arturo> subnet 185.1.5.56.240/29 has id 7c6bcc12-212f-44c2-9954-5c55002ee371 in neutron (T265288)

Mentioned in SAL (#wikimedia-cloud) [2020-11-09T12:40:58Z] <arturo> root@cloudcontrol1005:~# neutron router-gateway-set --fixed-ip subnet_id=7c6bcc12-212f-44c2-9954-5c55002ee371,ip_address=185.15.56.244 cloudinstances2b-gw wan-transport-eqiad (T265288)

Mentioned in SAL (#wikimedia-cloud) [2020-11-09T12:41:28Z] <arturo> root@cloudcontrol1005:~# neutron subnet-delete dcbb0f98-5e9d-4a93-8dfc-4e3ec3c44dcc (T265288)

Mentioned in SAL (#wikimedia-cloud) [2020-11-09T12:42:11Z] <arturo> restarted neutron l3 agent in cloudnet1003 bc it still had the old default route (T265288)

Change 640093 merged by jenkins-bot:
[operations/homer/public@master] Remove 208.80.155.88/29 from cloud4 prefix list

https://gerrit.wikimedia.org/r/640093

Thanks, everything is done!
For the record, we had a 30s then ~4min connectivity interruption.

Change 640096 merged by Arturo Borrero Gonzalez:
[operations/dns@master] cloud: refresh cloudinstances2b-gw router address

https://gerrit.wikimedia.org/r/640096

Change 640149 had a related patch set uploaded (by Ayounsi; owner: Ayounsi):
[operations/dns@master] Remove 208.80.155.88/29 from DNS

https://gerrit.wikimedia.org/r/640149

Change 640149 merged by Ayounsi:
[operations/dns@master] Remove 208.80.155.88/29 from DNS

https://gerrit.wikimedia.org/r/640149