Page MenuHomePhabricator

Route cloud-hosts1-b-eqiad vlan through cloudsw
Closed, ResolvedPublic

Description

Stage 1 of https://wikitech.wikimedia.org/wiki/Wikimedia_Cloud_Services_team/EnhancementProposals/2020_Network_refresh

Scheduled for Thursday Sept. 3rd, 9am UTC.

  • Force VRRP master on cr1-eqiad
cr1-eqiad
[edit interfaces ae2 unit 1118 family inet address 10.64.20.2/24 vrrp-group 118]
+        priority 200;
[edit interfaces ae2 unit 1118 family inet6 address 2620:0:861:118:fe00::1/64 vrrp-inet6-group 118]
+        priority 200;

show vrrp interface ae2.1118 | match State:

  • Move inet/inet6 configuration on cr2-eqiad from ae2.1118 to xe-3/0/4.1118

Enable: https://netbox.wikimedia.org/dcim/interfaces/2146/
Enable: https://netbox.wikimedia.org/dcim/interfaces/7663/
Rename: https://netbox.wikimedia.org/dcim/interfaces/9010/ to xe-3/0/4.1118
Run Homer
cloudsw1-d5-eqiad# delete interfaces xe-0/0/0 disable

  • Check if IP is working as expected

Reachability from cr1:ae2.1118 to cr2:xe-3/0/4.1118
cr1-eqiad> ping 10.64.20.3 source 10.64.20.2
VRRP state sharing between cr1 and cr2
cr1-eqiad> show vrrp interface ae2.1118

  • Move VRRP mastership to cr2
cr1-eqiad
[edit interfaces ae2 unit 1118 family inet address 10.64.20.2/24 vrrp-group 118]
+        priority 70;
[edit interfaces ae2 unit 1118 family inet6 address 2620:0:861:118:fe00::1/64 vrrp-inet6-group 118]
+        priority 70;
  • Check reachability of cloud-hosts devices
  • Move inet/inet6 configuration on cr1-eqiad from ae2.1118 to xe-3/0/4.1118

Enable https://netbox.wikimedia.org/dcim/interfaces/7651/
Enable https://netbox.wikimedia.org/dcim/interfaces/2082/
Rename https://netbox.wikimedia.org/dcim/interfaces/9033/ to xe-3/0/4.1118
Remove VRRP group from vlan 1118 VRRP IPs (See T260363)
Edit homer-public to change the interface using labs-in6
Run Homer
cloudsw1-c8-eqiad# delete interfaces xe-0/0/0 disable

  • Check if IP is working as expected
  • Cleanup (update OSFP/bootp)

Related Objects

Event Timeline

ayounsi triaged this task as Medium priority.Sep 2 2020, 1:45 PM
ayounsi created this task.
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Mentioned in SAL (#wikimedia-operations) [2020-09-03T09:01:51Z] <XioNoX> force ae2.1118 VRRP master on cr1-eqiad - T261866

Mentioned in SAL (#wikimedia-operations) [2020-09-03T09:06:19Z] <XioNoX> move vlan 1118 from ae2.1118 to xe-3/0/4.1118 cr2-eqiad - T261866

Mentioned in SAL (#wikimedia-operations) [2020-09-03T09:13:30Z] <XioNoX> rolled back: move vlan 1118 from ae2.1118 to xe-3/0/4.1118 cr2-eqiad - T261866

Mentioned in SAL (#wikimedia-cloud) [2020-09-03T09:31:50Z] <arturo> downtime cloud* servers for 30 mins (T261866)

Mentioned in SAL (#wikimedia-cloud) [2020-09-03T09:31:57Z] <arturo> icinga downtime cloud* servers for 30 mins (T261866)

Mentioned in SAL (#wikimedia-operations) [2020-09-03T09:38:18Z] <XioNoX> move vlan 1118 IPv6 from ae2.1118 to xe-3/0/4.1118 cr2-eqiad - T261866

Mentioned in SAL (#wikimedia-operations) [2020-09-03T09:46:31Z] <XioNoX> move vlan 1118 IPv4 from ae2.1118 to xe-3/0/4.1118 cr2-eqiad - T261866

Mentioned in SAL (#wikimedia-operations) [2020-09-03T09:48:01Z] <XioNoX> move VRRP master from cr1-eqiad:ae2.1118 to cr2-eqiad:xe-3/0/4.1118 - T261866

Mentioned in SAL (#wikimedia-operations) [2020-09-03T09:56:43Z] <XioNoX> move vlan 1118 from ae2.1118 to xe-3/0/4.1118 cr2-eqiad - T261866

Mentioned in SAL (#wikimedia-operations) [2020-09-03T09:57:05Z] <XioNoX> rectification: move vlan 1118 from ae2.1118 to xe-3/0/4.1118 on cr1-eqiad - T261866

Change 623988 had a related patch set uploaded (by Ayounsi; owner: Ayounsi):
[operations/homer/public@master] Rename ae2.1118 to xe-3/0/4.1118

https://gerrit.wikimedia.org/r/623988

Mentioned in SAL (#wikimedia-operations) [2020-09-03T10:07:47Z] <XioNoX> re-apply vlan 1118 firewall filter and update OSPF/bootp on cr1/2-eqiad - T261866

Change 623995 had a related patch set uploaded (by Ayounsi; owner: Ayounsi):
[operations/homer/public@master] Only use VRRP bandwidth-threshold on ae links

https://gerrit.wikimedia.org/r/623995

Change 623988 merged by jenkins-bot:
[operations/homer/public@master] Rename ae2.1118 to xe-3/0/4.1118

https://gerrit.wikimedia.org/r/623988

Change 623995 merged by jenkins-bot:
[operations/homer/public@master] Only use VRRP bandwidth-threshold on ae links

https://gerrit.wikimedia.org/r/623995

There has been 1 issue: the cr2-eqiad facing interface on cloudsw1-d5 was miss-configured (configured as L3 instead of L2 trunk interface).

Even though VRRP master for that vlan was on cr1-eqiad, all the other vlans had their default GW on cr2-eqiad, so traffic from (for example) icinga, was trying to enter vlan 1118 through the cr2<->cloudsw1-d5 link and got blackholed.

Fixing the initial issue then taking a more careful approach by moving v6 before v4 allowed to complete this change.