This ticket is to track the work to enable IPv6 in codfw1dev prior to doing in eqiad1.
Description
Details
Title | Reference | Author | Source Branch | Dest Branch | |
---|---|---|---|---|---|
codfw1dev: network: rename them to better reflect what they do | repos/cloud/cloud-vps/tofu-infra!157 | aborrero | arturo-360-codfw1dev-network-r | main |
Event Timeline
The steps I have in mind here are:
1/ Setup v6 on the transport network (most likely 2620:0:860:fe0a::/64)
2/ Assign a /48 for cloud codfw, see T187929 (Here we won't go into its subnetting)
3/ Advertise that /48 to the world either from codfw only or both eqiad/codfw(TBD, former is preferred)
3a/ Shrink esams advertisement from 2a02:ec80::/32 to 2a02:ec80:500::/48
4/ Setup a basic firewall filter (cloud-in6)
My suggestion is to initially only allow traffic to the internet and maybe some selected "private" hosts needed for the testing. This in order to minimize the workload of creating and managing that filter during and after the POC.
5/ Statically route the whole /48 to the neutron gateway VIP (go live)
This would stay like this for simplicity for now, but will most likely change down the road (if for example support network gets moved to that prefix but on a different network)
I assume that if the PoC is not successful, we would rollback the configuration and assignments will stay as reserved for future usage.
If successful, we would need to start managing the firewall filter (like we currently do with v4) does that mean the v4 exceptions will go away but be replaced by the v6 ones or both will stay live?
Is there anything else that will change on that aspect if we keep maintaining v6 on the long run?
Mostly agree with everything, some inlined comments
Is that range definitive? If no, could we pick one that would likely be used in the future too? Per my comment T187929#5984401 I suggest we use 2a02:ec80:1:0::/64.
2/ Assign a /48 for cloud codfw, see T187929 (Here we won't go into its subnetting)
OK. Same, I suggest we use 2a02:ec80:1:1::/64 (see T187929#5984401)
3/ Advertise that /48 to the world either from codfw only or both eqiad/codfw(TBD, former is preferred)
4/ Setup a basic firewall filter (cloud-in6)
My suggestion is to initially only allow traffic to the internet and maybe some selected "private" hosts needed for the testing. This in order to minimize the workload of creating and managing that filter during and after the POC.
I agree. I suggest we do this firewalling incrementally. Ideally we would implement some statefull firewalling and only allow connections started within the cloud, but I'm aware this is not possible using the prod core routers as upstream.
For the initial PoC, we only need 3 addresses allowed:
- 2620:0:860:3:208:80:153:76 (cloudservices2002-dev.wikimedia.org)
- 2620:0:860:2:208:80:153:59 (cloudcontrol2001-dev.wikimedia.org)
- 2620:0:860:3:208:80:153:75 (cloudcontrol2003-dev.wikimedia.org)
i.e, connection to/from the the cloud to these physical hosts.
5/ Statically route the whole /48 to the neutron gateway VIP (go live)
This would stay like this for simplicity for now, but will most likely change down the road (if for example support network gets moved to that prefix but on a different network)I assume that if the PoC is not successful, we would rollback the configuration and assignments will stay as reserved for future usage.
Ok, That's why I think it may be important to agree on the addressing plan first: T187929: Cloud IPv6 subnets
If successful, we would need to start managing the firewall filter (like we currently do with v4) does that mean the v4 exceptions will go away but be replaced by the v6 ones or both will stay live?
Initially both will stay alive. I'm not sure what would be our next step from the network point of view:
- we get rid of our NAT dmz_cidr mechanism --> v4 firewall need adjustment in the prod core routers.
- we introduce a pair of cloud upstream routers/firewalls between neutron and the prod core routers --> we can drop all the firewalling from the prod core routers.
Is there anything else that will change on that aspect if we keep maintaining v6 on the long run?
Other than what I commented already, I don't have any other concern as of right now. Things may change in the near future if we really start playing with the PoC and we start learning how this looks like for real.
Change #1078990 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):
[operations/homer/public@master] Add elements for WMCS IPv6 range in codfw 2a02:ec80:a100::/48
Change #1078990 merged by jenkins-bot:
[operations/homer/public@master] Add elements for WMCS IPv6 range in codfw 2a02:ec80:a100::/48
Change #1079237 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):
[operations/homer/public@master] Adjust cloudsw/cr bgp policies and include new IPv6 range for codfw
Change #1079237 merged by jenkins-bot:
[operations/homer/public@master] Adjust cloudsw/cr bgp policies and include new IPv6 range for codfw
Change #1079252 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):
[operations/homer/public@master] Fix typos in updated prefix-list for cloud ranges eqiad
Change #1079252 merged by jenkins-bot:
[operations/homer/public@master] Fix typos in updated prefix-list for cloud ranges eqiad
Change #1079254 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):
[operations/homer/public@master] Fix similar typo in the codfw policy
Change #1079254 merged by jenkins-bot:
[operations/homer/public@master] Fix similar typo in the codfw policy
The edge (cloudsw/cr) networking is now complete, elements in the range are reachable externally.
cathal@officepc:~$ mtr -z -b -w -c 5 2a02:ec80:a100:fe03::1 Start: 2024-10-10T14:35:25+0100 HOST: officepc Loss% Snt Last Avg Best Wrst StDev 1. AS5466 pool-ipv6-pd.agg1.srl.blp-srl.eir.ie (2001:bb6:8b70:9e00::1) 0.0% 5 0.4 0.4 0.3 0.5 0.1 2. AS5466 agg1.srl.blp-srl.eircom.net (2001:bb0:6:a11d::1) 0.0% 5 8.2 5.9 4.6 8.2 1.7 3. AS5466 2001:bb0:6:a197::1 0.0% 5 4.7 4.8 4.7 4.9 0.1 4. AS1299 dln-b3-link.ip.twelve99.net (2001:2035:0:9eb::1) 0.0% 5 4.7 4.7 4.4 4.8 0.1 5. AS1299 ldn-bb2-v6.ip.twelve99.net (2001:2034:1:ca::1) 0.0% 5 14.7 14.8 14.6 15.0 0.1 6. AS1299 prs-bb2-v6.ip.twelve99.net (2001:2034:1:c1::1) 0.0% 5 32.5 32.8 32.5 33.0 0.2 7. AS1299 rest-bb1-v6.ip.twelve99.net (2001:2034:1:73::1) 20.0% 5 98.2 98.1 97.9 98.2 0.1 8. AS1299 atl-bb1-v6.ip.twelve99.net (2001:2034:1:a1::1) 0.0% 5 115.0 114.6 114.2 115.0 0.3 9. AS??? ??? 100.0 5 0.0 0.0 0.0 0.0 0.0 10. AS??? ??? 100.0 5 0.0 0.0 0.0 0.0 0.0 11. AS??? irb-1120.cloudsw1-b1-codfw.wikimedia.org (2a02:ec80:a100:fe03::1) 0.0% 5 132.7 132.8 132.7 133.0 0.2
cmooney@cloudgw2002-dev:~$ sudo tcpdump -i vlan2120 -l -p -nn host 2001:bb6:8b70:9e00::187 listening on vlan2120, link-type EN10MB (Ethernet), snapshot length 262144 bytes 13:51:35.790703 IP6 2001:bb6:8b70:9e00::187 > 2a02:ec80:a100:1::29c: ICMP6, echo request, id 6205, seq 97, length 64 13:51:36.814913 IP6 2001:bb6:8b70:9e00::187 > 2a02:ec80:a100:1::29c: ICMP6, echo request, id 6205, seq 98, length 64
Change #1079288 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):
[operations/homer/public@master] Add orlonger to policy on announced v6 routes from cloudsw
Change #1079288 abandoned by Cathal Mooney:
[operations/homer/public@master] Add orlonger to policy on announced v6 routes from cloudsw
Change #1079982 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):
[operations/homer/public@master] Remove WMCS codfw prefix from CR aggregate conf and adjust outfilter
Change #1079982 merged by jenkins-bot:
[operations/homer/public@master] Remove WMCS codfw prefix from CR aggregate conf and adjust outfilter
aborrero opened https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/157
codfw1dev: network: rename them to better reflect what they do
aborrero merged https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/157
codfw1dev: network: rename them to better reflect what they do