Renumber cloud-instance-transport1-b-eqiad to public IPs
Open, NormalPublic

Description

Follow up from T122406 (and IRC chats)

cloud-instance-transport1-b-eqiad that connects the OpenStack instances to the Wikimedia infrastructure uses IPs in the 10/8 space, this should probably be renumbered to a public IP subnet (similar to any kind of customer interco link).

https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Neutron#/media/File:Eqiad1_network_topology.png

If I'm correct, we need 6 IPs in the subnet (cr1, cr2, cr1/2-vrrp, 2 hosts on the cloud side plus their VIP), so a /29 would works.

Longer down the road the prefix 185.15.56.0/25 should be advertised by BGP as well (currently the routers have a static route to 10.64.22.4), as well as possibly connect the cloud routers directly to cr1, cr2 (using 2*/31s).

ayounsi created this task.Oct 22 2018, 2:42 PM
ayounsi triaged this task as Normal priority.
Restricted Application added a project: Operations. · View Herald TranscriptOct 22 2018, 2:42 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

I don't see any problem with this from the top of my head. I would ask @chasemp to see if he can see any issue with this new setting.

Please note that doing this is a major modification of our environment and will probably require a medium disruption of the CloudVPS service.
It involves re-configuring neutron to the beginning. Almost a new bootstrap :-) since all neutron objects are interleaved in the database.
So I would ask to wait for this until next Q, at least.

Out of curiosity, I would expect some shortage of public IPv4 addressing, is not the case?

chasemp added a comment.EditedOct 22 2018, 4:46 PM

No technical blockers to this VLAN having public IPs that I know of. Agreed that the switchover could be difficult to make transparent to users. It's possible adding an interface to the neutron router for a new subnet in the existing VLAN would allow cutover to be faster but its probably more trouble than it's worth -- would need to test it out a bit.

I don't think this changes desired behavior on the neutron router side, we still want to overload to a designated IP within the customer public block, and we still don't want to allocate IPs in this subnet as floating IPs.

If I'm correct, we need 6 IPs in the subnet (cr1, cr2, cr1/2-vrrp, 2 hosts on the cloud side plus their VIP), so a /29 would works.

It seems like a dual stack approach on the external provider interface for the neutron router would work out here but without testing it who knows https://docs.openstack.org/mitaka/networking-guide/config-ipv6.html

edit: I originally read 6 IPs as IPv6 for who knows what reason. Long day(s) :)

edit: few updates

Out of curiosity, I would expect some shortage of public IPv4 addressing, is not the case?

We need to be careful with our public IPs indeed, but this is a good use of a /29 :)

It seems like a dual stack approach on the external provider interface for the neutron router would work out here but without testing it who knows https://docs.openstack.org/mitaka/networking-guide/config-ipv6.html

v6 is out of scope for that specific task but would be indeed great to add. I know we have T187929 for the space itself. Freel free to open a task for the v6 configuration itself.

It's sad to hear that's a major disruption :( Would it make sense to do this now when it's early in the migration and relatively few projects have migrated over? If not, is there a specific timeframe where we can schedule this? Could this be done by, say, end of Q2? Thanks!

It's sad to hear that's a major disruption :( Would it make sense to do this now when it's early in the migration and relatively few projects have migrated over? If not, is there a specific timeframe where we can schedule this? Could this be done by, say, end of Q2? Thanks!

We have several projects running on eqiad1 already. I will talk to my team to try to guestimate a timeframe.

There are currently 23 projects running in the new region, and we're moving more over every day. This would have been a reasonable request when were originally setting up the Neutron network but it is far from trivial now.

It may be that I misunderstand the motivation of the renaming. Can you tell me what the benefits are that will outweigh the cost of work and downtime?

faidon added a comment.EditedOct 23 2018, 6:11 PM

This is essentially part of T122406, which we resolved last week with the intention of making it more specific with this task (among others).

Basically, it shouldn't had been assigned into the 10/8 space to begin with, unfortunately, and my apologies for not catching that earlier. This is a guest/interconnect network that should not be in the production realm & IP space; it's a router-to-router interconnection with a network that we consider public (because anyone can get access to it, by design).

Where we'd like to be is to be treating the Neutron interfaces from the network perspective as "customer" ports, like how we e.g. are treating the OIT port and network right now. That means that we would apply our customer/border policy that would filter and block all traffic coming with a source IP of 10/8 (and eventually, also destination IP of 10/8) on the interconnection interface.

I'm very sorry if this feels like a surprise :( This has been the plan since mid-2015 or so and we discussed it at length (me and people from the Labs team) at the Barcelona offsite in 2016, when we also decided to deprecate the 10/8 labs-support network and move those servers to public IPs, for the same reasons.

It was the prevalent idea even pre-2015, and we had just put it in the tech-debt/too-big-to-fix category, and have been deferring all this until post-Neutron for as long as the plans for Neutron (and Quantum) have existed! We have waited a long time, and I guess we can wait a little bit longer if it's complicated and hard, but I'd like for that extra time to be months and not years, if possible :)

I can investigate how difficult is this and give a better guess-estimate of the disruption to end users.
I'll try the approach that @chasemp suggested, having both transport networks configured at the same time in neutron and swapping them as quick as possible.

For that, I would try this setup in the labtestn deployment (which is a mirror of the eqiad1 one, but in codfw).

Please @ayounsi , do the following:

  • allocate a public IPv4 range (probably a /29 as described in task description) in codfw.
  • Select one IP from the range and let us know. We are using one IP only in our side right now, the neutron VIP.
  • configure the corresponding codfw core router.

At some point, we would have to switch the main routes to the internal openstack ranges in labtestn, but that will require some real-time coordination, while I swap the transport setting in neutron.

Change 469771 had a related patch set uploaded (by Ayounsi; owner: Ayounsi):
[operations/dns@master] DNS: assign public /29 for cloud-instance-transport1-b-codfw

https://gerrit.wikimedia.org/r/469771

Thanks for investigating it!

See https://gerrit.wikimedia.org/r/c/operations/dns/+/469771 for the IPs, I took the same model as the existing subnet:
https://github.com/wikimedia/operations-dns/blob/96a7b2b706b5bd9ddac8505592ea0f2be3847d4e/templates/10.in-addr.arpa#L3273
Even though the .189 might not be needed. In that case the .188 should have a PTR mentioning that it's a VIP.

To be pushed to the routers:

cr1-codfw
[edit interfaces ae2 unit 2120 family inet]
        address 10.192.22.2/24 { ... }
+       address 208.80.153.186/29 {
+           vrrp-group 121 {
+               virtual-address 208.80.153.185;
+               track {
+                   interface ae2.2120 {
+                       bandwidth-threshold 20g priority-cost 50;
+                       bandwidth-threshold 30g priority-cost 30;
+                   }
+               }
+           }
+       }
cr2-codfw
[edit interfaces ae2 unit 2120 family inet]
        address 10.192.22.2/24 { ... }
+       address 208.80.153.187/29 {
+           vrrp-group 121 {
+               virtual-address 208.80.153.185;
+               track {
+                   interface ae2.2120 {
+                       bandwidth-threshold 20g priority-cost 50;
+                       bandwidth-threshold 30g priority-cost 30;
+                   }
+               }
+           }
+       }

Then update the static route once the server is configured.

[edit routing-options static route 172.16.128.0/21]
-    next-hop 10.192.22.4;
+    next-hop 208.80.153.188;

Once done, cleanup old IPs and DNS records.
Ideally update the VRRP group back to 120, but that might cause another brief outage.

Change 469771 merged by Ayounsi:
[operations/dns@master] DNS: assign public /29 for cloud-instance-transport1-b-codfw

https://gerrit.wikimedia.org/r/469771

Mentioned in SAL (#wikimedia-operations) [2018-10-25T21:29:45Z] <XioNoX> configure 208.80.153.185/29 on cr1/2-codfw - T207663

aborrero edited projects, added cloud-services-team (Kanban); removed Cloud-Services.
aborrero lowered the priority of this task from Normal to Low.
aborrero moved this task from Inbox to Doing on the cloud-services-team (Kanban) board.
aborrero claimed this task.Wed, Dec 5, 12:34 PM

Change 477769 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] cloudvps: labtestn: introduce new physical net mapping

https://gerrit.wikimedia.org/r/477769

Change 477769 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] cloudvps: labtestn: introduce new physical net mapping

https://gerrit.wikimedia.org/r/477769

root@labtestcontrol2003:~# neutron net-create 'wan-transport-codfw' --router:external=true --provider:network_type=flat --provider:physical_network=br-transport --shared
Created a new network:
+---------------------------+--------------------------------------+
| Field                     | Value                                |
+---------------------------+--------------------------------------+
| admin_state_up            | True                                 |
| availability_zone_hints   |                                      |
| availability_zones        |                                      |
| created_at                | 2018-12-05T12:54:25                  |
| description               |                                      |
| id                        | 07d9efe1-bed6-4b44-85af-4a37d8e3c766 |
| ipv4_address_scope        |                                      |
| ipv6_address_scope        |                                      |
| is_default                | False                                |
| mtu                       | 1500                                 |
| name                      | wan-transport-codfw                  |
| port_security_enabled     | True                                 |
| provider:network_type     | flat                                 |
| provider:physical_network | br-transport                         |
| provider:segmentation_id  |                                      |
| router:external           | True                                 |
| shared                    | True                                 |
| status                    | ACTIVE                               |
| subnets                   |                                      |
| tags                      |                                      |
| tenant_id                 | admin                                |
| updated_at                | 2018-12-05T12:54:25                  |
+---------------------------+--------------------------------------+
root@labtestcontrol2003:~# neutron subnet-create --gateway 208.80.153.185 --name cloud-instances-transport1-b-codfw --ip-version 4 --disable-dhcp wan-transport-codfw 208.80.153.184/29
Created a new subnet:
+-------------------+------------------------------------------------------+
| Field             | Value                                                |
+-------------------+------------------------------------------------------+
| allocation_pools  | {"start": "208.80.153.186", "end": "208.80.153.190"} |
| cidr              | 208.80.153.184/29                                    |
| created_at        | 2018-12-05T12:57:55                                  |
| description       |                                                      |
| dns_nameservers   |                                                      |
| enable_dhcp       | False                                                |
| gateway_ip        | 208.80.153.185                                       |
| host_routes       |                                                      |
| id                | eb4db443-2184-4456-b414-6e53fa878bee                 |
| ip_version        | 4                                                    |
| ipv6_address_mode |                                                      |
| ipv6_ra_mode      |                                                      |
| name              | cloud-instances-transport1-b-codfw                   |
| network_id        | 07d9efe1-bed6-4b44-85af-4a37d8e3c766                 |
| subnetpool_id     |                                                      |
| tenant_id         | admin                                                |
| updated_at        | 2018-12-05T12:57:55                                  |
+-------------------+------------------------------------------------------+
root@labtestcontrol2003:~# neutron router-gateway-set --fixed-ip subnet_id=cloud-instances-transport1-b-codfw,ip_address=208.80.153.190  cloudinstances2b-gw wan-transport-codfw
Set gateway for router cloudinstances2b-gw
root@labtestcontrol2003:~# neutron router-list
+--------------------------------------+---------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------+------+
| id                                   | name                | external_gateway_info                                                                                                                                                                      | distributed | ha   |
+--------------------------------------+---------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------+------+
| 5712e22e-134a-40d3-a75a-1c9b441717ad | cloudinstances2b-gw | {"network_id": "07d9efe1-bed6-4b44-85af-4a37d8e3c766", "enable_snat": true, "external_fixed_ips": [{"subnet_id": "eb4db443-2184-4456-b414-6e53fa878bee", "ip_address": "208.80.153.190"}]} | False       | True |
+--------------------------------------+---------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------+------+
root@labtestcontrol2003:~# neutron router-show cloudinstances2b-gw
+-------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Field                   | Value                                                                                                                                                                                      |
+-------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| admin_state_up          | True                                                                                                                                                                                       |
| availability_zone_hints |                                                                                                                                                                                            |
| availability_zones      | nova                                                                                                                                                                                       |
| description             |                                                                                                                                                                                            |
| distributed             | False                                                                                                                                                                                      |
| external_gateway_info   | {"network_id": "07d9efe1-bed6-4b44-85af-4a37d8e3c766", "enable_snat": true, "external_fixed_ips": [{"subnet_id": "eb4db443-2184-4456-b414-6e53fa878bee", "ip_address": "208.80.153.190"}]} |
| ha                      | True                                                                                                                                                                                       |
| id                      | 5712e22e-134a-40d3-a75a-1c9b441717ad                                                                                                                                                       |
| name                    | cloudinstances2b-gw                                                                                                                                                                        |
| routes                  |                                                                                                                                                                                            |
| status                  | ACTIVE                                                                                                                                                                                     |
| tenant_id               | admin                                                                                                                                                                                      |
+-------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

The moment I set the external gateway for the neutron virtual router the network breaks for instances if not already configured in core routers.

I will try reusing the same network object to try to make this even cleaner.

Change 477807 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] Revert "cloudvps: labtestn: introduce new physical net mapping"

https://gerrit.wikimedia.org/r/477807

Change 477807 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] Revert "cloudvps: labtestn: introduce new physical net mapping"

https://gerrit.wikimedia.org/r/477807

Trying now with only adding a new subnet object:

root@labtestcontrol2003:~# neutron subnet-create --gateway 208.80.153.185 --name cloud-instances-transport1-b-codfw --ip-version 4 --disable-dhcp flattransportb 208.80.153.184/29
Created a new subnet:
+-------------------+------------------------------------------------------+
| Field             | Value                                                |
+-------------------+------------------------------------------------------+
| allocation_pools  | {"start": "208.80.153.186", "end": "208.80.153.190"} |
| cidr              | 208.80.153.184/29                                    |
| created_at        | 2018-12-05T16:50:03                                  |
| description       |                                                      |
| dns_nameservers   |                                                      |
| enable_dhcp       | False                                                |
| gateway_ip        | 208.80.153.185                                       |
| host_routes       |                                                      |
| id                | 31214392-9ca5-4256-bff5-1e19a35661de                 |
| ip_version        | 4                                                    |
| ipv6_address_mode |                                                      |
| ipv6_ra_mode      |                                                      |
| name              | cloud-instances-transport1-b-codfw                   |
| network_id        | 57017d7c-3817-429a-8aa3-b028de82cdcc                 |
| subnetpool_id     |                                                      |
| tenant_id         | admin                                                |
| updated_at        | 2018-12-05T16:50:03                                  |
+-------------------+------------------------------------------------------+
root@labtestcontrol2003:~# neutron net-list
+--------------------------------------+-------------------------+--------------------------------------------------------+
| id                                   | name                    | subnets                                                |
+--------------------------------------+-------------------------+--------------------------------------------------------+
| 57017d7c-3817-429a-8aa3-b028de82cdcc | flattransportb          | 31214392-9ca5-4256-bff5-1e19a35661de 208.80.153.184/29 |
|                                      |                         | 5f646219-ce2c-4eb8-8a40-14848c4aab22 10.192.22.0/24    |
|                                      |                         | 9dd8c6f6-9b58-4a14-a920-72b201c6b325 172.16.129.0/24   |
| d967e056-efc3-46f2-b75b-c906bb5322dc | HA network tenant admin | 651250de-53ca-4487-97ce-e6f65dc4b8ec 169.254.192.0/18  |
| 3a3bfff3-d602-43c7-9178-89d7a90545a9 | compat-net              | 79c339b3-94ff-4d89-829c-44acfc9ef5cc 10.196.16.0/24    |
| 05a5494a-184f-4d5c-9e98-77ae61c56daa | flatcloudinstancesb     | 7adfcebe-b3d0-4315-92fe-e8365cc80668 172.16.128.0/24   |
+--------------------------------------+-------------------------+--------------------------------------------------------+

Mentioned in SAL (#wikimedia-operations) [2018-12-05T17:11:26Z] <XioNoX> add public IPs to codfw cloud-instance-transport1-b T207663

Mentioned in SAL (#wikimedia-operations) [2018-12-05T17:19:53Z] <XioNoX> remove private IPs from codfw cloud-instance-transport1-b T207663

Change 477819 had a related patch set uploaded (by Ayounsi; owner: Ayounsi):
[operations/dns@master] Remove private IPs for labs-instance-transport1-b-codfw

https://gerrit.wikimedia.org/r/477819

Change 477819 merged by Ayounsi:
[operations/dns@master] Remove private IPs for labs-instance-transport1-b-codfw

https://gerrit.wikimedia.org/r/477819

@ayounsi and I did in real time both:

  1. change routing in CRs
  2. introduce the new default gateway for neutron:
root@labtestcontrol2003:~# neutron router-gateway-set --fixed-ip subnet_id=cloud-instances-transport1-b-codfw,ip_address=208.80.153.190 cloudinstances2b-gw flattransportb
Set gateway for router cloudinstances2b-gw
root@labtestcontrol2003:~# neutron router-port-list cloudinstances2b-gw
+--------------------------------------+----------------------+-------------------+---------------------------------------------------------------------------------------+
| id                                   | name                 | mac_address       | fixed_ips                                                                             |
+--------------------------------------+----------------------+-------------------+---------------------------------------------------------------------------------------+
| 1290224c-b1b4-4120-a1fe-70d25b28a3bf |                      | fa:16:3e:35:9f:97 | {"subnet_id": "31214392-9ca5-4256-bff5-1e19a35661de", "ip_address": "208.80.153.190"} |
| 21e10025-d464-45a6-82ac-25894e9164e4 |                      | fa:16:3e:3c:11:01 | {"subnet_id": "7adfcebe-b3d0-4315-92fe-e8365cc80668", "ip_address": "172.16.128.1"}   |
| 586de0ea-52cf-4429-804b-bc4b535feec9 | compat-port          | fa:16:3e:5b:f0:05 | {"subnet_id": "79c339b3-94ff-4d89-829c-44acfc9ef5cc", "ip_address": "10.196.16.3"}    |
| db1b15f9-aca9-4282-bfda-d087c84e1396 | HA port tenant admin | fa:16:3e:eb:b0:3b | {"subnet_id": "651250de-53ca-4487-97ce-e6f65dc4b8ec", "ip_address": "169.254.192.2"}  |
| e7fcaf0d-ec11-4d3d-afdf-5ea1f0c4a486 | HA port tenant admin | fa:16:3e:f7:0c:54 | {"subnet_id": "651250de-53ca-4487-97ce-e6f65dc4b8ec", "ip_address": "169.254.192.1"}  |
+--------------------------------------+----------------------+-------------------+---------------------------------------------------------------------------------------+

(for this to properly work in realtime, both CRs and neutron were configured with the required configuration in non-realtime)

I was running a ping to test packet loss during the change:

ping from a VM to the outside
57 packets transmitted, 55 received, 3% packet loss, time 56138ms
rtt min/avg/max/mdev = 36.500/36.882/37.906/0.347 ms
ping from outside to a VM
58 packets transmitted, 55 received, 5% packet loss, time 57144ms
rtt min/avg/max/mdev = 36.371/36.802/37.919/0.372 ms

@ayounsi mentioned that the commit in his side can take about 5s to complete.
The change involves losing packets, but if done in realtime, is not a big deal.

For the record, this was the labtestn network before the change:


This is the labtestn network after the change:

Mentioned in SAL (#wikimedia-cloud) [2018-12-05T17:59:53Z] <arturo> T207663 changed labtestn transport network addressing from private to public

aborrero raised the priority of this task from Low to Normal.Wed, Dec 12, 12:19 PM

My team agreed on following up with eqiad1. The only requirement is we have a clear rollback plan in case something goes wrong.

@ayounsi could you please prepare all the previous stuff (IPv4 range allocation, CR configurations, DNS patches, etc) for eqiad1?
https://netbox.wikimedia.org/ipam/vlans/90/

Change 479337 had a related patch set uploaded (by Ayounsi; owner: Ayounsi):
[operations/dns@master] Assign public /29 for cloud-instance-transport1-b-eqiad

https://gerrit.wikimedia.org/r/479337

To be pushed:

cr1-eqiad
[edit interfaces ae2 unit 1120 family inet]
        address 10.64.22.2/24 { ... }
+       address 208.80.155.90/29 {
+           vrrp-group 121 {
+               virtual-address 208.80.155.89;
+               track {
+                   interface ae2.1120 {
+                       bandwidth-threshold 20g priority-cost 50;
+                       bandwidth-threshold 30g priority-cost 30;
+                   }
+               }
+           }
+       }
cr2-eqiad
[edit interfaces ae2 unit 1120 family inet]
        address 10.64.22.3/24 { ... }
+       address 208.80.155.91/29 {
+           vrrp-group 121 {
+               virtual-address 208.80.155.89;
+               track {
+                   interface ae2.1120 {
+                       bandwidth-threshold 20g priority-cost 50;
+                       bandwidth-threshold 30g priority-cost 30;
+                   }
+               }
+           }
+       }

service impacting:

both
[edit routing-options static]
     route 62.115.145.25/32 { ... }
-    route 172.16.0.0/21 next-hop 10.64.22.4;
+    /* Cloud instances prefix via labnet100[45] */
+    route 172.16.0.0/21 next-hop 208.80.155.92;
[edit routing-options static]
     route 172.16.0.0/21 { ... }
-    route 185.15.56.0/25 next-hop 10.64.22.4;
+    /* Cloud public prefix via labnet100[45] */
+    route 185.15.56.0/25 next-hop 208.80.155.92;

@aborrero Everything is ready to be merged/commited.

I used the name vip-gw-cloudnet.wikimedia.org. let me know if that's correct or should be changed.

aborrero added a comment.EditedThu, Dec 13, 11:04 AM

Thanks!

I love diagrams, they help me better understand topology and architectures. Please @ayounsi confirm the following are right.

eqiad1 transport network before the changes:

eqiad1 transport network after the changes:

For the D day:

# create new subnet
root@cloudcontrol1004:~# neutron subnet-create --gateway 208.80.155.89 --name cloud-instances-transport1-b-eqiad1 --ip-version 4 --disable-dhcp wan-transport-eqiad1 208.80.155.88/29

# switch gateway (service impact)
root@cloudcontrol1004:~# neutron router-gateway-set  --fixed-ip subnet_id=cloud-instances-transport1-b-eqiad1,ip_address=208.80.155.92 cloudinstances2b-gw wan-transport-eqiad1

# check ports in router
root@cloudcontrol1004:~# neutron router-port-list cloudinstances2b-gw

# cleanup if all is correct 
root@cloudcontrol1004:~# neutron subnet-delete e4fb2771-a361-4add-ac4e-280cc300c59f

In case of rollback:

# In case of rollback, create old subnet if was already cleaned up
root@cloudcontrol1004:~# neutron subnet-create --gateway 10.64.22.1 --name cloud-instances-transport1-b-eqiad --ip-version 4 --disable-dhcp wan-transport-eqiad 10.64.22.0/24

# In case of rollback, switch again to the old gateway (service impact)
root@cloudcontrol1004:~# neutron router-gateway-set --fixed-ip subnet_id=cloud-instances-transport1-b-eqiad,ip_address=10.64.22.4 cloudinstances2b-gw wan-transport-eqiad

# In case of rollback, check ports in router
root@cloudcontrol1004:~# neutron router-port-list cloudinstances2b-gw

# In case of rollback, cleanup new subnet, ID unknown by the time of this writting
root@cloudcontrol1004:~# neutron subnet-delete $ID

If all is OK, refresh docs on:

aborrero moved this task from Inbox to Doing on the cloud-services-team (Kanban) board.

We also need to update the cloud-in4 filter in eqiad, cf. T211921

[edit firewall family inet filter cloud-in4 term allow-icmp from source-address]
+        /* cloud-instance-transport1-b-eqiad */
+        208.80.155.88/29;
-        /* cloud-instance-transport1-b-eqiad */
-        10.64.22.0/24;