Page MenuHomePhabricator

cloudservices[2004/2005]-dev & cloudweb2002-dev: connect them to cloudsw so they can have cloud-private vlan
Closed, ResolvedPublic

Description

As part of T297596: have cloud hardware servers in the cloud realm using a dedicated LB layer and T324992: cloudlb: create PoC on codfw, the cloudservices nodes should be reachable in the cloud-private vlan. This also has implications for T307357: Move cloud vps ns-recursor IPs to host/row-independent addressing.
But as of this writing they are physically connected to asw switches which makes that impossible (that vlan is cloud realm and only defined in cloud switches).

For now, the most simple / sensible way moving forward is to run an additional cable from the servers into cloudsw so we can enable cloud-private vlan on them and continue developing whatever service architecture with them.

In https://phabricator.wikimedia.org/T327919#8699523 there seems to be a switch port proposal:

HostExisting port (keep)New Port (additional net new)
cloudservices2004-devasw-b1-codfw ge-1/0/28cloudsw1-b1-codfw ge-0/0/36
cloudservices2005-devasw-b1-codfw ge-1/0/29cloudsw1-b1-codfw ge-0/0/37
cloudweb2002-devasw-b1-codfw ge-1/0/30cloudsw1-b1-codfw ge-0/0/38

Related Objects

Event Timeline

aborrero triaged this task as Medium priority.May 12 2023, 3:56 PM

@aborrero I've patched the new connections as you described them. (papaul is out this week, lmk if there's anything else I can help with. thanks)

Change 919363 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/dns@master] wikimedia.cloud: add cloudservices200[4/5]-dev cloud-private address

https://gerrit.wikimedia.org/r/919363

Change 919352 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] cloudservices: codfw1dev: enable cloud-private subnet

https://gerrit.wikimedia.org/r/919352

@aborrero quick question on how best to set this up on the host-side. Re-opening for now until we get the netbox set up and switches configured.

In the long run I guess hosts like this will have the following:

  • 1 Physical network connection (to cloudsw)
  • Untagged vlan on this will be the prod-realm 'private' network, i.e. 10.x address used for SSH, other mangement
  • We also have a tagged interface on the device, which connects the 'cloud-private' network, similar to vlan2151 on cloudlb2002

But for now with this kind of hybrid setup how should we set up the newly connected NIC, eno1, on these? Will I set the host-side as tagged, and you can create vlan interface vlan2151 as a separate child interface of eno1? Or do you want to just place the IP directly on eno1 itself (and hence switch side is set to untagged in cloud-private vlan).

Hope the question makes sense, drop me a line on irc if any questions.

@aborrero quick question on how best to set this up on the host-side. Re-opening for now until we get the netbox set up and switches configured.

In the long run I guess hosts like this will have the following:

  • 1 Physical network connection (to cloudsw)
  • Untagged vlan on this will be the prod-realm 'private' network, i.e. 10.x address used for SSH, other mangement
  • We also have a tagged interface on the device, which connects the 'cloud-private' network, similar to vlan2151 on cloudlb2002

But for now with this kind of hybrid setup how should we set up the newly connected NIC, eno1, on these? Will I set the host-side as tagged, and you can create vlan interface vlan2151 as a separate child interface of eno1? Or do you want to just place the IP directly on eno1 itself (and hence switch side is set to untagged in cloud-private vlan).

Hope the question makes sense, drop me a line on irc if any questions.

Thanks for bringing this up. I forgot about these steps here.

In order to simplify the setup and support a more "seamless" near-future migration here in which these hosts are reimaged into the .codfw.wmnet address (thus, losing their connection to wikimedia.org) we should add the trunked vlan for cloud-private tagged in a similar fashion to other hosts.

So, this implies:

  • today, they should have 2 NICs connected
    • "primary" NIC (let's call this eth0) connected to ASW with native prod vlan and address from public IPv4 prod space with wikimedia.org domain
    • "secondary" NIC (let's call this eth1) connected to CLOUDSW with trunked/tagged vlan cloud-private
  • tomorrow, when all operations are finished, they should:
    • still have the "secondary" eth1 connected to CLOUDSW but with trunked vlans: cloud-host (native) cloud-private (tagged). No traces of public IPv4 addresses or wikimedia.org domain, or prod vlans (beyond cloud-hosts).
    • if we are annoyed by using eth1 instead of eth0 , we could recable them and reimage.

The transitional state between today and tomorrow is not clear to me, if the should have an intermediate step with addresses from 10.0. I don't really care since the moment we introduce the BGP VIPs and the cloud-private vlans and reconfigure puppet, all meaningful cloud "customer" traffic will flow using that secondary NIC and the other will just be used for SSH management/puppet/monitoring and the rest of prod traffic.

Does that make sense?

Change 919363 merged by Arturo Borrero Gonzalez:

[operations/dns@master] wikimedia.cloud: add cloudservices200[4/5]-dev cloud-private address

https://gerrit.wikimedia.org/r/919363

@aborrero in general that makes sense yes.

if we are annoyed by using eth1 instead of eth0 , we could recable them and reimage.

I think what we should do this yep. To keep everything clean and ensure we have a proper blueprint for these servers to provision them from scratch.

For now let me have a look at adding the links to Netbox, allocating IPs etc., and configuring the switch side ports for these hosts. I'll catch up with you when done and you can let me know if it looks ok.

@aborrero set up now, let me know if this looks correct:

cloudservices2004-dev:

image.png (779×1 px, 128 KB)

cloudservices2005-dev:

image.png (779×1 px, 128 KB)

cloudweb2002-dev:

image.png (750×1 px, 124 KB)

I've pushed the switch-code config to match that for now anyway so you should be able to test. Ports are up and connected.

Change 919352 merged by Andrew Bogott:

[operations/puppet@production] cloudservices: codfw1dev: enable cloud-private subnet

https://gerrit.wikimedia.org/r/919352

Couple of niggles getting this going on the host side, but I understand from the irc discussions that the above setup is indeed what you had in mind @aborrero. Both cloudservice hosts now have a working link to cloud-private vlan off eno2.

I haven't dug much, but designate is currently failing on cloudservices200[45]-dev because the services on that host are unable to contact mysql on cloudcontrols:

2023-06-01 23:56:09.549 561935 WARNING oslo_db.sqlalchemy.engines [None req-23c62696-f4f1-43d6-bb83-678685a18cbd - - - all - -] SQL connection failed. 10 attempts left.: oslo_db.exception.DBConnectionError: (pymysql.err.OperationalError) (2003, "Can't connect to MySQL server on 'openstack.codfw1dev.wikimediacloud.org' (timed out)")

I haven't dug much, but designate is currently failing on cloudservices200[45]-dev because the services on that host are unable to contact mysql on cloudcontrols:

Should be fixed now as part of T336808: Inconsistent connectivity between cloudservices200[45]-dev and codfw1dev cloudcontrols