Page MenuHomePhabricator

Decision request - Choose a subdomain for new cloud-private subnets
Closed, ResolvedPublic

Description

Problem

As part of the latest project on network isolation, there are new per-rack VLANs subnets allocated called cloud-private. See https://phabricator.wikimedia.org/T324992#8671971.

The data is copied here for reference:

"supernet": 172.20.0.0/16

Vlan NameVlan IDSubnet
cloud-private-c8-eqiad1151172.20.1.0/24
cloud-private-d5-eqiad1152172.20.2.0/24
cloud-private-e4-eqiad1153172.20.3.0/24
cloud-private-f4-eqiad1154172.20.4.0/24
cloud-private-b1-codfw2151172.20.5.0/24

These new IP addresses will be allocated and assigned per physical hardware host, in parallel to the traditional 10.x.y.z addresses that we know and love for ssh/puppet/management, etc.
The 10.x.y.z addresses use the <datacenter>.wmnet naming, but since these new addresses are considered natively cloud realm (even though not virtual), we won't be using wmnet.

This decision request is to decide on the subdomain to use for them.

Our current ""policy"" for domain names is at https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/DNS, which should be updated with the results of this decision.
As of today, the policy suggests that the domain we should use is wikimedia.cloud, because it replaced eqiad.wmflabs which was the cloud counterpart to eqiad.wmnet.

Constraints and risks

  • Make sure whatever domain we use makes it clear that they are HW servers and not virtual machines.
  • This wont be really exposed to end-users/customers, so we have a bit more freedom to pick one option and have second thoughts a couple of years later.
  • We already have some precedents in enwiki.analytics.db.svc.wikimedia.cloud FQDNs. They use the svc subdomain. Such subdomain is not very fitted for this case since these aren't service IP addresses.
  • The chosen subdomain must be hosted by wikiland DNS servers to avoid chicken-egg problems (the domain being unavailable because the cloud being down, but the config of some core cloud service relying on the FQDNs for startup)

Decision record

https://wikitech.wikimedia.org/wiki/Wikimedia_Cloud_Services_team/EnhancementProposals/Decision_record_T332191_subdomain_for_new_cloud_private_subnets

Options

Option 1

Use <dc>.wikimedia.cloud.

Examples:

  • cloudcontrol1003.eqiad.wikimedia.cloud
  • cloudlb2001-dev.codfw.wikimedia.cloud

Pros:

  • simple and straight forward 'mirror' of the <dc>.wmnet scheme.

Cons:

  • in some cases may be too similar to VM FQDNs, like whatever.project.eqiad1.wikimedia.cloud.

Option 2

Use <dc>.hw.wikimedia.cloud.

Examples:

  • cloudcontrol1003.eqiad.hw.wikimedia.cloud
  • cloudlb2001-dev.codfw.hw.wikimedia.cloud

Pros:

  • Explicit hw keyword (meaning: hardware), should help clearly identify this is an IP in hardware and not on a virtual machine.

Cons:

  • Slightly longer to type.

Option 3

Use <vlan>.wikimedia.cloud

Examples:

  • cloudcontrol1003.cloud-private-c8-eqiad.wikimedia.cloud
  • cloudlb2001-dev.cloud-private-b1-codfw.wikimedia.cloud

Pros:

  • Extra clear what this is about, as it hardcodes in an explicit fashion the DC, the rack and the vlan name.

Cons:

  • Long and complex to type.
  • If a host is relocated into a different rack, the FQDN will need to be updated, making them less time-stable than other options.

Option 3bis

Use <vlan-shortname>.<dc>.wikimedia.cloud

Examples:

  • cloudcontrol1003.private.eqiad.wikimedia.cloud
  • cloudlb2001-dev.private.codfw.wikimedia.cloud

Pros:

  • Extra clear what this is about, as it hardcodes in an explicit fashion the DC, and the vlan [short] name.

Cons:

  • none!

Event Timeline

aborrero triaged this task as Medium priority.Mar 15 2023, 3:37 PM
aborrero updated the task description. (Show Details)

I got a question, these will be shared in https://config-master.wikimedia.org/known_hosts.ecdsa ?
I use https://pypi.org/project/wm-ssh/, and it uses that url to fetch lists of hosts, just curious

I got a question, these will be shared in https://config-master.wikimedia.org/known_hosts.ecdsa ?
I use https://pypi.org/project/wm-ssh/, and it uses that url to fetch lists of hosts, just curious

For ssh management from our laptops we are expected to use eqiad.wmnet addresses, i.e, the wikiland-production realm connection.

In that regard, I don't expect there will be any ssh-related machinery related to the new addresses, including known_hosts entry, fingerprints or similar.

I got a question, these will be shared in https://config-master.wikimedia.org/known_hosts.ecdsa ?
I use https://pypi.org/project/wm-ssh/, and it uses that url to fetch lists of hosts, just curious

For ssh management from our laptops we are expected to use eqiad.wmnet addresses, i.e, the wikiland-production realm connection.

In that regard, I don't expect there will be any ssh-related machinery related to the new addresses, including known_hosts entry, fingerprints or similar.

Nice, so it's no a problem in any case for it.

I don't much like option 3 because it will be hard for me to remember. 1 and 2 are both fine with me, I'm leaning towards #1, since it's still distinct from VMs which are eqiad1/codfw1dev.

Between option 1 and 2, I prefer option 2 because it makes it explicit it's a hardware host and not a VM. But I can live with option 1 as well.

Given that those hostnames are not meant to be used anywhere except in one-time-set configs (if I understood correctly), and there's already a coupling between the ip and the vlan, then I think that the most explicit the better, so I vote for 3 (weak vote though).

You can always dig +short -x <ip> to find out the name :)

Just noticed that option #3 implies a FQDN rename when a host is relocated between racks. I don't like that :-(

I'd prefer going with options #1 or #2. Actually #2 since that would create additional consensus.

On the WMCS team meeting we decided to go with option #2.

Change 914310 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/dns@master] wikimedia.cloud: add new codfw.hw.wikimedia.cloud addresses

https://gerrit.wikimedia.org/r/914310

Change 914310 merged by Arturo Borrero Gonzalez:

[operations/dns@master] wikimedia.cloud: add new codfw.hw.wikimedia.cloud addresses

https://gerrit.wikimedia.org/r/914310

I realize I'm quite late here, but hw seems problematic in case we'll ever have VMs (for example in Ganeti) that need addresses and names like these.

Agreed, see my comment on https://gerrit.wikimedia.org/r/c/operations/dns/+/914310/comments/62a641fd_5d4434a9
and on IRC:

yeah I think it's a bad idea to hardcode a server type in a "zone" name. To me matching the vlan is more appropriate

I'd suggest 3bis:

cloudlb2001-dev.private.codfw.wikimedia.cloud

  • Closer to the prod notation sretest1001.mgmt.eqiad.wmnet
  • Much shorter than option 3: no need to go as deep in the details as cloud-private-b1-codfw (cloud is assumed with the ending .cloud), site is in the FQDN (like in .hw.), rack info is not needed
  • No need to rename if the host is moved
  • More specific to less specific (host -> vlan -> DC)

Would option #1 work for you @ayounsi ? It was the second most voted option when this was decided.

Would option #1 work for you @ayounsi ? It was the second most voted option when this was decided.

I prefer my later suggestion than option 1 as it's more explicit. For example in prod we have the analytics zone hosts with the same FQDN as private hosts which adds confusion.
On the other hand we could consider this zone as the "default" one (and go with option 1). Then only require an extra distinguisher to future zone if if we add more of them in the future.

Would option #1 work for you @ayounsi ? It was the second most voted option when this was decided.

I prefer my later suggestion than option 1 as it's more explicit. For example in prod we have the analytics zone hosts with the same FQDN as private hosts which adds confusion.
On the other hand we could consider this zone as the "default" one (and go with option 1). Then only require an extra distinguisher to future zone if if we add more of them in the future.

Could you clarify what this <zone> thingy is? It is not clear to me. Do you think it can be of use for us in the future?

My current concern is encoding a private keyword in there that feels a bit redundant. All addresses in the wikimedia.cloud domain are explicitly private already.
The point of having (some) keyword was to help us differentiate from Cloud VPS virtual machines, but they encode the project name already.

I mean, with examples, in codfw:

  • cloudcontrol2003-dev.private.codfw.wikimedia.cloud <-- a hadrware server
  • cloudcontrol2003-dev.someproject.codfw1dev.wikimedia.cloud <-- a virtual machine private address
  • cloudcontrol2003-dev.codfw.wikimedia.cloud <-- a hardware server

in eqiad:

  • cloudlb1001.private.eqiad.wikimedia.cloud <-- a hadrware server
  • cloudlb1001.someproject.eqiad1.wikimedia.cloud <-- a virtual machine private address
  • cloudlb1001.eqiad.wikimedia.cloud <-- a hardware server

I actually find it more clear without any additional keyword. Is more visually different, perhaps because there is one less subdomain.

Zone here means a virtual organisation of hosts, in that case it's the vlan "cloud-private". It's like we have .mgmt. or closer to option 3 .hw..

My current concern is encoding a private keyword in there that feels a bit redundant. All addresses in the wikimedia.cloud domain are explicitly private already.

Yeah that's what I tried to say above by considering it as default, and thus not needing this extra identifier. It's redundant for now, but if we add some other vlans later on, hosts there will need a way to be distinguished from the ones in the private vlan. But it can be an issue for later.

I think I see now what you mean. The private keyword does not speak about the nature of the IPv4 addressing, but about the VLAN name (which happens to have private addressing coincidentally).

Ok, going with that!

aborrero updated the task description. (Show Details)

Marking this as resolved, option 3bis is documented and marked as taken!