Page MenuHomePhabricator

Decision request - Toolforge external infrastructure domain usage
Closed, ResolvedPublic

Description

Problem

DNS names for Toolforge infrastructure services are in inconsistent DNS zones (examples: mail.tools.wmflabs.org., k8s.tools.eqiad1.wikimedia.cloud., jobs.svc.tools.eqiad1.wikimedia.cloud., login.toolforge.org., tools-redis., tools-prometheus.wmflabs.org.) and conflict with names of actual tools hosted on tool-name.toolforge.org (examples: login.toolforge.org.).

We've mostly settled on using using svc.$PROJECT.$DEPLOYMENT.wikimedia.cloud for services that are not exposed to the public internet, but the question of public names remains. New public services are rarer, and the only recent example (Harbor) went with a cloud vps web proxy. However, that does not work for some use cases.

So in short, this decision request aims to decide which name to use for services that:

  • May be user-facing (seen by tool maintainers, and possibly end users)
  • Might not be able use the shared Cloud VPS web proxy
    • For example due to custom TLS termination needs (example: API authentication)
    • Or due to specific load balancing / traffic needs (this is a risk for Harbor and static/object storage services)
    • Or maybe they're not using HTTP at all! (example: SSH access)
  • May be hosted in Kubernetes or on VMs, but are not running via tool accounts
    • For tool accounts *.toolforge.org will remain the only supported scheme for now.

Apart from tools-*.wmcloud.org, none of these really have any technical advantages or disadvantages: the services we're talking about here can't easily re-use any of our existing infrastructure and they're all controlled in Designate in the tools project so we can issue acme-chief certs for them.

An example task that is affected by this decision would be T332476: Toolforge: expose API gateway to the internet.

Also, just to make it clear: I'm not proposing an immediate re-naming of all existing services. This is to make it clear what to use for new services, and existing services can stay as is. (Although if they're getting major updates that would require some changes to the network ingress setup or naming anyways, the people doing that should consider using the latest scheme. An example of this was the latest Redis migration which moved it under the internal svc.tools.eqiad1.wikimedia.cloud domain)

For additional context on the different domains we have, see: https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/DNS

Constraints and risks

  • Inconsistency in infrastructure
  • Confusion between infrastructure and user-created tools, also maybe a phishing risk
  • Use of deprecated terms like 'labs'

Decision record

coming soon

Options

Note that toolsbeta already exists. We already use *.beta.toolforge.org. for toolsbeta webservices, so if we end up choosing something under toolforge.org the author assumes we will replicate the same under beta.tf.o for toolsbeta use.

A. Use something like *.internal.toolforge.org

This is the preferred solution by the author. It makes it clear that it's about Toolforge (since it's on toolforge.org) and it's clear that it's an admin-managed service and not user-created tool (since it does not share the same namespace).

Alternative but very similar options:

  • *.admin.toolforge.org
  • *.infra.toolforge.org
  • *.mgmt.toolforge.org
  • *.svc.toolforge.org

B. Use *.tools.wmcloud.org

This also follows the standard Cloud VPS model, but it means that possibly user-facing addresses will have inconsistency ('tools' vs 'toolforge', different domain).

C. Use *.toolforge.org

This is a simple option but means that the same naming scheme is used for both infrastructure and tools.

D. Use tools-*.wmcloud.org

This is an option, but mostly for services that will use the shared Cloud VPS web proxy. For other use cases *.tools.wmcloud.org is vastly superior, as that domain is already delegated to the tools project.

Event Timeline

I think that the subdomain for toolforge makes more sense (and is easier to remember, at least for me): *.infra.toolforge.org

No strong opinion except avoiding *.toolforg.org (to avoid collisions), or otherwise changing the user tools domain (*.user.toolforge.org, though I prefer us using the more complex domain)

Unless I misunderstand the question, it seems like the obvious answer is:

*.svc.tools.eqiad1.wikimedia.cloud for internal things (VMs talking to VMs)

*.toolforge.org for public-facing things (e.g. login.toolforge.org)

That's approximately how we manage things in other projects... is this for a use case that's different from those?

Unless I misunderstand the question, it seems like the obvious answer is:

I have two things I want from this task:

  • A proper decision on public-facing things (I'm not a huge fan of *.toolforge.org, which is why I want to discuss that before implementing anything)
  • *.toolforge.org equivalent for toolsbeta
dcaro renamed this task from Decision request template - Toolforge infrastructure domain usage to Decision request - Toolforge infrastructure domain usage.May 26 2023, 7:52 AM
taavi renamed this task from Decision request - Toolforge infrastructure domain usage to Decision request - Toolforge external infrastructure domain usage.Jul 5 2023, 12:48 PM
taavi updated the task description. (Show Details)

I still have an interest to get something decided here for public names so I can move forward with T332476: Toolforge: expose API gateway to the internet. I've reworked this task a bit, although I'm not sure if this should be a standalone decision request or if I should combine it with a more general discussion on how we want to do public ingress for infra services[0]. Thoughts welcome.

[0]: How much can/should we rely on the shared web proxy? What about the services that can't use it?

@taavi we can follow the decision making process for this, that will get you both exposure and a resolution https://www.mediawiki.org/wiki/Wikimedia_Cloud_Services_team#Decision_Making

If you want to do so, please add to the weekly notes so people see it, and set up a date for on-task resolution or decision meeting scheduling.

Besides that, some opinions :)

A proper decision on public-facing things (I'm not a huge fan of *.toolforge.org, which is why I want to discuss that before implementing anything)

For public facing things, maybe we should get a new subdomain?
Something like *.admin.toolforge.org?
we could reuse admin.toolforge.org as login.toolforge.org too if using ssh.

*.toolforge.org equivalent for toolsbeta

I would strongly prefer a new subdomain for toolsbeta, like toolsbeta.org or toolforgebeta.org or similar, instead of reusing toolforge.org for anything there.
That gives both more similar deployment and avoid conflicts, confusion and coupling more the deployments in any way.

Deadline for in-task decision is 12th of March

I'm perfectly happy with *.internal.toolforge.org or *.infra.toolforge.org, which seems to be what Taavi prefers as well :)

I assume that this is for public IPv4 (185.15.x.y, etc) only, no? I guess for priv addresses we will keep using svc.tools.eqiad1.wikimedia.cloud.

My preferences would be .svc.toolforge.org which is the shortest, and reuses the svc keyword that we are already familiar with.
For toolsbeta, I'm fine with either .svc.beta.toolforge.org, or trying to acquire a new domain toolsbeta.org, toolforgebeta.org, etc like David suggested (which on the other hand is the more expensive option).

Also, as part of this decision request, I would suggest we:

  • add support for custom domains in nova-proxy, so for example we could create whatever.svc.toolforge.org HTTPS endpoints reusing the same infra.
  • schedule a migration of already present and (soon to be) inconsistent FQDNs to the new scheme. Examples:
    • deb-tools.wmcloud.org to deb.svc.toolforge.org
    • tools-harbor.wmcloud.org to harbor.svc.toolforge.org
    • nfs-tools.wmcloud.org to nfs.svc.toolforge.org

Ok, so it seems like we agree that we want to go for the [something].toolforge.org route, and svc.toolforge.org is the only subdomain name that got an explicit vote. So, unless anyone has any objections on using svc in tomorrow's Toolforge monthly meeting I think we can declare that the chosen option.

  • add support for custom domains in nova-proxy, so for example we could create whatever.svc.toolforge.org HTTPS endpoints reusing the same infra.
  • schedule a migration of already present and (soon to be) inconsistent FQDNs to the new scheme. Examples:

I agree we should do that, but I don't want to block making any decisions on implementing that. I will research what that would take on the dynamicproxy infra and will file a separate task for implementing support.

taavi claimed this task.

There have been no objections for the .svc.toolforge.org name so I'm declaring that the accepted proposal. I will follow-up on the novaproxy support for that name in T342398 once the grid engine proxy is gone (so that code is much simpler to work with).

Change 1011273 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] P:wmcs::services: use deb.svc.toolforge.org

https://gerrit.wikimedia.org/r/1011273

Change 1011274 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] hieradata: maintain-dbusers: use nfs.svc.toolforge.org

https://gerrit.wikimedia.org/r/1011274

Change 1011275 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] P:toolforge: use prometheus.svc.toolforge.org

https://gerrit.wikimedia.org/r/1011275

Change 1011276 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/docker-images/toollabs-images@master] Use deb.svc.toolforge.org

https://gerrit.wikimedia.org/r/1011276

Change 1011273 merged by Majavah:

[operations/puppet@production] P:wmcs::services: use deb.svc.toolforge.org

https://gerrit.wikimedia.org/r/1011273

Change 1011276 merged by jenkins-bot:

[operations/docker-images/toollabs-images@master] Use deb.svc.toolforge.org

https://gerrit.wikimedia.org/r/1011276

Change 1011274 merged by Majavah:

[operations/puppet@production] hieradata: maintain-dbusers: use nfs.svc.toolforge.org

https://gerrit.wikimedia.org/r/1011274

Change 1011275 merged by Majavah:

[operations/puppet@production] P:toolforge: use prometheus.svc.toolforge.org

https://gerrit.wikimedia.org/r/1011275