Page MenuHomePhabricator

cloudcontrol: decide on FQDN for service endpoints
Closed, ResolvedPublic

Description

Current openstack endpoints all point to the existing active cloudcontrol server:

aborrero@cloudcontrol1004:~ 3s $ sudo wmcs-openstack endpoint list | grep public
| 0a1eb902933c4652ad41e6450fe436ee | eqiad1-r | neutron      | network      | True    | public    | http://cloudcontrol1003.wikimedia.org:9696                             |
| 4578c49346db479ab6d5b7961af8f60a | eqiad1-r | nova         | compute      | True    | public    | http://cloudcontrol1003.wikimedia.org:8774/v2.1                        |
| 465f331e03de4bfcbef1f8a7dbb4de15 | eqiad1-r | designate    | dns          | True    | public    | http://cloudservices1003.wikimedia.org:9001                            |
| be7a84a1af114f94bd1d6cc48b374413 | eqiad1-r | keystone     | identity     | True    | public    | http://cloudcontrol1003.wikimedia.org:5000/v3                          |
| ebf523fce03a4ff4b859277f0a3d2477 | eqiad1-r | proxy        | proxy        | True    | public    | http://proxy-eqiad1.wmflabs.org:5668/dynamicproxy-api/v1/$(tenant_id)s |
| f922afd9417448028ce02734e0420a0b | eqiad1-r | glance       | image        | True    | public    | http://cloudcontrol1003.wikimedia.org:9292                             |

We aim to replace the endpoints with something more generic, ideally introducing our new domain, some options:

  • eqiad1.cloudvps.svc.wmcloud.org
  • eqiad1.openstack.svc.wmcloud.org
  • eqiad1.cloudvps.wmcloud.org
  • eqiad1.openstack.wmcloud.org
  • eqiad1.api.wmcloud.org

It worth mentioning that perhaps this domain should be hosted outside of openstack (designate) so we avoid chicken-egg problems (example: we can't bootstrap the deployment because we can't resolve the FQDN).

If not using the new domain, some options:

  • eqiad1.cloudvps.wikimedia.org
  • eqiad1.openstack.wikimedia.org
  • eqiad1.api.cloudvps.wikimedia.org
  • eqiad1.svc.cloudvps.wikimedia.org
  • api.eqiad1.cloudvps.wikimedia.org

Once we decide on a FQDN, we should evaluate how to introduce the endpoints to external proxy/balancers (probably something similar to prod setup).

NOTE: we could introduce this in codfw1dev first for testing purposes.

Event Timeline

aborrero triaged this task as High priority.May 20 2019, 12:13 PM
aborrero created this task.
aborrero moved this task from Inbox to Needs discussion on the cloud-services-team (Kanban) board.

Marking this as high priority as this is part of this quarter goal: upgrade openstack to stretch.

Moving to the Needs discussion column in our workboard so we discuss this in our next team meeting.

aborrero updated the task description. (Show Details)May 20 2019, 12:16 PM

Once we are capable of being multidomain within neutron, we wouldn't want the domain in keystone stuff, would we? Possibly in certain other things? We needed perhaps more separation than in the future because novanetwork wasn't even a thing in the new OpenStack, but plenty is segregated by region, I'm sure. I'm asking because I really don't know and am curious :-D

We have service endpoints for e.g. NFS and clouddb already, right? Do those establish an existing pattern that we should keep following?

Regarding the wmcloud.org domain -- does it make sense to use the same domain for control plane things as for on-VM things? I don't feel strongly but would've expected these endpoints to be .wikimedia.org.

Krenair added a comment.EditedMay 20 2019, 2:26 PM

These endpoints should be .wikimedia.org.
Edit: Except proxy

I'm fine too if we name them .wikimedia.org.

aborrero updated the task description. (Show Details)May 21 2019, 11:35 AM

Following this for the NFS maps IP in T209527

aborrero added a subscriber: Vgutierrez.
  • we agree on using 'wikimedia.org'
  • we agree on using 'wmcs.'
  • if the thing is specific to a deployment, use the deployment name 'eqiad1.'
  • Final example: 'keystone.eqiad1.wmcs.wikimedia.org' (<service>.<region>.wmcs.wikimedia.org)
  • Other example: 'xxxx.wmcs.wikimedia.org'? (not deployment specific)
  • Arturo will talk to @Vgutierrez to validate the pattern

I just noticed the proxy endpoint, that obviously cannot be under wikimedia.org, but the rest should be there.
If you're wondering about SSL certs, all the wikimedia.org ones should be fine. Though interestingly right now those are all hard-coded http:// for some reason, we should probably fix that.

so, after a quick check you should consider several things:

  • wikimedia.org is a canonical domain for WMF, everything is expected to use secure TLS settings.
  • if you aim to use the production caching layer, the hostnames must match *.wikimedia.org

Otherwise you could use acme-chief to get valid certificates, but you'll need to run your own TLS termination and caching layers

Could we avoid SSL in the first iteration?

Something like:

  • introduce the new endpoints, put them in LVS
  • investigate how to introduce SSL to the new endpoints once the previous point is working
  • probably use acme-chief and manage our own TLS termination at local daemon level?

IMHO you should move away from *.wikimedia.org then and use another domain

So, to sum up:

  • we didn't consider SSL when deciding on the new API FQDNs
  • prod SSL certs for wikimedia.org can't cover our new API FQDNs (at least right now), since the cert is for *.wikimedia.org and that doesn't include *.wmcs.wikimedia.org.
  • we could add TLS to our daemons (nova-api, keystone, neutron-server, etc) by means of acme-chief and don't reuse prod certificates/caching layer

@Andrew @Bstorm @bd808 any thoughts?

I do think TLS should be on OpenStack service endpoints in general for a lot of reasons. Independent of the FQDN considerations, I strongly think that should factor in, if we can do it. A caching layer would benefit some read-only stuff, but I tend to imagine we'd want openstack stuff to dodge caching anyway since making that kind of API cache-friendly required quite a bit of tweaking and cache tuning the last time I did it elsewhere (and required me compiling in some varnish stuff to make auth work better through it). I generally have to imagine that OpenStack api caching won't look quite like MediaWiki api caching needs--but you never know. This all makes me think avoiding the prod caching layer might save us trouble at the outset.

I am saying all this in relative ignorance to the methods being employed so far for endpoints. Are we plugging into shared load balancer setups where we can terminate TLS wherever we want (like at the real servers), or are we standing up our own LVS/load balancers? If we can terminate at the API endpoints or whatever, then acme-chief seems like a great idea to me if it is compatible with how we are sharing out the endpoints. I recall last I checked we were doing some wsgi workers or something vs running behind apache, if that's correct.

Krenair added a comment.EditedMay 23 2019, 5:08 PM

acme-chief seems like a great idea to me if it is compatible with how we are sharing out the endpoints. I recall last I checked we were doing some wsgi workers or something vs running behind apache, if that's correct.

You can just tell acme-chief to get a cert with subjects *.wmcs.wikimedia.org and *.eqiad1.wmcs.wikimedia.org, and which (prod realm) servers to provide access to the cert through puppet. acme-chief doesn't care how you end up using the cert.
acme-chief's job is to a) deal with the ACME (Let's Encrypt) API, b) deal with our DNS servers to prove ownership of the domain, and c) expose the certificates over Puppet's fileserver API.
It's been used with nginx, apache, and exim in prod so far AFAIK. See params in modules/acme_chief/manifests/cert.pp for how to include the cert resource on a host, and the file paths it will use locally, and see the acme-chief config in hieradata/role/common/acme_chief.yaml for how to configure new certificates.

Vgutierrez added a comment.EditedMay 23 2019, 5:13 PM

That's right. Also take into account that you can get as many certificates as you need from acme-chief, so maybe you don't need the wildcard one.

However wikimedia.org it's on the HSTS preload lists and some libraries like python requests enforce it. So TLS it's a hard requirement if you go for *.wmcs.wikimedia.org endpoints.

Krenair added a comment.EditedMay 23 2019, 5:14 PM

I do think TLS should be on OpenStack service endpoints in general for a lot of reasons. Independent of the FQDN considerations, I strongly think that should factor in, if we can do it. A caching layer would benefit some read-only stuff, but I tend to imagine we'd want openstack stuff to dodge caching anyway since making that kind of API cache-friendly required quite a bit of tweaking and cache tuning the last time I did it elsewhere (and required me compiling in some varnish stuff to make auth work better through it). I generally have to imagine that OpenStack api caching won't look quite like MediaWiki api caching needs--but you never know. This all makes me think avoiding the prod caching layer might save us trouble at the outset.

IIRC prod caching already makes the distinction between misc services (random misc services run in prod such as phabricator) and text (MediaWiki)?

Edit: Or, perhaps used to?

Restricted Application added a project: Operations. · View Herald TranscriptMay 23 2019, 5:17 PM

so, after a quick check you should consider several things:

  • wikimedia.org is a canonical domain for WMF, everything is expected to use secure TLS settings.
  • if you aim to use the production caching layer, the hostnames must match *.wikimedia.org

Does this mean that ldap-ro.eqiad.wikimedia.org is already violating the TLS policy? It supports both tls and unecrypted ldap, and also doesn't match *.wikimedia.org

https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/496858/

Krenair added a comment.EditedMay 23 2019, 8:02 PM

so, after a quick check you should consider several things:

  • wikimedia.org is a canonical domain for WMF, everything is expected to use secure TLS settings.
  • if you aim to use the production caching layer, the hostnames must match *.wikimedia.org

Does this mean that ldap-ro.eqiad.wikimedia.org is already violating the TLS policy? It supports both tls and unecrypted ldap, and also doesn't match *.wikimedia.org
https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/496858/

I don't think Valentin is saying it's against policy, I think he's saying that to go through the prod HTTP caching cluster (aka Varnish, which LDAP obviously does not), you need to be using a *.wikimedia.org domain because that's all the cert used by it includes, and it's not so easy to just add on extra certificate subjects there.

Right.. that ldap service certificate it's being handled by acme-chief and as Alex explained the *.wikimedia.org limitation only affects services that need to use https caching.

Same thing regarding HSTS, it only affects HTTPS services. However, in my opinion enforcing TLS traffic in the LDAP services would be great.

I am fine with changing our proposed names to things like keystone-eqiad1.wikimedia.org or keystone-eqiad1-wmcs.wikimedia.org if that keeps things simple. If that gets us tls for no extra effort then I'm more than fine with it :)

Another thing to consider if we're really talking about using the prod caches is that currently those endpoints are not exposed to the world, only some Wikimedia hosts (certain prod hosts + all of labs). I'm not sure if such restrictions are actually necessary though.

BBlack added a subscriber: BBlack.May 23 2019, 10:07 PM

Do these belong in wikimedia.org at all? It seems this has already been discussed, but I guess I lack some context.

The comment about STS-preload is that if you point browser (or other entities that preload the same STS lists) at anything underneath wikimedia.org, they have hardcoded policies that tell them to only use HTTPS, not unencrypted HTTP. So deploying an unencrypted HTTP service there that an STS-preloading UA (like all modern browsers) is expected to hit, probably won't work. Separtely, it's also a matter of policy that we require HTTPS (instead of HTTP) on all our canonical domains, including wikimedia.org (and require meeting some minimum standards of implementation: https://wikitech.wikimedia.org/wiki/HTTPS , which includes HSTS and such).

Do these belong in wikimedia.org at all? It seems this has already been discussed, but I guess I lack some context.

Some context, we considered mainly 2 options:

  • wikimedia.org
  • wmcloud.org

The wmcloud.org is not used anywhere yet, and we thought on using it mostly for ' virtual' things, like VM addresses, Toolforge, etc, i.e: a domain dynamically managed by Openstack Designate.
The API endpoints we are discussing here live in physical hardware, the prod side of CloudVPS. I don't think we have a strong opinion in using either of them for the API endpoints.

Anyway we may have HTTPS issues with both options: wikimedia.org because the subdomains, wmcloud.org because lack of any of the minimum HTTPS standards you mention.

As @Andrew said, I guess we could live with FQDNs like keystone-eqiad1-wmcs.wikimedia.org, although that seems less elegant :-P

BBlack added a subscriber: faidon.EditedMay 24 2019, 12:30 PM

Ok, @aborrero caught me up on all the context on IRC so I can stop asking dumb questions (Thanks!).

I think wikimedia.org as the parent is less-than-ideal. There's already a lot of mixed concerns within that domain as it stands (production server hostnames, actual movement wikis, important meta-things like login.wikimedia.org, and Google-ification of it for our gsuite services, etc, as our canonical WMF email domain), and public (ish) WMCS control endpoints seems like an easy thing to carve off and separate here to avoid further confusion of mental and/or security contexts for various related things.

I'd recommend a whole separate domain for public-facing (even if currently firewalled a bit) cloud services stuff, like wmcloud.org or wikimedia.cloud. Policy is for us to define anyways, so shouldn't be a blocker.

@faidon also recommended that perhaps before making such a weighty decision (about how/where we use some of these valuable and nicely-named domains that are currently empty), there should be a higher-level plan properly decided and documented about all the domainname plans for WMCS (there could be up 3-ish total public domains involved - which are going to be used for what (instances, public API endpoints like these, possibly the dmz-ish data services inbetween, etc)).... so we don't end up backtracking on decisions that are difficult to undo.

I'm going to propose it here because the IRC conversation this morning was a bit frantic and some breakfix is happening as well: whether or not this is blocking for some other tasks (since we have other ways around than enabling HA), what if we start using wikimediacloudservices.org instead for infrastructure services that are not inherently public facing? We have the domain name. It's long and unwieldy--and therefore likely better used on things that are contacted by scripts and services than human fingers. It lacks a plan currently, unlike wikimedia.org and wmcloud.org. I wanted to throw it out here for more slow and careful consideration.

My question is @Vgutierrez and @BBlack - does that remove the issues here?

@aborrero and @Andrew does that seem to work from our end? If we keep it internal only for firewall purposes and add TLS as appropriate, perhaps this would work well.

Honestly, wikimediacloudservices.org seems overly long. Just reading that makes me feel lazy :-( If possible, I would use it only for redirecting www. to wikitech or whatever our landing page is :-P

Let me do a quick summary/proposal:

  • wmcloud.org is originally intended to replace wmflabs.org as noted by @bd808 and as noted in the original labs re-branding docs [0]
  • wikimedia.org is, from my point of view, something to avoid, due to the reasons mentioned by @BBlack
  • we have wikimedia.cloud, as pointed out by @faidon which we can use for whatever we need.

The use cases we have are:

  • Toolforge tools and other stuff native to Toolforge: we plan to use toolforge.org for them, as stated in the original docs [0]
  • cloudVPS VMs and other virtual stuff managed by openstack: we plan to use wmcloud.org for them, as stated in the original docs [0]. This domain will be entirely managed by designate.
  • databases, like wikireplicas, toolsdb, etc. They are (or will be soon) all cloudVPS VMs. So they are covered by wmcloud.org already.
  • for Openstack endpoints (the original issue in this tickets) I think we have 2 options:
    • we could use $subdomain.wmcloud.org if this subdomain is not hosted by desginate (to avoid chicken-egg problems)
    • we could use wikimedia.cloud, host this domain in prod DNS and dedicate it to services only. Examples: keystone.eqiad1.wikimedia.cloud, nova.codfw1dev.wikimedia.org
  • supporting services, like NFS or Ceph. We are in the middle of doing a PoC for this. It's not clear what the final networking topology/layout for this would be. We could probably either:
    • reuse standard prod domains (be it .wmnet or .wikimedia.org)
    • try to introduce them to wikimedia.cloud as well: cloudosd1001.wikimedia.cloud? Perhaps this will be the first case of real HW being assigned anything other than .wmnet|.wikimedia.org.

Worth noting that this is only talking about domain names, and I'm not making any assumption on IPv4/IPv6 private/public addressing or VLANs.

[1] https://wikitech.wikimedia.org/wiki/Wikimedia_Cloud_Services_team/Rebranding_Cloud_Services_products#Changing_domain_names

  • we could use $subdomain.wmcloud.org if this subdomain is not hosted by desginate (to avoid chicken-egg problems)

wmcloud.org would be hosted by Designate, so we'd need an NS record inside designate for $subdomain to prod gdnsd for it to serve records relating to the prod services supporting openstack.

That cloud rebranding link above also mentions wikimediacloud.org, which is yet another option nobody's exploiting yet. So even without getting into the over-long wikimediacloudservices.org, we have sufficient names to cover all the cases here (feel free to re-arrange, esp the latter two):

  • wmcloud.org - replaces wmflabs.org, hosted directly on Designate, covers everything that's inside instances
  • wikimediacloud.org - Used for the official public API service (nevermind that they're not yet fully public, initially) - The ~5 API endpoints in this ticket, and the DNS is hosted by prod DNS and not designate.
  • wikimedia.cloud - Used for supporting services (NFS, Ceph, etc) or any other semi-internal usage that's sitting in some bridging position between prod and wmcs. Probably hosted by prod DNS, but could delegate subdomains to Designate if appropriate.

If we're going to divide things up in that manner it would strike me as a bit weird to have the full purposes of the different domains be indistinguishable from the outside, covered only deep in some docs/comments somewhere.

T224324: LB for cloudelastic is probably related in some way at least when we get to figuring out the DNS name to expose that LB setup to Cloud VPS users as.

Reading the discussion here and in irc earlier today, I think the more general topic of which TLDs we are going to use for which purpose deserves a wiki page and some strawdog proposals that are easier to iterate on as a group than phab comments. I don't think we need to spend ages on finding a 'perfect' solution here, but nothing will collapse if we pick up the discussion again next week and spend a few days poking holes in draft ideas.

ema moved this task from Triage to DNS Names on the Traffic board.Jun 3 2019, 3:11 PM

Reading the discussion here and in irc earlier today, I think the more general topic of which TLDs we are going to use for which purpose deserves a wiki page and some strawdog proposals that are easier to iterate on as a group than phab comments. I don't think we need to spend ages on finding a 'perfect' solution here, but nothing will collapse if we pick up the discussion again next week and spend a few days poking holes in draft ideas.

https://wikitech.wikimedia.org/wiki/Wikimedia_Cloud_Services_team/EnhancementProposals/DNS_domain_usage

We had a session about this during the SRE summit. The conclusions were:

  • Use HA Proxy instead of trying to get into the LVS pool (This isn't exactly a naming decision but it informed the name choices).
  • openstack.eqiad1.wikimediacloud.org --- for openstack endpoints, and other endpoints like elasticsearch, using different subdomain. This domain will be hosted in prod DNS servers (or at least outside Designate).
  • wmcloud.org --- replaces wmflabs.org. This domain is managed by WMCS/designate.
  • wikimedia.cloud --- replaces .wmflabs. This domain is managed by WMCS/designate.

The remaining task here is to make/update a wiki page about this.