Page MenuHomePhabricator

Move various support services for Cloud VPS currently in prod into their own instances
Open, NormalPublic

Description

I assume the Nova API and Neutron themselves can't be moved. And obviously the virt servers can't be. And LDAP, though named for 'labs', has outgrown that and is trusted by prod for various things so that can't go anywhere.

Here are some potential ones to consider:

  • OpenStack Designate (auth DNS)
  • OpenStack Keystone (OpenStack identity/auth) (see T207536#4735654)
  • OpenStack Glance (instance images)
  • OpenStack Horizon (dashboard) (see T207536#4735633)
  • Wikimedia Striker (toolsadmin) (see T207536#4735633)
  • DB replicas (see T207536#4735648)
  • NFS - there is a wmcs-nfs project already but that is for testing
  • The new Elastic replicas (which haven't been set up yet, see T194186)

I'm not arguing in particular for any individual service on the list to be migrated, feel free to shoot any down that can't/shouldn't be moved in the comments and strikethrough it in the description. If you think one should be moved, open a subtask and edit it out of this description.

Related Objects

StatusAssignedTask
OpenNone
OpenNone
OpenNone
OpenNone
OpenNone
OpenNone
Openaborrero
Openaborrero
OpenNone
Resolvedherron
Resolvedbd808
Resolvedherron
ResolvedKrenair
ResolvedKrenair
ResolvedKrenair
ResolvedKrenair
ResolvedNone
ResolvedAndrew
ResolvedAndrew
ResolvedAndrew
ResolvedKrenair
ResolvedAndrew
ResolvedKrenair
OpenNone
ResolvedBstorm
ResolvedCmjohnson
ResolvedCmjohnson
Resolvedaborrero
DeclinedNone
DeclinedNone
OpenJclark-ctr
ResolvedHalfak
ResolvedHalfak
ResolvedBstorm
OpenNone
OpenNone
DeclinedNone

Event Timeline

Krenair created this task.Oct 20 2018, 11:08 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 20 2018, 11:08 AM
Krenair updated the task description. (Show Details)Oct 20 2018, 1:02 PM
Krenair updated the task description. (Show Details)
Krenair updated the task description. (Show Details)Oct 20 2018, 1:13 PM
Krenair updated the task description. (Show Details)
faidon added a subscriber: faidon.Oct 20 2018, 1:21 PM
Krenair updated the task description. (Show Details)Oct 20 2018, 1:24 PM
Krenair updated the task description. (Show Details)Oct 20 2018, 11:32 PM

For the record, I don't think we should move openstack components inside openstack just to avoid complex chicken-egg problems.

Specifically, I'm referring to the first 4 points on the task description:

* OpenStack Designate (auth DNS)
* OpenStack Keystone (OpenStack identity/auth)
* OpenStack Glance (instance images)
* OpenStack Horizon (dashboard)

My understanding of the problem is:

  • cloud supporting services in hardware (DNS, SMTP, DB replicas, NFS, monitoring, etc) share their addressing with production services, therefore they are considered part of the prod infra (or to be side-by-side).
  • we don't trust what runs on Cloud VPS instances, specially when it comes to interaction with prod infra
  • we have concerns regarding a possible VM --> supporting service --> prod escalation

The proposed alternative model in most cases is to move all hardware services to instances within Cloud VPS, so they change addressing (from 10.x to 172.16.x).
But this is just a change in addressing. The only difference from the networking point of view is a NAT in-between. i.e, a compromised supporting service running on CloudVPS will still be able to reach a given prod server using the egress NAT.
So egress connections from VMs to prod need to be strongly filtered somewhere even with the use of NAT, so the additional security provided by the egress NAT is relative. I would say low.

Am I missing something? What benefit do you see in changing the addressing?

Also, what characteristics would you like to see in an improved model? I'm thinking on:

  • simple networking policies at routing level, i.e, don't allow routing between 10.x and 172.16.x
  • a more 'visual' difference between IPv4 addressing will ease configuring firewalls and other ACLs
  • a central killswitch in case of attack: shutting down the entire openstack to shut down a whole attack surface
aborrero claimed this task.Oct 23 2018, 1:28 PM
aborrero triaged this task as Normal priority.
aborrero moved this task from Inbox to Needs discussion on the cloud-services-team (Kanban) board.
bd808 added a subscriber: bd808.Oct 23 2018, 2:12 PM

The new Elastic replicas (which haven't been set up yet, see T194186)

I suppose this could be done by treating the physical hardware we bought for this as cloudvirt boxes and keeping them isolated to use by a single project. A pro of this would be the obvious and complete network separation from production. I'm sure there are cons to consider as well both for the WMCS team and for the Search team.

I'm sure there are cons to consider as well both for the WMCS team and for the Search team.

Well, this task is probably a decent place to write those cons.

My understanding of the problem is:

  • cloud supporting services in hardware (DNS, SMTP, DB replicas, NFS, monitoring, etc) share their addressing with production services, therefore they are considered part of the prod infra (or to be side-by-side).
  • we don't trust what runs on Cloud VPS instances, specially when it comes to interaction with prod infra
  • we have concerns regarding a possible VM --> supporting service --> prod escalation

Sort of! It's a bit broader than that. We have 3-4 different so-called realms right now: production; labs; fr-tech, with OIT arguably being a fourth one. (using "labs" because that's what $::realm is set to in Puppet!)

These are all separate admin domains, with different purposes, data restrictions (PII, credit card data) and access and superuser policies (ToS/NDA/Donor NDA etc.). Note that while these realms are now managed by different Foundation teams, that's not the relevant part here -- and in fact it hasn't always been the case, as both fr-tech roots and Labs roots were in the TechOps team at one point in time.

In turn, these different needs and restrictions drive a lot of technical decisions, from network isolation (e.g. fr-tech behind a separate set of firewalls), access request processes, +2/deployment rights, puppetmasters and puppet trees (in labs' case labs-private, in fr-tech entirely different puppet tree), security update expectations, reliability expectations, monitoring etc.

Unfortunately, for historical/legacy/tech debt reasons, production and labs/WMCS have been deeply intertwined. This is primarily a (serious) infrastructure security issue, although there are other aspects to this than just security. A cleaner separation between the realms is something we've talked about and agreed on multiple times over the years, but put off later due to lack of time. I see this task and its subtasks as one of the building blocks; the 10/8 -> 172.16/12 space migration as another, all progressing towards this wider goal.

Does that make sense/help?

The proposed alternative model in most cases is to move all hardware services to instances within Cloud VPS, so they change addressing (from 10.x to 172.16.x).
But this is just a change in addressing. The only difference from the networking point of view is a NAT in-between. i.e, a compromised supporting service running on CloudVPS will still be able to reach a given prod server using the egress NAT.
So egress connections from VMs to prod need to be strongly filtered somewhere even with the use of NAT, so the additional security provided by the egress NAT is relative. I would say low.

Yes and no. This is not just a change in addressing (per above), but even for addressing specifically:

You're not wrong that you can enforce whatever filtering/firewalling policy that you want regardless of address space. However, that has been proven over the years to be hard to keep thinking about, maintain and enforce (there are multiple examples I can think of).

Where we want and have long wanted to go is to treat the WMCS/Neutron routers as customer ports, where we apply our border policy, identical to the one we'd be applying to other customer ports (like our OIT interconnection) and similar to one we'd be applying to a PNI with a peer. Basically, given that WMCS is, by design, accessible to everyone, treat its network exactly like we're treating the Internet. Internally, both prod WMCS can use one or more private networks, invisible to each other and the rest of the Internet.

(Also note that I just gave a similar response to this at T207663 that you may find relevant too!).

Hope this helps, and happy to chat about this over a more realtime medium as well! :)

@faidon the complete separation seems like a great goal from a security perspective, but considering there's a lot of legacy code out there, it's potentially a big one too (apart from just changing addresses). Would these tasks be part of a larger project and if so, tracked as quarterly goals? In other words, do we have a timeline we should follow? One difficulty I had personally was to understand where all these little tasks fit in the grand scheme of things, thanks for all the background information so far.

Does that make sense/help?

Yes, thanks :-) the extra context is really appreciated.

Where we want and have long wanted to go is to treat the WMCS/Neutron routers as customer ports, where we apply our border policy, identical to the one we'd be applying to other customer ports (like our OIT interconnection) and similar to one we'd be applying to a PNI with a peer. Basically, given that WMCS is, by design, accessible to everyone, treat its network exactly like we're treating the Internet. Internally, both prod WMCS can use one or more private networks, invisible to each other and the rest of the Internet.

Ok, as stated in other tickets, we can't really get rid of the supporting infra (physical servers) so easily, so I did some diagrams in order to better understand myself this subject. (the diagrams are here https://drive.google.com/open?id=1xqcomg7u7ibD1jc5fZWrFqCzEZAgJQzp, should be editable with gdrive+draw.io)

This is what I understand as your ideal model:

Zooming in a bit in the WMCS side, we have this:

And with a X2 zoom into WMCS, we start seeing more stuff:


(I added a fake small router to have more differences with the later diagram)

Please @faidon confirm I'm understanding this right.

If I compare the last diagram with the one we are currently using:
https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Neutron#/media/File:Eqiad1_network_topology.png

  • what changes would we have to do, long term, to go from what we have right now to the ideal model?
  • would we need to reallocate all physical servers to a single subnet (different from prod) and probably to the same DC row?
  • our supporting servers would still need to reach puppetmasters, install servers, ldap servers, DNS servers, etc. i.e, they are still administered as any other prod server. Would you like to see a change in that aspect as well?

@faidon the complete separation seems like a great goal from a security perspective, but considering there's a lot of legacy code out there, it's potentially a big one too (apart from just changing addresses). Would these tasks be part of a larger project and if so, tracked as quarterly goals? In other words, do we have a timeline we should follow? One difficulty I had personally was to understand where all these little tasks fit in the grand scheme of things, thanks for all the background information so far.

@GTirloni That's a good question :) Given this is more heavy on WMCS, I think the organization and scheduling of this is something that really depends on you! We in SRE would definitely like to see some progress on this (the sooner the better!) and are also willing to help in ways that we can (e.g. like we did with the MX stuff).

@Krenair filed this particular task and it wasn't at my request, but I don't think the intention was to put any pressure about doing this now :) That said, it would be great IMHO if we could use this task to:

  • Document the (numerous) conversations that have happened in the past around all this
  • Discuss concerns and figure out of ways to address them; i.e. agree on the path forward
  • Track the steps that we'll need to get to the desired end ersult (the various subtasks)
  • Keep this as a reference and make sure that we won't regress (e.g. see the discussion at T207321 that's happening ~now, which indirectly kinda triggered this conversation!)
  • Hopefully address some of the low hanging fruits soon-ish (@Krenair has helpfully working on some of this!)
  • Formulate and agree on a timeline for completing this work

Thoughts?

Please @faidon confirm I'm understanding this right.
If I compare the last diagram with the one we are currently using:
https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Neutron#/media/File:Eqiad1_network_topology.png

  • what changes would we have to do, long term, to go from what we have right now to the ideal model?
  • would we need to reallocate all physical servers to a single subnet (different from prod) and probably to the same DC row?
  • our supporting servers would still need to reach puppetmasters, install servers, ldap servers, DNS servers, etc. i.e, they are still administered as any other prod server. Would you like to see a change in that aspect as well?

I'm a little confused! The ask (at least my ask!) is not to add another router in front of Neutron, nor to create another zone side-by-side with the "wmcs openstack network". Instead, my proposal is to move the servers supporting instances into OpenStack itself e.g. in the cloudinfra project. This would mean using standard OpenStack provisioning (Glance and whatnot), WMCS-wide or project-wide Puppetmasters, WMCS' DNS recursors etc.

Or were you referring specifically to the 4 first bullets (the OpenStack services) that you also mentioned before? I do agree with you that this is a separate can of worms and I wouldn't include it in the scope of this task.

Krenair added a comment.EditedOct 25 2018, 12:35 AM

I don't think the intention was to put any pressure about doing this now

Yeah I sometimes file tasks regardless of whether they're likely to happen soon or in the future. This one is a tracker because I noticed a pattern emerging, I've also stuck a list of potential services in the description for discussion. It's clear that this does not have priority, and that's fine - I only just filed the ticket and am not even convinced everyone is on board with the whole idea yet, let alone got it through the other processes that would be needed for the staff to work on something of this scope. I am a little surprised to see it has been marked assigned, not only due to priority but also as this is essentially a discussion and tracker ticket.

  • Hopefully address some of the low hanging fruits soon-ish (@Krenair has helpfully working on some of this!)

I did part of the DNS recursor task (get that puppet class actually working in labs) for the purposes of a different ticket (I don't think we're ready to start on actually creating the cloudinfra instances for it yet), I've offered to have a go at the puppetmaster task but that's dependent on cloud services team approval for me (a volunteer) getting what would effectively be root access across all Cloud VPS instances - that's kind of a big deal though others have got it.

Krenair added a comment.EditedOct 25 2018, 12:37 AM

Please @faidon confirm I'm understanding this right.
If I compare the last diagram with the one we are currently using:
https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Neutron#/media/File:Eqiad1_network_topology.png

  • what changes would we have to do, long term, to go from what we have right now to the ideal model?
  • would we need to reallocate all physical servers to a single subnet (different from prod) and probably to the same DC row?
  • our supporting servers would still need to reach puppetmasters, install servers, ldap servers, DNS servers, etc. i.e, they are still administered as any other prod server. Would you like to see a change in that aspect as well?

I'm a little confused! The ask (at least my ask!) is not to add another router in front of Neutron, nor to create another zone side-by-side with the "wmcs openstack network". Instead, my proposal is to move the servers supporting instances into OpenStack itself e.g. in the cloudinfra project. This would mean using standard OpenStack provisioning (Glance and whatnot), WMCS-wide or project-wide Puppetmasters, WMCS' DNS recursors etc.
Or were you referring specifically to the 4 first bullets (the OpenStack services) that you also mentioned before? I do agree with you that this is a separate can of worms and I wouldn't include it in the scope of this task.

He might be referring to the actual physical virtualisation machines which literally cannot be moved inside themselves (so I didn't put them in the bullet points) :)

Although I never really thought of these as 'support' services.

I'm a little confused! The ask (at least my ask!) is not to add another router in front of Neutron, nor to create another zone side-by-side with the "wmcs openstack network". Instead, my proposal is to move the servers supporting instances into OpenStack itself e.g. in the cloudinfra project. This would mean using standard OpenStack provisioning (Glance and whatnot), WMCS-wide or project-wide Puppetmasters, WMCS' DNS recursors etc.
Or were you referring specifically to the 4 first bullets (the OpenStack services) that you also mentioned before? I do agree with you that this is a separate can of worms and I wouldn't include it in the scope of this task.

He might be referring to the actual physical virtualisation machines which literally cannot be moved inside themselves (so I didn't put them in the bullet points) :)
Although I never really thought of these as 'support' services.

Ok, I think I understand this better now.

But we still have "supporting" services which are in the edge of what we would be able to move inside openstack, not only cloudvirts, but NFS servers for example. Our NFS problems are to be discussed in other tasks though :-)
Other services can be moved inside of openstack, like Wikireplicas, but because the way they work, they still need direct access to prod infra (so little benefit on moving them to VMs).

PD: This task is assigned and into our kanban board because the way we are trying to track tasks now.

Ok, I think I understand this better now.
But we still have "supporting" services which are in the edge of what we would be able to move inside openstack, not only cloudvirts, but NFS servers for example. Our NFS problems are to be discussed in other tasks though :-)
Other services can be moved inside of openstack, like Wikireplicas, but because the way they work, they still need direct access to prod infra (so little benefit on moving them to VMs).

Ack, there are certainly borderline and hard to figure out cases! I won't pretend I have all the answers or even be aware of all the challenges -- let's figure it out together :)

Should the next step here to make an exhaustive list of the "support services" indicating: server, application and if it can be moved into OpenStack?
Then analyze case by case the remaining services, to figure out where they should reside depending on their requirements.
Then open tracking sub-tasks for each of them?

Should the next step here to make an exhaustive list of the "support services" indicating: server, application and if it can be moved into OpenStack?
Then analyze case by case the remaining services, to figure out where they should reside depending on their requirements.
Then open tracking sub-tasks for each of them?

There have been prior discussions about this (maybe a year ago?) and I wrote up a document listing some of them back then. I'll dig it out next week and can file some sub tasks for specific services.

bd808 added a comment.Nov 9 2018, 6:05 PM
  • OpenStack Horizon (dashboard)
  • Wikimedia Striker (toolsadmin)

Both of these services receive developer account (LDAP) authentication credentials from end users (the usernames and passwords that also allow access to things like Gerrit, Wikitech, Phabricator). They also both hold credentials for highly privileged accounts in other services (LDAP, Keystone). I think this means that we should not be considering hosting these services inside the Cloud VPS environment as the terms of use and sane operational practices for Cloud VPS prohibit collecting LDAP passwords. Am I missing some subtly here or do others agree?

bd808 added a comment.Nov 9 2018, 6:11 PM
  • DB replicas

The Wiki Replica servers contain information considered sensitive by our privacy policies. This is the reason that all end user access is offered through a view layer rather than via direct access to the replicated tables themselves. Additionally I have been told that there are strong security reasons to restrict root access to the MariaDB service itself to only full production root users. My understanding of the reasons for this are limited (possibly for WP:BEANS reasons), but I trust that they are real.

I'm going to cross this one off the list with a pointer to this comment. Rebuttals are welcome if I am misunderstanding what protections could reasonably be taken to reverse this decision.

bd808 updated the task description. (Show Details)Nov 9 2018, 6:12 PM
Krenair added a comment.EditedNov 9 2018, 6:15 PM
  • OpenStack Horizon (dashboard)
  • Wikimedia Striker (toolsadmin)

Both of these services receive developer account (LDAP) authentication credentials from end users (the usernames and passwords that also allow access to things like Gerrit, Wikitech, Phabricator). They also both hold credentials for highly privileged accounts in other services (LDAP, Keystone). I think this means that we should not be considering hosting these services inside the Cloud VPS environment as the terms of use and sane operational practices for Cloud VPS prohibit collecting LDAP passwords. Am I missing some subtly here or do others agree?

agreed. I'm striking Keystone for the same reason

Krenair updated the task description. (Show Details)Nov 9 2018, 6:15 PM
Krenair updated the task description. (Show Details)
bd808 updated the task description. (Show Details)Nov 9 2018, 6:17 PM
Krenair updated the task description. (Show Details)Nov 9 2018, 6:18 PM
GTirloni removed a subscriber: GTirloni.Mar 21 2019, 9:11 PM