Maniphest T207536

Move various support services for Cloud VPS currently in prod into their own instances
Closed, ResolvedPublic
Actions

Assigned To

None

Authored By

	Krenair
	Oct 20 2018, 11:08 AM

Description

I assume the Nova API and Neutron themselves can't be moved. And obviously the virt servers can't be. And LDAP, though named for 'labs', has outgrown that and is trusted by prod for various things so that can't go anywhere.

Here are some potential ones to consider:

OpenStack Designate (auth DNS)
~~OpenStack Keystone (OpenStack identity/auth)~~ (see T207536#4735654)
OpenStack Glance (instance images)
~~OpenStack Horizon (dashboard)~~ (see T207536#4735633)
~~Wikimedia Striker (toolsadmin)~~ (see T207536#4735633)
~~DB replicas~~ (see T207536#4735648)
NFS - there is a wmcs-nfs project already but that is for testing
The new Elastic replicas (which haven't been set up yet, see T194186)

I'm not arguing in particular for any individual service on the list to be migrated, feel free to shoot any down that can't/shouldn't be moved in the comments and strikethrough it in the description. If you think one should be moved, open a subtask and edit it out of this description.

Related Objects
Search...

Status	Assigned	Task
		Restricted Task
Resolved	None	T207536 Move various support services for Cloud VPS currently in prod into their own instances
Declined	None	T207533 Move labs-recursors in WMCS
Resolved	herron	T41785 Create a Cloud VPS SMTP smarthost
Resolved	bd808	T174618 Request creation of project-smtp VPS project
Resolved	herron	T206261 Routing RFC1918 private IP addresses to/from WMCS floating IPs
Resolved	Krenair	T171188 Move the main WMCS puppetmaster into the Labs realm
Resolved	Krenair	T219421 Work out what we're going to do with the cumin master functionality currently on labpuppetmaster1001
Resolved	Krenair	T219424 Decide how we're going to handle certificates for the puppetmaster migration
Resolved	jbond	T220268 Consider ways to make puppetmaster CA changes smoother on the puppet client end
Resolved	None	T219428 Decide how to handle encapi database
Resolved	Andrew	T219390 Have puppet-merge on puppetmaster1001 publish the official sha1 after merging
		Restricted Task
Resolved	Andrew	T223920 Ensure/confirm a way to shell into unpuppetized VMs
Resolved	Andrew	T232427 cloud-vps puppet cert cleaner not working properly
Resolved	Krenair	T232428 Resolve local commits on cloud-puppetmaster-01.cloudinfra.eqiad.wmflabs and cloud-puppetmaster-02.cloudinfra.eqiad.wmflabs
Resolved	Andrew	T232429 Create in-cloud, cloud-vps-wide cumin masters
Resolved	Krenair	T232509 cloud-puppetmasters: move some hiera settings from Horizon to git/gerrit
Declined	None	T207543 Move labmon (Graphite, StatsD) into a Cloud VPS
Resolved	• Bstorm	T193264 Replace labsdb100[4567] with instances on cloudvirt1019 and cloudvirt1020
		Unknown Object (Task)
Resolved	• Cmjohnson	T196507 Degraded RAID on cloudvirt1019
Resolved	• Cmjohnson	T194855 Degraded RAID on cloudvirt1020
Resolved	aborrero	T216353 toolsdb: firewalling changes for new setup (temporal mysql replication)
Declined	None	T216373 CloudVPS: run maintain-dbusers inside Toolforge
Declined	None	T208754 rename cloudvirt1019 and cloudvirt1020 to cloudvirtdb1001 and cloudvirtdb1002
Resolved	Jclark-ctr	T216749 Decommission labsdb1004.eqiad.wmnet and labsdb1005.eqiad.wmnet
Resolved	Halfak	T217922 Migrate Wikilabels from labsdb1004 to clouddb1002
Resolved	Halfak	T219563 Add a DNS alias for the wikilabels database (wikilabels.db.svc.eqiad.wmflabs)
Resolved	• Bstorm	T219652 Final migration of osmdb.eqiad.wmnet into Cloud VPS instances
Resolved	Jclark-ctr	T220144 Decommission labsdb1006.eqiad.wmnet and labsdb1007.eqiad.wmnet
Declined	None	T216422 Virtualize NFS servers used exclusively by Cloud VPS tenants

Event Timeline

Krenair created this task.Oct 20 2018, 11:08 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 20 2018, 11:08 AM

Krenair added a subtask: T207533: Move labs-recursors in WMCS.Oct 20 2018, 11:08 AM

Krenair added a subtask: T41785: Create a Cloud VPS SMTP smarthost.

Krenair added a subtask: T171188: Move the main WMCS puppetmaster into the Labs realm.

faidon added a project: SRE.Oct 20 2018, 11:58 AM

Paladox subscribed.Oct 20 2018, 12:22 PM

Krenair updated the task description. (Show Details)Oct 20 2018, 1:02 PM

Krenair updated the task description. (Show Details)

Krenair updated the task description. (Show Details)Oct 20 2018, 1:13 PM

Krenair updated the task description. (Show Details)

faidon subscribed.Oct 20 2018, 1:21 PM

Krenair updated the task description. (Show Details)Oct 20 2018, 1:24 PM

Krenair updated the task description. (Show Details)Oct 20 2018, 11:32 PM

MoritzMuehlenhoff subscribed.Oct 22 2018, 7:54 AM

For the record, I don't think we should move openstack components inside openstack just to avoid complex chicken-egg problems.

Specifically, I'm referring to the first 4 points on the task description:

* OpenStack Designate (auth DNS)
* OpenStack Keystone (OpenStack identity/auth)
* OpenStack Glance (instance images)
* OpenStack Horizon (dashboard)

ayounsi mentioned this in T174596: dmz_cidr only includes some wikimedia public IP ranges, leading to some very strange behaviour.Oct 22 2018, 5:31 PM

ayounsi subscribed.Oct 22 2018, 7:00 PM

Krenair added a subtask: T193264: Replace labsdb100[4567] with instances on cloudvirt1019 and cloudvirt1020.Oct 23 2018, 12:24 AM

My understanding of the problem is:

cloud supporting services in hardware (DNS, SMTP, DB replicas, NFS, monitoring, etc) share their addressing with production services, therefore they are considered part of the prod infra (or to be side-by-side).
we don't trust what runs on Cloud VPS instances, specially when it comes to interaction with prod infra
we have concerns regarding a possible VM --> supporting service --> prod escalation

The proposed alternative model in most cases is to move all hardware services to instances within Cloud VPS, so they change addressing (from 10.x to 172.16.x).
But this is just a change in addressing. The only difference from the networking point of view is a NAT in-between. i.e, a compromised supporting service running on CloudVPS will still be able to reach a given prod server using the egress NAT.
So egress connections from VMs to prod need to be strongly filtered somewhere even with the use of NAT, so the additional security provided by the egress NAT is relative. I would say low.

Am I missing something? What benefit do you see in changing the addressing?

Also, what characteristics would you like to see in an improved model? I'm thinking on:

simple networking policies at routing level, i.e, don't allow routing between 10.x and 172.16.x
a more 'visual' difference between IPv4 addressing will ease configuring firewalls and other ACLs
a central killswitch in case of attack: shutting down the entire openstack to shut down a whole attack surface

aborrero claimed this task.Oct 23 2018, 1:28 PM

aborrero triaged this task as Medium priority.

aborrero added a project: cloud-services-team (Kanban).

aborrero moved this task from Inbox to Needs discussion on the cloud-services-team (Kanban) board.

The new Elastic replicas (which haven't been set up yet, see T194186)

I suppose this could be done by treating the physical hardware we bought for this as cloudvirt boxes and keeping them isolated to use by a single project. A pro of this would be the obvious and complete network separation from production. I'm sure there are cons to consider as well both for the WMCS team and for the Search team.

In T207536#4688833, @bd808 wrote:

I'm sure there are cons to consider as well both for the WMCS team and for the Search team.

Well, this task is probably a decent place to write those cons.

• GTirloni subscribed.Oct 23 2018, 5:05 PM

In T207536#4688654, @aborrero wrote:

My understanding of the problem is:

cloud supporting services in hardware (DNS, SMTP, DB replicas, NFS, monitoring, etc) share their addressing with production services, therefore they are considered part of the prod infra (or to be side-by-side).

we don't trust what runs on Cloud VPS instances, specially when it comes to interaction with prod infra

we have concerns regarding a possible VM --> supporting service --> prod escalation

Sort of! It's a bit broader than that. We have 3-4 different so-called realms right now: production; labs; fr-tech, with OIT arguably being a fourth one. (using "labs" because that's what $::realm is set to in Puppet!)

These are all separate admin domains, with different purposes, data restrictions (PII, credit card data) and access and superuser policies (ToS/NDA/Donor NDA etc.). Note that while these realms are now managed by different Foundation teams, that's not the relevant part here -- and in fact it hasn't always been the case, as both fr-tech roots and Labs roots were in the TechOps team at one point in time.

In turn, these different needs and restrictions drive a lot of technical decisions, from network isolation (e.g. fr-tech behind a separate set of firewalls), access request processes, +2/deployment rights, puppetmasters and puppet trees (in labs' case labs-private, in fr-tech entirely different puppet tree), security update expectations, reliability expectations, monitoring etc.

Unfortunately, for historical/legacy/tech debt reasons, production and labs/WMCS have been deeply intertwined. This is primarily a (serious) infrastructure security issue, although there are other aspects to this than just security. A cleaner separation between the realms is something we've talked about and agreed on multiple times over the years, but put off later due to lack of time. I see this task and its subtasks as one of the building blocks; the 10/8 -> 172.16/12 space migration as another, all progressing towards this wider goal.

Does that make sense/help?

The proposed alternative model in most cases is to move all hardware services to instances within Cloud VPS, so they change addressing (from 10.x to 172.16.x).
But this is just a change in addressing. The only difference from the networking point of view is a NAT in-between. i.e, a compromised supporting service running on CloudVPS will still be able to reach a given prod server using the egress NAT.
So egress connections from VMs to prod need to be strongly filtered somewhere even with the use of NAT, so the additional security provided by the egress NAT is relative. I would say low.

Yes and no. This is not just a change in addressing (per above), but even for addressing specifically:

You're not wrong that you can enforce whatever filtering/firewalling policy that you want regardless of address space. However, that has been proven over the years to be hard to keep thinking about, maintain and enforce (there are multiple examples I can think of).

Where we want and have long wanted to go is to treat the WMCS/Neutron routers as customer ports, where we apply our border policy, identical to the one we'd be applying to other customer ports (like our OIT interconnection) and similar to one we'd be applying to a PNI with a peer. Basically, given that WMCS is, by design, accessible to everyone, treat its network exactly like we're treating the Internet. Internally, both prod WMCS can use one or more private networks, invisible to each other and the rest of the Internet.

(Also note that I just gave a similar response to this at T207663 that you may find relevant too!).

Hope this helps, and happy to chat about this over a more realtime medium as well! :)

@faidon the complete separation seems like a great goal from a security perspective, but considering there's a lot of legacy code out there, it's potentially a big one too (apart from just changing addresses). Would these tasks be part of a larger project and if so, tracked as quarterly goals? In other words, do we have a timeline we should follow? One difficulty I had personally was to understand where all these little tasks fit in the grand scheme of things, thanks for all the background information so far.

In T207536#4689648, @faidon wrote:

Does that make sense/help?

Yes, thanks :-) the extra context is really appreciated.

Where we want and have long wanted to go is to treat the WMCS/Neutron routers as customer ports, where we apply our border policy, identical to the one we'd be applying to other customer ports (like our OIT interconnection) and similar to one we'd be applying to a PNI with a peer. Basically, given that WMCS is, by design, accessible to everyone, treat its network exactly like we're treating the Internet. Internally, both prod WMCS can use one or more private networks, invisible to each other and the rest of the Internet.

Ok, as stated in other tickets, we can't really get rid of the supporting infra (physical servers) so easily, so I did some diagrams in order to better understand myself this subject. (the diagrams are here https://drive.google.com/open?id=1xqcomg7u7ibD1jc5fZWrFqCzEZAgJQzp, should be editable with gdrive+draw.io)

This is what I understand as your ideal model:

Zooming in a bit in the WMCS side, we have this:

And with a X2 zoom into WMCS, we start seeing more stuff:

(I added a fake small router to have more differences with the later diagram)

Please @faidon confirm I'm understanding this right.

If I compare the last diagram with the one we are currently using:
https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Neutron#/media/File:Eqiad1_network_topology.png

what changes would we have to do, long term, to go from what we have right now to the ideal model?
would we need to reallocate all physical servers to a single subnet (different from prod) and probably to the same DC row?
our supporting servers would still need to reach puppetmasters, install servers, ldap servers, DNS servers, etc. i.e, they are still administered as any other prod server. Would you like to see a change in that aspect as well?

In T207536#4689900, @GTirloni wrote:

@faidon the complete separation seems like a great goal from a security perspective, but considering there's a lot of legacy code out there, it's potentially a big one too (apart from just changing addresses). Would these tasks be part of a larger project and if so, tracked as quarterly goals? In other words, do we have a timeline we should follow? One difficulty I had personally was to understand where all these little tasks fit in the grand scheme of things, thanks for all the background information so far.

@GTirloni That's a good question :) Given this is more heavy on WMCS, I think the organization and scheduling of this is something that really depends on you! We in SRE would definitely like to see some progress on this (the sooner the better!) and are also willing to help in ways that we can (e.g. like we did with the MX stuff).

@Krenair filed this particular task and it wasn't at my request, but I don't think the intention was to put any pressure about doing this now :) That said, it would be great IMHO if we could use this task to:

Document the (numerous) conversations that have happened in the past around all this
Discuss concerns and figure out of ways to address them; i.e. agree on the path forward
Track the steps that we'll need to get to the desired end ersult (the various subtasks)
Keep this as a reference and make sure that we won't regress (e.g. see the discussion at T207321 that's happening ~now, which indirectly kinda triggered this conversation!)
Hopefully address some of the low hanging fruits soon-ish (@Krenair has helpfully working on some of this!)
Formulate and agree on a timeline for completing this work

Thoughts?

In T207536#4692241, @aborrero wrote:

Please @faidon confirm I'm understanding this right.

If I compare the last diagram with the one we are currently using:
https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Neutron#/media/File:Eqiad1_network_topology.png

what changes would we have to do, long term, to go from what we have right now to the ideal model?

would we need to reallocate all physical servers to a single subnet (different from prod) and probably to the same DC row?

our supporting servers would still need to reach puppetmasters, install servers, ldap servers, DNS servers, etc. i.e, they are still administered as any other prod server. Would you like to see a change in that aspect as well?

I'm a little confused! The ask (at least my ask!) is not to add another router in front of Neutron, nor to create another zone side-by-side with the "wmcs openstack network". Instead, my proposal is to move the servers supporting instances into OpenStack itself e.g. in the cloudinfra project. This would mean using standard OpenStack provisioning (Glance and whatnot), WMCS-wide or project-wide Puppetmasters, WMCS' DNS recursors etc.

Or were you referring specifically to the 4 first bullets (the OpenStack services) that you also mentioned before? I do agree with you that this is a separate can of worms and I wouldn't include it in the scope of this task.

In T207536#4693519, @faidon wrote:

I don't think the intention was to put any pressure about doing this now

Yeah I sometimes file tasks regardless of whether they're likely to happen soon or in the future. This one is a tracker because I noticed a pattern emerging, I've also stuck a list of potential services in the description for discussion. It's clear that this does not have priority, and that's fine - I only just filed the ticket and am not even convinced everyone is on board with the whole idea yet, let alone got it through the other processes that would be needed for the staff to work on something of this scope. I am a little surprised to see it has been marked assigned, not only due to priority but also as this is essentially a discussion and tracker ticket.

In T207536#4693519, @faidon wrote:

Hopefully address some of the low hanging fruits soon-ish (@Krenair has helpfully working on some of this!)

I did part of the DNS recursor task (get that puppet class actually working in labs) for the purposes of a different ticket (I don't think we're ready to start on actually creating the cloudinfra instances for it yet), I've offered to have a go at the puppetmaster task but that's dependent on cloud services team approval for me (a volunteer) getting what would effectively be root access across all Cloud VPS instances - that's kind of a big deal though others have got it.

In T207536#4693535, @faidon wrote:

In T207536#4692241, @aborrero wrote:

Please @faidon confirm I'm understanding this right.

If I compare the last diagram with the one we are currently using:
https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Neutron#/media/File:Eqiad1_network_topology.png

what changes would we have to do, long term, to go from what we have right now to the ideal model?

would we need to reallocate all physical servers to a single subnet (different from prod) and probably to the same DC row?

our supporting servers would still need to reach puppetmasters, install servers, ldap servers, DNS servers, etc. i.e, they are still administered as any other prod server. Would you like to see a change in that aspect as well?

I'm a little confused! The ask (at least my ask!) is not to add another router in front of Neutron, nor to create another zone side-by-side with the "wmcs openstack network". Instead, my proposal is to move the servers supporting instances into OpenStack itself e.g. in the cloudinfra project. This would mean using standard OpenStack provisioning (Glance and whatnot), WMCS-wide or project-wide Puppetmasters, WMCS' DNS recursors etc.

Or were you referring specifically to the 4 first bullets (the OpenStack services) that you also mentioned before? I do agree with you that this is a separate can of worms and I wouldn't include it in the scope of this task.

He might be referring to the actual physical virtualisation machines which literally cannot be moved inside themselves (so I didn't put them in the bullet points) :)

Although I never really thought of these as 'support' services.

In T207536#4693554, @Krenair wrote:

In T207536#4693535, @faidon wrote:

I'm a little confused! The ask (at least my ask!) is not to add another router in front of Neutron, nor to create another zone side-by-side with the "wmcs openstack network". Instead, my proposal is to move the servers supporting instances into OpenStack itself e.g. in the cloudinfra project. This would mean using standard OpenStack provisioning (Glance and whatnot), WMCS-wide or project-wide Puppetmasters, WMCS' DNS recursors etc.

Or were you referring specifically to the 4 first bullets (the OpenStack services) that you also mentioned before? I do agree with you that this is a separate can of worms and I wouldn't include it in the scope of this task.

He might be referring to the actual physical virtualisation machines which literally cannot be moved inside themselves (so I didn't put them in the bullet points) :)

Although I never really thought of these as 'support' services.

Ok, I think I understand this better now.

But we still have "supporting" services which are in the edge of what we would be able to move inside openstack, not only cloudvirts, but NFS servers for example. Our NFS problems are to be discussed in other tasks though :-)
Other services can be moved inside of openstack, like Wikireplicas, but because the way they work, they still need direct access to prod infra (so little benefit on moving them to VMs).

PD: This task is assigned and into our kanban board because the way we are trying to track tasks now.

In T207536#4694093, @aborrero wrote:

Ok, I think I understand this better now.

But we still have "supporting" services which are in the edge of what we would be able to move inside openstack, not only cloudvirts, but NFS servers for example. Our NFS problems are to be discussed in other tasks though :-)
Other services can be moved inside of openstack, like Wikireplicas, but because the way they work, they still need direct access to prod infra (so little benefit on moving them to VMs).

Ack, there are certainly borderline and hard to figure out cases! I won't pretend I have all the answers or even be aware of all the challenges -- let's figure it out together :)

Should the next step here to make an exhaustive list of the "support services" indicating: server, application and if it can be moved into OpenStack?
Then analyze case by case the remaining services, to figure out where they should reside depending on their requirements.
Then open tracking sub-tasks for each of them?

In T207536#4695685, @ayounsi wrote:

Should the next step here to make an exhaustive list of the "support services" indicating: server, application and if it can be moved into OpenStack?
Then analyze case by case the remaining services, to figure out where they should reside depending on their requirements.
Then open tracking sub-tasks for each of them?

There have been prior discussions about this (maybe a year ago?) and I wrote up a document listing some of them back then. I'll dig it out next week and can file some sub tasks for specific services.

aborrero mentioned this in T209011: Change routing to ensure that traffic originating from Cloud VPS is seen as non-private IPs by Wikimedia wikis.Nov 8 2018, 11:37 AM

OpenStack Horizon (dashboard)

Wikimedia Striker (toolsadmin)

Both of these services receive developer account (LDAP) authentication credentials from end users (the usernames and passwords that also allow access to things like Gerrit, Wikitech, Phabricator). They also both hold credentials for highly privileged accounts in other services (LDAP, Keystone). I think this means that we should not be considering hosting these services inside the Cloud VPS environment as the terms of use and sane operational practices for Cloud VPS prohibit collecting LDAP passwords. Am I missing some subtly here or do others agree?

DB replicas

The Wiki Replica servers contain information considered sensitive by our privacy policies. This is the reason that all end user access is offered through a view layer rather than via direct access to the replicated tables themselves. Additionally I have been told that there are strong security reasons to restrict root access to the MariaDB service itself to only full production root users. My understanding of the reasons for this are limited (possibly for WP:BEANS reasons), but I trust that they are real.

I'm going to cross this one off the list with a pointer to this comment. Rebuttals are welcome if I am misunderstanding what protections could reasonably be taken to reverse this decision.

bd808 updated the task description. (Show Details)Nov 9 2018, 6:12 PM

In T207536#4735633, @bd808 wrote:

OpenStack Horizon (dashboard)

Wikimedia Striker (toolsadmin)

Both of these services receive developer account (LDAP) authentication credentials from end users (the usernames and passwords that also allow access to things like Gerrit, Wikitech, Phabricator). They also both hold credentials for highly privileged accounts in other services (LDAP, Keystone). I think this means that we should not be considering hosting these services inside the Cloud VPS environment as the terms of use and sane operational practices for Cloud VPS prohibit collecting LDAP passwords. Am I missing some subtly here or do others agree?

agreed. I'm striking Keystone for the same reason

Krenair updated the task description. (Show Details)Nov 9 2018, 6:15 PM

Krenair updated the task description. (Show Details)

bd808 updated the task description. (Show Details)Nov 9 2018, 6:17 PM

Krenair updated the task description. (Show Details)Nov 9 2018, 6:18 PM

aborrero added a parent task: T209460: CloudVPS: network architecture.Nov 14 2018, 9:54 AM

aborrero moved this task from Needs discussion to Soon! on the cloud-services-team (Kanban) board.Nov 14 2018, 10:31 AM

• GTirloni mentioned this in T177959: Should VPS puppetmasters include labs-recursor0/ns-1 in their resolv.confs?.Nov 21 2018, 7:25 PM

herron closed subtask T41785: Create a Cloud VPS SMTP smarthost as Resolved.Nov 26 2018, 9:05 PM

aborrero moved this task from Soon! to Graveyard on the cloud-services-team (Kanban) board.Jan 24 2019, 2:38 PM

• Phabricator_maintenance moved this task from Backlog to Acknowledged on the SRE board.Jan 26 2019, 10:32 PM

Andrew mentioned this in T215211: cloud instance rescue tools.Feb 4 2019, 9:55 PM