Network isolation for production and semi-production services
Closed, DuplicatePublic
Actions

Assigned To

None

Authored By

	• GWicke
	Dec 11 2015, 5:16 PM

Description

Our production network environment contains several sensitive services with weak built-in security. There are few hard restrictions at the network level that prevent a compromised production host from exploiting those weaknesses. Overall, the level of network separation within the production network is not sufficient to let us run semi-trusted services.

To still provide a reasonable level of security, we need to be careful about which services we allow to operate in this networking environment. This creates hurdles for semi-production or volunteer projects like HTML dumps, revision scoring, maps & others. Basically all of these services do not actually need this privileged level of access, but do need production-level hardware and reliability, so can't currently be supported in labs VMs.

There are also many current production services without a need for access to a privileged network environment. This includes Parsoid, Mathoid, Citoid, the Reading Content Service, AQS, Kartotherian and Hierator. The consequences of an exploit in any of these services would be significantly less severe if we ran them in restricted network environments.

To summarize, we are looking for a secure way to run semi-trusted services that

a) have production-level hardware and reliability requirements, and
b) don't need access to the privileged production networking environment, but are accessible from production.

Related Objects

Mentioned In: T170111: Implement a pod networking policy approach
T94329: secure Cassandra/RESTBase cluster
T134241: graphoid should not use the http proxy to connect to the mediawiki api and other internal services
Mentioned Here: T117095: eqiad: 1 hardware access request for labs on real hardware (mwoffliner)
T106731: eqiad: 1 hardware access request for labs on real hardware
T121237: Create labs baremetal subnet?
T17017: Wikimedia static HTML dumps broken
T106867: [Epic] Deploy Revscoring/ORES service in Prod

Event Timeline

• GWicke created this task.Dec 11 2015, 5:16 PM

• GWicke raised the priority of this task from to Needs Triage.

• GWicke updated the task description. (Show Details)

• GWicke added a project: SRE.

• GWicke subscribed.

Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald TranscriptDec 11 2015, 5:16 PM

• GWicke renamed this task from Provide a means to run production and semi-production services on separate vlans to Provide a means to run production and semi-production services without access to the catch-all production networking environment.Dec 11 2015, 5:24 PM

• GWicke updated the task description. (Show Details)

• GWicke set Security to None.

• GWicke updated the task description. (Show Details)Dec 11 2015, 5:38 PM

• GWicke added a project: Services.Dec 11 2015, 5:48 PM

• GWicke updated the task description. (Show Details)

• GWicke updated the task description. (Show Details)Dec 11 2015, 5:52 PM

• GWicke updated the task description. (Show Details)

• GWicke updated the task description. (Show Details)Dec 11 2015, 6:03 PM

• GWicke updated the task description. (Show Details)Dec 11 2015, 6:07 PM

• GWicke updated the task description. (Show Details)Dec 11 2015, 6:10 PM

• GWicke updated the task description. (Show Details)Dec 11 2015, 8:15 PM

• GWicke updated the task description. (Show Details)

• GWicke renamed this task from Provide a means to run production and semi-production services without access to the catch-all production networking environment to Network isolation for production and semi-production services.Dec 11 2015, 8:17 PM

• GWicke updated the task description. (Show Details)Dec 11 2015, 8:31 PM

• GWicke updated the task description. (Show Details)

• GWicke updated the task description. (Show Details)Dec 11 2015, 8:33 PM

• GWicke updated the task description. (Show Details)Dec 11 2015, 10:08 PM

• GWicke updated the task description. (Show Details)

• GWicke added a project: Security-General.Dec 11 2015, 10:11 PM

• GWicke added subscribers: • mobrovac, mark, faidon, • csteipp.

• GWicke added subscribers: tomasz, yuvipanda, dr0ptp4kt.

• GWicke added subscribers: ori, BBlack.

• GWicke updated the task description. (Show Details)Dec 11 2015, 10:35 PM

Peachey88 subscribed.Dec 11 2015, 10:56 PM

• GWicke updated the task description. (Show Details)Dec 12 2015, 2:19 AM

• GWicke updated the task description. (Show Details)

• GWicke updated the task description. (Show Details)Dec 12 2015, 2:23 AM

Krenair subscribed.Dec 12 2015, 2:28 AM

The task is vague, can you give specific example scenarios or something? I'm assuming services would run as regular users and not have root privileges, and we'd be applying updates in a timely fashion to make the likelihood of root escalation low. It's really two exploits that have to be possible and exploited together in on attack, one in embedded in the other: first breaking the app code, then utilizing the (often limited) access to that account to exploit a root privilege escalation on the host.

Also, the language at the top seems to sometimes talk about the "sensitive services with weak built-in security" (what are those, and how are they weak?) as platforms for launching attacks, and then the same as targets of said attack? If so, then putting all the services together in one "isolated network environment" doesn't fix that as they're all in that pool together and one can still be used to attack the other. Are you asking for a separate "isolated network environment" for each individual service?

At some other level of decoding this, what does "isolated network environment" even mean here? Isolated from being able to route traffic to each other (reach each other for cross-service API calls) at all? Firewall rules to only allow specific service ports (which we have in puppet at the host/role level)?

The task is vague, can you give specific example scenarios or something?

Anybody controlling one of these services has at least code execution privileges as the service user. This is already sufficient to attack many other services in the production network. Given the attack surface on bare hardware, motivated attackers also have a non-zero chance for elevating their privileges to root, which would let them circumvent purely IP-based restrictions.

Are you asking for a separate "isolated network environment" for each individual service?

Yes, that would provide the best protection, and I think is warranted especially for semi-trusted services developed by third-party volunteers. But, even one extra network segment with fairly harmless stateless services would improve security by isolating those from stateful and more sensitive services in the production environment.

At some other level of decoding this, what does "isolated network environment" even mean here?

It means that the level of isolation is sufficient to prevent an owned host from circumventing network protections & attacking other services. If we want to cover local root exploits or support handing out per-service root to developers, this means that the enforcement happens at the ethernet / switch port level. A way to implement this I'm familiar with would be VLans + explicitly whitelisted routes between those VLans on an as-needed basis. This was met by ridicule on IRC earlier for unclear reasons, so I removed explicit mentions since I don't care too much whether VLans or some other tech is used.

EBernhardson subscribed.Dec 12 2015, 3:15 AM

Do we actually care whether a service is developed by volunteers or not?

Do we actually care whether a service is developed by volunteers or not?

@Krenair: I agree with you that expertise and responsiveness of maintainers as well as the amount of scrutiny a service is receiving should matter more than the maintainer's professional affiliation. I used it as an example for services that might be experimental & do not receive as much ongoing security scrutiny. Ideally, most services (both internally & externally developed) would fall into this "semi-trusted" category, but this means that we need to enforce the needed security properties at a different level.

Removing myself and adding the other Tomasz instead :-)

tomasz edited subscribers, added: • Tfinc; removed: tomasz.Dec 12 2015, 8:40 AM

Ottomata triaged this task as Medium priority.Dec 14 2015, 8:03 PM

I've had this conversation with ops a few times, but I'll document it here for reference. I think we do need to find a way to segment our network more, to confine compromises. Starting with an "untrusted services" segment might be a good first step, although I'll leave that to ops to prioritize.

I've had a number of requests for security reviews of services where I would prefer that they were run on an isolated network. WDQS was an example-- a service where the blazegraph developers were saying it wasn't typically run exposed to the internet, but we wanted to do that. Citoid/zotero being another one I can think of. We compromised in rolling those out having them confined with firejail, although I think that still leaves us with more risk than I like.

@BBlack, addressing the multiple exploits specifically. For a serious attacker, using multiple exploits is pretty standard. Also, mediawiki completely trusts its caching mechanisms, so the ability to connect to memcache allows the attacker to escalate their mediawiki privileges.

• GWicke mentioned this in T134241: graphoid should not use the http proxy to connect to the mediawiki api and other internal services.May 12 2016, 7:40 PM

• GWicke mentioned this in T94329: secure Cassandra/RESTBase cluster.Oct 19 2016, 7:20 PM

• dpatrick subscribed.Dec 1 2016, 7:22 PM

• Phabricator_maintenance removed a subscriber: yuvipanda.Jun 7 2017, 6:45 PM

• GWicke mentioned this in T170111: Implement a pod networking policy approach.Jul 11 2017, 7:46 PM

ayounsi subscribed.Jul 12 2017, 10:03 AM

• GWicke moved this task from Backlog to attic on the Services board.Jul 12 2017, 5:24 PM

• GWicke edited projects, added Services (attic); removed Services.

• Phabricator_maintenance added a project: acl*security.Sep 20 2018, 9:15 AM

• mobrovac added a project: Platform Team Legacy (Attic).Dec 20 2018, 1:01 PM

• Phabricator_maintenance moved this task from Backlog to Acknowledged on the SRE board.Jan 26 2019, 8:27 PM

• kchapman removed projects: Platform Team Legacy (Attic), Services (attic).Jul 3 2019, 8:46 PM