Page MenuHomePhabricator

Kubernetes Wikifunctions security and control measures
Closed, ResolvedPublic

Description

Intro

Wikifunctions, is, at a conceptual level, a remote code execution system. We will be accepting arbitrary code from users and executing it in our platform. This poses great security concerns as running arbitrary untrusted code, even if we end up performing multiple validation steps a wide range of attacks than can be carried out against our infrastructure, projects and users. To mitigate the risk stemming from this, and with the fact in mind that container escapes are not uncommon, we will be implementing a set of defence in depth measures, each aimed at a variety of attack vectors.

Investigated measures

This task focuses at the infrastructure level level measures, although some passing notes for application level measures are embedded here for completeness

On the application level

The application itself will be taking care of setting some limits, like e.g. amount of time a workload can run for, after which workloads will be terminated. This will greatly mitigate the risk of long running attack vectors.

On the infrastructure, aka WikiKube Kubernetes execution platform level

Resource based measures

First, we will go through the various resource based measures. The point here is to limit computational resources, memory, CPU and network.

Memory limits

We will be using the Kubernetes platforms capabilities for this one. Without diving into too much detailed, we will be defining requests and limits stanzas in the pods to limit memory usage. Containers/Pods that go above the specified limits will either not be allocated more memory when they request it, or will be outrighted be killed by the OutOfMemory killer. This is a feature we 've been using since the start of our Kubernetes platform, we are comfortable with it and we know what to expect and how to debug it. When the platform is under great stress, requests specs are expected to de-prioritize scheduling Wikifunctions workloads and instead schedule more critical workloads.

CPU limits

Similar to memory. Pods will be capped to specific CPU limits. Pods crossing that limit will be severely throttled in CPU usage. This is a feature we 've been using since the start of our Kubernetes platform, we are comfortable with it and we know what to expect and how to debug it. requests specs will be again used when the platform is under great CPU stress, where Wikifunctions will be de-prioritized in favor of more critical workloads.

Network

This isn't going to be discussed here in much detail, but the gist is very simply that we won't be allow the execution units to talk to anything by default, whitelisting progressively some selected resources. This is on par with every other service that we run where network access is blacklisted by default and we whitelist specific destinations. The functionality itself is provided by Kubernetes Network Policies (as the definition) and the implementation comes from our Calico CNI. This is a feature we 've been using since the start of our Kubernetes platform, we are comfortable with it and we know what to expect and how to debug it

As for the selected resources, ones that come to mind at a high level (implementation might end up being interesting) are wikidata.org and wikifunctions.org, at least at the beginning.

Execution hardening measures

mount / as read-only

A typical hardening measure that makes attackers life more difficult is the inability to write arbitrary files in various places in the filesystem. This can be achieved in a variety of ways, including properly setting permissions everywhere. Kubernetes provides a way to set the root filesystem of each container in a Pod as read-only. This will mean the attacker will have to spend time to probe the container for a writeable localtion for their payload, increasing the cost of the attack. It is pretty simple to do it in a Pod context, code follows

apiVersion: v1
kind: Pod
spec:
  containers:
  - image: nginx
    imagePullPolicy: Always
    name: nginx
    securityContext:
      readOnlyRootFilesystem: true
  securityContext: {}

In the above extremely simple example, a pod is run with a single container, using the nginx image. The root filesystem of that container is set to read only. Note: there are 2 securityContext objects here. One applies to the container level and the other to the Pod level. They have wildly different specifications, PodSecurityContext vs SecurityContext In the few cases where the 2 overlap, the container level securityContext takes precedence.

PrivilegeEscalation

AllowPrivilegeEscalation controls whether a process can gain more privileges than its parent process. This bool directly controls if the no_new_privs flag will be set on the container process. AllowPrivilegeEscalation is true always when the container is: 1) run as Privileged 2) has CAP_SYS_ADMIN Note that this field cannot be set when spec.os.name is windows.

We default this to false in our restricted PodSecurityPolicy already.

Linux Capabilities

Linux capabilities provide a subset of the available root privileges to a process. In our environment we already don't allow typical workloads to run as root and we have already set in our restricted PodSecurityPolicy requiredDropCapabilities to ALL, disallowing any workload to add new Linux capabilities to themselves

Rule-based isolation measures

SELinux

Configuring SELinux can be a daunting task and we have no prior art/knowledge in the foundation. Since the SRE team has focused on providing Debian operating system support, providing a proper SELinux policy framework to be used by Kubernetes would require a significant amount of effort and investment, one which currently is NOT easily quantified and justified. For now, we will be avoiding this and revisit at a later time.

Seccomp

Seccomp, a way to limit system calls allowed by pods, support exists for quite a while in kubernetes, but the feature has only graduated to GA (General Availability) in version 1.19. The API changed from adding Annotations to pods (a typical way some features are first implemented), to actual securityContext fields in the Kubernetes Pod API Furthermore, by default pods are being run under the Unconfined seccomp profile, effectively disabling it by. There is work in the upstream project to switch to enable by default. KEP-2413: Enable seccomp by default has the gory details, but essentially we are talking about a feature gate and a new kubelet flag. This was done in Kubernetes 1.22 and is still considered Alpha.

For now, we will not be relying on the new seccomp by default approach as it is still in Alpha and prone to change.

Furthermore, seccomp requires that we create and ship application specific seccomp profiles after auditing the application to find out the system calls it makes, evaluating and whitelisting them. This is an arduous process, which eventually we might want to do, but right now we don't have the time for.

Finally, enabling seccomp has negative repercussions as far as performance goes, especially when it comes to workloads performing a lot of I/O. We 've had experience with instances of increased runtimes and latency, especially in CI environments. The performance degradation increases with the number of syscalls a workload does (e.g. CI tends to do tons of syscalls for reading files, accessing network, etc) and requires some careful consideration of the workload before being done. In our testing, the profile being applied appeared to have no discernible difference in the performance degradation. That is, a very permissive profile had similar performance degradation to a very restrictive profile. The one exception being unconfined , which does not enable seccomp at all.

For completeness sake, here's how specifying seccomp profile past Kubernetes version 1.19 (pre Kubernetes 1.19 isn't a target for this document) should look like:

apiVersion: v1
kind: Pod
spec:
  containers:
  - image: nginx
    imagePullPolicy: Always
    name: nginx
    securityContext:
      seccompProfile:
        type: Unconfined
  securityContext: {}

Note: there are 2 securityContext objects here. One applies to the container level and the other to the Pod level. They have wildly different specifications, PodSecurityContext vs SecurityContext In the few cases where the 2 overlap, the container level securityContext takes precedence. seccompProfile is one such case.

Here, we have an nginx container that runs with the seccompProfile Unconfined which is the default. We could also use runtime/default which would relying on the default of the chosen Container Runtime Engine. That is currently Docker, but since support for it is removed from Kubernetes post version 1.23 (which is the last version that support Docker), we will have to first migrate to a new Container Runtime Engine and then decide how to populate seccomp profiles.

Note: Our current restricted PodSecurityPoliciy don't allow anyway to define anything else than runtime/default (and the older and equivalent docker/default), so no workloads will anyway be scheduled if they try to override that.

For the above reasons, we will be avoiding seccomp for Wikifunctions and reconsider enabling it at a later date

Apparmor

Following: Restrict a Container's Access to Resources with AppArmor](https://kubernetes.io/docs/tutorials/security/apparmor/)

Requirements
RequirementStatusNotes
Kubernetes version should be >= 1.4SatisfiedWe are at 1.16 and 1.23
AppArmor kernel module enabledSatisfiedsudo cumin 'kubernetes*' 'cat /sys/module/apparmor/parameters/enabled' returns Y
AppArmor profiles loadedSatisfiedsudo cumin 'kubernetes*' 'cat /sys/kernel/security/apparmor/profiles' returns various profiles
Kubelet reports AppArmor enabledSatisfiedkubemaster1001.eqiad.wmnet: kubelet is posting ready status. AppArmor enabled
Securing a pod

Each container should have an annotation of the form:

container.apparmor.security.beta.kubernetes.io/<container_name>: <profile_ref>

Possible values include:

  • runtime/default to apply the runtime's default profile
  • localhost/<profile_name> to apply the profile loaded on the host with the name <profile_name>
  • unconfined to indicate that no profiles will be loaded

Our current restricted PodSecurityPolicy, restricts the used AppArmor profile to runtime/default in case it is defined (if it isn't defined, runtime/default remains the chosen one as it is specified as apparmor.security.beta.kubernetes.io/defaultProfileName in the restricted PodSecurityPolicy.)

A minimal pod example:

apiVersion: v1
kind: Pod
metadata:
  annotations:
    container.apparmor.security.beta.kubernetes.io/nginx: runtime/default
  labels:
    run: nginx
  name: nginx
  namespace: default
spec:
  containers:
  - image: nginx
    imagePullPolicy: Always
    name: nginx
AppArmor profiles

Kubernetes does not currently provide any native mechanisms for loading AppArmor profiles onto nodes. There are lots of ways to set up the profiles though, such as:

  • Through a DaemonSet that runs a Pod on each node to ensure the correct profiles are loaded. An example implementation can be found here.
  • At node initialization time, using your node initialization scripts (e.g. Salt, Ansible, etc.) or image.

Currently in our WikiKube Kubernetes nodes the following profiles are loaded:

  • docker-default (enforce)
  • tcpdump (enforce)
  • lsb_release (enforce)
  • man_groff (enforce)
  • man_filter (enforce)
  • /usr/bin/man (enforce)
  • nvidia_modprobe (enforce)
  • nvidia_modprobe//kmod (enforce) |

All of these are all shipped by some Debian Package and we currently don't ship a specific AppArmor profile to our WikiKube cluster. That being said, we do have some experience with AppArmor (using it to sandbox some services in the past ) as well as Puppet code.

A simple proposal for WikiFunctions targetted AppArmor profile would be

# vim:syntax=apparmor
#include <tunables/global>

profile function-orchestrator {
  #include <abstractions/base>
  #include <abstrations/python>
  network inet tcp,
  network inet6 tcp,
  
  # Deny all file writes.
  deny /** w,
}

profile function-evaluator {
  #include <abstractions/base>
  #include <abstrations/python>
  network inet tcp,
  network inet6 tcp,
  
  # Deny all file writes.
  deny /** w,
}

The above profiles allow to execute Python, allow TCP IPv4/IPv6 and disallow writing everywhere explicitly. Overall they also disallow pretty much any other kind of resource access. The profiles are named and not bound to a specific process (the are defined as "profile foo" and not as profile "/usr/bin/foo") so they can be applied by Kubernetes to container.

Note: We will clearly also need Lua. There are unfortunately no abstractions yet for that one, we will have to write it. We might also need to adapt the overly strict set of rules.

Here's how to apply them in a Kubernetes pod:

apiVersion: v1
kind: Pod
metadata:
  annotations:
    container.apparmor.security.beta.kubernetes.io/nginx: localhost/function-orchestrator
spec:
  containers:
  - image: nginx
    imagePullPolicy: Always
    name: nginx

AppArmor profiles will be shipped since day-1 of Wikifunctions deployment after possibly fine-tuning the above

Virtualization based measures

Firecracker

Amazon Firecracker is the technology that powers AWS Lambda and AWS Fargate today. It is a virtualization technology that creates lightweight virtual machines (MicroVMs) specifically for multi-tenant containers and serverless operational models. It provides an alternative way to the rule based ones above and provides stronger isolation guarantees.

Unfortunately, at the time of this writing (2023-01-10) firecracker does not support our Container Runtime Engine (Docker). It does support containerd and Kata Containers though. Since Docker support is removed from Kubernetes post version 1.23, we will revisit Firecracker once we have switched away from Docker and into some other Container Runtime Engine.

Note: An interesting read is the unrolled Twitter thread by Micah Hausler as to why Firecracker and Kubernetes aren't yet particularly compatible.

Kata Containers

Kata Containers is pretty similar to Firecracker, albeit running workloads in a generic hypervisor (QEMU) instead of a dedicated microvm hypervisor. Interestingly, it is possible to use Firecracker as the VMM for Kata containers.

Docker support for Kata Containers was added in Docker 22.06 Unfortunately, at the time of this writing (2023-01-10) we are at Docker version 20.10.5. Since Docker support is removed from Kubernetes post version 1.23 we will anyway have to move away from Docker and into some other Container Runtime Engine that is supported by Kubernetes. We will revisit Kata Containers once that's completed.

Note: The limitations document by Kata Containers, is a pretty useful read anyways.

Application Level Kernel measures

gVisor

gVisor is an application kernel implementing a substantial portion of the Linux system call interface. It provides a 3rd way isolation mechanism, the other 2 classes being virtualization (e.g. Kata Containers, firecracker) and rule-based (seccomp, SELinux, AppArmor) and discussed above. The technique sandboxes containers by providing implementations of system calls maintaining a degree of compatibility with the Linux Kernel while only utilizing a small number of actual Linux Kernel system calls. The work to evaluate and test this measure is at T316706

Related Objects

View Standalone Graph
This task is connected to more than 200 other tasks. Only direct parents and subtasks are shown here. Use View Standalone Graph to show more of the graph.

Event Timeline

Change 879282 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/deployment-charts@master] wikifunctions: Add AppArmor profile usage

https://gerrit.wikimedia.org/r/879282

Thanks for writing this, very good read! Sounds pretty sensible to focus on AppArmor for now (apart from the obvious stuff) as good as we can and potentially re-iterate after moving away from docker-shim. Two notes from my side:

Network

As of now we allow all pods to egress to the pod network. Although this does not mean the receiving end does allow that ingress traffic, it would probably make sense to restrict egress in general for wikifunctions to prevent access in case of configuration errors/mistakes in other workloads.

SELinux

[...] one which currently is not easily quantified and justified. [...]

Thanks for writing this, very good read! Sounds pretty sensible to focus on AppArmor for now (apart from the obvious stuff) as good as we can and potentially re-iterate after moving away from docker-shim. Two notes from my side:

Network

As of now we allow all pods to egress to the pod network. Although this does not mean the receiving end does allow that ingress traffic, it would probably make sense to restrict egress in general for wikifunctions to prevent access in case of configuration errors/mistakes in other workloads.

Good point, fully agreed. I had forgotten about the all pods to all pods egress rule.

SELinux

[...] one which currently is not easily quantified and justified. [...]

Good catch. Fixed, thanks!

Change 907930 had a related patch set uploaded (by Cory Massaro; author: Cory Massaro):

[mediawiki/extensions/WikiLambda@master] Ensure that evaluator cannot make external web requests.

https://gerrit.wikimedia.org/r/907930

Change 907930 abandoned by Cory Massaro:

[mediawiki/extensions/WikiLambda@master] Ensure that evaluator cannot make external web requests.

Reason:

done better elsewhere

https://gerrit.wikimedia.org/r/907930

apine updated https://gitlab.wikimedia.org/repos/abstract-wiki/aw-ci-chart/-/merge_requests/11

Block egress for evaluators and add test demonstrating that egress rules prevent external calls in the evaluators.

apine merged https://gitlab.wikimedia.org/repos/abstract-wiki/aw-ci-chart/-/merge_requests/11

Block egress for evaluators and add test demonstrating that egress rules prevent external calls in the evaluators.

Hello! According to this Slack thread, no gVisor environment will be available in production. Does that mean we can proceed with Wikifunctions without enabling gVisor?

Hello! According to this Slack thread, no gVisor environment will be available in production. Does that mean we can proceed with Wikifunctions without enabling gVisor?

Slack threads aren't public unfortunately, so pasting them to phabricator ends up with a forever broken link for anyone (everyone in the communities that is) without explicit access to those threads.

But yes, you can proceed with Wikifunctions without enabling gVisor.

Change 936316 had a related patch set uploaded (by JMeybohm; author: Cory Massaro):

[operations/deployment-charts@master] Add AppArmor configuration for the deployed function-evaluator service.

https://gerrit.wikimedia.org/r/936316

Change 940147 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] wikifunctions: Both charts are required to use readOnlyRootFilesystem

https://gerrit.wikimedia.org/r/940147

Change 940152 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] k8s::apparmor: Add support for deploying apparmor profiles to k8s nodes

https://gerrit.wikimedia.org/r/940152

Change 940147 merged by jenkins-bot:

[operations/deployment-charts@master] wikifunctions: Both charts are required to use readOnlyRootFilesystem

https://gerrit.wikimedia.org/r/940147

Change 940371 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/deployment-charts@master] admin: Add wikifunctions apparmor profiles to PSP

https://gerrit.wikimedia.org/r/940371

OK, so situation as I understand it right now at 2023-07-24Z20:55 is:

Is that it for putting the service live in prod?

OK, so situation as I understand it right now at 2023-07-24Z20:55 is:

Is that it for putting the service live in prod?

DNS changes and service::catalog entry for the orchestrator:
https://wikitech.wikimedia.org/wiki/Kubernetes/Ingress#Add_a_new_service_under_Ingress

Change 940152 merged by Alexandros Kosiaris:

[operations/puppet@production] Kubernetes: add support for deployment apparmor profiles

https://gerrit.wikimedia.org/r/940152

And...

akosiaris@kubernetes1007:~$ sudo apparmor_status 
apparmor module is loaded.
10 profiles are loaded.
10 profiles are in enforce mode.
   /usr/bin/man
   docker-default
   lsb_release
   man_filter
   man_groff
   nvidia_modprobe
   nvidia_modprobe//kmod
   tcpdump
   wikifunctions-evaluator
   wikifunctions-orchestrator
<snip>

Change 940371 merged by jenkins-bot:

[operations/deployment-charts@master] admin: Add wikifunctions apparmor profiles to PSP

https://gerrit.wikimedia.org/r/940371

Change 879282 merged by jenkins-bot:

[operations/deployment-charts@master] wikifunctions: Add AppArmor profile usage

https://gerrit.wikimedia.org/r/879282

The apparmor changes have been merged. I think the goal of this task is done. I 'll resolve, but feel free to reopen.

Change 936316 abandoned by Jforrester:

[operations/deployment-charts@master] Add AppArmor configuration for the deployed function-evaluator service.

Reason:

Done in other patches.

https://gerrit.wikimedia.org/r/936316

Change #1020700 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] kubernetes::node: Ensure apparmor profiles are loaded automatically

https://gerrit.wikimedia.org/r/1020700

Change #1020701 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] wikifunction: Move apparmor annotation to pod template

https://gerrit.wikimedia.org/r/1020701

Change #1020700 merged by JMeybohm:

[operations/puppet@production] kubernetes::node: Ensure apparmor profiles are loaded automatically

https://gerrit.wikimedia.org/r/1020700

Change #1020803 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] kubernetes::node: Remove apparmor cleanup code

https://gerrit.wikimedia.org/r/1020803

Change #1020805 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] apparmor::profile: Don't try to define /etc/apparmor.d resource

https://gerrit.wikimedia.org/r/1020805

Change #1020805 merged by JMeybohm:

[operations/puppet@production] apparmor::profile: Don't try to define /etc/apparmor.d resource

https://gerrit.wikimedia.org/r/1020805

Change #1020701 merged by jenkins-bot:

[operations/deployment-charts@master] wikifunctions: Move apparmor annotation to pod template

https://gerrit.wikimedia.org/r/1020701

Change #1020803 merged by JMeybohm:

[operations/puppet@production] kubernetes::node: Remove apparmor cleanup code

https://gerrit.wikimedia.org/r/1020803