allow tool users to attach strace to their processes (at least on exec hosts)
Closed, DeclinedPublic
Actions

Assigned To

None

Authored By

	valhallasw
	Oct 1 2015, 7:02 PM

Description

Makes debugging stuck processes much easier. Currently disallowed:

attach: ptrace(PTRACE_ATTACH, ...): Operation not permitted
Could not attach to process.  If your uid matches the uid of the target
process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try
again as the root user.  For more details, see /etc/sysctl.d/10-ptrace.conf

which is mostly useful on systems where secret information is expected in process memory but not on disk. This sounds unlikely for tool labs, but I'd like to be sure there's no other security issue I'm overlooking.

Related Objects

Mentioned Here: T185561: weblinkchecker.py slows down (itself, OS) to freeze after a while reaching 100% of RAM
T195834: mono-based bot hangs after mono version upgrade
T198503: CropTool sometimes locks and have to be manually restarted
T200341: WiDaR goes into infinite recursion in oauth.php:doApiQuery due to Wikidata dispatching issues and maxlag

Event Timeline

valhallasw created this task.Oct 1 2015, 7:02 PM

valhallasw raised the priority of this task from to Needs Triage.

valhallasw updated the task description. (Show Details)

valhallasw added a project: Toolforge.

valhallasw subscribed.

Restricted Application added a project: Cloud-Services. · View Herald TranscriptOct 1 2015, 7:02 PM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

scfc renamed this task from allow tool users to strace (at least on exec hosts) to allow tool users to attach strace to their processes (at least on exec hosts).Oct 2 2015, 12:34 AM

scfc set Security to None.

valhallasw triaged this task as Low priority.Oct 4 2015, 11:57 AM

valhallasw moved this task from Backlog to Ready to be worked on on the Toolforge board.

valhallasw merged a task: T142715: possibility to strace one's own jobs on tool labs.Aug 11 2016, 2:56 PM

valhallasw added subscribers: zhuyifei1999, • chasemp, yuvipanda.

Some background is at https://wiki.ubuntu.com/SecurityTeam/Roadmap/KernelHardening#ptrace_Protection. I believe the necessary Puppet fragment would be to add:

'kernel.yama.ptrace_scope' => 0,

to sysctl::parameters in modules/toollabs/manifests/exec_environ.pp and then run echo 0 | sudo tee /proc/sys/kernel/yama/ptrace_scope on all execution nodes (as an alternative to resetting the nodes).

• Phabricator_maintenance removed a subscriber: yuvipanda.Jun 7 2017, 6:47 PM

@MoritzMuehlenhoff Loosening this protection on the Toolforge exec nodes seems reasonable to me, but I wanted to check with you before putting in a Puppet patch and actually changing the live nodes. The protection seems to be primarily aimed at desktop systems to help prevent malware from inspecting processes. In our shared hosting environment it is not reasonable to allow sudo for gdb/ptrace and it is not scalable to have root holders perform inspections for tenants.

bd808 moved this task from Inbox to Needs discussion on the cloud-services-team (Kanban) board.Nov 16 2017, 8:52 PM

In the context of the tools themselves, I don't see a problem with secret data. However, the tools nodes do have other processes that might be carrying such secret information in memory for cluster operations.

It seems that enabling ptrace increases the attack surface considerably and there have been many exploits that leverage ptrace as a starting point (last one from 2017: CVE-2017-15537).

If the concern is just around troubleshooting stuck processes, I'd suggest tools authors increase logging instead (although that's not the same thing as peaking into syscalls).

In T114401#4685270, @GTirloni wrote:

In the context of the tools themselves, I don't see a problem with secret data. However, the tools nodes do have other processes that might be carrying such secret information in memory for cluster operations.

I don't understand this concern. Even with classic ptrace, users may only attach ptrace to a process under the same uid, so they may not attach to a process belonging to the infrastructure. The problem would have to be infrastructure attaching to themselves.

ptrace allows the attacher to read & modify the memory & execution state of the attachee.
Regarding tools running on grid they do not generally have some tty or some one-time-use source-of-secret to use. I do not know of any tool utilizing cross-uid one-time-use source-of-secret. For non-cross-uid, the attacher could just generate another secret and rotate, without needing the complexity of ptrace. For multi-time-use, you could just authenticate to the source-of-secret and get the secret, again without the complexity of ptrace.
In other words, generally the information contained in memory may only be derived from the filesystem (an already-authenticated source) and the network (and other later-authenticated sources). Any reading of such network or other later-authenticated sources will require authentication from memory leaving only filesystem as the source. With same uid, the attacher could just read it themselves and pretend that they are the attachee to get the secret.

Regarding the infrastructure, the same argument to tools applies to most of the infrastructure. Let's look at some examples of the infrastructure processes containing / with access to secrets:

root-owned processes like puppet & systemd: Only root may attach to them anyways. Attacker shouldn't have root. If they do ptrace restriction is entirely moot since root can attach to whatever userspace process they want.
processes that inherently utilize secret information, like nginx (which utilizes SSL certificates). Attacker should never be on the same uid as these infrastructure processes, only these services may. In addition, these services themselves do not utilize ptrace. If the attacker gets an arbitrary code execution in these service processes in order to get ptrace to read such secret data, they may as well read the secrets like SSL certs directly from the filesystem/memory without using ptrace, from arbitrary code execution only.

Therefore, tl;dr: same-uid ptrace does not leak information in the context of toolforge Grid. They apply more to a desktop system where, say, a game should not be able to ptrace a browser to get the password typed into it.

It seems that enabling ptrace increases the attack surface considerably and there have been many exploits that leverage ptrace as a starting point (last one from 2017: CVE-2017-15537).

This would be similar to the argument that "we shouldn't use sudo because there are so many exploits leverage the use of setuid as a starting point". It doesn't make sense does it?

I have not looked into how CVE-2017-15537 works closely, nor how ptrace works internally, but the sample exploit code is a ptrace attaching to a child process -- available with classic ptrace permissions.

If the leaked information comes from the ptraced process, my argument above that same-uid ptrace does not leak information applies.
If the leaked information comes from a faulty context switch (i.e. the process running prior to the context switch), then the source is practically random. You could do an exploit just by attaching to a child process many times until you get your target. Whether it's restricted pr classic ptrace permissions does not make much of a difference.

Yes we may get attacked. The method forward to keep us secure is to keep the platform up to date on all the bug fixes.

If the concern is just around troubleshooting stuck processes, I'd suggest tools authors increase logging instead (although that's not the same thing as peaking into syscalls).

While I have been a root on toolforge, I have debugged several tools (eg. T185561, T195834, T198503, T200341) via the use of strace & gdb.
Logging in many cases can only help cases where some weirdness is expected. Debugging hangs with logging is a super trial-and-error. Worse, sometimes the issue is in native code (like in the mono ticket) where, and the issue would be appear to be random in the context of managed code, but more deterministic in native code. I would very muck like to document / teach tool authors how to debug the tools themselves, rather have them rely on a root to debug their tools.

I think you make valid points but the fact that ptrace increases the attack surface needs to be weighed against the benefits it would provide in an open platform as ours. We already provide shell access which is the easiest way to a security compromise. Making that even easier, considering ptrace's history, is a change that worries me.

If you search for "ptrace escape container", you may start to share my concerns. The Linux security mechanisms are porous and I wouldn't trust simple same-process-UID restrictions considering there have been other methods to abuse ptrace and increase privileges.

Maybe we can reach a compromise and have it enabled for certain users on request for a limited period of time, if that's even possible on a per-user basis. I think we should consider ptrace as a security-sensitive feature in an open platform that provides shell access.

In T114401#4686487, @GTirloni wrote:

If you search for "ptrace escape container", you may start to share my concerns. The Linux security mechanisms are porous and I wouldn't trust simple same-process-UID restrictions considering there have been other methods to abuse ptrace and increase privileges.

I don't find a particular article that blames the use of ptrace as the main point of doing a container escape. Looking at the first page of my google search results, the exploits are:

Escaping Docker container using waitid() – CVE-2017-5123: This was a kernel ring 0 arbitrary code execution exploit, not anything wrong with ptrace.
unprivileged guest to host real-root escape via lxc-attach. lxc-attach was broken in handling security policies. The container itself does not have privileges to escape without the the host messing up. (This is like some root code accidentally gave out a root password to an unprivileged user). Again, ptrace has nothing to do with it.
Abusing Privileged and Unprivileged Linux Containers: As far as I understand in the section on "The ptrace(2) Hole", it is that ptrace might bypass seccomp, by using ptrace to modify the system call number after seccomp does its checks. However, this would apply to any process that is being traced, whether it is traced with classic or restricted permissions. Docker / LXC themselves should try to disallow any unsafe behavior, though I don't see how skipping seccomp is a method to escape the container when the namespaces and capabilities are setup correctly (eg. no CAP_SYS_MODULE, no /dev/kcore, etc.). The kernel syscall handlers themselves should check the capabilities.

( If you find better articles / papers I'd like to see them :) )

However, let's say PID namespacing is badly implemented, and it is possible to ptrace a process outside the container under classic permissions. Let's see where linux namespaces are used on toolforge:

Systemd -- These belong to the infrastructure. The goal of systemd's use of namespaces is to limit the ability of arbitrary services to do much damage to the system. Again, I'm not aware of any infrastructure services that utilizes ptrace, so if something is wrong with ptrace and an attacker want to use these services to do a container escape, they need an arbitrary code execution in these services.
Generic & PAWS Kubernetes cluster -- These are single user docker containers that have UIDs map directly to the host, and both use LDAP for UID -> username translation. Therefore, for any single LDAP-attached UID , the owner is same in the contaier and on the host. If some process in the container ptraces a process outside the container in an attempt to perform the escape, by classic ptrace permissions, they could only ptrace some process they own already on the host, assuming they exist at all. (In the current situation, no, the k8s workers are login restricted to roots, so at best they would just hop into another container, in the bad case of PID namespacing being badly implemented).

( I'm unaware of anywhere else on toolforge that uses linux namespaces. If you know I'd like to hear about them )

Therefore to leverage such an attack, you would need:

an arbitrary code execution into a UID you don't own & badly implemented PID namespaces, or
an arbitrary code execution into a ring you don't own

After that you can indeed use ptrace to gain arbitrary code execution to outside the container, which can indeed enable a container escape, but given the requirements, that is so unlikely.

Thanks for the detailed analysis, much appreciated. I liked the point you made about the difficults in executing code in an UID namespace you don't own.

It does look like the number of ptrace-related exploits in recent kernels is lower compared to when I might be drawing my conclusions from (the 2.x/3.x kernel series). I'm updating my assumptions, however...

I did a cursory search for recent exploits that were related ptrace:

CVE-2018-1000199 - Unprivileged user can crash system or escalate privileges in kerels 3.18 or before (mitigation is to filter argument in... ptrace_set_debugreg function)
CVE-2017-15537 - Ptrace used to read FPU registers in kernels before 4.13.5 (might not be that great of an issue in our non-secrets world?)
CVE-2014-4699 - Race conditions exploited through ptrace allows DoS or gain privileges in kernels before 3.15.4
Maybe more? I stopped looking.

Also, it seems we might be immune to this warning from the seccomp(2) man page due to our usage of kernel 4.9 in k8s workers but it adds to my argument that Linux security is porous although it's been improving lately with the push to LXC-supporting technologies (and on the Grid this wouldn't apply since I don't think it uses seccomp for anything):

Before kernel 4.8, the seccomp check will not be run again after the tracer is notified. (This means that, on older kernels, seccomp-based sandboxes must not allow use of ptrace(2)—even of other sandboxed processes—without extreme care; ptracers can use this mechanism to escape from the seccomp sandbox.

I still think I need to make the point that this is a security-sensitive API that we shouldn't expose to end users unnecessarily if we can. And that moves me to do some quantitative analysis and ask how many users (that don't already have root) are requesting access to strace to debug their applications, do we know?

Finally, can we hope to teach users to use this advanced tool and understand what they're seeing? I've been doing this sysadmin game for a long time and get more surprised than bored by what I see in strace usually (my point is, it's a whole new adventure every time it's necessary to dig this deeper into the stack). However, I haven't been here long enough to pretend I know who are users are so I defer to the people with more experience to answer this.

Overall, if this solves a problem big enough for us, let's go for it. Hopefully by moving to Stretch we avoid these pesky kernel bugs (or can react faster).

In T114401#4687185, @GTirloni wrote:

I did a cursory search for recent exploits that were related ptrace:

These looks like all generic ptrace-related kernel bugs that are exploitable whether the ptrace permissions is classic or restricted.

Also, it seems we might be immune to this warning from the seccomp(2) man page due to our usage of kernel 4.9 in k8s workers but it adds to my argument that Linux security is porous although it's been improving lately with the push to LXC-supporting technologies (and on the Grid this wouldn't apply since I don't think it uses seccomp for anything):

This is

In T114401#4687030, @zhuyifei1999 wrote:

Abusing Privileged and Unprivileged Linux Containers: As far as I understand in the section on "The ptrace(2) Hole", it is that ptrace might bypass seccomp, by using ptrace to modify the system call number after seccomp does its checks. However, this would apply to any process that is being traced, whether it is traced with classic or restricted permissions. Docker / LXC themselves should try to disallow any unsafe behavior, though I don't see how skipping seccomp is a method to escape the container when the namespaces and capabilities are setup correctly (eg. no CAP_SYS_MODULE, no /dev/kcore, etc.). The kernel syscall handlers themselves should check the capabilities.

In T114401#4687185, @GTirloni wrote:

I still think I need to make the point that this is a security-sensitive API that we shouldn't expose to end users unnecessarily if we can. And that moves me to do some quantitative analysis and ask how many users (that don't already have root) are requesting access to strace to debug their applications, do we know?

Me? :) I was dreaming of being able to strace my processes before being root. I believe more advanced people like Magnus and Danmichaelo would also benefit from it. Because it is unavailable people might have gave up those debugging they would have done and used a restart-periodically approach or a watchdog approach.

Finally, can we hope to teach users to use this advanced tool and understand what they're seeing?

I can help with this :) Maybe start a Help:Toolforge/Debugging page with some FAQs on using gdb & strace. We can even make some sort of script to abstract away the details of grid (like our grid crontab setup).

Hopefully by moving to Stretch we avoid these pesky kernel bugs (or can react faster).

Make sense. If we want to encourage the use of historically-not-so-safe syscalls, make sense to update the kernel :)

@zhuyifei1999 thanks for your perspective, it's been enlightening to me because 1) the analysis you provided of the existing exploits was very informative and 2) because you're actively using the platform and thus has first-hand experience with its shortcoming :) I'll let other team members weight in as they see fit and hopefully we can reach a consensus here. Thanks again!

• Bstorm moved this task from Needs discussion to Inbox on the cloud-services-team (Kanban) board.Oct 29 2018, 1:11 PM

• GTirloni moved this task from Inbox to Graveyard on the cloud-services-team (Kanban) board.Dec 1 2018, 3:59 AM

• GTirloni unsubscribed.Mar 21 2019, 9:11 PM

Xqt added a parent task: T185561: weblinkchecker.py slows down (itself, OS) to freeze after a while reaching 100% of RAM.Apr 26 2022, 4:50 AM

Xqt removed a parent task: T185561: weblinkchecker.py slows down (itself, OS) to freeze after a while reaching 100% of RAM.Jun 19 2022, 3:19 PM

fnegri edited projects, added cloud-services-team; removed cloud-services-team (Kanban).Jan 18 2023, 6:54 PM

fnegri moved this task from Kanban to Graveyard on the cloud-services-team board.