CAP_PERFMON — and new capabilities in general

Did you know...?
See More
LWN.net is a subscriber-supported publication; we rely on subscribers to keep the entire operation going. Please help out by buying a subscription and keeping LWN on the net.

By Jonathan Corbet
February 21, 2020

The perf_event_open() system call is a complicated beast, requiring a fair amount of study to master. This call also has some interesting security implications: it can be used to obtain a lot of information about the running system, and the complexity of the underlying implementation has made it more than usually prone to unpleasant bugs. In current kernels, the security controls around perf_event_open() are simple, though: if you have the CAP_SYS_ADMIN capability, perf_event_open() is available to you (though the system administrator can make it available without any privilege at all). Some current work to create a new capability for the perf events subsystem would seem to make sense, raising the question of why adding new capabilities isn't done more often.

Capabilities are a longstanding effort to split apart the traditional Unix superuser's powers into something more fine-grained, allowing administrators to give limited privileges where needed without making the recipients into full superusers. There are 37 capabilities defined in current Linux kernels, controlling the ability to carry out a range of tasks including configuring terminal devices, overriding resource limits, installing kernel modules, or adjusting the system time. Among these capabilities, though, is CAP_SYS_ADMIN, nominally the capability needed to perform system-administration tasks. CAP_SYS_ADMIN has become the default capability to require when nothing else seems to fit; it enables so many actions that it has long been known as "the new root".

A quick check shows well over 500 checks for CAP_SYS_ADMIN in the 5.6-rc kernel. During the 5.6 merge window, new checks were added to allow holders of CAP_SYS_ADMIN to send hardware-specific commands to obscure devices, configure time namespaces, load BPF programs for kernel operations structures, and open access to x86 MTRR registers. The perf events subsystem has also come to rely on CAP_SYS_ADMIN to keep unprivileged users out. As a result, to enable a user to call perf_event_open(), an administrator must also allow that user to mount filesystems, access PCI configuration spaces, tune memory-management policies, load BPF programs, and more. That is a lot of privilege to associate with a task that ordinary users are fairly likely to legitimately need to do.

This patch set from Alexey Budankov addresses that problem by creating a new capability called CAP_PERFMON to govern performance-monitoring tasks. With this patch installed, users (or their programs) could be granted CAP_PERFMON rather than CAP_SYS_ADMIN, enabling them to get performance data without adding all those other powers. Of course, CAP_SYS_ADMIN would still be sufficient to call perf_event_open(); otherwise the chances of breaking existing systems are high. But it would no longer be necessary if a user has CAP_PERFMON instead.

At a first look, this change seems relatively obvious; it is hard to complain about separating out a relatively constrained, low-danger activity from a powerful capability. But it does lead one to wonder why this kind of change is done so rarely. The last time a new capability was added was in 2014, when CAP_AUDIT_READ joined the set. It would appear that the last time a capability was split out of CAP_SYS_ADMIN was the creation of CAP_SYSLOG in 2010. Once something becomes part of CAP_SYS_ADMIN, it seems, it stays there. Why might that be the case?

One reason, of course, is the aforementioned compatibility issue: once CAP_SYS_ADMIN allows an action, it can never lose that power without possibly breaking existing systems. When Serge Hallyn added CAP_SYSLOG, he added the usual code that made things continue to work if the process in question had CAP_SYS_ADMIN. In that case, though, the kernel issues a warning that use of CAP_SYS_ADMIN for these operations is deprecated. Nearly ten years later, the compatibility code — and the warning — remain. Splitting capabilities out of CAP_SYS_ADMIN is less than fully rewarding when the power of CAP_SYS_ADMIN itself can never be reduced.

Adding capabilities has hazards of its own, in that existing code will know nothing about a new capability and what it might control. A program that clears bits out of a capability mask is likely to clear the new one, but that capability might be needed going forward. Experience has shown that running a privileged program with selectively removed capabilities can open up surprising vulnerabilities; every new capability potentially creates just that sort of situation. So capabilities must be added with care. There is a reason why the SELinux build has a check that explicitly fails if new capabilities have been added without corresponding changes in SELinux itself.

Then, there is the unfortunate fact that capabilities in Linux are seen by many as a failed experiment. Nobody has ever made a practical, fully capability-based system using them, and many of the defined capabilities are relatively easily escalated to full root powers. Linux systems above the kernel level have made limited use of them, if indeed capabilities have been used at all. It can be hard to generate enthusiasm for refining a system that can never work as was originally intended and which may never be used in any serious way.

As an example, one obvious way to use capabilities to reduce privilege would be to remove the setuid bit on existing utilities and install just the needed capabilities instead. The kernel has supported file-based capabilities since the 2.6.24 release in 2008 after all. Your editor's current system, running Fedora 31 (which includes "first" among its goals) contains a grand total of nine binaries with capabilities attached:

    # getcap -r /
    /usr/bin/gnome-keyring-daemon = cap_ipc_lock+ep
    /usr/bin/clockdiff = cap_net_raw+p
    /usr/bin/arping = cap_net_raw+p
    /usr/bin/newuidmap = cap_setuid+ep
    /usr/bin/newgidmap = cap_setgid+ep
    /usr/bin/ping = cap_net_admin,cap_net_raw+p
    /usr/bin/gnome-shell = cap_sys_nice+ep
    /usr/sbin/mtr-packet = cap_net_raw+ep
    /usr/sbin/suexec = cap_setgid,cap_setuid+ep

It is good to know that gnome-shell does not run setuid root, so capabilities have brought some value here. But that compares with 31 setuid root binaries; it would appear that there is no prospect of this distribution becoming capability-only anytime soon.

That said, there are signs of a shift with regard to capabilities. The never-ending desire to harden our systems against attacks is driving developers to take another look at Linux capabilities and how they might help. The Android system makes use of capabilities, for example. Systemd gives administrators extensive control over the capabilities granted to running programs. It may just be that, after many years of disuse, Linux capabilities are finally finding a place in deployed systems.

If that is the case, we may well see a renewed level of interest in increasing the granularity of the permissions controlled by capabilities. That could include splitting more powers out of CAP_SYS_ADMIN though, as noted above, that must be done carefully. CAP_SYS_ADMIN is unlikely to stop being the not-so-new root anytime soon, but perhaps it could be made into a capability that few programs need to have to get their work done.

Index entries for this article
Kernel	Capabilities
Security	Capabilities

CAP_PERFMON — and new capabilities in general

Posted Feb 21, 2020 18:03 UTC (Fri) by NYKevin (subscriber, #129325) [Link] (14 responses)

Perhaps I just don't understand what the kernel developers are trying to do (which is a very real possibility as I don't read LKML religiously). But it appears that quite a lot of capability-guarded operations are inherently root-equivalent and cannot be meaningfully sandboxed without a complete redesign of Linux's security model. Some examples:

"Mount and unmount any filesystem" can be used to create a setuid-root binary, to backdoor anything in /bin or /sbin, and for a variety of other privilege escalation attacks.
"ptrace any process" can be used to execute arbitrary code as any user who is running code on the machine, which will generally include root.
"Load kernel modules" can be used to execute arbitrary code in kernel space, because that's exactly what it is meant to do.
"Call setuid(2) with any value" can be used to become root, and then full capabilities are regained on calling execve(2).

I don't really understand the purpose of trying to sandbox operations similar to the above. I suppose capabilities could be used to mitigate the confused-deputy problem in some cases, but they seem like a rather roundabout way of doing that (contrast seccomp, containers, etc.). Of course, there are privileged operations which are not root-equivalent, and sandboxing those does make sense. I just don't understand why capabilities are applied to literally every privileged operation under the sun.

CAP_PERFMON — and new capabilities in general

Posted Feb 21, 2020 19:17 UTC (Fri) by smurf (subscriber, #17840) [Link] (2 responses)

The operative word is "can be". These granular privileges aren't supposed to be granted to any random user process.

The idea is that the program that's been granted the privilege needs only be careful when using that exact privilege.

As an example, a program that has "mount any filesystem" privileges needs only be careful when actually mounting a file system, but not when opening the file that's backing the data for the file system (just as a random example). Similarly, the system profiler might be allowed to profile the system, but not to overwrite /etc/shadow with the resulting data.

CAP_PERFMON — and new capabilities in general

Posted Feb 21, 2020 19:54 UTC (Fri) by smcv (subscriber, #53363) [Link] (1 responses)

> The idea is that the program that's been granted the privilege needs only be careful when using that exact privilege

... and when defending itself against being subverted by processes that don't have the privilege, including its parent process.

CAP_PERFMON — and new capabilities in general

Posted Mar 12, 2020 16:29 UTC (Thu) by immibis (guest, #105511) [Link]

I think that was his/her point - you know that any subversion has to go through the mechanism to which permission is granted, so you only need to be especially careful there. You don't need to check the output file path the parent passed to you, because you don't have any special permission to write to files that the parent process couldn't write to anyway.

CAP_PERFMON — and new capabilities in general

Posted Feb 21, 2020 21:57 UTC (Fri) by pbonzini (subscriber, #60935) [Link] (2 responses)

It depends on the usecase. Some capabilities are not equivalent to root, and others can be paired with other defense mechanism:

> "Mount and unmount any filesystem" can be used to create a setuid-root binary, to backdoor anything in /bin or /sbin, and for a variety of other privilege escalation attacks.

Not in combination with mount namespaces + seccomp to block exec, for example. A program that is launched as root can set them up before dropping all other capabilities.

> "Call setuid(2) with any value" can be used to become root, and then full capabilities are regained on calling execve(2).

Besides using seccomp to block execve, you can also use inheritable capabilities so that children do not keep them.

In other cases, the environment around the program can limit the root-equivalence of capabilities:

> "Load kernel modules" can be used to execute arbitrary code in kernel space, because that's exactly what it is meant to do.

You can use SELinux to prevent the program from loading a .ko file that wasn't given a particular SELinux label; or you can reject non-signed modules.

> "ptrace any process" can be used to execute arbitrary code as any user who is running code on the machine, which will generally include root.

A process that runs in a pid namespace will not be able to exit it and do ptrace outside its pid namespace (IIRC).

CAP_PERFMON — and new capabilities in general

Posted Feb 23, 2020 12:35 UTC (Sun) by ibukanov (subscriber, #3942) [Link] (1 responses)

Those examples actually prove the grand-parent point. In my experience things like no-new-privileges, namespaces, syscall filters are vastly more useful to secure systems than capabilities. With those it is possible to secure a system even without restricting capabilities, while capabilities alone cannot realistically secure the system. Then again, why it took so long to come up with ambient capabilities that allow to grant a particular capability to a particular invocation of a process, not each and every execution of a binary?

CAP_PERFMON — and new capabilities in general

Posted Feb 23, 2020 12:45 UTC (Sun) by pbonzini (subscriber, #60935) [Link]

Capabilities alone are useless. Capabilities make no new privs, seccomp stronger and seccomp makes capabilities usable.

CAP_PERFMON — and new capabilities in general

Posted Feb 22, 2020 8:03 UTC (Sat) by epa (subscriber, #39769) [Link] (7 responses)

On a large number of deployed systems, any ordinary user account can be escalated to root, either because of unpatched bugs or because the architecture is inherently not that secure. That does not mean the whole structure of user permissions is useless.

A buggy daemon running as root will be much easier to subvert than one that runs as a normal user account with a couple of extra capabilities. Those capabilities might get you root through a few tricks, but getting the daemon to perform those steps is harder than getting it to overwrite a random file because of missing path sanitization.

For human accounts it can also work to have specific administrative roles with their needed capabilities rather than an all-powerful root account. This is why even oclassical Unix systems have a sudoers file, so admins log in with an ordinary account and ‘sudo’ particular commands when needed. In principle this gives the same power as just logging in as root, but it gives better protection against mistakes and some auditing of what the admin does, even if half the time the command is ‘sudo bash’.

In simpler times there were efforts to split the human admin account into its capabilities too. Windows NT defined roles like ‘Backup operator’. Unfortunately the messy world we inhabit means that any admin probably does need full access to get anything done.

CAP_PERFMON — and new capabilities in general

Posted Feb 22, 2020 20:03 UTC (Sat) by Cyberax (✭ supporter ✭, #52523) [Link] (4 responses)

> A buggy daemon running as root will be much easier to subvert than one that runs as a normal user account with a couple of extra capabilities.
The problem is that there are almost no capabilities that are useful for regular daemons, with the exception of CAP_SYS_NET_BIND (which shouldn't have existed in the first place).

So if your daemon runs as root then it probably needs it for something that can't be expressed as capabilities anyway.

CAP_PERFMON — and new capabilities in general

Posted Feb 24, 2020 17:24 UTC (Mon) by imMute (guest, #96323) [Link] (3 responses)

Why do you believe CAP_NET_BIND_SERVICE shouldn't exist?
I, for one example, believe it would be useful to allow an HTTP server to bind to port 80/443 without needing to be started as root.

CAP_PERFMON — and new capabilities in general

Posted Feb 24, 2020 17:49 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link]

> Why do you believe CAP_NET_BIND_SERVICE shouldn't exist?
Because there should have been no restriction on <1024 ports to begin with (i.e. everything should have CAP_NET_BIND_SERVICE).

CAP_PERFMON — and new capabilities in general

Posted Feb 24, 2020 18:38 UTC (Mon) by smurf (subscriber, #17840) [Link] (1 responses)

You could pass the open port to the web server as an open file descriptor.

CAP_PERFMON — and new capabilities in general

Posted Feb 24, 2020 19:16 UTC (Mon) by andresfreund (subscriber, #69562) [Link]

And then when the web server wants to e.g. use SO_REUSEPORT to have a separate socket for each socket / group of cores/core, you have to teach your system startup tooling that. And can't configure it in the application's config file anymore.

Not saying that passing the fd in is not a good solution in some cases, just that it does has its own set of implied limitations.

CAP_PERFMON — and new capabilities in general

Posted Feb 23, 2020 19:20 UTC (Sun) by NYKevin (subscriber, #129325) [Link] (1 responses)

> On a large number of deployed systems, any ordinary user account can be escalated to root, either because of unpatched bugs or because the architecture is inherently not that secure. That does not mean the whole structure of user permissions is useless.

The difference, I think, is that you are describing a bug, and I am describing how the system was designed to work.

Obviously, defense in depth is a Good Thing. I am not suggesting we eliminate capabilities entirely, or that we do anything at all, for that matter. The concern is that additional complexity in privileged code (such as the kernel) carries additional risk. So when adding new layers of security, we need to balance the security benefits with the complexity. It's not clear to me how capabilities strike that balance, and under what circumstances they ought to be used in concert with or in lieu of seccomp, containerization, SELinux, etc. As a sysadmin, I would like to know which security subsystems are actually best practices, and which ones are just there because somebody wanted them to be there.

> A buggy daemon running as root will be much easier to subvert than one that runs as a normal user account with a couple of extra capabilities. Those capabilities might get you root through a few tricks, but getting the daemon to perform those steps is harder than getting it to overwrite a random file because of missing path sanitization.

This is a reasonable point. As I said, capabilities do offer some defense against confused deputies. It's just not clear to me that they are the Right Way to go about doing that.

(Of course, this is a more general problem with Linux. The man pages are great at telling you what syscall X does, but often not so good at telling you why you might want that functionality, or how you might choose to compose it with other syscalls. Section 7 pages frequently do provide this information, but they can be hard to find because it's less obvious what name you should give to man. Section 2 pages, on the other hand, tend to be rather terse. I realize this is by design, but rightly or wrongly, many people learn to program Unix by reading man pages, and this is not a great first impression.)

CAP_PERFMON — and new capabilities in general

Posted Feb 24, 2020 13:28 UTC (Mon) by epa (subscriber, #39769) [Link]

True. I think that splitting the root account's powers into umpteen different capability bits is conceptually pretty simple. Instead of checking uid==0 you check whether the relevant bit is set. There's not too much to go wrong in that, and it's certainly less code than SELinux or seccomp. The hard part seems to be finding space for the bitmask in relevant structures and perhaps in filesystems .

CAP_PERFMON — and new capabilities in general

Posted Feb 23, 2020 7:08 UTC (Sun) by matthias (subscriber, #94967) [Link]

> One reason, of course, is the aforementioned compatibility issue: once CAP_SYS_ADMIN allows an action, it can never lose that power without possibly breaking existing systems. When Serge Hallyn added CAP_SYSLOG, he added the usual code that made things continue to work if the process in question had CAP_SYS_ADMIN. In that case, though, the kernel issues a warning that use of CAP_SYS_ADMIN for these operations is deprecated. Nearly ten years later, the compatibility code — and the warning — remain. Splitting capabilities out of CAP_SYS_ADMIN is less than fully rewarding when the power of CAP_SYS_ADMIN itself can never be reduced.

I do not buy this. The compatibility code could be made optional in kernel config. There already are a bunch of options that say in the help text "Only enable this if you want to run binaries from the stone age." Probably there is no demand for such an option because CAP_SYS_ADMIN is omnipotent anyway. The reward for splitting capabilities out of CAP_SYS_ADMIN is not that CAP_SYS_ADMIN becomes less powerfull. The reward is that less processes need the power of CAP_SYS_ADMIN and processes can use less privileged capabilities instead.

CAP_PERFMON — and new capabilities in general

Posted Feb 23, 2020 12:10 UTC (Sun) by meyert (subscriber, #32097) [Link] (2 responses)

Increasing major version number to 5 could have been used to introduce breaking changes like above deadlock situation, and get rid of other legacy stuff.

CAP_PERFMON — and new capabilities in general

Posted Feb 23, 2020 18:35 UTC (Sun) by intelfx (subscriber, #130118) [Link]

Alas — Linux kernel does not use semantic versioning. The model is "we do not break userspace".

CAP_PERFMON — and new capabilities in general

Posted Feb 23, 2020 19:01 UTC (Sun) by mpr22 (subscriber, #60784) [Link]

A majver bump for Linux means "Linus woke up and felt like bumping the majver instead of the minver", and nothing more.

To be allowed to break a userspace interface, you have to be able to demonstrate that nobody who's paying attention is using that interface on a system that has a realistic prospect of being upgraded to the new kernel version.

CAP_PERFMON — and new capabilities in general

Posted Feb 24, 2020 7:53 UTC (Mon) by diconico07 (guest, #117416) [Link] (1 responses)

Another reason to add capabilities carefully is the fact there can only be a limited number of these (64 if I remember well), so a badly defined or "useless" capability (e.g CAP_SYS_PACCT or CAP_NET_BROADCAST) only lowers the number of capabilities for future features that would be needing a clearly separated capability.

I really think CAP_SYS_ADMIN is bloated and unusable, and that some of its feature could have get their own capability (e.g seccomp related checks), because for now I prefer giving root rather than CAP_SYS_ADMIN as it shows more clearly that the process might do dangerous things in a quite uncontrollable manner (without other things like seccomp and al.), however I don't think we can have this balance of having some really usable and useful capabilities without having some bloated ones (remember that there is nothing just checking for root in the kernel anymore, becoming root just means getting all capabilities).

CAP_PERFMON — and new capabilities in general

Posted Feb 25, 2020 17:35 UTC (Tue) by Freeaqingme (subscriber, #103259) [Link]

I concur.

A few years ago I had a process that spawned workers. The parent process would then assign jobs to these workers. I wanted every job to be performed in a specific namespace/cgroup. Therefore, I needed the master process to change the namespaces of the spawned workers. Such a syscall does not exist, so we decided the worker should switch to that namespace (setns()) itself.

Ideally, we'd not run the child processes as root because they also executed/processed user input. As such, I set out to implement a custom capability that would grant a process the rights to change its namespace/cgroups without having to run as root.

A few limitations I ran into:
- There's indeed a max of 64 (IIRC) capabilities. This makes it difficult to pick a number of which you're sure it won't be used by another capability (introduced by 'upstream') in the future.
- I don't entirely recall it anymore, but I believe we'd have to modify libcap, libapparmor, libc as well as the kernel itself.

These constraints make it hard to prototype something. Lack of prototypes will probably also - at least in part - be a reason why there's not much of it upstreamed.

Also, because it's very specific to our use case, I did expect that upstream would not be willing to accept this new privilege. That may be a reason why there's so relatively few capabilities. For every scenario a different capability could probably be thought of.

I'm not a seasoned kernel developer, so I may have had some more challenges than someone more experienced in this regard would have been. However, after trying various options for a couple of days, our solution simply was to run the mentioned master process as root, and harden it through things like appamor instead.

Reducing CAP_SYS_ADMIN

Posted Mar 4, 2020 13:40 UTC (Wed) by Wol (subscriber, #4433) [Link] (1 responses)

I think that as a new capability is added, that ability should be "deleted" from CAP_SYS_ADMIN. Have a boot flag that says "restrict CAP_SYS_ADMIN" and those abililties will no longer be there (okay the default is don't restrict, and those abilities will still be available to anything that needs them).

But then, if we get the distros on board, especially long term distros like RHEL, they should state that "anything that won't compile and run when the flag is on, is not supported". If the long-term-kernel maintainers also agree that no capability-removal code will be back-ported, so the CAP_SYS_ADMIN capabilities are fixed for any individual x.y kernel, then there is clear pressure on upstream to support new capabilities, and users who run longer-term kernels can rely on the capability system to provide the protection it was designed to.

Cheers,
Wol

Reducing CAP_SYS_ADMIN

Posted Mar 4, 2020 14:20 UTC (Wed) by mathstuf (subscriber, #69389) [Link]

Won't that break containers running older distros on newer kernels? Would we need capability namespaces then?