We are using a new facility, introduced in Linux 3.5 and developed by Will Drewry called Seccomp-BPF.
Seccomp-BPF builds on the ability to send small BPF (for BSD Packet Filter) programs that can be interpreted by the kernel. This feature was originally designed for tcpdump, so that filters could directly run in the kernel for performance reasons.
BPF programs are untrusted by the kernel, so they are limited in a number of ways. Most notably, they can't have loops, which bounds their execution time by a monotonic function of their size and allows the kernel to know they will always terminate.
With Seccomp-BPF, BPF programs can now be used to evaluate system call numbers and their parameters.
This is a huge change for sandboxing code in Linux, which, as you may recall, has been very limited in this area. It's also a change that recognizes and innovates in two important dimensions of sandboxing:
- Mandatory access control versus "discretionary privilege dropping". Something I always felt strongly about and have discussed before.
- Access control semantics, versus attack surface reduction.
Let's talk about the second topic. Having a nice, high level, access control semantics is appealing and, one may argue, necessary. When you're designing a sandbox for your application, you may want to say things such as:
- I want this process to have access to this subset of the file system.
- I want this process to be able to allocate or de-allocate memory.
- I want this process to be able to interfere (debug, send signals) with this set of processes.
The capabilities-oriented framework Capsicum takes such an approach. This is very useful.
However, with such an approach it's difficult to assess the kernel's attack surface. When the whole kernel is in your trusted computing base "you're going to have a bad time", as a colleague recently put it.
Now, in that same dimension, at the other end of the spectrum, is the "attack surface reduction" oriented approach. The approach where you're close to the ugly guts of implementation details, the one taken by Seccomp-BPF.
In that approach, read()+write() and vmsplice() are completely different beasts, because you're not looking at their semantics, but at the attack surface they open in the kernel. They perform similar things, but perhaps ihaquer will have a harder time exploiting read()/write() on pipes than vmsplice(). Semantically, uselib() seems to be a subset of open() + mmap(), but similarly, the attack surface is different.
The drawback of course is that implementing particular sandbox semantics with such a mechanism looks ugly. For instance, let's say you want to allow opening any file in /public from within the sandbox, how would you implement that in seccomp-BPF?
Well, first you need to understand what set of system calls would be concerned by such an operation. That's not just open(), but also openat() (an ugly implementation-level detail, some libc will happily use openat() with AT_FDCWD instead of open(). Then you realize that a BPF program in the kernel will only see a pointer to the file name, so you can't filter on that (even if you could dereference pointers in BPF programs, it wouldn't be safe to do so, because an attacker could create another thread that would modify the file name after it was evaluated by the BPF program, so the kernel would also need to copy it in a safe location).
In the end, what you need to do is have a trusted helper process (or broker) that runs unsandboxed for this particular set of system calls and have it accept requests to open files over an IPC channel, have it make the security decision and send the file descriptor back over an IPC.
(If you're interested in that sort of approach, pushed to the extreme, look at Markus Gutschke's original seccomp mode 1 sandbox.)
That's tedious but doable. In comparison, Capsicum would make this a breeze.
There are other issues with such a low-level approach. By filtering system calls, you're breaking the kernel API. This means that third party code (such as libraries) you include in your address space can break. For this reason, I suggested to Will to implement an "exception" mechanism through signals, so that special handlers can be called when system calls are denied. Such handlers are now used and can for instance "broker out" system calls such as open().
In my opinion, the Capsicum and Seccomp-BPF approach are trade-offs, each on the other end of the spectrum. Having both would be great. We could stack one on top of the other and have the best of both worlds.
In a similar, but very limited, fashion, this is what we have now in Chrome: we stacked the seccomp-bpf sandbox on top of the setuid sandbox. The setuid sandbox gives a few easy to understand semantic properties: no file system access, no process access outside of the sandbox, no network access. It makes it much easier to layer a seccomp-bpf sandbox on top.
Several people besides myself have worked on making this possible. In particular: Chris Evans, Jorge Lucangeli Obes, Markus Gutschke, Adam Langley (and others who made Chrome sandboxable under the setuid sandbox in the first place) and of course, for the actual kernel support, Will Drewry and Kees Cook.
We will continue to work on improving and tightening this new sandbox, this is just a start. Please give it a try, and report any bugs to crbug.com (feel free to cc: jln at chromium.org directly).
PS: to make sure that you have kernel support for seccomp BPF, use Linux 3.5 or Ubuntu 12.04. Check about:sandbox in Chrome 22+ and see if Seccomp-BPF is enabled). Also make sure you're using the 64 bits version of Chrome.