Saturday, June 12, 2021

A few thoughts on Fuchsia security

I want to say a few words about my current adventure. I joined the Fuchsia project at its inception and worked on the daunting task of building and shipping a brand new open-source operating system.

As my colleague Chris noted, pointing to this comparison of a device running a Linux-based OS vs Fuchsia, making Fuchsia invisible was not an easy feat.

Of course, under the hood, a lot is different. We built a brand new message-passing kernel, new connectivity stacks, component model, file-systems, you name it. And yes, there are a few security things I'm excited about.

Message-passing and capabilities

I wrote a few posts on this blog about the sandboxing technologies a few of us were building in Chrome/ChromeOS at the time. A while back, the situation was challenging on Linux to say the least. We had to build a special a setuid binary to sandbox Chrome and seccomp-bpf was essentially created to improve the state of sandboxing on ChromeOS, and Linux generally.

With lots of work, we got into a point where the Chrome renderer sandbox was *very* tight in respect to the rest of the system [initial announcement]. Most of the remaining attack surface was in IPC interfaces and the remaining available system interfaces were as essential as it could get on Linux.

A hard problem in particular was to make sure that existing code, not written with sandboxing in mind, would "just" work under a very tight sandbox (I'm talking about zero file-system access - chroot-ed into an empty, deleted directory -, different namespaces, a small subset of syscalls available, etc.). One had to allow for "hooking" into some of the system calls that we would deny, so that we could dynamically rewrite them into IPCs (this is why the SIGSYS mechanism of seccomp was built). It was hard, and I dare say, pretty messy.

On Fuchsia, we have solved many of those issues. Sandboxing is trivial. In fact a new process with access to no capabilities can do exceedingly little (also see David Kaplan's exploration). FIDL, our IPC system, is a joy. I often smile when debating designs, because whether or not something is in-process or out-of-process can sometimes feel like a small implementation detail to people.

Verified execution

We will eventually write some good documentation about this. I believe that we have meaningfully expanded on ChromeOS' verified boot design.

The gist is that we store immutable code and data on a content-addressed file-system called BlobFS. You access what you want by specifying its hash (really, the root of a Merkle tree, for fast random access). Then we have an abstraction layer on top, which components can use to access files by names and which, under the hood can verify signatures for those hashes. File-systems are of course in user-land, can layer nicely, and it's easy to create the right environment for any component.

A key element is that we have made the ability to create executable pages a real permission, without disturbing the loading of BlobFS-backed, signed, dynamic libraries. For any process which doesn't need a JIT, it'll force attackers to ROP/JOP their way to the next stage of their attack.


For system-level folks, Rust is one of the most exciting security developments of the past few decades. It elegantly solves problems which smart people were saying could not be solved. Fuchsia has a lot of code, and we made sure that much of it (millions of LoC) was in Rust.

Our kernel, Zircon, is not in Rust. Not yet anyway. But it is in a nice, lean subset of C++ which I consider a vast improvement over C.


There is much more, which I may get to at some point. And there is a lot more to do. I am optimistic that we have created a sensible security foundation to iterate on. Time will tell. What did we miss? Fuchsia is covered by the Google VRP, so you can get payed by telling us!

Thursday, September 6, 2012

Introducing Chrome's next-generation Linux sandbox

Starting with Chrome 23.0.1255.0, recently released to the Dev Channel, you will see Chrome making use of our next-generation sandbox on Linux and ChromeOS for renderers.

We are using a new facility, introduced in Linux 3.5 and developed by Will Drewry called Seccomp-BPF.

Seccomp-BPF builds on the ability to send small BPF (for BSD Packet Filter) programs that can be interpreted by the kernel. This feature was originally designed for tcpdump, so that filters could directly run in the kernel for performance reasons.

BPF programs are untrusted by the kernel, so they are limited in a number of ways. Most notably, they can't have loops, which bounds their execution time by a monotonic function of their size and allows the kernel to know they will always terminate.

With Seccomp-BPF, BPF programs can now be used to evaluate system call numbers and their parameters.

This is a huge change for sandboxing code in Linux, which, as you may recall, has been very limited in this area. It's also a change that recognizes and innovates in two important dimensions of sandboxing:

  • Mandatory access control versus "discretionary privilege dropping". Something I always felt strongly about and have discussed before.
  • Access control semantics, versus attack surface reduction.
Let's talk about the second topic. Having a nice, high level, access control semantics is appealing and, one may argue, necessary. When you're designing a sandbox for your application, you may want to say things such as:
  • I want this process to have access to this subset of the file system.
  • I want this process to be able to allocate or de-allocate memory.
  • I want this process to be able to interfere (debug, send signals) with this set of processes.
The capabilities-oriented framework Capsicum takes such an approach. This is very useful.

However, with such an approach it's difficult to assess the kernel's attack surface. When the whole kernel is in your trusted computing base "you're going to have a bad time", as a colleague recently put it.

Now, in that same dimension, at the other end of the spectrum, is the "attack surface reduction" oriented approach. The approach where you're close to the ugly guts of implementation details, the one taken by Seccomp-BPF.

In that approach, read()+write() and vmsplice() are completely different beasts, because you're not looking at their semantics, but at the attack surface they open in the kernel. They perform similar things, but perhaps ihaquer will have a harder time exploiting read()/write() on pipes than vmsplice(). Semantically, uselib() seems to be a subset of open() + mmap(), but similarly, the attack surface is different.

The drawback of course is that implementing particular sandbox semantics with such a mechanism looks ugly. For instance, let's say you want to allow opening any file in /public from within the sandbox, how would you implement that in seccomp-BPF?

Well, first you need to understand what set of system calls would be concerned by such an operation. That's not just open(), but also openat() (an ugly implementation-level detail, some libc will happily use openat() with AT_FDCWD instead of open(). Then you realize that a BPF program in the kernel will only see a pointer to the file name, so you can't filter on that (even if you could dereference pointers in BPF programs, it wouldn't be safe to do so, because an attacker could create another thread that would modify the file name after it was evaluated by the BPF program, so the kernel would also need to copy it in a safe location).

In the end, what you need to do is have a trusted helper process (or broker) that runs unsandboxed for this particular set of system calls and have it accept requests to open files over an IPC channel, have it make the security decision and send the file descriptor back over an IPC.

(If you're interested in that sort of approach, pushed to the extreme, look at Markus Gutschke's original seccomp mode 1 sandbox.)

That's tedious but doable. In comparison, Capsicum would make this a breeze.

There are other issues with such a low-level approach. By filtering system calls, you're breaking the kernel API. This means that third party code (such as libraries) you include in your address space can break. For this reason, I suggested to Will to implement an "exception" mechanism through signals, so that special handlers can be called when system calls are denied. Such handlers are now used and can for instance "broker out" system calls such as open().

In my opinion, the Capsicum and Seccomp-BPF approach are trade-offs, each on the other end of the spectrum. Having both would be great. We could stack one on top of the other and have the best of both worlds.

In a similar, but very limited, fashion, this is what we have now in Chrome: we stacked the seccomp-bpf sandbox on top of the setuid sandbox. The setuid sandbox gives a few easy to understand semantic properties: no file system access, no process access outside of the sandbox, no network access. It makes it much easier to layer a seccomp-bpf sandbox on top.

Several people besides myself have worked on making this possible. In particular: Chris Evans, Jorge Lucangeli Obes, Markus Gutschke, Adam Langley (and others who made Chrome sandboxable under the setuid sandbox in the first place) and of course, for the actual kernel support, Will Drewry and Kees Cook.

We will continue to work on improving and tightening this new sandbox, this is just a start. Please give it a try, and report any bugs to (feel free to cc: jln at directly).

PS: to make sure that you have kernel support for seccomp BPF, use Linux 3.5 or Ubuntu 12.04. Check about:sandbox in Chrome 22+ and see if Seccomp-BPF is enabled). Also make sure you're using the 64 bits version of Chrome.

Friday, April 9, 2010


EDIT: Following its full disclosure Sun fixed Tavis' Java deployment toolkit bug (CVE-2010-0886 and CVE-2010-0887) in a matter of days, wow! No doubts this will be used in the future as an argument for full disclosure.
However, this does not bring much security! An attacker can still automatically downgrade your version of Java (using installJRE) and exploit this bug or any other he likes!

Almost one year ago, I blogged about one of my favorite security bug, found by Sami Koivu.

More specifically, I blogged about a class of Java bugs exposed by Sami Koivu and I mentioned this was the first instance of it.

Not only was it interesting from a technical perspective, but also high impact, allowing perfectly reliable (and relatively simple) cross platform exploitation of any system supporting Java applets (and that's a lot of systems). And this, through a widely deployed, but notoriously poorly updated component.

One year later [1], Sami strikes again. This time should be the final nail in Java applets' coffin for anyone with security expectations:
  • Another instance of the privileged deserialization class of bugs (CVE-2010-0094)
  • A new class of bugs: Java trusted method chaining. With one instance as a free sample (CVE-2010-0840). (This one is beautful by the way, be sure to read!)
  • Free goodies for web security researchers: a flaw that completely breaks the web security model. The "Java-SOP" security was done in the compiler, not the runtime (CVE-2010-0095). Normally this would translate to "really bad", but why would one need your cookies when one can have your computer?
But Tavis would not let Sami have his party alone and between two kernel bugs took a quick look at the Java deployment toolkit and found this embarrassingly trivially exploitable issue. It's not corrected yet. And it's exploitable even if you have Java disabled in IE or Firefox, you only need to have Java installed.

It's so simple that it was obvious that many people had found (and were exploiting) this one. And we've already had confirmation of this, which led Tavis to release his advisory with mitigation instructions before a patch was available. Read his advisory for interesting thoughts on disclosure.

So, dear reader, if you don't want to get owned multiple times:
  • Disable Java in your web browsers
  • Uninstall Java completely or follow Tavis' mitigation instructions on Windows
Updating Java does not work, Sami has already mentioned that he would be very surprised if there weren't 10 other cases of "Java trusted method chaining" bugs. There are probably other deserialization ones too.
And anyway, a lazy attacker can just silently downgrade his up-to date target to whatever vulnerable Java version he wants to exploit, using the aforementioned Java deployment toolkit. Really, it's a feature.

Moreover, not everyone can update Java. Let's see how long it takes for Apple to patch these ones this time. My bet is that up-to-date default MacOS X installations are going to be vulnerable for a while to even the publicly reported bugs.

This is Javocalypse.

[1] well technically, only a few months later, but it took 5 months before the public advisory. A delay that I would call "reasonable".

Sunday, March 28, 2010

There's a party at Ring0, and you're invited

Tavis and I have just come back from CanSecWest. The title of our talk was "There's a party at Ring0, and you're invited".

We went through some of the bugs that we have worked on this past year and mentioned some of our thoughts on kernel security in general:

  • We see an increasing attack surface, both locally and remotely (@font-face, webgl...)
  • The recent focus on sandboxes (Chrome, Office) makes the kernel an even more interesting target
  • Modern operating systems still generally lack facilities for discretionary privilege dropping or to reduce the kernel's attack surface (with the notable exception of SECCOMP on Linux)
  • While most OS have some degree of userland memory corruption exploitation prevention, kernel exploitation prevention is immature. On Linux, PaX/grsecurity leads the effort and Microsoft added safe unlinking in the Windows 7 kernel.
If you're interested, you can download our slides here.

Thursday, January 21, 2010

CVE-2010-0232: Microsoft Windows NT #GP Trap Handler Allows Users to Switch Kernel Stack

Two days ago, Tavis Ormandy has published one of the most interesting vulnerabilities I've seen so far.

It's one of those rare, but fascinating design-level errors dealing with low-level system internals. Its exploitation requires skills and ingenuity.

The vulnerability lies in Windows' support for Intel's hardware 8086 emulation support (virtual-8086, or VM86) and is believed to have been there since Windows NT 3.1 (1993!), making it 17 years old.

It uses two tricks that we have already published on this blog before, the #GP on pre-commit handling failure and the forging of cs:eip in VM86 mode.

This was intended to be mentioned in our talk at PacSec about virtualization this past November, but Tavis had agreed with Microsoft to postpone the release of this advisory.

Tavis was kind enough to write a blog post about it, you can read it below:

From Tavis Ormandy:

I've just published one of the most interesting bugs I've ever encountered, a simple authentication check in Windows NT that can incorrectly let users take control of the system. The bug exists in code hidden deep enough inside the kernel that it's gone unnoticed for as long as NT has existed.

If you've ever tried to run an MS-DOS or Win16 application on a modern NT machine, the chances are it worked. This is an impressive feat, these applications were written for a completely different execution environment and operating system, and yet still work today and run at almost native speed.

The secret that makes this possible behind the scenes is Virtual-8086 mode. Virtual-8086 mode is a hardware emulation facility built into all x86 processors since the i386, and allows modern operating systems to run 16-bit programs designed for real mode with very little overhead. These 16-bit programs run in a simulated real mode environment within a regular protected mode task, allowing them to co-exist in a modern multitasking environment.

Support for Virtual-8086 mode requires a monitor, the collective name for the software that handles any requests the program makes. These requests range from handling sensitive instructions to mapping low-level services onto system calls and are implemented partially in kernel mode and partially in user mode.

In Windows NT, the user mode component is called the NTVDM subsystem, and it interacts with the kernel via a native system service called NtVdmControl. NtVdmControl is unusual because it's authenticated, only authorised programs are permitted to access it, which is enforced using a special process flag called VdmAllowed which the kernel verifies is present before NtVdmControl will perform any action; if you don't have this flag, the kernel will always return STATUS_ACCESS_DENIED.

The bug we're talking about today involves how BIOS service calls are handled, which are a low level way of interacting with the system that's needed to support real-mode programs. The kernel implements BIOS service calls in two stages, the second stage begins when the interrupt handler for general protection faults (often shortened to #GP in technical documents) detects that the system has completed the first stage.

The details of how BIOS service calls are implemented are unimportant, what is important is that the two stages must be perfectly synchronised, if the kernel transitions to the second stage incorrectly, a hostile user can take advantage of this confusion to take control of the kernel and compromise the system. In theory, this shouldn't be a problem, Microsoft implemented a check that verifies that the trap occurred at a magic address (actually, a cs:eip pair) that unprivileged users can't reach.

The check seems reasonable at first, the hardware guarantees that unprivileged code can't arbitrarily make itself more privileged without a special request, and even if it could, only authorised programs are permitted to use NtVdmControl() anyway.

Unfortunately, it turns out these assumptions were wrong. The problem I noticed was that although unprivileged code cannot make itself more privileged arbitrarily, Virtual-8086 mode makes testing the privilege level of code more difficult because the segment registers lose their special meaning. This is because In protected mode, the segment registers (particularly ss and cs) can be used to test privilege level, however in Virtual-8086 mode they're used to create far pointers, which allow 16-bit programs to access the 20-bit real address space.

However, I still couldn't abuse this fact because NtVdmControl() can only be accessed by authorised programs, and there's no other way to request pathological operation on Virtual-8086 mode tasks. I was able to solve this problem by invoking the real NTVDM subsystem, and then loading my own code inside it using a combination of CreateRemoteThread(), VirtualAllocEx() and WriteProcessMemory().

Finally, I needed to find a way to force the kernel to transition to the vulnerable code while my process appeared to be privileged. My solution to this was to make the kernel fault when returning to user mode from kernel mode, thus creating the appearance of a legitimate trap for the fabricated execution context that I had installed. These steps all fit together perfectly, and can be used to convince the kernel to execute my code, giving me complete control of the system.


Could Microsoft have avoided this issue? It's difficult to imagine how, errors like this will generally elude fuzz testing (In order to observe any problem, a fuzzer would need to guess a 46-bit magic number, as well as setup an intricate process state, not to mention the VdmAllowed flag), and any static analysis would need an incredibly accurate model of the Intel architecture.

The code itself was probably resistant to manual audit, it's remained fairly static throughout the history of NT, and is likely considered forgotten lore even inside Microsoft. In cases like this, security researchers are sometimes in a better position than those with the benefit of documentation and source code, all abstraction is stripped away and we can study what remains without being tainted by how documentation claims something is supposed to work.

If you want to mitigate future problems like this, reducing attack surface is always the key to security. In this particular case, you can use group policy to disable support for Application Compatibility (see the Application Compatability policy template) which will prevent unprivileged users from accessing NtVdmControl(), certainly a wise move if your users don't need MS-DOS or Windows 3.1 applications.

Saturday, November 28, 2009

Virtualization security and the Intel privilege model

Earlier this month, Tavis and I spoke at PacSec 2009 in Tokyo about virtualisation security on Intel architectures, with a focus on CPU virtualisation.

During this talk, we briefly explained various techniques used for CPU virtualisation such as dynamic translation (QEmu), VMware-style binary translation or paravirtualisation (Xen) and we went through bugs found by us and others:

- We released some details about MS09-33 (CVE-2009-1542), a bug we found in VirtualPC's instructions decoding
- We mentioned two of the awesome bugs found by Derek Soeder in VMware, CVE-2008-4915 and CVE-2008-4279
- We explained and demo-ed the exploitation of the mishandled exception on page fault bug in VMware that I previously blogged about.
- We released information on CVE-2009-3827, a bug we discovered in Virtual PC's hardware virtualisation.
A funny fact is that the exact same bug was independently uncovered and corrected in KVM later by Avi Kivity (CVE-2009-3722). The reason may be a not perfectly clear Intel documentation about the differences between MOV_DR and MOV_CR events in hardware virtualisation.
This bug has already been addressed by Microsoft in Windows 7 and will get corrected in the next service pack for Virtual PC and Virtual Server.

If you are interested, you can download the slides here.

Friday, October 30, 2009

CVE-2009-2267: Mishandled exception on page fault in VMware

Tavis Ormandy and myself have recently released an advisory for CVE-2009-2267.

This is a vulnerability in VMware's virtual CPU which can lead to privilege escalation in a guest. All VMware virtualisation products were affected, including in hardware virtualisation mode.

In a VMware guest, in the general case, unprivileged (Ring 3) code runs without VMM intervention until an exception or interrupt occurs. An exception to this is Virtual-8086 mode (VM86) where VMware will perform CPU emulation.

When VMware was emulating a far call instruction in VM86 mode, it was using supervisory access to push the CS and IP registers. Because of this, if this operation raisee a Page Fault (#PF) exception, the resulting exception code would be invalid and would have it's user/supervisor flag incorrectly set.

This can be used to confuse a Guest kernel. Moreover, VM86 mode can be used to further confuse the guest kernel because it allows an attacker to load an arbitrary value in the code segment (CS) register.

We wrote a reliable proof of concept to elevate privileges on Linux guests. It turned out to be very easy because of the PNP BIOS recovery code.

For further details, check our advisory, VMware's advisory and the non weaponized PoC (vmware86.c, vmware86.tar.gz), including Tavis' cool CODE32 macro.

Note that VMware silently patches their products until all all of them are updated and then releases an advisory. If you have updated VMware Workstation a few month ago, you were already protected against this vulnerability.

In theory, VMware's Virtual CPU flaws could be treated like Intel or AMD errata and worked around in operating systems. In practice, since VMware's software can be updated, this is unlikely to happen. Moreover, VMware doesn't release full details that could be used to produce work arounds.

If you like virtual CPU vulnerabilities, I suggest that you have a look at Derek Soeder's awesome advisory from last year.