Kernel Planet

October 22, 2016

Matthew Garrett: Microsoft aren't forcing Lenovo to block free operating systems

Update: Patches to fix this have been posted

There's a story going round that Lenovo have signed an agreement with Microsoft that prevents installing free operating systems. This is sensationalist, untrue and distracts from a genuine problem.

The background is straightforward. Intel platforms allow the storage to be configured in two different ways - "standard" (normal AHCI on SATA systems, normal NVMe on NVMe systems) or "RAID". "RAID" mode is typically just changing the PCI IDs so that the normal drivers won't bind, ensuring that drivers that support the software RAID mode are used. Intel have not submitted any patches to Linux to support the "RAID" mode.

In this specific case, Lenovo's firmware defaults to "RAID" mode and doesn't allow you to change that. Since Linux has no support for the hardware when configured this way, you can't install Linux (distribution installers will boot, but won't find any storage device to install the OS to).

Why would Lenovo do this? I don't know for sure, but it's potentially related to something I've written about before - recent Intel hardware needs special setup for good power management. The storage driver that Microsoft ship doesn't do that setup. The Intel-provided driver does. "RAID" mode prevents the Microsoft driver from binding and forces the user to use the Intel driver, which means they get the correct power management configuration, battery life is better and the machine doesn't melt.

(Why not offer the option to disable it? A user who does would end up with a machine that doesn't boot, and if they managed to figure that out they'd have worse power management. That increases support costs. For a consumer device, why would you want to? The number of people buying these laptops to run anything other than Windows is miniscule)

Things are somewhat obfuscated due to a statement from a Lenovo rep:This system has a Signature Edition of Windows 10 Home installed. It is locked per our agreement with Microsoft. It's unclear what this is meant to mean. Microsoft could be insisting that Signature Edition systems ship in "RAID" mode in order to ensure that users get a good power management experience. Or it could be a misunderstanding regarding UEFI Secure Boot - Microsoft do require that Secure Boot be enabled on all Windows 10 systems, but (a) the user must be able to manage the key database and (b) there are several free operating systems that support UEFI Secure Boot and have appropriate signatures. Neither interpretation indicates that there's a deliberate attempt to prevent users from installing their choice of operating system.

The real problem here is that Intel do very little to ensure that free operating systems work well on their consumer hardware - we still have no information from Intel on how to configure systems to ensure good power management, we have no support for storage devices in "RAID" mode and we have no indication that this is going to get better in future. If Intel had provided that support, this issue would never have occurred. Rather than be angry at Lenovo, let's put pressure on Intel to provide support for their hardware.

comment count unavailable comments

October 22, 2016 05:51 AM

Matthew Garrett: Fixing the IoT isn't going to be easy

A large part of the internet became inaccessible today after a botnet made up of IP cameras and digital video recorders was used to DoS a major DNS provider. This highlighted a bunch of things including how maybe having all your DNS handled by a single provider is not the best of plans, but in the long run there's no real amount of diversification that can fix this - malicious actors have control of a sufficiently large number of hosts that they could easily take out multiple providers simultaneously.

To fix this properly we need to get rid of the compromised systems. The question is how. Many of these devices are sold by resellers who have no resources to handle any kind of recall. The manufacturer may not have any kind of legal presence in many of the countries where their products are sold. There's no way anybody can compel a recall, and even if they could it probably wouldn't help. If I've paid a contractor to install a security camera in my office, and if I get a notification that my camera is being used to take down Twitter, what do I do? Pay someone to come and take the camera down again, wait for a fixed one and pay to get that put up? That's probably not going to happen. As long as the device carries on working, many users are going to ignore any voluntary request.

We're left with more aggressive remedies. If ISPs threaten to cut off customers who host compromised devices, we might get somewhere. But, inevitably, a number of small businesses and unskilled users will get cut off. Probably a large number. The economic damage is still going to be significant. And it doesn't necessarily help that much - if the US were to compel ISPs to do this, but nobody else did, public outcry would be massive, the botnet would not be much smaller and the attacks would continue. Do we start cutting off countries that fail to police their internet?

Ok, so maybe we just chalk this one up as a loss and have everyone build out enough infrastructure that we're able to withstand attacks from this botnet and take steps to ensure that nobody is ever able to build a bigger one. To do that, we'd need to ensure that all IoT devices are secure, all the time. So, uh, how do we do that?

These devices had trivial vulnerabilities in the form of hardcoded passwords and open telnet. It wouldn't take terribly strong skills to identify this at import time and block a shipment, so the "obvious" answer is to set up forces in customs who do a security analysis of each device. We'll ignore the fact that this would be a pretty huge set of people to keep up with the sheer quantity of crap being developed and skip straight to the explanation for why this wouldn't work.

Yeah, sure, this vulnerability was obvious. But what about the product from a well-known vendor that included a debug app listening on a high numbered UDP port that accepted a packet of the form "BackdoorPacketCmdLine_Req" and then executed the rest of the payload as root? A portscan's not going to show that up[1]. Finding this kind of thing involves pulling the device apart, dumping the firmware and reverse engineering the binaries. It typically takes me about a day to do that. Amazon has over 30,000 listings that match "IP camera" right now, so you're going to need 99 more of me and a year just to examine the cameras. And that's assuming nobody ships any new ones.

Even that's insufficient. Ok, with luck we've identified all the cases where the vendor has left an explicit backdoor in the code[2]. But these devices are still running software that's going to be full of bugs and which is almost certainly still vulnerable to at least half a dozen buffer overflows[3]. Who's going to audit that? All it takes is one attacker to find one flaw in one popular device line, and that's another botnet built.

If we can't stop the vulnerabilities getting into people's homes in the first place, can we at least fix them afterwards? From an economic perspective, demanding that vendors ship security updates whenever a vulnerability is discovered no matter how old the device is is just not going to work. Many of these vendors are small enough that it'd be more cost effective for them to simply fold the company and reopen under a new name than it would be to put the engineering work into fixing a decade old codebase. And how does this actually help? So far the attackers building these networks haven't been terribly competent. The first thing a competent attacker would do would be to silently disable the firmware update mechanism.

We can't easily fix the already broken devices, we can't easily stop more broken devices from being shipped and we can't easily guarantee that we can fix future devices that end up broken. The only solution I see working at all is to require ISPs to cut people off, and that's going to involve a great deal of pain. The harsh reality is that this is almost certainly just the tip of the iceberg, and things are going to get much worse before they get any better.

Right. I'm off to portscan another smart socket.

[1] UDP connection refused messages are typically ratelimited to one per second, so it'll take almost a day to do a full UDP portscan, and even then you have no idea what the service actually does.

[2] It's worth noting that this is usually leftover test or debug code, not an overtly malicious act. Vendors should have processes in place to ensure that this isn't left in release builds, but ha well.

[3] My vacuum cleaner crashes if I send certain malformed HTTP requests to the local API endpoint, which isn't a good sign

comment count unavailable comments

October 22, 2016 05:14 AM

October 20, 2016

Kees Cook: CVE-2016-5195

My prior post showed my research from earlier in the year at the 2016 Linux Security Summit on kernel security flaw lifetimes. Now that CVE-2016-5195 is public, here are updated graphs and statistics. Due to their rarity, the Critical bug average has now jumped from 3.3 years to 5.2 years. There aren’t many, but, as I mentioned, they still exist, whether you know about them or not. CVE-2016-5195 was sitting on everyone’s machine when I gave my LSS talk, and there are still other flaws on all our Linux machines right now. (And, I should note, this problem is not unique to Linux.) Dealing with knowing that there are always going to be bugs present requires proactive kernel self-protection (to minimize the effects of possible flaws) and vendors dedicated to updating their devices regularly and quickly (to keep the exposure window minimized once a flaw is widely known).

So, here are the graphs updated for the 668 CVEs known today:

© 2016, Kees Cook. This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 License.
Creative Commons License

October 20, 2016 11:02 PM

October 19, 2016

Kees Cook: Security bug lifetime

In several of my recent presentations, I’ve discussed the lifetime of security flaws in the Linux kernel. Jon Corbet did an analysis in 2010, and found that security bugs appeared to have roughly a 5 year lifetime. As in, the flaw gets introduced in a Linux release, and then goes unnoticed by upstream developers until another release 5 years later, on average. I updated this research for 2011 through 2016, and used the Ubuntu Security Team’s CVE Tracker to assist in the process. The Ubuntu kernel team already does the hard work of trying to identify when flaws were introduced in the kernel, so I didn’t have to re-do this for the 557 kernel CVEs since 2011.

As the README details, the raw CVE data is spread across the active/, retired/, and ignored/ directories. By scanning through the CVE files to find any that contain the line “Patches_linux:”, I can extract the details on when a flaw was introduced and when it was fixed. For example CVE-2016-0728 shows:

 break-fix: 3a50597de8635cd05133bd12c95681c82fe7b878 23567fd052a9abb6d67fe8e7a9ccdd9800a540f2

This means that CVE-2016-0728 is believed to have been introduced by commit 3a50597de8635cd05133bd12c95681c82fe7b878 and fixed by commit 23567fd052a9abb6d67fe8e7a9ccdd9800a540f2. If there are multiple lines, then there may be multiple SHAs identified as contributing to the flaw or the fix. And a “-” is just short-hand for the start of Linux git history.

Then for each SHA, I queried git to find its corresponding release, and made a mapping of release version to release date, wrote out the raw data, and rendered graphs. Each vertical line shows a given CVE from when it was introduced to when it was fixed. Red is “Critical”, orange is “High”, blue is “Medium”, and black is “Low”:

CVE lifetimes 2011-2016

And here it is zoomed in to just Critical and High:

Critical and High CVE lifetimes 2011-2016

The line in the middle is the date from which I started the CVE search (2011). The vertical axis is actually linear time, but it’s labeled with kernel releases (which are pretty regular). The numerical summary is:

This comes out to roughly 5 years lifetime again, so not much has changed from Jon’s 2010 analysis.

While we’re getting better at fixing bugs, we’re also adding more bugs. And for many devices that have been built on a given kernel version, there haven’t been frequent (or some times any) security updates, so the bug lifetime for those devices is even longer. To really create a safe kernel, we need to get proactive about self-protection technologies. The systems using a Linux kernel are right now running with security flaws. Those flaws are just not known to the developers yet, but they’re likely known to attackers, as there have been prior boasts/gray-market advertisements for at least CVE-2010-3081 and CVE-2013-2888.

(Edit: see my updated graphs that include CVE-2016-5195.)

© 2016, Kees Cook. This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 License.
Creative Commons License

October 19, 2016 04:46 AM

October 18, 2016

LPC 2016: Hotel Blocks now Expired

All our block bookings at the hotels have now expired.  However, if you still haven’t booked a hotel, it may still be possible for the Linux Foundation to get you a room at one of them at the conference rate (availability permitting).  Please email if you are interested in this option.

October 18, 2016 08:06 PM

Gustavo F. Padovan: Mainline Explicit Fencing – part 2

In the first post we covered the main concepts behind Explicit Synchronization for the Linux Kernel. Now in the second post of the series we are going to look to the Android Sync Framework, the first (out-of-tree) Explicit Fencing implementation for the Linux Kernel.

The Sync Framework was the Android solution to implement Explicit Fencing in AOSP. It uses file descriptors to communicate fencing information between userspace and kernel and between userspace process.

In the Sync Framework it all starts with the creation of a Sync Timeline, a struct created for each driver context to represent a monotonically increasing counter. It is the Sync Timeline who will guarantee the ordering between fences in the same Timeline. The driver contexts could be different GPU rings, or different Displays on your hardware.

Sync Timeline

Sync Timeline

Then we have Sync Points(sync_pt), the name Android gave to fences, they represent a specific value in the Sync Timeline. When created the Sync Point is initialized in the Active state, and when it signals, i.e., the job it was associated to finishes, it transits to the Signaled state and informs the Sync Timeline to update the value of the last signaled Sync Point.

Sync Point

Sync Point

To export and import Sync Points to/from userspace the Sync Fence struct is used. Under the hood the the Sync Fence is a Linux file and we use thte Sync Fence to store Sync Point information. To exported to userspace a unused file descriptor(fd) is associated to the Sync Fence file. Drivers can then use the file descriptor to pass the Sync Point information around.

Sync Fence

Sync Fence

The Sync Fence is usually created just after the Sync Point creation, it then travel through the pipeline, via userspace, until the driver that is going to wait for the Sync Fence to signal. The Sync Fence signal when the Sync Point inside it signals.

One of the most important features of the Android Sync Framework is the ability to merge Sync Fences into a new Sync Fence containing all Sync Points from both Sync Fences. It can contain as many Sync Points as your resource allows. A merged Sync Fence will only signal when all its Sync Points signals.

Sync Fence with Merged fences

Sync Fence with Merged Fences. Here we merge two Sync Points into one Sync File.

When it comes to userspace API the Sync Framework has implements three ioctl calls. The first one is to wait on sync_fence to signal. There is also a call to merge two sync_fences into a third and new sync_fence. And finally there is a also a call to grab information about the sync_fence and all its sync_points.

The Sync Fences fds are passed to/from the kernel in the calls to ask the kernel to render or display a buffer.

This was intended to be a overview of the Sync Framework as we will see some of these concepts on the next article where we will talk about the effort to add explict fencing on mainline kernel. If you want to learn more about the Sync Framework you can find more info here and here.

October 18, 2016 06:16 PM

October 08, 2016

Michael Kerrisk (manpages): man-pages-4.07 is released

I've released man-pages-4.07. The release tarball is available on The browsable online pages can be found on The Git repository for man-pages is available on

This release resulted from patches, bug reports, reviews, and comments from around 50 contributors. The release includes changes to over 140 man pages. Among the more significant changes in man-pages-4.07 are the following:

October 08, 2016 12:20 PM

Michael Kerrisk (manpages): man-pages-4.06 is released

I've released man-pages-4.06. The release tarball is available on The browsable online pages can be found on The Git repository for man-pages is available on

This release resulted from patches, bug reports, reviews, and comments from around 20 contributors. The release includes changes to just over 40 man pages. Among the more significant changes in man-pages-4.06 are the following:

October 08, 2016 12:20 PM

Michael Kerrisk (manpages): man-pages-4.05 is released

I've released man-pages-4.05. The release tarball is available on The browsable online pages can be found on The Git repository for man-pages is available on

This release resulted from patches, bug reports, reviews, and comments from more than 70 contributors. The release includes changes to more than 400 man pages. Among the more significant changes in man-pages-4.05 are the following:

October 08, 2016 12:20 PM

Michael Kerrisk (manpages): man-pages-4.08 is released

I've released man-pages-4.08. The release tarball is available on The browsable online pages can be found on The Git repository for man-pages is available on

This release resulted from patches, bug reports, reviews, and comments from around 40 contributors. The release includes changes to nearly 200 man pages. Among the more significant changes in man-pages-4.08 are the following:

October 08, 2016 12:19 PM

October 07, 2016

Daniel Vetter: Neat drm/i915 Stuff for 4.8

I procristanated rather badly on this one, so instead of the previous kernel release happening the v4.8 release is already out of the door. Read on for my slightly more terse catch-up report.

Since I’m this late I figured instead of the usual comprehensive list I’ll do something new and just list some of the work that landed in 4.8, but with a bit more focus on the impact and why things have been done.

Midlayers, Be Gone!

The first thing I want to highlight is the driver de-midlayering. In the linux kernel community the mid-layer mistake or helper library design pattern, see the linked article from LWN is a set of rules to design subsystems and common support code for drivers. The underlying rule is that the driver itself must be in control of everything, like allocating memory, handling all requests. Common code is only shared in helper library functions, which the driver can call if they are suitable. The reason for that is that there is always some hardware which needs special treatment, and when you have a special case and there’s a midlayer, it will get in the way.

Due to the shared history with BSD kernels DRM originally had a full-blown midlayer, but over time this has been fixed. For example kernel modesetting was designed from the start with the helper library pattern. The last hold is the device structure itself, and for the Intel driver this is now fixed. This has two main benefits:

Thundering Herds

GPUs process rendering asynchronously, and sometimes the CPU needs to wait for them. For this purpose there’s a wait queue in the driver. Userspace processes block on that until the interrupt handler wakes them up. The trouble now is that thus far there was just one wait queue per engine, which means every time the GPU completed something all waiters had to be woken up. Then they checked whether the work they needed to wait for completed, and if not, again block on the wait queue until the next batch job completed. That’s all rather inefficient. On top there’s only one per-engine knob to enable interrupts. Which means even if there was only one waiting process, it was woken for every completed job. And GPUs can have a lot of jobs in-flight.

In summary, waiting for the GPU worked more like a frantic herd trampling all over things instead of something orderly. To fix this the request and completion tracking was entirely revamped, to make sure that the driver has a much better understanding of what’s going on. On top there’s now also an efficient search structure of all current waiting processes. With that the interrupt handler can quickly check whether the just completed GPU job is of interest, and if so, which exact process should be woken up.

But this wasn’t just done to make the driver more efficient. Better tracking of pending and completed GPU requests is an important fundation to implement proper GPU scheduling on top of. And it’s also needed to interface the completion tracking with other drivers, to finally fixing tearing for multi-GPU machines. Having a thundering herd in your own backyard is unsightly, letting it loose on your neighbours is downright bad! A lot of this follow-up work already landed for the 4.9 kernel, hence I will talk more about this in a future installement of this seris.

October 07, 2016 12:00 PM

October 06, 2016

James Morris: LinuxCon Europe Kernel Security Slides

Yesterday I gave an update on the Linux kernel security subsystem at LinuxCon Europe, in Berlin.

The slides are available here:

The talk began with a brief overview and history of the Linux kernel security subsystem, and then I provided an update on significant changes in the v4 kernel series, up to v4.8.  Some expected upcoming features were also covered.  Skip to slide 31 if you just want to see the changes.  There are quite a few!

It’s my first visit to Berlin, and it’s been fascinating to see the remnants of the Cold War, which dominated life in 1980s when I was at school, but which also seemed so impossibly far from Australia.

brandenburg gate

Brandenburg Gate, Berlin. Unity Day 2016.

I hope to visit again with more time to explore.

October 06, 2016 12:56 PM

Pavel Machek: FlightGame

FlightGear is a very nice simulator, but it is not a lot of fun: page with "places to fly" helps. But when you setup your flight details, including weather and failures, you can kind of expect what is going to happen. FlightGame was designed to address this (not for me, unfortunately, alrough... if you ever debugged piece of software you know unexpected things happen): levels are prepared to be interesting, yet they try to provide enough information so that you don't need to
study maps and aircraft specifications before the flight.
Don't expect anything great/too complex, this is just python getting data from gpsd, and causing your aircaft probles over internal webserver. But it still should be fun.
Code is at
. I guess I should really create a better README.
Who wants to play?

October 06, 2016 08:56 AM

October 05, 2016

Kees Cook: security things in Linux v4.8

Previously: v4.7. Here are a bunch of security things I’m excited about in Linux v4.8:

SLUB freelist ASLR

Thomas Garnier continued his freelist randomization work by adding SLUB support.

x86_64 KASLR text base offset physical/virtual decoupling

On x86_64, to implement the KASLR text base offset, the physical memory location of the kernel was randomized, which resulted in the virtual address being offset as well. Due to how the kernel’s “-2GB” addressing works (gcc‘s “-mcmodel=kernel“), it wasn’t possible to randomize the physical location beyond the 2GB limit, leaving any additional physical memory unused as a randomization target. In order to decouple the physical and virtual location of the kernel (to make physical address exposures less valuable to attackers), the physical location of the kernel needed to be randomized separately from the virtual location. This required a lot of work for handling very large addresses spanning terabytes of address space. Yinghai Lu, Baoquan He, and I landed a series of patches that ultimately did this (and in the process fixed some other bugs too). This expands the physical offset entropy to roughly $physical_memory_size_of_system / 2MB bits.

x86_64 KASLR memory base offset

Thomas Garnier rolled out KASLR to the kernel’s various statically located memory ranges, randomizing their locations with CONFIG_RANDOMIZE_MEMORY. One of the more notable things randomized is the physical memory mapping, which is a known target for attacks. Also randomized is the vmalloc area, which makes attacks against targets vmalloced during boot (which tend to always end up in the same location on a given system) are now harder to locate. (The vmemmap region randomization accidentally missed the v4.8 window and will appear in v4.9.)

x86_64 KASLR with hibernation

Rafael Wysocki (with Thomas Garnier, Borislav Petkov, Yinghai Lu, Logan Gunthorpe, and myself) worked on a number of fixes to hibernation code that, even without KASLR, were coincidentally exposed by the earlier W^X fix. With that original problem fixed, then memory KASLR exposed more problems. I’m very grateful everyone was able to help out fixing these, especially Rafael and Thomas. It’s a hard place to debug. The bottom line, now, is that hibernation and KASLR are no longer mutually exclusive.

gcc plugin infrastructure

Emese Revfy ported the PaX/Grsecurity gcc plugin infrastructure to upstream. If you want to perform compiler-based magic on kernel builds, now it’s much easier with CONFIG_GCC_PLUGINS! The plugins live in scripts/gcc-plugins/. Current plugins are a short example called “Cyclic Complexity” which just emits the complexity of functions as they’re compiled, and “Sanitizer Coverage” which provides the same functionality as gcc’s recent “-fsanitize-coverage=trace-pc” but back through gcc 4.5. Another notable detail about this work is that it was the first Linux kernel security work funded by Linux Foundation’s Core Infrastructure Initiative. I’m looking forward to more plugins!

If you’re on Debian or Ubuntu, the required gcc plugin headers are available via the gcc-$N-plugin-dev package (and similarly for all cross-compiler packages).

hardened usercopy

Along with work from Rik van Riel, Laura Abbott, Casey Schaufler, and many other folks doing testing on the KSPP mailing list, I ported part of PAX_USERCOPY (the basic runtime bounds checking) to upstream as CONFIG_HARDENED_USERCOPY. One of the interface boundaries between the kernel and user-space are the copy_to_user()/copy_from_user() family of functions. Frequently, the size of a copy is known at compile-time (“built-in constant”), so there’s not much benefit in checking those sizes (hardened usercopy avoids these cases). In the case of dynamic sizes, hardened usercopy checks for 3 areas of memory: slab allocations, stack allocations, and kernel text. Direct kernel text copying is simply disallowed. Stack copying is allowed as long as it is entirely contained by the current stack memory range (and on x86, only if it does not include the saved stack frame and instruction pointers). For slab allocations (e.g. those allocated through kmem_cache_alloc() and the kmalloc()-family of functions), the copy size is compared against the size of the object being copied. For example, if copy_from_user() is writing to a structure that was allocated as size 64, but the copy gets tricked into trying to write 65 bytes, hardened usercopy will catch it and kill the process.

For testing hardened usercopy, lkdtm gained several new tests: USERCOPY_HEAP_SIZE_TO, USERCOPY_HEAP_SIZE_FROM, USERCOPY_STACK_FRAME_TO,
USERCOPY_STACK_FRAME_FROM, USERCOPY_STACK_BEYOND, and USERCOPY_KERNEL. Additionally, USERCOPY_HEAP_FLAG_TO and USERCOPY_HEAP_FLAG_FROM were added to test what will be coming next for hardened usercopy: flagging slab memory as “safe for copy to/from user-space”, effectively whitelisting certainly slab caches, as done by PAX_USERCOPY. This further reduces the scope of what’s allowed to be copied to/from, since most kernel memory is not intended to ever be exposed to user-space. Adding this logic will require some reorganization of usercopy code to add some new APIs, as PAX_USERCOPY’s approach to handling special-cases is to add bounce-copies (copy from slab to stack, then copy to userspace) as needed, which is unlikely to be acceptable upstream.

seccomp reordered after ptrace

By its original design, seccomp filtering happened before ptrace so that seccomp-based ptracers (i.e. SECCOMP_RET_TRACE) could explicitly bypass seccomp filtering and force a desired syscall. Nothing actually used this feature, and as it turns out, it’s not compatible with process launchers that install seccomp filters (e.g. systemd, lxc) since as long as the ptrace and fork syscalls are allowed (and fork is needed for any sensible container environment), a process could spawn a tracer to help bypass a filter by injecting syscalls. After Andy Lutomirski convinced me that ordering ptrace first does not change the attack surface of a running process (unless all syscalls are blacklisted, the entire ptrace attack surface will always be exposed), I rearranged things. Now there is no (expected) way to bypass seccomp filters, and containers with seccomp filters can allow ptrace again.

That’s it for v4.8! The merge window is open for v4.9…

© 2016, Kees Cook. This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 License.
Creative Commons License

October 05, 2016 12:26 AM

October 03, 2016

Matthew Garrett: The importance of paying attention in building community trust

Trust is important in any kind of interpersonal relationship. It's inevitable that there will be cases where something you do will irritate or upset others, even if only to a small degree. Handling small cases well helps build trust that you will do the right thing in more significant cases, whereas ignoring things that seem fairly insignificant (or saying that you'll do something about them and then failing to do so) suggests that you'll also fail when there's a major problem. Getting the small details right is a major part of creating the impression that you'll deal with significant challenges in a responsible and considerate way.

This isn't limited to individual relationships. Something that distinguishes good customer service from bad customer service is getting the details right. There are many industries where significant failures happen infrequently, but minor ones happen a lot. Would you prefer to give your business to a company that handles those small details well (even if they're not overly annoying) or one that just tells you to deal with them?

And the same is true of software communities. A strong and considerate response to minor bug reports makes it more likely that users will be patient with you when dealing with significant ones. Handling small patch contributions quickly makes it more likely that a submitter will be willing to do the work of making more significant contributions. These things are well understood, and most successful projects have actively worked to reduce barriers to entry and to be responsive to user requests in order to encourage participation and foster a feeling that they care.

But what's often ignored is that this applies to other aspects of communities as well. Failing to use inclusive language may not seem like a big thing in itself, but it leaves people with the feeling that you're less likely to do anything about more egregious exclusionary behaviour. Allowing a baseline level of sexist humour gives the impression that you won't act if there are blatant displays of misogyny. The more examples of these "insignificant" issues people see, the more likely they are to choose to spend their time somewhere else, somewhere they can have faith that major issues will be handled appropriately.

There's a more insidious aspect to this. Sometimes we can believe that we are handling minor issues appropriately, that we're acting in a way that handles people's concerns, while actually failing to do so. If someone raises a concern about an aspect of the community, it's important to discuss solutions with them. Putting effort into "solving" a problem without ensuring that the solution has the desired outcome is not only a waste of time, it alienates those affected even more - they're now not only left with the feeling that they can't trust you to respond appropriately, but that you will actively ignore their feelings in the process.

It's not always possible to satisfy everybody's concerns. Sometimes you'll be left in situations where you have conflicting requests. In that case the best thing you can do is to explain the conflict and why you've made the choice you have, and demonstrate that you took this issue seriously rather than ignoring it. Depending on the issue, you may still alienate some number of participants, but it'll be fewer than if you just pretend that it's not actually a problem.

One warning, though: while building trust in this way enhances people's willingness to join your community, it also builds expectations. If a significant issue does arise, and if you fail to handle it well, you'll burn a lot of that trust in the process. The fact that you've built that trust in the first place may be what saves your community from disintegrating completely, but people will feel even more betrayed if you don't actively work to rebuild it. And if there's a pattern of mishandling major problems, no amount of getting the details right will matter.

Communities that ignore these issues are, long term, likely to end up weaker than communities that pay attention to them. Making sure you get this right in the first place, and setting expectations that you will pay attention to your contributors, is a vital part of building a meaningful relationship between your community and its members.

comment count unavailable comments

October 03, 2016 05:14 PM

Gustavo F. Padovan: Collabora Contributions to Linux Kernel 4.8

Linux Kernel 4.8 is out and once more Collabora engineers did a significant contribution to the Kernel. For the 4.8 Collabora contributed 101 patches by 8 engineers, our record to date in single kernel release! We’ve also seen the first contribution from Frederic Dalleau since he joined Collabora. covered the new features of the new kernel in three different posts, here, here and here.

On the Collabora side of the contributions we touched a few different areas in the kernel. Bob Ham, who recently left Collabora, added support for the Alea I Random Number Generator, while Enric Balletbo improved the audio support on the Rockchip rk3288 SoC. Frederic Dalleau fixed an important memory leak on the Bluetooth stack.

Gustavo Padovan continued his work add Explicit Synchronization for Buffer Sharing on the kernel. In this release he added fence_array support and prepared the SW_SYNC interfaces for de-staging, SW_SYNC meant to be used for Explict Syncronization testing. He also worked in removing some of the legacy functions from drm_irq.c from the kernel.

Helen Koike added some improvements and clean ups to the ASoC subsystem mainly on the max9877 and tpa6130a2 drivers. Nicolas Dufresne fixed the bytes per line calculation on YUV planes on the uvcvideo driver.

Thierry Escande added many improvements the NFC digital layer and Tomeu Vizoso added a new helper for the ChromeOS Embedded Controller and improved usage of DRM Core APIs on the Rockchip driver. He also fixed an issue with the Analogix DP on Rockchip that was not enabling clocks in the correct order.

Bob Ham (2):

Enric Balletbo i Serra (8):

Frederic Dalleau (1):

Gustavo Padovan (50):

Helen Koike (8):

Nicolas Dufresne (1):

Thierry Escande (26):

Tomeu Vizoso (5):

October 03, 2016 01:59 PM

Pavel Machek: Linux V4.8 on N900

Basics work, good. GSM does not work too well, which is kind of a problem. Camera broke between 4.7 and 4.8. That is not good, either.

If you want to talk about Linux and phones, I'll probably be on LinuxDays in Prague this weekend, and will have a talk about it at Ubucon Europe.

October 03, 2016 11:13 AM

Kees Cook: security things in Linux v4.7

Previously: v4.6. Onward to security things I found interesting in Linux v4.7:

KASLR text base offset for MIPS

Matt Redfearn added text base address KASLR to MIPS, similar to what’s available on x86 and arm64. As done with x86, MIPS attempts to gather entropy from various build-time, run-time, and CPU locations in an effort to find reasonable sources during early-boot. MIPS doesn’t yet have anything as strong as x86′s RDRAND (though most have an instruction counter like x86′s RDTSC), but it does have the benefit of being able to use Device Tree (i.e. the “/chosen/kaslr-seed” property) like arm64 does. By my understanding, even without Device Tree, MIPS KASLR entropy should be as strong as pre-RDRAND x86 entropy, which is more than sufficient for what is, similar to x86, not a huge KASLR range anyway: default 8 bits (a span of 16MB with 64KB alignment), though CONFIG_RANDOMIZE_BASE_MAX_OFFSET can be tuned to the device’s memory, giving a maximum of 11 bits on 32-bit, and 15 bits on EVA or 64-bit.

SLAB freelist ASLR

Thomas Garnier added CONFIG_SLAB_FREELIST_RANDOM to make slab allocation layouts less deterministic with a per-boot randomized freelist order. This raises the bar for successful kernel slab attacks. Attackers will need to either find additional bugs to help leak slab layout information or will need to perform more complex grooming during an attack. Thomas wrote a post describing the feature in more detail here: Randomizing the Linux kernel heap freelists. (SLAB is done in v4.7, and SLUB in v4.8.)

eBPF JIT constant blinding

Daniel Borkmann implemented constant blinding in the eBPF JIT subsystem. With strong kernel memory protections (CONFIG_DEBUG_RODATA) in place, and with the segregation of user-space memory execution from kernel (i.e SMEP, PXN, CONFIG_CPU_SW_DOMAIN_PAN), having a place where user-space can inject content into an executable area of kernel memory becomes very high-value to an attacker. The eBPF JIT was exactly such a thing: the use of BPF constants could result in the JIT producing instruction flows that could include attacker-controlled instructions (e.g. by directing execution into the middle of an instruction with a constant that would be interpreted as a native instruction). The eBPF JIT already uses a number of other defensive tricks (e.g. random starting position), but this added randomized blinding to any BPF constants, which makes building a malicious execution path in the eBPF JIT memory much more difficult (and helps block attempts at JIT spraying to bypass other protections).

Elena Reshetova updated a 2012 proof-of-concept attack to succeed against modern kernels to help provide a working example of what needed fixing in the JIT. This serves as a thorough regression test for the protection.

The cBPF JITs that exist in ARM, MIPS, PowerPC, and Sparc still need to be updated to eBPF, but when they do, they’ll gain all these protections immediatley.

Bottom line is that if you enable the (disabled-by-default) bpf_jit_enable sysctl, be sure to set the bpf_jit_harden sysctl to 2 (to perform blinding even for root).

fix brk ASLR weakness on arm64 compat

There have been a few ASLR fixes recently (e.g. ET_DYN, x86 32-bit unlimited stack), and while reviewing some suggested fixes to arm64 brk ASLR code from Jon Medhurst, I noticed that arm64′s brk ASLR entropy was slightly too low (less than 1 bit) for 64-bit and noticeably lower (by 2 bits) for 32-bit compat processes when compared to native 32-bit arm. I simplified the code by using literals for the entropy. Maybe we can add a sysctl some day to control brk ASLR entropy like was done for mmap ASLR entropy.

LoadPin LSM

LSM stacking is well-defined since v4.2, so I finally upstreamed a “small” LSM that implements a protection I wrote for Chrome OS several years back. On systems with a static root of trust that extends to the filesystem level (e.g. Chrome OS’s coreboot+depthcharge boot firmware chaining to dm-verity, or a system booting from read-only media), it’s redundant to sign kernel modules (you’ve already got the modules on read-only media: they can’t change). The kernel just needs to know they’re all coming from the correct location. (And this solves loading known-good firmware too, since there is no convention for signed firmware in the kernel yet.) LoadPin requires that all modules, firmware, etc come from the same mount (and assumes that the first loaded file defines which mount is “correct”, hence load “pinning”).

That’s it for v4.7. Prepare yourself for v4.8 next!

© 2016, Kees Cook. This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 License.
Creative Commons License

October 03, 2016 07:47 AM

October 01, 2016

Kees Cook: security things in Linux v4.6

Previously: v4.5. The v4.6 Linux kernel release included a bunch of stuff, with much more of it under the KSPP umbrella.

seccomp support for parisc

Helge Deller added seccomp support for parisc, which including plumbing support for PTRACE_GETREGSET to get the self-tests working.

x86 32-bit mmap ASLR vs unlimited stack fixed

Hector Marco-Gisbert removed a long-standing limitation to mmap ASLR on 32-bit x86, where setting an unlimited stack (e.g. “ulimit -s unlimited“) would turn off mmap ASLR (which provided a way to bypass ASLR when executing setuid processes). Given that ASLR entropy can now be controlled directly (see the v4.5 post), and that the cases where this created an actual problem are very rare, means that if a system sees collisions between unlimited stack and mmap ASLR, they can just adjust the 32-bit ASLR entropy instead.

x86 execute-only memory

Dave Hansen added Protection Key support for future x86 CPUs and, as part of this, implemented support for “execute only” memory in user-space. On pkeys-supporting CPUs, using mmap(..., PROT_EXEC) (i.e. without PROT_READ) will mean that the memory can be executed but cannot be read (or written). This provides some mitigation against automated ROP gadget finding where an executable is read out of memory to find places that can be used to build a malicious execution path. Using this will require changing some linker behavior (to avoid putting data in executable areas), but seems to otherwise Just Work. I’m looking forward to either emulated QEmu support or access to one of these fancy CPUs.

CONFIG_DEBUG_RODATA enabled by default on arm and arm64, and mandatory on x86

Ard Biesheuvel (arm64) and I (arm) made the poorly-named CONFIG_DEBUG_RODATA enabled by default. This feature controls whether the kernel enforces proper memory protections on its own memory regions (code memory is executable and read-only, read-only data is actually read-only and non-executable, and writable data is non-executable). This protection is a fundamental security primitive for kernel self-protection, so making it on-by-default is required to start any kind of attack surface reduction within the kernel.

On x86 CONFIG_DEBUG_RODATA was already enabled by default, but, at Ingo Molnar’s suggestion, I made it mandatory: CONFIG_DEBUG_RODATA cannot be turned off on x86. I expect we’ll get there with arm and arm64 too, but the protection is still somewhat new on these architectures, so it’s reasonable to continue to leave an “out” for developers that find themselves tripping over it.

arm64 KASLR text base offset

Ard Biesheuvel reworked a ton of arm64 infrastructure to support kernel relocation and, building on that, Kernel Address Space Layout Randomization of the kernel text base offset (and module base offset). As with x86 text base KASLR, this is a probabilistic defense that raises the bar for kernel attacks where finding the KASLR offset must be added to the chain of exploits used for a successful attack. One big difference from x86 is that the entropy for the KASLR must come either from Device Tree (in the “/chosen/kaslr-seed” property) or from UEFI (via EFI_RNG_PROTOCOL), so if you’re building arm64 devices, make sure you have a strong source of early-boot entropy that you can expose through your boot-firmware or boot-loader.

zero-poison after free

Laura Abbott reworked a bunch of the kernel memory management debugging code to add zeroing of freed memory, similar to PaX/Grsecurity’s PAX_MEMORY_SANITIZE feature. This feature means that memory is cleared at free, wiping any sensitive data so it doesn’t have an opportunity to leak in various ways (e.g. accidentally uninitialized structures or padding), and that certain types of use-after-free flaws cannot be exploited since the memory has been wiped. To take things even a step further, the poisoning can be verified at allocation time to make sure that nothing wrote to it between free and allocation (called “sanity checking”), which can catch another small subset of flaws.

To understand the pieces of this, it’s worth describing that the kernel’s higher level allocator, the “page allocator” (e.g. __get_free_pages()) is used by the finer-grained “slab allocator” (e.g. kmem_cache_alloc(), kmalloc()). Poisoning is handled separately in both allocators. The zero-poisoning happens at the page allocator level. Since the slab allocators tend to do their own allocation/freeing, their poisoning happens separately (since on slab free nothing has been freed up to the page allocator).

Only limited performance tuning has been done, so the penalty is rather high at the moment, at about 9% when doing a kernel build workload. Future work will include some exclusion of frequently-freed caches (similar to PAX_MEMORY_SANITIZE), and making the options entirely CONFIG controlled (right now both CONFIGs are needed to build in the code, and a kernel command line is needed to activate it). Performing the sanity checking (mentioned above) adds another roughly 3% penalty. In the general case (and once the performance of the poisoning is improved), the security value of the sanity checking isn’t worth the performance trade-off.

Tests for the features can be found in lkdtm as READ_AFTER_FREE and READ_BUDDY_AFTER_FREE. If you’re feeling especially paranoid and have enabled sanity-checking, WRITE_AFTER_FREE and WRITE_BUDDY_AFTER_FREE can test these as well.

To perform zero-poisoning of page allocations and (currently non-zero) poisoning of slab allocations, build with:


and enable the page allocator poisoning and slab allocator poisoning at boot with this on the kernel command line:

page_poison=on slub_debug=P

To add sanity-checking, change PAGE_POISONING_NO_SANITY=n, and add “F” to slub_debug as “slub_debug=PF“.

read-only after init

I added the infrastructure to support making certain kernel memory read-only after kernel initialization (inspired by a small part of PaX/Grsecurity’s KERNEXEC functionality). The goal is to continue to reduce the attack surface within the kernel by making even more of the memory, especially function pointer tables, read-only (which depends on CONFIG_DEBUG_RODATA above).

Function pointer tables (and similar structures) are frequently targeted by attackers when redirecting execution. While many are already declared “const” in the kernel source code, making them read-only (and therefore unavailable to attackers) for their entire lifetime, there is a class of variables that get initialized during kernel (and module) start-up (i.e. written to during functions that are marked “__init“) and then never (intentionally) written to again. Some examples are things like the VDSO, vector tables, arch-specific callbacks, etc.

As it turns out, most architectures with kernel memory protection already delay making their data read-only until after __init (see mark_rodata_ro()), so it’s trivial to declare a new data section (“.data..ro_after_init“) and add it to the existing read-only data section (“.rodata“). Kernel structures can be annotated with the new section (via the “__ro_after_init” macro), and they’ll become read-only once boot has finished.

The next step for attack surface reduction infrastructure will be to create a kernel memory region that is passively read-only, but can be made temporarily writable (by a single un-preemptable CPU), for storing sensitive structures that are written to only very rarely. Once this is done, much more of the kernel’s attack surface can be made read-only for the majority of its lifetime.

As people identify places where __ro_after_init can be used, we can grow the protection. A good place to start is to look through the PaX/Grsecurity patch to find uses of __read_only on variables that are only written to during __init functions. The rest are places that will need the temporarily-writable infrastructure (PaX/Grsecurity uses pax_open_kernel()/pax_close_kernel() for these).

That’s it for v4.6, next up will be v4.7!

© 2016, Kees Cook. This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 License.
Creative Commons License

October 01, 2016 07:45 AM

September 30, 2016

LPC 2016: Last batch of LPC registrations available on October 1

The last batch of registrations for the 2016 Linux Plumbers Conference will be available starting at noon Eastern Time (EDT) on October 1. This will be the last chance to register to attend the conference. Those interested should visit the registration web site after that time.

The schedule for the conference has been posted, which includes information on the microconferences (and the discussions planned for those) as well as the refereed talks. Any conflicts noted should be sent to

We hope to see you in Santa Fe!

September 30, 2016 02:49 PM

James Morris: Linux Security Summit 2016 Wrapup

Here’s a summary of the 2016 Linux Security Summit, which was held last month in Toronto.

Presentation slides are available at

This year, videos were made of the sessions, and they may be viewed at — many thanks to Intel for sponsoring the recordings!

LWN has published some excellent coverage:

This is a pretty good representation of the main themes which emerged in the conference: container security, kernel self-protection, and integrity / secure boot.

Many of the core or low level security technologies (such as access control, integrity measurement, crypto, and key management) are now fairly mature. There’s more focus now on how to integrate these components into higher-level systems and architectures.

One talk I found particularly interesting was Design and Implementation of a Security Architecture for Critical Infrastructure Industrial Control Systems in the Era of Nation State Cyber Warfare. (The title, it turns out, was a hack to bypass limited space for the abstract in the cfp system).  David Safford presented an architecture being developed by GE to protect a significant portion of the world’s electrical grid from attack.  This is being done with Linux, and is a great example of how the kernel’s security mechanisms are being utilized for such purposes.  See the slides or the video.  David outlined gaps in the kernel in relation to their requirements, and a TPM BoF was held later in the day to work on these.  The BoF was reportedly very successful, as several key developers in the area of TPM and Integrity were present.

#linuxsecuritysummit TPM BOF session

— LinuxSecuritySummit (@LinuxSecSummit) August 25, 2016

Attendance at LSS was the highest yet with well over a hundred security developers, researchers and end users.

Special thanks to all of the LF folk who manage the logistics for the event.  There’s no way we could stage something on this scale without their help.

Stay tuned for the announcement of next year’s event!


September 30, 2016 11:19 AM

Daniel Vetter: Commit Rights in the Linux Kernel?!

Since about a year we’re running the Intel graphics driver with a new process: Besides the two established maintainers we’ve added all regular contributors as committers to the main feature branch feeding into -next. This turned out into a tremendous success, but did require some initial adustments to how we run things in the first few months.

I’ve presented the new model here at Kernel Recipes in Paris, and I will also talk about it at Kernel Summit in Santa Fe. Since LWN is present at both I won’t bother with a full writeup, but leave that to much better editors. Update: LWN on kernel maintainer scalability.

Anyway, there’s a video recording and the slides. Our process is also documented - scroll down to the bottom for the more interesting bits around what’s expected of committers.

On a related note: At XDC, and a bit before, Eric Anholt started a discussion about improving our patch submission process, especially for new contributors. He used the Rust community as a great example, and presented about it at XDC. Rather interesting to hear his perspective as a first-time contributor confirm what I learned in LCA this year in Emily Dunham’s awesome talk on Life is better with Rust’s community automation.

September 30, 2016 05:32 AM

September 28, 2016

Kees Cook: security things in Linux v4.5

Previously: v4.4. Some things I found interesting in the Linux kernel v4.5:


The CONFIG_STRICT_DEVMEM setting that has existed for a long time already protects system RAM from being accessible through the /dev/mem device node to root in user-space. Dan Williams added CONFIG_IO_STRICT_DEVMEM to extend this so that if a kernel driver has reserved a device memory region for use, it will become unavailable to /dev/mem also. The reservation in the kernel was to keep other kernel things from using the memory, so this is just common sense to make sure user-space can’t stomp on it either. Everyone should have this enabled. (And if you have a system where you discover you need IO memory access from userspace, you can boot with “iomem=relaxed” to disable this at runtime.)

If you’re looking to create a very bright line between user-space having access to device memory, it’s worth noting that if a device driver is a module, a malicious root user can just unload the module (freeing the kernel memory reservation), fiddle with the device memory, and then reload the driver module. So either just leave out /dev/mem entirely (not currently possible with upstream), build a monolithic kernel (no modules), or otherwise block (un)loading of modules (/proc/sys/kernel/modules_disabled).

ptrace fsuid checking

Jann Horn fixed some corner-cases in how ptrace access checks were handled on special files in /proc. For example, prior to this fix, if a setuid process temporarily dropped privileges to perform actions as a regular user, the ptrace checks would not notice the reduced privilege, possibly allowing a regular user to trick a privileged process into disclosing things out of /proc (ASLR offsets, restricted directories, etc) that they normally would be restricted from seeing.

ASLR entropy sysctl

Daniel Cashman standardized the way architectures declare their maximum user-space ASLR entropy (CONFIG_ARCH_MMAP_RND_BITS_MAX) and then created a sysctl (/proc/sys/vm/mmap_rnd_bits) so that system owners could crank up entropy. For example, the default entropy on 32-bit ARM was 8 bits, but the maximum could be as much as 16. If your 64-bit kernel is built with CONFIG_COMPAT, there’s a compat version of the sysctl as well, for controlling the ASLR entropy of 32-bit processes: /proc/sys/vm/mmap_rnd_compat_bits.

Here’s how to crank your entropy to the max, without regard to what architecture you’re on:

for i in "" "compat_"; do f=/proc/sys/vm/mmap_rnd_${i}bits; n=$(cat $f); while echo $n > $f ; do n=$(( n + 1 )); done; done

strict sysctl writes

Two years ago I added a sysctl for treating sysctl writes more like regular files (i.e. what’s written first is what appears at the start), rather than like a ring-buffer (what’s written last is what appears first). At the time it wasn’t clear what might break if this was enabled, so a WARN was added to the kernel. Since only one such string showed up in searches over the last two years, the strict writing mode was made the default. The setting remains available as /proc/sys/kernel/sysctl_writes_strict.

seccomp UM support

Mickaël Salaün added seccomp support (and selftests) for user-mode Linux. Moar architectures!

seccomp NNP vs TSYNC fix

Jann Horn noticed and fixed a problem where if a seccomp filter was already in place on a process (after being installed by a privileged process like systemd, a container launcher, etc) then the setting of the “no new privs” flag could be bypassed when adding filters with the SECCOMP_FILTER_FLAG_TSYNC flag set. Bypassing NNP meant it might be possible to trick a buggy setuid program into doing things as root after a seccomp filter forced a privilege drop to fail (generally referred to as the “sendmail setuid flaw”). With NNP set, a setuid program can’t be run in the first place.

That’s it! Next I’ll cover v4.6

Edit: Added notes about “iomem=…”

© 2016, Kees Cook. This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 License.
Creative Commons License

September 28, 2016 09:58 PM

September 27, 2016

Kees Cook: security things in Linux v4.4

Previously: v4.3. Continuing with interesting security things in the Linux kernel, here’s v4.4. As before, if you think there’s stuff I missed that should get some attention, please let me know.

seccomp Checkpoint/Restore-In-Userspace

Tycho Andersen added a way to extract and restore seccomp filters from running processes via PTRACE_SECCOMP_GET_FILTER under CONFIG_CHECKPOINT_RESTORE. This is a continuation of his work (that I failed to mention in my prior post) from v4.3, which introduced a way to suspend and resume seccomp filters. As I mentioned at the time (and for which he continues to quote me) “this feature gives me the creeps.” :)

x86 W^X detection

Stephen Smalley noticed that there was still a range of kernel memory (just past the end of the kernel code itself) that was incorrectly marked writable and executable, defeating the point of CONFIG_DEBUG_RODATA which seeks to eliminate these kinds of memory ranges. He corrected this in v4.3 and added CONFIG_DEBUG_WX in v4.4 which performs a scan of memory at boot time and yells loudly if unexpected memory protection are found. To nobody’s delight, it was shortly discovered the UEFI leaves chunks of memory in this state too, which posed an ugly-to-solve problem (which Matt Fleming addressed in v4.6).

x86_64 vsyscall CONFIG

I introduced a way to control the mode of the x86_64 vsyscall with a build-time CONFIG selection, though the choice I really care about is CONFIG_LEGACY_VSYSCALL_NONE, to force the vsyscall memory region off by default. The vsyscall memory region was always mapped into process memory at a fixed location, and it originally posed a security risk as a ROP gadget execution target. The vsyscall emulation mode was added to mitigate the problem, but it still left fixed-position static memory content in all processes, which could still pose a security risk. The good news is that glibc since version 2.15 doesn’t need vsyscall at all, so it can just be removed entirely. Any kernel built this way that discovered they needed to support a pre-2.15 glibc could still re-enable it at the kernel command line with “vsyscall=emulate”.

That’s it for v4.4. Tune in tomorrow for v4.5!

© 2016, Kees Cook. This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 License.
Creative Commons License

September 27, 2016 10:47 PM

Dave Airlie: radv: status update or is Talos Principle rendering yet?

The answer is YES!!

I fixed the last bug with instance rendering and Talos renders great on radv now.

Also with the semi-interesting branch vkQuake also renders, there are some upstream bugs that needs fixing in spirv/nir that I'm awaiting and upstream resolution on, but I've included some prelim fixes in semi-interesting for now, that'll go away when upstream fixes are decided on.

Here's a screenshot:

September 27, 2016 04:33 AM

September 26, 2016

Kees Cook: security things in Linux v4.3

When I gave my State of the Kernel Self-Protection Project presentation at the 2016 Linux Security Summit, I included some slides covering some quick bullet points on things I found of interest in recent Linux kernel releases. Since there wasn’t a lot of time to talk about them all, I figured I’d make some short blog posts here about the stuff I was paying attention to, along with links to more information. This certainly isn’t everything security-related or generally of interest, but they’re the things I thought needed to be pointed out. If there’s something security-related you think I should cover from v4.3, please mention it in the comments. I’m sure I haven’t caught everything. :)

A note on timing and context: the momentum for starting the Kernel Self Protection Project got rolling well before it was officially announced on November 5th last year. To that end, I included stuff from v4.3 (which was developed in the months leading up to November) under the umbrella of the project, since the goals of KSPP aren’t unique to the project nor must the goals be met by people that are explicitly participating in it. Additionally, not everything I think worth mentioning here technically falls under the “kernel self-protection” ideal anyway — some things are just really interesting userspace-facing features.

So, to that end, here are things I found interesting in v4.3:


Russell King implemented this feature for ARM which provides emulated segregation of user-space memory when running in kernel mode, by using the ARM Domain access control feature. This is similar to a combination of Privileged eXecute Never (PXN, in later ARMv7 CPUs) and Privileged Access Never (PAN, coming in future ARMv8.1 CPUs): the kernel cannot execute user-space memory, and cannot read/write user-space memory unless it was explicitly prepared to do so. This stops a huge set of common kernel exploitation methods, where either a malicious executable payload has been built in user-space memory and the kernel was redirected to run it, or where malicious data structures have been built in user-space memory and the kernel was tricked into dereferencing the memory, ultimately leading to a redirection of execution flow.

This raises the bar for attackers since they can no longer trivially build code or structures in user-space where they control the memory layout, locations, etc. Instead, an attacker must find areas in kernel memory that are writable (and in the case of code, executable), where they can discover the location as well. For an attacker, there are vastly fewer places where this is possible in kernel memory as opposed to user-space memory. And as we continue to reduce the attack surface of the kernel, these opportunities will continue to shrink.

While hardware support for this kind of segregation exists in s390 (natively separate memory spaces), ARM (PXN and PAN as mentioned above), and very recent x86 (SMEP since Ivy-Bridge, SMAP since Skylake), ARM is the first upstream architecture to provide this emulation for existing hardware. Everyone running ARMv7 CPUs with this kernel feature enabled suddenly gains the protection. Similar emulation protections (PAX_MEMORY_UDEREF) have been available in PaX/Grsecurity for a while, and I’m delighted to see a form of this land in upstream finally.

To test this kernel protection, the ACCESS_USERSPACE and EXEC_USERSPACE triggers for lkdtm have existed since Linux v3.13, when they were introduced in anticipation of the x86 SMEP and SMAP features.

Ambient Capabilities

Andy Lutomirski (with Christoph Lameter and Serge Hallyn) implemented a way for processes to pass capabilities across exec() in a sensible manner. Until Ambient Capabilities, any capabilities available to a process would only be passed to a child process if the new executable was correctly marked with filesystem capability bits. This turns out to be a real headache for anyone trying to build an even marginally complex “least privilege” execution environment. The case that Chrome OS ran into was having a network service daemon responsible for calling out to helper tools that would perform various networking operations. Keeping the daemon not running as root and retaining the needed capabilities in children required conflicting or crazy filesystem capabilities organized across all the binaries in the expected tree of privileged processes. (For example you may need to set filesystem capabilities on bash!) By being able to explicitly pass capabilities at runtime (instead of based on filesystem markings), this becomes much easier.

For more details, the commit message is well-written, almost twice as long as than the code changes, and contains a test case. If that isn’t enough, there is a self-test available in tools/testing/selftests/capabilities/ too.

PowerPC and Tile support for seccomp filter

Michael Ellerman added support for seccomp to PowerPC, and Chris Metcalf added support to Tile. As the seccomp maintainer, I get excited when an architecture adds support, so here we are with two. Also included were updates to the seccomp self-tests (in tools/testing/selftests/seccomp), to help make sure everything continues working correctly.

That’s it for v4.3. If I missed stuff you found interesting, please let me know! I’m going to try to get more per-version posts out in time to catch up to v4.8, which appears to be tentatively scheduled for release this coming weekend. Next: v4.4.

© 2016, Kees Cook. This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 License.
Creative Commons License

September 26, 2016 10:54 PM

LPC 2016: Refereed Talks now Posted to Plumbers Schedule

The Linux Plumbers conference schedule has now been updated to include the accepted refereed talk proposals.  As usual, we’ve tried to make sure the conflicts are minimised, but if anyone needs a change to the timing of their talk, please email

September 26, 2016 08:50 PM

September 25, 2016

Gustavo F. Padovan: My talk about Mainline Explicit Fencing at XDC 2016!

Last week I was at XDC in Helsinki where I presented about the Explicit Fencing work we’ve been doing on the Mainline Linux Kernel in the lastest few months. There was a livestream of all presentations during the conference and recorded sections are available. You can check the video of my presentation. Check out the slides too.

If you want to check the code we’ve been writing they are available here:

Linux Kernel:




Soon we will get Explicit Fencing on Android’s drm_hwcomposer as well so expect updates on this blog with more information about that. :)

Also I would like to take the opportunity to thank Collabora for sponsoring my travel to XDC and Martin Peres for organizing such a great conference. It was my first time attending XDC and my time there was absolutely great, I  have learnt a lot about what the Graphics community have been doing lately and I met the people doing this work. I was happy to see a lot of interest from many people around the Explicit Fencing work we’ve doing.


September 25, 2016 09:44 PM

September 24, 2016

Pavel Machek: Audio fun

Documentation for audio on Linux... is pretty much nonexistent.


There is a hidden pointer somewhere in this text to a page containing deeper information about using audio. You should have perfect understanding about the features described in this page before jumping into more complicated information. Just make sure you read this text carefully enough so you will be able to find the link.
Oh, thank you, so we are now on treasure hunt?
Under construction!
This page is currently being written. A more complete version should be released shortly.
Last updated Fri 16 Aug 1996 (minor changes).
Seems like the complete page is not going to be available any time soon.
Still, that was best page explaining how audio is supposed to work on Linux. Ouch. I could not get ALSA to work. OSS works fine. (I guess that also talks a bit about state of audio on Linux). And then I discovered that modem does not work in kernel 4.8, so my problems were not pulseaudio problems but modem problems. Oh well.

September 24, 2016 10:05 AM

September 19, 2016

LPC 2016: Preliminary Microconference Schedule Up

Every year we get a number of constraints on Microconferences which we try hard to accommodate.  Accounting for all of those, we’ve put the preliminary schedule up here.  If you notice any problems, please email and we’ll try to fix it

Also note, this is preliminary, the Microconferences may still move around as we get requests to change them.  Also note that the times of talks within Microconferences is highly likely to change (please see the MC leaders if you want this to change).

September 19, 2016 07:00 PM

September 10, 2016

LPC 2016: Git microconference accepted into LPC 2016

The Linux kernel community has been using Git for more than a decade, but it is still under active development, with more than 2,000 non-merge commits from almost 200 contributors over the past year. Rather than review this extensive history, this Micro Git Together instead focuses on what the next few years might bring. In addition, Junio Hamano will present on the state of the Git Union, Josh Triplett will present on the git-series project, and Steve Rostedt will present “A Maze Of Git Scripts All Alike”, in which Steve puts forward the radical notion that common function in various maintainers’ scripts could be pulled into Git itself. This should help lead into a festive discussion about the future of Git.

Please join us for an important discussion!

September 10, 2016 06:20 PM

Paul E. Mc Kenney: Git Microconference Accepted into 2016 Linux Plumbers Conference

The Linux kernel community has been using Git for more than a decade, but it is still under active development, with more than 2,000 non-merge git commits from almost 200 contributor over the past year. Rather than review this extensive history, this Micro Git Together instead focuses on what the next few years might bring. In addtion, Junio will present on the state of the Git Union, Josh Triplett will present on the git-series project, and Steve Rostedt will present "A Maze Of Git Scripts All Alike", in which Steve puts forward the radical notion that common function in various maintainers' scripts could be pulled into git itself. This should help lead into a festive discussion about the future of git.

Please join us for an important and festive discussion!

September 10, 2016 10:25 AM

September 09, 2016

Mel Gorman: Stabilising performance after a major kernel revision

A topic related to upstreaming patches on kernel forks related to embedded platforms is currently being discussed for Kernel Summit 2016. This is an age-old topic related to whether it is better to work upstream and backport or apply patches to a product-specific kernel and worry about forward-porting later. The points being raised have not changed over the years and still comes down to getting something out the door quickly versus long-term maintenance overhead. I’m not directly affected so had nothing new to add to the thread.

However, I’ve had recent experience stabilising the performance of an upstream kernel after a major kernel revision in the context of a distribution kernel. The kernel in question follows an upstream-first-and-then-backport policy with very rare exceptions. The backports are almost always related to hardware enablement but performance-related patches are also cherry-picked which is what my primary concern as Performance Team Lead is. The difficulty we face is that the distribution kernel is faster than the baseline upstream stable kernel is and faster than the mainline kernel we rebase to for a new release. There are usually multiple root causes and because of the cherry-picking, it’s not a simple case of bisecting.

Performance is always workload and hardware specific so I’m not going to get into the performance figures and profiles used to make decisions but the patches in question are on a public git tree if someone was sufficiently motivated. There may be an attempt to update the -stable kernel involved without a guarantee it’ll be picked up. Right now, it’s still a work in progress but this list gives an idea of the number of patches involved;

This is an incomplete list and it’s a single case that may or may not apply to other people and products. I do have anecdotal evidence that other companies carry far fewer patches when stabilising performance but in many cases, those same companies have a fixed set of well-known workloads where as this is a distribution kernel for general use.

This is unrelated to the difficulties embedded vendors have when shipping a product but lets just say that I have a certain degree of sympathy when a major kernel revision is required. That said, my experience suggests that the effort required to stabilise a major release periodically is lower than carrying ever-increasing numbers of backports that get harder and harder to backport.

September 09, 2016 10:56 AM

September 08, 2016

Pavel Machek: Security getting hard/impossible on recent systems

Cache attacks: this is not good. Ok, so we have a rowhammer: basically very common, hard-to-work-around, hardware problem. Bits in your memory may flip. Deal with it.

And now, there are cache attacks, too. Users should not be able to spy on each other on multiuser system, but they very probably can. In particular, other users can tell which parts of emacs you are executing, and when. They can probably not distinguish what characters you are typing, but they can probably learn when you are typing space, normal letter, or moving cursor. Ouch. And if they indeed can spy on individual characters... you can hardly blame emacs. With plain keyboard, cache attack on individual letters is probably not feasible. With t-9 like system on touchscreen... it probably is. Deal with it. But how?

September 08, 2016 10:46 AM

Pavel Machek: fcam-dev now gets autofocus on 4.7 kernel

Ok, without proper timing support, everything is really, really slow, but hey - I already got one usable photo out of the system :-).

Oh, and this is the reason to run Debian on your phone: .

September 08, 2016 10:36 AM

Pavel Machek: 25 years of Linux

25 years of linux and yes, I know Linux is popular. Still it was unexpected when I was asked in public transport if I know about Linux. Man wanted me to help with X restarting due to bad graphics drivers... I asked how he realized... and he told me about my T-shirt. I realized I have UnitedLinux T-shirt on... Given SCO's involvement in that one... should I burn the shirt?

September 08, 2016 10:32 AM

Pavel Machek: ext4 encryption incompatible with grub

You encrypt a directory -- sounds easy, right? Support is in 4.4 kernel, my machines run newer kernels than that. Encrypting root would be hard, but encrypting parts of data partition should be easy.

Ok, lets follow howto... Need to do tune2fs. Right. Aha, still does not work, looks like I'll need to reboot.
Hmm. Will not boot. Grub no longer recognizes my /data partition, and that's where new kernels are. Old kernels are in /boot, but those are now useless. Lets copy new kernel on machine using USB stick. Does not boot. Fun.
tune2fs on root filesystem is useless, as it is too old. New one is ... on the data partition. Right. Ok, lets bring newer version of tune2fs in. "encryption" feature can not be cleared.
Argh! Come on, I did not even create single encrypted directory on the partition. I want the damn bit to go off, so I can go back to working configuration. "Old kernels can not read encrypted files" sounds ok, but "old kernels can not mount filesystem at all" is not acceptable here :-(.

You encrypt a directory -- sounds easy, right? Support is in 4.4 kernel, my machines run newer kernels than that. Encrypting root would be hard, but encrypting parts of data partition should be easy.
Ok, lets follow howto... Need to do tune2fs. Right. Aha, still does not work, looks like I'll need to reboot.
Hmm. Will not boot. Grub no longer recognizes my /data partition, and that's where new kernels are. Old kernels are in /boot, but those are now useless. Lets copy new kernel on machine using USB stick. Does not boot. Fun.
tune2fs on root filesystem is useless, as it is too old. New one is ... on the data partition. Right. Ok, lets bring newer version of tune2fs in. "encryption" feature can not be cleared.
Argh! Come on, I did not even create single encrypted directory on the partition. I want the damn bit to go off, so I can go back to working configuration. "Old kernels can not read encrypted files" sounds ok, but "old kernels can not mount filesystem at all" is not acceptable here :-(.
Ok, it seems it is possible to go back, as long as encryption was not actually used. fsck -fn; debugfs -w -R "feature -encrypt" /dev/device; fsck -fn;. I guess I was too optimistic. Using ext4 encryption would require at least new e2fsprogs at the root filesystem, which was something I was hoping to avoid.

September 08, 2016 10:31 AM

Pavel Machek: Anyone with x60 and working gigabit?

On the lists, I was told that I probably have broken wire inside my notebook. I believe broken wires simply don't happen, so... is there anyone with working gigabit on x60?

September 08, 2016 10:28 AM

September 07, 2016

LPC 2016: Limited number of LPC registrations available starting September 8

LPC registration will open up on September 8 at noon Eastern Time (EDT) with a very limited number of slots available. Those interested in attending the conference who have not yet registered will want to visit the registration web site after that time. There will also be a very limited number of late registrations that will be available starting on October 1.

Another way to get a pass to the nearly sold out conference would be to submit a refereed track proposal before September 8. Each accepted talk will get one free pass to LPC.

September 07, 2016 01:49 PM

September 06, 2016

LPC 2016: Audio workshop accepted for Linux Plumbers Conference and Kernel Summit

Audio is an increasingly important component of the Linux plumbing, given increased use of Linux for media workloads and of the Linux kernel for smartphones. Topics include low-latency audio, use of the clock API, propagating digital configuration through dynamic audio power management (DAPM), integration of HDA and ASoC, SoundWire ALSA use-case managemer (UCM) scalability, standardizing HDMI and DisplayPort interfaces, Media Controller API integration, and a number of topics relating to the multiple userspace users of Linux-kernel audio, including Android and ChromeOS as well as the various desktop-oriented Linux distributions.

As with many Linux-kernel components, upstreaming of vendor drivers and handling of stable and long term-stable (LTS) trees are also important topics.

Please join us for a timely and important discussion!

September 06, 2016 04:38 PM

Greg Kroah-Hartman: 4.9 == next LTS kernel

As I briefly mentioned a few weeks ago on my G+ page, the plan is for the 4.9 Linux kernel release to be the next “Long Term Supported” (LTS) kernel.

Last year, at the Linux Kernel Summit, we discussed just how to pick the LTS kernel. Many years ago, we tried to let everyone know ahead of time what the kernel version would be, but that caused a lot of problems as people threw crud in there that really wasn’t ready to be merged, just to make it easier for their “day job”. That was many years ago, and people insist they aren’t going to do this again, so let’s see what happens.

I reserve the right to not pick 4.9 and support it for two years, if it’s a major pain because people abused this notice. If so, I’ll possibly drop back to 4.8, or just wait for 4.10 to be released. I’ll let everyone know by updating the releases page when it’s time (many months from now.)

If people have questions about this, email me and I will be glad to discuss it.

September 06, 2016 07:59 AM

September 05, 2016

Gustavo F. Padovan: Mainline Explicit Fencing – part 1

When it comes to buffer sharing synchronization in the kernel there are two ways of doing it: Implicit Fencing and Explicit Fencing. The difference between them relies on the fact that the kernel may or may not share synchronization information with userspace, it will either be implicit, with no fencing information provided, or explicit with all information available to userspace.

The fencing synchronization mechanism allows the sharing of buffers without the risk of a driver or userspace to read an incomplete buffer or write to a buffer that is still under use somewhere else in the system. The fencing provides ordering to these operations to make reads or writes happen only when the buffer is not used by other drivers anymore. For example,when a GPU job is queued a fence is associated to the buffer in the job, that fence can be used by other drivers for synchronization purposes, they won’t use the buffer a signal from the fence is received. The signal means the buffers is now free to be used. Similarly we can have the same setting for the GPU driver to wait the buffer to come out of the screen to render on it again.

The central piece here is the fence, an element that is attached to each buffer whenever a request involving the buffer is sent to the kernel. The fence can be used by userspace or other drivers to wait for the work to finish. So once the work is finished the fence signals and the waiter can proceed and do whatever they want with the buffer.

While Implicit Fencing  helps a lot with buffer synchronization there are a few cases where the whole desktop compositing could stall. Imagine the following compositor flow: there are 3 buffers to process, A, B and C. A and B are sent for rendering in parallel while C is going to be composed of both A and B. But the compositor will only be notified when both buffers are rendered thus if B takes too long the compositing of the whole desktop will be blocked waiting for B and C won’t be displayed in time.

A compositor processing two buffers in parallel

A compositor processing two buffers in parallel, with Implicit Fencing if B takes too long the desktop compositor freezes.

However with Explicit Fencing the compositor should have one fence for each buffer and will be notified when each buffer is rendered. So if A renders fast and B takes too long the compositor can decide not wait for B and proceed with the scanout of C with buffer A but an old version of B. The fencing information allows the compositor to be smart and take decisions to avoid the screen to freeze for example.

As of today the Linux Kernel only has generic APIs for Implicit Fencing, although some drivers have Explicit Fencing already their APIs are device specific. Android currently has its own implementation through the Android Sync Framework – which will be explained in the next article.

Explicit Fencing works on a Consumer-Producer fashion. In an GPU rendering + scanout to the screen pipeline it would synchronize between the kernel drivers, so when submitting a new rendering job to the GPU(Producer side) userspace would get back a fence related to that buffer submitted. That means userspace doesn’t need to block waiting for the job to complete, a signal is sent when the job is finished. As userspace doesn’t need to block it and has a fence of the buffer it then can proceed right away with the syscall to ask the display hardware(Consumer) to scanout the buffer that is yet to be processed. With explicit fencing the kernel is taught to wait for the fence to signal, before starting the scanout process.

A new fence is returned to userspace when the buffer is submitted to the kernel for scanout on the display hardware, that fence will signal when the buffer is not being displayed anymore, thus is ready for reuse by another rendering job. When the userspace gets this fence back it can submit a new rendering job to the GPU without waiting. The wait is done on the kernel side by the GPU driver, once the fence signals the rendering on that buffer can be initiated.

Explicit Fencing

The fence travels all the way to userspace and the next element on the pipeline. The yellow arrows represents the fences on userspace.

Last but not least, debugability of the graphics pipeline is improved. Having access to the fence in userspace helps a lot understanding what is happening in the pipeline. Previously, with Implicit Fencing there was no infomation available, so it was hard to figure out what was happening on the pipeline, also each vendor was trying to implement their own Implicit Fencing mechanism. Now with an standard Explicit Fencing mechanism it easier to build debug/tracing infrastructure that can be used to investigate issues in any system.

The next article will explain the Android Sync Framework and later the work on mainline to support explicit fencing will be described.

September 05, 2016 09:15 PM

September 04, 2016

Paul E. Mc Kenney: Audio Workshop Accepted into 2016 Linux Kernel Summit and Linux Plumbers Conference

Audio is an increasingly important component of the Linux plumbing, given increased use of Linux for media workloads and of the Linux kernel for smartphones. Topics include low-latency audio, use of the clock API, propagating digital configuration through dynamic audio power management (DAPM), integration of HDA and ASoC, SoundWire ALSA use-case managemer (UCM) scalability, standardizing HDMI and DisplayPort interfaces, Media Controller API integration, and a number of topics relating to the multiple userspace users of Linux-kernel audio, including Android and ChromeOS as well as the various desktop-oriented Linux distributions.

As with many Linux-kernel components, upstreaming of vendor drivers and handling of stable and long-term-stable (LTS) trees are also important topics.

Please join us for a timely and important discussion!

September 04, 2016 11:46 AM

September 02, 2016

LPC 2016: Submission deadline for LPC refereed track proposals extended by a week

The deadline for submitting refereed track proposals for the 2016 Linux Plumbers Conference has been extended until September 8, 2016 at 11:59PM CET. The refereed track will have 50-minute presentations on a specific aspect of Linux “plumbing” (e.g. core libraries, media creation/playback, display managers, init systems, kernel APIs/ABIs, etc.) that are chosen by the LPC committee to be given during the four days of the conference.

Registration for the conference has largely sold out at this point, but accepted talks for the refereed track will receive one free pass to the conference.

September 02, 2016 02:12 PM

Pete Zaitcev: Russian Joke

Supposedly from, via

Autor's Bio: Andrey Pan'gin [ref — zaitcev]. Programmer in the Odnoklassniki company, specializing in highly loaded back-ends. Knows JVM like the back of his hand, since he developed the HotSpot VM at Sun Microsystems and Oracle for several years. Loves assembly and systems programming.
A comment: Fallen angel.

September 02, 2016 04:12 AM

August 31, 2016

Vegard Nossum: Debugging a kernel crash found by syzkaller

Having done quite a bit of kernel fuzzing and debugging lately I’ve decided to take one of the very latest crashes and write up the whole process from start to finish as I work through it. As you will see, I'm not very familiar with the site of this particular crash, the block layer. Being familiar with some existing kernel code helps, of course, since you recognise a lot of code patterns, but the kernel is so large that nobody can be familiar with everything and the crashes found by trinity and syzkaller can show up almost anywhere.

So I got this with syzkaller after running it for a few hours:

general protection fault: 0000 [#1] PREEMPT SMP KASAN
Dumping ftrace buffer:
(ftrace buffer empty)
CPU: 0 PID: 11941 Comm: syz-executor Not tainted 4.8.0-rc2+ #169
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 04/01/2014
task: ffff880110762cc0 task.stack: ffff880102290000
RIP: 0010:[<ffffffff81f04b7a>] [<ffffffff81f04b7a>] blk_get_backing_dev_info+0x4a/0x70
RSP: 0018:ffff880102297cd0 EFLAGS: 00010202
RAX: dffffc0000000000 RBX: 0000000000000000 RCX: ffffc90000bb4000
RDX: 0000000000000097 RSI: 0000000000000000 RDI: 00000000000004b8
RBP: ffff880102297cd8 R08: 0000000000000000 R09: 0000000000000001
R10: 0000000000000000 R11: 0000000000000001 R12: ffff88011a010a90
R13: ffff88011a594568 R14: ffff88011a010890 R15: 7fffffffffffffff
FS: 00007f2445174700(0000) GS:ffff88011aa00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000200047c8 CR3: 0000000107eb5000 CR4: 00000000000006f0
DR0: 000000000000001e DR1: 000000000000001e DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600
1ffff10020452f9e ffff880102297db8 ffffffff81508daa 0000000000000000
0000000041b58ab3 ffffffff844e89e1 ffffffff81508b30 ffffed0020452001
7fffffffffffffff 0000000000000000 0000000000000000 7fffffffffffffff
Call Trace:
[<ffffffff81508daa>] __filemap_fdatawrite_range+0x27a/0x2e0
[<ffffffff81508b30>] ? filemap_check_errors+0xe0/0xe0
[<ffffffff83c24b47>] ? preempt_schedule+0x27/0x30
[<ffffffff810020ae>] ? ___preempt_schedule+0x16/0x18
[<ffffffff81508e36>] filemap_fdatawrite+0x26/0x30
[<ffffffff817191b0>] fdatawrite_one_bdev+0x50/0x70
[<ffffffff817341b4>] iterate_bdevs+0x194/0x210
[<ffffffff81719160>] ? fdatawait_one_bdev+0x70/0x70
[<ffffffff817195f0>] ? sync_filesystem+0x240/0x240
[<ffffffff817196be>] sys_sync+0xce/0x160
[<ffffffff817195f0>] ? sync_filesystem+0x240/0x240
[<ffffffff81002b60>] ? exit_to_usermode_loop+0x190/0x190
[<ffffffff8150455a>] ? __context_tracking_exit.part.4+0x3a/0x1e0
[<ffffffff81005524>] do_syscall_64+0x1c4/0x4e0
[<ffffffff83c3276a>] entry_SYSCALL64_slow_path+0x25/0x25
Code: 89 fa 48 c1 ea 03 80 3c 02 00 75 35 48 8b 9b e0 00 00 00 48 b8 00 00 00 00 00 fc ff df 48 8d bb b8 04 00 00 48 89 fa 48 c1 ea 03 <80> 3c 02 00 75 17 48 8b 83 b8 04 00 00 5b 5d 48 05 10 02 00 00
RIP [<ffffffff81f04b7a>] blk_get_backing_dev_info+0x4a/0x70
RSP <ffff880102297cd0>
The very first thing to do is to look up the code in the backtrace:
$ addr2line -e vmlinux -i ffffffff81f04b7a ffffffff81508daa ffffffff81508e36 ffffffff817191b0 ffffffff817341b4 ffffffff817196be
The actual site of the crash is this:
 842 static inline struct request_queue *bdev_get_queue(struct block_device *bdev)
843 {
844 return bdev->bd_disk->queue; /* this is never NULL */
845 }
Because we’re using KASAN we can’t look at CR2 to find the bad pointer because KASAN triggers before the page fault (or to be completely honest, KASAN tries to access the shadow memory for the bad pointer, which is itself a bad pointer and causes the GPF above).

Let’s look at the “Code:” line to try to find the exact dereference causing the error:
$ echo 'Code: 89 fa 48 c1 ea 03 80 3c 02 00 75 35 48 8b 9b e0 00 00 00 48 b8 00 00 00 00 00 fc ff df 48 8d bb b8 04 00 00 48 89 fa 48 c1 ea 03 <80> 3c 02 00 75 17 48 8b 83 b8 04 00 00 5b 5d 48 05 10 02 00 00 ' | scripts/decodecode 
Code: 89 fa 48 c1 ea 03 80 3c 02 00 75 35 48 8b 9b e0 00 00 00 48 b8 00 00 00 00 00 fc ff df 48 8d bb b8 04 00 00 48 89 fa 48 c1 ea 03 <80> 3c 02 00 75 17 48 8b 83 b8 04 00 00 5b 5d 48 05 10 02 00 00
All code
0: 89 fa mov %edi,%edx
2: 48 c1 ea 03 shr $0x3,%rdx
6: 80 3c 02 00 cmpb $0x0,(%rdx,%rax,1)
a: 75 35 jne 0x41
c: 48 8b 9b e0 00 00 00 mov 0xe0(%rbx),%rbx
13: 48 b8 00 00 00 00 00 movabs $0xdffffc0000000000,%rax
1a: fc ff df
1d: 48 8d bb b8 04 00 00 lea 0x4b8(%rbx),%rdi
24: 48 89 fa mov %rdi,%rdx
27: 48 c1 ea 03 shr $0x3,%rdx
2b:* 80 3c 02 00 cmpb $0x0,(%rdx,%rax,1) <-- trapping instruction
2f: 75 17 jne 0x48
31: 48 8b 83 b8 04 00 00 mov 0x4b8(%rbx),%rax
38: 5b pop %rbx
39: 5d pop %rbp
3a: 48 05 10 02 00 00 add $0x210,%rax
I’m using CONFIG_KASAN_INLINE=y so most of the code above is actually generated by KASAN which makes things a bit harder to read. The movabs with a weird 0xdffff… address is how it generates the address for the shadow memory bytemap and the cmpb that crashed is where it tries to read the value of the shadow byte.

The address is %rdx + %rax and we know that %rax is 0xdffffc0000000000. Let’s look at %rdx in the crash above… RDX: 0000000000000097; yup, that’s a NULL pointer dereference all right.

But the line in question has two pointer dereferences, bdev->bd_disk and bd_disk->queue, and which one is the crash? The lea 0x4b8(%rbx), %rdi is what gives it away, since that gives us the offset into the structure that is being dereferenced (also, NOT coincidentally, %rbx is 0). Let’s use pahole:
$ pahole -C 'block_device' vmlinux
struct block_device {
dev_t bd_dev; /* 0 4 */
int bd_openers; /* 4 4 */
struct inode * bd_inode; /* 8 8 */
struct super_block * bd_super; /* 16 8 */
struct mutex bd_mutex; /* 24 128 */
/* --- cacheline 2 boundary (128 bytes) was 24 bytes ago --- */
void * bd_claiming; /* 152 8 */
void * bd_holder; /* 160 8 */
int bd_holders; /* 168 4 */
bool bd_write_holder; /* 172 1 */

/* XXX 3 bytes hole, try to pack */

struct list_head bd_holder_disks; /* 176 16 */
/* --- cacheline 3 boundary (192 bytes) --- */
struct block_device * bd_contains; /* 192 8 */
unsigned int bd_block_size; /* 200 4 */

/* XXX 4 bytes hole, try to pack */

struct hd_struct * bd_part; /* 208 8 */
unsigned int bd_part_count; /* 216 4 */
int bd_invalidated; /* 220 4 */
struct gendisk * bd_disk; /* 224 8 */
struct request_queue * bd_queue; /* 232 8 */
struct list_head bd_list; /* 240 16 */
/* --- cacheline 4 boundary (256 bytes) --- */
long unsigned int bd_private; /* 256 8 */
int bd_fsfreeze_count; /* 264 4 */

/* XXX 4 bytes hole, try to pack */

struct mutex bd_fsfreeze_mutex; /* 272 128 */
/* --- cacheline 6 boundary (384 bytes) was 16 bytes ago --- */

/* size: 400, cachelines: 7, members: 21 */
/* sum members: 389, holes: 3, sum holes: 11 */
/* last cacheline: 16 bytes */
0x4b8 is 1208 in decimal, which is way bigger than this struct. Let’s try the other one:
$ pahole -C 'gendisk' vmlinux
struct gendisk {
int major; /* 0 4 */
int first_minor; /* 4 4 */
int minors; /* 8 4 */
char disk_name[32]; /* 12 32 */

/* XXX 4 bytes hole, try to pack */

char * (*devnode)(struct gendisk *, umode_t *); /* 48 8 */
unsigned int events; /* 56 4 */
unsigned int async_events; /* 60 4 */
/* --- cacheline 1 boundary (64 bytes) --- */
struct disk_part_tbl * part_tbl; /* 64 8 */
struct hd_struct part0; /* 72 1128 */
/* --- cacheline 18 boundary (1152 bytes) was 48 bytes ago --- */
const struct block_device_operations * fops; /* 1200 8 */
struct request_queue * queue; /* 1208 8 */
/* --- cacheline 19 boundary (1216 bytes) --- */
void * private_data; /* 1216 8 */
int flags; /* 1224 4 */

/* XXX 4 bytes hole, try to pack */

struct kobject * slave_dir; /* 1232 8 */
struct timer_rand_state * random; /* 1240 8 */
atomic_t sync_io; /* 1248 4 */

/* XXX 4 bytes hole, try to pack */

struct disk_events * ev; /* 1256 8 */
struct kobject integrity_kobj; /* 1264 64 */
/* --- cacheline 20 boundary (1280 bytes) was 48 bytes ago --- */
int node_id; /* 1328 4 */

/* XXX 4 bytes hole, try to pack */

struct badblocks * bb; /* 1336 8 */
/* --- cacheline 21 boundary (1344 bytes) --- */

/* size: 1344, cachelines: 21, members: 20 */
/* sum members: 1328, holes: 4, sum holes: 16 */
1208 is ->queue, so that fits well with what we’re seeing; therefore, bdev->bd_disk must be NULL.

At this point I would go up the stack of function to see if anything sticks out – although unlikely, it’s possible that it’s an “easy” bug where you can tell just from looking at the code in a single function that it sets the pointer to NULL just before calling the function that crashed or something like that.

Probably the most interesting function in the stack trace (at a glance) is iterate_bdevs() in fs/block_dev.c:
1880 void iterate_bdevs(void (*func)(struct block_device *, void *), void *arg)
1881 {
1882 struct inode *inode, *old_inode = NULL;
1884 spin_lock(&blockdev_superblock->s_inode_list_lock);
1885 list_for_each_entry(inode, &blockdev_superblock->s_inodes, i_sb_list) {
1886 struct address_space *mapping = inode->i_mapping;
1888 spin_lock(&inode->i_lock);
1889 if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW) ||
1890 mapping->nrpages == 0) {
1891 spin_unlock(&inode->i_lock);
1892 continue;
1893 }
1894 __iget(inode);
1895 spin_unlock(&inode->i_lock);
1896 spin_unlock(&blockdev_superblock->s_inode_list_lock);
1897 /*
1898 * We hold a reference to 'inode' so it couldn't have been
1899 * removed from s_inodes list while we dropped the
1900 * s_inode_list_lock We cannot iput the inode now as we can
1901 * be holding the last reference and we cannot iput it under
1902 * s_inode_list_lock. So we keep the reference and iput it
1903 * later.
1904 */
1905 iput(old_inode);
1906 old_inode = inode;
1908 func(I_BDEV(inode), arg);
1910 spin_lock(&blockdev_superblock->s_inode_list_lock);
1911 }
1912 spin_unlock(&blockdev_superblock->s_inode_list_lock);
1913 iput(old_inode);
1914 }
I can’t quite put my finger on it, but it looks interesting because it has a bunch of locking in it and it seems to be what’s getting the block device from a given inode. I ran git blame on the file/function in question since that might point to a recent change there, but the most interesting thing is commit 74278da9f7 changing some locking logic. Maybe relevant, maybe not, but let’s keep it in mind.

Remember that bd->bd_disk is NULL. Let’s try to check if ->bd_disk is assigned NULL anywhere:
$ git grep -n '\->bd_disk.*=.*NULL'
block/blk-flush.c:470: if (bdev->bd_disk == NULL)
drivers/block/xen-blkback/xenbus.c:466: if (vbd->bdev->bd_disk == NULL) {
fs/block_dev.c:1295: bdev->bd_disk = NULL;
fs/block_dev.c:1375: bdev->bd_disk = NULL;
fs/block_dev.c:1615: bdev->bd_disk = NULL;
kernel/trace/blktrace.c:1624: if (bdev->bd_disk == NULL)
This by no means necessarily includes the code that set ->bd_disk to NULL in our case (since there could be code that looks like x = NULL; bdev->bd_disk = x; which wouldn’t be found with the regex above), but this is a good start and I’ll look at the functions above just to see if it might be relevant. Actually, for this I’ll just add -W to the git grep above to quickly look at the functions.

The first two and last hits are comparisons so they are uninteresting. The third and fourth ones are part of error paths in __blkdev_get(). That might be interesting if the process that crashed somehow managed to get a reference to the block device just after the NULL assignment (if so, that would probably be a locking bug in either __blkdev_get() or one of the functions in the crash stack trace – OR it might be a bug where the struct block_device * is made visible/reachable before it’s ready). The fifth one is in __blkdev_put(). I’m going to read over __blkdev_get() and __blkdev_put() to figure out what they do and if there’s maybe something going on in either of those.

In all these cases, it seems to me that &bdev->bd_mutex is locked; that’s a good sign. That’s also maybe an indication that we should be taking &bdev->bd_mutex in the other code path, so let’s check if we are. There’s nothing that I can see in any of the functions from inode_to_bdi() and up. Although inode_to_bdi() itself looks interesting, because that’s where the block device pointer comes from; it calls I_BDEV(inode) which returns a struct block_device *. Although if we follow the stack even further up, we see that fdatawrite_one_bdev() in fs/sync.c also knows about a struct block_device *. This by the way appears to be what is called through the function pointer in iterate_bdevs():
1908                 func(I_BDEV(inode), arg);
This in turn is called from the sync() system call. In other words, I cannot see any caller that takes &bdev->bd_mutex. There may yet be another mechanism (maybe a lock) intended to prevent somebody from seeing bdev->bd_disk == NULL, but this seems like a strong indication of what the problem might be.

Let’s try to figure out more about ->bd_mutex, maybe there’s some documentation somewhere telling us what it’s supposed to protect. There is this:
include/linux/fs.h=454=struct block_device {
include/linux/fs.h-455- dev_t bd_dev; /* not a kdev_t - it's a search key */
include/linux/fs.h-456- int bd_openers;
include/linux/fs.h-457- struct inode * bd_inode; /* will die */
include/linux/fs.h-458- struct super_block * bd_super;
include/linux/fs.h:459: struct mutex bd_mutex; /* open/close mutex */
There is this:
include/linux/genhd.h-681- * Any access of part->nr_sects which is not protected by partition
include/linux/genhd.h:682: * bd_mutex or gendisk bdev bd_mutex, should be done using this
include/linux/genhd.h-683- * accessor function.
include/linux/genhd.h-684- *
include/linux/genhd.h-685- * Code written along the lines of i_size_read() and i_size_write().
include/linux/genhd.h-686- * CONFIG_PREEMPT case optimizes the case of UP kernel with preemption
include/linux/genhd.h-687- * on.
include/linux/genhd.h-688- */
include/linux/genhd.h=689=static inline sector_t part_nr_sects_read(struct hd_struct *part)
And there is this:
include/linux/genhd.h:712: * Should be called with mutex lock held (typically bd_mutex) of partition
include/linux/genhd.h-713- * to provide mutual exlusion among writers otherwise seqcount might be
include/linux/genhd.h-714- * left in wrong state leaving the readers spinning infinitely.
include/linux/genhd.h-715- */
include/linux/genhd.h-716-static inline void part_nr_sects_write(struct hd_struct *part, sector_t size)
Under Documentation/ there is also this:
--------------------------- block_device_operations -----------------------
locking rules:
open: yes
release: yes
ioctl: no
compat_ioctl: no
direct_access: no
media_changed: no
unlock_native_capacity: no
revalidate_disk: no
getgeo: no
swap_slot_free_notify: no (see below)
Looking at __blkdev_get() again, there’s also one comment above it hinting at locking rules:
1233 /*                  
1234 * bd_mutex locking:
1235 *
1236 * mutex_lock(part->bd_mutex)
1237 * mutex_lock_nested(whole->bd_mutex, 1)
1238 */
1240 static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)
__blkdev_get() is called as part of blkdev_get(), which is what is called when you open a block device. In other words, it seems likely that we may have a race between opening/closing a block device and calling sync() – although for the sync() call to reach the block device, we should have some inode open on that block device (since we start out with an inode that is mapped to a block device with I_BDEV(inode)).

Looking at the syzkaller log file, there is a sync() call just before the crash, and I also see references to [sr0] unaligned transfer (and sr0 is a block device, so that seems slightly suspicious):
2016/08/25 05:45:02 executing program 0:
mmap(&(0x7f0000001000)=nil, (0x4000), 0x3, 0x31, 0xffffffffffffffff, 0x0)
mbind(&(0x7f0000004000)=nil, (0x1000), 0x8003, &(0x7f0000002000)=0x401, 0x9, 0x2)
shmat(0x0, &(0x7f0000001000)=nil, 0x4000)
dup2(0xffffffffffffffff, 0xffffffffffffff9c)
mmap(&(0x7f0000000000)=nil, (0x1000), 0x3, 0x32, 0xffffffffffffffff, 0x0)
mmap(&(0x7f0000000000)=nil, (0x1000), 0x3, 0x32, 0xffffffffffffffff, 0x0)
mmap(&(0x7f0000000000)=nil, (0x1000), 0x3, 0x32, 0xffffffffffffffff, 0x0)
clock_gettime(0x0, &(0x7f0000000000)={0x0, 0x0})
sr0] unaligned transfer
sr 1:0:0:0: [sr0] unaligned transfer
sr 1:0:0:0: [sr0] unaligned transfer
sr 1:0:0:0: [sr0] unaligned transfer
kasan: CONFIG_KASAN_INLINE enabled
2016/08/25 05:45:03 result failed=false hanged=false:

2016/08/25 05:45:03 executing program 1:
mmap(&(0x7f0000002000)=nil, (0x1000), 0x3, 0x32, 0xffffffffffffffff, 0x0)
r0 = syz_open_dev$sr(&(0x7f0000002000)="2f6465762f73723000", 0x0, 0x4800)
readahead(r0, 0xcb84, 0x10001)
mmap(&(0x7f0000000000)=nil, (0x1000), 0x3, 0x32, 0xffffffffffffffff, 0x0)
mmap(&(0x7f0000001000)=nil, (0x1000), 0x3, 0x32, 0xffffffffffffffff, 0x0)
syz_open_dev$mixer(&(0x7f0000002000-0x8)="2f6465762f6d6978657200", 0x0, 0x86000)
mmap(&(0x7f0000001000)=nil, (0x1000), 0x6, 0x12, r0, 0x0)
mount$fs(&(0x7f0000001000-0x6)="6d73646f7300", &(0x7f0000001000-0x6)="2e2f62757300", &(0x7f0000001000-0x6)="72616d667300", 0x880, &(0x7f0000000000)="1cc9417348")
kasan: GPF could be caused by NULL-ptr deref or user memory access
Here we see both the sync() call and the syz_open_dev$sr() call and we see that the GFP seems to happen some time shortly after opening sr0:
r0 = syz_open_dev$sr(&(0x7f0000002000)="2f6465762f73723000", 0x0, 0x4800)

>>> "2f6465762f73723000".decode('hex')
There’s also a mount$fs() call there that looks interesting. Its arguments are:
>>> "6d73646f7300".decode('hex')
>>> "2e2f62757300".decode('hex')
>>> "72616d667300".decode('hex')
However, I can’t see any references to any block devices in fs/ramfs, so I think this is unlikely to be it. I do still wonder how opening /dev/sr0 can do anything for us if it doesn’t have a filesystem or even a medium. [Note from the future: block devices are represented as inodes on the “bdev” pseudo-filesystem. Go figure!] Grepping for sr0 in the rest of the syzkaller log shows this bit, which seems to indicate we do in fact have inodes for sr0:
VFS: Dirty inode writeback failed for block device sr0 (err=-5).
Grepping for “Dirty inode writeback failed”, I find bdev_write_inode() in fs/block_dev.c, called only from… __blkdev_put(). It definitely feels like we’re on to something now – maybe a race between sync() and open()/close() for /dev/sr0.

syzkaller comes with some scripts to rerun the programs from a log file. I’m going to try that and see where it gets us – if we can reproduce the crash. I’ll first try to convert the two programs (the one with sync() and the one with the open(/dev/sr0)) to C and compile them. If that doesn’t work, syzkaller also has an option to auto-reproduce based on all the programs in the log file, but that’s likely slower and not always likely to succeed.

I use syz-prog2c and launch the two programs in parallel in a VM, but it doesn’t show anything at all. I switch to syz-repro to see if it can reproduce anything given the log file, but this fails too. I see that there are other sr0-related messages in the kernel log, so there must be a way to open the device without just getting ENOMEDIUM. I do a stat on /dev/sr0 to find the device numbers:
$ stat /dev/sr0 
File: ‘/dev/sr0’
Size: 0 Blocks: 0 IO Block: 4096 block special file
Device: 5h/5d Inode: 7867 Links: 1 Device type: b,0
So the device major is 0xb (11 decimal). We can find this in include/uapi/linux/major.h and it gives us:
include/uapi/linux/major.h:#define SCSI_CDROM_MAJOR     11
We see that this is the driver responsible for /dev/sr0:
drivers/scsi/sr.c:      rc = register_blkdev(SCSI_CDROM_MAJOR, "sr");
(I could have guessed this as well, but there are so many systems and subsystems and drivers that I often double check just to make sure I’m in the right place.) I look for an open() function and I find two – sr_open() and sr_block_open(). sr_block_open() does cdrom_open() – from drivers/cdrom/cdrom.c – and this has an interesting line:
        /* if this was a O_NONBLOCK open and we should honor the flags,
* do a quick open without drive/disc integrity checks. */
if ((mode & FMODE_NDELAY) && (cdi->options & CDO_USE_FFLAGS)) {
ret = cdi->ops->open(cdi, 1);
So we need to pass O_NONBLOCK to get the device to open. When I add this to the test program from the syzkaller log and run sync() in parallel… ta-da!
kasan: CONFIG_KASAN_INLINE enabled
kasan: GPF could be caused by NULL-ptr deref or user memory access
general protection fault: 0000 [#1] PREEMPT SMP KASAN
Dumping ftrace buffer:
(ftrace buffer empty)
CPU: 3 PID: 1333 Comm: sync1 Not tainted 4.8.0-rc2+ #169
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 04/01/2014
task: ffff880114114080 task.stack: ffff880112bf0000
RIP: 0010:[<ffffffff8170654d>] [<ffffffff8170654d>] wbc_attach_and_unlock_inode+0x23d/0x760
RSP: 0018:ffff880112bf7ca0 EFLAGS: 00010206
RAX: dffffc0000000000 RBX: ffff880112bf7d10 RCX: ffff8801141147d0
RDX: 0000000000000093 RSI: ffff8801170f8750 RDI: 0000000000000498
RBP: ffff880112bf7cd8 R08: 0000000000000000 R09: 0000000000000000
R10: ffff8801141147e8 R11: 0000000000000000 R12: ffff8801170f8750
R13: 0000000000000000 R14: ffff880112bf7d38 R15: ffff880112bf7d10
FS: 00007fd533aa2700(0000) GS:ffff88011ab80000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000601028 CR3: 0000000112b04000 CR4: 00000000000006e0
ffff8801170f8750 0000000000000000 1ffff1002257ef9e ffff8801170f8950
ffff8801170f8750 0000000000000000 ffff880112bf7d10 ffff880112bf7db8
ffffffff81508d70 0000000000000000 0000000041b58ab3 ffffffff844e89e1
Call Trace:
[<ffffffff81508d70>] __filemap_fdatawrite_range+0x240/0x2e0
[<ffffffff81508b30>] ? filemap_check_errors+0xe0/0xe0
[<ffffffff83c24b47>] ? preempt_schedule+0x27/0x30
[<ffffffff810020ae>] ? ___preempt_schedule+0x16/0x18
[<ffffffff81508e36>] filemap_fdatawrite+0x26/0x30
[<ffffffff817191b0>] fdatawrite_one_bdev+0x50/0x70
[<ffffffff817341b4>] iterate_bdevs+0x194/0x210
[<ffffffff81719160>] ? fdatawait_one_bdev+0x70/0x70
[<ffffffff817195f0>] ? sync_filesystem+0x240/0x240
[<ffffffff817196be>] sys_sync+0xce/0x160
[<ffffffff817195f0>] ? sync_filesystem+0x240/0x240
[<ffffffff81002b60>] ? exit_to_usermode_loop+0x190/0x190
[<ffffffff82001a47>] ? check_preemption_disabled+0x37/0x1e0
[<ffffffff8150455a>] ? __context_tracking_exit.part.4+0x3a/0x1e0
[<ffffffff81005524>] do_syscall_64+0x1c4/0x4e0
[<ffffffff83c3276a>] entry_SYSCALL64_slow_path+0x25/0x25
Code: fa 48 c1 ea 03 80 3c 02 00 0f 85 b3 04 00 00 49 8d bd 98 04 00 00 48 b8 00 00 00 00 00 fc ff df 4c 89 63 30 48 89 fa 48 c1 ea 03 <80> 3c 02 00 0f 85 83 04 00 00 4d 8b bd 98 04 00 00 48 b8 00 00
RIP [<ffffffff8170654d>] wbc_attach_and_unlock_inode+0x23d/0x760
RSP <ffff880112bf7ca0>
---[ end trace 50fffb72f7adb3e5 ]---
This is not exactly the same oops that we saw before, but it’s close enough that it’s very likely to be a related crash. The reproducer is actually taking quite a while to trigger the issue, though. Even though I’ve reduced to two threads/processes executing just a handful of syscalls it still takes nearly half an hour to reproduce in a tight loop. I spend some time playing with the reproducer, trying out different things (read() instead of readahead(), just open()/close() with no reading at all, 2 threads doing sync(), etc.) to see if I can get it to trigger faster. In the end, I find that having many threads doing sync() in parallel seems to be the key to a quick reproducer, on the order of a couple of seconds.

Now that I have a fairly small reproducer it should be a lot easier to figure out the rest. I can add as many printk()s as I need to validate my theory that sync() should be taking the bd_mutex. For cases like this I set up a VM so that I can start the VM and run the reproducer by running a single command. I also actually like to use trace_printk() instead of plain printk() and boot with ftrace_dump_on_oops on the kernel command line – this way, the messages don’t get printed until the crash actually happens (and have a lower probability of interfering with the race itself; printk() goes directly to the console, which is usually pretty slow).

I apply this patch and recompile the kernel:
diff --git a/fs/block_dev.c b/fs/block_dev.c
index e17bdbd..fb9d5c5 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1292,6 +1292,7 @@ static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)
bdev->bd_part = NULL;
+ trace_printk("%p->bd_disk = NULL\n", bdev);
bdev->bd_disk = NULL;
bdev->bd_queue = NULL;
@@ -1372,6 +1373,7 @@ static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)

+ trace_printk("%p->bd_disk = NULL\n", bdev);
bdev->bd_disk = NULL;
bdev->bd_part = NULL;
bdev->bd_queue = NULL;
@@ -1612,6 +1614,7 @@ static void __blkdev_put(struct block_device *bdev, fmode_t mode, int for_part)

bdev->bd_part = NULL;
+ trace_printk("%p->bd_disk = NULL\n", bdev);
bdev->bd_disk = NULL;
if (bdev != bdev->bd_contains)
victim = bdev->bd_contains;
@@ -1905,6 +1908,7 @@ void iterate_bdevs(void (*func)(struct block_device *, void *), void *arg)
old_inode = inode;

+ trace_printk("%p->bd_disk = %p\n", I_BDEV(inode), I_BDEV(inode)->bd_disk);
func(I_BDEV(inode), arg);

With this patch applied, I get this output on a crash:
   sync1-1343    3.... 8303954us : iterate_bdevs: ffff88011a0105c0->bd_disk = ffff880114618880
sync1-1340 0.... 8303955us : iterate_bdevs: ffff88011a0105c0->bd_disk = ffff880114618880
sync1-1343 3.... 8303961us : iterate_bdevs: ffff88011a0105c0->bd_disk = ffff880114618880
sync1-1335 1.... 8304043us : iterate_bdevs: ffff88011a0105c0->bd_disk = ffff880114618880
sync2-1327 1.... 8304852us : __blkdev_put: ffff88011a0105c0->bd_disk = NULL
CPU: 2 PID: 1336 Comm: sync1 Not tainted 4.8.0-rc2+ #170
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 04/01/2014
task: ffff88011212d600 task.stack: ffff880112190000
RIP: 0010:[<ffffffff81f04c3a>] [<ffffffff81f04c3a>] blk_get_backing_dev_info+0x4a/0x70
RSP: 0018:ffff880112197cd0 EFLAGS: 00010202
Since __blkdev_put() is the very last line of output before the crash (and I don’t see any other call setting ->bd_disk to NULL in the last few hundred lines or so), there is a very strong indication that this is the problematic assignment. Rerunning this a couple of times shows that it tends to crash with the same symptoms every time.

To get slightly more information about the context in which __blkdev_put() is called in, I apply this patch instead:
diff --git a/fs/block_dev.c b/fs/block_dev.c
index e17bdbd..298bf70 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1612,6 +1612,7 @@ static void __blkdev_put(struct block_device *bdev, fmode_t mode, int for_part)

bdev->bd_part = NULL;
+ trace_dump_stack(0);
bdev->bd_disk = NULL;
if (bdev != bdev->bd_contains)
victim = bdev->bd_contains;
With that, I get the following output:
   <...>-1328    0.... 9309173us : <stack trace>
=> blkdev_close
=> __fput
=> ____fput
=> task_work_run
=> exit_to_usermode_loop
=> do_syscall_64
=> return_from_SYSCALL_64
CPU: 3 PID: 1352 Comm: sync1 Not tainted 4.8.0-rc2+ #171
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 04/01/2014
task: ffff88011248c080 task.stack: ffff880112568000
RIP: 0010:[<ffffffff81f04b7a>] [<ffffffff81f04b7a>] blk_get_backing_dev_info+0x4a/0x70
One thing that’s a bit surprising to me is that this actually isn’t called directly from close(), but as a delayed work item on a workqueue. But in any case we can tell it comes from close() since fput() is called when closing a file descriptor.

Now that I have a fairly good idea of what’s going wrong, it’s time to focus on the fix. This is almost more difficult than what we’ve done so far because it’s such an open-ended problem. Of course I could add a brand new global spinlock to provide mutual exclusion between sync() and clone(), but that would be a bad solution and the wrong thing to do. Usually the author of the code in question had a specific locking scheme or design in mind and the bug is just due to a small flaw or omission somewhere. In other words, it’s usually not a bug in the general architecture of the code (which might require big changes to fix), but a small bug somewhere in the implementation, which would typically require just a few changed lines to fix. It’s fairly obvious that close() is trying to prevent somebody else from seeing bdev->bd_disk == NULL by wrapping most of the __blkdev_put() code in the ->bdev_mutex. This makes me think that it’s the sync() code path that is missing some locking.

Looking around __blkdev_put() and iterate_bdevs(), another thing that strikes me is that iterate_bdevs() is able to get a reference to a block device which is nevertheless in the process of being destroyed – maybe the real problem is that the block device is being destroyed too soon (while iterate_bdevs() is holding a reference to it). So it’s possible that iterate_bdevs() simply needs to formally take a reference to the block device by bumping its reference count while it does its work.

There is a function called bdgrab() which is supposed to take an extra reference to a block device – but only if you aready have one. Thus, using this would be just as racy, since we’re not already formally holding a reference to it. Another function, bd_acquire() seems to formally acquire a reference through a struct inode *. That seems quite promising. It is using the bdev_lock spinlock to prevent the block device from disappearing. I try this tentative patch:
diff --git a/fs/block_dev.c b/fs/block_dev.c
index e17bdbd..489473d 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1884,6 +1884,7 @@ void iterate_bdevs(void (*func)(struct block_device *, void *), void *arg)
list_for_each_entry(inode, &blockdev_superblock->s_inodes, i_sb_list) {
struct address_space *mapping = inode->i_mapping;
+ struct block_device *bdev;

if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW) ||
@@ -1905,7 +1906,11 @@ void iterate_bdevs(void (*func)(struct block_device *, void *), void *arg)
old_inode = inode;

- func(I_BDEV(inode), arg);
+ bdev = bd_acquire(inode);
+ if (bdev) {
+ func(bdev, arg);
+ bdput(bdev);
+ }

My reasoning is that the call to bd_acquire() will prevent close() from actually reaching the bits in __blkdev_put() that do the final cleanup (i.e. setting bdev->bd_disk to NULL) and so prevent the crash from happening.

Unfortunately, running the reproducer again shows no change that I can see. It seems that I was wrong about this preventing __blkdev_put() from running: blkdev_close() calls blkdev_put() unconditionally, which calls __blkdev_put() unconditionally.

Another idea might be to remove the block device from the list that iterate_bdevs() is traversing before setting bdev->bd_disk to NULL. However, it seems that this is all handled by the VFS and we can’t really change it just for block devices.

Reading over most of fs/block_dev.c, I decide to fall back to my first (and more obvious) idea: take bd_mutex in iterate_bdevs(). This should be safe since both the s_inode_list_lock and inode->i_lock are dropped before calling the iterate_bdevs() callback function. However, I am still getting the same crash… On second thought, even taking bd_mutex is not enough because bdev->bd_disk will still be NULL when __blkdev_put() releases the mutex. Maybe there’s a condition we can test while holding the mutex that will tell us whether the block device is “useable” or not. We could test ->bd_disk directly, which is what we’re really interested in, but that seems like a derived property and not a real indication of whether the block device has been closed or not; ->bd_holders or ->bd_openers MAY be better candidates.

While digging around trying to figure out whether to check ->bd_disk, ->bd_holders, or ->bd_openers, I came across this comment in one of the functions in the crashing call chain:
 106 /**
107 * blk_get_backing_dev_info - get the address of a queue's backing_dev_info
108 * @bdev: device
109 *
110 * Locates the passed device's request queue and returns the address of its
111 * backing_dev_info. This function can only be called if @bdev is opened
112 * and the return value is never NULL.
113 */
114 struct backing_dev_info *blk_get_backing_dev_info(struct block_device *bdev)
115 {
116 struct request_queue *q = bdev_get_queue(bdev);
118 return &q->backing_dev_info;
119 }
In particular, the “This function can only be called if @bdev is opened” requirement seems to be violated in our case.

Taking bdev->bd_mutex and checking bdev->bd_disk actually seems to be a fairly reliable test of whether it’s safe to call filemap_fdatawrite() for the block device inode. The underlying problem here is that sync() is able to get a reference to a struct block_device without having it open as a file. Doing something like this does fix the bug:
diff --git a/fs/sync.c b/fs/sync.c
index 2a54c1f..9189eeb 100644
--- a/fs/sync.c
+++ b/fs/sync.c
@@ -81,7 +81,10 @@ static void sync_fs_one_sb(struct super_block *sb, void *arg)

static void fdatawrite_one_bdev(struct block_device *bdev, void *arg)
- filemap_fdatawrite(bdev->bd_inode->i_mapping);
+ mutex_lock(&bdev->bd_mutex);
+ if (bdev->bd_disk)
+ filemap_fdatawrite(bdev->bd_inode->i_mapping);
+ mutex_unlock(&bdev->bd_mutex);

static void fdatawait_one_bdev(struct block_device *bdev, void *arg)
What I don’t like about this patch is that it simply skips block devices which we don’t have any open file descriptors for. That seems wrong to me because sync() should do writeback on (and wait for) all devices, not just the ones that we happen to have an open file descriptor for. Imagine if we opened a device, wrote a lot of data to it, closed it, called sync(), and sync() returns. Now we should be guaranteed the data was written, but I’m not sure we are in this case.

Another slightly ugly thing is that we’re now holding a new mutex over a potentially big chunk of code (everything that happens inside filemap_fdatawrite()).

I’m not sure I can do much better in terms of a small patch at the moment, so I will submit this to the linux-block mailing list with a few relevant people on Cc (Jens Axboe for being the block maintainer, Tejun Heo for having written a lot of the code involved according to git blame, Jan Kara for writing iterate_bdevs(), and Al Viro for probably knowing both the block layer and VFS quite well).

I submitted my patch here: thread

Rabin Vincent answered pretty quickly that he already sent a fix for the very same issue. Oh well, at least his patch is quite close to what I came up with and I learned quite a few new things about the kernel.

Tejun Heo also responded that a better fix would probably be to prevent the disk from going away by getting a reference to it. I tried a couple of different patches without much luck. The currently last patch from me in that thread seemed to prevent the crash, but as I only realised a few minutes after sending it: we’re decrementing the reference count without doing anything when it reaches 0! Of course we don’t get a NULL pointer dereference if we never do the cleanup/freeing in the first place…

If you liked this post and you enjoy fixing bugs like this one, you may enjoy working with us in the Ksplice group at Oracle. Ping me at my Oracle email address :-)

August 31, 2016 09:00 PM

August 30, 2016

LPC 2016: Most LPC passes sold out; refereed track proposals deadline nears

All of the regular and early bird registrations for the 2016 Linux Plumbers Conference have now sold out. There will be a very limited number of late registrations available starting on October 1.

Those interested in attending the conference should also note that each refereed track talk gets one free pass to the conference. The deadline for refereed track proposals is Thursday September 1.

We hope to see you at LPC 2016!

August 30, 2016 02:15 PM

August 29, 2016

LPC 2016: Coherent Accelerators, FPGAs, and PLD Microconference Accepted into LPC 2016

It has been more than a decade since CPU core clock frequencies stopped doubling every 18 months, which has shifted the search for performance from the “hardware free lunch” to concurrency and, more recently, hardware accelerators. Beyond accelerating computational offload, field-programmable gate arrays (FPGAs) and programmable logic devices (PLDs) have long been used in the embedded space to provide ways to offload I/O or to implement timing-sensitive algorithms as close as possible to the pin.

Regardless of how they are used, however, there exists a common class of problems which accompany the use of FPGAs, accelerators, and PLDs on Linux. Perhaps most important are the probing, discovery, and enumeration of these devices, which can be a challenge given the wide variety of interconnects to which they may be attached.

The purpose of this microconference is to discuss these problems, and figure out what it would take to make these devices first-class citizens on Linux. We will be looking at important use cases, including the much-maligned network-offload case as well as the more general topic of workload acceleration.

For more details on coherent accelerators, FPGAs, and PLDs, please see this microconference’s wiki page.

We hope to see you there!

August 29, 2016 06:40 PM

August 28, 2016

LPC 2016: TPM Microconference Accepted into LPC 2016

Although trusted platform modules (TPMs) have been the subject of some controversy over the years, it is quite likely that they have important roles to play in preventing firmware-based attacks, protecting user keys, and so on. However, some work is required to enable TPMs to successfully play these roles, including getting TPM support into bootloaders, securely distributing known-good hashes, and providing robust and repeatable handling of upgrades.

In short, given the ever-more-hostile environments that our systems must operate in, it seems quite likely that much help will be needed, including from TPMs. For more details, see the TPM Microconference wiki page.

We hope to see you there!

August 28, 2016 05:16 PM

August 26, 2016

Dave Airlie: radv: status update or is dota2 working yet?

Clickbait titles for the win!

First up, massive thanks to my major co-conspirator on radv, Bas Nieuwenhuizen, for putting in so much effort on getting radv going.

So where are we at?

Well this morning I finally found the last bug that was causing missing rendering on Dota 2. We were missing support for a compressed texture format that dota2 used. So currently dota 2 renders, I've no great performance comparison to post yet because my CPU is 5 years old, and can barely get close to 30fps with GL or Vulkan. I think we know of a couple of places that could be bottlenecking us on the CPU side. The radv driver is currently missing hyper-z (90% done), fast color clears and DCC, which are all GPU side speedups in theory. Also running the phoronix-test-suite dota2 tests works sometimes, hangs in a thread lock sometimes, or crashes sometimes. I think we have some memory corruption somewhere that it collides with.

Other status bits: the Vulkan CTS test suite contains 114598 tests, a piglit run a few hours before I fixed dota2 was at:
[114598/114598] skip: 50388, pass: 62932, fail: 1193, timeout: 2, crash: 83 - |/-\

So that isn't too bad a showing, we know some missing features are accounting for some of fails. A lot of the crashes are an assert in CTS hitting, that I don't think is a real problem.

We render most of the Sascha Willems demos fine.

I've tested the Talos Principle as well, the texture fix renders a lot more stuff on the screen, but we are still seeing large chunks of blackness where I think there should be trees in-game, the menus etc all seem to load fine.

All this work is on the semi-interesting branch of

It only has been tested on VI AMD GPUs, Polaris worked previously but something derailed it, but we should fix it once we get the finished bisect. CIK GPUs kinda work with the amdgpu kernel driver loaded. SI GPUs are nowhere yet.

Here's a screenshot:

August 26, 2016 03:05 AM

Matthew Garrett: Priorities in security

I read this tweet a couple of weeks ago:

to me, an inclusive security community would focus as much (or at all) on surveillance of women by abusive partners as it does the state

— kelsey ᕕ( ᐛ )ᕗ (@_K_E_L_S_E_Y) August 2, 2016

and it got me thinking. Security research is often derided as unnecessary stunt hacking, proving insecurity in things that are sufficiently niche or in ways that involve sufficient effort that the realistic probability of any individual being targeted is near zero. Fixing these issues is basically defending you against nation states (who (a) probably don't care, and (b) will probably just find some other way) and, uh, security researchers (who (a) probably don't care, and (b) see (a)).

Unfortunately, this may be insufficient. As basically anyone who's spent any time anywhere near the security industry will testify, many security researchers are not the nicest people. Some of them will end up as abusive partners, and they'll have both the ability and desire to keep track of their partners and ex-partners. As designers and implementers, we owe it to these people to make software as secure as we can rather than assuming that a certain level of adversary is unstoppable. "Can a state-level actor break this" may be something we can legitimately write off. "Can a security expert continue reading their ex-partner's email" shouldn't be.

comment count unavailable comments

August 26, 2016 12:02 AM

August 25, 2016

Gustavo F. Padovan: Slides for my LinuxCon talk on Mainline Explicit Fencing

For those of you that are interested here are the slides of the my presentation at LinuxCon North America this week. The conference was great with very good talks and very interesting meetings on the hallway track.

My presentation covered the effort to create the Explicit Fencing mechanism on the Linux Kernel which is to be used mainly by the Graphics pipeline. In short, Explicit Fencing is a way to give userspace information about the current state of shared buffers inside the kernel. This is done through fences, that can then be passed around to userspace and/or other kernel drivers for synchronization purposes. This allows both userspace and kernel to wait for kernel jobs to finish without blocking. It also significantly helps the compositor take more efficient and smart decisions on scheduling frames to display on the screen. I’ll be posting an article with more details on it soon. :)

Finally I would like to thank Collabora for sponsoring my travel to LinuxCon.

August 25, 2016 04:28 PM

Paul E. Mc Kenney: Coherent Accelerators, FPGAs, and PLD Microconference Accepted into 2016 Linux Plumbers Conference

It has been more than a decade since CPU core clock frequencies stopped doubling every 18 months, which has shifted the search for performance from the "hardware free lunch" to concurrency and, more recently, hardware accelerators. Beyond accelerating computational offload, field-programmable gate arrays (FPGAs) and programmable logic devices (PLDs) have long been used in the embedded space to provide ways to offload I/O or to implement timing-sensitive algorithms as close as possible to the pin.

Regardless of how they are used, however, there exists a common class of problems which accompany the use of FPGAs, accelerators, and PLDs on Linux. Perhaps most important are the probing, discovery, and enumeration of these devices, which can be a challenge given the wide variety of interconnects to which they may be attached.

The purpose of this microconference is to discuss these problems, and figure out what it would take to make these devices first-class citizens on Linux. We will be looking at important use cases, including the much-maligned network-offload case as well as the more general topic of workload acceleration.

For more details on coherent accelerators, FPGAs, and PLDs, please see this microconference's wiki page.

We hope to see you there!

August 25, 2016 11:32 AM

August 24, 2016

Pete Zaitcev: Curse you, Jon Masters! Why do you always have to be right!

My friend and colleague Jon is proud of his disdain for Linux on desktop (or tablet, for that matter), and always goes around telling people how OSX always works on a Mac, because Apple performs integrated testing etc. etc.. The latest episode involved him buying a lemon Dell XPS 13. Pretty much nothing worked right on that pitiful excuse for a computer, and I am sorry to admit, I felt a little smug telling Jon on Facebook how well my ASUS UX303LB worked under Fedora. I've not had a failure to resume even once in the years I had it (yeah, my standards are this low).

Long story short, Fedora 24 came out and I'm given the taste of the same medicine: the video on the ASUS is completely busted. I was able to limp along for now by using the old kernel 4.4.6-301.fc23, but come on, this is clearly a massive regression. Think anyone is there to bisect and find the culprit? Of course not. I have to do it it myself.

So, how did F24 ship? Well... I didn't test beta versions, so I don't have much ground to complain.

UPDATE: While upsream is working on the fix in the next release, I'm using i915.enable_psr=0.

August 24, 2016 06:56 PM

Pete Zaitcev: The sound ID of telemarketers

I noticed one strange thing recently. Every telemarketing call starts with an particular sound that resembles a modulated data block. It's very short, about 250 ms, but very audible. I'm a little curious what it is. Is it possible to capture and decode?

The regular calls are not preceded by this block, so I'm certain that it's something that telemarketers mix in. But to what purpose?

UPDATE: Someone shared this post on Hacker News. I'm happy to report they didn't conclude that I'm making shit up, but the best the hive-mind came up with was that the Caller-ID is getting sent even after the receiver is picked up. I really should record this somehow.

August 24, 2016 05:16 PM

August 19, 2016

Pete Zaitcev: Fedora, Swift, and xattr>=0.4

If one tries to run Swift tests with "PYTHONPATH=$(pwd) ./.unittests" on a stock Fedora, a bunch of them fail with "DistributionNotFound: xattr>=0.4". This is fixed easily with the following patch:

diff -urp pyxattr-0.5.1-p3/ pyxattr-0.5.1/
--- pyxattr-0.5.1-p3/	2012-05-15 16:58:20.000000000 -0600
+++ pyxattr-0.5.1/	2014-05-29 14:21:54.223317477 -0600
@@ -29,3 +29,11 @@ setup(name = "pyxattr",
       test_suite = "test",
       platforms = ["Linux"],
+# Add a dummy egg so "xattr>=0.4" works in requirements.txt for paste-deploy.
+# This primarily helps with running unit tests of Swift, because for
+# packaging we already disable all this.
+      version = version,
+      description = "Alias to pyxattr",
+      ext_modules = [Extension("xattr", [])]
+     )

IIRC I proposed this as a fix, but the maintainer of pyxattr in Fedora was not glad to see it, so I threw together a spec and RPM for pyxattr, kept in my page.

This was going on for 3 years or more. Rebuilding the patched pyxattr again for Fedora 24, I started wondering idly, why is it that nobody else ran into this problem? I suspect the answer is that I am the only human in the world who tests OpenStack Swift on Fedora. Everyone else uses Ubuntu (or pip).

August 19, 2016 12:23 AM

August 16, 2016

Pavel Machek: Vala -- seems ideal so far

I was searching for a language to write the phone GUI with... python3+gtk3 is way too slow; 9 seconds for trivial application is a bit too much (on N900). python2+gtk2 is a lot better at 2 seconds. Lua should be even faster.

But while searching for good language, Vala caught my mind. Designed to be integrated with gtk/dbus, compiled language. I was woried about error messages and errors from vala->c->binary compilation, but seems good so far.

Oh and it seems that emacs org mode is right thing to use for calendar. It looks like a bit too complex at first, but it seems the complexity is well justified... and I was doing similar things manually. Still have to search for a component to notify using popup / audio when an event is upcoming.

August 16, 2016 09:42 PM

August 11, 2016

Matthew Garrett: Microsoft's compromised Secure Boot implementation

There's been a bunch of coverage of this attack on Microsoft's Secure Boot implementation, a lot of which has been somewhat confused or misleading. Here's my understanding of the situation.

Windows RT devices were shipped without the ability to disable Secure Boot. Secure Boot is the root of trust for Microsoft's User Mode Code Integrity (UMCI) feature, which is what restricts Windows RT devices to running applications signed by Microsoft. This restriction is somewhat inconvenient for developers, so Microsoft added support in the bootloader to disable UMCI. If you were a member of the appropriate developer program, you could give your device's unique ID to Microsoft and receive a signed blob that disabled image validation. The bootloader would execute a (Microsoft-signed) utility that verified that the blob was appropriately signed and matched the device in question, and would then insert it into an EFI Boot Services variable[1]. On reboot, the boot loader reads the blob from that variable and integrates that policy, telling later stages to disable code integrity validation.

The problem here is that the signed blob includes the entire policy, and so any policy change requires an entirely new signed blob. The Windows 10 Anniversary Update added a new feature to the boot loader, allowing it to load supplementary policies. These must also be signed, but aren't tied to a device id - the idea is that they'll be ignored unless a device-specific policy has also been loaded. This way you can get a single device-specific signed blob that allows you to set an arbitrary policy later by using a combination of supplementary policies.

This is all fine in the Anniversary Edition. Unfortunately older versions of the boot loader will happily load a supplementary policy as if it were a full policy, ignoring the fact that it doesn't include a device ID. The loaded policy replaces the built-in policy, so in the absence of a base policy a supplementary policy as simple as "Enable this feature" will effectively remove all other restrictions.

Unfortunately for Microsoft, such a supplementary policy leaked. Installing it as a base policy on pre-Anniversary Edition boot loaders will then allow you to disable all integrity verification, including in the boot loader. Which means you can ask the boot loader to chain to any other executable, in turn allowing you to boot a compromised copy of any operating system you want (not just Windows).

This does require you to be able to install the policy, though. The PoC released includes a signed copy of SecureBootDebug.efi for ARM, which is sufficient to install the policy on ARM systems. There doesn't (yet) appear to be a public equivalent for x86, which means it's not (yet) practical for arbitrary attackers to subvert the Secure Boot process on x86. I've been doing my testing on a setup where I've manually installed the policy, which isn't practical in an automated way.

How can this be prevented? Installing the policy requires the ability to run code in the firmware environment, and by default the boot loader will only load signed images. The number of signed applications that will copy the policy to the Boot Services variable is presumably limited, so if the Windows boot loader supported blacklisting second-stage bootloaders Microsoft could simply blacklist all policy installers that permit installation of a supplementary policy as a primary policy. If that's not possible, they'll have to blacklist of the vulnerable boot loaders themselves. That would mean all pre-Anniversary Edition install media would stop working, including recovery and deployment images. That's, well, a problem. Things are much easier if the first case is true.

Thankfully, if you're not running Windows this doesn't have to be a issue. There are two commonly used Microsoft Secure Boot keys. The first is the one used to sign all third party code, including drivers in option ROMs and non-Windows operating systems. The second is used purely to sign Windows. If you delete the second from your system, Windows boot loaders (including all the vulnerable ones) will be rejected by your firmware, but non-Windows operating systems will still work fine.

From what we know so far, this isn't an absolute disaster. The ARM policy installer requires user intervention, so if the x86 one is similar it'd be difficult to use this as an automated attack vector[2]. If Microsoft are able to blacklist the policy installers without blacklisting the boot loader, it's also going to be minimally annoying. But if it's possible to install a policy without triggering any boot loader blacklists, this could end up being embarrassing.

Even outside the immediate harm, this is an interesting vulnerability. Presumably when the older boot loaders were written, Microsoft policy was that they would never sign policy files that didn't include a device ID. That policy changed when support for supplemental policies was added. without this policy change, the older boot loaders could still be considered secure. Adding new features can break old assumptions, and your design needs to take that into account.

[1] EFI variables come in two main forms - those accessible at runtime (Runtime Services variables) and those only accessible in the early boot environment (Boot Services variables). Boot Services variables can only be accessed before ExitBootServices() is called, and in Secure Boot environments all code executing before this point is (theoretically) signed. This means that Boot Services variables are nominally tamper-resistant.

[2] Shim has explicit support for allowing a physically present machine owner to disable signature validation - this is basically equivalent

comment count unavailable comments

August 11, 2016 09:58 PM

August 10, 2016

LPC 2016: Refereed Talk Deadline Approaching

The refereed talk deadline for Linux Plumbers Conference is only a few weeks off, September 1, 2016 at 11:59PM CET. So there is still some time to get your proposals in, but time is growing short.

Note that this year’s Plumbers is co-located with Linux Kernel Summit rather than LinuxCon, so the refereed track is all Plumbers this year. We are therefore looking forward to seeing your all-Plumbers refereed-track submission!

As you might have noticed, earlybird registration has closed, but normal-rate registration will be opening up on August 27th—however, accepted refereed speaking proposals will receive a free pass.

The conference itself is in Santa Fe, New Mexico on November 1-4, 2016. Looking forward to seeing you there!

August 10, 2016 11:05 PM

August 08, 2016

Paul E. Mc Kenney: Refereed Talk Deadline Approaching for Linux Plumbers Conference

The refereed talk deadline for Linux Plumbers Conference is only a few weeks off, September 1, 2016 at 11:59PM CET. So there is still some time to get your proposals in, but time is growing short.

Note that this year's Plumbers is co-located with Linux Kernel Summit rather than LinuxCon, so the refereed track is all Plumbers this year. We are therefore looking forward to seeing your all-Plumbers refereed-track submission!

As you might have noticed, earlybird registration has closed, but normal-rate registration will be opening up on August 27th—however, accepted refereed speaking proposals will receive a free pass.

The conference itself is in Santa Fe, New Mexico on November 1-4, 2016. Looking forward to seeing you there!

August 08, 2016 08:53 PM