Kernel Planet

December 06, 2016

LPC 2016: Linux Plumbers Conference 2017

It’s our pleasure to announce that Linux Plumbers Conference 2017 will take place on September 13-15 2017, in Los Angeles, California, USA. The conference will be co-located with the Linux Foundation Open Source Summit North America.

Stay tuned for more information as the Linux Plumbers Conference committee is starting to plan for the 2017 edition.

We hope you’ll join us in 2017.

The LPC Planning Team.

 

December 06, 2016 07:51 PM

December 05, 2016

James Bottomley: Using Your TPM as a Secure Key Store

One of the new features of Linux Plumbers Conference this year was the TPM Microconference, which facilitated great discussions both in the session itself and in the hallways.  Quite a bit of discussion was generated by the Beginner’s Guide to the TPM talk I gave, mostly because I blamed the Trusted Computing Group for the abject failure to adopt TPMs for anything citing the incredible complexity of their stack.

The main thing that came out of this discussion was that a lot of this stack complexity can be hidden from users and we should concentrate on making the TPM “just work” for all cryptographic functions where we have parallels in the existing security layers (like the keystore).  One of the great advantages of the TPM, instead of messing about with USB pkcs11 tokens, is that it has a file format for TPM keys (I’ll explain this later) which can be used directly in place of standard private key files.  However, before we get there, lets discuss some of the basics of how your TPM works and how to make use of it.

TPM Basics

Note that all of what I’m saying below applies to a 1.2 TPM (the type most people have in their laptops) 2.0 TPMs are now appearing on the market, but chances are you have a 1.2.

A TPM is traditionally delivered in your laptop in an uninitialised state.  In older laptops, the TPM is traditionally disabled and you usually have to find an entry in the BIOS menu to enable it.  In more modern laptops (thanks to Windows 10) the TPM is enabled in the bios and ready for the OS install to make use of it.  All TPMs are delivered with one manufacturer set key called the Endorsement Key (EK).  This key is unique to your TPM (like an identifying label) and is used as part of the attestation protocol.  Because the EK is a unique label, the attestation protocol is rather complex involving a so called privacy CA to protect your identity, but because it isn’t necessary to use the TPM as a secure keystore, I won’t cover it further.

The other important key, which you have to generate, is called the Storage Root Key.  This key is generated internally within the TPM once somebody takes ownership of it.  The package you need to begin using the tpm is tpm-tools, which is packaged by most distros. You must also have the Linux TSS stack trousers installed (just installing tpm-tools will often pull this in) and have the tcsd part of trousers running (usually systemctl start tcsd; systemctl enable tcsd). I tend to configure my TPM with an owner password (for things like resetting dictionary attacks) but a well known storage root key authority.  To do this from a fully cleared and enabled TPM, execute

tpm_takeownership -z

And give your chosen owner password when prompted.  If you get an error, chances are you need to go back to the BIOS menu and actively clear and reset the TPM (usually under the security options).

Aside about Authority and the Trusted Security Stack

To the TPM, an “authority” is a 20 byte number you use to prove you’re allowed to manipulate whatever object you’re trying to use.  The TPM typically has a well known way of converting typed passwords into these 20 byte codes.  The way you prove you know the authority is to add a Hashed Message Authentication Code (HMAC) to your TPM command.  This means that the hash can only be generated by someone who knows the authority for the object, but anyone seeing the hash cannot derive the authority from it.  The utility of this is that the trousers library (tspi) generates the HMAC before the TPM command is passed to the central daemon (tcsd) meaning that nothing except you and the TPM know the authority

trousersThe final thing about authority you need to know is that the TPM has a concept of “well known authority” which simply means supply 20 bytes of zeros.  It’s kind of paradoxical to have a secret everyone knows, however, there are reasons for this:  For most objects in the TPM whether you require authority to use them is optional, but for some it is mandatory.  For objects (like the SRK) where authority is mandatory, using the well known authority is equivalent to saying actually I don’t need authorization for this object.

The Storage Root Key (SRK)

Once you’ve generated this above, the TPM keeps the secret part permanently hidden, but can be persuaded to give anyone the public part.  In TPM 1.2, the SRK is a RSA 2048 key.  On most modern TPMs, you have to tell the tpm you want anyone to be able to read the public part of the storage root key, which you do with this command

tpm_restrictsrk -a

You’ll get prompted for the owner password.  Once you execute this command, anyone who knows the SRK authority (which you’ve set to be well known) is allowed to read the public part.

Why all this fuss about SRK authorization?  Well, traditionally, the TPM is designed for use in a hostile multi-user environment.  In the relaxed, no authorization, environment I’ve advised you to set up, anyone who knows the SRK can upload any storage object (like a key or protected blob) into the TPM.  This means, since the TPM has very limited storage, that they could in theory do a DoS attack against the TPM simply by filling it with objects.  On a laptop where there’s only one user (you) this is not usually a concern, hence the advice to use a well known authority, which makes the TPM much easier to use.

The way external objects (like keys or data blobs) are uploaded into the TPM is that they all have a parent (which must be a storage key) and they are encrypted to the public part of this key (in TPM parlance, this is called wrapping).  The TPM can have deep key hierarchies (all eventually parented to the SRK), but for a laptop, it makes sense simply to use the SRK as the only storage key and wrap everything for it as the parent.  Now here’s the reason for the well known authority: to upload an object into the TPM, it not only needs to be wrapped to the parent key, you also need to use the parent key authority to perform the upload.  The object you’re using also has a separate authority.  This means that when you upload and use a key, if you’ve set a SRK password, you’ll end up having to type both the SRK password and the key password pretty much every time you use it, which is a bit of a pain.

The tools used to create wrapped keys are found in the openssl_tpm_engine package.  I’ve done a few patches to make it easier to use (mostly by trying well known authority first before asking for the SRK password), so you can see my patched version here.  The first thing you can do is take any PEM key file you have and wrap it for your tpm

create_tpm_key -m -w test.key test.tpm.key

This creates a TPM key file test.tpm.key containing a wrapped key for your TPM with no authority (to add an authority password, use the -a option).  If you cat the test.tpm.key file, you’ll see it looks like a standard PEM file, except the guards are now

-----BEGIN TSS KEY BLOB-----
-----END TSS KEY BLOB-----

This key is now wrapped for your TPM’s SRK and would only be usable on your laptop.  If you’re fortunate enough to be using an application linked with gnutls, you can simply use this key with the URI  tpmkey:file=<path to test.tpm.key>.  If you’re using openssl, you need to patch it to get it to use TPM keys easily (see below).

The ideal, however, would be since these are PEM files with unique guards, any ssl provider should simply recognise the guards  and load the key into the TPM  This means that in order to use a TPM key, you take a standard PEM private key file, transform it into a TPM key file and then simply copy it back to where the original key file was being used from and voila! you’re using a TPM based key.  This is what the openssl patches below do.

Getting TPM keys to “just work” with openssl

In openssl, external encryption processors, like the TPM or USB keys are used by things called engines.  The engine you need for the TPM is also in the openssl_tpm_engine package, so once you’ve installed that package, the engine is available.  Unfortunately, openssl doesn’t naturally use a particular engine unless told to do so (most of the openssl tools have a -engine option for this).  However, having to specify the engine in every application somewhat spoils the “just works” aspect we’re looking for, so the openssl patches here allow an engine to specify that it knows how to parse a PEM file and can load a key from it.  This allows you simply to replace the original key file with a TPM protected key file and have your application continue working with it.

As a demo of the usefulness, I’m using it on my current laptop with all my VPN keys.  It is also possible to use it with openssh keys, since they’re standard PEM files.  However, the way openssh works with agents means that the agent cannot handle the keys and you have to type the password (if you set one one the key) each time you use it.

It should be noted that the idea of having PEM based TPM keys just work in openssl is encountering resistance.  However, it does just work in gnutls (provided you change the file name to be a tpmkey:file= URL).

Conclusions (or How Well is it Working?)

As I said above, I’m currently using this scheme for my openvpn and ssh keys.  I have to confess, since I use openssh a lot, I got very tired of having to type the password on every ssh operation, so I’ve gone back to using non-TPM based keys which can be handled by the agent.  Fixing this is on my list of things to look at.  However, I still am using TPM based keys for my openvpn.

Even for openvpn, though there are hiccoughs: the trousers daemon, tcsd, crashes periodically on my platform.  When it does, the vpn goes down (because the VPN needs a key based authentication transaction every hour to rotate the symmetric encryption keys).  Unfortunately, just restarting tcsd isn’t enough because the design of trousers doesn’t seem to be robust to this failure (even though the tspi part linked with the application could recreate all the keys), so the VPN itself must be restarted when this happens, which makes it rather user unfriendly.  Fixing trousers to cope with tcsd failure is also on my list of things to fix …

December 05, 2016 04:41 PM

December 02, 2016

Matthew Garrett: Ubuntu still isn't free software

Mark Shuttleworth just blogged about their stance against unofficial Ubuntu images. The assertion is that a cloud hoster is providing unofficial and modified Ubuntu images, and that these images are meaningfully different from upstream Ubuntu in terms of their functionality and security. Users are attempting to make use of these images, are finding that they don't work properly and are assuming that Ubuntu is a shoddy product. This is an entirely legitimate concern, and if Canonical are acting to reduce user confusion then they should be commended for that.

The appropriate means to handle this kind of issue is trademark law. If someone claims that something is Ubuntu when it isn't, that's probably an infringement of the trademark and it's entirely reasonable for the trademark owner to take action to protect the value associated with their trademark. But Canonical's IP policy goes much further than that - it can be interpreted as meaning[1] that you can't distribute works based on Ubuntu without paying Canonical for the privilege, even if you call it something other than Ubuntu.

This remains incompatible with the principles of free software. The freedom to take someone else's work and redistribute it is a vital part of the four freedoms. It's legitimate for Canonical to insist that you not pass it off as their work when doing so, but their IP policy continues to insist that you remove all references to Canonical's trademarks even if their use would not infringe trademark law.

If you ask a copyright holder if you can give a copy of their work to someone else (assuming it doesn't infringe trademark law), and they say no or insist you need an additional contract, it's not free software. If they insist that you recompile source code before you can give copies to someone else, it's not free software. Asking that you remove trademarks that would otherwise infringe trademark law is fine, but if you can't use their trademarks in non-infringing ways, that's still not free software.

Canonical's IP policy continues to impose restrictions on all of these things, and therefore Ubuntu is not free software.

[1] And by "interpreted as meaning" I mean that's what it says and Canonical refuse to say otherwise

comment count unavailable comments

December 02, 2016 09:37 AM

November 16, 2016

Pavel Machek: Linux did not win, yet

http://www.cio.com/article/3141918/linux/linux-has-won-microsoft-joins-the-linux-foundation.html Yes, Linux won on servers. Unfortunately... servers are not that important, and Linux still did not win on desktops (and is not much closer now than it was in 1998, AFAICT). We kind-of won on phones, but are not getting any benefits from that. Android is incompatible with X applications. Kernels on phones are so patched that updating kernel on phone is impossible... :-(. This means that Microsoft sponsors Linux Foundation. Well, nice, but not a big deal. Has Microsoft promised not to use their patents against Linux? Does their kernel actually contain vfat code? Can I even get source for "their" Linux kernel? [Searching for Linux on microsoft.com does not reveal anything interesting; might be switching to english would help...]

November 16, 2016 11:49 PM

November 15, 2016

Paul E. Mc Kenney: Another great Linux Plumbers Conference!

A big “thank you” to the program committee, to the microconference leads, to the refereed-track speakers, and, most of all, to the attendees! We had a great Linux Plumbers Conference this year, and we could not have done it without all of you!!!

November 15, 2016 10:47 PM

November 14, 2016

Pavel Machek: foxtrotgps: not suitable for spacecraft navigation

Subject: foxtrotgps: not suitable for spacecraft navigation
Package: foxtrotgps
Version: 1.2.0-1
Severity: normal
Dear Maintainer,
Trying to use foxtrotgps in the spacecraft leads to some interesting
glitches.
When date line is reached, "track traveled" jumps over the whole
world, and "your position" gets de-synchronized from point when the
red line is painted.
Reproduced with Vostok-1 spacecraft.

November 14, 2016 10:22 AM

November 10, 2016

Matthew Garrett: Tor, TPMs and service integrity attestation

One of the most powerful (and most scary) features of TPM-based measured boot is the ability for remote systems to request that clients attest to their boot state, allowing the remote system to determine whether the client has booted in the correct state. This involves each component in the boot process writing a hash of the next component into the TPM and logging it. When attestation is requested, the remote site gives the client a nonce and asks for an attestation, the client OS passes the nonce to the TPM and asks it to provide a signed copy of the hashes and the nonce and sends them (and the log) to the remote site. The remoteW site then replays the log to ensure it matches the signed hash values, and can examine the log to determine whether the system is trustworthy (whatever trustworthy means in this context).

When this was first proposed people were (justifiably!) scared that remote services would start refusing to work for users who weren't running (for instance) an approved version of Windows with a verifiable DRM stack. Various practical matters made this impossible. The first was that, until fairly recently, there was no way to demonstrate that the key used to sign the hashes actually came from a TPM[1], so anyone could simply generate a set of valid hashes, sign them with a random key and provide that. The second is that even if you have a signature from a TPM, you have no way of proving that it's from the TPM that the client booted with (you can MITM the request and either pass it to a client that did boot the appropriate OS or to an external TPM that you've plugged into your system after boot and then programmed appropriately). The third is that, well, systems and configurations vary so much that outside very controlled circumstances it's impossible to know what a "legitimate" set of hashes even is.

As a result, so far remote attestation has tended to be restricted to internal deployments. Some enterprises use it as part of their VPN login process, and we've been working on it at CoreOS to enable Kubernetes clusters to verify that workers are in a trustworthy state before running jobs on them. While useful, this isn't terribly exciting for most people. Can we do better?

Remote attestation has generally been thought of in terms of remote systems requiring that clients attest. But there's nothing that requires things to be done in that direction. There's nothing stopping clients from being able to request that a server attest to its state, allowing clients to make informed decisions about whether they should provide confidential data. But the problems that apply to clients apply equally well to servers. Let's work through them in reverse order.

We have no idea what expected "good" values are

Yes, and this is a problem. CoreOS ships with an expected set of good values, and we had general agreement at the Linux Plumbers Conference that other distributions would start looking at what it would take to do the same. But how do we know that those values are themselves trustworthy? In an ideal world this would involve reproducible builds, allowing anybody to grab the source code for the OS, build it locally and verify that they have the same hashes.

Ok. So we're able to verify that the booted OS was good. But how about the services? The rkt container runtime supports measuring each container into the TPM, which means we can verify which container images were started. If container images are also built in such a way that they're reproducible, users can grab the source code, rebuild the container locally and again verify that it has the same hashes. Users can then be sure that the remote site is running the code they're looking at.

Or can they? Not really - a general purpose OS has all kinds of ways to inject code into containers, so an admin could simply replace the binaries inside the container after it's been measured, or ptrace() the server, or modify rkt so it generates correct measurements regardless of the image or, well, there's lots they could do. So a general purpose OS is probably a bad idea here. Instead, let's imagine an immutable OS that does nothing other than bring up networking and then reads a config file that tells it which container images to download and run. This reduces the amount of code that needs to support reproducible builds, making it easier for a client to verify that the source corresponds to the code the remote system is actually running.

Is this sufficient? Eh sadly no. Even if we know the valid values for the entire OS and every container, we don't know the legitimate values for the system firmware. Any modified firmware could tamper with the rest of the trust chain, making it possible for you to get valid OS values even if the OS has been subverted. This isn't a solved problem yet, and really requires hardware vendor support. Let's handwave this for now, or assert that we'll have some sidechannel for distributing valid firmware values.

Avoiding TPM MITMing

This one's more interesting. If I ask the server to attest to its state, it can simply pass that through to a TPM running on another system that's running a trusted stack and happily serve me content from a compromised stack. Suboptimal. We need some way to tie the TPM identity and the service identity to each other.

Thankfully, we have one. Tor supports running services in the .onion TLD. The key used to identify the service to the Tor network is also used to create the "hostname" of the system. I wrote a pretty hacky implementation that generates that key on the TPM, tying the service identity to the TPM. You can ask the TPM to prove that it generated a key, and that allows you to tie both the key used to run the Tor service and the key used to sign the attestation hashes to the same TPM. You now know that the attestation values came from the same system that's running the service, and that means you know the TPM hasn't been MITMed.

How do you know it's a TPM at all?

This is much easier. See [1].



There's still various problems around this, including the fact that we don't have this immutable minimal container OS, that we don't have the infrastructure to ensure that container builds are reproducible, that we don't have any known good firmware values and that we don't have a mechanism for allowing a user to perform any of this validation. But these are all solvable, and it seems like an interesting project.

"Interesting" isn't necessarily the right metric, though. "Useful" is. And I think this is very useful. If I'm about to upload documents to a SecureDrop instance, it seems pretty important that I be able to verify that it is a SecureDrop instance rather than something pretending to be one. This gives us a mechanism.

The next few years seem likely to raise interest in ensuring that people have secure mechanisms to communicate. I'm not emotionally invested in this one, but if people have better ideas about how to solve this problem then this seems like a good time to talk about them.

[1] More modern TPMs have a certificate that chains from the TPM's root key back to the TPM manufacturer, so as long as you trust the TPM manufacturer to have kept control of that you can prove that the signature came from a real TPM

comment count unavailable comments

November 10, 2016 08:48 PM

November 04, 2016

LPC 2016: Closing party at the compound

We have a map here for you to locate the compound with

November 04, 2016 10:18 PM

October 28, 2016

Matthew Garrett: Of course smart homes are targets for hackers

The Wirecutter, an in-depth comparative review site for various electrical and electronic devices, just published an opinion piece on whether users should be worried about security issues in IoT devices. The summary: avoid devices that don't require passwords (or don't force you to change a default and devices that want you to disable security, follow general network security best practices but otherwise don't worry - criminals aren't likely to target you.

This is terrible, irresponsible advice. It's true that most users aren't likely to be individually targeted by random criminals, but that's a poor threat model. As I've mentioned before, you need to worry about people with an interest in you. Making purchasing decisions based on the assumption that you'll never end up dating someone with enough knowledge to compromise a cheap IoT device (or even meeting an especially creepy one in a bar) is not safe, and giving advice that doesn't take that into account is a huge disservice to many potentially vulnerable users.

Of course, there's also the larger question raised by the last week's problems. Insecure IoT devices still pose a threat to the wider internet, even if the owner's data isn't at risk. I may not be optimistic about the ease of fixing this problem, but that doesn't mean we should just give up. It is important that we improve the security of devices, and many vendors are just bad at that.

So, here's a few things that should be a minimum when considering an IoT device:

  • Does the vendor publish a security contact? (If not, they don't care about security)
  • Does the vendor provide frequent software updates, even for devices that are several years old? (If not, they don't care about security)
  • Has the vendor ever denied a security issue that turned out to be real? (If so, they care more about PR than security)
  • Is the vendor able to provide the source code to any open source components they use? (If not, they don't know which software is in their own product and so don't care about security, and also they're probably infringing my copyright)
  • Do they mark updates as fixing security bugs? (If not, they care more about hiding security issues than fixing them)
  • Has the vendor ever threatened to prosecute a security researcher? (If so, again, they care more about PR than security)
  • Does the vendor provide a public minimum support period for the device? (If not, they don't care about security or their users)

    I've worked with big name vendors who did a brilliant job here. I've also worked with big name vendors who responded with hostility when I pointed out that they were selling a device with arbitrary remote code execution. Going with brand names is probably a good proxy for many of these requirements, but it's insufficient.

    So here's my recommendations to The Wirecutter - talk to a wide range of security experts about the issues that users should be concerned about, and figure out how to test these things yourself. Don't just ask vendors whether they care about security, ask them what their processes and procedures look like. Look at their history. And don't assume that just because nobody's interested in you, everybody else's level of risk is equal.


  • comment count unavailable comments

    October 28, 2016 05:23 PM

    October 25, 2016

    Valerie Aurora: Why I won’t be attending Systems We Love

    Systems We Love is a one day event in San Francisco to talk excitedly about systems computing. When I first heard about it, I was thrilled! I love systems so much that I moved from New Mexico to the Bay Area when I was 23 years old purely so that I could talk to more people about them. I’m the author of the Kernel Hacker’s Bookshelf series, in which I enthusiastically described operating systems research papers I loved in the hopes that systems programmers would implement them. The program committee of Systems We Love includes many people I respect and enjoy being around. And the event is so close to me that I could walk to it.

    So why I am not going to Systems We Love? Why am I warning my friends to think twice before attending? And why am I writing a blog post warning other people about attending Systems We Love?

    The answer is that I am afraid that Bryan Cantrill, the lead organizer of Systems We Love, will say cruel and humiliating things to people who attend. Here’s why I’m worried about that.

    I worked with Bryan in the Solaris operating systems group at Sun from 2002 to 2004. We didn’t work on the same projects, but I often talked to him at the weekly Monday night Solaris kernel dinner at Osteria in Palo Alto, participated in the same mailing lists as him, and stopped by his office to ask him questions every week or two. Even 14 years ago, Bryan was one of the best systems programmers, writers, and speakers I have ever met. I admired him and learned a lot from him. At the same time, I was relieved when I left Sun because I knew I’d never have to work with Bryan again.

    Here’s one way to put it: to me, Bryan Cantrill is the opposite of another person I admire in operating systems (whom I will leave unnamed). This person makes me feel excited and welcome and safe to talk about and explore operating systems. I’ve never seen them shame or insult or put down anyone. They enthusiastically and openly talk about learning new systems concepts, even when other people think they should already know them. By doing this, they show others that it’s safe to admit that they don’t know something, which is the first step to learning new things. They are helping create the kind of culture I want in systems programming – the kind of culture promoted by Papers We Love, which Bryan cites as the inspiration for Systems We Love.

    By contrast, when I’m talking to Bryan I feel afraid, cautious, and fearful. Over the years I worked with Bryan, I watched him shame and insult hundreds of people, in public and in private, over email and in person, in papers and talks. Bryan is no Linus Torvalds – Bryan’s insults are usually subtle, insinuating, and beautifully phrased, whereas Linus’ insults tend towards the crude and direct. Even as you are blushing in shame from what Bryan just said about you, you are also admiring his vocabulary, cadence, and command of classical allusion. When I talked to Bryan about any topic, I felt like I was engaging in combat with a much stronger foe who only wanted to win, not help me learn. I always had the nagging fear that I probably wouldn’t even know how cleverly he had insulted me until hours later. I’m sure other people had more positive experiences with Bryan, but my experience matches that of many others. In summary, Bryan is supporting the status quo of the existing culture of systems programming, which is a culture of combat, humiliation, and domination.

    People admire and sometimes hero-worship Bryan because he’s a brilliant technologist, an excellent communicator, and a consummate entertainer. But all that brilliance, sparkle, and wit are often used in the service of mocking and humiliating other people. We often laugh and are entertained by what Bryan says, but most of the time we are laughing at another person, or at a person by proxy through their work. I think we rationalize taking part in this kind of cruelty by saying that the target “deserves” it because they made a short-sighted design decision, or wrote buggy code, or accidentally made themselves appear ridiculous. I argue that no one deserves to be humiliated or laughed at for making an honest mistake, or learning in public, or doing the best they could with the resources they had. And if that means that people like Bryan have to learn how to be entertaining without humiliating people, I’m totally fine with that.

    I stopped working with Bryan in 2004, which was 12 years ago. It’s fair to wonder if Bryan has had a change of heart since then. As far as I can tell, the answer is no. I remember speaking to Bryan in 2010 and 2011 and it was déjà vu all over again. The first time, I had just co-founded a non-profit for women in open technology and culture, and I was astonished when Bryan delivered a monologue to me on the “right” way to get more women involved in computing. The second time I was trying to catch up with a colleague I hadn’t seen in a while and Bryan was invited along. Bryan dominated the conversation and the two of us the entire evening, despite my best efforts. I tried one more time about a month ago: I sent Bryan a private message on Twitter telling him honestly and truthfully what my experience of working with him was like, and asking if he’d had a change of heart since then. His reply: “I don’t know what you’re referring to, and I don’t feel my position on this has meaningfully changed — though I am certainly older and wiser.” Then he told me to google something he’d written about women in computing.

    But you don’t have to trust my word on what Bryan is like today. The blog post Bryan wrote announcing Systems We Love sounds exactly like the Bryan I knew: erudite, witty, self-praising, and full of elegant insults directed at a broad swathe of people. He gaily recounts the time he gave a highly critical keynote speech at USENIX, bashfully links to a video praising him at a Papers We Love event, elegantly puts down most of the existing operating systems research community, and does it all while using the words “ancillary,” “verve,” and “quadrennial.” Once you know the underlying structure – a layer cake of vituperation and braggadocio, frosted with eloquence – you can see the same pattern in most of his writing and talks.

    So when I heard about Systems We Love, my first thought was, “Maybe I can go but just avoid talking to Bryan and leave the room when he is speaking.” Then I thought, “I should warn my friends who are going.” Then I realized that my friends are relatively confident and successful in this field, but the people I should be worried about are the ones just getting started. Based on the reputation of Papers We Love and the members of the Systems We Love program committee, they probably fully expect to be treated respectfully and kindly. I’m old and scarred and know what to expect when Bryan talks, and my stomach roils at the thought of attending this event. How much worse would it be for someone new and open and totally unprepared?

    Bryan is a better programmer than I am. Bryan is a better systems architect than I am. Bryan is a better writer and speaker than I am. The one area I feel confident that I know more about than Bryan is increasing diversity in computing. And I am certain that the environment that Bryan creates and fosters is more likely to discourage and drive off women of all races, people of color, queer and trans folks, and other people from underrepresented groups. We’re already standing closer to the exit; for many of us, it doesn’t take much to make us slip quietly out the door and never return.

    I’m guessing that Bryan will respond to me saying that he humiliates, dominates, and insults people by trying to humiliate, dominate, and insult me. I’m not sure if he’ll criticize my programming ability, my taste in operating systems, or my work on increasing diversity in tech. Maybe he’ll criticize me for humiliating, dominating, and insulting people myself – and I’ll admit, I did my fair share of that when I was trying to emulate leaders in my field such as Bryan Cantrill and Linus Torvalds. It’s gone now, but for years there was a quote from me on a friend’s web site, something like: “I’m an elitist jerk, I fit right in at Sun.” It took me years to detox and unlearn those habits and I hope I’m a kinder, more considerate person now.

    Even if Bryan doesn’t attack me, people who like the current unpleasant culture of systems programming will. I thought long and hard about the friendships, business opportunities, and social capital I would lose over this blog post. I thought about getting harassed and threatened on social media. I thought about a week of cringing whenever I check my email. Then I thought about the people who might attend Systems We Love: young folks, new developers, a trans woman at her first computing event since coming out – people who are looking for a friendly and supportive place to talk about systems at the beginning of their careers. I thought about them being deeply hurt and possibly discouraged for life from a field that gave me so much joy.

    Come at me, Bryan.

    Note: comments are now closed on this post. You can read and possibly comment on the follow-up post, When is naming abuse itself abusive?


    Tagged: conferences, feminism, kernel

    October 25, 2016 03:24 AM

    October 24, 2016

    LPC 2016: Things to remember about the altitude in Santa Fe

    Santa Fe is at an altitude of 7,200 feet (2,200m). There are a few things that attendees who are not used to higher altitudes may want to bear in mind:

    October 24, 2016 12:36 AM

    October 23, 2016

    James Bottomley: Home Automation: Coping with Insecurity in the IoT

    Reading Matthew Garret’s exposés of home automation IoT devices makes most engineers think “hell no!” or “over my dead body!”.  However, there’s also the siren lure that the ability to program your home, or update its settings from anywhere in the world is phenomenally useful:  for instance, the outside lights in my house used to depend on two timers (located about 50m from each other).  They were old, loud (to the point the neighbours used to wonder what the buzzing was when they visited) and almost always wrongly set for turning the lights on at sunset.  The final precipitating factor for me was the need to replace our thermostat, whose thermistor got so eccentric it started cooling in winter; so away went all the timers and their loud noises and in came a z-wave based home automation system, and the guilty pleasure of having an IoT based home automation system.  Now the lights precisely and quietly turn on at sunset and off at 23:00 (adjusting themselves for daylight savings); the thermostat is accessible from my phone, meaning I can adjust it from wherever I happen to be (including Hong Kong airport when I realised I’d forgotten to set it to energy saving mode before we went on holiday).  Finally, there’s waking up at 3am to realise your wife has fallen asleep over her book again and being able to turn off her reading light from your alarm clock without having to get out of bed … Automation bliss!

    We all want the convenience; the trick is to work around the rampant insecurity that comes with today’s IoT to avoid your home automation system being part of the DDoS bot net that brings down the internet.

    Selecting your network

    For me, nothing IP/Wifi based was partly due to Matthew’s blog and partly because my home Wifi network looks different from everyone else’s: I actually run an internal, secure, home network that is wired and have my Wifi sit unsecured and outside the firewall.  This goes back to the good old days of expecting to find wifi wherever you travelled and returning the courtesy by ensuring your wifi was accessible, but it does mean that any wifi connected device would be outside my firewall and open to all, which, given the general insecurity of the devices, makes this a non-starter.

    The next level down is to use a private network, like zigbee or z-wave.  I chose z-wave because it covers longer distances (which I need) and it doesn’t interfere with wifi (I have a hard time covering the entire house, even with two wifi access points).  Z-wave also looks secure, but, if you dig deeply, you find that there are flaws in the protocol that lay you open to a local attacker.  This, by the way, shows the futility of demanding security from IoT vendors who really don’t understand how to do it: a flawed security implementation is pretty much as bad as no security at all.

    Once this decision is made, the next is to choose a gateway to the internet that does what you want, namely give you remote control without giving up your security.

    Gateway Phone Home?

    A surprising number of z-wave controllers are of the phone home type (this means phone their manufacturer’s home, not you), and almost all of these simply won’t work if they’re not allowed to phone home.  Google comprehensively demonstrated the issues this raises with nest: lots of early adopters now have so much non-functional junk.

    For me, there was also the burned hand experience with Google services: whenever I travel, I invariably get locked out because of some pseudo-security issue and it takes a fight to get back in again. This ultimately precipitated my move away from the Google cloud and on to Owncloud for calendar and contacts, but also means I really don’t want to have to trust another external service for my home automation.

    Given the significantly limited choice of non-phone home z-wave controllers, I chose the HomeSeer Zee S2.  It’s basically a raspberry pi with a z-wave dongle and Linux.  If you’re into Linux on evereything, you should be aware that the home automation system is actually written in .net and it uses mono to bridge the gap; an odd choice given that there’s no known windows platform that could actually possibly run this system.

    Secure Internet based Automation

    The ZS2 does actually come with wifi, but given my already listed wifi problems, it’s actually plugged into my secure wired network with all phone home capabilities disabled.  Great, but that means it’s only accessible over a VPN and I want to be able to control it from things like my phone, where running a VPN is cumbersome, so lets do some magic tricks to make it securely accessible by any member of the family from any device.

    Obviously, since I already run Owncloud, I have a server of my own in a co-located site.  It’s this server I propose to use as my secure gateway.  The obvious way of doing this is simply proxying the ZS2 controller web page, but there are a couple of problems: firstly if I do it globally the ZS2 will be visible to port scans and secondly it only actually has an unencrypted web page with http authentication, meaning the login credentials would go over the internet in clear text … oops!

    The solution to the first of these is to make the web page only accessible to authenticated devices.  My current method is to use firewall whitelisting and a hook to an existing service authentication to open up the port.  So in the firewall mangle table, all the ports which require whitelisting are marked.  Then, in the input firewall, any packet so marked is checked against the whitelist for a matching source IP.  If a match is found, then the packet is permitted, otherwise it is denied.

    Whitelisting itself is done by a simple pam script

    #!/usr/bin/perl
    use Socket;
    
    $xt_file = '/proc/net/xt_recent/whitelist';
    
    $name = $ENV{'PAM_RHOST'};
    if ($name =~ m/^[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+$/) {
     $addr = $name;
    } else {
     $ip = gethostbyname($name);
     $addr = inet_ntoa($ip);
    }
    
    open(FD, ">$xt_file");
    print FD "+$addr\n";
    close(FD);
    
    exit 0

    And this script is executed from the dovecot pam file as

    # add session to cause ip address of successful login to be whitelisted
    session optional pam_exec.so /etc/pam.d/whitelist.pl

    Meaning that any IP address that gets an authenticated imap connection (which is basically everybody’s internet device, since they all connect to email) is now allowed to access the authenticated ports.  Since imap requires re-authentication after a configurable timeout, the whitelist entry only lasts for just over that timeout and hey presto, we have our secured port system.

    Obviously, this isn’t foolproof: in particular whitelisting by external IP means that anyone sharing the same ip address via nat (like at a hotel) also has access to the secured ports, but it does cut down enormously on generic internet visibility.

    The final thing is to add security to the insecure web page, so anyone in the path to my internet host can’t sniff the password.  This is easily achieved by an stunnel redirect from the secure incoming port to the ZS2 over the VPN that connects to the internal network.  The beauty of this is that stunnel can now use the existing web certificate for my internet host to afford protection from man in the middle attacks as well.

    Last thoughts about Security

    Obviously, the security above isn’t perfect.  Anyone sharing my external IP would be able to run a port scan and (if they’re clever) detect the https port the ZS2 is on.  However, it does require a lot of luck to do this and, obviously, even if they’re in the fortunate position of sharing an IP address, I’ve changed the default password, so the recent Mirai attack wouldn’t have been able to compromise the device.

    Do I think this is good enough security: absolutely.  In security, the bear principle applies: in that when escaping from a ravenous bear, you don’t have to be able to run faster than the bear itself, you merely need to be able to run faster than the slowest other potential food source …  In internet terms, this means that while there are so many completely insecure devices out there, no-one can be bothered to hack a moderately secure system like mine because the customisation makes it quite a bit harder.  It’s also instructive to think that the bear principle is why Linux has such a security reputation: it’s not that we have perfect security against virus and trojan systems, it’s just that Windows was always so much worse …

    Eventually, something like Mirai will look to attack the ZS2 web server itself (it is .net based, after all) rather than simply try a list of default passwords and then I’ll need to be a bit more clever, but while everyone else is so much more insecure, that day will be long delayed.

    October 23, 2016 07:20 PM

    Pete Zaitcev: FAA proposes to ban NavWorx

    Seen a curious piece of news today. As a short preamble, an aircraft in the U.S. may receive useful information from a ground station (TIS-B and FIS-B), but it has to transmit a certain ADS-B packet for that to happen. And all ADS-B packets include a field that specifies the system's claim that it operates according to a certain level of precision and integrity. The idea is, roughly, if you detect that e.g. one of your redundant GPS receivers is off-line, you should broadcast that you're downgraded. The protocol field is called SIL. The maximum level you can claim is determined by how crazily redundant and paranoid your design is. We are talking something in the order of $20,000 worth of cost, most of which is amortization of FAA paperwork certifying and you are entitled to claim SIL of 2. I lied about this explanation being short, BTW.

    So, apparently, NavWorks shipped cheap ADS-B boxes, which were made with a Raspberry Pie and a cellphone GPS chip (or such). They honestly transmitted a SIL of 0. Who cares, right? Well, FAA decided that TIS should stop to reply to airplanes flying around with a SIL Zero ADS-B boxes, because fuck the citizens, they should pay their $20k. Pilots called the NavWorks and complained that their iPads hooked to ADS600 do not display the weather reliably anymore. NavWorks issued a software update that programmed their boxes to transmit SIL of 2. No other change: the actual transmitted positions remained exactly as before, only the claimed reliability was faked. When FAA got the wind of this happening, they went nuclear on NavWorks users' asses. The proposed emergency directive orders owners to remove the offending equipment from their aircraft. They are grounded until the compliance.

    Now the good thing is, the ADS-B mandate comes in 2020. They still have 3 years to find a more compliant (and expensive) supplier, before they are prohibited from a vicinity of a major city. So it's only money.

    I don't have a dog in this fight, personally, so I can sympathize with both the bureaucrats who saw cheaters and threw a book at them, and the company that employed a workaround against a meaningless and capricious rule. However, here's a couple of observations.

    First, note how FAA maintains a database of individual (not aggregate) protocol compliance for each ADS-B ID. They will even helpfully send you a report about what they know about you (it's intended so you can test the performance your ADS-B equipment). Imagine if the government saved every query that your browser made, and could tell if your Chrome were not compliant with a certain RFC. This detailed tracking of everything is actually very necessary because the protocol has no encryption whatsoever and is trivially spoofed. Nothing stops a bad actor to use your ID in ADS-B. The only recourse is for the government to investigate reported issues and find the culprit. And they need the absolute tracking for it.

    Second, about the 2020 mandate. The airspace prohibition amounts to not letting someone into a city if the battery is flat in their EZ-pass transponder. Only in this case, the government sent you a letter saying that your transponder is banned, and you must buy a new one before you can get to work. In theory, your freedom of travel is not limited - you can take a bus. In practice though, not everyone has $20k, and the waiting list for the installer is 6 months.

    October 23, 2016 04:27 AM

    October 22, 2016

    Matthew Garrett: Microsoft aren't forcing Lenovo to block free operating systems

    Update: Patches to fix this have been posted

    There's a story going round that Lenovo have signed an agreement with Microsoft that prevents installing free operating systems. This is sensationalist, untrue and distracts from a genuine problem.

    The background is straightforward. Intel platforms allow the storage to be configured in two different ways - "standard" (normal AHCI on SATA systems, normal NVMe on NVMe systems) or "RAID". "RAID" mode is typically just changing the PCI IDs so that the normal drivers won't bind, ensuring that drivers that support the software RAID mode are used. Intel have not submitted any patches to Linux to support the "RAID" mode.

    In this specific case, Lenovo's firmware defaults to "RAID" mode and doesn't allow you to change that. Since Linux has no support for the hardware when configured this way, you can't install Linux (distribution installers will boot, but won't find any storage device to install the OS to).

    Why would Lenovo do this? I don't know for sure, but it's potentially related to something I've written about before - recent Intel hardware needs special setup for good power management. The storage driver that Microsoft ship doesn't do that setup. The Intel-provided driver does. "RAID" mode prevents the Microsoft driver from binding and forces the user to use the Intel driver, which means they get the correct power management configuration, battery life is better and the machine doesn't melt.

    (Why not offer the option to disable it? A user who does would end up with a machine that doesn't boot, and if they managed to figure that out they'd have worse power management. That increases support costs. For a consumer device, why would you want to? The number of people buying these laptops to run anything other than Windows is miniscule)

    Things are somewhat obfuscated due to a statement from a Lenovo rep:This system has a Signature Edition of Windows 10 Home installed. It is locked per our agreement with Microsoft. It's unclear what this is meant to mean. Microsoft could be insisting that Signature Edition systems ship in "RAID" mode in order to ensure that users get a good power management experience. Or it could be a misunderstanding regarding UEFI Secure Boot - Microsoft do require that Secure Boot be enabled on all Windows 10 systems, but (a) the user must be able to manage the key database and (b) there are several free operating systems that support UEFI Secure Boot and have appropriate signatures. Neither interpretation indicates that there's a deliberate attempt to prevent users from installing their choice of operating system.

    The real problem here is that Intel do very little to ensure that free operating systems work well on their consumer hardware - we still have no information from Intel on how to configure systems to ensure good power management, we have no support for storage devices in "RAID" mode and we have no indication that this is going to get better in future. If Intel had provided that support, this issue would never have occurred. Rather than be angry at Lenovo, let's put pressure on Intel to provide support for their hardware.

    comment count unavailable comments

    October 22, 2016 05:51 AM

    Matthew Garrett: Fixing the IoT isn't going to be easy

    A large part of the internet became inaccessible today after a botnet made up of IP cameras and digital video recorders was used to DoS a major DNS provider. This highlighted a bunch of things including how maybe having all your DNS handled by a single provider is not the best of plans, but in the long run there's no real amount of diversification that can fix this - malicious actors have control of a sufficiently large number of hosts that they could easily take out multiple providers simultaneously.

    To fix this properly we need to get rid of the compromised systems. The question is how. Many of these devices are sold by resellers who have no resources to handle any kind of recall. The manufacturer may not have any kind of legal presence in many of the countries where their products are sold. There's no way anybody can compel a recall, and even if they could it probably wouldn't help. If I've paid a contractor to install a security camera in my office, and if I get a notification that my camera is being used to take down Twitter, what do I do? Pay someone to come and take the camera down again, wait for a fixed one and pay to get that put up? That's probably not going to happen. As long as the device carries on working, many users are going to ignore any voluntary request.

    We're left with more aggressive remedies. If ISPs threaten to cut off customers who host compromised devices, we might get somewhere. But, inevitably, a number of small businesses and unskilled users will get cut off. Probably a large number. The economic damage is still going to be significant. And it doesn't necessarily help that much - if the US were to compel ISPs to do this, but nobody else did, public outcry would be massive, the botnet would not be much smaller and the attacks would continue. Do we start cutting off countries that fail to police their internet?

    Ok, so maybe we just chalk this one up as a loss and have everyone build out enough infrastructure that we're able to withstand attacks from this botnet and take steps to ensure that nobody is ever able to build a bigger one. To do that, we'd need to ensure that all IoT devices are secure, all the time. So, uh, how do we do that?

    These devices had trivial vulnerabilities in the form of hardcoded passwords and open telnet. It wouldn't take terribly strong skills to identify this at import time and block a shipment, so the "obvious" answer is to set up forces in customs who do a security analysis of each device. We'll ignore the fact that this would be a pretty huge set of people to keep up with the sheer quantity of crap being developed and skip straight to the explanation for why this wouldn't work.

    Yeah, sure, this vulnerability was obvious. But what about the product from a well-known vendor that included a debug app listening on a high numbered UDP port that accepted a packet of the form "BackdoorPacketCmdLine_Req" and then executed the rest of the payload as root? A portscan's not going to show that up[1]. Finding this kind of thing involves pulling the device apart, dumping the firmware and reverse engineering the binaries. It typically takes me about a day to do that. Amazon has over 30,000 listings that match "IP camera" right now, so you're going to need 99 more of me and a year just to examine the cameras. And that's assuming nobody ships any new ones.

    Even that's insufficient. Ok, with luck we've identified all the cases where the vendor has left an explicit backdoor in the code[2]. But these devices are still running software that's going to be full of bugs and which is almost certainly still vulnerable to at least half a dozen buffer overflows[3]. Who's going to audit that? All it takes is one attacker to find one flaw in one popular device line, and that's another botnet built.

    If we can't stop the vulnerabilities getting into people's homes in the first place, can we at least fix them afterwards? From an economic perspective, demanding that vendors ship security updates whenever a vulnerability is discovered no matter how old the device is is just not going to work. Many of these vendors are small enough that it'd be more cost effective for them to simply fold the company and reopen under a new name than it would be to put the engineering work into fixing a decade old codebase. And how does this actually help? So far the attackers building these networks haven't been terribly competent. The first thing a competent attacker would do would be to silently disable the firmware update mechanism.

    We can't easily fix the already broken devices, we can't easily stop more broken devices from being shipped and we can't easily guarantee that we can fix future devices that end up broken. The only solution I see working at all is to require ISPs to cut people off, and that's going to involve a great deal of pain. The harsh reality is that this is almost certainly just the tip of the iceberg, and things are going to get much worse before they get any better.

    Right. I'm off to portscan another smart socket.

    [1] UDP connection refused messages are typically ratelimited to one per second, so it'll take almost a day to do a full UDP portscan, and even then you have no idea what the service actually does.

    [2] It's worth noting that this is usually leftover test or debug code, not an overtly malicious act. Vendors should have processes in place to ensure that this isn't left in release builds, but ha well.

    [3] My vacuum cleaner crashes if I send certain malformed HTTP requests to the local API endpoint, which isn't a good sign

    comment count unavailable comments

    October 22, 2016 05:14 AM

    October 20, 2016

    Kees Cook: CVE-2016-5195

    My prior post showed my research from earlier in the year at the 2016 Linux Security Summit on kernel security flaw lifetimes. Now that CVE-2016-5195 is public, here are updated graphs and statistics. Due to their rarity, the Critical bug average has now jumped from 3.3 years to 5.2 years. There aren’t many, but, as I mentioned, they still exist, whether you know about them or not. CVE-2016-5195 was sitting on everyone’s machine when I gave my LSS talk, and there are still other flaws on all our Linux machines right now. (And, I should note, this problem is not unique to Linux.) Dealing with knowing that there are always going to be bugs present requires proactive kernel self-protection (to minimize the effects of possible flaws) and vendors dedicated to updating their devices regularly and quickly (to keep the exposure window minimized once a flaw is widely known).

    So, here are the graphs updated for the 668 CVEs known today:

    © 2016, Kees Cook. This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 License.
    Creative Commons License

    October 20, 2016 11:02 PM

    October 19, 2016

    Kees Cook: Security bug lifetime

    In several of my recent presentations, I’ve discussed the lifetime of security flaws in the Linux kernel. Jon Corbet did an analysis in 2010, and found that security bugs appeared to have roughly a 5 year lifetime. As in, the flaw gets introduced in a Linux release, and then goes unnoticed by upstream developers until another release 5 years later, on average. I updated this research for 2011 through 2016, and used the Ubuntu Security Team’s CVE Tracker to assist in the process. The Ubuntu kernel team already does the hard work of trying to identify when flaws were introduced in the kernel, so I didn’t have to re-do this for the 557 kernel CVEs since 2011.

    As the README details, the raw CVE data is spread across the active/, retired/, and ignored/ directories. By scanning through the CVE files to find any that contain the line “Patches_linux:”, I can extract the details on when a flaw was introduced and when it was fixed. For example CVE-2016-0728 shows:

    Patches_linux:
     break-fix: 3a50597de8635cd05133bd12c95681c82fe7b878 23567fd052a9abb6d67fe8e7a9ccdd9800a540f2
    

    This means that CVE-2016-0728 is believed to have been introduced by commit 3a50597de8635cd05133bd12c95681c82fe7b878 and fixed by commit 23567fd052a9abb6d67fe8e7a9ccdd9800a540f2. If there are multiple lines, then there may be multiple SHAs identified as contributing to the flaw or the fix. And a “-” is just short-hand for the start of Linux git history.

    Then for each SHA, I queried git to find its corresponding release, and made a mapping of release version to release date, wrote out the raw data, and rendered graphs. Each vertical line shows a given CVE from when it was introduced to when it was fixed. Red is “Critical”, orange is “High”, blue is “Medium”, and black is “Low”:

    CVE lifetimes 2011-2016

    And here it is zoomed in to just Critical and High:

    Critical and High CVE lifetimes 2011-2016

    The line in the middle is the date from which I started the CVE search (2011). The vertical axis is actually linear time, but it’s labeled with kernel releases (which are pretty regular). The numerical summary is:

    This comes out to roughly 5 years lifetime again, so not much has changed from Jon’s 2010 analysis.

    While we’re getting better at fixing bugs, we’re also adding more bugs. And for many devices that have been built on a given kernel version, there haven’t been frequent (or some times any) security updates, so the bug lifetime for those devices is even longer. To really create a safe kernel, we need to get proactive about self-protection technologies. The systems using a Linux kernel are right now running with security flaws. Those flaws are just not known to the developers yet, but they’re likely known to attackers, as there have been prior boasts/gray-market advertisements for at least CVE-2010-3081 and CVE-2013-2888.

    (Edit: see my updated graphs that include CVE-2016-5195.)

    © 2016, Kees Cook. This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 License.
    Creative Commons License

    October 19, 2016 04:46 AM

    October 18, 2016

    LPC 2016: Hotel Blocks now Expired

    All our block bookings at the hotels have now expired.  However, if you still haven’t booked a hotel, it may still be possible for the Linux Foundation to get you a room at one of them at the conference rate (availability permitting).  Please email mphillips@linuxfoundation.org if you are interested in this option.

    October 18, 2016 08:06 PM

    Gustavo F. Padovan: Mainline Explicit Fencing – part 2

    In the first post we covered the main concepts behind Explicit Synchronization for the Linux Kernel. Now in the second post of the series we are going to look to the Android Sync Framework, the first (out-of-tree) Explicit Fencing implementation for the Linux Kernel.

    The Sync Framework was the Android solution to implement Explicit Fencing in AOSP. It uses file descriptors to communicate fencing information between userspace and kernel and between userspace process.

    In the Sync Framework it all starts with the creation of a Sync Timeline, a struct created for each driver context to represent a monotonically increasing counter. It is the Sync Timeline who will guarantee the ordering between fences in the same Timeline. The driver contexts could be different GPU rings, or different Displays on your hardware.

    Sync Timeline

    Sync Timeline

    Then we have Sync Points(sync_pt), the name Android gave to fences, they represent a specific value in the Sync Timeline. When created the Sync Point is initialized in the Active state, and when it signals, i.e., the job it was associated to finishes, it transits to the Signaled state and informs the Sync Timeline to update the value of the last signaled Sync Point.

    Sync Point

    Sync Point

    To export and import Sync Points to/from userspace the Sync Fence struct is used. Under the hood the the Sync Fence is a Linux file and we use thte Sync Fence to store Sync Point information. To exported to userspace a unused file descriptor(fd) is associated to the Sync Fence file. Drivers can then use the file descriptor to pass the Sync Point information around.

    Sync Fence

    Sync Fence

    The Sync Fence is usually created just after the Sync Point creation, it then travel through the pipeline, via userspace, until the driver that is going to wait for the Sync Fence to signal. The Sync Fence signal when the Sync Point inside it signals.

    One of the most important features of the Android Sync Framework is the ability to merge Sync Fences into a new Sync Fence containing all Sync Points from both Sync Fences. It can contain as many Sync Points as your resource allows. A merged Sync Fence will only signal when all its Sync Points signals.

    Sync Fence with Merged fences

    Sync Fence with Merged Fences. Here we merge two Sync Points into one Sync File.

    When it comes to userspace API the Sync Framework has implements three ioctl calls. The first one is to wait on sync_fence to signal. There is also a call to merge two sync_fences into a third and new sync_fence. And finally there is a also a call to grab information about the sync_fence and all its sync_points.

    The Sync Fences fds are passed to/from the kernel in the calls to ask the kernel to render or display a buffer.

    This was intended to be a overview of the Sync Framework as we will see some of these concepts on the next article where we will talk about the effort to add explict fencing on mainline kernel. If you want to learn more about the Sync Framework you can find more info here and here.

    October 18, 2016 06:16 PM

    October 08, 2016

    Michael Kerrisk (manpages): man-pages-4.07 is released

    I've released man-pages-4.07. The release tarball is available on kernel.org. The browsable online pages can be found on man7.org. The Git repository for man-pages is available on kernel.org.

    This release resulted from patches, bug reports, reviews, and comments from around 50 contributors. The release includes changes to over 140 man pages. Among the more significant changes in man-pages-4.07 are the following:

    October 08, 2016 12:20 PM

    Michael Kerrisk (manpages): man-pages-4.06 is released

    I've released man-pages-4.06. The release tarball is available on kernel.org. The browsable online pages can be found on man7.org. The Git repository for man-pages is available on kernel.org.

    This release resulted from patches, bug reports, reviews, and comments from around 20 contributors. The release includes changes to just over 40 man pages. Among the more significant changes in man-pages-4.06 are the following:

    October 08, 2016 12:20 PM

    Michael Kerrisk (manpages): man-pages-4.05 is released

    I've released man-pages-4.05. The release tarball is available on kernel.org. The browsable online pages can be found on man7.org. The Git repository for man-pages is available on kernel.org.

    This release resulted from patches, bug reports, reviews, and comments from more than 70 contributors. The release includes changes to more than 400 man pages. Among the more significant changes in man-pages-4.05 are the following:

    October 08, 2016 12:20 PM

    Michael Kerrisk (manpages): man-pages-4.08 is released

    I've released man-pages-4.08. The release tarball is available on kernel.org. The browsable online pages can be found on man7.org. The Git repository for man-pages is available on kernel.org.

    This release resulted from patches, bug reports, reviews, and comments from around 40 contributors. The release includes changes to nearly 200 man pages. Among the more significant changes in man-pages-4.08 are the following:

    October 08, 2016 12:19 PM

    October 07, 2016

    Daniel Vetter: Neat drm/i915 Stuff for 4.8

    I procristanated rather badly on this one, so instead of the previous kernel release happening the v4.8 release is already out of the door. Read on for my slightly more terse catch-up report.

    Since I’m this late I figured instead of the usual comprehensive list I’ll do something new and just list some of the work that landed in 4.8, but with a bit more focus on the impact and why things have been done.

    Midlayers, Be Gone!

    The first thing I want to highlight is the driver de-midlayering. In the linux kernel community the mid-layer mistake or helper library design pattern, see the linked article from LWN is a set of rules to design subsystems and common support code for drivers. The underlying rule is that the driver itself must be in control of everything, like allocating memory, handling all requests. Common code is only shared in helper library functions, which the driver can call if they are suitable. The reason for that is that there is always some hardware which needs special treatment, and when you have a special case and there’s a midlayer, it will get in the way.

    Due to the shared history with BSD kernels DRM originally had a full-blown midlayer, but over time this has been fixed. For example kernel modesetting was designed from the start with the helper library pattern. The last hold is the device structure itself, and for the Intel driver this is now fixed. This has two main benefits:

    Thundering Herds

    GPUs process rendering asynchronously, and sometimes the CPU needs to wait for them. For this purpose there’s a wait queue in the driver. Userspace processes block on that until the interrupt handler wakes them up. The trouble now is that thus far there was just one wait queue per engine, which means every time the GPU completed something all waiters had to be woken up. Then they checked whether the work they needed to wait for completed, and if not, again block on the wait queue until the next batch job completed. That’s all rather inefficient. On top there’s only one per-engine knob to enable interrupts. Which means even if there was only one waiting process, it was woken for every completed job. And GPUs can have a lot of jobs in-flight.

    In summary, waiting for the GPU worked more like a frantic herd trampling all over things instead of something orderly. To fix this the request and completion tracking was entirely revamped, to make sure that the driver has a much better understanding of what’s going on. On top there’s now also an efficient search structure of all current waiting processes. With that the interrupt handler can quickly check whether the just completed GPU job is of interest, and if so, which exact process should be woken up.

    But this wasn’t just done to make the driver more efficient. Better tracking of pending and completed GPU requests is an important fundation to implement proper GPU scheduling on top of. And it’s also needed to interface the completion tracking with other drivers, to finally fixing tearing for multi-GPU machines. Having a thundering herd in your own backyard is unsightly, letting it loose on your neighbours is downright bad! A lot of this follow-up work already landed for the 4.9 kernel, hence I will talk more about this in a future installement of this seris.

    October 07, 2016 12:00 PM

    October 06, 2016

    James Morris: LinuxCon Europe Kernel Security Slides

    Yesterday I gave an update on the Linux kernel security subsystem at LinuxCon Europe, in Berlin.

    The slides are available here: http://namei.org/presentations/linux_kernel_security_linuxconeu2016.pdf

    The talk began with a brief overview and history of the Linux kernel security subsystem, and then I provided an update on significant changes in the v4 kernel series, up to v4.8.  Some expected upcoming features were also covered.  Skip to slide 31 if you just want to see the changes.  There are quite a few!

    It’s my first visit to Berlin, and it’s been fascinating to see the remnants of the Cold War, which dominated life in 1980s when I was at school, but which also seemed so impossibly far from Australia.

    brandenburg gate

    Brandenburg Gate, Berlin. Unity Day 2016.

    I hope to visit again with more time to explore.

    October 06, 2016 12:56 PM

    Pavel Machek: FlightGame

    FlightGame
    FlightGear is a very nice simulator, but it is not a lot of fun: page with "places to fly" helps. But when you setup your flight details, including weather and failures, you can kind of expect what is going to happen. FlightGame was designed to address this (not for me, unfortunately, alrough... if you ever debugged piece of software you know unexpected things happen): levels are prepared to be interesting, yet they try to provide enough information so that you don't need to
    study maps and aircraft specifications before the flight.
    Don't expect anything great/too complex, this is just python getting data from gpsd, and causing your aircaft probles over internal webserver. But it still should be fun.
    Code is at
    https://gitlab.com/tui/tui/tree/master/fgame
    . I guess I should really create a better README.
    Who wants to play?

    October 06, 2016 08:56 AM

    October 05, 2016

    Kees Cook: security things in Linux v4.8

    Previously: v4.7. Here are a bunch of security things I’m excited about in Linux v4.8:

    SLUB freelist ASLR

    Thomas Garnier continued his freelist randomization work by adding SLUB support.

    x86_64 KASLR text base offset physical/virtual decoupling

    On x86_64, to implement the KASLR text base offset, the physical memory location of the kernel was randomized, which resulted in the virtual address being offset as well. Due to how the kernel’s “-2GB” addressing works (gcc‘s “-mcmodel=kernel“), it wasn’t possible to randomize the physical location beyond the 2GB limit, leaving any additional physical memory unused as a randomization target. In order to decouple the physical and virtual location of the kernel (to make physical address exposures less valuable to attackers), the physical location of the kernel needed to be randomized separately from the virtual location. This required a lot of work for handling very large addresses spanning terabytes of address space. Yinghai Lu, Baoquan He, and I landed a series of patches that ultimately did this (and in the process fixed some other bugs too). This expands the physical offset entropy to roughly $physical_memory_size_of_system / 2MB bits.

    x86_64 KASLR memory base offset

    Thomas Garnier rolled out KASLR to the kernel’s various statically located memory ranges, randomizing their locations with CONFIG_RANDOMIZE_MEMORY. One of the more notable things randomized is the physical memory mapping, which is a known target for attacks. Also randomized is the vmalloc area, which makes attacks against targets vmalloced during boot (which tend to always end up in the same location on a given system) are now harder to locate. (The vmemmap region randomization accidentally missed the v4.8 window and will appear in v4.9.)

    x86_64 KASLR with hibernation

    Rafael Wysocki (with Thomas Garnier, Borislav Petkov, Yinghai Lu, Logan Gunthorpe, and myself) worked on a number of fixes to hibernation code that, even without KASLR, were coincidentally exposed by the earlier W^X fix. With that original problem fixed, then memory KASLR exposed more problems. I’m very grateful everyone was able to help out fixing these, especially Rafael and Thomas. It’s a hard place to debug. The bottom line, now, is that hibernation and KASLR are no longer mutually exclusive.

    gcc plugin infrastructure

    Emese Revfy ported the PaX/Grsecurity gcc plugin infrastructure to upstream. If you want to perform compiler-based magic on kernel builds, now it’s much easier with CONFIG_GCC_PLUGINS! The plugins live in scripts/gcc-plugins/. Current plugins are a short example called “Cyclic Complexity” which just emits the complexity of functions as they’re compiled, and “Sanitizer Coverage” which provides the same functionality as gcc’s recent “-fsanitize-coverage=trace-pc” but back through gcc 4.5. Another notable detail about this work is that it was the first Linux kernel security work funded by Linux Foundation’s Core Infrastructure Initiative. I’m looking forward to more plugins!

    If you’re on Debian or Ubuntu, the required gcc plugin headers are available via the gcc-$N-plugin-dev package (and similarly for all cross-compiler packages).

    hardened usercopy

    Along with work from Rik van Riel, Laura Abbott, Casey Schaufler, and many other folks doing testing on the KSPP mailing list, I ported part of PAX_USERCOPY (the basic runtime bounds checking) to upstream as CONFIG_HARDENED_USERCOPY. One of the interface boundaries between the kernel and user-space are the copy_to_user()/copy_from_user() family of functions. Frequently, the size of a copy is known at compile-time (“built-in constant”), so there’s not much benefit in checking those sizes (hardened usercopy avoids these cases). In the case of dynamic sizes, hardened usercopy checks for 3 areas of memory: slab allocations, stack allocations, and kernel text. Direct kernel text copying is simply disallowed. Stack copying is allowed as long as it is entirely contained by the current stack memory range (and on x86, only if it does not include the saved stack frame and instruction pointers). For slab allocations (e.g. those allocated through kmem_cache_alloc() and the kmalloc()-family of functions), the copy size is compared against the size of the object being copied. For example, if copy_from_user() is writing to a structure that was allocated as size 64, but the copy gets tricked into trying to write 65 bytes, hardened usercopy will catch it and kill the process.

    For testing hardened usercopy, lkdtm gained several new tests: USERCOPY_HEAP_SIZE_TO, USERCOPY_HEAP_SIZE_FROM, USERCOPY_STACK_FRAME_TO,
    USERCOPY_STACK_FRAME_FROM, USERCOPY_STACK_BEYOND, and USERCOPY_KERNEL. Additionally, USERCOPY_HEAP_FLAG_TO and USERCOPY_HEAP_FLAG_FROM were added to test what will be coming next for hardened usercopy: flagging slab memory as “safe for copy to/from user-space”, effectively whitelisting certainly slab caches, as done by PAX_USERCOPY. This further reduces the scope of what’s allowed to be copied to/from, since most kernel memory is not intended to ever be exposed to user-space. Adding this logic will require some reorganization of usercopy code to add some new APIs, as PAX_USERCOPY’s approach to handling special-cases is to add bounce-copies (copy from slab to stack, then copy to userspace) as needed, which is unlikely to be acceptable upstream.

    seccomp reordered after ptrace

    By its original design, seccomp filtering happened before ptrace so that seccomp-based ptracers (i.e. SECCOMP_RET_TRACE) could explicitly bypass seccomp filtering and force a desired syscall. Nothing actually used this feature, and as it turns out, it’s not compatible with process launchers that install seccomp filters (e.g. systemd, lxc) since as long as the ptrace and fork syscalls are allowed (and fork is needed for any sensible container environment), a process could spawn a tracer to help bypass a filter by injecting syscalls. After Andy Lutomirski convinced me that ordering ptrace first does not change the attack surface of a running process (unless all syscalls are blacklisted, the entire ptrace attack surface will always be exposed), I rearranged things. Now there is no (expected) way to bypass seccomp filters, and containers with seccomp filters can allow ptrace again.

    That’s it for v4.8! The merge window is open for v4.9…

    © 2016, Kees Cook. This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 License.
    Creative Commons License

    October 05, 2016 12:26 AM

    October 03, 2016

    Matthew Garrett: The importance of paying attention in building community trust

    Trust is important in any kind of interpersonal relationship. It's inevitable that there will be cases where something you do will irritate or upset others, even if only to a small degree. Handling small cases well helps build trust that you will do the right thing in more significant cases, whereas ignoring things that seem fairly insignificant (or saying that you'll do something about them and then failing to do so) suggests that you'll also fail when there's a major problem. Getting the small details right is a major part of creating the impression that you'll deal with significant challenges in a responsible and considerate way.

    This isn't limited to individual relationships. Something that distinguishes good customer service from bad customer service is getting the details right. There are many industries where significant failures happen infrequently, but minor ones happen a lot. Would you prefer to give your business to a company that handles those small details well (even if they're not overly annoying) or one that just tells you to deal with them?

    And the same is true of software communities. A strong and considerate response to minor bug reports makes it more likely that users will be patient with you when dealing with significant ones. Handling small patch contributions quickly makes it more likely that a submitter will be willing to do the work of making more significant contributions. These things are well understood, and most successful projects have actively worked to reduce barriers to entry and to be responsive to user requests in order to encourage participation and foster a feeling that they care.

    But what's often ignored is that this applies to other aspects of communities as well. Failing to use inclusive language may not seem like a big thing in itself, but it leaves people with the feeling that you're less likely to do anything about more egregious exclusionary behaviour. Allowing a baseline level of sexist humour gives the impression that you won't act if there are blatant displays of misogyny. The more examples of these "insignificant" issues people see, the more likely they are to choose to spend their time somewhere else, somewhere they can have faith that major issues will be handled appropriately.

    There's a more insidious aspect to this. Sometimes we can believe that we are handling minor issues appropriately, that we're acting in a way that handles people's concerns, while actually failing to do so. If someone raises a concern about an aspect of the community, it's important to discuss solutions with them. Putting effort into "solving" a problem without ensuring that the solution has the desired outcome is not only a waste of time, it alienates those affected even more - they're now not only left with the feeling that they can't trust you to respond appropriately, but that you will actively ignore their feelings in the process.

    It's not always possible to satisfy everybody's concerns. Sometimes you'll be left in situations where you have conflicting requests. In that case the best thing you can do is to explain the conflict and why you've made the choice you have, and demonstrate that you took this issue seriously rather than ignoring it. Depending on the issue, you may still alienate some number of participants, but it'll be fewer than if you just pretend that it's not actually a problem.

    One warning, though: while building trust in this way enhances people's willingness to join your community, it also builds expectations. If a significant issue does arise, and if you fail to handle it well, you'll burn a lot of that trust in the process. The fact that you've built that trust in the first place may be what saves your community from disintegrating completely, but people will feel even more betrayed if you don't actively work to rebuild it. And if there's a pattern of mishandling major problems, no amount of getting the details right will matter.

    Communities that ignore these issues are, long term, likely to end up weaker than communities that pay attention to them. Making sure you get this right in the first place, and setting expectations that you will pay attention to your contributors, is a vital part of building a meaningful relationship between your community and its members.

    comment count unavailable comments

    October 03, 2016 05:14 PM

    Gustavo F. Padovan: Collabora Contributions to Linux Kernel 4.8

    Linux Kernel 4.8 is out and once more Collabora engineers did a significant contribution to the Kernel. For the 4.8 Collabora contributed 101 patches by 8 engineers, our record to date in single kernel release! We’ve also seen the first contribution from Frederic Dalleau since he joined Collabora. LWN.net covered the new features of the new kernel in three different posts, here, here and here.

    On the Collabora side of the contributions we touched a few different areas in the kernel. Bob Ham, who recently left Collabora, added support for the Alea I Random Number Generator, while Enric Balletbo improved the audio support on the Rockchip rk3288 SoC. Frederic Dalleau fixed an important memory leak on the Bluetooth stack.

    Gustavo Padovan continued his work add Explicit Synchronization for Buffer Sharing on the kernel. In this release he added fence_array support and prepared the SW_SYNC interfaces for de-staging, SW_SYNC meant to be used for Explict Syncronization testing. He also worked in removing some of the legacy functions from drm_irq.c from the kernel.

    Helen Koike added some improvements and clean ups to the ASoC subsystem mainly on the max9877 and tpa6130a2 drivers. Nicolas Dufresne fixed the bytes per line calculation on YUV planes on the uvcvideo driver.

    Thierry Escande added many improvements the NFC digital layer and Tomeu Vizoso added a new helper for the ChromeOS Embedded Controller and improved usage of DRM Core APIs on the Rockchip driver. He also fixed an issue with the Analogix DP on Rockchip that was not enabling clocks in the correct order.

    Bob Ham (2):

    Enric Balletbo i Serra (8):

    Frederic Dalleau (1):

    Gustavo Padovan (50):

    Helen Koike (8):

    Nicolas Dufresne (1):

    Thierry Escande (26):

    Tomeu Vizoso (5):

    October 03, 2016 01:59 PM

    Pavel Machek: Linux V4.8 on N900

    Basics work, good. GSM does not work too well, which is kind of a problem. Camera broke between 4.7 and 4.8. That is not good, either.

    If you want to talk about Linux and phones, I'll probably be on LinuxDays in Prague this weekend, and will have a talk about it at Ubucon Europe.

    October 03, 2016 11:13 AM

    Kees Cook: security things in Linux v4.7

    Previously: v4.6. Onward to security things I found interesting in Linux v4.7:

    KASLR text base offset for MIPS

    Matt Redfearn added text base address KASLR to MIPS, similar to what’s available on x86 and arm64. As done with x86, MIPS attempts to gather entropy from various build-time, run-time, and CPU locations in an effort to find reasonable sources during early-boot. MIPS doesn’t yet have anything as strong as x86′s RDRAND (though most have an instruction counter like x86′s RDTSC), but it does have the benefit of being able to use Device Tree (i.e. the “/chosen/kaslr-seed” property) like arm64 does. By my understanding, even without Device Tree, MIPS KASLR entropy should be as strong as pre-RDRAND x86 entropy, which is more than sufficient for what is, similar to x86, not a huge KASLR range anyway: default 8 bits (a span of 16MB with 64KB alignment), though CONFIG_RANDOMIZE_BASE_MAX_OFFSET can be tuned to the device’s memory, giving a maximum of 11 bits on 32-bit, and 15 bits on EVA or 64-bit.

    SLAB freelist ASLR

    Thomas Garnier added CONFIG_SLAB_FREELIST_RANDOM to make slab allocation layouts less deterministic with a per-boot randomized freelist order. This raises the bar for successful kernel slab attacks. Attackers will need to either find additional bugs to help leak slab layout information or will need to perform more complex grooming during an attack. Thomas wrote a post describing the feature in more detail here: Randomizing the Linux kernel heap freelists. (SLAB is done in v4.7, and SLUB in v4.8.)

    eBPF JIT constant blinding

    Daniel Borkmann implemented constant blinding in the eBPF JIT subsystem. With strong kernel memory protections (CONFIG_DEBUG_RODATA) in place, and with the segregation of user-space memory execution from kernel (i.e SMEP, PXN, CONFIG_CPU_SW_DOMAIN_PAN), having a place where user-space can inject content into an executable area of kernel memory becomes very high-value to an attacker. The eBPF JIT was exactly such a thing: the use of BPF constants could result in the JIT producing instruction flows that could include attacker-controlled instructions (e.g. by directing execution into the middle of an instruction with a constant that would be interpreted as a native instruction). The eBPF JIT already uses a number of other defensive tricks (e.g. random starting position), but this added randomized blinding to any BPF constants, which makes building a malicious execution path in the eBPF JIT memory much more difficult (and helps block attempts at JIT spraying to bypass other protections).

    Elena Reshetova updated a 2012 proof-of-concept attack to succeed against modern kernels to help provide a working example of what needed fixing in the JIT. This serves as a thorough regression test for the protection.

    The cBPF JITs that exist in ARM, MIPS, PowerPC, and Sparc still need to be updated to eBPF, but when they do, they’ll gain all these protections immediatley.

    Bottom line is that if you enable the (disabled-by-default) bpf_jit_enable sysctl, be sure to set the bpf_jit_harden sysctl to 2 (to perform blinding even for root).

    fix brk ASLR weakness on arm64 compat

    There have been a few ASLR fixes recently (e.g. ET_DYN, x86 32-bit unlimited stack), and while reviewing some suggested fixes to arm64 brk ASLR code from Jon Medhurst, I noticed that arm64′s brk ASLR entropy was slightly too low (less than 1 bit) for 64-bit and noticeably lower (by 2 bits) for 32-bit compat processes when compared to native 32-bit arm. I simplified the code by using literals for the entropy. Maybe we can add a sysctl some day to control brk ASLR entropy like was done for mmap ASLR entropy.

    LoadPin LSM

    LSM stacking is well-defined since v4.2, so I finally upstreamed a “small” LSM that implements a protection I wrote for Chrome OS several years back. On systems with a static root of trust that extends to the filesystem level (e.g. Chrome OS’s coreboot+depthcharge boot firmware chaining to dm-verity, or a system booting from read-only media), it’s redundant to sign kernel modules (you’ve already got the modules on read-only media: they can’t change). The kernel just needs to know they’re all coming from the correct location. (And this solves loading known-good firmware too, since there is no convention for signed firmware in the kernel yet.) LoadPin requires that all modules, firmware, etc come from the same mount (and assumes that the first loaded file defines which mount is “correct”, hence load “pinning”).

    That’s it for v4.7. Prepare yourself for v4.8 next!

    © 2016, Kees Cook. This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 License.
    Creative Commons License

    October 03, 2016 07:47 AM

    October 01, 2016

    Kees Cook: security things in Linux v4.6

    Previously: v4.5. The v4.6 Linux kernel release included a bunch of stuff, with much more of it under the KSPP umbrella.

    seccomp support for parisc

    Helge Deller added seccomp support for parisc, which including plumbing support for PTRACE_GETREGSET to get the self-tests working.

    x86 32-bit mmap ASLR vs unlimited stack fixed

    Hector Marco-Gisbert removed a long-standing limitation to mmap ASLR on 32-bit x86, where setting an unlimited stack (e.g. “ulimit -s unlimited“) would turn off mmap ASLR (which provided a way to bypass ASLR when executing setuid processes). Given that ASLR entropy can now be controlled directly (see the v4.5 post), and that the cases where this created an actual problem are very rare, means that if a system sees collisions between unlimited stack and mmap ASLR, they can just adjust the 32-bit ASLR entropy instead.

    x86 execute-only memory

    Dave Hansen added Protection Key support for future x86 CPUs and, as part of this, implemented support for “execute only” memory in user-space. On pkeys-supporting CPUs, using mmap(..., PROT_EXEC) (i.e. without PROT_READ) will mean that the memory can be executed but cannot be read (or written). This provides some mitigation against automated ROP gadget finding where an executable is read out of memory to find places that can be used to build a malicious execution path. Using this will require changing some linker behavior (to avoid putting data in executable areas), but seems to otherwise Just Work. I’m looking forward to either emulated QEmu support or access to one of these fancy CPUs.

    CONFIG_DEBUG_RODATA enabled by default on arm and arm64, and mandatory on x86

    Ard Biesheuvel (arm64) and I (arm) made the poorly-named CONFIG_DEBUG_RODATA enabled by default. This feature controls whether the kernel enforces proper memory protections on its own memory regions (code memory is executable and read-only, read-only data is actually read-only and non-executable, and writable data is non-executable). This protection is a fundamental security primitive for kernel self-protection, so making it on-by-default is required to start any kind of attack surface reduction within the kernel.

    On x86 CONFIG_DEBUG_RODATA was already enabled by default, but, at Ingo Molnar’s suggestion, I made it mandatory: CONFIG_DEBUG_RODATA cannot be turned off on x86. I expect we’ll get there with arm and arm64 too, but the protection is still somewhat new on these architectures, so it’s reasonable to continue to leave an “out” for developers that find themselves tripping over it.

    arm64 KASLR text base offset

    Ard Biesheuvel reworked a ton of arm64 infrastructure to support kernel relocation and, building on that, Kernel Address Space Layout Randomization of the kernel text base offset (and module base offset). As with x86 text base KASLR, this is a probabilistic defense that raises the bar for kernel attacks where finding the KASLR offset must be added to the chain of exploits used for a successful attack. One big difference from x86 is that the entropy for the KASLR must come either from Device Tree (in the “/chosen/kaslr-seed” property) or from UEFI (via EFI_RNG_PROTOCOL), so if you’re building arm64 devices, make sure you have a strong source of early-boot entropy that you can expose through your boot-firmware or boot-loader.

    zero-poison after free

    Laura Abbott reworked a bunch of the kernel memory management debugging code to add zeroing of freed memory, similar to PaX/Grsecurity’s PAX_MEMORY_SANITIZE feature. This feature means that memory is cleared at free, wiping any sensitive data so it doesn’t have an opportunity to leak in various ways (e.g. accidentally uninitialized structures or padding), and that certain types of use-after-free flaws cannot be exploited since the memory has been wiped. To take things even a step further, the poisoning can be verified at allocation time to make sure that nothing wrote to it between free and allocation (called “sanity checking”), which can catch another small subset of flaws.

    To understand the pieces of this, it’s worth describing that the kernel’s higher level allocator, the “page allocator” (e.g. __get_free_pages()) is used by the finer-grained “slab allocator” (e.g. kmem_cache_alloc(), kmalloc()). Poisoning is handled separately in both allocators. The zero-poisoning happens at the page allocator level. Since the slab allocators tend to do their own allocation/freeing, their poisoning happens separately (since on slab free nothing has been freed up to the page allocator).

    Only limited performance tuning has been done, so the penalty is rather high at the moment, at about 9% when doing a kernel build workload. Future work will include some exclusion of frequently-freed caches (similar to PAX_MEMORY_SANITIZE), and making the options entirely CONFIG controlled (right now both CONFIGs are needed to build in the code, and a kernel command line is needed to activate it). Performing the sanity checking (mentioned above) adds another roughly 3% penalty. In the general case (and once the performance of the poisoning is improved), the security value of the sanity checking isn’t worth the performance trade-off.

    Tests for the features can be found in lkdtm as READ_AFTER_FREE and READ_BUDDY_AFTER_FREE. If you’re feeling especially paranoid and have enabled sanity-checking, WRITE_AFTER_FREE and WRITE_BUDDY_AFTER_FREE can test these as well.

    To perform zero-poisoning of page allocations and (currently non-zero) poisoning of slab allocations, build with:

    CONFIG_DEBUG_PAGEALLOC=n
    CONFIG_PAGE_POISONING=y
    CONFIG_PAGE_POISONING_NO_SANITY=y
    CONFIG_PAGE_POISONING_ZERO=y
    CONFIG_SLUB_DEBUG=y

    and enable the page allocator poisoning and slab allocator poisoning at boot with this on the kernel command line:

    page_poison=on slub_debug=P

    To add sanity-checking, change PAGE_POISONING_NO_SANITY=n, and add “F” to slub_debug as “slub_debug=PF“.

    read-only after init

    I added the infrastructure to support making certain kernel memory read-only after kernel initialization (inspired by a small part of PaX/Grsecurity’s KERNEXEC functionality). The goal is to continue to reduce the attack surface within the kernel by making even more of the memory, especially function pointer tables, read-only (which depends on CONFIG_DEBUG_RODATA above).

    Function pointer tables (and similar structures) are frequently targeted by attackers when redirecting execution. While many are already declared “const” in the kernel source code, making them read-only (and therefore unavailable to attackers) for their entire lifetime, there is a class of variables that get initialized during kernel (and module) start-up (i.e. written to during functions that are marked “__init“) and then never (intentionally) written to again. Some examples are things like the VDSO, vector tables, arch-specific callbacks, etc.

    As it turns out, most architectures with kernel memory protection already delay making their data read-only until after __init (see mark_rodata_ro()), so it’s trivial to declare a new data section (“.data..ro_after_init“) and add it to the existing read-only data section (“.rodata“). Kernel structures can be annotated with the new section (via the “__ro_after_init” macro), and they’ll become read-only once boot has finished.

    The next step for attack surface reduction infrastructure will be to create a kernel memory region that is passively read-only, but can be made temporarily writable (by a single un-preemptable CPU), for storing sensitive structures that are written to only very rarely. Once this is done, much more of the kernel’s attack surface can be made read-only for the majority of its lifetime.

    As people identify places where __ro_after_init can be used, we can grow the protection. A good place to start is to look through the PaX/Grsecurity patch to find uses of __read_only on variables that are only written to during __init functions. The rest are places that will need the temporarily-writable infrastructure (PaX/Grsecurity uses pax_open_kernel()/pax_close_kernel() for these).

    That’s it for v4.6, next up will be v4.7!

    © 2016, Kees Cook. This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 License.
    Creative Commons License

    October 01, 2016 07:45 AM

    September 30, 2016

    LPC 2016: Last batch of LPC registrations available on October 1

    The last batch of registrations for the 2016 Linux Plumbers Conference will be available starting at noon Eastern Time (EDT) on October 1. This will be the last chance to register to attend the conference. Those interested should visit the registration web site after that time.

    The schedule for the conference has been posted, which includes information on the microconferences (and the discussions planned for those) as well as the refereed talks. Any conflicts noted should be sent to contact@linuxplumbersconf.org.

    We hope to see you in Santa Fe!

    September 30, 2016 02:49 PM

    James Morris: Linux Security Summit 2016 Wrapup

    Here’s a summary of the 2016 Linux Security Summit, which was held last month in Toronto.

    Presentation slides are available at http://events.linuxfoundation.org/events/archive/2016/linux-security-summit/program/slides.

    This year, videos were made of the sessions, and they may be viewed at https://www.linux.com/news/linux-security-summit-videos — many thanks to Intel for sponsoring the recordings!

    LWN has published some excellent coverage:

    This is a pretty good representation of the main themes which emerged in the conference: container security, kernel self-protection, and integrity / secure boot.

    Many of the core or low level security technologies (such as access control, integrity measurement, crypto, and key management) are now fairly mature. There’s more focus now on how to integrate these components into higher-level systems and architectures.

    One talk I found particularly interesting was Design and Implementation of a Security Architecture for Critical Infrastructure Industrial Control Systems in the Era of Nation State Cyber Warfare. (The title, it turns out, was a hack to bypass limited space for the abstract in the cfp system).  David Safford presented an architecture being developed by GE to protect a significant portion of the world’s electrical grid from attack.  This is being done with Linux, and is a great example of how the kernel’s security mechanisms are being utilized for such purposes.  See the slides or the video.  David outlined gaps in the kernel in relation to their requirements, and a TPM BoF was held later in the day to work on these.  The BoF was reportedly very successful, as several key developers in the area of TPM and Integrity were present.

    #linuxsecuritysummit TPM BOF session pic.twitter.com/l1ko9Meiud

    — LinuxSecuritySummit (@LinuxSecSummit) August 25, 2016

    Attendance at LSS was the highest yet with well over a hundred security developers, researchers and end users.

    Special thanks to all of the LF folk who manage the logistics for the event.  There’s no way we could stage something on this scale without their help.

    Stay tuned for the announcement of next year’s event!

     

    September 30, 2016 11:19 AM

    Daniel Vetter: Commit Rights in the Linux Kernel?!

    Since about a year we’re running the Intel graphics driver with a new process: Besides the two established maintainers we’ve added all regular contributors as committers to the main feature branch feeding into -next. This turned out into a tremendous success, but did require some initial adustments to how we run things in the first few months.

    I’ve presented the new model here at Kernel Recipes in Paris, and I will also talk about it at Kernel Summit in Santa Fe. Since LWN is present at both I won’t bother with a full writeup, but leave that to much better editors. Update: LWN on kernel maintainer scalability.

    Anyway, there’s a video recording and the slides. Our process is also documented - scroll down to the bottom for the more interesting bits around what’s expected of committers.

    On a related note: At XDC, and a bit before, Eric Anholt started a discussion about improving our patch submission process, especially for new contributors. He used the Rust community as a great example, and presented about it at XDC. Rather interesting to hear his perspective as a first-time contributor confirm what I learned in LCA this year in Emily Dunham’s awesome talk on Life is better with Rust’s community automation.

    September 30, 2016 05:32 AM

    September 28, 2016

    Kees Cook: security things in Linux v4.5

    Previously: v4.4. Some things I found interesting in the Linux kernel v4.5:

    CONFIG_IO_STRICT_DEVMEM

    The CONFIG_STRICT_DEVMEM setting that has existed for a long time already protects system RAM from being accessible through the /dev/mem device node to root in user-space. Dan Williams added CONFIG_IO_STRICT_DEVMEM to extend this so that if a kernel driver has reserved a device memory region for use, it will become unavailable to /dev/mem also. The reservation in the kernel was to keep other kernel things from using the memory, so this is just common sense to make sure user-space can’t stomp on it either. Everyone should have this enabled. (And if you have a system where you discover you need IO memory access from userspace, you can boot with “iomem=relaxed” to disable this at runtime.)

    If you’re looking to create a very bright line between user-space having access to device memory, it’s worth noting that if a device driver is a module, a malicious root user can just unload the module (freeing the kernel memory reservation), fiddle with the device memory, and then reload the driver module. So either just leave out /dev/mem entirely (not currently possible with upstream), build a monolithic kernel (no modules), or otherwise block (un)loading of modules (/proc/sys/kernel/modules_disabled).

    ptrace fsuid checking

    Jann Horn fixed some corner-cases in how ptrace access checks were handled on special files in /proc. For example, prior to this fix, if a setuid process temporarily dropped privileges to perform actions as a regular user, the ptrace checks would not notice the reduced privilege, possibly allowing a regular user to trick a privileged process into disclosing things out of /proc (ASLR offsets, restricted directories, etc) that they normally would be restricted from seeing.

    ASLR entropy sysctl

    Daniel Cashman standardized the way architectures declare their maximum user-space ASLR entropy (CONFIG_ARCH_MMAP_RND_BITS_MAX) and then created a sysctl (/proc/sys/vm/mmap_rnd_bits) so that system owners could crank up entropy. For example, the default entropy on 32-bit ARM was 8 bits, but the maximum could be as much as 16. If your 64-bit kernel is built with CONFIG_COMPAT, there’s a compat version of the sysctl as well, for controlling the ASLR entropy of 32-bit processes: /proc/sys/vm/mmap_rnd_compat_bits.

    Here’s how to crank your entropy to the max, without regard to what architecture you’re on:

    for i in "" "compat_"; do f=/proc/sys/vm/mmap_rnd_${i}bits; n=$(cat $f); while echo $n > $f ; do n=$(( n + 1 )); done; done
    

    strict sysctl writes

    Two years ago I added a sysctl for treating sysctl writes more like regular files (i.e. what’s written first is what appears at the start), rather than like a ring-buffer (what’s written last is what appears first). At the time it wasn’t clear what might break if this was enabled, so a WARN was added to the kernel. Since only one such string showed up in searches over the last two years, the strict writing mode was made the default. The setting remains available as /proc/sys/kernel/sysctl_writes_strict.

    seccomp UM support

    Mickaël Salaün added seccomp support (and selftests) for user-mode Linux. Moar architectures!

    seccomp NNP vs TSYNC fix

    Jann Horn noticed and fixed a problem where if a seccomp filter was already in place on a process (after being installed by a privileged process like systemd, a container launcher, etc) then the setting of the “no new privs” flag could be bypassed when adding filters with the SECCOMP_FILTER_FLAG_TSYNC flag set. Bypassing NNP meant it might be possible to trick a buggy setuid program into doing things as root after a seccomp filter forced a privilege drop to fail (generally referred to as the “sendmail setuid flaw”). With NNP set, a setuid program can’t be run in the first place.

    That’s it! Next I’ll cover v4.6

    Edit: Added notes about “iomem=…”

    © 2016, Kees Cook. This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 License.
    Creative Commons License

    September 28, 2016 09:58 PM

    September 27, 2016

    Kees Cook: security things in Linux v4.4

    Previously: v4.3. Continuing with interesting security things in the Linux kernel, here’s v4.4. As before, if you think there’s stuff I missed that should get some attention, please let me know.

    seccomp Checkpoint/Restore-In-Userspace

    Tycho Andersen added a way to extract and restore seccomp filters from running processes via PTRACE_SECCOMP_GET_FILTER under CONFIG_CHECKPOINT_RESTORE. This is a continuation of his work (that I failed to mention in my prior post) from v4.3, which introduced a way to suspend and resume seccomp filters. As I mentioned at the time (and for which he continues to quote me) “this feature gives me the creeps.” :)

    x86 W^X detection

    Stephen Smalley noticed that there was still a range of kernel memory (just past the end of the kernel code itself) that was incorrectly marked writable and executable, defeating the point of CONFIG_DEBUG_RODATA which seeks to eliminate these kinds of memory ranges. He corrected this in v4.3 and added CONFIG_DEBUG_WX in v4.4 which performs a scan of memory at boot time and yells loudly if unexpected memory protection are found. To nobody’s delight, it was shortly discovered the UEFI leaves chunks of memory in this state too, which posed an ugly-to-solve problem (which Matt Fleming addressed in v4.6).

    x86_64 vsyscall CONFIG

    I introduced a way to control the mode of the x86_64 vsyscall with a build-time CONFIG selection, though the choice I really care about is CONFIG_LEGACY_VSYSCALL_NONE, to force the vsyscall memory region off by default. The vsyscall memory region was always mapped into process memory at a fixed location, and it originally posed a security risk as a ROP gadget execution target. The vsyscall emulation mode was added to mitigate the problem, but it still left fixed-position static memory content in all processes, which could still pose a security risk. The good news is that glibc since version 2.15 doesn’t need vsyscall at all, so it can just be removed entirely. Any kernel built this way that discovered they needed to support a pre-2.15 glibc could still re-enable it at the kernel command line with “vsyscall=emulate”.

    That’s it for v4.4. Tune in tomorrow for v4.5!

    © 2016, Kees Cook. This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 License.
    Creative Commons License

    September 27, 2016 10:47 PM

    Dave Airlie: radv: status update or is Talos Principle rendering yet?

    The answer is YES!!

    I fixed the last bug with instance rendering and Talos renders great on radv now.

    Also with the semi-interesting branch vkQuake also renders, there are some upstream bugs that needs fixing in spirv/nir that I'm awaiting and upstream resolution on, but I've included some prelim fixes in semi-interesting for now, that'll go away when upstream fixes are decided on.

    Here's a screenshot:

    September 27, 2016 04:33 AM

    September 26, 2016

    Kees Cook: security things in Linux v4.3

    When I gave my State of the Kernel Self-Protection Project presentation at the 2016 Linux Security Summit, I included some slides covering some quick bullet points on things I found of interest in recent Linux kernel releases. Since there wasn’t a lot of time to talk about them all, I figured I’d make some short blog posts here about the stuff I was paying attention to, along with links to more information. This certainly isn’t everything security-related or generally of interest, but they’re the things I thought needed to be pointed out. If there’s something security-related you think I should cover from v4.3, please mention it in the comments. I’m sure I haven’t caught everything. :)

    A note on timing and context: the momentum for starting the Kernel Self Protection Project got rolling well before it was officially announced on November 5th last year. To that end, I included stuff from v4.3 (which was developed in the months leading up to November) under the umbrella of the project, since the goals of KSPP aren’t unique to the project nor must the goals be met by people that are explicitly participating in it. Additionally, not everything I think worth mentioning here technically falls under the “kernel self-protection” ideal anyway — some things are just really interesting userspace-facing features.

    So, to that end, here are things I found interesting in v4.3:

    CONFIG_CPU_SW_DOMAIN_PAN

    Russell King implemented this feature for ARM which provides emulated segregation of user-space memory when running in kernel mode, by using the ARM Domain access control feature. This is similar to a combination of Privileged eXecute Never (PXN, in later ARMv7 CPUs) and Privileged Access Never (PAN, coming in future ARMv8.1 CPUs): the kernel cannot execute user-space memory, and cannot read/write user-space memory unless it was explicitly prepared to do so. This stops a huge set of common kernel exploitation methods, where either a malicious executable payload has been built in user-space memory and the kernel was redirected to run it, or where malicious data structures have been built in user-space memory and the kernel was tricked into dereferencing the memory, ultimately leading to a redirection of execution flow.

    This raises the bar for attackers since they can no longer trivially build code or structures in user-space where they control the memory layout, locations, etc. Instead, an attacker must find areas in kernel memory that are writable (and in the case of code, executable), where they can discover the location as well. For an attacker, there are vastly fewer places where this is possible in kernel memory as opposed to user-space memory. And as we continue to reduce the attack surface of the kernel, these opportunities will continue to shrink.

    While hardware support for this kind of segregation exists in s390 (natively separate memory spaces), ARM (PXN and PAN as mentioned above), and very recent x86 (SMEP since Ivy-Bridge, SMAP since Skylake), ARM is the first upstream architecture to provide this emulation for existing hardware. Everyone running ARMv7 CPUs with this kernel feature enabled suddenly gains the protection. Similar emulation protections (PAX_MEMORY_UDEREF) have been available in PaX/Grsecurity for a while, and I’m delighted to see a form of this land in upstream finally.

    To test this kernel protection, the ACCESS_USERSPACE and EXEC_USERSPACE triggers for lkdtm have existed since Linux v3.13, when they were introduced in anticipation of the x86 SMEP and SMAP features.

    Ambient Capabilities

    Andy Lutomirski (with Christoph Lameter and Serge Hallyn) implemented a way for processes to pass capabilities across exec() in a sensible manner. Until Ambient Capabilities, any capabilities available to a process would only be passed to a child process if the new executable was correctly marked with filesystem capability bits. This turns out to be a real headache for anyone trying to build an even marginally complex “least privilege” execution environment. The case that Chrome OS ran into was having a network service daemon responsible for calling out to helper tools that would perform various networking operations. Keeping the daemon not running as root and retaining the needed capabilities in children required conflicting or crazy filesystem capabilities organized across all the binaries in the expected tree of privileged processes. (For example you may need to set filesystem capabilities on bash!) By being able to explicitly pass capabilities at runtime (instead of based on filesystem markings), this becomes much easier.

    For more details, the commit message is well-written, almost twice as long as than the code changes, and contains a test case. If that isn’t enough, there is a self-test available in tools/testing/selftests/capabilities/ too.

    PowerPC and Tile support for seccomp filter

    Michael Ellerman added support for seccomp to PowerPC, and Chris Metcalf added support to Tile. As the seccomp maintainer, I get excited when an architecture adds support, so here we are with two. Also included were updates to the seccomp self-tests (in tools/testing/selftests/seccomp), to help make sure everything continues working correctly.

    That’s it for v4.3. If I missed stuff you found interesting, please let me know! I’m going to try to get more per-version posts out in time to catch up to v4.8, which appears to be tentatively scheduled for release this coming weekend. Next: v4.4.

    © 2016, Kees Cook. This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 License.
    Creative Commons License

    September 26, 2016 10:54 PM

    LPC 2016: Refereed Talks now Posted to Plumbers Schedule

    The Linux Plumbers conference schedule has now been updated to include the accepted refereed talk proposals.  As usual, we’ve tried to make sure the conflicts are minimised, but if anyone needs a change to the timing of their talk, please email contact@linuxplumbersconf.org.

    September 26, 2016 08:50 PM

    September 25, 2016

    Gustavo F. Padovan: My talk about Mainline Explicit Fencing at XDC 2016!

    Last week I was at XDC in Helsinki where I presented about the Explicit Fencing work we’ve been doing on the Mainline Linux Kernel in the lastest few months. There was a livestream of all presentations during the conference and recorded sections are available. You can check the video of my presentation. Check out the slides too.

    If you want to check the code we’ve been writing they are available here:

    Linux Kernel: https://git.kernel.org/cgit/linux/kernel/git/padovan/linux.git/log/?h=fences

    Mesa: https://git.collabora.com/cgit/user/padovan/mesa.git/log/?h=fences

    libdrm: https://git.collabora.com/cgit/user/padovan/libdrm.git/log/?h=fences

    kmscube: https://github.com/robclark/kmscube/tree/atomic-fence

    Soon we will get Explicit Fencing on Android’s drm_hwcomposer as well so expect updates on this blog with more information about that. :)

    Also I would like to take the opportunity to thank Collabora for sponsoring my travel to XDC and Martin Peres for organizing such a great conference. It was my first time attending XDC and my time there was absolutely great, I  have learnt a lot about what the Graphics community have been doing lately and I met the people doing this work. I was happy to see a lot of interest from many people around the Explicit Fencing work we’ve doing.

     

    September 25, 2016 09:44 PM

    September 24, 2016

    Pavel Machek: Audio fun

    Documentation for audio on Linux... is pretty much nonexistent.

    Notice!

    There is a hidden pointer somewhere in this text to a page containing deeper information about using audio. You should have perfect understanding about the features described in this page before jumping into more complicated information. Just make sure you read this text carefully enough so you will be able to find the link.
    Oh, thank you, so we are now on treasure hunt?
    Under construction!
    This page is currently being written. A more complete version should be released shortly.
    ....
    Last updated Fri 16 Aug 1996 (minor changes).
    Seems like the complete page is not going to be available any time soon.
    Still, that was best page explaining how audio is supposed to work on Linux. Ouch. I could not get ALSA to work. OSS works fine. (I guess that also talks a bit about state of audio on Linux). And then I discovered that modem does not work in kernel 4.8, so my problems were not pulseaudio problems but modem problems. Oh well.
    --

    September 24, 2016 10:05 AM

    September 19, 2016

    LPC 2016: Preliminary Microconference Schedule Up

    Every year we get a number of constraints on Microconferences which we try hard to accommodate.  Accounting for all of those, we’ve put the preliminary schedule up here.  If you notice any problems, please email contact@linuxplumbersconf.org and we’ll try to fix it

    Also note, this is preliminary, the Microconferences may still move around as we get requests to change them.  Also note that the times of talks within Microconferences is highly likely to change (please see the MC leaders if you want this to change).

    September 19, 2016 07:00 PM

    September 10, 2016

    LPC 2016: Git microconference accepted into LPC 2016

    The Linux kernel community has been using Git for more than a decade, but it is still under active development, with more than 2,000 non-merge commits from almost 200 contributors over the past year. Rather than review this extensive history, this Micro Git Together instead focuses on what the next few years might bring. In addition, Junio Hamano will present on the state of the Git Union, Josh Triplett will present on the git-series project, and Steve Rostedt will present “A Maze Of Git Scripts All Alike”, in which Steve puts forward the radical notion that common function in various maintainers’ scripts could be pulled into Git itself. This should help lead into a festive discussion about the future of Git.

    Please join us for an important discussion!

    September 10, 2016 06:20 PM

    Paul E. Mc Kenney: Git Microconference Accepted into 2016 Linux Plumbers Conference

    The Linux kernel community has been using Git for more than a decade, but it is still under active development, with more than 2,000 non-merge git commits from almost 200 contributor over the past year. Rather than review this extensive history, this Micro Git Together instead focuses on what the next few years might bring. In addtion, Junio will present on the state of the Git Union, Josh Triplett will present on the git-series project, and Steve Rostedt will present "A Maze Of Git Scripts All Alike", in which Steve puts forward the radical notion that common function in various maintainers' scripts could be pulled into git itself. This should help lead into a festive discussion about the future of git.

    Please join us for an important and festive discussion!

    September 10, 2016 10:25 AM

    September 09, 2016

    Mel Gorman: Stabilising performance after a major kernel revision

    A topic related to upstreaming patches on kernel forks related to embedded platforms is currently being discussed for Kernel Summit 2016. This is an age-old topic related to whether it is better to work upstream and backport or apply patches to a product-specific kernel and worry about forward-porting later. The points being raised have not changed over the years and still comes down to getting something out the door quickly versus long-term maintenance overhead. I’m not directly affected so had nothing new to add to the thread.

    However, I’ve had recent experience stabilising the performance of an upstream kernel after a major kernel revision in the context of a distribution kernel. The kernel in question follows an upstream-first-and-then-backport policy with very rare exceptions. The backports are almost always related to hardware enablement but performance-related patches are also cherry-picked which is what my primary concern as Performance Team Lead is. The difficulty we face is that the distribution kernel is faster than the baseline upstream stable kernel is and faster than the mainline kernel we rebase to for a new release. There are usually multiple root causes and because of the cherry-picking, it’s not a simple case of bisecting.

    Performance is always workload and hardware specific so I’m not going to get into the performance figures and profiles used to make decisions but the patches in question are on a public git tree if someone was sufficiently motivated. There may be an attempt to update the -stable kernel involved without a guarantee it’ll be picked up. Right now, it’s still a work in progress but this list gives an idea of the number of patches involved;

    This is an incomplete list and it’s a single case that may or may not apply to other people and products. I do have anecdotal evidence that other companies carry far fewer patches when stabilising performance but in many cases, those same companies have a fixed set of well-known workloads where as this is a distribution kernel for general use.

    This is unrelated to the difficulties embedded vendors have when shipping a product but lets just say that I have a certain degree of sympathy when a major kernel revision is required. That said, my experience suggests that the effort required to stabilise a major release periodically is lower than carrying ever-increasing numbers of backports that get harder and harder to backport.

    September 09, 2016 10:56 AM

    September 08, 2016

    Pavel Machek: Security getting hard/impossible on recent systems

    Cache attacks: this is not good. Ok, so we have a rowhammer: basically very common, hard-to-work-around, hardware problem. Bits in your memory may flip. Deal with it.

    And now, there are cache attacks, too. Users should not be able to spy on each other on multiuser system, but they very probably can. In particular, other users can tell which parts of emacs you are executing, and when. They can probably not distinguish what characters you are typing, but they can probably learn when you are typing space, normal letter, or moving cursor. Ouch. And if they indeed can spy on individual characters... you can hardly blame emacs. With plain keyboard, cache attack on individual letters is probably not feasible. With t-9 like system on touchscreen... it probably is. Deal with it. But how?

    September 08, 2016 10:46 AM

    Pavel Machek: fcam-dev now gets autofocus on 4.7 kernel

    Ok, without proper timing support, everything is really, really slow, but hey - I already got one usable photo out of the system :-).

    Oh, and this is the reason to run Debian on your phone: https://citizenlab.org/2016/08/million-dollar-dissident-iphone-zero-day-nso-group-uae/ .

    September 08, 2016 10:36 AM

    Pavel Machek: 25 years of Linux

    25 years of linux and yes, I know Linux is popular. Still it was unexpected when I was asked in public transport if I know about Linux. Man wanted me to help with X restarting due to bad graphics drivers... I asked how he realized... and he told me about my T-shirt. I realized I have UnitedLinux T-shirt on... Given SCO's involvement in that one... should I burn the shirt?

    September 08, 2016 10:32 AM

    Pavel Machek: ext4 encryption incompatible with grub

    You encrypt a directory -- sounds easy, right? Support is in 4.4 kernel, my machines run newer kernels than that. Encrypting root would be hard, but encrypting parts of data partition should be easy.

    Ok, lets follow howto... Need to do tune2fs. Right. Aha, still does not work, looks like I'll need to reboot.
    Hmm. Will not boot. Grub no longer recognizes my /data partition, and that's where new kernels are. Old kernels are in /boot, but those are now useless. Lets copy new kernel on machine using USB stick. Does not boot. Fun.
    tune2fs on root filesystem is useless, as it is too old. New one is ... on the data partition. Right. Ok, lets bring newer version of tune2fs in. "encryption" feature can not be cleared.
    Argh! Come on, I did not even create single encrypted directory on the partition. I want the damn bit to go off, so I can go back to working configuration. "Old kernels can not read encrypted files" sounds ok, but "old kernels can not mount filesystem at all" is not acceptable here :-(.

    You encrypt a directory -- sounds easy, right? Support is in 4.4 kernel, my machines run newer kernels than that. Encrypting root would be hard, but encrypting parts of data partition should be easy.
    Ok, lets follow howto... Need to do tune2fs. Right. Aha, still does not work, looks like I'll need to reboot.
    Hmm. Will not boot. Grub no longer recognizes my /data partition, and that's where new kernels are. Old kernels are in /boot, but those are now useless. Lets copy new kernel on machine using USB stick. Does not boot. Fun.
    tune2fs on root filesystem is useless, as it is too old. New one is ... on the data partition. Right. Ok, lets bring newer version of tune2fs in. "encryption" feature can not be cleared.
    Argh! Come on, I did not even create single encrypted directory on the partition. I want the damn bit to go off, so I can go back to working configuration. "Old kernels can not read encrypted files" sounds ok, but "old kernels can not mount filesystem at all" is not acceptable here :-(.
    Ok, it seems it is possible to go back, as long as encryption was not actually used. fsck -fn; debugfs -w -R "feature -encrypt" /dev/device; fsck -fn;. I guess I was too optimistic. Using ext4 encryption would require at least new e2fsprogs at the root filesystem, which was something I was hoping to avoid.

    September 08, 2016 10:31 AM

    Pavel Machek: Anyone with x60 and working gigabit?

    On the lists, I was told that I probably have broken wire inside my notebook. I believe broken wires simply don't happen, so... is there anyone with working gigabit on x60?

    September 08, 2016 10:28 AM

    September 07, 2016

    LPC 2016: Limited number of LPC registrations available starting September 8

    LPC registration will open up on September 8 at noon Eastern Time (EDT) with a very limited number of slots available. Those interested in attending the conference who have not yet registered will want to visit the registration web site after that time. There will also be a very limited number of late registrations that will be available starting on October 1.

    Another way to get a pass to the nearly sold out conference would be to submit a refereed track proposal before September 8. Each accepted talk will get one free pass to LPC.

    September 07, 2016 01:49 PM

    September 06, 2016

    LPC 2016: Audio workshop accepted for Linux Plumbers Conference and Kernel Summit

    Audio is an increasingly important component of the Linux plumbing, given increased use of Linux for media workloads and of the Linux kernel for smartphones. Topics include low-latency audio, use of the clock API, propagating digital configuration through dynamic audio power management (DAPM), integration of HDA and ASoC, SoundWire ALSA use-case managemer (UCM) scalability, standardizing HDMI and DisplayPort interfaces, Media Controller API integration, and a number of topics relating to the multiple userspace users of Linux-kernel audio, including Android and ChromeOS as well as the various desktop-oriented Linux distributions.

    As with many Linux-kernel components, upstreaming of vendor drivers and handling of stable and long term-stable (LTS) trees are also important topics.

    Please join us for a timely and important discussion!

    September 06, 2016 04:38 PM

    Greg Kroah-Hartman: 4.9 == next LTS kernel

    As I briefly mentioned a few weeks ago on my G+ page, the plan is for the 4.9 Linux kernel release to be the next “Long Term Supported” (LTS) kernel.

    Last year, at the Linux Kernel Summit, we discussed just how to pick the LTS kernel. Many years ago, we tried to let everyone know ahead of time what the kernel version would be, but that caused a lot of problems as people threw crud in there that really wasn’t ready to be merged, just to make it easier for their “day job”. That was many years ago, and people insist they aren’t going to do this again, so let’s see what happens.

    I reserve the right to not pick 4.9 and support it for two years, if it’s a major pain because people abused this notice. If so, I’ll possibly drop back to 4.8, or just wait for 4.10 to be released. I’ll let everyone know by updating the kernel.org releases page when it’s time (many months from now.)

    If people have questions about this, email me and I will be glad to discuss it.

    September 06, 2016 07:59 AM

    September 05, 2016

    Gustavo F. Padovan: Mainline Explicit Fencing – part 1

    When it comes to buffer sharing synchronization in the kernel there are two ways of doing it: Implicit Fencing and Explicit Fencing. The difference between them relies on the fact that the kernel may or may not share synchronization information with userspace, it will either be implicit, with no fencing information provided, or explicit with all information available to userspace.

    The fencing synchronization mechanism allows the sharing of buffers without the risk of a driver or userspace to read an incomplete buffer or write to a buffer that is still under use somewhere else in the system. The fencing provides ordering to these operations to make reads or writes happen only when the buffer is not used by other drivers anymore. For example,when a GPU job is queued a fence is associated to the buffer in the job, that fence can be used by other drivers for synchronization purposes, they won’t use the buffer a signal from the fence is received. The signal means the buffers is now free to be used. Similarly we can have the same setting for the GPU driver to wait the buffer to come out of the screen to render on it again.

    The central piece here is the fence, an element that is attached to each buffer whenever a request involving the buffer is sent to the kernel. The fence can be used by userspace or other drivers to wait for the work to finish. So once the work is finished the fence signals and the waiter can proceed and do whatever they want with the buffer.

    While Implicit Fencing  helps a lot with buffer synchronization there are a few cases where the whole desktop compositing could stall. Imagine the following compositor flow: there are 3 buffers to process, A, B and C. A and B are sent for rendering in parallel while C is going to be composed of both A and B. But the compositor will only be notified when both buffers are rendered thus if B takes too long the compositing of the whole desktop will be blocked waiting for B and C won’t be displayed in time.

    A compositor processing two buffers in parallel

    A compositor processing two buffers in parallel, with Implicit Fencing if B takes too long the desktop compositor freezes.

    However with Explicit Fencing the compositor should have one fence for each buffer and will be notified when each buffer is rendered. So if A renders fast and B takes too long the compositor can decide not wait for B and proceed with the scanout of C with buffer A but an old version of B. The fencing information allows the compositor to be smart and take decisions to avoid the screen to freeze for example.

    As of today the Linux Kernel only has generic APIs for Implicit Fencing, although some drivers have Explicit Fencing already their APIs are device specific. Android currently has its own implementation through the Android Sync Framework – which will be explained in the next article.

    Explicit Fencing works on a Consumer-Producer fashion. In an GPU rendering + scanout to the screen pipeline it would synchronize between the kernel drivers, so when submitting a new rendering job to the GPU(Producer side) userspace would get back a fence related to that buffer submitted. That means userspace doesn’t need to block waiting for the job to complete, a signal is sent when the job is finished. As userspace doesn’t need to block it and has a fence of the buffer it then can proceed right away with the syscall to ask the display hardware(Consumer) to scanout the buffer that is yet to be processed. With explicit fencing the kernel is taught to wait for the fence to signal, before starting the scanout process.

    A new fence is returned to userspace when the buffer is submitted to the kernel for scanout on the display hardware, that fence will signal when the buffer is not being displayed anymore, thus is ready for reuse by another rendering job. When the userspace gets this fence back it can submit a new rendering job to the GPU without waiting. The wait is done on the kernel side by the GPU driver, once the fence signals the rendering on that buffer can be initiated.

    Explicit Fencing

    The fence travels all the way to userspace and the next element on the pipeline. The yellow arrows represents the fences on userspace.

    Last but not least, debugability of the graphics pipeline is improved. Having access to the fence in userspace helps a lot understanding what is happening in the pipeline. Previously, with Implicit Fencing there was no infomation available, so it was hard to figure out what was happening on the pipeline, also each vendor was trying to implement their own Implicit Fencing mechanism. Now with an standard Explicit Fencing mechanism it easier to build debug/tracing infrastructure that can be used to investigate issues in any system.

    The next article will explain the Android Sync Framework and later the work on mainline to support explicit fencing will be described.

    September 05, 2016 09:15 PM

    September 04, 2016

    Paul E. Mc Kenney: Audio Workshop Accepted into 2016 Linux Kernel Summit and Linux Plumbers Conference

    Audio is an increasingly important component of the Linux plumbing, given increased use of Linux for media workloads and of the Linux kernel for smartphones. Topics include low-latency audio, use of the clock API, propagating digital configuration through dynamic audio power management (DAPM), integration of HDA and ASoC, SoundWire ALSA use-case managemer (UCM) scalability, standardizing HDMI and DisplayPort interfaces, Media Controller API integration, and a number of topics relating to the multiple userspace users of Linux-kernel audio, including Android and ChromeOS as well as the various desktop-oriented Linux distributions.

    As with many Linux-kernel components, upstreaming of vendor drivers and handling of stable and long-term-stable (LTS) trees are also important topics.

    Please join us for a timely and important discussion!

    September 04, 2016 11:46 AM

    September 02, 2016

    LPC 2016: Submission deadline for LPC refereed track proposals extended by a week

    The deadline for submitting refereed track proposals for the 2016 Linux Plumbers Conference has been extended until September 8, 2016 at 11:59PM CET. The refereed track will have 50-minute presentations on a specific aspect of Linux “plumbing” (e.g. core libraries, media creation/playback, display managers, init systems, kernel APIs/ABIs, etc.) that are chosen by the LPC committee to be given during the four days of the conference.

    Registration for the conference has largely sold out at this point, but accepted talks for the refereed track will receive one free pass to the conference.

    September 02, 2016 02:12 PM

    Pete Zaitcev: Russian Joke

    Supposedly from Habrahabr.ru, via bash.org.ru:

    Autor's Bio: Andrey Pan'gin [ref — zaitcev]. Programmer in the Odnoklassniki company, specializing in highly loaded back-ends. Knows JVM like the back of his hand, since he developed the HotSpot VM at Sun Microsystems and Oracle for several years. Loves assembly and systems programming.
    A comment: Fallen angel.

    September 02, 2016 04:12 AM

    August 31, 2016

    Vegard Nossum: Debugging a kernel crash found by syzkaller

    Having done quite a bit of kernel fuzzing and debugging lately I’ve decided to take one of the very latest crashes and write up the whole process from start to finish as I work through it. As you will see, I'm not very familiar with the site of this particular crash, the block layer. Being familiar with some existing kernel code helps, of course, since you recognise a lot of code patterns, but the kernel is so large that nobody can be familiar with everything and the crashes found by trinity and syzkaller can show up almost anywhere.

    So I got this with syzkaller after running it for a few hours:

    general protection fault: 0000 [#1] PREEMPT SMP KASAN
    Dumping ftrace buffer:
    (ftrace buffer empty)
    CPU: 0 PID: 11941 Comm: syz-executor Not tainted 4.8.0-rc2+ #169
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.3-0-ge2fc41e-prebuilt.qemu-project.org 04/01/2014
    task: ffff880110762cc0 task.stack: ffff880102290000
    RIP: 0010:[<ffffffff81f04b7a>] [<ffffffff81f04b7a>] blk_get_backing_dev_info+0x4a/0x70
    RSP: 0018:ffff880102297cd0 EFLAGS: 00010202
    RAX: dffffc0000000000 RBX: 0000000000000000 RCX: ffffc90000bb4000
    RDX: 0000000000000097 RSI: 0000000000000000 RDI: 00000000000004b8
    RBP: ffff880102297cd8 R08: 0000000000000000 R09: 0000000000000001
    R10: 0000000000000000 R11: 0000000000000001 R12: ffff88011a010a90
    R13: ffff88011a594568 R14: ffff88011a010890 R15: 7fffffffffffffff
    FS: 00007f2445174700(0000) GS:ffff88011aa00000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00000000200047c8 CR3: 0000000107eb5000 CR4: 00000000000006f0
    DR0: 000000000000001e DR1: 000000000000001e DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600
    Stack:
    1ffff10020452f9e ffff880102297db8 ffffffff81508daa 0000000000000000
    0000000041b58ab3 ffffffff844e89e1 ffffffff81508b30 ffffed0020452001
    7fffffffffffffff 0000000000000000 0000000000000000 7fffffffffffffff
    Call Trace:
    [<ffffffff81508daa>] __filemap_fdatawrite_range+0x27a/0x2e0
    [<ffffffff81508b30>] ? filemap_check_errors+0xe0/0xe0
    [<ffffffff83c24b47>] ? preempt_schedule+0x27/0x30
    [<ffffffff810020ae>] ? ___preempt_schedule+0x16/0x18
    [<ffffffff81508e36>] filemap_fdatawrite+0x26/0x30
    [<ffffffff817191b0>] fdatawrite_one_bdev+0x50/0x70
    [<ffffffff817341b4>] iterate_bdevs+0x194/0x210
    [<ffffffff81719160>] ? fdatawait_one_bdev+0x70/0x70
    [<ffffffff817195f0>] ? sync_filesystem+0x240/0x240
    [<ffffffff817196be>] sys_sync+0xce/0x160
    [<ffffffff817195f0>] ? sync_filesystem+0x240/0x240
    [<ffffffff81002b60>] ? exit_to_usermode_loop+0x190/0x190
    [<ffffffff8150455a>] ? __context_tracking_exit.part.4+0x3a/0x1e0
    [<ffffffff81005524>] do_syscall_64+0x1c4/0x4e0
    [<ffffffff83c3276a>] entry_SYSCALL64_slow_path+0x25/0x25
    Code: 89 fa 48 c1 ea 03 80 3c 02 00 75 35 48 8b 9b e0 00 00 00 48 b8 00 00 00 00 00 fc ff df 48 8d bb b8 04 00 00 48 89 fa 48 c1 ea 03 <80> 3c 02 00 75 17 48 8b 83 b8 04 00 00 5b 5d 48 05 10 02 00 00
    RIP [<ffffffff81f04b7a>] blk_get_backing_dev_info+0x4a/0x70
    RSP <ffff880102297cd0>
    The very first thing to do is to look up the code in the backtrace:
    $ addr2line -e vmlinux -i ffffffff81f04b7a ffffffff81508daa ffffffff81508e36 ffffffff817191b0 ffffffff817341b4 ffffffff817196be
    ./include/linux/blkdev.h:844
    block/blk-core.c:116
    ./include/linux/backing-dev.h:186
    ./include/linux/backing-dev.h:229
    mm/filemap.c:316
    mm/filemap.c:334
    fs/sync.c:85
    ./include/linux/spinlock.h:302
    fs/block_dev.c:1910
    fs/sync.c:116
    The actual site of the crash is this:
     842 static inline struct request_queue *bdev_get_queue(struct block_device *bdev)
    843 {
    844 return bdev->bd_disk->queue; /* this is never NULL */
    845 }
    Because we’re using KASAN we can’t look at CR2 to find the bad pointer because KASAN triggers before the page fault (or to be completely honest, KASAN tries to access the shadow memory for the bad pointer, which is itself a bad pointer and causes the GPF above).

    Let’s look at the “Code:” line to try to find the exact dereference causing the error:
    $ echo 'Code: 89 fa 48 c1 ea 03 80 3c 02 00 75 35 48 8b 9b e0 00 00 00 48 b8 00 00 00 00 00 fc ff df 48 8d bb b8 04 00 00 48 89 fa 48 c1 ea 03 <80> 3c 02 00 75 17 48 8b 83 b8 04 00 00 5b 5d 48 05 10 02 00 00 ' | scripts/decodecode 
    Code: 89 fa 48 c1 ea 03 80 3c 02 00 75 35 48 8b 9b e0 00 00 00 48 b8 00 00 00 00 00 fc ff df 48 8d bb b8 04 00 00 48 89 fa 48 c1 ea 03 <80> 3c 02 00 75 17 48 8b 83 b8 04 00 00 5b 5d 48 05 10 02 00 00
    All code
    ========
    0: 89 fa mov %edi,%edx
    2: 48 c1 ea 03 shr $0x3,%rdx
    6: 80 3c 02 00 cmpb $0x0,(%rdx,%rax,1)
    a: 75 35 jne 0x41
    c: 48 8b 9b e0 00 00 00 mov 0xe0(%rbx),%rbx
    13: 48 b8 00 00 00 00 00 movabs $0xdffffc0000000000,%rax
    1a: fc ff df
    1d: 48 8d bb b8 04 00 00 lea 0x4b8(%rbx),%rdi
    24: 48 89 fa mov %rdi,%rdx
    27: 48 c1 ea 03 shr $0x3,%rdx
    2b:* 80 3c 02 00 cmpb $0x0,(%rdx,%rax,1) <-- trapping instruction
    2f: 75 17 jne 0x48
    31: 48 8b 83 b8 04 00 00 mov 0x4b8(%rbx),%rax
    38: 5b pop %rbx
    39: 5d pop %rbp
    3a: 48 05 10 02 00 00 add $0x210,%rax
    I’m using CONFIG_KASAN_INLINE=y so most of the code above is actually generated by KASAN which makes things a bit harder to read. The movabs with a weird 0xdffff… address is how it generates the address for the shadow memory bytemap and the cmpb that crashed is where it tries to read the value of the shadow byte.

    The address is %rdx + %rax and we know that %rax is 0xdffffc0000000000. Let’s look at %rdx in the crash above… RDX: 0000000000000097; yup, that’s a NULL pointer dereference all right.

    But the line in question has two pointer dereferences, bdev->bd_disk and bd_disk->queue, and which one is the crash? The lea 0x4b8(%rbx), %rdi is what gives it away, since that gives us the offset into the structure that is being dereferenced (also, NOT coincidentally, %rbx is 0). Let’s use pahole:
    $ pahole -C 'block_device' vmlinux
    struct block_device {
    dev_t bd_dev; /* 0 4 */
    int bd_openers; /* 4 4 */
    struct inode * bd_inode; /* 8 8 */
    struct super_block * bd_super; /* 16 8 */
    struct mutex bd_mutex; /* 24 128 */
    /* --- cacheline 2 boundary (128 bytes) was 24 bytes ago --- */
    void * bd_claiming; /* 152 8 */
    void * bd_holder; /* 160 8 */
    int bd_holders; /* 168 4 */
    bool bd_write_holder; /* 172 1 */

    /* XXX 3 bytes hole, try to pack */

    struct list_head bd_holder_disks; /* 176 16 */
    /* --- cacheline 3 boundary (192 bytes) --- */
    struct block_device * bd_contains; /* 192 8 */
    unsigned int bd_block_size; /* 200 4 */

    /* XXX 4 bytes hole, try to pack */

    struct hd_struct * bd_part; /* 208 8 */
    unsigned int bd_part_count; /* 216 4 */
    int bd_invalidated; /* 220 4 */
    struct gendisk * bd_disk; /* 224 8 */
    struct request_queue * bd_queue; /* 232 8 */
    struct list_head bd_list; /* 240 16 */
    /* --- cacheline 4 boundary (256 bytes) --- */
    long unsigned int bd_private; /* 256 8 */
    int bd_fsfreeze_count; /* 264 4 */

    /* XXX 4 bytes hole, try to pack */

    struct mutex bd_fsfreeze_mutex; /* 272 128 */
    /* --- cacheline 6 boundary (384 bytes) was 16 bytes ago --- */

    /* size: 400, cachelines: 7, members: 21 */
    /* sum members: 389, holes: 3, sum holes: 11 */
    /* last cacheline: 16 bytes */
    };
    0x4b8 is 1208 in decimal, which is way bigger than this struct. Let’s try the other one:
    $ pahole -C 'gendisk' vmlinux
    struct gendisk {
    int major; /* 0 4 */
    int first_minor; /* 4 4 */
    int minors; /* 8 4 */
    char disk_name[32]; /* 12 32 */

    /* XXX 4 bytes hole, try to pack */

    char * (*devnode)(struct gendisk *, umode_t *); /* 48 8 */
    unsigned int events; /* 56 4 */
    unsigned int async_events; /* 60 4 */
    /* --- cacheline 1 boundary (64 bytes) --- */
    struct disk_part_tbl * part_tbl; /* 64 8 */
    struct hd_struct part0; /* 72 1128 */
    /* --- cacheline 18 boundary (1152 bytes) was 48 bytes ago --- */
    const struct block_device_operations * fops; /* 1200 8 */
    struct request_queue * queue; /* 1208 8 */
    /* --- cacheline 19 boundary (1216 bytes) --- */
    void * private_data; /* 1216 8 */
    int flags; /* 1224 4 */

    /* XXX 4 bytes hole, try to pack */

    struct kobject * slave_dir; /* 1232 8 */
    struct timer_rand_state * random; /* 1240 8 */
    atomic_t sync_io; /* 1248 4 */

    /* XXX 4 bytes hole, try to pack */

    struct disk_events * ev; /* 1256 8 */
    struct kobject integrity_kobj; /* 1264 64 */
    /* --- cacheline 20 boundary (1280 bytes) was 48 bytes ago --- */
    int node_id; /* 1328 4 */

    /* XXX 4 bytes hole, try to pack */

    struct badblocks * bb; /* 1336 8 */
    /* --- cacheline 21 boundary (1344 bytes) --- */

    /* size: 1344, cachelines: 21, members: 20 */
    /* sum members: 1328, holes: 4, sum holes: 16 */
    };
    1208 is ->queue, so that fits well with what we’re seeing; therefore, bdev->bd_disk must be NULL.

    At this point I would go up the stack of function to see if anything sticks out – although unlikely, it’s possible that it’s an “easy” bug where you can tell just from looking at the code in a single function that it sets the pointer to NULL just before calling the function that crashed or something like that.

    Probably the most interesting function in the stack trace (at a glance) is iterate_bdevs() in fs/block_dev.c:
    1880 void iterate_bdevs(void (*func)(struct block_device *, void *), void *arg)
    1881 {
    1882 struct inode *inode, *old_inode = NULL;
    1883
    1884 spin_lock(&blockdev_superblock->s_inode_list_lock);
    1885 list_for_each_entry(inode, &blockdev_superblock->s_inodes, i_sb_list) {
    1886 struct address_space *mapping = inode->i_mapping;
    1887
    1888 spin_lock(&inode->i_lock);
    1889 if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW) ||
    1890 mapping->nrpages == 0) {
    1891 spin_unlock(&inode->i_lock);
    1892 continue;
    1893 }
    1894 __iget(inode);
    1895 spin_unlock(&inode->i_lock);
    1896 spin_unlock(&blockdev_superblock->s_inode_list_lock);
    1897 /*
    1898 * We hold a reference to 'inode' so it couldn't have been
    1899 * removed from s_inodes list while we dropped the
    1900 * s_inode_list_lock We cannot iput the inode now as we can
    1901 * be holding the last reference and we cannot iput it under
    1902 * s_inode_list_lock. So we keep the reference and iput it
    1903 * later.
    1904 */
    1905 iput(old_inode);
    1906 old_inode = inode;
    1907
    1908 func(I_BDEV(inode), arg);
    1909
    1910 spin_lock(&blockdev_superblock->s_inode_list_lock);
    1911 }
    1912 spin_unlock(&blockdev_superblock->s_inode_list_lock);
    1913 iput(old_inode);
    1914 }
    I can’t quite put my finger on it, but it looks interesting because it has a bunch of locking in it and it seems to be what’s getting the block device from a given inode. I ran git blame on the file/function in question since that might point to a recent change there, but the most interesting thing is commit 74278da9f7 changing some locking logic. Maybe relevant, maybe not, but let’s keep it in mind.

    Remember that bd->bd_disk is NULL. Let’s try to check if ->bd_disk is assigned NULL anywhere:
    $ git grep -n '\->bd_disk.*=.*NULL'
    block/blk-flush.c:470: if (bdev->bd_disk == NULL)
    drivers/block/xen-blkback/xenbus.c:466: if (vbd->bdev->bd_disk == NULL) {
    fs/block_dev.c:1295: bdev->bd_disk = NULL;
    fs/block_dev.c:1375: bdev->bd_disk = NULL;
    fs/block_dev.c:1615: bdev->bd_disk = NULL;
    kernel/trace/blktrace.c:1624: if (bdev->bd_disk == NULL)
    This by no means necessarily includes the code that set ->bd_disk to NULL in our case (since there could be code that looks like x = NULL; bdev->bd_disk = x; which wouldn’t be found with the regex above), but this is a good start and I’ll look at the functions above just to see if it might be relevant. Actually, for this I’ll just add -W to the git grep above to quickly look at the functions.

    The first two and last hits are comparisons so they are uninteresting. The third and fourth ones are part of error paths in __blkdev_get(). That might be interesting if the process that crashed somehow managed to get a reference to the block device just after the NULL assignment (if so, that would probably be a locking bug in either __blkdev_get() or one of the functions in the crash stack trace – OR it might be a bug where the struct block_device * is made visible/reachable before it’s ready). The fifth one is in __blkdev_put(). I’m going to read over __blkdev_get() and __blkdev_put() to figure out what they do and if there’s maybe something going on in either of those.

    In all these cases, it seems to me that &bdev->bd_mutex is locked; that’s a good sign. That’s also maybe an indication that we should be taking &bdev->bd_mutex in the other code path, so let’s check if we are. There’s nothing that I can see in any of the functions from inode_to_bdi() and up. Although inode_to_bdi() itself looks interesting, because that’s where the block device pointer comes from; it calls I_BDEV(inode) which returns a struct block_device *. Although if we follow the stack even further up, we see that fdatawrite_one_bdev() in fs/sync.c also knows about a struct block_device *. This by the way appears to be what is called through the function pointer in iterate_bdevs():
    1908                 func(I_BDEV(inode), arg);
    This in turn is called from the sync() system call. In other words, I cannot see any caller that takes &bdev->bd_mutex. There may yet be another mechanism (maybe a lock) intended to prevent somebody from seeing bdev->bd_disk == NULL, but this seems like a strong indication of what the problem might be.

    Let’s try to figure out more about ->bd_mutex, maybe there’s some documentation somewhere telling us what it’s supposed to protect. There is this:
    include/linux/fs.h=454=struct block_device {
    include/linux/fs.h-455- dev_t bd_dev; /* not a kdev_t - it's a search key */
    include/linux/fs.h-456- int bd_openers;
    include/linux/fs.h-457- struct inode * bd_inode; /* will die */
    include/linux/fs.h-458- struct super_block * bd_super;
    include/linux/fs.h:459: struct mutex bd_mutex; /* open/close mutex */
    There is this:
    include/linux/genhd.h-680-/*
    include/linux/genhd.h-681- * Any access of part->nr_sects which is not protected by partition
    include/linux/genhd.h:682: * bd_mutex or gendisk bdev bd_mutex, should be done using this
    include/linux/genhd.h-683- * accessor function.
    include/linux/genhd.h-684- *
    include/linux/genhd.h-685- * Code written along the lines of i_size_read() and i_size_write().
    include/linux/genhd.h-686- * CONFIG_PREEMPT case optimizes the case of UP kernel with preemption
    include/linux/genhd.h-687- * on.
    include/linux/genhd.h-688- */
    include/linux/genhd.h=689=static inline sector_t part_nr_sects_read(struct hd_struct *part)
    And there is this:
    include/linux/genhd.h-711-/*
    include/linux/genhd.h:712: * Should be called with mutex lock held (typically bd_mutex) of partition
    include/linux/genhd.h-713- * to provide mutual exlusion among writers otherwise seqcount might be
    include/linux/genhd.h-714- * left in wrong state leaving the readers spinning infinitely.
    include/linux/genhd.h-715- */
    include/linux/genhd.h-716-static inline void part_nr_sects_write(struct hd_struct *part, sector_t size)
    Under Documentation/ there is also this:
    --------------------------- block_device_operations -----------------------
    [...]
    locking rules:
    bd_mutex
    open: yes
    release: yes
    ioctl: no
    compat_ioctl: no
    direct_access: no
    media_changed: no
    unlock_native_capacity: no
    revalidate_disk: no
    getgeo: no
    swap_slot_free_notify: no (see below)
    Looking at __blkdev_get() again, there’s also one comment above it hinting at locking rules:
    1233 /*                  
    1234 * bd_mutex locking:
    1235 *
    1236 * mutex_lock(part->bd_mutex)
    1237 * mutex_lock_nested(whole->bd_mutex, 1)
    1238 */
    1239
    1240 static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)
    __blkdev_get() is called as part of blkdev_get(), which is what is called when you open a block device. In other words, it seems likely that we may have a race between opening/closing a block device and calling sync() – although for the sync() call to reach the block device, we should have some inode open on that block device (since we start out with an inode that is mapped to a block device with I_BDEV(inode)).

    Looking at the syzkaller log file, there is a sync() call just before the crash, and I also see references to [sr0] unaligned transfer (and sr0 is a block device, so that seems slightly suspicious):
    2016/08/25 05:45:02 executing program 0:
    mmap(&(0x7f0000001000)=nil, (0x4000), 0x3, 0x31, 0xffffffffffffffff, 0x0)
    mbind(&(0x7f0000004000)=nil, (0x1000), 0x8003, &(0x7f0000002000)=0x401, 0x9, 0x2)
    shmat(0x0, &(0x7f0000001000)=nil, 0x4000)
    dup2(0xffffffffffffffff, 0xffffffffffffff9c)
    mmap(&(0x7f0000000000)=nil, (0x1000), 0x3, 0x32, 0xffffffffffffffff, 0x0)
    mmap(&(0x7f0000000000)=nil, (0x1000), 0x3, 0x32, 0xffffffffffffffff, 0x0)
    sync()
    mmap(&(0x7f0000000000)=nil, (0x1000), 0x3, 0x32, 0xffffffffffffffff, 0x0)
    clock_gettime(0x0, &(0x7f0000000000)={0x0, 0x0})
    sr0] unaligned transfer
    sr 1:0:0:0: [sr0] unaligned transfer
    sr 1:0:0:0: [sr0] unaligned transfer
    sr 1:0:0:0: [sr0] unaligned transfer
    kasan: CONFIG_KASAN_INLINE enabled
    2016/08/25 05:45:03 result failed=false hanged=false:

    2016/08/25 05:45:03 executing program 1:
    mmap(&(0x7f0000002000)=nil, (0x1000), 0x3, 0x32, 0xffffffffffffffff, 0x0)
    r0 = syz_open_dev$sr(&(0x7f0000002000)="2f6465762f73723000", 0x0, 0x4800)
    readahead(r0, 0xcb84, 0x10001)
    mmap(&(0x7f0000000000)=nil, (0x1000), 0x3, 0x32, 0xffffffffffffffff, 0x0)
    mmap(&(0x7f0000001000)=nil, (0x1000), 0x3, 0x32, 0xffffffffffffffff, 0x0)
    syz_open_dev$mixer(&(0x7f0000002000-0x8)="2f6465762f6d6978657200", 0x0, 0x86000)
    mmap(&(0x7f0000001000)=nil, (0x1000), 0x6, 0x12, r0, 0x0)
    mount$fs(&(0x7f0000001000-0x6)="6d73646f7300", &(0x7f0000001000-0x6)="2e2f62757300", &(0x7f0000001000-0x6)="72616d667300", 0x880, &(0x7f0000000000)="1cc9417348")
    kasan: GPF could be caused by NULL-ptr deref or user memory access
    Here we see both the sync() call and the syz_open_dev$sr() call and we see that the GFP seems to happen some time shortly after opening sr0:
    r0 = syz_open_dev$sr(&(0x7f0000002000)="2f6465762f73723000", 0x0, 0x4800)

    >>> "2f6465762f73723000".decode('hex')
    '/dev/sr0\x00'
    There’s also a mount$fs() call there that looks interesting. Its arguments are:
    >>> "6d73646f7300".decode('hex')
    'msdos\x00'
    >>> "2e2f62757300".decode('hex')
    './bus\x00'
    >>> "72616d667300".decode('hex')
    'ramfs\x00'
    However, I can’t see any references to any block devices in fs/ramfs, so I think this is unlikely to be it. I do still wonder how opening /dev/sr0 can do anything for us if it doesn’t have a filesystem or even a medium. [Note from the future: block devices are represented as inodes on the “bdev” pseudo-filesystem. Go figure!] Grepping for sr0 in the rest of the syzkaller log shows this bit, which seems to indicate we do in fact have inodes for sr0:
    VFS: Dirty inode writeback failed for block device sr0 (err=-5).
    Grepping for “Dirty inode writeback failed”, I find bdev_write_inode() in fs/block_dev.c, called only from… __blkdev_put(). It definitely feels like we’re on to something now – maybe a race between sync() and open()/close() for /dev/sr0.

    syzkaller comes with some scripts to rerun the programs from a log file. I’m going to try that and see where it gets us – if we can reproduce the crash. I’ll first try to convert the two programs (the one with sync() and the one with the open(/dev/sr0)) to C and compile them. If that doesn’t work, syzkaller also has an option to auto-reproduce based on all the programs in the log file, but that’s likely slower and not always likely to succeed.

    I use syz-prog2c and launch the two programs in parallel in a VM, but it doesn’t show anything at all. I switch to syz-repro to see if it can reproduce anything given the log file, but this fails too. I see that there are other sr0-related messages in the kernel log, so there must be a way to open the device without just getting ENOMEDIUM. I do a stat on /dev/sr0 to find the device numbers:
    $ stat /dev/sr0 
    File: ‘/dev/sr0’
    Size: 0 Blocks: 0 IO Block: 4096 block special file
    Device: 5h/5d Inode: 7867 Links: 1 Device type: b,0
    So the device major is 0xb (11 decimal). We can find this in include/uapi/linux/major.h and it gives us:
    include/uapi/linux/major.h:#define SCSI_CDROM_MAJOR     11
    We see that this is the driver responsible for /dev/sr0:
    drivers/scsi/sr.c:      rc = register_blkdev(SCSI_CDROM_MAJOR, "sr");
    (I could have guessed this as well, but there are so many systems and subsystems and drivers that I often double check just to make sure I’m in the right place.) I look for an open() function and I find two – sr_open() and sr_block_open(). sr_block_open() does cdrom_open() – from drivers/cdrom/cdrom.c – and this has an interesting line:
            /* if this was a O_NONBLOCK open and we should honor the flags,
    * do a quick open without drive/disc integrity checks. */
    cdi->use_count++;
    if ((mode & FMODE_NDELAY) && (cdi->options & CDO_USE_FFLAGS)) {
    ret = cdi->ops->open(cdi, 1);
    So we need to pass O_NONBLOCK to get the device to open. When I add this to the test program from the syzkaller log and run sync() in parallel… ta-da!
    kasan: CONFIG_KASAN_INLINE enabled
    kasan: GPF could be caused by NULL-ptr deref or user memory access
    general protection fault: 0000 [#1] PREEMPT SMP KASAN
    Dumping ftrace buffer:
    (ftrace buffer empty)
    CPU: 3 PID: 1333 Comm: sync1 Not tainted 4.8.0-rc2+ #169
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.3-0-ge2fc41e-prebuilt.qemu-project.org 04/01/2014
    task: ffff880114114080 task.stack: ffff880112bf0000
    RIP: 0010:[<ffffffff8170654d>] [<ffffffff8170654d>] wbc_attach_and_unlock_inode+0x23d/0x760
    RSP: 0018:ffff880112bf7ca0 EFLAGS: 00010206
    RAX: dffffc0000000000 RBX: ffff880112bf7d10 RCX: ffff8801141147d0
    RDX: 0000000000000093 RSI: ffff8801170f8750 RDI: 0000000000000498
    RBP: ffff880112bf7cd8 R08: 0000000000000000 R09: 0000000000000000
    R10: ffff8801141147e8 R11: 0000000000000000 R12: ffff8801170f8750
    R13: 0000000000000000 R14: ffff880112bf7d38 R15: ffff880112bf7d10
    FS: 00007fd533aa2700(0000) GS:ffff88011ab80000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 0000000000601028 CR3: 0000000112b04000 CR4: 00000000000006e0
    Stack:
    ffff8801170f8750 0000000000000000 1ffff1002257ef9e ffff8801170f8950
    ffff8801170f8750 0000000000000000 ffff880112bf7d10 ffff880112bf7db8
    ffffffff81508d70 0000000000000000 0000000041b58ab3 ffffffff844e89e1
    Call Trace:
    [<ffffffff81508d70>] __filemap_fdatawrite_range+0x240/0x2e0
    [<ffffffff81508b30>] ? filemap_check_errors+0xe0/0xe0
    [<ffffffff83c24b47>] ? preempt_schedule+0x27/0x30
    [<ffffffff810020ae>] ? ___preempt_schedule+0x16/0x18
    [<ffffffff81508e36>] filemap_fdatawrite+0x26/0x30
    [<ffffffff817191b0>] fdatawrite_one_bdev+0x50/0x70
    [<ffffffff817341b4>] iterate_bdevs+0x194/0x210
    [<ffffffff81719160>] ? fdatawait_one_bdev+0x70/0x70
    [<ffffffff817195f0>] ? sync_filesystem+0x240/0x240
    [<ffffffff817196be>] sys_sync+0xce/0x160
    [<ffffffff817195f0>] ? sync_filesystem+0x240/0x240
    [<ffffffff81002b60>] ? exit_to_usermode_loop+0x190/0x190
    [<ffffffff82001a47>] ? check_preemption_disabled+0x37/0x1e0
    [<ffffffff8150455a>] ? __context_tracking_exit.part.4+0x3a/0x1e0
    [<ffffffff81005524>] do_syscall_64+0x1c4/0x4e0
    [<ffffffff83c3276a>] entry_SYSCALL64_slow_path+0x25/0x25
    Code: fa 48 c1 ea 03 80 3c 02 00 0f 85 b3 04 00 00 49 8d bd 98 04 00 00 48 b8 00 00 00 00 00 fc ff df 4c 89 63 30 48 89 fa 48 c1 ea 03 <80> 3c 02 00 0f 85 83 04 00 00 4d 8b bd 98 04 00 00 48 b8 00 00
    RIP [<ffffffff8170654d>] wbc_attach_and_unlock_inode+0x23d/0x760
    RSP <ffff880112bf7ca0>
    ---[ end trace 50fffb72f7adb3e5 ]---
    This is not exactly the same oops that we saw before, but it’s close enough that it’s very likely to be a related crash. The reproducer is actually taking quite a while to trigger the issue, though. Even though I’ve reduced to two threads/processes executing just a handful of syscalls it still takes nearly half an hour to reproduce in a tight loop. I spend some time playing with the reproducer, trying out different things (read() instead of readahead(), just open()/close() with no reading at all, 2 threads doing sync(), etc.) to see if I can get it to trigger faster. In the end, I find that having many threads doing sync() in parallel seems to be the key to a quick reproducer, on the order of a couple of seconds.

    Now that I have a fairly small reproducer it should be a lot easier to figure out the rest. I can add as many printk()s as I need to validate my theory that sync() should be taking the bd_mutex. For cases like this I set up a VM so that I can start the VM and run the reproducer by running a single command. I also actually like to use trace_printk() instead of plain printk() and boot with ftrace_dump_on_oops on the kernel command line – this way, the messages don’t get printed until the crash actually happens (and have a lower probability of interfering with the race itself; printk() goes directly to the console, which is usually pretty slow).

    I apply this patch and recompile the kernel:
    diff --git a/fs/block_dev.c b/fs/block_dev.c
    index e17bdbd..fb9d5c5 100644
    --- a/fs/block_dev.c
    +++ b/fs/block_dev.c
    @@ -1292,6 +1292,7 @@ static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)
    */
    disk_put_part(bdev->bd_part);
    bdev->bd_part = NULL;
    + trace_printk("%p->bd_disk = NULL\n", bdev);
    bdev->bd_disk = NULL;
    bdev->bd_queue = NULL;
    mutex_unlock(&bdev->bd_mutex);
    @@ -1372,6 +1373,7 @@ static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)

    out_clear:
    disk_put_part(bdev->bd_part);
    + trace_printk("%p->bd_disk = NULL\n", bdev);
    bdev->bd_disk = NULL;
    bdev->bd_part = NULL;
    bdev->bd_queue = NULL;
    @@ -1612,6 +1614,7 @@ static void __blkdev_put(struct block_device *bdev, fmode_t mode, int for_part)

    disk_put_part(bdev->bd_part);
    bdev->bd_part = NULL;
    + trace_printk("%p->bd_disk = NULL\n", bdev);
    bdev->bd_disk = NULL;
    if (bdev != bdev->bd_contains)
    victim = bdev->bd_contains;
    @@ -1905,6 +1908,7 @@ void iterate_bdevs(void (*func)(struct block_device *, void *), void *arg)
    iput(old_inode);
    old_inode = inode;

    + trace_printk("%p->bd_disk = %p\n", I_BDEV(inode), I_BDEV(inode)->bd_disk);
    func(I_BDEV(inode), arg);

    spin_lock(&blockdev_superblock->s_inode_list_lock);
    With this patch applied, I get this output on a crash:
       sync1-1343    3.... 8303954us : iterate_bdevs: ffff88011a0105c0->bd_disk = ffff880114618880
    sync1-1340 0.... 8303955us : iterate_bdevs: ffff88011a0105c0->bd_disk = ffff880114618880
    sync1-1343 3.... 8303961us : iterate_bdevs: ffff88011a0105c0->bd_disk = ffff880114618880
    sync1-1335 1.... 8304043us : iterate_bdevs: ffff88011a0105c0->bd_disk = ffff880114618880
    sync2-1327 1.... 8304852us : __blkdev_put: ffff88011a0105c0->bd_disk = NULL
    ---------------------------------
    CPU: 2 PID: 1336 Comm: sync1 Not tainted 4.8.0-rc2+ #170
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.3-0-ge2fc41e-prebuilt.qemu-project.org 04/01/2014
    task: ffff88011212d600 task.stack: ffff880112190000
    RIP: 0010:[<ffffffff81f04c3a>] [<ffffffff81f04c3a>] blk_get_backing_dev_info+0x4a/0x70
    RSP: 0018:ffff880112197cd0 EFLAGS: 00010202
    Since __blkdev_put() is the very last line of output before the crash (and I don’t see any other call setting ->bd_disk to NULL in the last few hundred lines or so), there is a very strong indication that this is the problematic assignment. Rerunning this a couple of times shows that it tends to crash with the same symptoms every time.

    To get slightly more information about the context in which __blkdev_put() is called in, I apply this patch instead:
    diff --git a/fs/block_dev.c b/fs/block_dev.c
    index e17bdbd..298bf70 100644
    --- a/fs/block_dev.c
    +++ b/fs/block_dev.c
    @@ -1612,6 +1612,7 @@ static void __blkdev_put(struct block_device *bdev, fmode_t mode, int for_part)

    disk_put_part(bdev->bd_part);
    bdev->bd_part = NULL;
    + trace_dump_stack(0);
    bdev->bd_disk = NULL;
    if (bdev != bdev->bd_contains)
    victim = bdev->bd_contains;
    With that, I get the following output:
       <...>-1328    0.... 9309173us : <stack trace>
    => blkdev_close
    => __fput
    => ____fput
    => task_work_run
    => exit_to_usermode_loop
    => do_syscall_64
    => return_from_SYSCALL_64
    ---------------------------------
    CPU: 3 PID: 1352 Comm: sync1 Not tainted 4.8.0-rc2+ #171
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.3-0-ge2fc41e-prebuilt.qemu-project.org 04/01/2014
    task: ffff88011248c080 task.stack: ffff880112568000
    RIP: 0010:[<ffffffff81f04b7a>] [<ffffffff81f04b7a>] blk_get_backing_dev_info+0x4a/0x70
    One thing that’s a bit surprising to me is that this actually isn’t called directly from close(), but as a delayed work item on a workqueue. But in any case we can tell it comes from close() since fput() is called when closing a file descriptor.

    Now that I have a fairly good idea of what’s going wrong, it’s time to focus on the fix. This is almost more difficult than what we’ve done so far because it’s such an open-ended problem. Of course I could add a brand new global spinlock to provide mutual exclusion between sync() and clone(), but that would be a bad solution and the wrong thing to do. Usually the author of the code in question had a specific locking scheme or design in mind and the bug is just due to a small flaw or omission somewhere. In other words, it’s usually not a bug in the general architecture of the code (which might require big changes to fix), but a small bug somewhere in the implementation, which would typically require just a few changed lines to fix. It’s fairly obvious that close() is trying to prevent somebody else from seeing bdev->bd_disk == NULL by wrapping most of the __blkdev_put() code in the ->bdev_mutex. This makes me think that it’s the sync() code path that is missing some locking.

    Looking around __blkdev_put() and iterate_bdevs(), another thing that strikes me is that iterate_bdevs() is able to get a reference to a block device which is nevertheless in the process of being destroyed – maybe the real problem is that the block device is being destroyed too soon (while iterate_bdevs() is holding a reference to it). So it’s possible that iterate_bdevs() simply needs to formally take a reference to the block device by bumping its reference count while it does its work.

    There is a function called bdgrab() which is supposed to take an extra reference to a block device – but only if you aready have one. Thus, using this would be just as racy, since we’re not already formally holding a reference to it. Another function, bd_acquire() seems to formally acquire a reference through a struct inode *. That seems quite promising. It is using the bdev_lock spinlock to prevent the block device from disappearing. I try this tentative patch:
    diff --git a/fs/block_dev.c b/fs/block_dev.c
    index e17bdbd..489473d 100644
    --- a/fs/block_dev.c
    +++ b/fs/block_dev.c
    @@ -1884,6 +1884,7 @@ void iterate_bdevs(void (*func)(struct block_device *, void *), void *arg)
    spin_lock(&blockdev_superblock->s_inode_list_lock);
    list_for_each_entry(inode, &blockdev_superblock->s_inodes, i_sb_list) {
    struct address_space *mapping = inode->i_mapping;
    + struct block_device *bdev;

    spin_lock(&inode->i_lock);
    if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW) ||
    @@ -1905,7 +1906,11 @@ void iterate_bdevs(void (*func)(struct block_device *, void *), void *arg)
    iput(old_inode);
    old_inode = inode;

    - func(I_BDEV(inode), arg);
    + bdev = bd_acquire(inode);
    + if (bdev) {
    + func(bdev, arg);
    + bdput(bdev);
    + }

    spin_lock(&blockdev_superblock->s_inode_list_lock);
    }
    My reasoning is that the call to bd_acquire() will prevent close() from actually reaching the bits in __blkdev_put() that do the final cleanup (i.e. setting bdev->bd_disk to NULL) and so prevent the crash from happening.

    Unfortunately, running the reproducer again shows no change that I can see. It seems that I was wrong about this preventing __blkdev_put() from running: blkdev_close() calls blkdev_put() unconditionally, which calls __blkdev_put() unconditionally.

    Another idea might be to remove the block device from the list that iterate_bdevs() is traversing before setting bdev->bd_disk to NULL. However, it seems that this is all handled by the VFS and we can’t really change it just for block devices.

    Reading over most of fs/block_dev.c, I decide to fall back to my first (and more obvious) idea: take bd_mutex in iterate_bdevs(). This should be safe since both the s_inode_list_lock and inode->i_lock are dropped before calling the iterate_bdevs() callback function. However, I am still getting the same crash… On second thought, even taking bd_mutex is not enough because bdev->bd_disk will still be NULL when __blkdev_put() releases the mutex. Maybe there’s a condition we can test while holding the mutex that will tell us whether the block device is “useable” or not. We could test ->bd_disk directly, which is what we’re really interested in, but that seems like a derived property and not a real indication of whether the block device has been closed or not; ->bd_holders or ->bd_openers MAY be better candidates.

    While digging around trying to figure out whether to check ->bd_disk, ->bd_holders, or ->bd_openers, I came across this comment in one of the functions in the crashing call chain:
     106 /**
    107 * blk_get_backing_dev_info - get the address of a queue's backing_dev_info
    108 * @bdev: device
    109 *
    110 * Locates the passed device's request queue and returns the address of its
    111 * backing_dev_info. This function can only be called if @bdev is opened
    112 * and the return value is never NULL.
    113 */
    114 struct backing_dev_info *blk_get_backing_dev_info(struct block_device *bdev)
    115 {
    116 struct request_queue *q = bdev_get_queue(bdev);
    117
    118 return &q->backing_dev_info;
    119 }
    In particular, the “This function can only be called if @bdev is opened” requirement seems to be violated in our case.

    Taking bdev->bd_mutex and checking bdev->bd_disk actually seems to be a fairly reliable test of whether it’s safe to call filemap_fdatawrite() for the block device inode. The underlying problem here is that sync() is able to get a reference to a struct block_device without having it open as a file. Doing something like this does fix the bug:
    diff --git a/fs/sync.c b/fs/sync.c
    index 2a54c1f..9189eeb 100644
    --- a/fs/sync.c
    +++ b/fs/sync.c
    @@ -81,7 +81,10 @@ static void sync_fs_one_sb(struct super_block *sb, void *arg)

    static void fdatawrite_one_bdev(struct block_device *bdev, void *arg)
    {
    - filemap_fdatawrite(bdev->bd_inode->i_mapping);
    + mutex_lock(&bdev->bd_mutex);
    + if (bdev->bd_disk)
    + filemap_fdatawrite(bdev->bd_inode->i_mapping);
    + mutex_unlock(&bdev->bd_mutex);
    }

    static void fdatawait_one_bdev(struct block_device *bdev, void *arg)
    What I don’t like about this patch is that it simply skips block devices which we don’t have any open file descriptors for. That seems wrong to me because sync() should do writeback on (and wait for) all devices, not just the ones that we happen to have an open file descriptor for. Imagine if we opened a device, wrote a lot of data to it, closed it, called sync(), and sync() returns. Now we should be guaranteed the data was written, but I’m not sure we are in this case.

    Another slightly ugly thing is that we’re now holding a new mutex over a potentially big chunk of code (everything that happens inside filemap_fdatawrite()).

    I’m not sure I can do much better in terms of a small patch at the moment, so I will submit this to the linux-block mailing list with a few relevant people on Cc (Jens Axboe for being the block maintainer, Tejun Heo for having written a lot of the code involved according to git blame, Jan Kara for writing iterate_bdevs(), and Al Viro for probably knowing both the block layer and VFS quite well).

    I submitted my patch here: lkml.org thread

    Rabin Vincent answered pretty quickly that he already sent a fix for the very same issue. Oh well, at least his patch is quite close to what I came up with and I learned quite a few new things about the kernel.

    Tejun Heo also responded that a better fix would probably be to prevent the disk from going away by getting a reference to it. I tried a couple of different patches without much luck. The currently last patch from me in that thread seemed to prevent the crash, but as I only realised a few minutes after sending it: we’re decrementing the reference count without doing anything when it reaches 0! Of course we don’t get a NULL pointer dereference if we never do the cleanup/freeing in the first place…

    If you liked this post and you enjoy fixing bugs like this one, you may enjoy working with us in the Ksplice group at Oracle. Ping me at my Oracle email address :-)

    August 31, 2016 09:00 PM

    August 30, 2016

    LPC 2016: Most LPC passes sold out; refereed track proposals deadline nears

    All of the regular and early bird registrations for the 2016 Linux Plumbers Conference have now sold out. There will be a very limited number of late registrations available starting on October 1.

    Those interested in attending the conference should also note that each refereed track talk gets one free pass to the conference. The deadline for refereed track proposals is Thursday September 1.

    We hope to see you at LPC 2016!

    August 30, 2016 02:15 PM