Kernel Planet

August 17, 2018

Paul E. Mc Kenney: Testing & Fuzzing Microconference Accepted into 2018 Linux Plumbers Conference

Testing, fuzzing, and other diagnostics have greatly increased the robustness of the Linux ecosystem, but embarrassing bugs still escape to end users. Furthermore, a million-year bug would happen several tens of times per day across Linux's installed base (said to number more than 20 billion), so the best we can possibly do is hardly good enough.

The Testing and Fuzzing Microconference intends to raise the bar with further progress on syzbot/syzkaller, distribution/stable testing, kernel continuous integration, and unit testing. The best evidence of progress in these efforts will of course be the plethora of bug reports produced by these and similar tools!

Join us for an important and spirited discussion!

August 17, 2018 07:38 PM

August 11, 2018

Linux Plumbers Conference: Early Registration Ending Soon!

The early registration deadline is August 18, 2018, after which the regular-registration period will begin.  So to save $150, register for the Linux Plumbers Conference before August 18th!

August 11, 2018 06:25 PM

August 08, 2018

Linux Plumbers Conference: Containers Microconference Accepted into 2018 Linux Plumbers Conference

The Containers Micro-conference at Linux Plumbers is the yearly gathering of container runtime developers, kernel developers and container users. It is the one opportunity to have everyone in the same room to both look back at the past year in the container space and discuss the year ahead.

In the past, topics such as use of cgroups by containers, system call filtering and interception (Seccomp), improvements/additions of kernel namespaces, interaction with the Linux Security Modules (AppArmor, SELinux, SMACK), TPM based validation (IMA), mount propagation and mount API changes, uevent isolation, unprivileged filesystem mounts and more have been discussed in this micro-conference.

There will also no doubt be some discussions around performance to make up for the overhead caused by the recent Spectre and Meltdown set of mitigations that in some cases have had a significant impact on container runtimes.

This year’s edition will be combined with what was formerly the Checkpoint-Restart micro-conference. Expect continued discussion about integration of CRIU with the container runtimes, addressing performance issues of checkpoint and restart and possible optimizations, as well as (in)stability of rarely used kernel ABIs. Another hot new topic would be time namespacing and its usage for container snapshotting and migration.

August 08, 2018 04:23 PM

August 03, 2018

Linux Plumbers Conference: BPF Microconference Accepted into 2018 Linux Plumbers Conference

We are pleased to announce that the BPF Microconference has been accepted into the 2018 Linux Plumbers Conference!

BPF (Berkeley Packet Filter) is one of the fastest emerging technologies of the Linux kernel and plays a major role in networking (XDP (eXpress Data Path), tc/BPF, etc), tracing (kprobes, uprobes, tracepoints) and security (seccomp, landlock) thanks to its versatility and efficiency.

BPF has seen a lot of progress since last year’s Plumbers conference and many of the discussed BPF tracing Microconference improvements have been tackled since then such as the introduction of BPF type format (BTF) to name one. This year’s BPF Microconference event focuses on the core BPF infrastructure as well as its subsystems, therefore topics proposed for this year’s event include improving verifier scalability, next steps on BPF type format, dynamic tracing without on the fly compilation, string and loop support, reuse of host JITs for offloads, LRU heuristics and timers, syscall interception, microkernels, and many more.

August 03, 2018 09:17 PM

August 01, 2018

Dave Airlie (blogspot): virgl - exposes GLES3.1/3.2 and GL4.3

I'd had a bit of a break from adding feature to virgl while I was working on radv, but recently Google and Collabora have started to invest in virgl as a solution. A number of developers from both companies have joined the project.

This meant trying to get virgl to pass their dEQP suite and adding support for newer GL/GLES feature levels. They also have a goal for the renderer to run on a host GLES implementation whereas it currently only ran on a host GL.

Over the past few months I've worked with the group to add support for all the necessary features needed for guest GLES3.1 support (for them) and GL4.3 (for me).

The feature list was roughly:
tessellation shaders
fp64 support
ARB_gpu_shader5 support
Shader buffer objects
Shader image objects
Compute shaders
Copy Image
Texture views

With this list implemented we achieved GL4.3 and GLES3.1.

However Marek@AMD did some work on exposing ASTC for gallium drivers,
and with some extra work on EXT_shader_framebuffer_fetch, we now expose GLES3.2.

There was also plenty of work done on avoiding crashes from rogue guests (rewrote the whole feature/capability bit handling), and lots of bug fixes. There are still ongoing fixes to finish the dEQP tests, but it looks like all the feature work should be landed now.

What next?

Well there is one big problem facing virgl in exposing GL 4.4. GL_ARB_buffer_storage requires exposing coherent buffer memory, and with the virgl architecture, we currently don't have a way to map the pages behind a host GL buffer mapping into a guest GL buffer mapping in order to achieve coherency. This is going to require some thought and it may even require exposing some new GL extensions to export a buffer to a dma-buf.

There has also been a GSoC student Nathan working on vulkan/virgl support, he's made some iniital progess, however vulkan also has requirements on coherent memory so that tricky problems needs to be solved.

Thanks again to all the contributors to the virgl project.

August 01, 2018 04:26 AM

July 31, 2018

Matthew Garrett: Porting Coreboot to the 51NB X210

The X210 is a strange machine. A set of Chinese enthusiasts developed a series of motherboards that slot into old Thinkpad chassis, providing significantly more up to date hardware. The X210 has a Kabylake CPU, supports up to 32GB of RAM, has an NVMe-capable M.2 slot and has eDP support - and it fits into an X200 or X201 chassis, which means it also comes with a classic Thinkpad keyboard . We ordered some from a Facebook page (a process that involved wiring a large chunk of money to a Chinese bank which wasn't at all stressful), and a couple of weeks later they arrived. Once I'd put mine together I had a quad-core i7-8550U with 16GB of RAM, a 512GB NVMe drive and a 1920x1200 display. I'd transplanted over the drive from my XPS13, so I was running stock Fedora for most of this development process.

The other fun thing about it is that none of the firmware flashing protection is enabled, including Intel Boot Guard. This means running a custom firmware image is possible, and what would a ridiculous custom Thinkpad be without ridiculous custom firmware? A shadow of its potential, that's what. So, I read the Coreboot[1] motherboard porting guide and set to.

My life was made a great deal easier by the existence of a port for the Purism Librem 13v2. This is a Skylake system, and Skylake and Kabylake are very similar platforms. So, the first job was to just copy that into a new directory and start from there. The first step was to update the Inteltool utility so it understood the chipset - this commit shows what was necessary there. It's mostly just adding new PCI IDs, but it also needed some adjustment to account for the GPIO allocation being different on mobile parts when compared to desktop ones. One thing that bit me - Inteltool relies on being able to mmap() arbitrary bits of physical address space, and the kernel doesn't allow that if CONFIG_STRICT_DEVMEM is enabled. I had to disable that first.

The GPIO pins got dropped into gpio.h. I ended up just pushing the raw values into there rather than parsing them back into more semantically meaningful definitions, partly because I don't understand what these things do that well and largely because I'm lazy. Once that was done, on to the next step.

High Definition Audio devices (or HDA) have a standard interface, but the codecs attached to the HDA device vary - both in terms of their own configuration, and in terms of dealing with how the board designer may have laid things out. Thankfully the existing configuration could be copied from /sys/class/sound/card0/hwC0D0/init_pin_configs[2] and then hda_verb.h could be updated.

One more piece of hardware-specific configuration is the Video BIOS Table, or VBT. This contains information used by the graphics drivers (firmware or OS-level) to configure the display correctly, and again is somewhat system-specific. This can be grabbed from /sys/kernel/debug/dri/0/i915_vbt.

A lot of the remaining platform-specific configuration has been split out into board-specific config files. and this also needed updating. Most stuff was the same, but I confirmed the GPE and genx_dec register values by using Inteltool to dump them from the vendor system and copy them over. lspci -t gave me the bus topology and told me which PCIe root ports were in use, and lsusb -t gave me port numbers for USB. That let me update the root port and USB tables.

The final code update required was to tell the OS how to communicate with the embedded controller. Various ACPI functions are actually handled by this autonomous device, but it's still necessary for the OS to know how to obtain information from it. This involves writing some ACPI code, but that's largely a matter of cutting and pasting from the vendor firmware - the EC layout depends on the EC firmware rather than the system firmware, and we weren't planning on changing the EC firmware in any way. Using ifdtool told me that the vendor firmware image wasn't using the EC region of the flash, so my assumption was that the EC had its own firmware stored somewhere else. I was ready to flash.

The first attempt involved isis' machine, using their Beaglebone Black as a flashing device - the lack of protection in the firmware meant we ought to be able to get away with using flashrom directly on the host SPI controller, but using an external flasher meant we stood a better chance of being able to recover if something went wrong. We flashed, plugged in the power and… nothing. Literally. The power LED didn't turn on. The machine was very, very dead.

Things like managing battery charging and status indicators are up to the EC, and the complete absence of anything going on here meant that the EC wasn't running. The most likely reason for that was that the system flash did contain the EC's firmware even though the descriptor said it didn't, and now the system was very unhappy. Worse, the flash wouldn't speak to us any more - the power supply from the Beaglebone to the flash chip was sufficient to power up the EC, and the EC was then holding onto the SPI bus desperately trying to read its firmware. Bother. This was made rather more embarrassing because isis had explicitly raised concern about flashing an image that didn't contain any EC firmware, and now I'd killed their laptop.

After some digging I was able to find EC firmware for a related 51NB system, and looking at that gave me a bunch of strings that seemed reasonably identifiable. Looking at the original vendor ROM showed very similar code located at offset 0x00200000 into the image, so I added a small tool to inject the EC firmware (basing it on an existing tool that does something similar for the EC in some HP laptops). I now had an image that I was reasonably confident would get further, but we couldn't flash it. Next step seemed like it was going to involve desoldering the flash from the board, which is a colossal pain. Time to sleep on the problem.

The next morning we were able to borrow a Dediprog SPI flasher. These are much faster than doing SPI over GPIO lines, and also support running the flash at different voltage. At 3.5V the behaviour was the same as we'd seen the previous night - nothing. According to the datasheet, the flash required at least 2.7V to run, but flashrom listed 1.8V as the next lower voltage so we tried. And, amazingly, it worked - not reliably, but sufficiently. Our hypothesis is that the chip is marginally able to run at that voltage, but that the EC isn't - we were no longer powering the EC up, so could communicated with the flash. After a couple of attempts we were able to write enough that we had EC firmware on there, at which point we could shift back to flashing at 3.5V because the EC was leaving the flash alone.

So, we flashed again. And, amazingly, we ended up staring at a UEFI shell prompt[3]. USB wasn't working, and nor was the onboard keyboard, but we had graphics and were executing actual firmware code. I was able to get USB working fairly quickly - it turns out that Linux numbers USB ports from 1 and the FSP numbers them from 0, and fixing that up gave us working USB. We were able to boot Linux! Except there were a whole bunch of errors complaining about EC timeouts, and also we only had half the RAM we should.

After some discussion on the Coreboot IRC channel, we figured out the RAM issue - the Librem13 only has one DIMM slot. The FSP expects to be given a set of i2c addresses to probe, one for each DIMM socket. It is then able to read back the DIMM configuration and configure the memory controller appropriately. Running i2cdetect against the system SMBus gave us a range of devices, including one at 0x50 and one at 0x52. The detected DIMM was at 0x50, which made 0x52 seem like a reasonable bet - and grepping the tree showed that several other systems used 0x52 as the address for their second socket. Adding that to the list of addresses and passing it to the FSP gave us all our RAM.

So, now we just had to deal with the EC. One thing we noticed was that if we flashed the vendor firmware, ran it, flashed Coreboot and then rebooted without cutting the power, the EC worked. This strongly suggested that there was some setup code happening in the vendor firmware that configured the EC appropriately, and if we duplicated that it would probably work. Unfortunately, figuring out what that code was was difficult. I ended up dumping the PCI device configuration for the vendor firmware and for Coreboot in case that would give us any clues, but the only thing that seemed relevant at all was that the LPC controller was configured to pass io ports 0x4e and 0x4f to the LPC bus with the vendor firmware, but not with Coreboot. Unfortunately the EC was supposed to be listening on 0x62 and 0x66, so this wasn't the problem.

I ended up solving this by using UEFITool to extract all the code from the vendor firmware, and then disassembled every object and grepped them for port io. x86 systems have two separate io buses - memory and port IO. Port IO is well suited to simple devices that don't need a lot of bandwidth, and the EC is definitely one of these - there's no way to talk to it other than using port IO, so any configuration was almost certainly happening that way. I found a whole bunch of stuff that touched the EC, but was clearly depending on it already having been enabled. I found a wide range of cases where port IO was being used for early PCI configuration. And, finally, I found some code that reconfigured the LPC bridge to route 0x4e and 0x4f to the LPC bus (explaining the configuration change I'd seen earlier), and then wrote a bunch of values to those addresses. I mimicked those, and suddenly the EC started responding.

It turns out that the writes that made this work weren't terribly magic. PCs used to have a SuperIO chip that provided most of the legacy port functionality, including the floppy drive controller and parallel and serial ports. Individual components (called logical devices, or LDNs) could be enabled and disabled using a sequence of writes that was fairly consistent between vendors. Someone on the Coreboot IRC channel recognised that the writes that enabled the EC were simply using that protocol to enable a series of LDNs, which apparently correspond to things like "Working EC" and "Working keyboard". And with that, we were done.

Coreboot doesn't currently have ACPI support for the latest Intel graphics chipsets, so right now my image doesn't have working backlight control.Backlight control also turned out to be interesting. Most modern Intel systems handle the backlight via registers in the GPU, but the X210 uses the embedded controller (possibly because it supports both LVDS and eDP panels). This means that adding a simple display stub is sufficient - all we have to do on a backlight set request is store the value in the EC, and it does the rest.

Other than that, everything seems to work (although there's probably a bunch of power management optimisation to do). I started this process knowing almost nothing about Coreboot, but thanks to the help of people on IRC I was able to get things working in about two days of work[4] and now have firmware that's about as custom as my laptop.

[1] Why not Libreboot? Because modern Intel SoCs haven't had their memory initialisation code reverse engineered, so the only way to boot them is to use the proprietary Intel Firmware Support Package.
[2] Card 0, device 0
[3] After a few false starts - it turns out that the initial memory training can take a surprisingly long time, and we kept giving up before that had happened
[4] Spread over 5 or so days of real time

comment count unavailable comments

July 31, 2018 08:44 AM

July 29, 2018

Pavel Machek: Pretty big side-effect

Timing and side-channels are not normally considered side-effects, meaning compilers and cpus feel free to do whatever they want. And they do. Unfortunately, I consider leaking my passwords to remote attackers prety significant side-effect... Imagine simple function.

void handle(char secret) {}
That's obviously safe, right? And now
void handle(char secret) { int i; for (i=0; i<secret*1000000; i++) ; }
That's obviously bad idea, because now secret is exposed via timing. Now, that used to be the only sidechannel for a while, but then, caches were invented. These days,
static char font[16*256]; void handle(char secret) { font[secret*16]; }
may be bad idea. But C has not changed, it knows nothing about caches, and nothing about side-channels. Caches are old news. But today we have complex branch predictors, and speculative execution. It is called spectre. This is bad idea:
static char font[16*256]; void handle(char secret) { if (0) font[secret*16]; }
as is this:
static char small[16], big[256]; void foo(int untrusted) { if (untrusted<16) big[small[untrusted]]; }
CPU bug... unfortunately it is tricky to fix, and the bug only affects caches / timing, so it "does not exist" as far as C is concerned. Canonical fix is something like
static char small[16], big[256]; void foo(int untrusted) { if (untrusted<16) { asm volatile("lfence"); big[small[untrusted]]; }}
which is okay as long as compiler compiles it the obvious way. But again, compiler knows nothing about caches / side channels, so it may do something unexpected, and re-introduce the bug. Unfortunately, it seems that there's not even agreement whose bug it is. Is it time C was extended to know about side-channels? What about
void handle(int please_do_not_leak_this secret) {}
? Do we need new language to handle modern (speculative, multi-core, fast, side-channels all around) CPUs?
(Now, you may say that it is impossible to eliminate all the side-channels. I believe eliminating most of them is well possible, if we are willing to live with ... huge slowdown. You can store each variable twice, to at least detectrowhammer. Caches still can be disabled -- row buffer in DRAM will be more problematic, and if you disable hyperthreading and make every second instruction  lfence, you can get rid of Spectre-like problems. You may get 100x? 1000x? slowdown, but if that's applied only to software that needs it, it may be acceptable. You probably want your editors & mail readers protected. You probably don't need gcc to be protected. No, running modern web browser does not look sustainable).

July 29, 2018 06:20 AM

July 24, 2018

Linux Plumbers Conference: RDMA Microconference Accepted into 2018 Linux Plumbers Conference

We are pleased to announce that the RDMA Microconference has been accepted into the 2018 Linux Plumbers Conference!

RDMA (remote direct memory access) is a well-established technology that is used in environments requiring both maximum throughputs and minimum latencies. For a long time, this technology was used primary in high-performance computing, high frequency trading, and supercomputing. For example, the three most powerful computers are based on Linux and RDMA (in the guise of Infiniband).

However, the latest trends in cloud computing (more bandwidth at larger scales) and storage (more IOPS) makes RDMA increasingly important outside of its initial niches. Therefore, clean integration between RDMA and various kernel susbsystems is paramount. We are therefore looking to build on previous years’ successful RDMA microconferences, this year discussing our 2018-2019 plans and roadmap.

Topics proposed for this year’s event include the interaction between RDMA and DAX (direct access for files), how to solve the get_user_pages() problem (see https://lwn.net/Articles/753027/ and https://lwn.net/Articles/753272/), IOMMU and PCI-E issues, continuous integration, python integration, and Syzkaller testing.

July 24, 2018 05:40 PM

July 23, 2018

Paul E. Mc Kenney: RDMA Microconference Accepted into 2018 Linux Plumbers Conference

We are pleased to announce that the RDMA Microconference has been accepted into the 2018 Linux Plumbers Conference!

RDMA (remote direct memory access) is a well-established technology that is used in environments requiring both maximum throughputs and minimum latencies. For a long time, this technology was used primary in high-performance computing, high frequency trading, and supercomputing. For example, the three most powerful computers are based on Linux and RDMA (in the guise of Infiniband).

However, the latest trends in cloud computing (more bandwidth at larger scales) and storage (more IOPS) makes RDMA increasingly important outside of its initial niches. Therefore, clean integration between RDMA and various kernel susbsystems is paramount. We are therefore looking to build on previous years' successful RDMA microconferences, this year discussing our 2018-2019 plans and roadmap.

Topics proposed for this year's event include the interaction between RDMA and DAX (direct access for files), how to solve the get_user_pages() problem (see https://lwn.net/Articles/753027/ and https://lwn.net/Articles/753272/), IOMMU and PCI-E issues, continuous integration, python integration, and Syzkaller testing.

July 23, 2018 08:07 PM

Linux Plumbers Conference: Two-day Networking Track added to LPC

A two-day Networking Track will be featured at this year’s Linux Plumbers Conference; it will run the first two days of LPC, November 13-14. The track will consist of a series of talks, including a keynote from David Miller: “This talk is not about XDP: From Resource Limits to SKB Lists”. Talk proposals on a variety of networking topics are now under consideration; that page will be updated with the accepted talks soon. The Networking Track will be open to all LPC attendees.

LPC will be held in Vancouver, British Columbia, Canada from Tuesday, November 13 through Thursday, November 15. We look forward to the Networking Track as well as the rest of the LPC content (microconferences, Kernel Summit Track, refereed talks, and BoFs) and hope to see you there.

July 23, 2018 06:32 PM

Paul E. Mc Kenney: Verification Challenge 7: Heavy Modifications to Linux-Kernel Tree RCU

There was a time when I felt that Linux-kernel RCU was too low-level to possibly be the subject of a security exploit, but Rowhammer put paid to that naive notion. And it finally happened earlier this year. Now, I could claim that I did nothing wrong. After all, RCU worked as advertised. The issue was instead that RCU has multiple flavors:

 


  1. RCU-bh for code that is subject to network-based denial-of-service attacks.
  2. RCU-sched for code that must interact with interrupt/NMI handlers or with preemption-disabled regions of code, and for general-purpose use in CONFIG_PREEMPT=n kernels.
  3. RCU-preempt for general-purpose use in CONFIG_PREEMPT=y kernels.


The real problem was that someone used one flavor in one part of their RCU algorithm, and another flavor in another part. This has roughly the same effect on your kernel's health and well-being as does acquiring the wrong lock. And, as luck would have it, the resulting bug proved to be exploitable. To his credit, Linus Torvalds noted that having multiple RCU flavors was a root cause, and so he asked that I do something to prevent future similar security-exploitable confusion. After some discussion, it was decided that I try to merge the three flavors of RCU into “one flavor to rule them all”.

Which I have now done in the relative privacy of my -rcu git tree (as in “git clone https://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git” followed by “git checkout dev”).

So what has this got to do with validation in general or formal verification in particular?

Just this: Over the past few months, I have taken a meataxe to Linux-kernel RCU, which implies the injection of any number of bugs. If you would like your formal-verification tool/methodology to be the first to find a bug in Linux-kernel RCU that I don't already know about, this would be an excellent time to give it a try. And yes, all those qualifiers are necessary, as several groups have used formal-verification tools to find bugs in Linux-kernel RCU that I did already know about.

More generally, given the large number of swings I took with said meataxe, if your formal verification tool cannot find bugs in the current dev version of RCU, you might need to entertain the possibility that your formal verification tool cannot find bugs!

July 23, 2018 04:18 PM

July 16, 2018

Pete Zaitcev: Finally a use for code 451

Saw today at a respectable news site, which does not even nag about adblock:

451

We recognise you are attempting to access this website from a country belonging to the European Economic Area (EEA) including the EU which enforces the General Data Protection Regulation (GDPR) and therefore cannot grant you access at this time. For any issues, e-mail us at xxxxxxxx@xxxxxx.com or call us at xxx-xxx-4000.

What a way to brighten one's day. The phone without a country code is a cherry on top.

P.S. The only fly in this ointment is, I wasn't accessing it from the GDPR area. It was a geolocation failure.

July 16, 2018 12:46 AM

July 15, 2018

James Bottomley: Measuring the Horizontal Attack Profile of Nabla Containers

One of the biggest problems with the current debate about Container vs Hypervisor security is that no-one has actually developed a way of measuring security, so the debate is all in qualitative terms (hypervisors “feel” more secure than containers because of the interface breadth) but no-one actually has done a quantitative comparison.  The purpose of this blog post is to move the debate forwards by suggesting a quantitative methodology for measuring the Horizontal Attack Profile (HAP).  For more details about Attack Profiles, see this blog post.  I don’t expect this will be the final word in the debate, but by describing how we did it I hope others can develop quantitative measurements as well.

Well begin by looking at the Nabla technology through the relatively uncontroversial metric of performance.  In most security debates, it’s acceptable that some performance is lost by securing the application.  As a rule of thumb, placing an application in a hypervisor loses anywhere between 10-30% of the native performance.  Our goal here is to show that, for a variety of web tasks, the Nabla containers mechanism has an acceptable performance penalty.

Performance Measurements

We took some standard benchmarks: redis-bench-set, redis-bench-get, python-tornado and node-express and in the latter two we loaded up the web servers with simple external transactional clients.  We then performed the same test for docker, gVisor, Kata Containers (as our benchmark for hypervisor containment) and nabla.  In all the figures, higher is better (meaning more throughput):

The red Docker measure is included to show the benchmark.  As expected, the Kata Containers measure is around 10-30% down on the docker one in each case because of the hypervisor penalty.  However, in each case the Nabla performance is the same or higher than the Kata one, showing we pay less performance overhead for our security.  A final note is that since the benchmarks are network ones, there’s somewhat of a penalty paid by userspace networking stacks (which nabla necessarily has) for plugging into docker network, so we show two values, one for the bridging plug in (nabla-containers) required to orchestrate nabla with kubernetes and one as a direct connection (nabla-raw) showing where the performance would be without the network penalty.

One final note is that, as expected, gVisor sucks because ptrace is a really inefficient way of connecting the syscalls to the sandbox.  However, it is more surprising that gVisor-kvm (where the sandbox connects to the system calls of the container using hypercalls instead) is also pretty lacking in performance.  I speculate this is likely because hypercalls exact their own penalty and hypervisors usually try to minimise them, which using them to replace system calls really doesn’t do.

HAP Measurement Methodology

The Quantitative approach to measuring the Horizontal Attack Profile (HAP) says that we take the bug density of the Linux Kernel code  and multiply it by the amount of unique code traversed by the running system after it has reached a steady state (meaning that it doesn’t appear to be traversing any new kernel paths). For the sake of this method, we assume the bug density to be uniform and thus the HAP is approximated by the amount of code traversed in the steady state.  Measuring this for a running system is another matter entirely, but, fortunately, the kernel has a mechanism called ftrace which can be used to provide a trace of all of the functions called by a given userspace process and thus gives a reasonable approximation of the number of lines of code traversed (note this is an approximation because we measure the total number of lines in the function taking no account of internal code flow, primarily because ftrace doesn’t give that much detail).  Additionally, this methodology works very well for containers where all of the control flow emanates from a well known group of processes via the system call information, but it works less well for hypervisors where, in addition to the direct hypercall interface, you also have to add traces from the back end daemons (like the kvm vhost kernel threads or dom0 in the case of Xen).

HAP Results

The results are for the same set of tests as the performance ones except that this time we measure the amount of code traversed in the host kernel:

As stated in our methodology, the height of the bar should be directly proportional to the HAP where lower is obviously better.  On these results we can say that in all cases the Nabla runtime tender actually has a better HAP than the hypervisor contained Kata technology, meaning that we’ve achieved a container system with better HAP (i.e. more secure) than hypervisors.

Some of the other results in this set also bear discussing.  For instance the Docker result certainly isn’t 10x the Kata result as a naive analysis would suggest.  In fact, the containment provided by docker looks to be only marginally worse than that provided by the hypervisor.  Given all the hoopla about hypervisors being much more secure than containers this result looks surprising but you have to consider what’s going on: what we’re measuring in the docker case is the system call penetration of normal execution of the systems.  Clearly anything malicious could explode this result by exercising all sorts of system calls that the application doesn’t normally use.  However, this does show clearly that a docker container with a well crafted seccomp profile (which blocks unexpected system calls) provides roughly equivalent security to a hypervisor.

The other surprising result is that, in spite of their claims to reduce the exposure to Linux System Calls, gVisor actually is either equivalent to the docker use case or, for the python tornado test, significantly worse than the docker case.  This too is explicable in terms of what’s going on under the covers: gVisor tries to improve containment by rewriting the Linux system call interface in Go.  However, no-one has paid any attention to the amount of system calls the Go runtime is actually using, which is what these results are really showing.  Thus, while current gVisor doesn’t currently achieve any containment improvement on this methodology, it’s not impossible to write a future version of the Go runtime that is much less profligate in the way it uses system calls by developing a Secure Go using the same methodology we used to develop Nabla.

Conclusions

On both tests, Nabla is far and away the best containment technology for secure workloads given that it sacrifices the least performance over docker to achieve the containment and, on the published results, is 2x more secure even than using hypervisor based containment.

Hopefully these results show that it is perfectly possible to have containers that are more secure than hypervisors and lays to rest, finally, the arguments about which is the more secure technology.  The next step, of course, is establishing the full extent of exposure to a malicious application and to do that, some type of fuzz testing needs to be employed.  Unfortunately, right at the moment, gVisor is simply crashing when subjected to fuzz testing, so it needs to become more robust before realistic measurements can be taken.

July 15, 2018 05:54 AM

James Bottomley: A New Method of Containment: IBM Nabla Containers

In the previous post about Containers and Cloud Security, I noted that most of the tenants of a Cloud Service Provider (CSP) could safely not worry about the Horizontal Attack Profile (HAP) and leave the CSP to manage the risk.  However, there is a small category of jobs (mostly in the financial and allied industries) where the damage done by a Horizontal Breach of the container cannot be adequately compensated by contractual remedies.  For these cases, a team at IBM research has been looking at ways of reducing the HAP with a view to making containers more secure than hypervisors.  For the impatient, the full open source release of the Nabla Containers technology is here and here, but for the more patient, let me explain what we did and why.  We’ll have a follow on post about the measurement methodology for the HAP and how we proved better containment than even hypervisor solutions.

The essence of the quest is a sandbox that emulates the interface between the runtime and the kernel (usually dubbed the syscall interface) with as little code as possible and a very narrow interface into the kernel itself.

The Basics: Looking for Better Containment

The HAP attack worry with standard containers is shown on the left: that a malicious application can breach the containment wall and attack an innocent application.  This attack is thought to be facilitated by the breadth of the syscall interface in standard containers so the guiding star in developing Nabla Containers was a methodology for measuring the reduction in the HAP (and hence the improvement in containment), but the initial impetus came from the observation that unikernel systems are nicely modular in the libOS approach, can be used to emulate systemcalls and, thanks to rumprun, have a wide set of support for modern web friendly languages (like python, node.js and go) with a fairly thin glue layer.  Additionally they have a fairly narrow set of hypercalls that are actually used in practice (meaning they can be made more secure than conventional hypervisors).  Code coverage measurements of standard unikernel based kvm images confirmed that they did indeed use a far narrower interface.

Replacing the Hypervisor Interface

One of the main elements of the hypervisor interface is the transition from a less privileged guest kernel to a more privileged host one via hypercalls and vmexits.  These CPU mediated events are actually quite expensive, certainly a lot more expensive than a simple system call, which merely involves changing address space and privilege level.  It turns out that the unikernel based kvm interface is really only nine hypercalls, all of which are capable of being rewritten as syscalls, so the approach to running this new sandbox as a container is to do this rewrite and seccomp restrict the interface to being only what the rewritten unikernel runtime actually needs (meaning that the seccomp profile is now CSP enforced).  This vision, by the way, of a broad runtime above being mediated to a narrow interface is where the name Nabla comes from: The symbol for Nabla is an inverted triangle (∇) which is broad at the top and narrows to a point at the base.

Using this formulation means that the nabla runtime (or nabla tender) can be run as a single process within a standard container and the narrowness of the interface to the host kernel prevents most of the attacks that a malicious application would be able to perform.

DevOps and the ParaVirt conundrum

Back at the dawn of virtualization, there were arguments between Xen and VMware over whether a hypervisor should be fully virtual (capable of running any system supported by the virtual hardware description) or paravirtual (the system had to be modified to run on the virtualization system and thus would be incapable of running on physical hardware).  Today, thanks in a large part to CPU support for virtualization primtives, fully paravirtual systems have long since gone the way of the dodo and everyone nowadays expects any OS running on a hypervisor to be capable of running on physical hardware1.  The death of paravirt also left the industry with an aversion to ever reviving it, which explains why most sandbox containment systems (gVisor, Kata) try to require no modifications to the image.

With DevOps, the requirement is that images be immutable and that to change an image you must take it through the full develop build, test, deploy cycle.  This development centric view means that, provided there’s no impact to the images you use as the basis for your development, you can easily craft your final image to suit the deployment environment, which means a step like linking with the nabla tender is very easy.  Essentially, this comes down to whether you take the Dev (we can rebuild to suit the environment) or the Ops (the deployment environment needs to accept arbitrary images) view.  However, most solutions take the Ops view because of the anti-paravirt bias.  For the Nabla tender, we take the Dev view, which is born out by the performance figures.

Conclusion

Like most sandbox models, the Nabla containers approach is an alternative to namespacing for containment, but it still requires cgroups for resource management.  The figures show that the containment HAP is actually better than that achieved with a hypervisor and the performance, while being marginally less than a namespaced container, is greater than that obtained by running a container inside a hypervisor.  Thus we conclude that for tenants who have a real need for HAP reduction, this is a viable technology.

July 15, 2018 05:54 AM

July 12, 2018

Pete Zaitcev: Guido van Rossum steps down

See a mailing list message:

I would like to remove myself entirely from the decision process. // I am not going to appoint a successor.

July 12, 2018 06:01 PM

June 29, 2018

Pete Zaitcev: The Proprietary Mind

Regarding the Huston missive, two quotes jumped at me the most. The first is just beautiful:

It may be slightly more disconcerting to realise that your electronic wallet is on a device that is using a massive compilation of open source software of largely unknown origin [...]

Yeah, baby. This moldy canard is still operational.

The second is from the narrative of the smartphone revolution:

Apple’s iPhone, released in 2007, was a revolutionary device. [...] Apple’s early lead was rapidly emulated by Windows and Nokia with their own offerings. Google’s position was more as an active disruptor, using an open licensing framework for the Android platform [...]

Again, it's not like he's actually lying. He merely implies heavily that Nokia came next. I don't think the Nokia blunder even deserve a footnote, but to Huston, Google was too open. Google, Carl!

June 29, 2018 12:58 PM

June 26, 2018

James Morris: Linux Security Summit North America 2018: Schedule Published

The schedule for the Linux Security Summit North America (LSS-NA) 2018 is now published.

Highlights include:

and much more!

LSS-NA 2018 will be co-located with the Open Source Summit, and held over 27th-28th August, in Vancouver, Canada.  The attendance fee is $100 USD.  Register here.

See you there!

June 26, 2018 09:11 PM

June 25, 2018

Vegard Nossum: Compiler fuzzing, part 1

Much has been written about fuzzing compilers already, but there is not a lot that I could find about fuzzing compilers using more modern fuzzing techniques where coverage information is fed back into the fuzzer to find more bugs.

If you know me at all, you know I'll throw anything I can get my hands on at AFL. So I tried gcc. (And clang, and rustc -- but more about Rust in a later post.)

Levels of fuzzing


First let me summarise a post by John Regehr called Levels of Fuzzing, which my approach builds heavily on. Regehr presents a very important idea (which stems from earlier research/papers by others), namely that fuzzing can operate at different "levels". These levels correspond somewhat loosely to the different stages of compilation, i.e. lexing, parsing, type checking, code generation, and optimisation. In terms of fuzzing, the source code that you pass to the compiler has to "pass" one stage before it can enter the next; if you give the compiler a completely random binary file, it is unlikely to even get past the lexing stage, never mind to the point where the compiler is actually generating code. So it is in our interest (assuming we want to fuzz more than just the lexer) to generate test cases more intelligently than just using random binary data.

If we simply try to compile random data, we're not going to get very far.
 
In a "naïve" approach, we simply compile gcc with AFL instrumentation and run afl-fuzz on it as usual. If we give a reasonable corpus of existing C code, it is possible that the fuzzer will find something interesting by randomly mutating the test cases. But more likely than not, it is mostly going to end up with random garbage like what we see above, and never actually progress to more interesting stages of compilation. I did try this -- and the results were as expected. It takes a long time before the fuzzer hits anything interesting at all. Now, Sami Liedes did this with clang back in 2014 and obtained some impressive results ("34 distinct assertion failures in the first 11 hours"). So clearly it was possible to find bugs in this way. When I tried this myself for GCC, I did not find a single crash within a day or so of fuzzing. And looking at the queue of distinct testcases it had found, it was very clear that it was merely scratching the very outermost surface of the input handling in the compiler -- it was not able to produce a single program that would make it past the parsing stage.

AFL has a few built-in mutation strategies: bit flips, "byte flips", arithmetic on bytes, 2-bytes, and 4-bytes, insertion of common boundary values (like 0, 1, powers of 2, -1, etc.), insertions of and substitution by "dictionary strings" (basically user-provided lists of strings), along with random splicing of test cases. We can already sort of guess that most of these strategies will not be useful for C and C++ source code. Perhaps the "dictionary strings" is the most promising for source code as it allows you to insert keywords and snippets of code that have at least some chance of ending up as a valid program. For the other strategies, single bit flips can change variable names, but changing variable names is not that interesting unless you change one variable into another (which both have to exist, as otherwise you would hit a trivial "undeclared" error). They can also create expressions, but if you somehow managed to change a 'h' into a '(', source code with this mutation would always fail unless you also inserted a ')' somewhere else to balance the expression. Source code has a lot of these "correspondances" where changing one thing also requires changing another thing somewhere else in the program if you want it to still compile (even though you don't generate an equivalent program -- that's not what we're trying to do here). Variable uses match up with variable declarations. Parantheses, braces, and brackets must all match up (and in the right order too!).

These "correspondences" remind me a lot of CRCs and checksums in other file formats, and they give the fuzzer problems for the exact same reason: without extra code it's hard to overcome having to change the test case simultaneously in two or more places, never mind making the exact change that will preserve the relationship between these two values. It's a game of combinatorics; the more things we have to change at once and the more possibilities we have for those changes, the harder it will be to get that exact combination when you're working completely at random. For checksums the answer is easy, and there are two very good strategies: either you disable the checksum verification in the code you're fuzzing, or you write a small wrapper to "fix up" your test case so that the checksum always matches the data it protects (of course, after mutating an input you may not really know where in the file the checksum will be located anymore, but that's a different problem).

For C and C++ source code it's not so obvious how to help the fuzzer overcome this. You can of course generate programs with a grammar (and some heuristics), which is what several C random code generators such as Csmith, ccg, and yarpgen do. This is in a sense on the completely opposite side of the spectrum when it comes to the levels of fuzzing. By generating programs that you know are completely valid (and correct, and free of undefined behaviour), you will breeze through the lexing, the parsing, and the type checking and target the code generation and optimization stages. This is what Regehr et al. did in "Taming compiler fuzzers", another very interesting read. (Their approach does not include instrumentation feedback, however, so it is more of a traditional black-box fuzzing approach than AFL, which is considered grey-box fuzzing.)

But if you use a C++ grammar to generate C++ programs, that will also exclude a lot of inputs that are not valid but nevertheless accepted by the compiler. This approach relies on our ability to express all programs that should be valid, but there may also be programs non-valid programs that crash the compiler. As an example, if our generator knows that you cannot add an integer to a function, or assign a value to a constant, then the code paths checking for those conditions in the compiler would never be exercised, despite the fact that those errors are more interesting than mere syntax errors. In other words, there is a whole range of "interesting" test cases which we will never be able to generate if we restrict ourselves only to those programs that are actually valid code.

Please note that I am not saying that one approach is better than the other! I believe we need all of them to successfully find bugs in all the areas of the compiler. By realising exactly what the limits of each method are, we can try to find other ways to fill the gaps.

Fuzzing with a loose grammar


So how can we fill the gap between the shallow syntax errors in the front end and the very deep of the code generation in the back end? There are several things we can do.

The main feature of my solution is to use a "loose" grammar. As opposed to a "strict" grammar which would follow the C/C++ specs to the dot, the loose grammar only really has one type of symbol, and all the production rules in the grammar create this type of symbol. As a simple example, a traditional C grammar will not allow you to put a statement where an expression is expected, whereas the loose grammar has no restrictions on that. It does, however, take care that your parantheses and braces match up. My grammar file therefore looks something like this (also see the full grammar if you're curious!):
"[void] [f] []([]) { [] }"
"[]; []"
"{ [] }"
"[0] + [0]"
...
Here, anything between "[" and "]" (call it a placeholder) can be substituted by any other line from the grammar file. An evolution of a program could therefore plausibly look like this:
void f () { }           // using the "[void] [f] []([]) { [] }" rule
void f () { ; } // using the "[]; []" rule
void f () { 0 + 0; } // using the "[0] + [0]" rule
void f ({ }) { 0 + 0; } // using the "{ [] }" rule
...
Wait, what happened at the end there? That's not valid C. No -- but it could still be an interesting thing to try to pass to the compiler. We did have a placeholder where the arguments usually go, and according to the grammar we can put any of the other rules in there. This does quickly generate a lot of nonsensical programs that stop the compiler completely dead in its track at the parsing stage. We do have another trick to help things along, though...

AFL doesn't care at all whether what we pass it is accepted by the compiler or not; it doesn't distinguish between success and failure, only between graceful termination and crashes. However, all we have to do is teach the fuzzer about the difference between exit codes 0 and 1; a 0 means the program passed all of gcc's checks and actually resulted in an object file. Then we can discard all the test cases that result in an error, and keep a corpus of test cases which compile successfully. It's really a no-brainer, but makes such a big difference in what the fuzzer can generate/find.

Enter prog-fuzz


prog-fuzz output


If it's not clear by now, I'm not using afl-fuzz to drive the main fuzzing process for the techniques above. I decided it was easier to write a fuzzer from scratch, just reusing the AFL instrumentation and some of the setup code to collect the coverage information. Without the fork server, it's surprisingly little code, on the order of 15-20 lines of code! (I do have support for the fork server on a different branch and it's not THAT much harder to implement, but I simply haven't gotten around to it yet; and it also wasn't really needed to find a lot of bugs).

You can find prog-fuzz on GitHub: https://github.com/vegard/prog-fuzz

The code is not particularly clean, it's a hacked-up fuzzer that gets the job done. I'll want to clean that up at some point, document all the steps to build gcc with AFL instrumentation, etc., and merge a proper fork server. I just want the code to be out there in case somebody else wants to have a poke around.

Results


From the end of February until some time in April I ran the fuzzer on and off and reported just over 100 distinct gcc bugs in total (32 of them fixed so far, by my count):
Now, there are a few things to be said about these bugs.

First, these bugs are mostly crashes: internal compiler errors ("ICEs"), assertion failures, and segfaults. Compiler crashes are usually not very high priority bugs -- especially when you are dealing with invalid programs. Most of the crashes would never occur "naturally" (i.e. as the result of a programmer trying to write some program). They represent very specific edge cases that may not be important at all in normal usage. So I am under no delusions about the relative importance of these bugs; a compiler crash is hardly a security risk.

However, I still think there is value in fuzzing compilers. Personally I find it very interesting that the same technique on rustc, the Rust compiler, only found 8 bugs in a couple of weeks of fuzzing, and not a single one of them was an actual segfault. I think it does say something about the nature of the code base, code quality, and the relative dangers of different programming languages, in case it was not clear already. In addition, compilers (and compiler writers) should have these fuzz testing techniques available to them, because it clearly finds bugs. Some of these bugs also point to underlying weaknesses or to general cases where something really could go wrong in a real program. In all, knowing about the bugs, even if they are relatively unimportant, will not hurt us.

Second, I should also note that I did have conversations with the gcc devs while fuzzing. I asked if I should open new bugs or attach more test cases to existing reports if I thought the area of the crash looked similar, even if it wasn't the exact same stack trace, etc., and they always told me to file a new report. In fact, I would like to praise the gcc developer community: I have never had such a pleasant bug-reporting experience. Within a day of reporting a new bug, somebody (usually Martin Liška or Marek Polacek) would run the test case and mark the bug as confirmed as well as bisect it using their huge library of precompiled gcc binaries to find the exact revision where the bug was introduced. This is something that I think all projects should strive to do -- the small feedback of having somebody acknowledge the bug is a huge encouragement to continue the process. Other gcc developers were also very active on IRC and answered almost all my questions, ranging from silly "Is this undefined behaviour?" to "Is this worth reporting?". In summary, I have nothing but praise for the gcc community.

I should also add that I played briefly with LLVM/clang, and prog-fuzz found 9 new bugs (2 of them fixed so far):
In addition to those, I also found a few other bugs that had already been reported by Sami Liedes back in 2014 which remain unfixed.

For rustc, I will write a more detailed blog post about how to set it up, as compiling rustc itself with AFL instrumentation is non-trivial and it makes more sense to detail those exact steps apart from this post.

What next?


I mentioned the efforts by Regehr et al. and Dmitry Babokin et al. on Csmith and yarpgen, respectively, as fuzzers that generate valid (UB-free) C/C++ programs for finding code generation bugs. I think there is work to be done here to find more code generation bugs; as far as I can tell, nobody has yet combined instrumentation feedback (grey-box fuzzing) with this kind of test case generator. Well, I tried to do it, but it requires a lot of effort to generate valid programs that are also interesting, and I stopped before finding any actual bugs. But I really think this is the future of compiler fuzzing, and I will outline the ideas that I think will have to go into it:
I don't have the time to continue working on this at the moment, but please do let me know if you would like to give it a try and I'll do my best to answer any questions about the code or the approach.

Acknowledgements


Thanks to John Regehr, Martin Liška, Marek Polacek, Jakub Jelinek, Richard Guenther, David Malcolm, Segher Boessenkool, and Martin Jambor for responding to my questions and bug reports!

Thanks to my employer, Oracle, for allowing me to do part of this fuzzing effort using company time and resources.

June 25, 2018 07:35 AM

June 22, 2018

Paul E. Mc Kenney: Stupid RCU Tricks: Changes to -rcu Workflow

The -rcu tree also takes LKMM patches, and I have been handling these completely separately, with one branch for RCU and another for LKMM. But this can be a bit inconvenient, and more important, can delay my response to patches to (say) LKMM if I am doing (say) extended in-tree RCU testing. So it is time to try something a bit different.

My current thought is continue to have separate LKMM and RCU branches (or more often, sets of branches) containing the commits to be offered up to the next merge window. The -rcu branch lkmm would flag the LKMM branch (or, more often, merge commit) and a new -rcu branch rcu would flag the RCU branch (or, again more often, merge commit). Then the lkmm and rcu merge commits would be merged, with new commits on top. These new commits would be intermixed RCU and LKMM commits.

The tip of the -rcu development effort (both LKMM and RCU) would be flagged with a new dev branch, with the old rcu/dev branch being retired. The rcu/next branch will continue to mark the commit to be pulled into the -next tree, and will point to the merge of the rcu and lkmm branches during the merge window.

I will create the next-merge-window branches sometime around -rc1 or -rc2, as I have in the past. I will send RFC patches to LKML shortly thereafter. I will send a pull request for the rcu branch around -rc5, and will send final patches from the lkmm branch at about that same time.

Should continue to be fun! :–)

June 22, 2018 09:17 PM

June 21, 2018

James Bottomley: Containers and Cloud Security

Introduction

The idea behind this blog post is to take a new look at how cloud security is measured and what its impact is on the various actors in the cloud ecosystem.  From the measurement point of view, we look at the vertical stack: all code that is traversed to provide a service all the way from input web request to database update to output response potentially contains bugs; the bug density is variable for the different components but the more code you traverse the higher your chance of exposure to exploitable vulnerabilities.  We’ll call this the Vertical Attack Profile (VAP) of the stack.  However, even this axis is too narrow because the primary actors are the cloud tenant and the cloud service provider (CSP).  In an IaaS cloud, part of the vertical profile belongs to the tenant (The guest kernel, guest OS and application) and part (the hypervisor and host OS) belong to the CSP.  However, the CSP vertical has the additional problem that any exploit in this piece of the stack can be used to jump into either the host itself or any of the other tenant virtual machines running on the host.  We’ll call this exploit causing a failure of containment the Horizontal Attack Profile (HAP).  We should also note that any Horizontal Security failure is a potentially business destroying event for the CSP, so they care deeply about preventing them.  Conversely any exploit occurring in the VAP owned by the Tenant can be seen by the CSP as a tenant only problem and one which the Tenant is responsible for locating and fixing.  We correlate size of profile with attack risk, so the large the profile the greater the probability of being exploited.

From the Tenant point of view, improving security can be done in one of two ways, the first (and mostly aspirational) is to improve the security and monitoring of the part of the Vertical the Tenant is responsible for and the second is to shift responsibility to the CSP, so make the CSP responsible for more of the Vertical.  Additionally, for most Tenants, a Horizontal failure mostly just means they lose trust in the CSP, unless the Tenant is trusting the CSP with sensitive data which can be exfiltrated by the Horizontal exploit.  In this latter case, the Tenant still cannot do anything to protect the CSP part of the Security Profile, so it’s mostly a contractual problem: SLAs and penalties for SLA failures.

Examples

To see how these interpretations apply to the various cloud environments, lets look at some of the Cloud (and pre-Cloud) models:

Physical Infrastructure

The left hand diagram shows a standard IaaS rented physical system.  Since the Tenant rents the hardware it is shown as red indicating CSP ownership and the the two Tenants are shown in green and yellow.  In this model, barring attacks from the actual hardware, the Tenant owns the entirety of the VAP.  The nice thing for the CSP is that hardware provides air gap security, so there is no HAP which means it is incredibly secure.

However, there is another (much older) model shown on the right, called the shared login model,  where the Tenant only rents a login on the physical system.  In this model, only the application belongs to the Tenant, so the CSP is responsible for much of the VAP (the expanded red area).  Here the total VAP is the same, but the Tenant’s VAP is much smaller: the CSP is responsible for maintaining and securing everything apart from the application.  From the Tenant point of view this is a much more secure system since they’re responsible for much less of the security.  From the CSP point of view there is now a  because a tenant compromising the kernel can control the entire system and jump to other tenant processes.  This actually has the worst HAP of all the systems considered in this blog.

Hypervisor based Virtual Infrastructure

In this model, the total VAP is unquestionably larger (worse) than the physical system above because there’s simply more code to traverse (a guest and a host kernel).  However, from the Tenant’s point of view, the VAP should be identical to that of unshared physical hardware because the CSP owns all the additional parts.  However, there is the possibility that the Tenant may be compromised by vulnerabilities in the Virtual Hardware Emulation.  This can be a worry because an exploit here doesn’t lead to a Horizontal security problem, so the CSP is apt to pay less attention to vulnerabilities in the Virtual Hardware simply because each guest has its own copy (even though that copy is wholly under the control of the CSP).

The HAP is definitely larger (worse) than the physical host because of the shared code in the Host Kernel/Hypervisor, but it has often been argued that because this is so deep in the Vertical stack that the chances of exploit are practically zero (although venom gave the lie to this hope: stack depth represents obscurity, not security).

However, there is another way of improving the VAP and that’s to reduce the number of vulnerabilities that can be hit.  One way that this can be done is to reduce the bug density (the argument for rewriting code in safer languages) but another is to restrict the amount of code which can be traversed by narrowing the interface (for example, see arguments in this hotcloud paper).  On this latter argument, the host kernel or hypervisor does have a much lower VAP than the guest kernel because the hypercall interface used for emulating the virtual hardware is very narrow (much narrower than the syscall interface).

The important takeaways here are firstly that simply transferring ownership of elements in the VAP doesn’t necessarily improve the Tenant VAP unless you have some assurance that the CSP is actively monitoring and fixing them.  Conversely, when the threat is great enough (Horizontal Exploit), you can trust to the natural preservation instincts of the CSP to ensure correct monitoring and remediation because a successful Horizontal attack can be a business destroying event for the CSP.

Container Based Virtual Infrastructure

The total VAP here is identical to that of physical infrastructure.  However, the Tenant component is much smaller (the kernel accounting for around 50% of all vulnerabilities).  It is this reduction in the Tenant VAP that makes containers so appealing: the CSP is now responsible for monitoring and remediating about half of the physical system VAP which is a great improvement for the Tenant.  Plus when the CSP remediates on the host, every container benefits at once, which is much better than having to crack open every virtual machine image to do it.  Best of all, the Tenant images don’t have to be modified to benefit from these fixes, simply running on an updated CSP host is enough.  However, the cost for this is that the HAP is the entire linux kernel syscall interface meaning the HAP is much larger than then hypervisor virtual infrastructure case because the latter benefits from interface narrowing to only the hypercalls (qualitatively, assuming the hypercall interface is ~30 calls and the syscall interface is ~300 calls, then the HAP is 10x larger in the container case than the hypervisor case); however, thanks to protections from the kernel namespace code, the HAP is less than the shared login server case.  Best of all, from the Tenant point of view, this entire HAP cost is borne by the CSP, which makes this an incredible deal: not only does the Tenant get a significant reduction in their VAP but the CSP is hugely motivated to keep on top of all vulnerabilities in their part of the VAP and remediate very fast because of the business implications of a successful horizontal attack.  The flip side of this is that a large number of the world’s CSPs are very unhappy about this potential risks and costs and actually try to shift responsibility (and risk) back to the Tenant by advocating nested virtualization solutions like running containers in hypervisors. So remember, you’re only benefiting from the CSP motivation to actively maintain their share of the VAP if your CSP runs bare metal containers because otherwise they’ve quietly palmed the problem back off on you.

Other Avenues for Controlling Attack Profiles

The assumption above was that defect density per component is roughly constant, so effectively the more code the more defects.  However, it is definitely true that different code bases have different defect densities, so one way of minimizing your VAP is to choose the code you rely on carefully and, of course, follow bug reduction techniques in the code you write.

Density Reduction

The simplest way of reducing defects is to find and fix the ones in the existing code base (while additionally being careful about introducing new ones).  This means it is important to know how actively defects are being searched for and how quickly they are being remediated.  In general, the greater the user base for the component, the greater the size of the defect searchers and the faster the speed of their remediation, which means that although the Linux Kernel is a big component in the VAP and HAP, a diligent patch routine is a reasonable line of defence because a fixed bug is not an exploitable bug.

Another way of reducing defect density is to write (or rewrite) the component in a language which is less prone to exploitable defects.  While this approach has many advocates, particularly among language partisans, it suffers from the defect decay issue: the idea that the maximum number of defects occurs in freshly minted code and the number goes down over time because the more time from release the more chance they’ve been found.  This means that a newly rewritten component, even in a shiny bug reducing language, can still contain more bugs than an older component written in a more exploitable language, simply because a significant number of bugs introduced on creation have been found in the latter.

Code Reduction (Minimization Techniques)

It also stands to reason that, for a complex component, simply reducing the amount of code that is accessible to the upper components reduces the VAP because it directly reduces the number of defects.  However, reducing the amount of code isn’t as simple as it sounds: it can only really be done by components that are configurable and then only if you’re not using the actual features you eliminate.  Elimination may be done in two ways, either physically, by actually removing the code from the component or virtually by blocking access using a guard (see below).

Guarding and Sandboxing

Guarding is mostly used to do virtual code elimination by blocking access to certain code paths that the upper layers do not use.  For instance, seccomp  in the Linux Kernel can be used to block access to system calls you know the application doesn’t use, meaning it also blocks any attempt to exploit code that would be in those system calls, thus reducing the VAP (and also reducing the HAP if the kernel is shared).

The deficiencies in the above are obvious: if the application needs to use a system call, you cannot block it although you can filter it, which leads to huge and ever more complex seccomp policies.  The solution for the system call an application has to use problem can sometimes be guarding emulation.  In this mode the guard code actually emulates all the effects of the system call without actually making the actual system call into the kernel.  This approach, often called sandboxing, is certainly effective at reducing the HAP since the guards usually run in their own address space which cannot be used to launch a horizontal attack.  However, the sandbox may or may not reduce the VAP depending on the bugs in the emulation code vs the bugs in the original.  One of the biggest potential disadvantages to watch out for with sandboxing is the fact that the address space the sandbox runs in is often that of the tenant, often meaning the CSP has quietly switched ownership of that component back to the tenant as well.

Conclusions

First and foremost: security is hard.  As a cloud Tenant, you really want to offload as much of it as possible to people who are much more motivated to actually do it than you are (i.e. the Cloud Service Provider).

The complete Vertical Attack Profile of a container bare metal system in the cloud is identical to a physical system and better than a Hypervisor based system; plus the tenant owned portion is roughly 50% of the total VAP meaning that Containers are by far the most secure virtualization technology available today from the Tenant perspective.

The increased Horizontal Attack profile that containers bring should all rightly belong to the Cloud Service Provider.  However, CSPs are apt to shirk this responsibility and try to find creative ways to shift responsibility back to the tenant including spreading misinformation about the container Attack profiles to try to make Tenants demand nested solutions.

Before you, as a Tenant, start worrying about the CSP owned Horizontal Attack Profile, make sure that contractual remedies (like SLAs or Reputational damage to the CSP) would be insufficient to cover the consequences of any data loss that might result from a containment breach.  Also remember that unless you, as the tenant, are under external compliance obligations like HIPPA or PCI, contractual remedies for a containment failure are likely sufficient and you should keep responsibility for the HAP where it belongs: with the CSP.

June 21, 2018 05:31 AM

June 19, 2018

Pete Zaitcev: Slasti py3

Got Slasti 2.1 released today, the main feature being a support for Python 3. Some of the changes were somewhat... horrifying maybe? I tried to adhere to a general plan, where the whole of the application operates in unicode, and the UTF-8 data is encoded/decoded at the boundary. Unfortunately, in practice the boundary was rather leaky, so in several places I had to resort to isinstance(). I expected to always assign a type to all variables and fields, and then rigidly convert as needed. But WSGI had its own ideas.

Overall, the biggest source of issues was not the py3 model, but trying to make the code compatible. I'm not going to do that again if I can help it: either py2 or py3, but not both.

UPDATE: Looks like CKS agrees that compatible code is usually too hard. I'm glad the recommendation to avoid Python 3 entirely is no longer operational.

June 19, 2018 02:54 AM

June 18, 2018

James Morris: Linux Security BoF at Open Source Summit Japan

This is a reminder for folks attending OSS Japan this week that I’ll be leading a  Linux Security BoF session  on Wednesday at 6pm.

If you’ve been working on a Linux security project, feel welcome to discuss it with the group.  We will have a whiteboard and projector.   This is also a good opportunity to raise topics for discussion, and to ask questions about Linux security.

See you then!

June 18, 2018 08:26 AM

June 16, 2018

Linux Plumbers Conference: Registration for Linux Plumbers Conference is Now Open

The 2018 Linux Plumbers Conference organizing committee is pleased to announce that the registration for this year’s conference is now open. Information on how to register can be found here. Registration prices and cutoff dates are published in the ATTEND page. A reminder that we are following a quota system to release registration slots. Therefore the early registration rate will remain in effect until early registration closes on 10 August 2018, or the quota limit is reached, whatever comes earlier. As usual, contact us if you have questions.

June 16, 2018 05:57 PM

June 15, 2018

Pete Zaitcev: Fedora 28 and IPv6 Neighbor Discovery

Finally updated my laptop to F28 and ssh connections started hanging. They hang for 15-20 seconds, then unstuck for a few seconds, then hang, and so on, cycling. I thought it was a WiFi problem at first. But eventually I narrowed it down to IPv6 ND being busted.

A packet trace on the laptop shows that traffic flows until the laptop issues a neighbor solicitation. The router replies with an advertisement, which I presume is getting dropped. Traffic stops — although what's strange, tcpdump still captures outgoing packets that the laptop sends. In a few seconds, the router sends a neighbor solicitation, but the laptop never replies. Presumably, dropped as well. This continues until a router advertisement resets the cycle.

Stopping firewalld lets solicitations in and the traffic resumes, so obviously a rule is busted somewhere. The IPv6 ICMP appears allowed, but the ip6tables rules generated by Firewalld are fairly opaque, I cannot be sure. Ended filing bug 1591867 for the time being and forcing ssh -4.

UPDATE: Looks like the problem is a "reverse path filter". Setting IPv6_rpfilter=no in /etc/firewalld/firewalld.conf fixes the the issue (thanks to Victor for the tip). Here's an associated comment in the configuration file:

# Performs a reverse path filter test on a packet for IPv6. If a reply to the
# packet would be sent via the same interface that the packet arrived on, the
# packet will match and be accepted, otherwise dropped.
# The rp_filter for IPv4 is controlled using sysctl.

Indeed there's no such sysctl for v6. Obviously the problem is that packets with the source of fe80::/16 are mistakenly assumed to be martians and dropped. That's easy enough to fix, I hope. But it's fascinating that we have an alternative configuration method nowadays, only exposed by certain specialist tools. If I don't have firewalld installed, and want this setting changed, what then?

Remarkably, the problem was reported first in March (it's June now). This tells me that most likely the erroneous check itself is in the kernel somewhere, and firewalld is not at fault, which is why Erik isn't fixing it. He should've reassigned the bug to kernel, if so, but...

The commit cede24d1b21d68d84ac5a36c44f7d37daadcc258 looks like the fix. Unfortunately, it just missed the 4.17.

June 15, 2018 05:39 PM

June 14, 2018

Kees Cook: security things in Linux v4.17

Previously: v4.16.

Linux kernel v4.17 was released last week, and here are some of the security things I think are interesting:

Jailhouse hypervisor

Jan Kiszka landed Jailhouse hypervisor support, which uses static partitioning (i.e. no resource over-committing), where the root “cell” spawns new jails by shrinking its own CPU/memory/etc resources and hands them over to the new jail. There’s a nice write-up of the hypervisor on LWN from 2014.

Sparc ADI

Khalid Aziz landed the userspace support for Sparc Application Data Integrity (ADI or SSM: Silicon Secured Memory), which is the hardware memory coloring (tagging) feature in Sparc M7. I’d love to see this extended into the kernel itself, as it would kill linear overflows between allocations, since the base pointer being used is tagged to belong to only a certain allocation (sized to a multiple of cache lines). Any attempt to increment beyond, into memory with a different tag, raises an exception. Enrico Perla has some great write-ups on using ADI in allocators and a comparison of ADI to Intel’s MPX.

new kernel stacks cleared on fork

It was possible that old memory contents would live in a new process’s kernel stack. While normally not visible, “uninitialized” memory read flaws or read overflows could expose these contents (especially stuff “deeper” in the stack that may never get overwritten for the life of the process). To avoid this, I made sure that new stacks were always zeroed. Oddly, this “priming” of the cache appeared to actually improve performance, though it was mostly in the noise.

MAP_FIXED_NOREPLACE

As part of further defense in depth against attacks like Stack Clash, Michal Hocko created MAP_FIXED_NOREPLACE. The regular MAP_FIXED has a subtle behavior not normally noticed (but used by some, so it couldn’t just be fixed): it will replace any overlapping portion of a pre-existing mapping. This means the kernel would silently overlap the stack into mmap or text regions, since MAP_FIXED was being used to build a new process’s memory layout. Instead, MAP_FIXED_NOREPLACE has all the features of MAP_FIXED without the replacement behavior: it will fail if a pre-existing mapping overlaps with the newly requested one. The ELF loader has been switched to use MAP_FIXED_NOREPLACE, and it’s available to userspace too, for similar use-cases.

pin stack limit during exec

I used a big hammer and pinned the RLIMIT_STACK values during exec. There were multiple methods to change the limit (through at least setrlimit() and prlimit()), and there were multiple places the limit got used to make decisions, so it seemed best to just pin the values for the life of the exec so no games could get played with them. Too much assumed the value wasn’t changing, so better to make that assumption actually true. Hopefully this is the last of the fixes for these bad interactions between stack limits and memory layouts during exec (which have all been defensive measures against flaws like Stack Clash).

Variable Length Array removals start

Following some discussion over Alexander Popov’s ongoing port of the stackleak GCC plugin, Linus declared that Variable Length Arrays (VLAs) should be eliminated from the kernel entirely. This is great because it kills several stack exhaustion attacks, including weird stuff like stepping over guard pages with giant stack allocations. However, with several hundred uses in the kernel, this wasn’t going to be an easy job. Thankfully, a whole bunch of people stepped up to help out: Gustavo A. R. Silva, Himanshu Jha, Joern Engel, Kyle Spiers, Laura Abbott, Lorenzo Bianconi, Nikolay Borisov, Salvatore Mesoraca, Stephen Kitt, Takashi Iwai, Tobin C. Harding, and Tycho Andersen. With Linus Torvalds and Martin Uecker, I also helped rewrite the max() macro to eliminate false positives seen by the -Wvla compiler option. Overall, about 1/3rd of the VLA instances were solved for v4.17, with many more coming for v4.18. I’m hoping we’ll have entirely eliminated VLAs by the time v4.19 ships.

That’s in for now! Please let me know if you think I missed anything. Stay tuned for v4.18; the merge window is open. :)

© 2018, Kees Cook. This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 License.
Creative Commons License

June 14, 2018 11:23 PM

June 07, 2018

Pete Zaitcev: Fundamental knowledge

Colleagues working in space technologies discussed recently if fundamental education were necessary for a programmer, so just for a reference, here's a list of fundamental-ish areas I had trouble with in practice over a 30 year career.

Statistics. This should be obvious. Although in theory I'm educated in the topic, I always had difficulty with it, and barely passed my tests, decades ago.

Error correction. To be entirely honest, I blew this. Every time I had to do it, I ended either using Phil Karn's library, or relying on Kevin Greenan's erasure coding package. I think the only time I implemented something that worked was the UAT.

The DSP on Inphase/Quadrature data. This one is really vexing. I ended with some ridiculous ad-hoc code, even though it's very interesting. In my excuse, there were some difficult performance constraints, so even if I knew the underlying math, there would be no way to apply it.

Other than the above, I don't feel like I was held back by any kind of fundamental background, most of all not in CS. About the only time it mattered was when an interviewer asked me to implement an R-B tree.

June 07, 2018 04:03 PM

Pavel Machek: Complex cameras coming to PCs

It seems PCs are getting complex cameras. Which is bad news for PCs, because existing libv4l2 will not work there, but good news for OMAP3, as there will be bigger pressure to fix stuff.

June 07, 2018 12:34 PM

June 05, 2018

Davidlohr Bueso: Linux v4.17: Performance Goodies

With Linux v4.17 now released, there are some interesting performance changes that went worth looking at. As always, the term 'performance' can be vague in that some gains in one area can negatively affect another so take everything with a grain of salt.


sysvipc: introduce STAT_ANY commands

There was a permission discrepancy when consulting shm ipc object metadata between /proc/sysvipc/shm (0444) and getting stat info (such as via SHM_STAT shmctl command). The later does permission checks for the object vs S_IRUGO. As such there can be cases where EACCESS is returned via syscall but the info is displayed anyways in the procfs files. While this might have security implications via info leaking (albeit no writing to the shm metadata), this behavior goes way back and showing all the objects regardless of the permissions was most likely an overlook - so we are stuck with it.

Some applications require getting the procfs info (without root privileges) and can be rather slow in comparison with a syscall -- up to 500x in some reported cases. For this, the new {SEM,SHM,MSG}_STAT_ANY commands have been introduced.
[Commit c21a6970ae72, a280d6dc77eb, 23c8cec8cf67]


kvm: x86 paravirtualization hints and KVM_HINTS_DEDICATED

When dealing with CPU virtualization, many in-kernel heuristics and optimizations revolve around the overcommited scenario.  By introducing KVM_HINTS_DEDICATED, the hypervisor administrator can select this option when there are pinned 1:1 virtual to physical CPU scenarios; particularly reducing the paravirt overhead in locking and TLB flushing as the vCPU is most unlikely to get preempted. In these cases, native qspinlock may perform better than pvqspinlock as it disables paravirt spinlock slowpath optimizations. There is an older Xen equivalent available as a kernel parameter: xen_nopvspin.
[Commit b2798ba0b876, 34226b6b7098, 6beacf74c257]


sched: rework idle loop

Rework the idle loop in order to prevent CPUs from spending too much time in shallow idle states by making it stop the scheduler tick before putting the CPU into an idle state only if the idle duration predicted by the idle governor is long enough. It reduces idle power on some systems by 10% or more and may improve performance of workloads in which the idle loop overhead matters. This required the code to be reordered to invoke the idle governor before stopping the tick, among other things
[Commit 0e7767687fda, 2aaf709a518d, ed98c3491998]


 mm: pcpu pages optimizations around zone lock

Two optimizations around zone->lock in free_pcpupages_bulk() that yield around a 5% performance improvement in page-fault benchmarks (will-it-scale in this case). The first reduces the scope of the  when freeing a batch of pages from back to buddy. Considering the per-cpu semantics, the lock was unnecessarily  held while pages are chosen from the pcpu page's migratetype list.

The second improvement adds a prefetch to the to-be-freed page's buddy outside of  the lock in hope that accessing the buddy's page structure later with the lock held will be faster. Normally prefetching is froundupon, particularly for microbenchmarks, however in the particular case the prefetched pointer will always be used.
[Commit 0a5f4e5b4562, 97334162e4d7]


mm: lockless list_lru_count_one()

During the reclaiming slab of a memcg, shrink_slab() iterates over all registered shrinkers in the system, trying to count and consume objects related to the cgroup. In case of memory pressure, the operation was had a bottlenecking while trying to acquire the nlru->lock. By applying RCU to the data structure, the lookup can be done without taking the lock, which translates in the overall contention pretty much disappearing.
[Commit 0c7c1bed7e13]


memory hotplug optimizations

Such optimizations reduce the amount of times struct pages is traversed during a memory hotplug operation, from three to one. Among other benefits, the memory hotplug is made similar to the boot memory initialization path because it initializes struct pages only in one function. Finally, this improves memory hotplug performance because the cache is not being evicted several times and also reduce loop branching overhead.
[Commit d0dc12e86b31]


procfs: miscellaneous optimizations

Access to various files within procfs have been optimized by replacing calls to seq_printf() with lower cost alternatives. Changes show some performance benefits for ad-hoc microbenchmarks.


btrfs: relax barrier when unlocking an extent buffer

Serializing checks for active waitqueue requires a barrier as it can race with  the waiter side. Such is the case with btrfs_tree_unlock(), which was abusing the barrier semantics on architectures where atomic operations are ordered, such as x86. A performance improvement is immediately noticeable by optimizing barrier usage while maintaining the necessary semantics.
[Commit 2e32ef87b074]


x86/pti: leave kernel text global for no PCID

From the patch: Global pages are bad for hardening because they potentially let an exploit read the kernel image via a Meltdown-style attack. But, global pages are good for performance because they reduce TLB misses when making user/kernel transitions, especially when PCIDs are not available, such as on older hardware, or where a hypervisor has disabled them for some reason.

This change implements a basic, sane policy: If PCIDs are available, only map a minimal amount of kernel text global. If no PCIDs, map all kernel text global. This translates into a considerable throughput increase on an lseek microbenhmark.
[Commit 8c06c7740d19]


lib/raid6/altivec: Add vpermxor implementation for raid6 Q syndrome

This enhancement uses the vpermxor instruction to optimize the raid6 Q syndrome. This instruction was made available with POWER8, ISA version 2.07. It allows for both vperm and vxor instructions to be done in a single instruction. The benchmark results show a 35% speed increase over the best existing algorithm for powerpc (altivec).
[Commit 751ba79cc552]

June 05, 2018 02:51 PM

June 04, 2018

James Bottomley: Why Microsoft is a good steward for GitHub

There seems to be a lot of hysteria going on in various communities that depend on GitHub for their project hosting around the Microsoft acquisition (just look in the comments here and here).  Obviously a lot of social media ink will be expended on this, so I’d just like to explain why as a committed open source developer, I think this will actually be a good thing.

Firstly, it’s very important to remember that git may be open source, but GitHub isn’t: none of the scripts that run the service have much published source code at all.  It may be a closed source hosting infrastructure that a lot of open source projects rely on but that doesn’t make it open source itself.  So why is GitHub not open source?  Well, it all goes back to the business model.  Notwithstanding fantastic market valuations there are lots of companies that play in the open source ecosystem, like GitHub, which struggle to find a sustainable business model (or even revenue).  This leads to a lot of open closed/open type models like GitHub (the reason GitHub keeps the code closed is so they can sell it to other companies for internal source management) or Docker Enterprise.

Secondly, even if GitHub were fully open source, as I’ve argued in my essays about the GPL, to trust a corporate player in the ecosystem, you need to be able to understand fully its business motivation for being there and verify the business goals align with the community ones.  As long as the business motivation is transparent and aligned with the community, you know you can trust it.  However, most of the new supposedly “open source” companies don’t have clear business models at all, which means their business motivation is anything but transparent.  Paradoxically this means that most of the new corporate idols in the open source ecosystem are remarkably untrustworthy because their business model changes from week to week as they struggle to please their venture capitalist overlords.  There’s no way you can get the transparency necessary for open source trust if the company itself doesn’t know what its business model will be next week.

Finally, this means that companies with well established open source business models and motivations that don’t depend on the whims of VCs are much more trustworthy in open source in the long term.  Although it’s a fairly recent convert, Microsoft is now among these because it’s clearly visible how its conversion from desktop to cloud both requires open source and requires Microsoft to play nicely with open source.  The fact that it has a trust deficit from past actions is a bonus because from the corporate point of view it has to be extra vigilant in maintaining its open source credentials.  The clinching factor is that GitHub is now ancillary to Microsoft’s open source strategy, not its sole means of revenue, so lots of previous less community oriented decisions, like keeping the GitHub code closed source, can be revisited in time as Microsoft seeks to gain community trust.

For the record, I should point out that although I have a github account, I host all my code on kernel.org mostly because the GitHub workflow really annoys me, having spent a lot of time trying to deduce commit motivations in a sparse git commit messages which then require delving into github issues and pull requests only to work out that most of the necessary details are in some private slack back channel well away from public view.  Regardless of who owns GitHub, I don’t see this workflow problem changing any time soon, so I’ll be sticking to my current hosting setup.

June 04, 2018 06:31 PM

June 02, 2018

Linux Plumbers Conference: Call For Participation in 2018 Linux Plumbers Conference!

Referred-track, microconference, and BoF proposals all welcome, see below!

Submissions close: September 2, 2018
Speakers notified: September 23, 2018
Slides due: November 9, 2018

Microconference slots often fill before the deadline (so don’t wait to submit yours!) but BoF submissions can come late.

Call for Refereed-Track Proposals
We are pleased to announce the Call for Refereed-Track Proposals for the 2018 edition of the Linux Plumbers Conference, which will held be in Vancouver, BC, Canada on November 13-15 in conjunction with the Linux Kernel Summit.

Refereed track presentations are 50 minutes in length (which includes time for questions and discussion) and should focus on a specific aspect of the “plumbing” in the Linux system. Examples of Linux plumbing include core kernel subsystems, toolchains, container runtimes, core libraries, windowing systems, management tools, device support, media creation/playback, and so on. The best presentations are not about finished work, but rather problems, proposals, or proof-of-concept solutions that require face-to-face discussions and debate.

Given that Plumbers is not colocated with Open Source Summit this year, we are spreading the refereed-track talks over all three days. This provides a change of pace and also provides a conflict-free schedule for the refereed-track talks. (Yes, this does result in more conflicts between the refereed-track talks and the Microconferences, but we never claimed that the world was perfect.)

Linux Plumbers Conference Program Committee members will be reviewing all submitted sessions. High-quality submisssion that cannot be accepted due to the limited number of slots will be forwarded to the Microconference leads for further consideration. We also encourage submitters to consider BoF sessions and the unconference.

To submit a refereed track talk proposal follow the instructions at this website.

Please note that we have a completely different submission system than last year, so please do not let your muscle memory take over.

Submissions are due on or before Friday September 2, 2018 at noon Mountain Time. Since this is after the closure of early registration, speakers may register before this date and we’ll refund the registration for any selected presentation’s speaker, but for only one speaker per presentation.

Call for Microconference Proposals
We are pleased to announce the Call for Microconferences for the 2018 edition of the Linux Plumbers Conference, which will be held in Vancouver BC, Canada on November 13-15 in conjunction with the Linux Kernel Summit.

A microconference is a collection of collaborative sessions focused on problems in a particular area of the Linux plumbing, which includes the kernel, libraries, utilities, UI, and so forth, but can also focus on cross-cutting concerns such as security, scaling, energy efficiency, toolchains, container runtimes, or a particular use case. Good microconferences result in solutions to these problems and concerns, while the best microconferences result in patches that implement those solutions. For more information on submitting a microconference proposal, see this website.

Again, please note that we have a completely different submission system than last year, so please do not let your muscle memory take over. In particular, unlike last year, there is no wiki. So instead of creating an entry for you microconference on a wiki, you submit it using the above URL.

Call for Bird of a Feather (BoF) Session Proposals
Last, but by no means least, we are also pleased to announce a call for BoF sessions. These are free-form get-togethers for people wishing to discuss a particular topic. As always, you only need to submit proposals for BoFs you want to hold on-site. In contrast, and again as always, informal BoFs may be held at local drinking establishments or in the “hallway track” at your convenience.

June 02, 2018 07:54 AM

May 30, 2018

Paul E. Mc Kenney: Call For Participation in 2018 Linux Plumbers Conference!

Referred-track, microconference, and BoF proposals all welcome, see below!

Submissions close: September 2, 2018
Speakers notified: September 23, 2018
Slides due: November 9, 2018

Microconference slots often fill before the deadline (so don't wait to submit yours!) but BoF submissions can come late.

Call for Refereed-Track Proposals

We are pleased to announce the Call for Refereed-Track Proposals for the 2018 edition of the Linux Plumbers Conference, which will held be in Vancouver, BC, Canada on November 13-15 in conjunction with the Linux Kernel Summit.

Refereed track presentations are 50 minutes in length (which includes time for questions and discussion) and should focus on a specific aspect of the "plumbing" in the Linux system. Examples of Linux plumbing include core kernel subsystems, toolchains, container runtimes, core libraries, windowing systems, management tools, device support, media creation/playback, and so on. The best presentations are not about finished work, but rather problems, proposals, or proof-of-concept solutions that require face-to-face discussions and debate.

Given that Plumbers is not colocated with Open Source Summit this year, we are spreading the refereed-track talks over all three days. This provides a change of pace and also provides a conflict-free schedule for the refereed-track talks. (Yes, this does result in more conflicts between the refereed-track talks and the Microconferences, but we never claimed that the world was perfect.)

Linux Plumbers Conference Program Committee members will be reviewing all submitted sessions. High-quality submisssion that cannot be accepted due to the limited number of slots will be forwarded to the Microconference leads for further consideration. We also encourage submitters to consider BoF sessions and the unconference.

To submit a refereed track talk proposal follow the instructions at this website.

Please note that we have a completely different submission system than last year, so please do not let your muscle memory take over.

Submissions are due on or before Friday September 2, 2018 at noon Mountain Time. Since this is after the closure of early registration, speakers may register before this date and we'll refund the registration for any selected presentation's speaker, but for only one speaker per presentation.

Call for Microconference Proposals

We are pleased to announce the Call for Microconferences for the 2018 edition of the Linux Plumbers Conference, which will be held in Vancouver BC, Canada on November 13-15 in conjunction with the Linux Kernel Summit.

A microconference is a collection of collaborative sessions focused on problems in a particular area of the Linux plumbing, which includes the kernel, libraries, utilities, UI, and so forth, but can also focus on cross-cutting concerns such as security, scaling, energy efficiency, toolchains, container runtimes, or a particular use case. Good microconferences result in solutions to these problems and concerns, while the best microconferences result in patches that implement those solutions. For more information on submitting a microconference proposal, see this website.

Again, please note that we have a completely different submission system than last year, so please do not let your muscle memory take over. In particular, unlike last year, there is no wiki. So instead of creating an entry for you microconference on a wiki, you submit it using the above URL.

Call for Bird of a Feather (BoF) Session Proposals

Last, but by no means least, we are also pleased to announce a call for BoF sessions. These are free-form get-togethers for people wishing to discuss a particular topic. As always, you only need to submit proposals for BoFs you want to hold on-site. In contrast, and again as always, informal BoFs may be held at local drinking establishments or in the “hallway track” at your convenience.

May 30, 2018 07:32 PM

May 29, 2018

Linux Plumbers Conference: New Linux Plumbers Conference web site

We invite you to check out the new web site for this year’s Plumbers Conference at linuxplumbersconf.org. It is based on the Indico conference-management web application that comes from CERN in Switzerland.  The LPC site will be used for submissions, acceptance, and scheduling of refereed talks, BoFs, and microconferences. Stay tuned for instructions on all of that. We are still working out the kinks, so please be gentle — but do try it out and report any problems you run into to contact@linuxplumbersconf.org.

May 29, 2018 10:40 PM

Pete Zaitcev: Advogato.org is gone

The domain now redirects to Wayback Machine. The last captured post is the official farewell message from June 22.

Personally, I would prefer people hack upon trust metrics than blockchain. But they did not agree, and personally I've done nothing to advance the field, so I don't have room to complain. And now the flagship open-source implementation is no more (well, of course Google still exists and so the trust metrics stay with us; possibly even get developed further).

May 29, 2018 08:14 PM

May 25, 2018

Gustavo F. Padovan: CFP for linuxdev-br conference extended until 7th of June

We already received some great talk proposals for this year’s event, but to bring in even more good content to our attendees we are extending the Call for Presentation until the 7th of June.

Linux Developer Conference Brazil – linuxdev-br for short – aims to be a meeting point for the worldwide Linux development community. We are looking for talks that deal with the most recent as well as the most relevant topics in FOSS development, including but not limited to kernel and drivers, bootloaders, networking and protocols, containers and virtualization, security, IoT, industry challenges and more. No matter what your background or level is, come share your views with the FOSS community at large.

Details on the topics accepted and how to submit can be found at the conference’s CFP page. Submit your talk now!

The post CFP for linuxdev-br conference extended until 7th of June appeared first on Gustavo Padovan.

May 25, 2018 01:19 AM

May 23, 2018

Pete Zaitcev: Jack Baruth on the agile development

As seen at a blog about cars:

Every software shop from Hyderabad to Cleveland now faithfully, and idiotically, replicates a cargo-cult version of the “standups” and “kanban methods” that were designed to work on a factory floor.

The “standups” are particularly miserable: Toyota’s version was best understood as a five-minute meeting where any potential issues in a given assembly-line department would be sorted out before the shift began, but under the corrupting influence of IBM, Accenture, and other “body shops,” the concept has degenerated into a 45-minute hellscape of offshore “engineers” mumbling a list of their miniature accomplishments out of a speakerphone while everybody else shifts from leg to leg and attempts not to fall asleep.

Shit, man. If even a pro racer turned autojourno can tell, we in software are past the point of ridiculous. That said, morning assembly is nothing new - it was a thing in the 1950s, long before Toyota. It even had native names: in Russia it was called "lineyka", in Japan it was "cho~rei".

May 23, 2018 06:49 PM

May 17, 2018

Pete Zaitcev: Amazon AI plz

Not being a native speaker, I get amusing results sometimes when searching on Amazon. For example, "floor scoop" brings up mostly fancy dresses. Apparently, a scoop is a type of dress, which can be floor-length, and so. The correct request is actually "dust pan". Today though, searching for "Peliton termite" ended with a bunch of bicycle saddles. Apparently, Amazon force-replaced it with "peloton", and I know of no syntax to force my spelling. I suspect that Peliton may have trouble selling their products at Amazon. This sort of thing is making me wary of Alexa. I don't see myself ever winning an argument with a robot who knows better, and is implemented in proprietary software that I cannot adjust.

UPDATE: The "plus prefix" works, e.g. "+peliton" (thanks to elisteran).

May 17, 2018 05:48 PM

May 09, 2018

Pete Zaitcev: The space-based ADS-B

Today, I want to build a satellite that receives ADS-B signals from airplanes over the open ocean, far away from land. With a decent receiver and a simple antenna, it should be possible on a gravity-stabilized cubesat. I know about terrestrial receivers picking signals 200..300 km out, surely with care one can do better. But I highly doubt that it's possible to finance such a toy — unless someone has already done that. I know that people somehow manage to finance AIS receivers, which are basically the same thing, only for ships. How do they do that?

UPDATE: Reportedly, hosted payloads by Aireon on Iridium NEXT satellites do ADS-B. The working altitude of the previous generation of Iridium was 780 km, the NEXT is probably the same.

May 09, 2018 03:27 AM

May 07, 2018

Davidlohr Bueso: Linux v4.16: Performance Goodies


Linux v4.16 was released a few weeks ago and continues the mitigation of meltdown and spectre bugs for x86-64, as well as for arm64 and IBM s390. While v4.16 is not the most exciting kernel version in terms of performance and scalability, the following is an unsorted and incomplete list of changes that went in which I have cherry-picked. As always, the term 'performance' can be vague in that some gains in one area can negatively affect another so take everything with a grain of salt.

sched: reduce migrations and spreading of load to multiple CPUs

The scheduler decisions are biased towards reducing latency of searches but tends to spread load across an entire socket, unnecessarily. On low CPU usage, this means the load on each individual CPU is low which can be good but cpufreq decides that utilization on individual CPUs is too low to increase P-state and overall throughput suffers.

When a cpufreq driver is completely under the control of the OS, it can be compensated for. For example, intel_pstate can decide to boost apparent cpu utilization if a task recently slept on a CPU for idle. However, if hardware-based cpufreq is in play (e.g. hardware P-states HWP) then very poor decisions can be made and the OS cannot do much about it. This only gets worse as HWP becomes more prevalent, sockets get larger and the p-state for individual cores can be controlled. Just setting the performance governor is not an answer given that plenty of people really do worry about power utilization and still want a reasonable balance between performance and power. Experiments show performance benefits for network benchmarks running on localhost (at ~10% on netperf RR for UDP and TCP, depending on the machine). Hackbench also has some small improvements with ~6-11%, depending on machine and thread count.
[Commit 89a55f56fd1c, 3b76c4a33959, 806486c377e3, 32e839dda3ba]


printk: new locking scheme

Problems around the kernel's printk() call aren't new and traditionally must overcome issues with the console lock. Considering that the kernel printing out to the console is very generic operation which can be called from virtually anywhere at any time, relying on any sort of lock can cause deadlocks. Similarly, the call to printk() must proceed regardless of the availability of the console lock. As such, what would happen is that upon contention, the task buffers the output for the console lock owner to flush as when it releases the lock.

On large multi-core systems this scheme can lead to the console owner to pile up a lot unbound work before it can release the lock, triggering watchdog lockups. This was replaced with a new mechanism that, upon contention, the task will not delay the work to the console lock owner and return, but it'll stay around spinning until it is available. The heuristics imply a console owner and waiter such that if multiple CPUs are generating output, the console lock will circulate between them, and none will end up printing output for too long.
[Commit dbdda842fe96]

idr tree optimizations

With the extensions and improvements of the ID allocation API, there is a performance enhancement for ID numbering schemes that don't start at 0; which, according to the patch, accounts for ~20% of all the kernel users. So by using the new idr functions with the _base() suffix users can immediately benefit from unnecessary iterations in the underlying radix tree.
[Commit 6ce711f27500]

 arm64: 52-bit physical address support

With ARMv8.2 the physical address space is extended from 48 to 52-bit, thus tasks are now able to address up to 4 pebibytes (PiB).
[Commit fa2a8445b1d3, 193383043f14, 529c4b05a3cb, 787fd1d019b2]

May 07, 2018 05:53 PM

April 30, 2018

Michael Kerrisk (manpages): man-pages-4.16 is released

I've released man-pages-4.16. The release tarball is available on kernel.org. The browsable online pages can be found on man7.org. The Git repository for man-pages is available on kernel.org.

This release resulted from patches, bug reports, reviews, and comments from 29 contributors. Somewhat more than 160 commits changed around 60 pages. A summary of the changes can be found here.

April 30, 2018 07:28 PM

April 29, 2018

Pavel Machek: Crazy complexity

Its crazy how complex PCs have become. ARMs are not exactly simple with the TrustZone and similar stuff, but.. this is crazy. If you thought you understand x86 architecture... this is likely to prove you wrong. There's now non-x86 cpu inside x86 that performs a lot of rather critical functions...
https://eprint.iacr.org/2016/086.pdf
...and shows that SGX indeed is evil.

April 29, 2018 08:52 PM

Pavel Machek: Microsoft sabotaging someone else's computers

My father got himself in a nice trap: he let his Lenovo notebook to update to Windows 10. Hard to blame him, as user interface was confusing on purpose.Now 2 out of 3 USB ports are non-functional (USB 2 port works USB 3 ports don't), and there's no way to fix that. And apparently, Microsoft knew about the problem. Congratulations, Microsoft...

Ouch and they are also sending people to jail for producing CDs neccessary to use licenses they already sold. Microsoft still is evil.

April 29, 2018 08:50 PM

Pavel Machek: O2 attacking their own customers

Just because you are paying for internet service does not mean O2 will not try to replace web-pages with advertising. Ouch. Seems like everyone needs to use https, we need better network-neutrality laws, and probably also class-action lawsuits.

April 29, 2018 08:45 PM

Pavel Machek: Dark design patterns

Got Jolla installed. Ok, it looks cool. But already some unnice things can be seen. You _need_ jolla account to install apps. You need to agree to nasty legaleese. You are asked for name and password, it looks like that's all, and then it wants to know real name, email address, birthday... Appstore looks cool... but does not list licenses for software being installed. Still better than Android. Miles away from Debian.

It also seems to require login separate from app store login to get the "really" interesting stuff. Unfortunately, I don't know how to get that one.
I'd quite like to get python/gtk to work on Jolla (or maybe Android). If someone knows how to do that, I'd like to know. But I guess running Maemo Leste is easier at the moment.

April 29, 2018 08:43 PM

Pavel Machek: Motorola Droid 4 is now usable

23.4.2018, around 12:34... I realized how unix ttys are sabotaging my attempts to send SMS.. and solved it. So now I have Motorola Droid 4, running 4.17-rc1 kernel, with voice calls working, SMSes, data connection, GPS working and have some basic GUIs to control the stuff. WIFI works. Screen locks, and keyboard map still could be improved. Battery life will probably will not be great. But hey, its a start -- I have GNU/Linux working on a cellphone. More precisely Maemo Leste, based on Devuan, based on Debian. Sure, some kernel patches are still needed, and there's a lot more work to do in userland... Today, Microsoft sold out last Windows Mobile phones. I guess that's just a coincidence.

April 29, 2018 08:40 PM

April 23, 2018

Pete Zaitcev: Azure Sphere

Oh Microsoft, you card:

[Azure Sphere OS] combines security innovations pioneered in Windows, a security monitor, and a custom Linux kernel [...]</p>

Kinda like Oracle shipping "Unbreakable Linux". Still in the "embrace" phase.

April 23, 2018 06:37 PM

Daniel Vetter: Linux Kernel Maintainer Statistics

As part of preparing my last two talks at LCA on the kernel community, “Burning Down the Castle” and “Maintainers Don’t Scale”, I have looked into how the Kernel’s maintainer structure can be measured. One very interesting approach is looking at the pull request flows, for example done in the LWN article “How 4.4’s patches got to the mainline”. Note that in the linux kernel process, pull requests are only used to submit development from entire subsystems, not individual contributions. What I’m trying to work out here isn’t so much the overall patch flow, but focusing on how maintainers work, and how that’s different in different subsystems.

Methodology

In my presentations I claimed that the kernel community is suffering from too steep hierarchies. And worse, the people in power don’t bother to apply the same rules to themselves as anyone else, especially around purported quality enforcement tools like code reviews.

For our purposes a contributor is someone who submits a patch to a mailing list, but needs a maintainer to apply it for them, to get the patch merged. A maintainer on the other hand can directly apply a patch to a subsystem tree, and will then send pull requests up the maintainer hierarchy until the patch lands in Linus’ tree. This is relatively easy to measure accurately in git: If the recorded patch author and committer match, it’s a maintainer self-commit, if they don’t match it’s a contributor commit.

There’s a few annoying special cases to handle:

Also note that this is a property of each commit - the same person can be both a maintainer and a contributor, depending upon how each of their patches gets merged.

The ratio of maintainer self-commits compared to overall commits then gives us a crude, but fairly useful metric to measure how steep the kernel community overall is organized.

Measuring review is much harder. For contributor commits review is not recorded consistently. Many maintainers forgo adding an explicit Reviewed-by tag since they’re adding their own Signed-off-by tag anyway. And since that’s required for all contributor commits, it’s impossible to tell whether a patch has seen formal review before merging. A reasonable assumption though is that maintainers actually look at stuff before applying. For a minimal definition of review, “a second person looked at the patch before merging and deemed the patch a good idea” we can assume that merged contributor patches have a review ratio of 100%. Whether that’s a full formal review or not can unfortunately not be measured with the available data.

A different story is maintainer self-commits - if there is no tag indicating review by someone else, then either it didn’t happen, or the maintainer felt it’s not important enough work to justify the minimal effort to record it. Either way, a patch where the git author and committer match, and which sports no review tags in the commit message, strongly suggests it has indeed seen none.

An objection would be that these patches get reviewed by the next maintainer up, when the pull request gets merged. But there’s well over a thousand such patches each kernel release, and most of the pull requests containing them go directly to Linus in the 2 week long merge window, when the over 10k feature patches of each kernel release land in the mainline branch. It is unrealistic to assume that Linus carefully reviews hundreds of patches himself in just those 2 weeks, while getting hammered by pull requests all around. Similar considerations apply at a subsystem level.

For counting reviews I looked at anything that indicates some kind of patch review, even very informal ones, to stay consistent with the implied oversight the maintainer’s Signed-off-by line provides for merged contributor patches. I therefore included both Reviewed-by and Acked-by tags, including a plethora of misspelled and combined versions of the same.

The scripts also keep track of how pull requests percolate up the hierarchy, which allows filtering on a per-subsystem level. Commits in topic branches are accounted to the subsystem that first lands in Linus’ tree. That’s fairly arbitrary, but simplest to implement.

Last few years of GPU subsystem history

Since I’ve pitched the GPU subsystem against the kernel at large in my recent talks, let’s first look at what things look like in graphics:

GPU maintainer commit statistics Fig. 1 GPU total commits, maintainer self-commits and reviewed maintainer self-commits
GPU relative maintainer commit statistics Fig. 2 GPU percentage maintainer self-commits and reviewed maintainer self-commits

In absolute numbers it’s clear that graphics has grown tremendously over the past few years. Much faster than the kernel at large. Depending upon the metric you pick, the GPU subsystem has grown from being 3% of the kernel to about 10% and now trading spots for 2nd largest subsystem with arm-soc and staging (depending who’s got a big pull for that release).

Maintainer commits keep up with GPU subsystem growth

The relative numbers have a different story. First, commit rights and the fairly big roll out of group maintainership we’ve done in the past 2 years aren’t extreme by historical graphics subsystem standards. We’ve always had around 30-40% maintainer self-commits. There’s a bit of a downward trend in the years leading towards v4.4, due to the massive growth of the i915 driver, and our failure to add more maintainers and committers for a few releases. Adding lots more committers and creating bigger maintainer groups from v4.5 on forward, first for the i915 driver, then to cope with the influx of new small drivers, brought us back to the historical trend line.

There’s another dip happening in the last few kernels, due to AMD bringing in a big new team of contributors to upstream. v4.15 was even more pronounced, in that release the entirely rewritten DC display driver for AMD GPUs landed. The AMD team is already using a committer model for their staging and internal trees, but not (yet) committing directly to their upstream branch. There’s a few process holdups, mostly around the CI flow, that need to be fixed first. As soon as that’s done I expect this recent dip will again be over.

In short, even when facing big growth like the GPU subsystem has, it’s very much doable to keep training new maintainers to keep up with the increased demand.

Review of maintainer self-commits established in the GPU subsystem

Looking at relative changes in how consistently maintainer self-commits are reviewed, there’s a clear growth from mostly no review to 80+% of all maintainer self-commits having seen some formal oversight. We didn’t just keep up with the growth, but scaled faster and managed to make review a standard practice. Most of the drivers, and all the core code, are now consistently reviewed. Even for tiny drivers with small to single person teams we’ve managed to pull this off, through combining them into larger teams run with a group maintainership model.

Last few years of kernel w/o GPU history

kernel w/o GPU maintainer commit statistics Fig. 3 kernel w/o GPU maintainer self-commits and reviewed maintainer self-commits
kernel w/o GPU relative maintainer commit statistics Fig. 4 kernel w/o GPU percentage maintainer self-commits and reviewed maintainer self-commits

Kernel w/o graphics is an entirely different story. Overall, review is much less a thing that happens, with only about 30% of all maintainer self-commits having any indication of oversight. The low ratio of maintainer self-commits is why I removed the total commit number from the absolute graph - it would have dwarfed the much more interesting data on self-commits and reviewed self-commits. The positive thing is that there’s at least a consistent, if very small upward trend in maintainer self-commit reviews, both in absolute and relative numbers. But it’s very slow, and will likely take decades until there’s no longer a double standard on review between contributors and maintainers.

Maintainers are not keeping up with the kernel growth overall

Much more worrying is the trend on maintainer self-commits. Both in absolute, and much more in relative numbers, there’s a clear downward trend, going from around 25% to below 15%. This indicates that the kernel community fails to mentor and train new maintainers at a pace sufficient to keep up with growth. Current maintainers are ever more overloaded, leaving ever less time for them to write patches of their own and get them merged.

Naively extrapolating the relative trend predicts that around the year 2025 large numbers of kernel maintainers will do nothing else than be the bottleneck, preventing everyone else from getting their work merged and not contributing anything of their own. The kernel community imploding under its own bureaucratic weight being the likely outcome of that.

This is a huge contrast to the “everything is getting better, bigger, and the kernel community is very healthy” fanfare touted at keynotes and the yearly kernel report. In my opinion, the kernel community is very much not looking like it is coping with its growth well and an overall healthy community. Even when ignoring all the issues around conduct that I’ve raised.

It is also a huge contrast to what we’ve experienced in the GPU subsystem since aggressively rolling out group maintainership starting with the v4.5 release; by spreading the bureaucratic side of applying patches over many more people, maintainers have much more time to create their own patches and get them merged. More crucially, experienced maintainers can focus their limited review bandwidth on the big architectural design questions since they won’t get bogged down in the minutiae of every single simple patch.

4.16 by subsystem

Let’s zoom into how this all looks at a subsystem level, looking at just the recently released 4.16 kernel.

Most subsystems have unsustainable maintainer ratios

Trying to come up with a reasonable list of subsystems that have high maintainer commit ratios is tricky; some rather substantial pull requests are essentially just maintainers submitting their own work, giving them an easy 100% score. But of course that’s just an outlier in the larger scope of the kernel overall having a maintainer self-commit ratio of just 15%. To get a more interesting list of subsystems we need to look at only those with a group of regular contributors and more than just 1 maintainer. A fairly arbitrary cut-off of 200 commits or more in total seems to get us there, yielding the following top ten list:

subsystem total commits maintainer self-commits maintainer ratio
GPU 1683 614 36%
KVM 257 91 35%
arm-soc 885 259 29%
linux-media 422 111 26%
tip (x86, core, …) 792 125 16%
linux-pm 201 31 15%
staging 650 61 9%
linux-block 249 20 8%
sound 351 26 7%
powerpc 235 16 7%

In short there’s very few places where it’s easier to become a maintainer than in the already rather low, roughly 15%, the kernel scores overall. Outside of these few subsystems, the only realistic way is to create a new subsystem, somehow get it merged, and become its maintainer. In most subsystems being a maintainer is an elite status, and the historical trends suggest it will only become more so. If this trend isn’t reversed, then maintainer overload will get a lot worse in the coming years.

Of course subsystem maintainers are expected to spend more time reviewing and managing other people’s contribution. When looking at individual maintainers it would be natural to expect a slow decline in their own contributions in patch form, and hence a decline in self-commits. But below them a new set of maintainers should grow and receive mentoring, and those more junior maintainers would focus more on their own work. That sustainable maintainer pipeline seems to not be present in many kernel subsystems, drawing a bleak future for them.

Much more interesting is the review statistics, split up by subsystem. Again we need a cut-off for noise and outliers. The big outliers here are all the pull requests and trees that have seen zero review, not even any Acked-by tags. As long as we only look at positive examples we don’t need to worry about those. A rather low cut-off of at least 10 maintainer self-commits takes care of other random noise:

subsystem total commits maintainer self-commits maintainer review ratio
f2fs 72 12 100%
XFS 105 78 100%
arm64 166 23 91%
GPU 1683 614 83%
linux-mtd 99 12 75%
KVM 257 91 74%
linux-pm 201 31 71%
pci 145 37 65%
remoteproc 19 14 64%
clk 139 14 64%
dma-mapping 63 60 60%

Yes, XFS and f2fs have their shit together. More interesting is how wide the spread in the filesystem code is; there’s a bunch of substantial fs pulls with a review ratio of flat out zero. Not even a single Acked-by. XFS on the other hand insists on full formal review of everything - I spot checked the history a bit. f2fs is a bit of an outlier with 4.16, barely getting above the cut-off. Usually it has fewer patches and would have been excluded.

Everyone not in the top ten taken together has a review ratio of 27%.

Review double standards in many big subsystems

Looking at the big subsystems with multiple maintainers and huge groups of contributors - I picked 500 patches as the cut-off - there’s some really low review ratios: Staging has 7%, networking 9% and tip scores 10%. Only arm-soc is close to the top ten, with 50%, at the 14th position.

Staging having no standard is kinda the point, but the other core subsystems eschewing review is rather worrisome. More than 9 out of 10 maintainer self-commits merged into these core subsystem do not carry any indication that anyone else ever looked at the patch and deemed it a good idea. The only other subsystem with more than 500 commits is the GPU subsystem, at 4th position with a 83% review ratio.

Compared to maintainers overall the review situation is looking a lot less bleak. There’s a sizeable group of subsystems who at least try to make this work, by having similar review criteria for maintainer self-commits than normal contributors. This is also supported by the rather slow, but steady overall increase of reviews when looking at historical trend.

But there’s clearly other subsystems where review only seems to be a gauntlet inflicted on normal contributors, entirely optional for maintainers themselves. Contributors cannot avoid review, because they can’t commit their own patches. When maintainers outright ignore review for most of their patches this creates a clear double standard between maintainers and mere contributors.

One year ago I wrote “Review, not Rocket Science” on how to roll out review in your subsystem. Looking at this data here I can close with an even shorter version:

What would Dave Chinner do?

Thanks a lot to Daniel Stone, Dave Chinner, Eric Anholt, Geoffrey Huntley, Luce Carter and Sean Paul for reading and commenting on drafts of this article.

April 23, 2018 12:00 AM

April 20, 2018

Kees Cook: UEFI booting and RAID1

I spent some time yesterday building out a UEFI server that didn’t have on-board hardware RAID for its system drives. In these situations, I always use Linux’s md RAID1 for the root filesystem (and/or /boot). This worked well for BIOS booting since BIOS just transfers control blindly to the MBR of whatever disk it sees (modulo finding a “bootable partition” flag, etc, etc). This means that BIOS doesn’t really care what’s on the drive, it’ll hand over control to the GRUB code in the MBR.

With UEFI, the boot firmware is actually examining the GPT partition table, looking for the partition marked with the “EFI System Partition” (ESP) UUID. Then it looks for a FAT32 filesystem there, and does more things like looking at NVRAM boot entries, or just running BOOT/EFI/BOOTX64.EFI from the FAT32. Under Linux, this .EFI code is either GRUB itself, or Shim which loads GRUB.

So, if I want RAID1 for my root filesystem, that’s fine (GRUB will read md, LVM, etc), but how do I handle /boot/efi (the UEFI ESP)? In everything I found answering this question, the answer was “oh, just manually make an ESP on each drive in your RAID and copy the files around, add a separate NVRAM entry (with efibootmgr) for each drive, and you’re fine!” I did not like this one bit since it meant things could get out of sync between the copies, etc.

The current implementation of Linux’s md RAID puts metadata at the front of a partition. This solves more problems than it creates, but it means the RAID isn’t “invisible” to something that doesn’t know about the metadata. In fact, mdadm warns about this pretty loudly:

# mdadm --create /dev/md0 --level 1 --raid-disks 2 /dev/sda1 /dev/sdb1 mdadm: Note: this array has metadata at the start and may not be suitable as a boot device. If you plan to store '/boot' on this device please ensure that your boot-loader understands md/v1.x metadata, or use --metadata=0.90

Reading from the mdadm man page:

-e, --metadata= ... 1, 1.0, 1.1, 1.2 default Use the new version-1 format superblock. This has fewer restrictions. It can easily be moved between hosts with different endian-ness, and a recovery operation can be checkpointed and restarted. The different sub-versions store the superblock at different locations on the device, either at the end (for 1.0), at the start (for 1.1) or 4K from the start (for 1.2). "1" is equivalent to "1.2" (the commonly preferred 1.x format). "default" is equivalent to "1.2".

First we toss a FAT32 on the RAID (mkfs.fat -F32 /dev/md0), and looking at the results, the first 4K is entirely zeros, and file doesn’t see a filesystem:

# dd if=/dev/sda1 bs=1K count=5 status=none | hexdump -C 00000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| * 00001000 fc 4e 2b a9 01 00 00 00 00 00 00 00 00 00 00 00 |.N+.............| ... # file -s /dev/sda1 /dev/sda1: Linux Software RAID version 1.2 ...

So, instead, we’ll use --metadata 1.0 to put the RAID metadata at the end:

# mdadm --create /dev/md0 --level 1 --raid-disks 2 --metadata 1.0 /dev/sda1 /dev/sdb1 ... # mkfs.fat -F32 /dev/md0 # dd if=/dev/sda1 bs=1 skip=80 count=16 status=none | xxd 00000000: 2020 4641 5433 3220 2020 0e1f be77 7cac FAT32 ...w|. # file -s /dev/sda1 /dev/sda1: ... FAT (32 bit)

Now we have a visible FAT32 filesystem on the ESP. UEFI should be able to boot whatever disk hasn’t failed, and grub-install will write to the RAID mounted at /boot/efi.

However, we’re left with a new problem: on (at least) Debian and Ubuntu, grub-install attempts to run efibootmgr to record which disk UEFI should boot from. This fails, though, since it expects a single disk, not a RAID set. In fact, it returns nothing, and tries to run efibootmgr with an empty -d argument:

Installing for x86_64-efi platform. efibootmgr: option requires an argument -- 'd' ... grub-install: error: efibootmgr failed to register the boot entry: Operation not permitted. Failed: grub-install --target=x86_64-efi WARNING: Bootloader is not properly installed, system may not be bootable

Luckily my UEFI boots without NVRAM entries, and I can disable the NVRAM writing via the “Update NVRAM variables to automatically boot into Debian?” debconf prompt when running: dpkg-reconfigure -p low grub-efi-amd64

So, now my system will boot with both or either drive present, and updates from Linux to /boot/efi are visible on all RAID members at boot-time. HOWEVER there is one nasty risk with this setup: if UEFI writes anything to one of the drives (which this firmware did when it wrote out a “boot variable cache” file), it may lead to corrupted results once Linux mounts the RAID (since the member drives won’t have identical block-level copies of the FAT32 any more).

To deal with this “external write” situation, I see some solutions:

Since mdadm has the “--update=resync” assembly option, I can actually do the latter option. This required updating /etc/mdadm/mdadm.conf to add <ignore> on the RAID’s ARRAY line to keep it from auto-starting:

ARRAY <ignore> metadata=1.0 UUID=123...

(Since it’s ignored, I’ve chosen /dev/md100 for the manual assembly below.) Then I added the noauto option to the /boot/efi entry in /etc/fstab:

/dev/md100 /boot/efi vfat noauto,defaults 0 0

And finally I added a systemd oneshot service that assembles the RAID with resync and mounts it:

[Unit] Description=Resync /boot/efi RAID DefaultDependencies=no After=local-fs.target [Service] Type=oneshot ExecStart=/sbin/mdadm -A /dev/md100 --uuid=123... --update=resync ExecStart=/bin/mount /boot/efi RemainAfterExit=yes [Install] WantedBy=sysinit.target

(And don’t forget to run “update-initramfs -u” so the initramfs has an updated copy of /dev/mdadm/mdadm.conf.)

If mdadm.conf supported an “update=” option for ARRAY lines, this would have been trivial. Looking at the source, though, that kind of change doesn’t look easy. I can dream!

And if I wanted to keep a “pristine” version of /boot/efi that UEFI couldn’t update I could rearrange things more dramatically to keep the primary RAID member as a loopback device on a file in the root filesystem (e.g. /boot/efi.img). This would make all external changes in the real ESPs disappear after resync. Something like:

# truncate --size 512M /boot/efi.img # losetup -f --show /boot/efi.img /dev/loop0 # mdadm --create /dev/md100 --level 1 --raid-disks 3 --metadata 1.0 /dev/loop0 /dev/sda1 /dev/sdb1

And at boot just rebuild it from /dev/loop0, though I’m not sure how to “prefer” that partition…

© 2018, Kees Cook. This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 License.
Creative Commons License

April 20, 2018 12:34 AM

April 16, 2018

Pete Zaitcev: Suddenly Liferea tonight

Liferea irritated me for many years with a strange behavior when dragging a subscription. You mouse down on the feed, it becomes selected — so far so good. Then you drag it somewhere — possibly far off screen, making the view scroll — then drop it. Drops fine, updates the DB, model, and the view fine. But! The selection then jumps to a completely random feed somewhere.

Well, it's not actually random. What happens instead, the GtkTreeView implements DnD by removing a row, then re-inserting it. When a selected row is removed, obviously the selection has to disappear, but instead it's set to the next row after the removed one. I suppose I may be uniquely vulnerable to this because I have 300+ feeds and I drag them around all the time. If Liferea weren't kind enough to remember the preferred order, this would not matter so much.

I meant to fix this for a long time, but somehow a wrong information got stuck in my head: I thought that Liferea was written in C++, so it took years to gather the motivation. Imagine my surprise when I found plain old C. I spent a good chunk of Sunday figuring out GTK's tree view thingie, but in the end it was quite simple.

April 16, 2018 03:09 PM

April 13, 2018

Kees Cook: security things in Linux v4.16

Previously: v4.15.

Linux kernel v4.16 was released last week. I really should write these posts in advance, otherwise I get distracted by the merge window. Regardless, here are some of the security things I think are interesting:

KPTI on arm64

Will Deacon, Catalin Marinas, and several other folks brought Kernel Page Table Isolation (via CONFIG_UNMAP_KERNEL_AT_EL0) to arm64. While most ARMv8+ CPUs were not vulnerable to the primary Meltdown flaw, the Cortex-A75 does need KPTI to be safe from memory content leaks. It’s worth noting, though, that KPTI does protect other ARMv8+ CPU models from having privileged register contents exposed. So, whatever your threat model, it’s very nice to have this clean isolation between kernel and userspace page tables for all ARMv8+ CPUs.

hardened usercopy whitelisting
While whole-object bounds checking was implemented in CONFIG_HARDENED_USERCOPY already, David Windsor and I finished another part of the porting work of grsecurity’s PAX_USERCOPY protection: usercopy whitelisting. This further tightens the scope of slab allocations that can be copied to/from userspace. Now, instead of allowing all objects in slab memory to be copied, only the whitelisted areas (where a subsystem has specifically marked the memory region allowed) can be copied. For example, only the auxv array out of the larger mm_struct.

As mentioned in the first commit from the series, this reduces the scope of slab memory that could be copied out of the kernel in the face of a bug to under 15%. As can be seen, one area of work remaining are the kmalloc regions. Those are regularly used for copying things in and out of userspace, but they’re also used for small simple allocations that aren’t meant to be exposed to userspace. Working to separate these kmalloc users needs some careful auditing.

Total Slab Memory: 48074720 Usercopyable Memory: 6367532 13.2% task_struct 0.2% 4480/1630720 RAW 0.3% 300/96000 RAWv6 2.1% 1408/64768 ext4_inode_cache 3.0% 269760/8740224 dentry 11.1% 585984/5273856 mm_struct 29.1% 54912/188448 kmalloc-8 100.0% 24576/24576 kmalloc-16 100.0% 28672/28672 kmalloc-32 100.0% 81920/81920 kmalloc-192 100.0% 96768/96768 kmalloc-128 100.0% 143360/143360 names_cache 100.0% 163840/163840 kmalloc-64 100.0% 167936/167936 kmalloc-256 100.0% 339968/339968 kmalloc-512 100.0% 350720/350720 kmalloc-96 100.0% 455616/455616 kmalloc-8192 100.0% 655360/655360 kmalloc-1024 100.0% 812032/812032 kmalloc-4096 100.0% 819200/819200 kmalloc-2048 100.0% 1310720/1310720

This series took quite a while to land (you can see David’s original patch date as back in June of last year). Partly this was due to having to spend a lot of time researching the code paths so that each whitelist could be explained for commit logs, partly due to making various adjustments from maintainer feedback, and partly due to the short merge window in v4.15 (when it was originally proposed for merging) combined with some last-minute glitches that made Linus nervous. After baking in linux-next for almost two full development cycles, it finally landed. (Though be sure to disable CONFIG_HARDENED_USERCOPY_FALLBACK to gain enforcement of the whitelists — by default it only warns and falls back to the full-object checking.)

automatic stack-protector

While the stack-protector features of the kernel have existed for quite some time, it has never been enabled by default. This was mainly due to needing to evaluate compiler support for the feature, and Kconfig didn’t have a way to check the compiler features before offering CONFIG_* options. As a defense technology, the stack protector is pretty mature. Having it on by default would have greatly reduced the impact of things like the BlueBorne attack (CVE-2017-1000251), as fewer systems would have lacked the defense.

After spending quite a bit of time fighting with ancient compiler versions (*cough*GCC 4.4.4*cough*), I landed CONFIG_CC_STACKPROTECTOR_AUTO, which is default on, and tries to use the stack protector if it is available. The implementation of the solution, however, did not please Linus, though he allowed it to be merged. In the future, Kconfig will gain the knowledge to make better decisions which lets the kernel expose the availability of (the now default) stack protector directly in Kconfig, rather than depending on rather ugly Makefile hacks.

execute-only memory for PowerPC

Similar to the Protection Keys (pkeys) hardware support that landed in v4.6 for x86, Ram Pai landed pkeys support for Power7/8/9. This should expand the scope of what’s possible in the dynamic loader to avoid having arbitrary read flaws allow an exploit to read out all of executable memory in order to find ROP gadgets.

That’s it for now; let me know if you think I should add anything! The v4.17 merge window is open. :)

Edit: added details on ARM register leaks, thanks to Daniel Micay.

Edit: added section on protection keys for POWER, thanks to Florian Weimer.

© 2018, Kees Cook. This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 License.
Creative Commons License

April 13, 2018 12:04 AM

April 11, 2018

James Morris: Linux Security Summit North America 2018 CFP Announced

lss logo

The CFP for the 2018 Linux Security Summit North America (LSS-NA) is announced.

LSS will be held this year as two separate events, one in North America
(LSS-NA), and one in Europe (LSS-EU), to facilitate broader participation in
Linux Security development. Note that this CFP is for LSS-NA; a separate CFP
will be announced for LSS-EU in May. We encourage everyone to attend both
events.

LSS-NA 2018 will be held in Vancouver, Canada, co-located with the Open Source Summit.

The CFP closes on June 3rd and the event runs from 27th-28th August.

To make a CFP submission, click here.

April 11, 2018 11:29 PM

April 10, 2018

Linux Plumbers Conference: Welcome to the 2018 LPC blog

Planning for the 2018 Linux Plumbers Conference is well underway at this point. The planning committee will be posting various informational blurbs here, including information on hotels, microconference acceptance, evening events, scheduling, and so on. Next up will be a “call for proposals” that should appear soon.

LPC will be held at the Sheraton Vancouver Wall Center in Vancouver, British Columbia, Canada, November 13-15, colocated with the Linux Kernel Summit.

April 10, 2018 04:40 PM

April 06, 2018

Pete Zaitcev: With Blockchain Technology

Recently it became common to see a mocking of startup founders that add "blockchain" to something, then sell it to gullible VCs and reap the green harvest. Apparently it has become quite a thing. But now they went a step further.

The other day I was watching some anime at Crunchyroll, when a commercial came up. It pitched a fantasy sports site "with blockchain technology" and smart contracts. The remarkable part about it is, it wasn't aimed at investors. It was a consumer advertisement. Its creators apparently expect members of the public — who play fantasy sports, no less — to know that blockchain exists and think about it in positive terms.

April 06, 2018 05:19 AM

April 05, 2018

Matthew Garrett: Linux kernel lockdown and UEFI Secure Boot

David Howells recently published the latest version of his kernel lockdown patchset. This is intended to strengthen the boundary between root and the kernel by imposing additional restrictions that prevent root from modifying the kernel at runtime. It's not the first feature of this sort - /dev/mem no longer allows you to overwrite arbitrary kernel memory, and you can configure the kernel so only signed modules can be loaded. But the present state of things is that these security features can be easily circumvented (by using kexec to modify the kernel security policy, for instance).

Why do you want lockdown? If you've got a setup where you know that your system is booting a trustworthy kernel (you're running a system that does cryptographic verification of its boot chain, or you built and installed the kernel yourself, for instance) then you can trust the kernel to keep secrets safe from even root. But if root is able to modify the running kernel, that guarantee goes away. As a result, it makes sense to extend the security policy from the boot environment up to the running kernel - it's really just an extension of configuring the kernel to require signed modules.

The patchset itself isn't hugely conceptually controversial, although there's disagreement over the precise form of certain restrictions. But one patch has, because it associates whether or not lockdown is enabled with whether or not UEFI Secure Boot is enabled. There's some backstory that's important here.

Most kernel features get turned on or off by either build-time configuration or by passing arguments to the kernel at boot time. There's two ways that this patchset allows a bootloader to tell the kernel to enable lockdown mode - it can either pass the lockdown argument on the kernel command line, or it can set the secure_boot flag in the bootparams structure that's passed to the kernel. If you're running in an environment where you're able to verify the kernel before booting it (either through cryptographic validation of the kernel, or knowing that there's a secret tied to the TPM that will prevent the system booting if the kernel's been tampered with), you can turn on lockdown.

There's a catch on UEFI systems, though - you can build the kernel so that it looks like an EFI executable, and then run it directly from the firmware. The firmware doesn't know about Linux, so can't populate the bootparam structure, and there's no mechanism to enforce command lines so we can't rely on that either. The controversial patch simply adds a kernel configuration option that automatically enables lockdown when UEFI secure boot is enabled and otherwise leaves it up to the user to choose whether or not to turn it on.

Why do we want lockdown enabled when booting via UEFI secure boot? UEFI secure boot is designed to prevent the booting of any bootloaders that the owner of the system doesn't consider trustworthy[1]. But a bootloader is only software - the only thing that distinguishes it from, say, Firefox is that Firefox is running in user mode and has no direct access to the hardware. The kernel does have direct access to the hardware, and so there's no meaningful distinction between what grub can do and what the kernel can do. If you can run arbitrary code in the kernel then you can use the kernel to boot anything you want, which defeats the point of UEFI Secure Boot. Linux distributions don't want their kernels to be used to be used as part of an attack chain against other distributions or operating systems, so they enable lockdown (or equivalent functionality) for kernels booted this way.

So why not enable it everywhere? There's a couple of reasons. The first is that some of the features may break things people need - for instance, some strange embedded apps communicate with PCI devices by mmap()ing resources directly from sysfs[2]. This is blocked by lockdown, which would break them. Distributions would then have to ship an additional kernel that had lockdown disabled (it's not possible to just have a command line argument that disables it, because an attacker could simply pass that), and users would have to disable secure boot to boot that anyway. It's easier to just tie the two together.

The second is that it presents a promise of security that isn't really there if your system didn't verify the kernel. If an attacker can replace your bootloader or kernel then the ability to modify your kernel at runtime is less interesting - they can just wait for the next reboot. Appearing to give users safety assurances that are much less strong than they seem to be isn't good for keeping users safe.

So, what about people whose work is impacted by lockdown? Right now there's two ways to get stuff blocked by lockdown unblocked: either disable secure boot[3] (which will disable it until you enable secure boot again) or press alt-sysrq-x (which will disable it until the next boot). Discussion has suggested that having an additional secure variable that disables lockdown without disabling secure boot validation might be helpful, and it's not difficult to implement that so it'll probably happen.

Overall: the patchset isn't controversial, just the way it's integrated with UEFI secure boot. The reason it's integrated with UEFI secure boot is because that's the policy most distributions want, since the alternative is to enable it everywhere even when it doesn't provide real benefits but does provide additional support overhead. You can use it even if you're not using UEFI secure boot. We should have just called it securelevel.

[1] Of course, if the owner of a system isn't allowed to make that determination themselves, the same technology is restricting the freedom of the user. This is abhorrent, and sadly it's the default situation in many devices outside the PC ecosystem - most of them not using UEFI. But almost any security solution that aims to prevent malicious software from running can also be used to prevent any software from running, and the problem here is the people unwilling to provide that policy to users rather than the security features.
[2] This is how X.org used to work until the advent of kernel modesetting
[3] If your vendor doesn't provide a firmware option for this, run sudo mokutil --disable-validation

comment count unavailable comments

April 05, 2018 01:07 AM

April 04, 2018

Pete Zaitcev: Jim Whitehurst on OpenStack in 2018

Remarks of our CEO, as captured in an interview by TechCrunch:

The other major open-source project Red Hat is betting on is OpenStack . That may come as a bit of a surprise, given that popular opinion in the last year or so has shifted against the massive project that wants to give enterprises an open source on-premise alternative to AWS and other cloud providers. “There was a sense among big enterprise tech companies that OpenStack was going to be their savior from Amazon,” Whitehurst said. “But even OpenStack, flawlessly executed, put you where Amazon was five years ago. If you’re Cisco or HP or any of those big OEMs, you’ll say that OpenStack was a disappointment. But from our view as a software company, we are seeing good traction.”

He's over-simplifying things for the constraints of an interview: the last sencence needs unpacking. Why do you think that "traction" happens? Because OpenStack gives its users something that Amazon does not. For example, Swift isn't trying to match features of S3. Attempting to do that would cause the exact lag he's referring. Instead, Swift works to solve the problem of people who want to own their own data in general. So, it's mostly about the implementation: how to make it scalable, inexpensive, etc. And, of course, keeing it open source, preserving user's freedom to modify. This is why often you see people installing a truncated OpenStack that only has Swift. I'm sure this applies to other parts of OpenStack, in particular the SDN/NFV.

April 04, 2018 04:44 PM

April 03, 2018

Paul E. Mc Kenney: A Linux-kernel memory model!

A big “thank you” to all my partners in LKMM crime, most especially to Jade, Luc, Andrea, and Alan! Jade presented our paper (slides, supplementary material) at ASPLOS, which was well-received. A number of people asked how they could learn more about LKMM, which is what much of this blog post is about.

Approaches to learning LKMM include:


  1. Read the documentation, starting with explanation.txt. This documentation replaces most of the older LWN series.
  2. Go through Ted Cooper's coursework for Portland State University's CS510 Advanced Topics in Concurrency class, taught by Jon Walpole.
  3. Those interested in the history of LKMM might wish to look at my 2017 linux.conf.au presentation (video).
  4. Play with the actual model.
The first three options are straightforward, but playing with the model requires some installation. However, playing with the model is probably key to gaining a full understanding of LKMM, so this installation step is well worth the effort.

Installation instructions may be found here (see the “REQUIREMENTS” section). The ocaml language is a prerequisite, which is fortunately included in many Linux distros. If you choose to install ocaml from source (for example, because you need a more recent version), do yourself a favor and read the instructions completely before starting the build process! Otherwise, you will find yourself learning of the convenient one-step build process only after carrying out the laborious five-step process, which can be a bit frustrating.

Of course, if you come across better methods to quickly, easily, and thoroughly learn LKMM, please do not keep them a secret!

Those wanting a few rules of thumb safely approximating LKMM should look at slide 96 (PDF page 78) of the aforementioned linux.conf.au presentation. Please note that material earlier in the presentation is required to make sense of the three rules of thumb.

We also got some excellent questions during Jade's ASPLOS talk, mainly from the renowned and irrepressible Sarita Adve:There were of course a great many other excellent presentations at ASPLOS, but that is a topic for another post!

April 03, 2018 06:35 PM

April 02, 2018

Pete Zaitcev: Wayland versus Glib in Liferea on F27

I decided to build Liferea over the weekend, and the build crashes at the introspection phase.

Apparently, GTK+ programs are set up to introspect themselves: basically the binary can look at its own types or whatnot, then output the result. I'm not quite clear what the purpose of that is, the online docs imply that it's for API documentation mostly. Anyhow, the build runs the liferea binary itself, with arguments that make it run the introspection, then this happens:

(gdb) where
#0  0x00007fa90a2a93b0 in wl_list_insert_list ()
    at /lib64/libwayland-server.so.0
#1  0x00007fa90a2a4e6f in wl_priv_signal_emit ()
    at /lib64/libwayland-server.so.0
#2  0x00007fa90a2a5477 in wl_display_destroy ()
    at /lib64/libwayland-server.so.0
#3  0x00007fa916d163d9 in \
  WebCore::PlatformDisplayWayland::~PlatformDisplayWayland() () at \
  /lib64/libwebkit2gtk-4.0.so.37
#4  0x00007fa916d163e9 in \
  WebCore::PlatformDisplayWayland::~PlatformDisplayWayland() () at \
  /lib64/libwebkit2gtk-4.0.so.37
#5  0x00007fa91100cb58 in __run_exit_handlers () at /lib64/libc.so.6
#6  0x00007fa91100cbaa in  () at /lib64/libc.so.6
#7  0x00007fa911e9d367 in  () at /lib64/libgirepository-1.0.so.1
#8  0x00007fa91197d188 in parse_arg.isra () at /lib64/libglib-2.0.so.0
#9  0x00007fa91197d8ca in parse_long_option () at /lib64/libglib-2.0.so.0
#10 0x00007fa91197f2d6 in g_option_context_parse () at \
  /lib64/libglib-2.0.so.0
#11 0x00007fa91197fd84 in g_option_context_parse_strv ()
    at /lib64/libglib-2.0.so.0
#12 0x00007fa912164558 in g_application_real_local_command_line ()
    at /lib64/libgio-2.0.so.0
#13 0x00007fa912164bf6 in g_application_run () at /lib64/libgio-2.0.so.0
#14 0x000000000041b9ff in main (argc=2, argv=0x7fff2e1203d8) at main.c:77

As much as I can tell, despite being asked only to do the introspection, Liferea (unknowingly, through GTK+) pokes Wayland, which sets exit handlers. However, Wayland is never used (introspection, duh), and not initialized completely, so when its exit handlers run, it crashes.

Well, now what?

I supplse the cleanest approach might be to modify Glib so it avoids provoking Wayland when merely introspecting. But honestly I have no clue about desktop apps and do not know where to even start looking.

UPDATE: Much thanks to Branko Grubic, who pointed me to a bug in WebKit. Currently building with this as a workaround:

--- a/src/Makefile.am
+++ b/src/Makefile.am
@@ -82,6 +82,7 @@ INTROSPECTION_GIRS = Liferea-3.0.gir
 
 Liferea-3.0.gir: liferea$(EXEEXT)
 INTROSPECTION_SCANNER_ARGS = -I$(top_srcdir)/src --warn-all -......
+INTROSPECTION_SCANNER_ENV = WEBKIT_DISABLE_COMPOSITING_MODE=1
 Liferea_3_0_gir_NAMESPACE = Liferea
 Liferea_3_0_gir_VERSION = 3.0
 Liferea_3_0_gir_PROGRAM = $(builddir)/liferea$(EXEEXT)

April 02, 2018 05:59 PM

March 20, 2018

Davidlohr Bueso: Linux v4.15: Performance Goodies

With the Meltdown and Spectre fiascos, performance isn't a very hot topic at the moment. In fact, with Linux v4.15 released, it is one of the rare times I've seen security win over performance in such a one sided way. Normally security features are tucked away under a kernel config option nobody really uses. Of course the software fixes are also backported in one way or another, so this isn't really specific to the latest kernel release.

All this said, v4.15 came out with a few performance enhancements across subsystems. The following is an unsorted and incomplete list of changes that went in. Note that the term 'performance' can be vague in that some gains in one area can negatively affect another, so take everything with a grain of salt and reach your own conclusions.

epoll: scale nested calls

Nested epolls are necessary to allow semantics where a file descriptor in the epoll interested-list is also an epoll instance. Such calls are not all that common, but some real world applications suffered severe performance issues in that it relied on global spinlocks, acquired throughout the callbacks in the epoll state machine. By removing them, we can speed up adding fds to the instance as well as polling, such that epoll_wait() can improve by 100x, scaling linearly when increasing amounts of cores block an an event.
[Commit 57a173bdf5ba,  37b5e5212a44]


pvspinlock: hybrid fairness paravirt semantics

Locking under virtual environments can be tricky, balancing performance and fairness while avoiding artifacts such as starvation and lock holder/waiter preemption. The current paravirtual queued spinlocks, while free from starvation, can perform less optimally than an unfair lock in guests with CPU over-commitment. With Linux v4.15, guest spinlocks now combine the best of both worlds, with an unfair and a queued mode. The idea is that, upon contention, extend the lock stealing attempt in the slowpath (unfair mode) as long as there are queued MCS waiters present, hence improving performance while avoiding starvation. Kernel build experiments show that as a VM becomes more and more over-committed, the ratio of locks acquired in unfair mode increases.
[Commit 11752adb68a3]


mm,x86: avoid saving/restoring interrupts state in gup

When x86 was converted to use the generic get_user_pages_fast() call a performance regression was introduced at a microbenchmark level. The generic gup function attempts to walk the page tables without acquiring any locks, such as the mmap semaphore. In order to do this, interrupts must be disabled, which is where things went different between the arch-specific and generic flavors. The later must save and restore the current state of interrupt, introducing extra overhead when compared to a simple local_irq_enable/disable().
[Commit 5b65c4677a57]



ipc: scale INFO commands

Any syscall used to get info from sysvipc (such as semctl(IPC_INFO) or shmctl(SHM_INFO)) requires internally computing the last ipc identifier. For cases with large amounts of keys, this operation alone can consume a large amount of cycles as it looked up on-demand, in O(N). In order to make this information available in constant time, we keep track of it whenever a new identifier is added.
[Commit 15df03c87983]



ext4:  improve smp scalability for inode generation

The superblock's inode generation number was currently sequentially increased (from a randomly initialized value) and protected by a spinlock, making the usage pattern quite primitive and not very friendly to workloads that are generating files/inodes concurrently. The inode generation path was optimized to remove the lock altogether and simply rely on prandom_u32() such that a fast/seeded pseudo random-number algorithm is used for computing the i_generation.
[Commit 232530680290]

March 20, 2018 05:37 PM

March 15, 2018

Pete Zaitcev: The more you tighten your grip

Seen at the webpage for RancherOS:

Everything in RancherOS is a Docker container. We accomplish this by launching two instances of Docker. One is what we call System Docker, the first process on the system. All other system services, like ntpd, syslog, and console, are running in Docker containers. System Docker replaces traditional init systems like systemd, and can be used to launch additional system services.

March 15, 2018 10:33 PM

March 13, 2018

Pete Zaitcev: You Are Not Uber: Only Uber Are Uber

Remember how FAA shut down the business of NavWorx, with heavy monetary and loss-of-use consequences for its customers? Imagine receiving a letter from U.S. Government telling you that your car is not compatible with roads, and therefore you are prohibited from continuing to drive it. Someone sure forgot that the power to regulate is the power to destroy. This week, we have this report by IEEE Spectrum:

IEEE Spectrum can reveal that the SpaceBees are almost certainly the first spacecraft from a Silicon Valley startup called Swarm Technologies, currently still in stealth mode. Swarm was founded in 2016 by one engineer who developed a spacecraft concept for Google and another who sold his previous company to Apple. The SpaceBees were built as technology demonstrators for a new space-based Internet of Things communications network.

The only problem is, the Federal Communications Commission (FCC) had dismissed Swarm’s application for its experimental satellites a month earlier, on safety grounds.

On Wednesday, the FCC sent Swarm a letter revoking its authorization for a follow-up mission with four more satellites, due to launch next month. A pending application for a large market trial of Swarm’s system with two Fortune 100 companies could also be in jeopardy.

Swarm Technologies, based in Menlo Park, Calif., is the brainchild of two talented young aerospace engineers. Sara Spangelo, its CEO, is a Canadian who worked at NASA’s Jet Propulsion Laboratory, before moving to Google in 2016. Spangelo’s astronaut candidate profile at the Canadian Space Agency says that while at Google, she led a team developing a spacecraft concept for its moonshot X division, including both technical and market analyses.

Swarm CFO Benjamin Longmier has an equally impressive resume. In 2015, he sold his near-space balloon company Aether Industries to Apple, before taking a teaching post at the University of Michigan. He is also co-founder of Apollo Fusion, a company producing an innovative electric propulsion system for satellites.

Although a leading supplier in its market, NavWorx was a bit player at the government level. Not that many people have small private airplanes anymore. But Swarm operates at a different level, an may be able to grease a enough palms in the Washington, D.C., enough to survive this debacle. Or, they may reconstitute as a notionally new company, then claim a clean start. Again unlike the NavWorx, there's no installed base.

March 13, 2018 03:45 PM

March 11, 2018

Greg Kroah-Hartman: My affidavit in the Geniatech vs. McHardy case

As many people know, last week there was a court hearing in the Geniatech vs. McHardy case. This was a case brought claiming a license violation of the Linux kernel in Geniatech devices in the German court of OLG Cologne.

Harald Welte has written up a wonderful summary of the hearing, I strongly recommend that everyone go read that first.

In Harald’s summary, he refers to an affidavit that I provided to the court. Because the case was withdrawn by McHardy, my affidavit was not entered into the public record. I had always assumed that my affidavit would be made public, and since I have had a number of people ask me about what it contained, I figured it was good to just publish it for everyone to be able to see it.

There are some minor edits from what was exactly submitted to the court such as the side-by-side German translation of the English text, and some reformatting around some footnotes in the text, because I don’t know how to do that directly here, and they really were not all that relevant for anyone who reads this blog. Exhibit A is also not reproduced as it’s just a huge list of all of the kernel releases in which I felt that were no evidence of any contribution by Patrick McHardy.

AFFIDAVIT

I, the undersigned, Greg Kroah-Hartman,
declare in lieu of an oath and in the
knowledge that a wrong declaration in
lieu of an oath is punishable, to be
submitted before the Court:

I. With regard to me personally:

1. I have been an active contributor to
   the Linux Kernel since 1999.

2. Since February 1, 2012 I have been a
   Linux Foundation Fellow.  I am currently
   one of five Linux Foundation Fellows
   devoted to full time maintenance and
   advancement of Linux. In particular, I am
   the current Linux stable Kernel maintainer
   and manage the stable Kernel releases. I
   am also the maintainer for a variety of
   different subsystems that include USB,
   staging, driver core, tty, and sysfs,
   among others.

3. I have been a member of the Linux
   Technical Advisory Board since 2005.

4. I have authored two books on Linux Kernel
   development including Linux Kernel in a
   Nutshell (2006) and Linux Device Drivers
   (co-authored Third Edition in 2009.)

5. I have been a contributing editor to Linux
   Journal from 2003 - 2006.

6. I am a co-author of every Linux Kernel
   Development Report. The first report was
   based on my Ottawa Linux Symposium keynote
   in 2006, and the report has been published
   every few years since then. I have been
   one of the co-author on all of them. This
   report includes a periodic in-depth
   analysis of who is currently contributing
   to Linux. Because of this work, I have an
   in-depth knowledge of the various records
   of contributions that have been maintained
   over the course of the Linux Kernel
   project.

   For many years, Linus Torvalds compiled a
   list of contributors to the Linux kernel
   with each release. There are also usenet
   and email records of contributions made
   prior to 2005. In April of 2005, Linus
   Torvalds created a program now known as
   “Git” which is a version control system
   for tracking changes in computer files and
   coordinating work on those files among
   multiple people. Every Git directory on
   every computer contains an accurate
   repository with complete history and full
   version tracking abilities.  Every Git
   directory captures the identity of
   contributors.  Development of the Linux
   kernel has been tracked and managed using
   Git since April of 2005.

   One of the findings in the report is that
   since the 2.6.11 release in 2005, a total
   of 15,637 developers have contributed to
   the Linux Kernel.

7. I have been an advisor on the Cregit
   project and compared its results to other
   methods that have been used to identify
   contributors and contributions to the
   Linux Kernel, such as a tool known as “git
   blame” that is used by developers to
   identify contributions to a git repository
   such as the repositories used by the Linux
   Kernel project.

8. I have been shown documents related to
   court actions by Patrick McHardy to
   enforce copyright claims regarding the
   Linux Kernel. I have heard many people
   familiar with the court actions discuss
   the cases and the threats of injunction
   McHardy leverages to obtain financial
   settlements. I have not otherwise been
   involved in any of the previous court
   actions.

II. With regard to the facts:

1. The Linux Kernel project started in 1991
   with a release of code authored entirely
   by Linus Torvalds (who is also currently a
   Linux Foundation Fellow).  Since that time
   there have been a variety of ways in which
   contributions and contributors to the
   Linux Kernel have been tracked and
   identified. I am familiar with these
   records.

2. The first record of any contribution
   explicitly attributed to Patrick McHardy
   to the Linux kernel is April 23, 2002.
   McHardy’s last contribution to the Linux
   Kernel was made on November 24, 2015.

3. The Linux Kernel 2.5.12 was released by
   Linus Torvalds on April 30, 2002.

4. After review of the relevant records, I
   conclude that there is no evidence in the
   records that the Kernel community relies
   upon to identify contributions and
   contributors that Patrick McHardy made any
   code contributions to versions of the
   Linux Kernel earlier than 2.4.18 and
   2.5.12. Attached as Exhibit A is a list of
   Kernel releases which have no evidence in
   the relevant records of any contribution
   by Patrick McHardy.

March 11, 2018 01:51 AM