Kernel Planet

June 28, 2016

Paul E. Mc Kenney: Stupid RCU Tricks: Ordering of RCU Callback Invocation

Suppose a Linux-kernel task registers a pair of RCU callbacks, as follows:


call_rcu(&p->rcu, myfunc);
smp_mb();
call_rcu(&q->rcu, myfunc);

Given that these two callbacks are guaranteed to be registered in order,
are they also guaranteed to be invoked in order?

June 28, 2016 05:58 PM

June 21, 2016

Matthew Garrett: I've bought some more awful IoT stuff

I bought some awful WiFi lightbulbs a few months ago. The short version: they introduced terrible vulnerabilities on your network, they violated the GPL and they were also just bad at being lightbulbs. Since then I've bought some other Internet of Things devices, and since people seem to have a bizarre level of fascination with figuring out just what kind of fractal of poor design choices these things frequently embody, I thought I'd oblige.

Today we're going to be talking about the KanKun SP3, a plug that's been around for a while. The idea here is pretty simple - there's lots of devices that you'd like to be able to turn on and off in a programmatic way, and rather than rewiring them the simplest thing to do is just to insert a control device in between the wall and the device andn ow you can turn your foot bath on and off from your phone. Most vendors go further and also allow you to program timers and even provide some sort of remote tunneling protocol so you can turn off your lights from the comfort of somebody else's home.

The KanKun has all of these features and a bunch more, although when I say "features" I kind of mean the opposite. I plugged mine in and followed the install instructions. As is pretty typical, this took the form of the plug bringing up its own Wifi access point, the app on the phone connecting to it and sending configuration data, and the plug then using that data to join your network. Except it didn't work. I connected to the plug's network, gave it my SSID and password and waited. Nothing happened. No useful diagnostic data. Eventually I plugged my phone into my laptop and ran adb logcat, and the Android debug logs told me that the app was trying to modify a network that it hadn't created. Apparently this isn't permitted as of Android 6, but the app was handling this denial by just trying again. I deleted the network from the system settings, restarted the app, and this time the app created the network record and could modify it. It still didn't work, but that's because it let me give it a 5GHz network and it only has a 2.4GHz radio, so one reset later and I finally had it online.

The first thing I normally do to one of these things is run nmap with the -O argument, which gives you an indication of what OS it's running. I didn't really need to in this case, because if I just telnetted to port 22 I got a dropbear ssh banner. Googling turned up the root password ("p9z34c") and I was logged into a lightly hacked (and fairly obsolete) OpenWRT environment.

It turns out that here's a whole community of people playing with these plugs, and it's common for people to install CGI scripts on them so they can turn them on and off via an API. At first this sounds somewhat confusing, because if the phone app can control the plug then there clearly is some kind of API, right? Well ha yeah ok that's a great question and oh good lord do things start getting bad quickly at this point.

I'd grabbed the apk for the app and a copy of jadx, an incredibly useful piece of code that's surprisingly good at turning compiled Android apps into something resembling Java source. I dug through that for a while before figuring out that before packets were being sent, they were being handed off to some sort of encryption code. I couldn't find that in the app, but there was a native ARM library shipped with it. Running strings on that showed functions with names matching the calls in the Java code, so that made sense. There were also references to AES, which explained why when I ran tcpdump I only saw bizarre garbage packets.

But what was surprising was that most of these packets were substantially similar. There were a load that were identical other than a 16-byte chunk in the middle. That plus the fact that every payload length was a multiple of 16 bytes strongly indicated that AES was being used in ECB mode. In ECB mode each plaintext is split up into 16-byte chunks and encrypted with the same key. The same plaintext will always result in the same encrypted output. This implied that the packets were substantially similar and that the encryption key was static.

Some more digging showed that someone had figured out the encryption key last year, and that someone else had written some tools to control the plug without needing to modify it. The protocol is basically ascii and consists mostly of the MAC address of the target device, a password and a command. This is then encrypted and sent to the device's IP address. The device then sends a challenge packet containing a random number. The app has to decrypt this, obtain the random number, create a response, encrypt that and send it before the command takes effect. This avoids the most obvious weakness around using ECB - since the same plaintext always encrypts to the same ciphertext, you could just watch encrypted packets go past and replay them to get the same effect, even if you didn't have the encryption key. Using a random number in a challenge forces you to prove that you actually have the key.

At least, it would do if the numbers were actually random. It turns out that the plug is just calling rand(). Further, it turns out that it never calls srand(). This means that the plug will always generate the same sequence of challenges after a reboot, which means you can still carry out replay attacks if you can reboot the plug. Strong work.

But there was still the question of how the remote control works, since the code on github only worked locally. tcpdumping the traffic from the server and trying to decrypt it in the same way as local packets worked fine, and showed that the only difference was that the packet started "wan" rather than "lan". The server decrypts the packet, looks at the MAC address, re-encrypts it and sends it over the tunnel to the plug that registered with that address.

That's not really a great deal of authentication. The protocol permits a password, but the app doesn't insist on it - some quick playing suggests that about 90% of these devices still use the default password. And the devices are all based on the same wifi module, so the MAC addresses are all in the same range. The process of sending status check packets to the server with every MAC address wouldn't take that long and would tell you how many of these devices are out there. If they're using the default password, that's enough to have full control over them.

There's some other failings. The github repo mentioned earlier includes a script that allows arbitrary command execution - the wifi configuration information is passed to the system() command, so leaving a semicolon in the middle of it will result in your own commands being executed. Thankfully this doesn't seem to be true of the daemon that's listening for the remote control packets, which seems to restrict its use of system() to data entirely under its control. But even if you change the default root password, anyone on your local network can get root on the plug. So that's a thing. It also downloads firmware updates over http and doesn't appear to check signatures on them, so there's the potential for MITM attacks on the plug itself. The remote control server is on AWS unless your timezone is GMT+8, in which case it's in China. Sorry, Western Australia.

It's running Linux and includes Busybox and dnsmasq, so plenty of GPLed code. I emailed the manufacturer asking for a copy and got told that they wouldn't give it to me, which is unsurprising but still disappointing.

The use of AES is still somewhat confusing, given the relatively small amount of security it provides. One thing I've wondered is whether it's not actually intended to provide security at all. The remote servers need to accept connections from anywhere and funnel decent amounts of traffic around from phones to switches. If that weren't restricted in any way, competitors would be able to use existing servers rather than setting up their own. Using AES at least provides a minor obstacle that might encourage them to set up their own server.

Overall: the hardware seems fine, the software is shoddy and the security is terrible. If you have one of these, set a strong password. There's no rate-limiting on the server, so a weak password will be broken pretty quickly. It's also infringing my copyright, so I'd recommend against it on that point alone.

comment count unavailable comments

June 21, 2016 11:11 PM

June 15, 2016

LPC 2016: Android/Mobile Microconference Accepted into 2016 Linux Plumbers Conference

Android continues to find interesting new applications and problems to solve, both within and outside the mobile arena. Mainlining continues to be an area of focus, as do a number of areas of core Android functionality, including the kernel. Other topics include efficient operation on big.LITTLE systems, support for HiKey in AOSP (and multi-device support in general), and the upcoming migration to Clang for Android builds.

Android continues to be a very exciting and dynamic project, with the above topics merely scratching the surface. For more details, see the Android/Mobile Microconference wiki page.

June 15, 2016 11:31 PM

Paul E. Mc Kenney: Android/Mobile Microconference Accepted into 2016 Linux Plumbers Conference

Android continues to find interesting new applications and problems to solve, both within and outside the mobile arena. Mainlining continues to be an area of focus, as do a number of areas of core Android functionality, including the kernel. Other topics include efficient operation on big.LITTLE systems, support for Hikey in AOSP (and multi-device support in general), and the upcoming migration to Clang" for Android builds.

Android continues to be a very exciting and dynamic project, with the above topics merely scratching the surface. For more details, see the Android/Mobile Microconference wiki page.

June 15, 2016 03:25 PM

Pete Zaitcev: Go go go

Can you program in Go without knowing a thing about it? Why, yes. A barrier to entry, where are you?

June 15, 2016 02:57 PM

Rusty Russell: Minor update on transaction fees: users still don’t care.

I ran some quick numbers on the last retargeting period (blocks 415296 through 416346 inclusive) which is roughly a week’s worth.

Blocks were full: median 998k mean 818k (some miners blind mining on top of unknown blocks). Yet of the 1,618,170 non-coinbase transactions, 48% were still paying dumb, round fees (like 5000 satoshis). Another 5% were paying dumbround-numbered per-byte fees (like 80 satoshi per byte).

The mean fee was 24051 satoshi (~16c), the mean fee rate 60 satoshi per byte. But if we look at the amount you needed to pay to get into a block (using the second cheapest tx which got in), the mean was 16.81 satoshis per byte, or about 5c.

tl;dr: It’s like a tollbridge charging vehicles 7c per ton, but half the drivers are just throwing a quarter as they drive past and hoping it’s enough. It really shows fees aren’t high enough to notice, and transactions don’t get stuck often enough to notice. That’s surprising; at what level will they notice? What wallets or services are they using?

June 15, 2016 03:00 AM

June 12, 2016

Daniel Vetter: On Getting Patches Merged

In some project there's an awesome process to handle newcomer's contributions - autobuilder picks up your pull and runs full CI on it, coding style checkers automatically do basic review, and the functional review load is also at least all assigned with tooling too.

Then there's projects where utter chaos and ad-hoc process reign, like the Linux kernel or the X.org community, and it's much harder for new folks to get their foot into the door. Of course there's documentation trying to bridge that gap, tools like get_maintainers.pl to figure out whom to ping, but that's kinda the details. In the end you need someone from the inside to care about what you're doing and guide you through the maze the first few times.

I've been pinged about this a few times recently on IRC, so I figured I'll type up my recommended best practices.


The crucial bit is that such unstructured developer communities run entirely on mutual trust, and patches get reviewed through a market of favours as in "I review yours and you review my patches". As a newcomer you have neither. The goal is to build up enough trust and review favours owed to you to get your patches in.

And finally your patches have landed, and in the process you've learned to know a few interesting people. And getting to know new folks is in my opinion really what open source and community is all about. Congrats, and all the best for the next step in your journey!

And finally for the flip side, there's a great write up from Sarah Sharp about doing review, which applies especially to reviewing newcomer's patches.

June 12, 2016 08:45 PM

June 10, 2016

Andy Grover: Why Rust for Low-level Linux programming?

I think Rust is extremely well-suited for low level Linux systems userspace programming — daemons, services, command-line tools, that sort of thing.

Low-level userspace code on Linux is almost universally written in C — until one gets to a certain point where it’s acceptable for Python to be used. Undoubtedly this springs from Linux’s GNU & Unix heritage, but there are also many recent and Linux-specific pieces that are written in C. I think Rust is a better choice for new projects, and here’s why.

Coding is challenging because of mental context-keeping

Coding is hard and distractions are bad because of how much context the developer needs to keep straight as they look at the code. Buffers allocated, locks taken, local variables — these all create little mental things that I need to remember if I’m going to understand a chunk of code, and fix or improve it. Why have proper indentation? Because it helps us keep things straight in our heads. Why keep functions short? Same reason.

Rust reduces the amount of state I need to keep track of in my brain. It checks things that before I depended on myself to check. It gives me tools to express what I want in fewer lines, but still allows maximum control when needed. The same functionality with less code, and checks to ensure it’s better code, these make me more productive and introduce fewer bugs.

Strong types help the compiler help you

Strong typing gives the compiler information it can use to spot errors. This is important as a program grows from a toy into a useful thing. Assumptions within the code change and strong typing check the assumptions so that each version of the program globally either uses the old, or the new assumptions, but not both.

The key to this is being able to describe to the compiler the intended constraints of our code as clearly as possible.

Expressive types prevent needing type “escape hatches”

One problem with weakly-typed languages is when you need to do something that the type system doesn’t quite let you describe. This leads to needing to use the language’s escape hatches, the “do what I mean!” outs, like casting, that let you do what you need to do, but also inherently create places where the compiler can’t help check things.

Or, there may be ambiguity because a type serves two purposes, depending on the context. One example would be returning a pointer. Is NULL a “good” return value, or is it an error? The programmer needs to know based upon external information (docs), and the compiler can’t help by checking the return value is used properly. Or, say a function returns int. Usually a negative value is an error, but not always. And, negative values are nonzero so they evaluate as true in a conditional statement! It’s just…loose.

Rust distinguishes between valid and error results much more explicitly. It has a richer type system with sum types like Option and Result. These eliminate using a single value for both error and success return cases, and let the compiler help the programmer get it right. A richer type system lets us avoid needing escapes.

Memory Safety, Lifetimes, and the Borrow Checker

For me, this is another case where Rust is enabling the verification of something that C programmers learned painfully how to do right — or else. In C I’ve had functions that “borrowed” a pointer versus ones that “took ownership”, but this was not enforced in the language, only documented in the occasional comment above the function that it had one behavior or the other. So for me it was like “duh”, yeah, we notate this, have terms that express what’s happening, and the compiler can check it. Having to use ref-counting or garbage collection is great but for most cases it’s not strictly needed. And if we do need it, it’s available.

Cargo and Libraries

Cargo makes using libraries easy. Easy libraries mean your program can focus more on doing its thing, and in turn make it easier for others to use what you provide. Efficient use of libraries reduce duplicated work and keep lines of code down.

Functional Code, Functional thinking

I like iterators and methods like map, filter, zip, and chain because they make it easier to break down what I’m doing to a sequence into easier to understand fundamental steps, and also make it easier for other coders to understand the code’s intent.

Rewrite everything?

It’s starting to be a cliche. Let’s rewrite everything in Rust! OpenSSL, Tor, the kernel, the browser. Wheeee! Of course this isn’t realistic, but why do people exposed to Rust keep thinking things would be better off in Rust?

I think for two reasons, each from a different group. First, from coders coming from Python, Ruby, and JavaScript, Rust offers some of the higher-level conveniences and productivity they expect. They’re familiar with Rust development model and its Cargo-based, GitHub-powered ecosystem. Types and the borrow checker are a learning curve for them, but the result is blazingly fast code, and the ability to do systems-level things, like calling ioctls, in which a C extension would’ve been called for — but these people don’t want to learn C. These people might call for a rewrite in Rust because it brings that component into the realm of things they can hack on.

Second, there are people like me, people working in C and Python on Linux systems-level stuff — the “plumbing”, who are frustrated with low productivity. C and Python have diametrically-opposed advantages and disadvantages. C is fast to run but slow to write, and hard to write securely. Python is more productive but too slow and RAM-hungry for something running all the time, on every system. We must deal with getting C components to talk to Python components all the time, and it isn’t fun. Rust is the first language that gives a system programmer performance and productivity. These people might see Rust as a chance to increase security, to increase their own productivity, to never have to touch libtool/autoconf ever again, and to solve the C/Python dilemma with a one language solution.

Incremental Evolution of the Linux Platform

Fun to think about for a coder, but then, what, now we have C, Python, AND Rust that need to interact? “If only everything were Rust… and it would be so easy… how hard could it be?” 🙂 Even in Rust, a huge, not-terribly-fun task. I think Rust has great promise, but success lies in incremental evolution of the Linux platform. We’ve seen service consolidation in systemd and the idea of Linux as a platform distinct from Unix. We have a very useful language-agnostic IPC mechanism — DBus — that gives us more freedom to link things written in new languages. I’m hopeful Rust can find places it can be useful as Linux gains new capabilities, and then perhaps converting existing components may happen as maintainers gain exposure and experience with Rust, and recognize its virtues.

The Future

Rust is not standing still. Recent developments like native debugging support in GDB, and the ongoing MIR work, show that Rust will become even better over time. But don’t wait. Rust can be used to rapidly develop high-quality programs today. Learning Rust can benefit you now, and also yield dividends as Rust and its ecosystem continue to improve.

June 10, 2016 04:00 PM

June 07, 2016

Pete Zaitcev: You are in a maze of twisted little directories, all alike


[root@rhev-a24c-01 go]# make get
go get -t ./...
go install: no install location for directory /root/hail/swift-go/go/bench outside GOPATH
For more details see: go help gopath
[root@rhev-a24c-01 go]# pwd
/root/hail/swift-go/go
[root@rhev-a24c-01 go]# ls -l /root/go/src/github.com/openstack/swift
lrwxrwxrwx. 1 root root 25 Jun 6 21:50 /root/go/src/github.com/openstack/swift -> ../../../../hail/swift-go
[root@rhev-a24c-01 go]# cd /root/go/src/github.com/openstack/swift/go
[root@rhev-a24c-01 go]# pwd
/root/go/src/github.com/openstack/swift/go
[root@rhev-a24c-01 go]# make get
go get -t ./...
[root@rhev-a24c-01 go]#

June 07, 2016 09:21 PM

Matthew Garrett: Be wary of heroes

Inspiring change is difficult. Fighting the status quo typically means being able to communicate so effectively that powerful opponents can't win merely by outspending you. People need to read your work or hear you speak and leave with enough conviction that they in turn can convince others. You need charisma. You need to be smart. And you need to be able to tailor your message depending on the audience, even down to telling an individual exactly what they need to hear to take your side. Not many people have all these qualities, but those who do are powerful and you want them on your side.

But the skills that allow you to convince people that they shouldn't listen to a politician's arguments are the same skills that allow you to convince people that they shouldn't listen to someone you abused. The ability that allows you to argue that someone should change their mind about whether a given behaviour is of social benefit is the same ability that allows you to argue that someone should change their mind about whether they should sleep with you. The visibility that gives you the power to force people to take you seriously is the same visibility that makes people afraid to publicly criticise you.

We need these people, but we also need to be aware that their talents can be used to hurt as well as to help. We need to hold them to higher standards of scrutiny. We need to listen to stories about their behaviour, even if we don't want to believe them. And when there are reasons to believe those stories, we need to act on them. That means people need to feel safe in coming forward with their experiences, which means that nobody should have the power to damage them in reprisal. If you're not careful, allowing charismatic individuals to become the public face of your organisation gives them that power.

There's no reason to believe that someone is bad merely because they're charismatic, but this kind of role allows a charismatic abuser both a great deal of cover and a great deal of opportunity. Sometimes people are just too good to be true. Pretending otherwise doesn't benefit anybody but the abusers.

comment count unavailable comments

June 07, 2016 03:33 AM

June 06, 2016

Daniel Vetter: Awesome Atomic Advances

Also, silly titles. Atomic has taken of for real, right now there's 17 drivers supporting atomic modesetting merged into the DRM subsystem. And still a pile of them each release pending for review&merging. But it's not just new drivers, there's also been a steady stream of small improvements over the past year, I think it's time for an update.


It seems small, but a big improvement made over the past few months is that most driver callbacks used by the helper libraries are now optional. Which means tons and tons of dummy functions and boilerplate code can be removed from drivers, leading to less clutter and easier to understand driver code. Aside: Not all drivers have been decluttered, doing that is great starter project for contributing a first few patches to the DRM subsystem. Many thanks to Boris Brezillion, Noralf Trønnes and many others for making this happen.

A long standing complaint about the DRM kernel mode setting is that it's too complicated, especially compared to fbdev when all you have is one dumb framebuffer and nothing else. And yes, in that case there's really no point in having distinct CRTC, plane and encoder objects, and now finally Noralf Trønnes has volunteered to write a helper library for simple display pipelines to hide all that complexity from drivers. It's not yet merged but I'm postive it'll land in 4.8. And it will help to make writing DRM drivers for simple hardware easy and the driver code clutter-free.

Another piece many dumb framebuffer drivers need is support for manually uploading new contents to the screen. Often on this simple panels there's no point in doing page-flipping of entire buffers since a real render engine is nowhere to be seen. And the panels are often behind a really slow bus, making full screen uploads to expensive. Instead it's all done by directly drawing into the frontbuffer, and then telling the driver what changed so that it can shovel the new bits over the bus to the panel. DRM has full support for this through a dirty interface and IOCTL, and legacy fbdev also has some support for this. But the fbdev emulation helpers in DRM never wired these two bits together, forcing drivers to all type their own boilerplate. Noralf has fixed this by implementing fbdev deferred I/O support for the DRM fbdev helpers.

A related improvement is generic support to disable the fbdev emulation from Archit Tajena, both through a Kconfig option and a module option. Most distributions still expect fbdev to work for the boot splash, recovery console and emergency logging. But some, like ChromeOS, are entirely legacy-free and don't need any of this. Thus far every DRM driver had to add implement support for fbdev emulation and disabling it optionally itself. Now that's all done in the library using dummy stub functions in the disabled case, again simplifying driver code.

Somehow most ARM-SoC display drivers start out their system suspend/resume support with a dumb register save/restore. I guess because with simple hardware that works, and regmap provides it almost for free. And then everyone learns the lessons why the atomic modeset helpers have a very strict state transition model the hard way: Display hardware gets upset extremely easily when things are done in the wrong order, or without the required delays, obeying the depencies between components and much more. Dumb register restoring does none of that. To fix this Thierry Redding implemented suspend/resume helpers for atomic drivers. Unfortunately not many drivers use this support yet, which is again a nice opportunity to get a kernel patch merged if you have the hardware for such a driver.

Another big gap in the original atomic infrastructure that's finally getting close is generic support for nonblocking commits. The tricky part there is getting the depency tracking between commits on different display parts right, and I secretly hoped that with a few examples it would be easier to implement something that's useful for most drivers. With 17 examples I've finally run out of excuse to postpone this, after more than 1 year.

But even more important than making the code prettier for atomic drivers and removing boilerplate with better helpers and libraries is in my opinion explaing it all, and making sure all drivers work the same. Over the past few months there's been massive improvements to the driver interface documentation. One of the big items there is certainly documenting the expected behaviour, return codes and special cases of every driver callback. But there is lots more improved than just this example, so go and read them! And of course, when you spot an inconsistency or typo, just send in a patch to fix it. And it's not just contents, but also presentation: Hopefully in 4.8 Jani Nikula's sphinx-based documentation toolchain - the above links are already generated using that for a peek at all the new pretty.


The flip side is testing, and on that front Collabora's effort to  convert all the kernel mode-setting tests in intel-gpu-tools to be generic and useful on any DRM is progressing nicely. For a bit more details read up on Tomeu Vizoso's blog on validating changes to KMS drivers.

Finally it's not all improvements to make it easier to write great drivers, there's also some new feature work. Lionel Landwerlin added new atomic properties to implement color management support. And there's work on-going to implement Z-order and blending properties, and lots more, but that's not yet ready for merging.

June 06, 2016 07:06 PM

LPC 2016: LPC 2016 registration is now Open

Early bird rate is available until our quota of 140 runs out (or we reach 26 August).  Please click here to register.

June 06, 2016 05:38 PM

June 03, 2016

LPC 2016: Registration Opening Delayed until Monday 6 June

A while ago, we decided to combine the Kernel Summit Open Day with Linux Plumbers, meaning that Plumbers itself runs now from 1-4 November (as you can see from the updated site banners).  Unfortunately, we neglected to test that the registration system was ready for this change, so we’re scrambling now to fix it and we should have registrations open by Monday.  Sorry for the Delay.

June 03, 2016 03:20 PM

May 30, 2016

Andi Kleen: Literature survey for practical TSX lock elision applications

Introduction

This is a short non comprehensive (and somewhat biased) literature survey on how lock elision with Intel TSX can be used to improve performance to programs by increasing parallelism. Focus is on practical incremental improvements in existing software.

A basic introduction of TSX lock elision is available in Scaling existing lock based applications with Lock elision.

Lock libraries

The papers below are on actual code using lock elision, not just how to implement lock elision itself. The basic rules of how to implement lock elision in a locking library are in chapter 12 of the Intel Optimization manual. Common anti-patterns (mistakes) while doing so are described in TSX Anti-patterns.

Existing lock libraries that implement lock elision using TSX include glibc pthreads, Thread Building Blocks, tsx-tools and ConcurrencyKit

Databases

Lock elision has been used widely to speed up both production and research databases. A lot of work has been done on the SAP HANA in memory data which uses TSX in production today to improve performance of the B+Tree index and the delta tree data structures. This is described in Improving In-Memory Database Index Performance with TSX.

Several research databases went beyond just using TSX to speed up specific data structures, but map complete database transactions to hardware transaction. This requires much more changes to the databases, and careful memory layout, but can also give more gains. It generally goes beyond simple lock elision. This approach has been implemented by Leis in Exploiting HTM in Main memory databases. A similar approach is used in Using Restricted Transactional Memory to Build a Scalable In-Memory Database. Both see large gains on TPC-C style workloads.

For non in memory databases, a group at Berkeley used lock elision to improve parallelism in LevelDB and getting a 25% speedup on a 4 core system. Standard LevelDB is essentially a single lock system, so this was a nice speedup with only minor work compared to other efforts that used manual fine grained locking to improve parallelism in LevelDB. However it required special handling of condition variables, which are used for commit. For simpler key value stores Diegues used an automatic tuner with TSX transactions to get a 2x gain with memcached.

DrTM uses TSX together with RDMA to implement a fast distributed transaction manager using 2PL. TSX provides local isolation, while RDMA (which aborts transactions) provides remote isolation.

Languages

An attractive use of lock elision is to speed up locks implicit in language runtimes. Yoo implemented transparent. support for using TSX for Java synchronized sections in Early Experience on Transactional
Execution of Java TM Programs
. The runtime automatically detects when a synchronized region does not benefit from TSX and disables lock elision then. This works transparently, but using it successfully may still need some changes in the program to minimize conflicts and other aborts, as described by Gil Tene in Understanding Hardware Transactional Memory. This support is available in JDK 8u40 and can be enabled with the -XX:+UseRTMLocking option.

Another interesting case for lock elision is improving parallelism for the Great Interpreter Lock (GIL) used in interpreters. Odaiara implemented this for Ruby in Eliminating Global Interpreter Locks in Ruby through HTM. They saw a 4.4x speedup on Ruby NPB, 1.6x in WEBrick and 1.2x speedup in Ruby on Rails on a 12 core system.

Hardware transactions can be used for auto-parallelizing existing loops, even when the compiler cannot prove that iterations are independent, by using the transactions to isolate individual iterations. Odaira implemented TSX loop speculation manually for workloads in SPECCpu and report a 11% speedup on a 4 core system. There are some limitations to this technique due to the lazy subscription problem described by Dice, but in principle it can be directly implemented in compilers.

Salamanca used TSX to implement tracing recovery code for Speculative Trace Optimization (STO) for loops. The basic principles are similar to the previous paper, but they implemented an automated prototype. They report 9% improvement on 4 cores with a number of benchmarks.

An older paper, which predates TSX, by Dice describes how to use Hardware Transactional Memory to simplify Work Stealing schedulers.

High Performance Computing

Yoo.et.al. use TSX lock elision to benchmark a number of HPC applications in Performance Evaluation of Intel TSX for High-Performance Computing. They report an average 41% speedup on a 4 core system. They also report an average 31% improvement in bandwidth when applying TSX to a user space TCP stack.

Hay explored lock elision for improving Parallel Discrete Event Simulations. He reports a speed ups of up to 28%.

Data structures

Lock elision has been used widely to speed up parallel data structures. Normally applying lock elision to an existing data structure is very simple — elide the lock protecting it — but some tweaking of the data structure can often give better performance. Dementiev explores TSX for general fast scalable hash tables. Li uses TSX to implement scalable Cuckoo hash tables. Using TSX for hash tables is generally very straight forward. For tree data structures one need to be careful that tree rebalancing does not overwhelm the write set capacity of the hardware transactions., Repetti uses TSX to scale Patricia Tries. Siakavas explores TSX usage for scalable Red-Black Trees, similar in this paper. Bonnichsen uses HTM to improve BT Trees, reporting a speed up of 2x to 3x compared to earlier implementations. The database papers described above describe the rules needed to successfully elide BTrees.

Calciu uses TSX to implement a more scalable priority queue based on skip list and reports increased parallelism.

Memory allocation and Garbage Collection

One challenge with using garbage collection is that the worst case “stop the world” pauses from parallel garbage collectors limit the total heap size that can be supported in copying garbage collectors. The Chihuahua GC paper implements a prototype TSX based collector for the Jikes research java VM. They report upto 101% speedup in concurrent copying speed, and show that a simple parallel garbage collector can be implemented with limited efforts.

Another dog-themed GC, the Collie garbage collector (original paper predates TSX) presents a production quality parallel collector that minimizes pauses and allows scaling to large heaps. Opdahl has another description of the Collie algorithm It is presumably deployed now for TSX in Azul’s commercial ZING JVM product which claims to scale upto 2TB of heap memory.

StackTrack is an efficient algorithm to do automatic memory reclaim for parallel data structures using hardware transactions, out performing existing techniques such as hazard pointers. It requires recompiling the program with a special patched gcc compiler, and automatically creates variable-length transactions for functions freeing memory. The technique could be potentially used even without special compilers.

Kuzmaul uses TSX to implement a scalable SuperMalloc and reports good performance combined with relatively simple code. Dice et all report how an cache index aware malloc can improve TSX performance by improving utilization of the L1 cache.

Other usages

Peters use lock elision to parallelize a micro kernel For a micro kernel a big lock is fine and finds that RTM lock elision outperforms fine grained locking due to less single thread overhead.

May 30, 2016 04:40 AM

May 29, 2016

Pete Zaitcev: Encrypt everything? Please reconsider.

Somehow it became fashionable among site admins to set it up so that accessing over HTTP was immediately redirected to https://. But doing that adds new ways to fail, such as certificate expired:

Notice that Firefox provides no way to ignore the problem and access the website (which was supposed to be accessible over HTTP to begin with). Solution? Use Chrome, which does:

Or, disable NTP and change your PC's clock two days back (be careful with running make while doing so).

This was discussed by CKS previously (of course), and he seems to think that the benefits outweigh the downsides of an occasional fuck-up, like the only website in the world that has the information that I want right now is suddenly unavailable with no recourse.

UPDATE: Chris dicussed the problem some more and brought up other examples, such as outdated KVM appliances that use obsolete ciphers.

One thing I'm wondering about is if a redirect from http:// to https:// makes a lot of sense. If you do not support access by plain HTTP, why not return ECONNREFUSED? I'm sure it's an extremely naiive idea.

May 29, 2016 05:16 PM

May 26, 2016

Vegard Nossum: Writing a reverb filter from first principles

WARNING/DISCLAIMER: Audio programming always carries the risk of damaging your speakers and/or your ears if you make a mistake. Therefore, remember to always turn down the volume completely before and after testing your program. And whatever you do, don't use headphones or earphones. I take no responsibility for damage that may occur as a result of this blog post!

Have you ever wondered how a reverb filter works? I have... and here's what I came up with.

Reverb is the sound effect you commonly get when you make sound inside a room or building, as opposed to when you are outdoors. The stairwell in my old apartment building had an excellent reverb. Most live musicians hate reverb because it muddles the sound they're trying to create and can even throw them off while playing. On the other hand, reverb is very often used (and overused) in the studio vocals because it also has the effect of smoothing out rough edges and imperfections in a recording.

We typically distinguish reverb from echo in that an echo is a single delayed "replay" of the original sound you made. The delay is also typically rather large (think yelling into a distant hill- or mountainside and hearing your HEY! come back a second or more later). In more detail, the two things that distinguish reverb from an echo are:

  1. The reverb inside a room or a hall has a much shorter delay than an echo. The speed of sound is roughly 340 meters/second, so if you're in the middle of a room that is 20 meters by 20 meters, the sound will come back to you (from one wall) after (20 / 2) / 340 = ~0.029 seconds, which is such a short duration of time that we can hardly notice it (by comparison, a 30 FPS video would display each frame for ~0.033 seconds).
  2. After bouncing off one wall, the sound reflects back and reflects off the other wall. It also reflects off the perpendicular walls and any and all objects that are in the room. Even more, the sound has to travel slightly longer to reach the corners of the room (~14 meters instead of 10). All these echoes themselves go on to combine and echo off all the other surfaces in the room until all the energy of the original sound has dissipated.

Intuitively, it should be possible to use multiple echoes at different delays to simulate reverb.

We can implement a single echo using a very simple ring buffer:

    class FeedbackBuffer {
    public:
        unsigned int nr_samples;
        int16_t *samples;

        unsigned int pos;

        FeedbackBuffer(unsigned int nr_samples):
            nr_samples(nr_samples),
            samples(new int16_t[nr_samples]),
            pos(0)
        {
        }

        ~FeedbackBuffer()
        {
            delete[] samples;
        }

        int16_t get() const
        {
            return samples[pos];
        }

        void add(int16_t sample)
        {
            samples[pos] = sample;

            /* If we reach the end of the buffer, wrap around */
            if (++pos == nr_samples)
                pos = 0;
        }
    };

The constructor takes one argument: the number of samples in the buffer, which is exactly how much time we will delay the signal by; when we write a sample to the buffer using the add() function, it will come back after a delay of exactly nr_samples using the get() function. Easy, right?

Since this is an audio filter, we need to be able to read an input signal and write an output signal. For simplicity, I'm going to use stdin and stdout for this -- we will read 8 KiB at a time using read(), process that, and then use write() to output the result. It will look something like this:

    #include <cstdio>
    #include <cstdint>
    #include <cstdlib>
    #include <cstring>
    #include <unistd.h>


    int main(int argc, char *argv[])
    {
        while (true) {
            int16_t buf[8192];
            ssize_t in = read(STDIN_FILENO, buf, sizeof(buf));
            if (in == -1) {
                /* Error */
                return 1;
            }
            if (in == 0) {
                /* EOF */
                break;
            }

            for (unsigned int j = 0; j < in / sizeof(*buf); ++j) {
                /* TODO: Apply filter to each sample here */
            }

            write(STDOUT_FILENO, buf, in);
        }

        return 0;
    }

On Linux you can use e.g. 'arecord' to get samples from the microphone and 'aplay' to play samples on the speakers, and you can do the whole thing on the command line:

    $ arecord -t raw -c 1 -f s16 -r 44100 |\
        ./reverb | aplay -t raw -c 1 -f s16 -r 44100

(-c means 1 channel; -f s16 means "signed 16-bit" which corresponds to the int16_t type we've used for our buffers; -r 44100 means a sample rate of 44100 samples per second; and ./reverb is the name of our executable.)

So how do we use class FeedbackBuffer to generate the reverb effect?

Remember how I said that reverb is essentially many echoes? Let's add a few of them at the top of main():

    FeedbackBuffer fb0(1229);
    FeedbackBuffer fb1(1559);
    FeedbackBuffer fb2(1907);
    FeedbackBuffer fb3(4057);
    FeedbackBuffer fb4(8117);
    FeedbackBuffer fb5(8311);
    FeedbackBuffer fb6(9931);

The buffer sizes that I've chosen here are somewhat arbitrary (I played with a bunch of different combinations and this sounded okay to me). But I used this as a rough guideline: simulating the 20m-by-20m room at a sample rate of 44100 samples per second means we would need delays roughly on the order of 44100 / (20 / 340) = 2594 samples.

Another thing to keep in mind is that we generally do not want our feedback buffers to be multiples of each other. The reason for this is that it creates a consonance between them and will cause certain frequencies to be amplified much more than others. As an example, if you count from 1 to 500 (and continue again from 1), and you have a friend who counts from 1 to 1000 (and continues again from 1), then you would start out 1-1, 2-2, 3-3, etc. up to 500-500, then you would go 1-501, 2-502, 3-504, etc. up to 500-1000. But then, as you both wrap around, you start at 1-1 again. And your friend will always be on 1 when you are on 1. This has everything to do with periodicity and -- in fact -- prime numbers! If you want to maximise the combined period of two counters, you have to make sure that they are relatively coprime, i.e. that they don't share any common factors. The easiest way to achieve this is to only pick prime numbers to start with, so that's what I did for my feedback buffers above.

Having created the feedback buffers (which each represent one echo of the original sound), it's time to put them to use. The effect I want to create is not simply overlaying echoes at fixed intervals, but to have the echos bounce off each other and feed back into each other. The way we do this is by first combining them into the output signal... (since we have 8 signals to combine including the original one, I give each one a 1/8 weight)

    float x = .125 * buf[j];
    x += .125 * fb0.get();
    x += .125 * fb1.get();
    x += .125 * fb2.get();
    x += .125 * fb3.get();
    x += .125 * fb4.get();
    x += .125 * fb5.get();
    x += .125 * fb6.get();
    int16_t out = x;

...then feeding the result back into each of them:

    fb0.add(out);
    fb1.add(out);
    fb2.add(out);
    fb3.add(out);
    fb4.add(out);
    fb5.add(out);
    fb6.add(out);

And finally we also write the result back into the buffer. I found that the original signal loses some of its power, so I use a factor 4 gain to bring it roughly back to its original strength; this number is an arbitrary choice by me, I don't have any specific calculations to support it:

    buf[j] = 4 * out;

That's it! 88 lines of code is enough to write a very basic reverb filter from first principles. Be careful when you run it, though, even the smallest mistake could cause very loud and unpleasant sounds to be played.

If you play with different buffer sizes or a different number of feedback buffers, let me know if you discover anything interesting :-)

May 26, 2016 07:05 PM

May 25, 2016

Pete Zaitcev: Russian Joke

In a quik translation from Bash:

XXX: Still writing that profiler?
YYY: Naah, reading books now
XXX: Like what books?
YYY: TCP Illustrated, Understanding the Linux kernel, Linux kernel development
XXX: And I read "The Never-ending Path Of Hatred".
YYY: That's about Node.js, right?

May 25, 2016 07:15 PM

May 22, 2016

Pete Zaitcev: Dell, why u no VPNC

Yo, we heard you liked remote desktops, so we put remote desktop into a remote desktop, now you can remote desktop while remote desktop.

I remember how IBM simply put VPNC interface in their Bladecenter. It was so nice. Unfortunately, vendors never want to be too nice to users, so their next release switched to a Java applet. Dell copied their approach for DRAC5. In theory, this should be future-proof, all hail WORA. In practice, it only worked with a specific version of Java, which was current when Dell shipped R905 ten years ago. You know, back then Windows XP was new and hot.

Fortunately, by a magic of KVM, libvirt, and Qemu, it's possible to create a virtual machine, install Fedora 10 on it, and then run Firefox with the stupid Java applet on it. Also, the Firefox and Java have to run in 32-bit mode.

When I did it for the first time, I ran Firefox through X11 redirection. That was quite inconvenient: I had to stop the Firefox running on the host desktop, because one cannot run 2 firefoxes painting to the same $DISPLAY. The reason that happens is, well, Mozilla Foundation is evil, basically. The remote Firefox finds the running Firefox through X11 properties and then some crapmagic happens and everything crashes and burns. So, it's much easier just to hook to the VM with Vinagre and run Firefox with DISPLAY=:0 in there.

Those old Fedoras were so nice, BTW. Funnily enough, that VM with 1 CPU and 1.5 GB starts quicker than the host laptop with the benefit of SystemD and its ability to run tasks in parallel. Of course, the handing of WiFi in Fedora 20+ is light years ahead of nm-applet in Fedora 10. There was some less noticeable progress elsewhere as well. But in the same time, the bloat was phenomenal.

UPDATE: Java does not work. Running the JNLP simply fails after downloading the applets, without any error messages. To set the plugin type of "native", ssh to DRAC, then "racadm config -g cfgRacTuning -o cfgRacTunePluginType 0". No kidding.

May 22, 2016 03:46 AM

May 20, 2016

Matthew Garrett: Your project's RCS history affects ease of contribution (or: don't squash PRs)

Github recently introduced the option to squash commits on merge, and even before then several projects requested that contributors squash their commits after review but before merge. This is a terrible idea that makes it more difficult for people to contribute to projects.

I'm spending today working on reworking some code to integrate with a new feature that was just integrated into Kubernetes. The PR in question was absolutely fine, but just before it was merged the entire commit history was squashed down to a single commit at the request of the reviewer. This single commit contains type declarations, the functionality itself, the integration of that functionality into the scheduler, the client code and a large pile of autogenerated code.

I've got some familiarity with Kubernetes, but even then this commit is difficult for me to read. It doesn't tell a story. I can't see its growth. Looking at a single hunk of this diff doesn't tell me whether it's infrastructural or part of the integration. Given time I can (and have) figured it out, but it's an unnecessary waste of effort that could have gone towards something else. For someone who's less used to working on large projects, it'd be even worse. I'm paid to deal with this. For someone who isn't, the probability that they'll give up and do something else entirely is even greater.

I don't want to pick on Kubernetes here - the fact that this Github feature exists makes it clear that a lot of people feel that this kind of merge is a good idea. And there are certainly cases where squashing commits makes sense. Commits that add broken code and which are immediately followed by a series of "Make this work" commits also impair readability and distract from the narrative that your RCS history should present, and Github present this feature as a way to get rid of them. But that ends up being a false dichotomy. A history that looks like "Commit", "Revert Commit", "Revert Revert Commit", "Fix broken revert", "Revert fix broken revert" is a bad history, as is a history that looks like "Add 20,000 line feature A", "Add 20,000 line feature B".

When you're crafting commits for merge, think about your commit history as a textbook. Start with the building blocks of your feature and make them one commit. Build your functionality on top of them in another. Tie that functionality into the core project and make another commit. Add client support. Add docs. Include your tests. Allow someone to follow the growth of your feature over time, with each commit being a chapter of that story. And never, ever, put autogenerated code in the same commit as an actual functional change.

People can't contribute to your project unless they can understand your code. Writing clear, well commented code is a big part of that. But so is showing the evolution of your features in an understandable way. Make sure your RCS history shows that, otherwise people will go and find another project that doesn't make them feel frustrated.

(Edit to add: Sarah Sharp wrote on the same topic a couple of years ago)

comment count unavailable comments

May 20, 2016 12:06 AM

May 17, 2016

Gustavo F. Padovan: Collabora contributions to Linux Kernel 4.6

Linux Kernel 4.6 was released this week, and a total of 9 Collabora engineers took part in its development, Collabora’s highest number of engineers contributing to a single Linux Kernel release yet. In total Collabora contributed 42 patches.

As part of Collabora’s continued commitment to further increase its participation to the Linux Kernel, Collabora is actively looking to expand its team of core software engineers. If you’d like to learn more, follow this link.

Here are some highlights of Collabora’s participation in Kernel 4.6:

Andrew Shadura fixed the number of buttons reported on the Pemount 6000 USB touchscreen controller, while Daniel Stone enabled BCM283x familiy devices in the ARM multi_v7_defconfig and Emilio López added module autoloading for a few sunxi devices.

Enric Balletbo i Serra added boot console output to AM335X(Sitara) and OMAP3-IGEP and fixed audio codec setup on AM335X using the right external clock. Martyn Welch added the USB device ID for the GE Healthcare cp210x serial device and renamed the reset reason of the Zodiac Watchdog.

Gustavo Padovan cleaned up the Android Sync Framework on the staging tree for further de-staging of the Sync File infrastructure, which will land in 4.7. Most of the work was removing interfaces that won’t be used in mainline. He also added vblank event support for atomic commits in the virtio DRM driver.

Peter Senna improved an error path and added some style fixes to the sisusbvga driver. While Sjoerd Simons enabled wireless on radxa Rock2 boards, fixed an issue withthe brcmfmac sdio driver sometimes timing out with a false positive and fixed some issues with Serial output on Renesas R-Car porter board.

Tomeu Vizoso changed driver_match_device() to return errors and in case of -EPROBE_DEFER queue the device for deferred probing, he also provided two fixes to Rockchip DRM driver as part of his work on making intel-gpu-tools work on other platforms.

Following is a list of all patches submitted by Collabora for this kernel release:

Andrew Shadura (1):

Daniel Stone (1):

Emilio López (4):

Enric Balletbo i Serra (3):

Gustavo Padovan (17):

Martyn Welch (2):

Peter Senna Tschudin (4):

Sjoerd Simons (6):

Tomeu Vizoso (4):

May 17, 2016 06:15 PM

May 12, 2016

Matthew Garrett: Convenience, security and freedom - can we pick all three?

Moxie, the lead developer of the Signal secure communication application, recently blogged on the tradeoffs between providing a supportable federated service and providing a compelling application that gains significant adoption. There's a set of perfectly reasonable arguments around that that I don't want to rehash - regardless of feelings on the benefits of federation in general, there's certainly an increase in engineering cost in providing a stable intra-server protocol that still allows for addition of new features, and the person leading a project gets to make the decision about whether that's a valid tradeoff.

One voiced complaint about Signal on Android is the fact that it depends on the Google Play Services. These are a collection of proprietary functions for integrating with Google-provided services, and Signal depends on them to provide a good out of band notification protocol to allow Signal to be notified when new messages arrive, even if the phone is otherwise in a power saving state. At the time this decision was made, there were no terribly good alternatives for Android. Even now, nobody's really demonstrated a free implementation that supports several million clients and has no negative impact on battery life, so if your aim is to write a secure messaging client that will be adopted by as many people is possible, keeping this dependency is entirely rational.

On the other hand, there are users for whom the decision not to install a Google root of trust on their phone is also entirely rational. I have no especially good reason to believe that Google will ever want to do something inappropriate with my phone or data, but it's certainly possible that they'll be compelled to do so against their will. The set of people who will ever actually face this problem is probably small, but it's probably also the set of people who benefit most from Signal in the first place.

(Even ignoring the dependency on Play Services, people may not find the official client sufficient - it's very difficult to write a single piece of software that satisfies all users, whether that be down to accessibility requirements, OS support or whatever. Slack may be great, but there's still people who choose to use Hipchat)

This shouldn't be a problem. Signal is free software and anybody is free to modify it in any way they want to fit their needs, and as long as they don't break the protocol code in the process it'll carry on working with the existing Signal servers and allow communication with people who run the official client. Unfortunately, Moxie has indicated that he is not happy with forked versions of Signal using the official servers. Since Signal doesn't support federation, that means that users of forked versions will be unable to communicate with users of the official client.

This is awkward. Signal is deservedly popular. It provides strong security without being significantly more complicated than a traditional SMS client. In my social circle there's massively more users of Signal than any other security app. If I transition to a fork of Signal, I'm no longer able to securely communicate with them unless they also install the fork. If the aim is to make secure communication ubiquitous, that's kind of a problem.

Right now the choices I have for communicating with people I know are either convenient and secure but require non-free code (Signal), convenient and free but insecure (SMS) or secure and free but horribly inconvenient (gpg). Is there really no way for us to work as a community to develop something that's all three?

comment count unavailable comments

May 12, 2016 02:40 PM

May 11, 2016

Daniel Vetter: Neat drm/i915 Stuff for 4.7

The 4.6 release is almost out of the door, it's time to look at what's in store for 4.7.
Let's first look at the epic saga called atomic support. In 4.7 the atomic watermark update support for Ironlake through Broadwell from Matt Roper, Ville Syrjälä and others finally landed. This took about 3 attempts to get merged because there's lots of small little corner cases that caused regressions each time around, but it's finally done. And it's an absolutely key piece for atomic support, since Intel hardware does not support atomic updates of the watermark settings for the display fetch fifos. And if those values are wrong tearings and other ugly things will result. We still need corresponding support for other platforms, but this is a really big step. But that's not the only atomic work: Maarten Lankhorst made the hardware state checker atomic, and there's been tons of smaller things all over to move the driver towards the shiny new.

Another big feature on the display side is color management, implemented by Lionel Landwerlin, and then fixes to make it fully atomic from Maarten. Color management aims for more accurate reproduction of a well definied color space on panels, using a de-gamma table, then a color matrix, and finally a gamma table.

For platform enabling the big thing is support for DSI panels on Broxton from Jani Nikula and Ramalingam C. One fallout from this effort is the cleaned up VBT parsing code, done by Jani. There's now a clean split between parsing the different VBT versions on all the various platforms, now neatly consolidated, and using that information in different places within the driver. Ville also hooked up upscaling/panel fitting for DSI panels on all platforms.

Looking more at driver internals Ander Conselvan de Oliviera and Ville refactored the entire display PLL code on all platforms, with the goal to reuse it in the DP detection code for upfront link training. This is needed to detect the link configuration in certain situations like USB type C connectors. Shubhangi Shrivastava reworked the DP detection code itself, again to prep for these features. Still on pure display topics Ville fixed lots of underrun issues to appease our CI on lots of platforms. Together with the atomic watermark updates this should shut up one of the largest sources of noise in our test results.

Moving on to power management work the big thing is lots of small fixes for the runtime PM support all over the place from Imre Deak and Ville, with a big focus on the Broxton platform. And while we talk features affecting the entire driver: Imre added fault injection to the driver load paths so that we can start to exercise all that code in an automated way.

Finally looking at the render/GEM side of the driver the short summary is that Tvrtko Ursulin and Chris Wilson worked the code all over the place: A cleanup up and tuned forcewake handling code from Tvrtko, fixes for more userptr corner cases from Chris, a new notifier to handle vmap exhaustion and assorted polish in the related shrinker code, cleaned up and fixed handling of gpu reset corner cases, fixes for context related hard hangs on Sandybridge and Ironlake, large-scale renaming of parameters and structures to realign old code with the newish execlist hardware mode, the list goes on. And finally a rather big piece, and one which causes some trouble, is all the work to speed up the execlist code, with a big focusing on reducing interrupt handling overhead. This was done by moving the expensive parts of execlist interrupt handling into a tasklet. Unfortunately that uncovered some bugs in our interrupt handling on Braswell, so Ville jumped in and fixed it all up, plus of course removed some cruft and applied some nice polish.

Other work in the GT are are gpu hang fixes for Skylake GT3 and GT4 configurations from Mika Kuoppala. Mika also provided patches to improve the edram handling on those same chips. Alex Dai and Dave Gordon kept working on making GuC ready for prime time, but not yet there. And Peter Antoine improved the MOCS support to work on all engines.

And of course there's been tons of smaller improvements, bugfixes, cleanups and refactorings all over the place, as usual.

May 11, 2016 03:02 PM

Michael Kerrisk (manpages): man-pages-4.06 is released

I've released man-pages-4.06. The release tarball is available on kernel.org. The browsable online pages can be found on man7.org. The Git repository for man-pages is available on kernel.org.

This release resulted from patches, bug reports, reviews, and comments from aroubd 20 contributors. The release includes changes to just over 40 man pages. Among which the more significant changes in man-pages-4.06 are the following:

May 11, 2016 06:46 AM

May 06, 2016

Pete Zaitcev: Dropbox lifts the kimono

Dropbox posted somewhat of a whitepaper about their exabyte storage system, which exceeds the largest Swift cluster by about 2 orders of magnitude. Here's a couple of fun quotes:

The Block Index is a giant sharded MySQL cluster, fronted by an RPC service layer, plus a lot of tooling for database operations and reliability. We’d originally planned on building a dedicated key-value store for this purpose but MySQL turned out to be more than capable.

Kinda like SQLite in Swift.

Cells are self-contained logical storage clusters that store around 50PB of raw data.

And they have dozens of those. Their cell has a master node BTW. Kinda like Ceph's PG, but unlike Swift.

RTWT

May 06, 2016 08:38 PM

May 05, 2016

LPC 2016: Tracing Microconference Accepted into 2016 Linux Plumbers Conference

After taking a break in 2015, Tracing is back at Plumbers this year! Tracing is heavily used throughout the Linux ecosystem, and provides an essential method for extracting information about the underlying code that is running on the system. Although tracing is simple in concept, effective usage and implementation can be quite involved.

Topics proposed for this year’s event include new features in the BPF compiler collection, perf, and ftrace; visualization frameworks; large-scale tracing and distributed debugging; always-on analytics and monitoring; do-it-yourself tracing tools; and, last but not least, a kernel-tracing wishlist.

We hope to see you there!

May 05, 2016 10:21 PM

May 04, 2016

Paul E. Mc Kenney: Tracing Microconference Accepted into 2016 Linux Plumbers Conference

After taking a break in 2015, Tracing is back at Plumbers this year! Tracing is heavily used throughout the Linux ecosystem, and provides an essential method for extracting information about the underlying code that is running on the system. Although tracing is simple in concept, effective usage and implementation can be quite involved.

Topics proposed for this year's event include new features in the BPF compiler collection, perf, and ftrace; visualization frameworks; large-scale tracing and distributed debugging; always-on analytics and monitoring; do-it-yourself tracing tools; and, last but not least, a kernel-tracing wishlist.

We hope to see you there!

May 04, 2016 03:18 PM

April 28, 2016

Pete Zaitcev: OpenStack Swift Proxy-FS by SwiftStack

SwiftStack's Joe Arnold and John Dickinson chose the Austin Summit and a low-key, #vBrownBag venue, to come out of closet with PROXY-FS (also spelled as ProxyFS), a tightly integrated addition to OpenStack Swift, which provides a POSIX-ish filesystem access to a Swift cluster.

Proxy-FS is basically a peer to a less known feature of Ceph Rados Gateway that permits accessing it over NFS. Both of them are fundamentally different from e.g. Swift-on-file in that the data is kept in Swift or Ceph, instead of a general filesystem.

The object layout is natural in that it takes advantage of SLO by creating a log-structured, manifested object. This way the in-place updates are handled, including appends. Yes, you can create a manifest with a billion of 1-byte objects just by invoking write(2). So, don't do that.

In response to my question, Joe promised to open the source, although we don't know when.

Another question dealt with the performance expectations. The small I/O performance of Proxy-FS is not going to be great in comparison to a traditional NFS filer. One of its key features is relative transparency: there is no cache involved and every application request goes straight to Swift. This helps to adhere to the principle of the least surprise, as well as achieve scalability for which Swift is famous. There is no need for any upcalls/cross-calls from the Swift Proxy into Proxy-FS that invalidate the cache, because there's no cache. But it has to be understood that Proxy-FS, as well as NFS mode in RGW, are not intended to compete with Netapp.

Not directly, anyway. But what they could do is to disrupt, in Christensen's sense. His disruption examples were defined as technologies that are markedly inferior to incumbents, as well as dramatically cheaper. Swift and Ceph are both: the filesystem performance sucks balls and the price per terabyte is 1/10th of NetApp (this statement was not evaluated by the Food and Drug Administration). If new applications come about that make use of these properties... You know the script.

April 28, 2016 04:42 PM

Daniel Vetter: X.Org Foundation Election Results

Two questions were up for voting, 4 seats on the Board of Directors and approval of the amended By-Laws to join SPI.

Congratulations to our reelected and new board members Egbert Eich, Alex Deucher, Keith Packard and Bryce Harrington. Thanks a lot to Lucas Stach for running. And also big thanks to our outgoing board member Matt Dew, who stepped down for personal reasons.

On the bylaw changes and merging with SPI, 61 out of 65 active members voted, with 54 voting yes, 4 no and 3 abstained. Which means we're well past the 2/3rd quorum for bylaw changes, and everything's green now to proceed with the plan to join SPI!

April 28, 2016 06:37 AM

April 27, 2016

James Bottomley: Unprivileged Build Containers

A while ago, a goal I set myself was to be able to maintain my build and test environments for architecture emulation containers without having to do any of the tasks as root and without creating any suid binaries to do this.  One of the big problems here is that distributions get annoyed (and don’t run correctly) if root doesn’t own most of the files … for instance the installers all check to see that the file got installed with the correct ownership and permissions and fail if they don’t.  Debian has an interesting mechanism, called fakeroot, to get around this using a preload library intercepting the chmod and chown system calls, but it’s getting a bit hackish to try to extend this to work permanently for an emulation container.

The correct way to do this is with user namespaces so the rest of this post will show you how.  Before we get into how to use them, lets begin with the theory of how user namespaces actually work.

Theory of User Namespaces

A user namespace is the single namespace that can be created by an unprivileged user.  Their job is to map a set of interior (inside the user namespace) uids, gids and projids1 to a set of exterior (outside the user namespace).

The way this works is that the root user namespace simply has a 1:1 identity mapping of all 2^32 identifiers, meaning it fully covers the space.  However, any new user namespace only need remap a subset of these.  Any id that is not mapped into the user namespace becomes inaccessible to that namespace.  This doesn’t mean completely inaccessible, it just means any resource owned or accessed by an unmapped id treats an attempted access (even from root in the namespace) as though it were completely unprivileged, so if the resource is readable by any id, it can still be read even in a user namespace where its owning id is unmapped.

User namespaces can also be nested but the nested namespace can only map ids that exist in its parent, so you can only reduce but not expand the id space by nesting.  The way the nested mapping works is that it remaps through all the parent namespaces, so what appears on the resource is still the original exterior ids.

User Namespaces also come with an owner (the uid/gid of the process that created the container).  The reason for this is that this owner is allowed to execute setuid/setgid to any id mapped into the namespace, so the owning uid/gid pair is the effective “root” of the container.  Note that this setuid/setgid property works on entry to the namespace even if uid 0 is not mapped inside the namespace, but won’t survive once the first setuid/setgid is issued.

The final piece of the puzzle is that every other namespace also has an owning user namespace, so while I cannot create a mount namespace as unprivileged user jejb, I can as remapped root inside my user namespace

jejb@jarvis:~> unshare --mount
unshare: unshare failed: Operation not permitted
jejb@jarvis:~> nsenter --user=/tmp/userns
root@jarvis:~# unshare --mount
root@jarvis:~#

And once created, I can always enter this mount namespace provided I’m also in my user namespace.

Setting up Unprivileged Namespaces

Any system user can actually create a user namespace.  However a non-root (meaning not uid zero in the parent namespace) user cannot remap any exterior id except their own.  This means that, because a build container needs a range of ids, it’s not possible to set up the intial remapped namespace without the help of root.  However, once that is done, the user can pretty much do every other operation2

The way remap ranges are set up is via the uid_map, gid_map and projid_map files sitting inside the /proc/<pid> directory.  These files may only be written to once and never updated3

As an example, to set up a build container, I need a remapping for every id that would be created during installation.  Traditionally for Linux, these are ids 0-999.  I want to remap them down to something unprivileged, say 100,000 so my line entry for this is

0 100000 1000

However, I also want an identity mapping for my own id (currently I’m at uid 1000), so I can still use my home directory from within the build container.  This also means I can create the roots for the containers within my home directory.  Finally, the nobody user and nobody,nogroup groups also need to be mapped, so the final uid map entries look like

0 100000 1000
1000 1000 1
65534 101001 1

For the groups, it’s even more complex because on openSUSE, I’m a member of the users group (gid 100) which sits in the middle of the privileged 0-999 group range, so the gid_map entry I use is

0 100000 100
100 100 1
101 100100 899
65533 101000 2

Which is almost up to the kernel imposed limit of five separate lines.

Finally, here’s how to set this up and create a binding for the user namespace.  As myself (I’m uid 1000 user name jejb) I do

jejb@jarvis:~> unshare --user
nobody@jarvis:~> echo $$
20211
nobody@jarvis:~>

Note that I become nobody inside the container because currently the map files are unwritten so there are no mapped ids at all.  Now as root, I have to write the mapping files and bind the entry file to the namespace somewhere

jarvis:/home/jejb # echo 1|awk '{print "0 100000 1000\n1000 1000 1\n65534 101001 1"}' > /proc/20211/uid_map
jarvis:/home/jejb # echo 1|awk '{print "0 100000 100\n100 100 1\n101 100100 899\n65533 101000 2"}' > /proc/20211/gid_map
jarvis:/home/jejb # touch /tmp/userns
jarvis:/home/jejb # mount --bind /proc/20211/ns/user /tmp/userns

Now I can exit my user namespace because it’s permanently bound and the next time I enter it I become root inside the container (although with uid 100000 outside)

jejb@jarvis:~> nsenter --user=/tmp/userns
root@jarvis:~# id
uid=0(root) gid=0(root) groups=0(root)
root@jarvis:~# su - jejb
jejb@jarvis:~> id
uid=1000(jejb) gid=100(users) groups=100(users)

Giving me a user namespace with sufficient mapped ids to create a build container.

Unprivileged Architecture Emulation Containers

Essentially, I can use the user namespace constructed above to bootstrap and enter the entire build container and its mount namespace with one proviso that I have to have a pre-created devices directory because I don’t possess the mknod capability as myself, so my container root also doesn’t possess it.  The way I get around this is to create the initial dev directory as root and then change the ownership to 100000.100000 (my unprivileged ids)

jejb@jarvis:~/containers/debian-amd64/dev> ls -l
total 0
lrwxrwxrwx 1 100000 100000 13 Feb 20 09:45 fd -> /proc/self/fd/
crw-rw-rw- 1 100000 100000 1, 7 Feb 20 09:45 full
crw-rw-rw- 1 100000 100000 1, 3 Feb 20 09:45 null
lrwxrwxrwx 1 100000 100000 8 Feb 20 09:45 ptmx -> pts/ptmx
drwxr-xr-x 2 100000 100000 6 Feb 20 09:45 pts/
crw-rw-rw- 1 100000 100000 1, 8 Feb 20 09:45 random
drwxr-xr-x 2 100000 100000 6 Feb 20 09:45 shm/
lrwxrwxrwx 1 100000 100000 15 Feb 20 09:45 stderr -> /proc/self/fd/2
lrwxrwxrwx 1 100000 100000 15 Feb 20 09:45 stdin -> /proc/self/fd/0
lrwxrwxrwx 1 100000 100000 15 Feb 20 09:45 stdout -> /proc/self/fd/1
crw-rw-rw- 1 100000 100000 5, 0 Feb 20 09:45 tty
crw-rw-rw- 1 100000 100000 1, 9 Feb 20 09:45 urandom
crw-rw-rw- 1 100000 100000 1, 5 Feb 20 09:45 zero

This seems to be sufficient of a dev skeleton to function with.  For completeness sake, I placed my bound user namespace into /run/build-container/userns and following the original architecture container emulation post, with modified susebootstrap and build-container scripts.  The net result is that as myself I can now enter and administer and update the build and test architecture emulation container with no suid required

jejb@jarvis:~> nsenter --user=/run/build-container/userns --mount=/run/build-container/ppc64
root@jarvis:/# id
uid=0(root) gid=0(root) groups=0(root)
root@jarvis:/# uname -m
ppc64
root@jarvis:/# su - jejb
jejb@jarvis:~> id
uid=1000(jejb) gid=100(users) groups=100(users)

The only final wrinkle is that root has to set up the user namespace on every boot, but that’s only because there’s no currently defined operating system way of doing this.

April 27, 2016 06:52 PM

April 26, 2016

LPC 2016: Checkpoint-Restore Microconference Accepted into 2016 Linux Plumbers Conference

This year will feature a four-fold deeper dive into checkpoint-restore technology, thanks to participation by people from a number of additional related projects! These are the OpenMPI message-passing library, Berkeley Lab Checkpoint/Restart (BLCR), and Distributed MultiThreaded CheckPointing (DMTCP) (not to be confused with TCP/IP), in addition to the Checkpoint/Restore in Userspace group that has participated in prior years.

Docker integration remains a hot topic, as is post-copy live migration, as well as testing/validation. As you might guess from the inclusion of people from BLCR and OpenMPI, checkpoint-restore for distributed workloads (rather than just single systems) is an area of interest.

Please join us for a timely and important discussion!

April 26, 2016 02:43 PM

April 25, 2016

Paul E. Mc Kenney: Checkpoint-Restore Microconference Accepted into 2016 Linux Plumbers Conference

This year will feature a four-fold deeper dive into checkpoint-restore technology, thanks to participation by people from a number of additional related projects! These are the OpenMPI message-passing library, Berkeley Lab Checkpoint/Restart (BLCR), and Distributed MultiThreaded CheckPointing (DMTCP) (not to be confused with TCP/IP), in addition to the Checkpoint/Restore in Userspace group that has participated in prior years.

Docker integration remains a hot topic, as is post-copy live migration, as well as testing/validation. As you might guess from the inclusion of people from BLCR and OpenMPI, checkpoint-restore for distributed workloads (rather than just single systems) is an area of interest.

Please join us for a timely and important discussion!

April 25, 2016 05:28 PM

April 24, 2016

LPC 2016: Testing & Fuzzing Microconference Accepted into 2016 Linux Plumbers Conference

Testing, fuzzing, and other diagnostics have made the Linux ecosystem much more robust than in the past, but there are still embarrassing bugs. Furthermore, million-year bugs will be happening many times per day across Linux’s huge installed base, so there is clearly need for even more aggressive validation.

The Testing and Fuzzing Microconference aims to significantly increase the aggression level of Linux-kernel validation, with discussions on tools and test suites including kselftest, syzkaller, trinity, mutation testing, and the 0day Test Robot. The effectiveness of these tools will be attested to by any of their victims, but we must further raise our game as the installed base of Linux continues to increase.

One additional way of raising the level of testing aggression is to document the various ABIs in machine-readable format, thus lowering the barrier to entry for new projects. Who knows? Perhaps Linux testing will be driven by artificial-intelligence techniques!

Join us for an important and spirited discussion!

April 24, 2016 07:43 PM

Daniel Vetter: Should the X.org Foundation join SPI? Vote Now!



In case you missed it: Please vote now on https://members.x.org/login.php!

April 24, 2016 09:14 AM

April 22, 2016

LPC 2016: Device Tree Microconference Accepted into 2016 Linux Plumbers Conference

Device-tree discussions are probably not quite as spirited as in the past, however, device tree is an active area. In particular, significant issues remain.

This microconference will cover the updated device-tree specification, debugging, bindings validation, and core-kernel code directions. In addition, there will be discussion of what does (and does not) go into the device tree, along with the device-creation and driver binding ordering swamp.

Join us for an important and spirited discussion!

April 22, 2016 01:33 PM

Matthew Garrett: Circumventing Ubuntu Snap confinement

Ubuntu 16.04 was released today, with one of the highlights being the new Snap package format. Snaps are intended to make it easier to distribute applications for Ubuntu - they include their dependencies rather than relying on the archive, they can be updated on a schedule that's separate from the distribution itself and they're confined by a strong security policy that makes it impossible for an app to steal your data.

At least, that's what Canonical assert. It's true in a sense - if you're using Snap packages on Mir (ie, Ubuntu mobile) then there's a genuine improvement in security. But if you're using X11 (ie, Ubuntu desktop) it's horribly, awfully misleading. Any Snap package you install is completely capable of copying all your private data to wherever it wants with very little difficulty.

The problem here is the X11 windowing system. X has no real concept of different levels of application trust. Any application can register to receive keystrokes from any other application. Any application can inject fake key events into the input stream. An application that is otherwise confined by strong security policies can simply type into another window. An application that has no access to any of your private data can wait until your session is idle, open an unconfined terminal and then use curl to send your data to a remote site. As long as Ubuntu desktop still uses X11, the Snap format provides you with very little meaningful security. Mir and Wayland both fix this, which is why Wayland is a prerequisite for the sandboxed xdg-app design.

I've produced a quick proof of concept of this. Grab XEvilTeddy from git, install Snapcraft (it's in 16.04), snapcraft snap, sudo snap install xevilteddy*.snap, /snap/bin/xevilteddy.xteddy . An adorable teddy bear! How cute. Now open Firefox and start typing, then check back in your terminal window. Oh no! All my secrets. Open another terminal window and give it focus. Oh no! An injected command that could instead have been a curl session that uploaded your private SSH keys to somewhere that's not going to respect your privacy.

The Snap format provides a lot of underlying technology that is a great step towards being able to protect systems against untrustworthy third-party applications, and once Ubuntu shifts to using Mir by default it'll be much better than the status quo. But right now the protections it provides are easily circumvented, and it's disingenuous to claim that it currently gives desktop users any real security.

comment count unavailable comments

April 22, 2016 01:51 AM

April 20, 2016

Paul E. Mc Kenney: Testing & Fuzzing Microconference Accepted into 2016 Linux Plumbers Conference

Testing, fuzzing, and other diagnostics have made the Linux ecosystem much more robust than in the past, but there are still embarrassing bugs. Furthermore, million-year bugs will be happening many times per day across Linux's huge installed base, so there is clearly need for even more aggressive validation.

The Testing and Fuzzing Microconference aims to significantly increase the aggression level of Linux-kernel validation, with discussions on tools and test suites including kselftest, syzkaller, trinity, mutation testing, and the 0day Test Robot. The effectiveness of these tools will be attested to by any of their victims, but we must further raise our game as the installed base of Linux continues to increase.

One additional way of raising the level of testing aggression is to document the various ABIs in machine-readable format, thus lowering the barrier to entry for new projects. Who knows? Perhaps Linux testing will be driven by artificial-intelligence techniques!

Join us for an important and spirited discussion!

April 20, 2016 04:43 PM

Daniel Vetter: X.org Foundation Election - Vote Now!

It's election season in X.org land, and it matters: Besides new board seats we're also voting on bylaw changes and whether to join SPI or not.

Personally, and as the secretary of the board I'm very much in favour of joining SPI. It will allow us to offload all the boring bits of running a foundation, and those are also all the bits we tend to struggle with. And that would give the board more time to do things that actually matter and help the community. And all that for a really reasonable price - running our own legal entity isn't free, and not really worth it for our small budget mostly consisting of travel sponsoring and the occasional internship.

And bylaw changes need a qualified supermajority of all members, every vote counts and not voting essentially means voting no. Hence please vote, and please vote even when you don't want to join - this is our second attempt and I'd really like to see a clear verdict from our members, one way or the other.

Thanks.

Voting closes by  Apr 26 23:59 UTC, but please don't cut it short, it's a computer that decides when it's over ...

April 20, 2016 07:24 AM

April 18, 2016

Paul E. Mc Kenney: Device Tree Microconference Accepted into 2016 Linux Plumbers Conference

Device-tree discussions are probably not quite as spirited as in the past, however, device tree is an active area. In particular, significant issues remain.

This microconference will cover the updated device-tree specification, debugging, bindings validation, and core-kernel code directions. In addition, there will be discussion of what does (and does not) go into the device tree, along with the device-creation and driver binding ordering swamp.

Join us for an important and spirited discussion!

April 18, 2016 03:21 PM

Matthew Garrett: One more attempt at SATA power management

Around a year ago I wrote some patches in an attempt to improve power management on Haswell and Broadwell systems by configuring Serial ATA power management appropriately. I got a couple of reports of them triggering SATA errors for some users, couldn't reproduce them myself and so didn't have a lot of confidence in them. Time passed.

I've been working on power management stuff again this week, so it seemed like a good opportunity to revisit these. I've made a few changes and pushed a couple of trees - one against master and one against 4.5.

First, these probably only have relevance to users of mobile Intel parts in the U or S range (/proc/cpuinfo will tell you - you're looking for a four-digit number that starts with 4 (Haswell), 5 (Broadwell) or 6 (Skylake) and ends with U or S), and won't do anything unless you have SATA drives (including PCI-based SATA). To test them, first disable anything like TLP that might alter your SATA link power management policy. Then check powertop - you should only be getting to PC3 at best. Build a kernel with these patches and boot it. /sys/class/scsi_host/*/link_power_management_policy should read "firmware". Check powertop and see whether you're getting into deeper PC states. Now run your system for a while and check the kernel log for any SATA errors that you didn't see before.

Let me know if you see SATA errors and are willing to help debug this, and leave a comment if you don't see any improvement in PC states.

comment count unavailable comments

April 18, 2016 02:15 AM

April 17, 2016

LPC 2016: PCI Microconference Accepted into 2016 Linux Plumbers Conference

Given that PCI was introduced more than two decades ago and that PCI Express was introduced more than ten years ago, one might think that the Linux plumbing already did everything possible to support PCI.

One would be quite wrong.

One issue with current PCI support is that resource allocation is handled on a per-architecture basis, leading to duplicate code, and, worse yet, duplicate bugs. This microconference will therefore look into possible consolidation of this code.

Another issue is integration of PCI’s message-signaled interrupts (MSI) into the kernel’s core interrupt code. Whilst the device tree bindings and relative core code for MSI management have just been integrated, legacy code needs more work, including architecture-specific code and PCI legacy host controller drivers. In addition, attention should be paid to the ACPI bindings, to the related ACPI kernel core code, and to the MSI passthrough usage in virtualized environments.

Advances in Virtualization technologies, System I/O devices and their memory management through IOMMU components are driving features in the PCI Express specifications (Address Translation Services (ATS) and Page Request Interface (PRI)). This microconference will also foster debate on the best way to integrate these features in the kernel in a seamless way for different architectures.

Of course, no discussion of PCI would be complete without considering firmware issues, hardware quirks, power management, and the interplay between device tree and ACPI.

Join us for an important new-to-Plumbers PCI discussion!

April 17, 2016 04:01 PM

April 15, 2016

Pavel Machek: Python is a nice... trap

Python is a nice... language

Easy to program in. No need to recompile, you can hack in as you would in shell. Fun.
Lets see if I can predict the weather. Oh yes, lets hack something in python. Hey, it works. But it is slow on a PC and so slow on a phone that future is already past when you predict is.
Then you try to improve the speed. That's very very bad idea. You can't optimize Python, but you can spend hours trying. You make a mental note to never ever use Python for computation again.
But Python should be fine for simple Gtk project, communicating over Dbus, right?
For bigger projects, Python has some support. Modules / object orientation really helps there. Lack of compilation really hurts there, but this is fun language, right? Allowing you to write nice and clean code.
Refactoring is hard, because variables are not declared.
Python is a nice... trap
But then your project grows larger, and you realize it takes 30 seconds to start up. Because many module need gtk, and importing gtk takes time. Heck... compiling _and_ running C based hello world is still faster then running Python based one.
Ok, C++ was tempting. Gtk Hello world takes 55 seconds to compile. My g++ does not support "auto", so the application starts with Glib::RefPtr<Gtk::Application> app = Gtk::Application::create(argc, argv, "org.gtkmm.example"). I think I'll stick with C.

Oh, and I already used my main SIM in N900/Debian... and I really used N900/Debian. I needed to tether a network, but not provide wifi hotspot. Done.

April 15, 2016 11:17 AM

Matthew Garrett: David MacKay

The first time I was paid to do software development came as something of a surprise to me. I was working as a sysadmin in a computational physics research group when a friend asked me if I'd be willing to talk to her PhD supervisor. I had nothing better to do, so said yes. And that was how I started the evening having dinner with David MacKay, and ended the evening better fed, a little drunker and having agreed in principle to be paid to write free software.

I'd been hired to work on Dasher, an information-efficient text entry system. It had been developed by one of David's students as a practical demonstration of arithmetic encoding after David had realised that presenting a visualisation of an effective compression algorithm allowed you to compose text without having to enter as much information into the system. At first this was merely a neat toy, but it soon became clear that the benefits of Dasher had a great deal of overlap with good accessibility software. It required much less precision of input, it made it easy to correct mistakes (you merely had to reverse direction in order to start zooming back out of the text you had entered) and it worked with a variety of input technologies from mice to eye tracking to breathing. My job was to take this codebase and turn it into a project that would be interesting to external developers.

In the year I worked with David, we turned Dasher from a research project into a well-integrated component of Gnome, improved its support for Windows, accepted code from an external contributor who ported it to OS X (using an OpenGL canvas!) and wrote ports for a range of handheld devices. We added code that allowed Dasher to directly control the UI of other applications, making it possible for people to drive word processors without having to leave Dasher. We taught Dasher to speak. We strove to avoid the mistakes present in so many other pieces of accessibility software, such as configuration that could only be managed by an (expensive!) external consultant. And we visited Dasher users and learned how they used it and what more they needed, then went back home and did what we could to provide that.

Working on Dasher was an incredible opportunity. I was involved in the development of exciting code. I spoke on it at multiple conferences. I became part of the Gnome community. I visited the USA for the first time. I entered people's homes and taught them how to use Dasher and experienced their joy as they realised that they could now communicate up to an order of magnitude more quickly. I wrote software that had a meaningful impact on the lives of other people.

Working with David was certainly not easy. Our weekly design meetings were, charitably, intense. He had an astonishing number of ideas, and my job was to figure out how to implement them while (a) not making the application overly complicated and (b) convincing David that it still did everything he wanted. One memorable meeting involved me gradually arguing him down from wanting five new checkboxes to agreeing that there were only two combinations that actually made sense (and hence a single checkbox) - and then admitting that this was broadly equivalent to an existing UI element, so we could just change the behaviour of that slightly without adding anything. I took the opportunity to delete an additional menu item in the process.

I was already aware of the importance of free software in terms of developers, but working with David made it clear to me how important it was to users as well. A community formed around Dasher, helping us improve it and allowing us to develop support for new use cases that made the difference between someone being able to type at two words per minute and being able to manage twenty. David saw that this collaborative development would be vital to creating something bigger than his original ideas, and it succeeded in ways he couldn't have hoped for.

I spent a year in the group and then went back to biology. David went on to channel his strong feelings about social responsibility into issues such as sustainable energy, writing a freely available book on the topic. He served as chief adviser to the UK Department of Energy and Climate Change for five years. And earlier this year he was awarded a knighthood for his services to scientific outreach.

David died yesterday. It's unlikely that I'll ever come close to what he accomplished, but he provided me with much of the inspiration to try to do so anyway. The world is already a less fascinating place without him.

comment count unavailable comments

April 15, 2016 06:26 AM

April 13, 2016

Matthew Garrett: Skylake's power management under Linux is dreadful and you shouldn't buy one until it's fixed

(Edit to add: this issue is restricted to the mobile SKUs. Desktop parts have very different power management behaviour)

Linux 4.5 seems to have got Intel's Skylake platform (ie, 6th-generation Core CPUs) to the point where graphics work pretty reliably, which is great progress (4.4 tended to lose all my windows every so often, especially over suspend/resume). I'm even running Wayland happily. Unfortunately one of the reasons I have a laptop is that I want to be able to do things like use it on battery, and power consumption's an important part of that. Skylake continues the trend from Haswell of moving to an SoC-type model where clock and power domains are shared between components that were previously entirely independent, and so you can't enter deep power saving states unless multiple components all have the correct power management configuration. On Haswell/Broadwell this manifested in the form of Serial ATA link power management being involved in preventing the package from going into deep power saving states - setting that up correctly resulted in a reduction in full-system power consumption of about 40%[1].

I've now got a Skylake platform with a nice shiny NVMe device, so Serial ATA policy isn't relevant (the platform doesn't even expose a SATA controller). The deepest power saving state I can get into is PC3, despite Skylake supporting PC8 - so I'm probably consuming about 40% more power than I should be. And nobody seems to know what needs to be done to fix this. I've found no public documentation on the power management dependencies on Skylake. Turning on everything in Powertop doesn't improve anything. My battery life is pretty poor and the system is pretty warm.

The best thing about this is the following statement from page 64 of the 6th Generation Intel ® Processor Datasheet for U-Platforms:

Caution: Long term reliability cannot be assured unless all the Low-Power Idle States are enabled.

which is pretty concerning. Without support for states deeper than PC3, Linux is running in a configuration that Intel imply may trigger premature failure. That's obviously not good. Until this situation is improved, you probably shouldn't buy any Skylake systems if you're planning on running Linux.

[1] These patches never went upstream. Someone reported that they resulted in their SSD throwing errors and I couldn't find anybody with deeper levels of SATA experience who was interested in working on the problem. Intel's AHCI drivers for Windows do the right thing, but I couldn't find anybody at Intel who could get any information from their Windows driver team.

comment count unavailable comments

April 13, 2016 10:16 PM

Paul E. Mc Kenney: PCI Microconference Accepted into 2016 Linux Plumbers Conference

Given that PCI was introduced more than two decades ago and that PCI Express was introduced more than ten years ago, one might think that the Linux plumbing already did everything possible to support PCI.

One would be quite wrong.

One issue with current PCI support is that resource allocation is handled on a per-architecture basis, leading to duplicate code, and, worse yet, duplicate bugs. This microconference will therefore look into possible consolidation of this code.

Another issue is integration of PCI's message-signaled interrupts (MSI) into the kernel's core interrupt code. Whilst the device tree bindings and relative core code for MSI management have just been integrated, legacy code needs more work, including architecture-specific code and PCI legacy host controller drivers. In addition, attention should be paid to the ACPI bindings, to the related ACPI kernel core code, and to the MSI passthrough usage in virtualized environments.

Advances in Virtualization technologies, System I/O devices and their memory management through IOMMU components are driving features in the PCI Express specifications (Address Translation Services (ATS) and Page Request Interface (PRI)). This microconference will also foster debate on the best way to integrate these features in the kernel in a seamless way for different architectures.

Of course, no discussion of PCI would be complete without considering firmware issues, hardware quirks, power management, and the interplay between device tree and ACPI.

Join us for an important new-to-Plumbers PCI discussion!

April 13, 2016 08:25 PM

April 12, 2016

LPC 2016: Call for Refereed-Track Proposals

We are pleased to announce the Call for Refereed-Track Proposals for the 2016 edition of the Linux Plumbers Conference, which will held be in Santa Fe, NM, USA on November 2-4 in conjunction with Linux Kernel Summit.

Refereed track presentations are 50 minutes in length and should focus on a specific aspect of the “plumbing” in the Linux system. Examples of Linux plumbing include core kernel subsystems, core libraries, windowing systems, management tools, device support, media creation/playback, and so on. The best presentations are not about finished work, but rather problems, proposals, or proof-of-concept solutions that require face-to-face discussions and debate.

Given that Plumbers is not colocated with LinuxCon this year, we are spreading the refereed-track talks over all three days. This provides a change of pace and also provides a conflict-free schedule for the refereed-track talks. (Yes, this does result in more conflicts between the referred-track talks and the Microconferences, but there never is a truly free lunch.)

Linux Plumbers Conference Program Committee members will be reviewing all submitted sessions. High-quality submisssion that cannot be accepted due to the limited number of slots will be forwarded to the Microconference leads for further consideration.

To submit a refereed track talk proposal follow the instructions at this website.

Submissions are due on or before Thursday 1 September, 2016 at 11:59PM Pacific Time. Since this is after the closure of early registration, speakers may register before this date and we’ll refund the registration for any selected presentation’s speaker, but for only one speaker per presentation.

Finally, we are still accepting Microconference submissions here.
 
Dates:

Submissions close: September 1, 2016
Speakers notified: September 22, 2016
Slides due: November 1, 2016

April 12, 2016 02:40 PM

LPC 2016: Live Kernel Patching Microconference Accepted into 2016 Linux Plumbers Conference

Live kernel patching was accepted into the Linux kernel in v4.0 in February 2015, so we can declare the 2014 LPC Live Kernel Patching Microconference to have been a roaring success! However, as was noted at the time, this is just the beginning of the real work. In short, the v4.0 work makes live kernel patching possible, but more work is required to make it more reliable and more routine.

Additional issues include stacktrace reliability, patch-safety criteria for kernel threads, thread consistency models, porting to non-x86 architectures, handling of loadable modules, compiler optimizations, userspace tooling, patching of data, automated regression testing, and patch-creation guidelines.

Join us for an important and spirited discussion!

April 12, 2016 01:01 AM

April 11, 2016

Paul E. Mc Kenney: 2016 Linux Plumbers Conference Call for Refereed-Track Proposals



 

 

Dates:

Submissions close: September 1, 2016
Speakers notified: September 22, 2016
Slides due: November 1, 2016

We are pleased to announce the Call for Refereed-Track Proposals for the 2016 edition of the Linux Plumbers Conference, which will held be in Santa Fe, NM, USA on November 2-4 in conjunction with Linux Kernel Summit.

Refereed track presentations are 50 minutes in length and should focus on a specific aspect of the "plumbing" in the Linux system. Examples of Linux plumbing include core kernel subsystems, core libraries, windowing systems, management tools, device support, media creation/playback, and so on. The best presentations are not about finished work, but rather problems, proposals, or proof-of-concept solutions that require face-to-face discussions and debate.

Given that Plumbers is not colocated with LinuxCon this year, we are spreading the refereed-track talks over all three days. This provides a change of pace and also provides a conflict-free schedule for the refereed-track talks. (Yes, this does result in more conflicts between the referred-track talks and the Microconferences, but there never is a truly free lunch.)

Linux Plumbers Conference Program Committee members will be reviewing all submitted sessions. High-quality submisssion that cannot be accepted due to the limited number of slots will be forwarded to the Microconference leads for further consideration.

To submit a refereed track talk proposal follow the instructions at this website.

Submissions are due on or before Thursday 1 September, 2016 at 11:59PM Pacific Time. Since this is after the closure of early registration, speakers may register before this date and we'll refund the registration for any selected presentation's speaker, but for only one speaker per presentation.

Finally, we are still accepting Microconference submissions here.

April 11, 2016 07:17 PM

Pavel Machek: pypy: suprisingly good

Using python for weather forecasting was a mistake... but it came with some surprises. It looks like pypy is as fast as gcc -O0 on simple correlation computation: https://gitlab.com/tui/tui/blob/master/nowcast/pypybench.py (and pypybench.c). gcc -O3 is twice as fast, and plain python2 is 20x slower (or 2000% slower). Python3 is 28x. slower. That's better than I expected for the JIT technology.

April 11, 2016 09:51 AM

Pavel Machek: Finally... power management on Nokia N900

After long long fight, it seems power management on Nokia N900 works for me for the first time. N900 is very picky about its configuration (you select lockdep, you lose video; you select something else 50mA power consumption... not good). That was the last major piece... I hope. I should have usable phone soon.

April 11, 2016 09:45 AM

Matthew Garrett: Making it easier to deploy TPMTOTP on non-EFI systems

I've been working on TPMTOTP a little this weekend. I merged a pull request that adds command-line argument handling, which includes the ability to choose the set of PCRs you want to seal to without rebuilding the tools, and also lets you print the base32 encoding of the secret rather than the qr code so you can import it into a wider range of devices. More importantly it also adds support for setting the expected PCR values on the command line rather than reading them out of the TPM, so you can now re-seal the secret against new values before rebooting.

I also wrote some new code myself. TPMTOTP is designed to be usable in the initramfs, allowing you to validate system state before typing in your passphrase. Unfortunately the initramfs itself is one of the things that's measured. So, you end up with something of a chicken and egg problem - TPMTOTP needs access to the secret, and the obvious thing to do is to put the secret in the initramfs. But the secret is sealed against the hash of the initramfs, and so you can't generate the secret until after the initramfs. Modify the initramfs to insert the secret and you change the hash, so the secret is no longer released. Boo.

On EFI systems you can handle this by sticking the secret in an EFI variable (there's some special-casing in the code to deal with the additional metadata on the front of things you read out of efivarfs). But that's not terribly useful if you're not on an EFI system. Thankfully, there's a way around this. TPMs have a small quantity of nvram built into them, so we can stick the secret there. If you pass the -n argument to sealdata, that'll happen. The unseal apps will attempt to pull the secret out of nvram before falling back to looking for a file, so things should just magically work.

I think it's pretty feature complete now, other than TPM2 support? That's on my list.

comment count unavailable comments

April 11, 2016 05:59 AM

April 08, 2016

Rusty Russell: Bitcoin Generic Address Format Proposal

I’ve been implementing segregated witness support for c-lightning; it’s interesting that there’s no address format for the new form of addresses.  There’s a segregated-witness-inside-p2sh which uses the existing p2sh format, but if you want raw segregated witness (which is simply a “0” followed by a 20-byte or 32-byte hash), the only proposal is BIP142 which has been deferred.

If we’re going to have a new address format, I’d like to make the case for shifting away from bitcoin’s base58 (eg. 1At1BvBMSEYstWetqTFn5Au4m4GFg7xJaNVN2):

  1. base58 is not trivial to parse.  I used the bignum library to do it, though you can open-code it as bitcoin-core does.
  2. base58 addresses are variable-length.  That makes webforms and software mildly harder, but also eliminates a simple sanity check.
  3. base58 addresses are hard to read over the phone.  Greg Maxwell points out that the upper and lower case mix is particularly annoying.
  4. The 4-byte SHA check does not guarantee to catch the most common form of errors; transposed or single incorrect letters, though it’s pretty good (1 in 4 billion chance of random errors passing).
  5. At around 34 letters, it’s fairly compact (36 for the BIP141 P2WPKH).

This is my proposal for a generic replacement (thanks to CodeShark for generalizing my previous proposal) which covers all possible future address types (as well as being usable for current ones):

  1. Prefix for type, followed by colon.  Currently “btc:” or “testnet:“.
  2. The full scriptPubkey using base 32 encoding as per http://philzimmermann.com/docs/human-oriented-base-32-encoding.txt.
  3. At least 30 bits for crc64-ecma, up to a multiple of 5 to reach a letter boundary.  This covers the prefix (as ascii), plus the scriptPubKey.
  4. The final letter is the Damm algorithm check digit of the entire previous string, using this 32-way quasigroup. This protects against single-letter errors as well as single transpositions.

These addresses look like btc:ybndrfg8ejkmcpqxot1uwisza345h769ybndrrfg (41 digits for a P2WPKH) or btc:yybndrfg8ejkmcpqxot1uwisza345h769ybndrfg8ejkmcpqxot1uwisza34 (60 digits for a P2WSH) (note: neither of these has the correct CRC or check letter, I just made them up).  A classic P2PKH would be 45 digits, like btc:ybndrfg8ejkmcpqxot1uwisza345h769wiszybndrrfg, and a P2SH would be 42 digits.

While manually copying addresses is something which should be avoided, it does happen, and the cost of making them robust against common typographic errors is small.  The CRC is a good idea even for machine-based systems: it will let through less than 1 in a billion mistakes.  Distinguishing which blockchain is a nice catchall for mistakes, too.

We can, of course, bikeshed this forever, but I wanted to anchor the discussion with something I consider fairly sane.

April 08, 2016 01:50 AM

April 07, 2016

Andi Kleen: Overview of Last Branch Records (LBRs) on Intel CPUs

I wrote a two part article on theory and practice of Last Branch
Records using Linux perf. They were published on LWN.

This includes the background (what is a last branch record and why
branch sampling), what you can use it for in perf, like hot path
profiling, frame pointerless callgraphs, or automatic micro benchmarks,
and also other uses like continuous compiler feedback.

The articles are now freely available:

Part 1: Introduction of Last Branch Records
Part 2: Advanced uses of Last Branch Records

April 07, 2016 02:26 PM

April 06, 2016

Pete Zaitcev: Amateur contributors to OpenStack

John was venting about our complicated contribution process in OpenStack and threw this off-hand remark:

I'm sure the fact that nearly 100% of @openstack contributors are paid to be so is completely unrelated. #eyeroll

While I share his frustration, one thing he may be missing is that OpenStack is generally useless to anyone who does not have thousands of computers dedicated to it. This is a significant barrier of entry for hobbyists, baked straight into the nature of OpenStack.

Exceptions that we see are basically people building little pseudo-clusters out of a dozen of VMs. They do it with an aim of advancing their careers.

April 06, 2016 04:44 PM

April 05, 2016

Matthew Garrett: There's more than one way to exploit the commons

There's a piece of software called XScreenSaver. It attempts to fill two somewhat disparate roles:

XScreenSaver does an excellent job of the second of these[2] and is pretty good at the first, which is to say that it only suffers from a disastrous security flaw once very few years and as such is certainly not appreciably worse than any other piece of software.

Debian ships an operating system that prides itself on stability. The Debian definition of stability is a very specific one - rather than referring to how often the software crashes or misbehaves, it refers to how often the software changes behaviour. Debian is very reluctant to upgrade software that is part of a stable release, to the extent that developers will attempt to backport individual security fixes to the version they shipped rather than upgrading to a release that contains all those security fixes but also adds a new feature. The argument here is that the new release may also introduce new bugs, and Debian's users desire stability (in the "things don't change" sense) more than new features. Backporting security fixes keeps them safe without compromising the reason they're running Debian in the first place.

This all makes plenty of sense at a theoretical level, but reality is sometimes less convenient. The first problem is that security bugs are typically also, well, bugs. They may make your software crash or misbehave in annoying but apparently harmless ways. And when you fix that bug you've also fixed a security bug, but the ability to determine whether a bug is a security bug or not is one that involves deep magic and a fanatical devotion to the cause so given the choice between maybe asking for a CVE and dealing with embargoes and all that crap when perhaps you've actually only fixed a bug that makes the letter "E" appear in places it shouldn't and not one that allows the complete destruction of your intergalactic invasion fleet means people will tend to err on the side of "Eh fuckit" and go drinking instead. So new versions of software will often fix security vulnerabilities without there being any indication that they do so[3], and running old versions probably means you have a bunch of security issues that nobody will ever do anything about.

But that's broadly a technical problem and one we can apply various metrics to, and if somebody wanted to spend enough time performing careful analysis of software we could have actual numbers to figure out whether the better security approach is to upgrade or to backport fixes. Conversations become boring once we introduce too many numbers, so let's ignore that problem and go onto the second, which is far more handwavy and social and so significantly more interesting.

The second problem is that upstream developers remain associated with the software shipped by Debian. Even though Debian includes a tool for reporting bugs against packages included in Debian, some users will ignore that and go straight to the upstream developers. Those upstream developers then have to spend at least 15 or so seconds telling the user that the bug they're seeing has been fixed for some time, and then figure out how to explain that no sorry they can't make Debian include a fixed version because that's not how things work. Worst case, the stable release of Debian ends up including a bug that makes software just basically not work at all and everybody who uses it assumes that the upstream author is brutally incompetent, and they end up quitting the software industry and I don't know running a nightclub or something.

From the Debian side of things, the straightforward solution is to make it more obvious that users should file bugs with Debian and not bother the upstream authors. This doesn't solve the problem of damaged reputation, and nor does it entirely solve the problem of users contacting upstream developers. If a bug is filed with Debian and doesn't get fixed in a timely manner, it's hardly surprising that users will end up going upstream. The Debian bugs list for XScreenSaver does not make terribly attractive reading.

So, coming back to the title for this entry. The most obvious failure of the commons is where a basically malicious actor consumes while giving nothing back, but if an actor with good intentions ends up consuming more than they contribute that may still be a problem. An upstream author releases a piece of software under a free license. Debian distributes this to users. Debian's policies result in the upstream author having to do more work. What does the upstream author get out of this exchange? In an ideal world, plenty. The author's software is made available to more people. A larger set of developers is willing to work on making improvements to the software. In a less ideal world, rather less. The author has to deal with bug mail about already fixed bugs. The author's reputation may be harmed by user exposure to said fixed bugs. The author may get less in the way of useful bug fixes or features because people are running old versions rather than fixing new ones. If the balance tips towards the latter, the author's decision to release their software under a free license has made their life more difficult.

Most discussions about Debian's policies entirely ignore the latter scenario, focusing more on the fact that the author chose to release their software under a free license to begin with. If the author is unwilling to handle the consequences of that, goes the argument, why did they do it in the first place? The unfortunate logical conclusion to that argument is that the author realises that they made a huge mistake and never does so again, and woo uh oops.

The irony here is that one of Debian's foundational documents, the Debian Free Software Guidelines, makes allowances for this. Section 4 allows for distribution of software in Debian even if the author insists that modified versions[4] are renamed. This allows for an author to make a choice - allow themselves to be associated with the Debian version of their work and increase (a) their userbase and (b) their support load, or try to distinguish what Debian ship from their identity. But that document was ratified in 1997 and people haven't really spent much time since then thinking about why it says what it does, and so this tradeoff is rarely considered.

Free software doesn't benefit from distributions antagonising their upstreams, even if said upstream is a cranky nightclub owner. Debian's users are Debian's highest priority, but those users are going to suffer if developers decide that not using free licenses improves their quality of life. Kneejerk reactions around specific instances aren't helpful, but now is probably a good time to start thinking about what value Debian bring to its upstream authors and how that can be increased. Failing to do so doesn't serve users, Debian itself or the free software community as a whole.

[1] The X server has no fundamental concept of a screen lock. This is implemented by an application asking that the X server send all keyboard and mouse input to it rather than to any other application, and then that application creating a window that fills the screen. Due to some hilarious design decisions, opening a pop-up menu in an application prevents any other application from being able to grab input and so it is impossible for the screensaver to activate if you open a menu and then walk away from your computer. This is merely the most obvious problem - there are others that are more subtle and more infuriating. The only fix in this case is to nuke the site from orbit.

[2] There's screenshots here. My favourites are the one that emulate the electrical characteristics of an old CRT in order to present a more realistic depiction of the output of an Apple 2 and the one that includes a complete 6502 emulator.

[3] And obviously new versions of software will often also introduce new security vulnerabilities without there being any indication that they do so, because who would ever put that in their changelog. But the less ethically challenged members of the security community are more likely to be looking at new versions of software than ones released three years ago, so you're probably still tending towards winning overall

[4] There's a perfectly reasonable argument that all packages distributed by Debian are modified in some way

comment count unavailable comments

April 05, 2016 11:53 PM

Matthew Garrett: TPMs, event logs, fine-grained measurements and avoiding fragility in remote-attestation

Trusted Platform Modules are fairly unintelligent devices. They can do some crypto, but they don't have any ability to directly monitor the state of the system they're attached to. This is worked around by having each stage of the boot process "measure" state into registers (Platform Configuration Registers, or PCRs) in the TPM by taking the SHA1 of the next boot component and performing an extend operation. Extend works like this:

New PCR value = SHA1(current value||new hash)

ie, the TPM takes the current contents of the PCR (a 20-byte register), concatenates the new SHA1 to the end of that in order to obtain a 40-byte value, takes the SHA1 of this 40-byte value to obtain a 20-byte hash and sets the PCR value to this. This has a couple of interesting properties:

But how do we know what those operations were? We control the bootloader and the kernel and we know what extend operations they performed, so that much is easy. But the firmware itself will have performed some number of operations (the firmware itself is measured, as is the firmware configuration, and certain aspects of the boot process that aren't in our control may also be measured) and we may not be able to reconstruct those from scratch.

Thankfully we have more than just the final PCR data. The firmware provides an interface to log each extend operation, and you can read the event log in /sys/kernel/security/tpm0/binary_bios_measurements. You can pull information out of that log and use it to reconstruct the writes the firmware made. Merge those with the writes you performed and you should be able to reconstruct the final TPM state. Hurrah!

The problem is that a lot of what you want to measure into the TPM may vary between machines or change in response to configuration changes or system updates. If you measure every module that grub loads, and if grub changes the order that it loads modules in, you also need to update your calculations of the end result. Thankfully there's a way around this - rather than making policy decisions based on the final TPM value, just use the final TPM value to ensure that the log is valid. If you extract each hash value from the log and simulate an extend operation, you should end up with the same value as is present in the TPM. If so, you know that the log is valid. At that point you can examine individual log entries without having to care about the order that they occurred in, which makes writing your policy significantly easier.

But there's another source of fragility. Imagine that you're measuring every command executed by grub (as is the case in the CoreOS grub). You want to ensure that no inappropriate commands have been run (such as ones that would allow you to modify the loaded kernel after it's been measured), but you also want to permit certain variations - for instance, you might have a primary root filesystem and a fallback root filesystem, and you're ok with either being passed as a kernel argument. One approach would be to write two lines of policy, but there's an even more flexible approach. If the bootloader logs the entire command into the event log, when replaying the log we can verify that the event description hashes to the value that was passed to the TPM. If it does, rather than testing against an explicit hash value, we can examine the string itself. If the event description matches a regular expression provided by the policy then we're good.

This approach makes it possible to write TPM policies that are resistant to changes in ordering and permit fine-grained definition of acceptable values, and which can cleanly separate out local policy, generated policy values and values that are provided by the firmware. The split between machine-specific policy and OS policy allows for the static machine-specific policy to be merged with OS-provided policy, making remote attestation viable even over automated system upgrades.

We've integrated an implementation of this kind of policy into the TPM support code we'd like to integrate into Kubernetes, and CoreOS will soon be generating known-good hashes at image build time. The combination of these means that people using Distributed Trusted Computing under Tectonic will be able to validate the state of their systems with nothing more than a minimal machine-specific policy description.

The support code for all of this should also start making it into other distributions in the near future (the grub code is already in Fedora 24), so with luck we can define a cross-distribution policy format and make it straightforward to handle this in a consistent way even in hetrogenous operating system environments. Remote attestation is a powerful tool for ensuring that your systems are in a valid state, but the difficulty of policy management has been a significant factor in making it difficult for people to deploy in their data centres. Making it easier for people to shield themselves against low-level boot attacks is a big step forward in improving the security of distributed workloads and makes bare-metal hosting a much more viable proposition.

comment count unavailable comments

April 05, 2016 04:08 PM

April 01, 2016

Rusty Russell: BIP9: versionbits In a Nutshell

Hi, I was one of the authors/bikeshedders of BIP9, which Pieter Wuille recently refined (and implemented) into its final form.  The bitcoin core plan is to use BIP9 for activations from now on, so let’s look at how it works!

Some background:

So, let’s look at BIP68 & 112 (Sequence locks and OP_CHECKSEQUENCEVERIFY) which are being activated together:

There are also two alerts in the bitcoin core implementation:

Now, when could the OP_CSV soft forks activate? bitcoin-core will only start setting the bit in the first period after the start date, so somewhere between 1st and 15th of May[1], then will take another period to lock-in (even if 95% of miners are already upgraded), then another period to activate.  So early June would be the earliest possible date, but we’ll get two weeks notice for sure.

The Old Algorithm

For historical purposes, I’ll describe how the old soft-fork code worked.  It used version as a simple counter, eg. 3 or above meant BIP66, 4 or above meant BIP65 support.  Every block, it examined the last 1000 blocks to see if more than 75% had the new version.  If so, then the new softfork rules were enforced on new version blocks: old version blocks would still be accepted, and use the old rules.  If more than 95% had the new version, old version blocks would be rejected outright.

I remember Gregory Maxwell and other core devs stayed up late several nights because BIP66 was almost activated, but not quite.  And as a miner there was no guarantee on how long before you had to upgrade: one smaller miner kept producing invalid blocks for weeks after the BIP66 soft fork.  Now you get two weeks’ notice (probably more if you’re watching the network).

Finally, this change allows for miners to reject a particular soft fork without rejecting them all.  If we’re going to see more contentious or competing proposals in the future, this kind of plumbing allows it.

Hope that answers all your questions!


 

[1] It would be legal for an implementation to start setting it on the very first block past the start date, though it’s easier to only think about version bits once every two weeks as bitcoin-core does.

April 01, 2016 01:28 AM

March 30, 2016

Pete Zaitcev: SwiftStack versus Swift

Erik Pounds posted an article on SwiftStack official blog where it presented a somewhat corporatist view of Swift and Ceph that comes down to this:

They are both productized by commercial companies so all enterprises can utilize them... Ceph via RedHat and Swift via SwiftStack.

This view is extremely reductionist along a couple of avenues.

First, it tries to sweep all of Swift under SwiftStack umbrella, whereas in reality it derives a lot of strength from not being controlled by SwiftStack. But way to assuage the fears of monoentity control by employing the PTL, guys. Fortunately, in the real and more complex world, Red Hat pays me to work on Swift, as well as offer Swift as a product, and I do not find that my needs are in any way sabotaged by PTL. Certainly our product focus differs; Red Hat's lifetime management offering, OSPd, manages OpenStack first and Swift as much as it's a part of OpenStack, whereas SwiftStack offer a Swift-specific product. Still it's not like Swift equals SwiftStack. I think RackSpace continue to operate the largest Swift cluster in the world.

Second, Erik somehow neglects to notice that Ceph provides a Swift compatibility through a component known as Rados Gateway. It is an option, you know, although obviously it can never be a better Swift than Swift itself, or better Amazon S3 than S3 itself.

March 30, 2016 07:36 PM

March 29, 2016

James Morris: Linux Security Summit 2016 – CFP Announced!

The 2016 Linux Security Summit (LSS) will be held in Toronto, Canada, on 25th and 26th, co-located with LinuxCon North America.  See the full announcement.

The Call for Participation (CFP) is now open, and submissions must be made by June 10th.  As with recent years, the committee is looking for refereed presentations and discussion topics.

This year, there will be a $100 registration fee for LSS, and you do not need to be registered for LinuxCon to attend LSS .

There’s also now an event twitter feed, for updates and announcements.

If you’ve been doing any interesting development, or deployment, of Linux security systems, please consider submitting a proposal!

March 29, 2016 11:08 AM

March 28, 2016

Paul E. Mc Kenney: Live Kernel Patching Microconference Accepted into 2016 Linux Plumbers Conference

Live kernel patching was accepted into the v4.0 Linux kernel in February 2015, so we can declare the 2014 LPC Live Kernel Patching Microconference to have been a roaring success! However, as was noted at the time, this is just the beginning of the real work. In short, the v4.0 work makes live kernel patching possible, but more work is required to make it more reliable and more routine.

Additional issues include stacktrace reliability, patch-safety criteria for kernel threads, thread consistency models, porting to non-x86 architectures, handling of loadable modules, compiler optimizations, userspace tooling, patching of data, automated regression testing, and patch-creation guidelines.

Join us for an important and spirited discussion!

March 28, 2016 04:52 PM

March 25, 2016

LPC 2016: Containers Microconference Accepted into 2016 Linux Plumbers Conference

The level of Containers excitement has increased even further this year, with much interplay between Docker, Kubernetes, Rkt, CoreOS, Mesos, LXC, LXD, OpenVZ, systemd, and much else besides. This excitement has led to some interesting new use cases, including even the use of containers on Android.

Some of these use cases in turn require some interesting new changes to the Linux plumbing, including mounts in unprivileged containers, improvements to cgroups resource management, ever-present security concerns, and interoperability between various sets of tools.

Please join us for an important discussion on an important aspect of cloud computing!

March 25, 2016 05:51 PM