Kernel Planet

December 19, 2014

Dave Jones: Moving on from Red Hat.

After eleven and a half years, today is my final day at Red Hat.
I’ll write more about what comes next in the new year.

In the meantime, here’s a slightly edited version of a mail I sent internally yesterday.

In 2003, I got an email from Michael Johnson, about a secretive new thing Red Hat was working on called "Fedora". No-one was quite sure what it was going to be (some may argue we're still figuring it out), but he was pretty sure I'd want to be a part of it. "How'd you feel about taking care of _any_ kernel problems that come in for this thing?" he asked. I was terrified, but excited at the opportunities to learn a lot of stuff outside my usual areas of expertise.

With barely any real detail as to what I was signing up for, I jumped at the opportunity. Within my first few months, I had some concerns over whether or not I had made a good decision. Then Michael left for rPath, and I seriously started to have my doubts.

While everyone was figuring out what Fedora was going to be, I was thrown in at the deep end. "Here's Red Hat Linux 7, 8 and 9, you maintain the kernel for those now. Go". I remember looking at bugzilla scrolling through page after page of bugs thinking "This is going to be a nightmare" At the same time, RHEL 3 was really starting to take shape. I looked at what the guys working on RHEL were doing and thought "Well, this sucks, but those guys.. they _really_ have work to do". As much as I was buried alive in work, I relished every moment of it, learning as much as I could in what little spare time I had.

Then Fedora finally happened. For those not around back then, Fedora Core 1 was pretty much what Red Hat Linux 10 would have been from a kernel pov. A nasty hairball of patches that weren't going upstream (execshield! 4g4g! Tux! CIPE!) that even their authors had stopped maintaining, and a bunch of features backported from 2.5 to 2.4. I get the shakes when I think back to the horrors of maintaining that mess, but like the horrors of RHL before it, it was an amazing learning experience (mostly "what not to do").

But for all its warts, Fedora gained traction, and after Fedora 2 moved to a 2.6 kernel, things really started to take shape. As Fedora's community started to grow, things got even busier in bugzilla than RHL had ever been.

Then somehow I got talked into also being RHEL4 kernel maintainer for a while.
It turned out that juggling Fedora 3, Fedora 4, Rawhide, RHEL4 GA, and RHEL4 U1 means you don't get a lot of time to sleep. So after finding another sucker to deal with the RHEL work, I moved back to just doing Fedora work, and in another big turning point, we started to slowly grow out the Fedora kernel team.

Over the years that followed, the only thing that remained constant was the inflow of bugs. At any given time we had a thousand or so bugs open, with at best 3 people, at worst 1 person working on them. I'm incredibly proud of what we've managed to achieve with the Fedora kernel. More than just the base for RHEL, it changed the whole landscape of upstream kernel development.

Despite this progress though, I always felt we were on a treadmill making no real forward progress. That constant 1000 or so bugs kept nagging at me. As fast as we closed them out, a new batch would arrive.

In more recent years, we tried to split the workload within the team so we could do more proactive bug-finding before users even find them. My own 'trinity' project has found so many serious bugs (filesystem corruptors, root holes, vm corner cases, the list goes on) that it got to be almost a full time job just tracking everything.

I used to feel that leaving Red Hat wasn't something I could do. On a few occasions I actually turned down offers from potential employers, because "What about the Fedora kernel?". For the first time since the project has begun I feel like I've left things in more than capable hands, and I'm sure things will continue to move in the right direction.

3 RHL's. 5 and a half RHEL's. 21 Fedoras. You don't even want to know how much hardware I've destroyed in the line of duty in this time. It's been uh, an experience.

So, after all this time, one thing I have learned, is that all this was definitely one of my better decisions. I hope that my next decision turns out to be an equally good one.

Moving on from Red Hat. is a post from:

December 19, 2014 04:07 PM

December 16, 2014

Daniel Vetter: Neat drm/i915 stuff for 3.19

So kernel version 3.18 is out the door and it's time for our regular look at what's in the next merge window.
First looking at new hardware the big item is basic Skylake support. There are still a few smalls things missing, but mostly it's there now. This has been contributed by Damien, Satheeshakrishna and a lot of other folks. Looking at other platforms there has also been a lot of changes for vlv/chv: Improved backlight code, completely refactored interrupt handling to bring it in line with other platforms, rewritten panel power sequencing code, all from Ville. Rodrigo contributed PSR support for vlv/chv together with a lot of other fixes for PSR. Unfortunately it's not yet again enabled by default.

Moving on to Broadwell and the render side of things, Mika and Arun provided patches to improve the render workaround code and bring the set of workarounds up to date. execlist (the new command submission support on Gen8+) is also being polished with the addition of on-demand pinning of context objects with patches from Thomas Daniel and Oscar Mateo. Finally the RPS/render-turbo code has seen a lot of polish from Imre with a few fixes from Tom O'Rourke.

Otherwise not a lot of really big things happened on the GEM side: Just a few patches to fix issues in ppgtt (unfortunately still not enabled by default anywhere due to fun with context switches). And there's a bit of prep work and reorg all over for new stuff landing hopefully soon.

Looking at overall infrastructure changes the big thing certainly is the preparations for atomic display updates. The drm core/driver interface for atomic and all the helper library code to convert drivers has landed in 3.19, and already some conversions. On the Intel side it's been just prep work under the hood thus far with patches from Ander to precompute display PLL state. The new code to use vblank evades for pagelips has also landed, which is needed for atomic plane updates. And prep patches from Gustavo Padovan started to split the low-level plane update functions into check and commit steps. Lots more patches from different people are in flight and some have been merged for 3.20 already.

Besides these driver internal changes for atomic there has been other work to improve the codebase: Imre reorganized our handlers for suspend, resume and thawing and freezing. Jani reworked the audio and eld code which is the gfx side of the puzzle needed to make audio over HDMI or DP work. Jesse provided patches to track infoframes more accurately, which is needed to correctly fastboot (i.e. without modesets if possible) on external screens.

For older machines Ville has spent a few spare cycles to make them more useful: GPU reset support for gen3/4 should mitigate some of the recent chromium crashes on mesa, and the modeset code on i830M might work correctly for the first time, ever.

And of course the usual pile of smaller fixes and improvements all over.

Not directly related to code or features is the start of documenting i915 driver internals: With this release we now have some of the interrupt handling, fifo underrun reporting, frontbuffer tracking and runtime pm support newly document. And there's lots more in-flight, so hopefully soonish this will be fairly useful.

December 16, 2014 09:03 PM

December 14, 2014

Eric Sandeen: Neptune RF water meter frequency hopping pattern

A couple of years ago, my water utility installed new remote-read water meters, Neptune e-Coder R900i, in every home in their service area.  As any casual reader of this blog knows, I’m a big fan of measuring and data when it comes to household resource consumption. I’ve got electricity pretty well covered, but water and gas are still lacking. I could install secondary meters with pulse counters, by that seems silly – I already have remote-read meters installed, it’s just that the data they send isn’t accessible to me. Let’s see if we can start to remedy that!

Reading the product literature for this meter, we learn a few things right off the bat:

Transmitter Specifications:

Great!  We know nothing about the signal, but we know where to start looking to find it.  Is there any cheap hardware which could listen in on these frequencies?  Why yes, there is!

nooelec dvb tuner [amzn]

The rtl-sdr project has repurposed Digital Video Broadcasting (DVB) receivers as general-purpose software-defined radios.  We can control frequency and gain, and listen in on signals within the tuned bandwidth.

For now, we’re just trying to find the signals; we don’t yet care what they look like, we just want to know where they are so we can start listening.  To this end we can use the rtl_power utility from the rtl-sdr suite, which listens in on a given bandwidth, slices it into smaller buckets, and gives us time-stamped signal strength for each frequency bucket.  I used keenerd’s tree on github, as it has many bleeding-edge fixes and features (he was also supremely helpful on the IRC channel, thanks!)

However, the device can’t listen in on the entire 910-920MHZ spectrum at once; its maximum bandwidth is about 2GHz.  So to listen across all 10MHZ of the published range would require sweeping, and we might miss the short signals the meter emits if we’re sweeping the wrong range at the time it sends the signal.  But that’s ok; we know that the periodicity of the signal is once every 14 seconds, and that it hops across 50 frequencies.  We don’t know how random the sequence is across the 50 frequencies, but let’s assume it’s simple, that it repeats every 14×50 seconds, or every 700 seconds.  So let’s try listening in to dedicated 1MHZ bands at a time, for 700s each, and line up the results; essentially “binning” the signals we get into 50 14-second-aligned slots.


# 50 channels, each for 14s; 50*14 is 700, gather 1 cycle at each freq

for I in `seq 910 1 919`; do
    let J=$I+1
    rtl_power -g 0.9 -p -39.906 -f ${I}M:${J}M:5000 -c 20% -i 1s -e 700s -E rtl_power-${I}M-${J}M-5000-i1.csv

This tells rtl_power to use gain 0.9 (pretty low, the antenna was very near the meter, and I don’t want my neighbors’ meters inerfering), calibrate the frequency, crop the gathered data by 20% (data at the edges of the bandwidth can be dodgy), record power every 1s, and stop after 700s.  We record 5000 bins in the range, fairly fine grained.  The -E switch tells rtl_power to record timestamps as seconds since epoch.  We do this for each of the 10 1GHz ranges between 910MHz and 920MHz in succession.

This gives us 10 CSV files, with one row per second, and 5000 time-stamped frequency columns.  Now what?  We’d like to know what it’s doing across the entire documented frequency range.

Assuming we were right about the sequence repeating every 700s, we can now just “horizontally concatenate” the frequency columns from each run.  It’s easy enough to use awk to strip out the timetamp data and retain only the frequency columns.  But how to concatenate them?  paste(1)!  Did you know about paste?  I did not.  Paste is awesome for this.

After simple awk scripts to gather only the frequency columns for each csv file, we combine them like so, using a comma for the delimiter as we combine the files:

paste -d , rtl_power-910M-911M-5000-i1.csv 911M-912M.csv 912M-913M.csv 913M-914M.csv 914M-915M.csv 915M-916M.csv 916M-917M.csv 917M-918M.csv 918M-919M.csv 919M-920M.csv > all.csv

Frikkin’ trivial!  And now we’d like to visualize it.  Turns out keenerd has a great tool for this too;  Again, super easy: all.csv > all.png

and voila!

That’s the heatmap of power across frequencies and time; the yellow lines are the higher energy emissions during the signals, and the white text is my own annotation for time and frequency sequences.  Looks amazingly orderly, no?

I actually ran the capture for 3 cycles after this, and confirmed that the cycle repeats the above pattern ad infinitum.  We know that the time scale is once every 14s, but it’s hard to tell from the image above what the frequencies might be; we could count pixels, or look into the CSV files for frequency bins, but this DVB tuner isn’t super accurate… so, looking at my meter, it’s FCC ID is P2SNTGECR900DL.  And looking at that the test report for that FCC filing, we find this:

Which matches just extremely well; the lower frequency is stated in the test report as 911.06MHZ, and the upper as 919.08 MHz.  It also states that the channel separation was measured at 131.58 kHz.  So, counting the peaks above, and counting up (17 peaks) or down (33 peaks) from the extremes, correlating to the sequence we measured, and sorting – here is the table of frequencies and sequences for the Neptune e-Coder R900i RF water meter.  Now that we know where to look, we can start investigating the signals.

Seq     Frequency
1	911.06
2	915.92
3	913.32
4	917.24
5	915.26
6	918.55
7	912.19
8	916.58
9	914.45
10	917.90
11	911.19
12	916.05
13	913.45
14	917.37
15	915.40
16	918.69
17	912.32
18	916.71
19	914.59
20	918.03
21	911.32
22	916.19
23	914.06
24	917.50
25	915.53
26	918.82
27	912.45
28	916.84
29	914.87
30	918.16
31	911.45
32	916.32
33	914.19
34	917.63
35	915.66
36	918.95
37	913.06
38	916.97
39	915.00
40	918.29
41	912.06
42	916.45
43	914.32
44	917.76
45	915.79
46	919.08
47	913.19
48	917.11
49	915.13
50	918.42

December 14, 2014 04:46 AM

December 12, 2014

Pete Zaitcev: blitz2

You know how some people attach several montiors to one PC? I don't. I just have several PCs. But then I want copy-paste to work transparently (as transparently as possible). For several years I used blitz to copy clipboard. It works well enough, but once you have 3 computers, it gets somewhat cumbersome to type the hostname. Also, it always bothered me how it rides ssh authentication. I wanted something independent from ssh.

Behold blitz2. Instead of passing the clipboard to the host where it's needed directly, the clipboard is uploaded to an HTTP server. Seems more complex at first, but it's actually much better, because previously the PC where you copy had to authenticate to the PC where you paste. Now the authentication is symmetric. So, all clients are configured exactly the same, and all can upload and download the clipboard no matter who trusts what ssh keys.

December 12, 2014 08:55 PM

November 29, 2014

Dave Airlie: is this a protocol? displaylink3

I'm not sure

but if hd0;u]; means anything to anyone from displaylink, or is the first unencrypted bytes they send, then oops.

Looks like I have some work to do next week.

November 29, 2014 05:42 AM

November 20, 2014

Pavel Machek: fight with pulseaudio

On nokia n900, pulseaudio is needed to have a correct call. Unfortunately that piece of software fights back.

pavel@n900:~$ pulseaudio --start
N: [pulseaudio] main.c: User-configured server at {d3b6d0d847a14a3390b6c41ef280dbac}unix:/run/user/1000/pulse/native, refusing to start/autospawn.

Ok, I'd really like to avoid complexity of users here. Let me try as root.

root@n900:/home/pavel# pulseaudio --start
W: [pulseaudio] main.c: This program is not intended to be run as root (unless --system is specified).
N: [pulseaudio] main.c: User-configured server at {d3b6d0d847a14a3390b6c41ef280dbac}unix:/run/user/1000/pulse/native, refusing to start/autospawn.

Ok, I don't need per-user sessions, this is cellphone. Lets specify --system.

root@n900:/home/pavel# pulseaudio --start --system
E: [pulseaudio] main.c: --start not supported for system instances.

Yeah, ok.root@n900:/home/pavel# pulseaudio --system
W: [pulseaudio] main.c: Running in system mode, but --disallow-exit not set!
W: [pulseaudio] main.c: Running in system mode, but --disallow-module-loading not set!
N: [pulseaudio] main.c: Running in system mode, forcibly disabling SHM mode!
N: [pulseaudio] main.c: Running in system mode, forcibly disabling exit idle time!
W: [pulseaudio] main.c: OK, so you are running PA in system mode. Please note that you most likely shouldn't be doing that.
W: [pulseaudio] main.c: If you do it nonetheless then it's your own fault if things don't work as expected.
W: [pulseaudio] main.c: Please read for an explanation why system mode is usually a bad idea.

Totally my fault that someone forgot to document this pile of code. Thanks for blaming me. I'd actually like to read what is wrong with that, except that the page referenced does not exist. :-(.

November 20, 2014 09:18 PM

Paul E. Mc Kenney: Stupid RCU Tricks: rcutorture Catches an RCU Bug

My previous posting described an RCU bug that I might plausibly blame on falsehoods from firmware. The RCU bug in this post, alas, I can blame only on myself.

In retrospect, things were going altogether too smoothly while I was getting my RCU commits ready for the 3.19 merge window. That changed suddenly when my automated testing kicked out a “BUG: FAILURE, 1 instances”. This message indicates a grace-period failure, in other words that RCU failed to be RCU, which is of course really really really bad. Fortunately, it was still some weeks until the merge window, so there was some time for debugging and fixing.

Of course, we all have specific patches that we are suspicious of. So my next step was to revert suspect patches and to otherwise attempt to outguess the bug. Unfortunately, I quickly learned that the bug is difficult to reproduce, requiring something like 100 hours of focused rcutorture testing. Bisection based on 100-hour tests would have consumed the remainder of 2014 and a significant fraction of 2015, so something better was required. In fact, something way better was required because there was only a very small number of failures, which meant that the expected test time to reproduce the bug might well have been 200 hours or even 300 hours instead of my best guess of 100 hours.

My first attempt at “something better” was to inspect the suspect patches. This effort did locate some needed fixes, but nothing that would explain the grace-period failures. My next attempt was to take a closer look at the dmesg logs of the two runs with failures, which produced much better results.

You see, rcutorture detects failures using the RCU API, for example, invoking synchronize_rcu() and waiting for it to return. This overestimates the grace-period duration, because RCU's grace-period primitives wait not only for a grace period to elapse, but also for the RCU core to notice their requests and also for RCU to inform them of the grace period's end, as shown in the following figure.


This means that a given RCU read-side critical section might overlap a grace period by a fair amount without rcutorture being any the wiser. However, some RCU implementations provide rcutorture access to the underlying grace-period counters, which in theory provide rcutorture with a much more precise view of each grace period's duration. These counters have long recorded in the rcutorture output as “Reader Batch” counts as shown on the second line of the following (the first line is the full-API data):

rcu-torture: !!! Reader Pipe:  13341924415 88927 1 0 0 0 0 0 0 0 0
rcu-torture: Reader Batch:  13341824063 189279 1 0 0 0 0 0 0 0 0

This shows rcutorture output from a run containing a failure. On each line, the first two numbers correspond to legal RCU activity: An RCU read-side critical section might have been entirely contained within a single RCU grace period (first number) or it might have overlapped the boundary between an adjacent pair of grace periods (second number). However, a single RCU read-side critical section is most definitely not allowed to overlap three different grace periods, because that would mean that the middle grace period was by definition too short. And exactly that error occured above, indicated by the exclamation marks and the “1” in the third place in the “Reader Pipe” line above.

If RCU was working perfectly, in theory, the output would instead look something like this, without the exclamation marks and without non-zero value in the third and subsequent positions:

rcu-torture: Reader Pipe:  13341924415 88927 0 0 0 0 0 0 0 0 0
rcu-torture: Reader Batch:  13341824063 189279 0 0 0 0 0 0 0 0 0

In practice, the “Reader Batch” counters were intended only for statistical use, and access to them is therefore unsynchronized, as indicated by the jagged grace-period-start and grace-period-end lines in the above figure. Furthermore, any attempt to synchronize them in rcutorture's RCU readers would incur so much overhead that rcutorture could not possibly do a good job of torturing RCU. This means that these counters can result in false positives, and false positives are not something that I want in my automated test suite. In other words, it is at least theoretically possible that we might legitimately see something like this from time to time:

rcu-torture: Reader Pipe:  13341924415 88927 0 0 0 0 0 0 0 0 0
rcu-torture: Reader Batch:  13341824063 189279 1 0 0 0 0 0 0 0 0

We clearly need to know how bad this false-positive problem is. One way of estimating the level of false-positive badness is to scan my three months worth of rcutorture test results. This scan showed that there were no suspicious Reader Batch counts in more than 1,300 hours of rcutorture testing, aside from the roughly one per hour from the TREE03 test, which happened to also be the only test to produce real failures. This suggested that I could use the Reader Batch counters as indications of a “near miss” to guide my debugging effort. The once-per-hour failure rate suggested a ten-hour test duration, which I was able compress into two hours by running five concurrent tests.

Why ten hours?

Since TREE03 had been generating Reader Batch near misses all along, this was not a recent problem. Therefore, git bisect was more likely to be confused by unrelated ancient errors than to be of much help. I therefore instead bisected by configuration and by rcutorture parameters. This alternative bisection showed that the problem occurred only with normal grace periods (as opposed to expedited grace periods), only with CONFIG_RCU_BOOST=y, and only with concurrent CPU-hotplug operations. Of course, one possibility was that the bug was in rcutorture rather than RCU, but rcutorture was exonerated via priority-boost testing on a kernel built with CONFIG_RCU_BOOST=n, which showed neither failures nor Reader Batch near misses. This combination of test results points the finger of suspicion directly at the rcu_preempt_offline_tasks() function, which is the only part of RCU's CPU-hotplug code path that has a non-trivial dependency on CONFIG_RCU_BOOST.

One very nice thing about this combination is that it is very unlikely that people are encountering this problem in production. After all, for this problem to appear, you must be doing the following:

  1. Running a kernel with both CONFIG_PREEMPT=y and CONFIG_RCU_BOOST=y.
  2. Running on hardware containing more than 16 CPUs (assuming the default value of 16 for CONFIG_RCU_FANOUT_LEAF).
  3. Carrying out frequent CPU-hotplug operations, and so much so that a given block of 16 CPUs (for example, CPUs 0-15 or 16-31) might reasonably all be offline at the same time.

That said, if this does describe your system and workload, you should consider applying this patch set. You should also consider modifying your workload.

Returning to the rcu_preempt_offline_tasks() function that the finger of suspicion now points to:

 1 static int rcu_preempt_offline_tasks(struct rcu_state *rsp,
 2                                      struct rcu_node *rnp,
 3                                      struct rcu_data *rdp)
 4 {
 5   struct list_head *lp;
 6   struct list_head *lp_root;
 7   int retval = 0;
 8   struct rcu_node *rnp_root = rcu_get_root(rsp);
 9   struct task_struct *t;
11   if (rnp == rnp_root) {
12     WARN_ONCE(1, "Last CPU thought to be offlined?");
13     return 0;
14   }
15   WARN_ON_ONCE(rnp != rdp->mynode);
16   if (rcu_preempt_blocked_readers_cgp(rnp) && rnp->qsmask == 0)
17     retval |= RCU_OFL_TASKS_NORM_GP;
18   if (rcu_preempted_readers_exp(rnp))
19     retval |= RCU_OFL_TASKS_EXP_GP;
20   lp = &rnp->blkd_tasks;
21   lp_root = &rnp_root->blkd_tasks;
22   while (!list_empty(lp)) {
23     t = list_entry(lp->next, typeof(*t), rcu_node_entry);
24     raw_spin_lock(&rnp_root->lock);
25     smp_mb__after_unlock_lock();
26     list_del(&t->rcu_node_entry);
27     t->rcu_blocked_node = rnp_root;
28     list_add(&t->rcu_node_entry, lp_root);
29     if (&t->rcu_node_entry == rnp->gp_tasks)
30       rnp_root->gp_tasks = rnp->gp_tasks;
31     if (&t->rcu_node_entry == rnp->exp_tasks)
32       rnp_root->exp_tasks = rnp->exp_tasks;
34     if (&t->rcu_node_entry == rnp->boost_tasks)
35       rnp_root->boost_tasks = rnp->boost_tasks;
36 #endif
37     raw_spin_unlock(&rnp_root->lock);
38   }
39   rnp->gp_tasks = NULL;
40   rnp->exp_tasks = NULL;
42   rnp->boost_tasks = NULL;
43   raw_spin_lock(&rnp_root->lock);
44   smp_mb__after_unlock_lock();
45   if (rnp_root->boost_tasks != NULL &&
46       rnp_root->boost_tasks != rnp_root->gp_tasks &&
47       rnp_root->boost_tasks != rnp_root->exp_tasks)
48     rnp_root->boost_tasks = rnp_root->gp_tasks;
49   raw_spin_unlock(&rnp_root->lock);
50 #endif
51   return retval;
52 }

First, we should of course check the code under #ifdef CONFIG_RCU_BOOST. The code starting at line 43 is quite suspicious, as it is not clear why it is safe to make these modifications after having dropped the rnp_root structure's ->lock on line 37. And removing lines 43-49 does in fact reduce the number of Reader Batch near misses by an order of magnitude, but sadly not to zero.

I was preparing to dig into this function to find the bug, but then I noticed that the loop spanning lines 22-38 is executed with interrupts disabled. Given that the number of iterations through this loop is limited only by the number of tasks in the system, this loop is horrendously awful for real-time response.

Or at least it is now—at the time I wrote that code, the received wisdom was “Never do CPU-hotplug operations on a system that runs realtime applications!” I therefore had invested zero effort into maintaining good realtime latencies during these operations. That expectation has changed because some recent real-time applications offline then online CPUs in order to clear off irrelevant processing that might otherwise degrade realtime latencies. Furthermore, the large CPU counts on current systems are an invitation to run multiple real-time applications on a single system, so that the CPU hotplug operations run as part of one application's startup might interfere with some other application. Therefore, this loop clearly needs to go. So I abandoned my debugging efforts and focused instead on getting rid of rcu_preempt_offline_tasks() entirely, along with all of its remaining bugs.

The point of the loop spanning lines 22-38 is handle any tasks on the rnp structure's ->blkd_tasks list. This list accumulates tasks that block while in an RCU read-side critical section while running on a CPU associated with the leaf rcu_node structure pointed to by rnp. This blocking will normally be due to preemption, but could also be caused by -rt's blocking spinlocks. This function is called only when the last CPU associated with the leaf rcu_node structure is going offline, after which there will no longer be any online CPUs associated with this leaf rcu_node structure. This loop then moves those tasks to the root rcu_node structure's ->blkd_tasks list. Because there is always at least one CPU online somewhere in the system, there will always be at least one online CPU associated with the root rcu_node structure, which means that RCU will be guaranteed to take proper care of these tasks.

The obvious question at this point is “why not just leave the tasks on the leaf rcu_node structure?” After all, this clearly avoids the unbounded time spent moving tasks while interrupts are disabled. However, it also means that RCU's grace-period computation must change in order to account for blocked tasks associated with a CPU-free leaf rcu_node structure.

In the old approach, the leaf rcu_node structure in question was excluded from RCU's grace-period computations immediately. In the new approach, RCU will need to continue including that leaf rcu_node structure until the last task on its list exits its outermost RCU read-side critical section. At that point, the last task will remove itself from the list, leaving the list empty, thus allowing RCU to ignore the leaf rcu_node structure from that point forward.

This patch set makes this change by abstracting an rcu_cleanup_dead_rnp() from rcu_cleanup_dead_cpu(), which can then be called from rcu_read_unlock_special() when the last task removes itself from the ->blkd_tasks list. This change allows the rcu_preempt_offline_tasks() function to be dispensed with entirely. With this patch set applied, TEST03 ran for 1,000 hours with neither failures nor near misses, which gives a high degree of confidence that this patch set made the bug less likely to appear. This, along with the decreased complexity and the removal of a source of realtime latency, is a definite plus!

I also modified the rcutorture scripts to pay attention to the “Reader Batch” near-miss output, giving a warning if one such near miss occurs in a run, and giving an error if two or more near misses occur, but at a rate of at least one near miss per three hours of test time. This should inform me should I make a similar mistake in the future.

Given that RCU priority boosting has been in the Linux kernel for many years and given that I regularly run rcutorture with CPU-hotplug testing enabled on CONFIG_RCU_BOOST=y kernels, it is only fair to ask why rcutorture did not locate this bug years ago. The likely reason is that I only recently added RCU callback flood testing, in which rcutorture registers 20,000 RCU callbacks, waits a jiffy, registers 20,000 more callbacks, waits another jiffy, then finally registers a third set of 20,000 callbacks. This causes call_rcu() to take evasive action, which has the effect of starting the RCU grace period more quickly, which in turn makes rcutorture more sensitive to too-short grace periods. This view is supported by the fact that I saw actual failures only on recent kernels whose rcutorture testing included callback-flood testing.

Another question is “exactly what was the bug?” I have a fairly good idea of what stupid mistake led to that bug, but given that I have completely eliminated the offending function, I am not all that motivated to chase it down. In fact, it seems much more productive to leave it as an open challenge for formal verification. So, if you have a favorite formal-verification tool, why not see what it can make of rcu_preempt_offline_tasks()?

Somewhat embarrassingly, slides 39-46 of this Collaboration Summit presentation features my careful development of rcu_preempt_offline_tasks(). I suppose that I could hide behind the “More changes due to RCU priority boosting” on slide 46, but the fact remains that I simply should not have written this function in the first place, whether carefully before priority boosting or carelessly afterwards.

In short, it is not enough to correctly code a function. You must also code the correct function!

November 20, 2014 07:18 PM

November 17, 2014

Pavel Machek: gcc trying to be helpful... in pretty unhelpful way

gcc tried to help me with figuring pulseaudio-module-cmtspeech-n9xx compilation... It says:

/lib/x86_64-linux-gnu/ error adding symbols: DSO missing from command line

To decrypt it, you should understand that "DSO" is a library. So it wants you to add /lib/x86_64-linux-gnu/ to command line you are using to compile. It took me a while to figure out...

November 17, 2014 08:45 PM

November 10, 2014

Eric Sandeen: Residential boiler monitoring via ModBus

Graph of boiler operation, click to embiggen

We recently upgraded our boiler to a high-efficiency modulating/condensing Triangle Tube Prestige Trimax Solo, and in perusing the manual, I found that the boiler has a ModBus interface .  Woohoo, project!  The final result is live charting like you see above; for more details, read on!

Per the boiler docs, the ModBus interface is Modbus/RTU using RS-485 for the physical layer.  First thing, obviously, is to find an RS-485 adapter.  Fortunately, I found one super cheap (under $10) at Amazon, a “Kedsum USB to RS485 Converter” [amzn] and despite the cheap price and appearance, it seems to work just fine:

Cheapo USB Adapter at Amazon

The adapter only has A and B connections, and no GND, but it works fine.  I used a twisted pair from a CAT5 cable to connect to the A and B terminals on the boiler, plugged it into a spare Raspberry Pi, and was ready to test this thing.

Boiler ModBus Connections

Raspberry Pi w/ RS-485 Adapter

libmodbus makes communication simple; basically open the serial port and read addresses.  A little test program I wrote verified that things are working:

# ./tt-status -s /dev/ttyUSB0 
 DHW Mode
 Flame Present
 DHW Pump
Supply temp:		168 °F
Return temp:		145 °F
DHW Storage temp:	102 °F
Flue temp:		132 °F
Outdoor temp:		 35 °F
Flame Ionization:	 11 μA
Firing rate:		 25 %
Boiler Setpoint:	170 °F
CH1 Maximum Setpoint:	143 °F
DHW Setpoint:		125 °F

Ok!  So to get the pretty graph above, I obviously needed some data collection; for this I used collectd simply because it already had a ModBus plugin.  It did require some patching and hacking though; the non-bloggy details are on this page; the graph you see above was made by Visage.

Overall this has been very useful; the boiler controls are fairly involved, but primarily we want to get the outdoor reset curve tweaked so that we get nice long runtimes and low return temps, which keep the boiler in its most efficient condensing mode.  My installer hit the default buttons and left, as contractors often do.  Seeing how the boiler was working over time definitely helped me tweak things to improve performance, efficiency and comfort.  Primarily, I drastically lowered the high end of the outdoor reset curve which defaulted to 170F For cast iron radiators, because we have much more radiation in this house than it needs at 170F output.  I currently have the high end set to 140F.  I may do another post on all that later… suffice it to say, doing a heat loss and measuring radiation and using that information to guide boiler setup is not rocket science, and you really need to do it for the sake of comfort and efficiency.

collectd, at least with the rrd output, doesn’t keep fine-grained data around for long.  The current setup is useful for seeing how things behaved in the past couple days, but I’ll probably start logging fine-grained data to a database at some point, and can start doing fun things like using the firing rate to estimate gas usage over the month, look at therms per heating degree days in near real time, etc.

Other boilers have ModBus interfaces as well; in particular Lochinvar and NTI both mention the capability, although I haven’t yet found documentation on what data is available.  If you have a boiler with ModBus and want to try this hack, drop me a line – I’d love to see how this looks in other homes, and I’d be glad to help you set up a collectd config file.  I may see about creating a Raspberry Pi image to make this all more or less work out of the box.

Again, take a look at the the more detailed writeup for a bit more info, in particular the patches I used with the upstream tools, and if you try this at home & make it work, share what you find!

November 10, 2014 11:14 PM

November 09, 2014

Pavel Machek: Dialer for ofono?

I have stock Debian running on Nokia n900, with ofono stack (on 3.18-rc1 and nfsroot)... and would like some GUI dialer. There's none in ofono project. mer had some, so I went to ... but it gives me "service temporarily unavailable"... I was told to look at telepathy-ring, but that is not in Debian 7.7, and would have a lot of dependencies. Any ideas where to get sources of dialer from mer or what other software to use?

And... Is there a recommended camera application in Debian? I'd like to test that the drivers still work...

November 09, 2014 09:46 PM

November 07, 2014

Dave Airlie: more on Displaylink3 and HDCP encryption

okay another braindump (still nothing working).

The git repo mentioned in previous post has all the code I've hacked up so far.

I finished writing the HDCP protocol stages, and sending all the msgs and getting replies from the device.

So I've successfully reached a point where I've negotiated a HDCP session key with the device, and we are both happy about it. Unfortunately I've no idea what I'm meant to be encrypting to send to the device. The next packet the USB traces contain is 384-bytes of encrypted data.

Now HDCP v2 had a vulnerabilty in its key neg, and I've written code to try and use this fact. So I've taken a trace I made from Windows, and extracted the necessary bits, and using that I've managed to derive the master key used in that trace, and subsequently managed to derived the session key for it. So I've replayed the first encrypted packet from the trace to the device and got an encrypted response the same as in the trace.

I've tried changing a bit in the session key, riv value and data I'm sending, and doing that causes the device not to reply with the answer. This to me implies that the device is using the HDCP cipher to encode the control channel. Now HDCP does say you should only do this for video streams, but maybe DisplayLink forgot to read that bit.

Now where does this leave me, in theory I should be able to replay the full trace (haven't had time yet) and I should see the same picture on screen as I did (though I can't remember what monitor/device I used, so I might have to retrace and restage my tests before then).

However I really need to decrypt the encrypted data in the trace, and from reading the HDCP spec the only values I need to feed the AES engine are ks ^ lc128, riv, streamctr, inputctr. I'm assuming streamctr and inputctr are 0 for the first packet (I could be wrong, maybe they use some wacky streamctr to avoid messing with hdcp), riv and ks I've captured. So lc128 is possibly the crux.

Now what is lc128? Its a secret 128-bit value in the HDCP world given only to HDCP adopters. Its normally something you'd store in hw on the GPU etc as an input to the hw cipher. But in displaylink there is no GPU encrypting the data. Now its possible that displaylink don't use the same lc128 as the HDCP people, unlikely but possible. Maybe they cipher their streams with their own lc128, and only use the offical hdcp lc128 for actual HDCP streams.

I don't think lc128 has leaked, I'm not sure what the consequences of it leaking would be, but hey its just a magic number, and if displaylink are using as an input to their AES code, it must be in RAM at some point, now I need to figure out ways to work that out. I'm not sure how long it would take to brute force as 128-bit key space, probably impossible.

At any point if someone from DisplayLink wants to talk, you know where to find me :-)

November 07, 2014 04:20 AM

November 03, 2014

Daniel Vetter: Atomic Modeset Support for KMS Drivers

So I've just reposted my atomic modeset helper series, and since the main goal of all that work was to ensure a smooth and simple transition for existing drivers to the promised atomic land it's time to elaborate a bit. The big problem is that the existing helper libraries and callbacks to driver backends don't really fit the new semantics, so some shuffling was required to avoid long-term pain. So if you are a driver writer and just interested in the details then read for what needs to be done to support atomic modeset updates using these new helper libraries.

Phase 1: Reworking the Driver Backend Functions for Planes

The first phase is reworking the driver backend callbacks to fit the new world. There are two big mismatches between the new atomic semantics and legacy ioctl interfaces:

Both issues are addressed by adding new driver backend callbacks. Furthermore a few transitional helper functions are provided to implement the legacy entry points in terms of these new callbacks. That way the driver backend can be reworked without the additional hassle of needing to deal with all the atomic state object handling and check/commit semantics.

The first step is to rework the ->disable/update_plane hooks using the transitional helper implementations drm_plane_helper_update/disable. These need the following new driver callbacks:
With this it's also easy to implement universal plane support directly, instead of with the default implementation which doesn't allow the primary plane to be disabled. Universal planes are a requirement for atomic and need to be implemented in phase 1, but testing the primary plane support is also a good preparation for the next step:

The new crtc->mode_set_nofb callback must be implement, which just updates the CRTC timings and data in the hardware without touching the primary plane state at all. The provided helpers functions drm_helper_crtc_mode_set and drm_helper_crtc_mode_set_base then implement the callbacks required by the CRTC helpers in terms of the new ->mode_set_nofb callback and the above newly implemented plane helper callbacks.

Phase 2: Wire up the Atomic State Object Scaffolding

With the completion of phase 1 all the driver backend functions have been adapted to the new requirements of the atomic helper library. The goal of phase 2 is to get all the state object handling needed for atomic updates into place. There are three steps to that:

Phase 3: Rolling out Atomic Support

With the driver backend changes from phase 1 and the state handling changes from phase 2 everything is ready for the step-by-step rollout of atomic support. Presuming nothing was missed this just consists of wiring up the ->atomic_check and ->atomic_commit implementations from the atomic helper library. And then replacing all the legacy entry pointers with the corresponding functions from the atomic helper library to implement them in terms of atomic.

The recommended order is to start with planes first, then test the ->set_config functionality. Page flips and properties are best done later since they likely need some additional work:

Besides these two complications (which might require a bit of work depending upon the driver) this is all that's needed for full atomic modeset and pageflip support.

Follow-up Driver Cleanups

But there's of course quite a bit of cleanup work possible afterards!

The are some big differences between the old CRTC helper modeset logic and the new one (using the same callbacks, but completely rewritten otherwise) in the atomic helper library:

These are all lessons learned from the i915 modeset rewrite. The only thing missing in the atomic helpers compared to i915 is the state readout and cross-checking support - everything else is there. But even that can be easily implemented by adding hardware state readout callbacks and using them in the various state reset functions (to reconstruct matching software state) and also to cross-check state.

The other big cleanup task is to stop using all the legacy state variables and switch all the driver backend code to only look at the state object structures. The two big examples here are crtc->mode and the plane->fb pointer.

So What Now?

With all that converting drivers should be simple and can be done with a series of not-too-invasive refactorings. But my patch series doesn't yet contain the actual atomic modeset ioctl. So what's left to be done in the drm core?

So still a few things to do, besides adding atomic support to all drivers.

Update: The explanation for how to implement state readout and cross checking was a bit confused, so I reworded that.

November 03, 2014 10:23 PM

Dave Jones: Thoughts on crashdumps.

Linux has what appears to be a useful feature that can be enabled to diagnose tricky kernel bugs. The feature is called kdump. A crashdump mechanism that uses kexec to switch to a different kernel, before writing out memory to disk, nfs, wherever. It’s a pretty neat idea.

Unfortunately, I have _never_ seen it working when I needed it.
I know it’s possible, because some of my co-workers swear by crashdumps for diagnosing tricky RHEL bugs. Someone every single RHEL release invests the time to fix up a bunch of bugs and get it into a working state again. But because Fedora is constantly moving, it’s near constantly broken in some non-trivial way.

We even have a wiki page telling Fedora users how to enable it. In honesty every time in the past I’ve told a user to try it, I’ve thought to myself “yeah, that isn’t going to work”, and my record for being correct in that regard is pretty damn good. If after 15+[1] years of kernel debugging, _I_ can’t get this thing to work, what hope does the average end-user have ?

In a recent meeting at the office, one of my coworkers enthused about how “it’s so much better now, it just works”. So I thought I’d give it a try again the last few weeks. In that time, I have ended up with a total of zero crash dumps, and I-lost-count-how-many kdump bugs.

Why is it so fragile ? I don’t have a good answer. It tends to have the worst possible failure modes. It’s hard to diagnose bugs that either lock up the machine entirely, or instantly reboot it. When you’re trying to debug something, and then it turns out you need to debug the debugging mechanism, most people probably think “I don’t have time for this shit”, and try alternative avenues of debugging, adding “FIND OUT WHY KDUMP IS FUCKED AGAIN” somewhere near the bottom of their TODO list.

At one point I thought “Maybe I’m just unlucky with hardware choices”[2], but the problems seem to be universal across every machine I’ve tried it on.

No doubt it “works” for some people, in certain circumstances, but this kind of feature has to be reliable at least most of the time to make it even worth trying.

I wish this post had a happy ending where I unveiled some solution to this problem[3], but after needing to travel to a machine that wedged itself after it had crashed for the Nth time this weekend, I’m kind of over kdump.
Sometimes it’s easier just to say “Don’t even bother” and do something entirely different.

[1] Oh god what have I done with my life.
[2] There are no good choices when it comes to computer hardware.
[3] Coming in a future post: Why pstore is the solution to this, and why it’s also completely awful.

Thoughts on crashdumps. is a post from:

November 03, 2014 04:02 PM

November 02, 2014

Daniel Vetter: Neat drm/i915 stuff for 3.18

Since Dave Airlie moved the feature cut-off of the drm-next tree roughly one month ahead it is already time for our regular look at what's ahead. Even though the 3.17 features aren't even released yet.

On the modeset side of things we now have the final pieces for plane rotation support from Sonika Jindal and Ville. The DisplayPort code has also seen lots of improvements, with updated training values in preparation of the latest eDP standard (Sonika Jindal) and support for DP training pattern 3 (Ville). DSI panels now support burst mode (Shobhit) and hdmi conformance has been improved with some fixes from Clint Taylor.

For eDP panels we also have improved panel power sequencing code, mostly to fix issues on Cherryview, from Ville. Ville has also contributed fixes to the VDD handling code, which is used to temporarily enable panel power. And the backlight code learned to handle the bl_power setting so that the backlight can be turned off completely without upsetting the panel's power sequencing, contributed by Jani.

Chris Wilson has also been fairly busy on the modeset code: 3.18 includes his patches to cache EDIDs for a single probe call - unfortunately the full caching solution to keep the EDID around between multiple probe calls isn't merged yet. And pageflips have now improved error detection and recovery logic: In case something goes wrong we shouldn't end up stuck any longer waiting for a pageflip to complete that has been lost by either the hardware or the driver.

Moving on to platform specific work there's been lots of preparations for Skylake, most of it from Damien and Sonika. The actual intial platform enabling is delayed for 3.19 though. On the other end of the timeline Ville fixed up i830M modeset support on a rainy w/e in his vacation, and 3.18 now has all that code. And there has been a lot of Cherryview fixes all over.

Cherryview also gained support for power wells and hence runtime pm (Ville). And for platform agnostic feature a lot of the preparation for DRRS (dynamic refresh rate switching) is merged, hopefully the actual feature patches from Vandana Kannan will land in 3.19.

Moving on the render side of the driver there's been a lot of patches to beat the full ppgtt support into shape. The context code has been cleaned up, lifetime handling for ppgtt address spaces is fixed and bad interactions with secure batches are now also rectified. Enabling full ppgtt missed the feature cutoff by a hair though, but it's already enabling for the following release.

Basic support for execlists command submission from Ben Widawsky, Oscar Mateo and Thomas Daniel was also merged. This is the fancy new way to submit commands available on Gen8 and subsequent platforms. It's not yet enabled by default, but since it's a requirement for a lot of cool new features keep an eye on what's going on here. There is also a lot of work going on underneath to enable all this new code in GEM, like preparing to switch away from sequence numbers to tracking gpu progress more abstractly using the driver's request structures.

And this time around there is also some cool stuff going on in the drm core worth of a shout-out: The vblank handling code is massively revamped, hopefully plugging all the small races, inconsistencies and inefficiencies in that code. And thanks to David Herrmann it is finally possible to write a drm driver without the drm midlayer getting completely in the way of a proper driver load and unload sequence! Unfortunately i915 can't be converted right away since the legacy usermodesetting code crucial relies on this midlayer functionality. But that's now well deprecated and hopefully can be removed in one of the next releases.

November 02, 2014 02:31 PM

October 31, 2014

Dave Airlie: a day with DisplayLink USB3 and HDCP

So for some reason I decided to look at the displaylink usb3 adaptors today. (no good news).

This blog post is so I don't forget all of this when I page it out. Notes, HDCP1.0 being broken doesn't matter to this, maybe HDCPv2.0 being a bit broken could be used, but I'm not sure how!

The displaylink USB3 protocol is based on HDCP protocol. I've traced the first few packets and it clearly
looks like the host sends two packets


and the device sends back

at least.

AKE_Send_Cert contains a 522 byte certificate, containing a receiver id, public key, some misc bytes and a signature generated with the DCP LLC private key, that you have to verify.

so the HDCP v2.2 spec contains the DP LLC public key, and I've written some code to verify the spec using openssl, but it totally fails to work. This is probably due to me doing something stupid, or not understanding what I'm doing, if you are openssl knowledgeable and want to look, the hack fest is

It might be the DisplayLink devices use a different signing key than the DP LLC one.

That repo contains some code to talk to the device (currently disabled) and do the initial sequence, along with an attempt to verify the cert.

Now once I get past this hurdle, the larger one seems to remain, the HDCP 2.0 spec has a global secret 128-bit value called LC128, that everyone who implements HDCP gets and hides somewhere. Its probably sitting in the displaylink driver in hex, but I'd hope they at least hide it better than that. It may also be possibly supplied by the OS, Windows or OSX. (I've no clue yet). That value is used in the key negotiation.

Now it might be possible that Displaylink allow non-HDCP encrypted data to be sent to the device, in which case win if I can find out where/how to do that, or it might be the device requires HDCP and decrypts non-HDCP content before sending it over VGA/DVI. I've no ideas yet on that front either.

Ah well probably enough learning for today, I knew nothing about HDCP this morning, so I can't say it made my life any better learning about it :-P

October 31, 2014 06:25 AM

October 30, 2014

Matthew Garrett: Hacker News metrics (first rough approach)

I'm not a huge fan of Hacker News[1]. My impression continues to be that it ends up promoting stories that align with the Silicon Valley narrative of meritocracy, technology will fix everything, regulation is the cancer killing agile startups, and discouraging stories that suggest that the world of technology is, broadly speaking, awful and we should all be ashamed of ourselves.

But as a good data-driven person[2], wouldn't it be nice to have numbers rather than just handwaving? In the absence of a good public dataset, I scraped Hacker Slide to get just over two months of data in the form of hourly snapshots of stories, their age, their score and their position. I then applied a trivial test:

  1. If the story is younger than any other story
  2. and the story has a higher score than that other story
  3. and the story has a worse ranking than that other story
  4. and at least one of these two stories is on the front page
then the story is considered to have been penalised.

(note: "penalised" can have several meanings. It may be due to explicit flagging, or it may be due to an automated system deciding that the story is controversial or appears to be supported by a voting ring. There may be other reasons. I haven't attempted to separate them, because for my purposes it doesn't matter. The algorithm is discussed here.)

Now, ideally I'd classify my dataset based on manual analysis and classification of stories, but I'm lazy (see [2]) and so just tried some keyword analysis:

A few things to note:
  1. Lots of stories are penalised. Of the front page stories in my dataset, I count 3240 stories that have some kind of penalty applied, against 2848 that don't. The default seems to be that some kind of detection will kick in.
  2. Stories containing keywords that suggest they refer to issues around social justice appear more likely to be penalised than stories that refer to technical matters
  3. There are other topics that are also disproportionately likely to be penalised. That's interesting, but not really relevant - I'm not necessarily arguing that social issues are penalised out of an active desire to make them go away, merely that the existing ranking system tends to result in it happening anyway.

This clearly isn't an especially rigorous analysis, and in future I hope to do a better job. But for now the evidence appears consistent with my innate prejudice - the Hacker News ranking algorithm tends to penalise stories that address social issues. An interesting next step would be to attempt to infer whether the reasons for the penalties are similar between different categories of penalised stories[3], but I'm not sure how practical that is with the publicly available data.

(Raw data is here, penalised stories are here, unpenalised stories are here)

[1] Moving to San Francisco has resulted in it making more sense, but really that just makes me even more depressed.
[2] Ha ha like fuck my PhD's in biology
[3] Perhaps stories about startups tend to get penalised because of voter ring detection from people trying to promote their startup, while stories about social issues tend to get penalised because of controversy detection?

comment count unavailable comments

October 30, 2014 06:18 PM

Matthew Garrett: Linux Container Security

First, read these slides. Done? Good.

(Edit: Just to clarify - these are not my slides. They're from a presentation Jerome Petazzoni gave at Linuxcon NA earlier this year)

Hypervisors present a smaller attack surface than containers. This is somewhat mitigated in containers by using seccomp, selinux and restricting capabilities in order to reduce the number of kernel entry points that untrusted code can touch, but even so there is simply a greater quantity of privileged code available to untrusted apps in a container environment when compared to a hypervisor environment[1].

Does this mean containers provide reduced security? That's an arguable point. In the event of a new kernel vulnerability, container-based deployments merely need to upgrade the kernel on the host and restart all the containers. Full VMs need to upgrade the kernel in each individual image, which takes longer and may be delayed due to the additional disruption. In the event of a flaw in some remotely accessible code running in your image, an attacker's ability to cause further damage may be restricted by the existing seccomp and capabilities configuration in a container. They may be able to escalate to a more privileged user in a full VM.

I'm not really compelled by either of these arguments. Both argue that the security of your container is improved, but in almost all cases exploiting these vulnerabilities would require that an attacker already be able to run arbitrary code in your container. Many container deployments are task-specific rather than running a full system, and in that case your attacker is already able to compromise pretty much everything within the container. The argument's stronger in the Virtual Private Server case, but there you're trading that off against losing some other security features - sure, you're deploying seccomp, but you can't use selinux inside your container, because the policy isn't per-namespace[2].

So that seems like kind of a wash - there's maybe marginal increases in practical security for certain kinds of deployment, and perhaps marginal decreases for others. We end up coming back to the attack surface, and it seems inevitable that that's always going to be larger in container environments. The question is, does it matter? If the larger attack surface still only results in one more vulnerability per thousand years, you probably don't care. The aim isn't to get containers to the same level of security as hypervisors, it's to get them close enough that the difference doesn't matter.

I don't think we're there yet. Searching the kernel for bugs triggered by Trinity shows plenty of cases where the kernel screws up from unprivileged input[3]. A sufficiently strong seccomp policy plus tight restrictions on the ability of a container to touch /proc, /sys and /dev helps a lot here, but it's not full coverage. The presentation I linked to at the top of this post suggests using the grsec patches - these will tend to mitigate several (but not all) kernel vulnerabilities, but there's tradeoffs in (a) ease of management (having to build your own kernels) and (b) performance (several of the grsec options reduce performance).

But this isn't intended as a complaint. Or, rather, it is, just not about security. I suspect containers can be made sufficiently secure that the attack surface size doesn't matter. But who's going to do that work? As mentioned, modern container deployment tools make use of a number of kernel security features. But there's been something of a dearth of contributions from the companies who sell container-based services. Meaningful work here would include things like:

These aren't easy jobs, but they're important, and I'm hoping that the lack of obvious development in areas like this is merely a symptom of the youth of the technology rather than a lack of meaningful desire to make things better. But until things improve, it's going to be far too easy to write containers off as a "convenient, cheap, secure: choose two" tradeoff. That's not a winning strategy.

[1] Companies using hypervisors! Audit your qemu setup to ensure that you're not providing more emulated hardware than necessary to your guests. If you're using KVM, ensure that you're using sVirt (either selinux or apparmor backed) in order to restrict qemu's privileges.
[2] There's apparently some support for loading per-namespace Apparmor policies, but that means that the process is no longer confined by the sVirt policy
[3] To be fair, last time I ran Trinity under Docker under a VM, it ended up killing my host. Glass houses, etc.

comment count unavailable comments

October 30, 2014 01:11 AM

Matthew Garrett: On joining the FSF board

I joined the board of directors of the Free Software Foundation a couple of weeks ago. I've been travelling a bunch since then, so haven't really had time to write about it. But since I'm currently waiting for a test job to finish, why not?

It's impossible to overstate how important free software is. A movement that began with a quest to work around a faulty printer is now our greatest defence against a world full of hostile actors. Without the ability to examine software, we can have no real faith that we haven't been put at risk by backdoors introduced through incompetence or malice. Without the freedom to modify software, we have no chance of updating it to deal with the new challenges that we face on a daily basis. Without the freedom to pass that modified software on to others, we are unable to help people who don't have the technical skills to protect themselves.

Free software isn't sufficient for building a trustworthy computing environment, one that not merely protects the user but respects the user. But it is necessary for that, and that's why I continue to evangelise on its behalf at every opportunity.


Free software has a problem. It's natural to write software to satisfy our own needs, but in doing so we write software that doesn't provide as much benefit to people who have different needs. We need to listen to others, improve our knowledge of their requirements and ensure that they are in a position to benefit from the freedoms we espouse. And that means building diverse communities, communities that are inclusive regardless of people's race, gender, sexuality or economic background. Free software that ends up designed primarily to meet the needs of well-off white men is a failure. We do not improve the world by ignoring the majority of people in it. To do that, we need to listen to others. And to do that, we need to ensure that our community is accessible to everybody.

That's not the case right now. We are a community that is disproportionately male, disproportionately white, disproportionately rich. This is made strikingly obvious by looking at the composition of the FSF board, a body made up entirely of white men. In joining the board, I have perpetuated this. I do not bring new experiences. I do not bring an understanding of an entirely different set of problems. I do not serve as an inspiration to groups currently under-represented in our communities. I am, in short, a hypocrite.

So why did I do it? Why have I joined an organisation whose founder I publicly criticised for making sexist jokes in a conference presentation? I'm afraid that my answer may not seem convincing, but in the end it boils down to feeling that I can make more of a difference from within than from outside. I am now in a position to ensure that the board never forgets to consider diversity when making decisions. I am in a position to advocate for programs that build us stronger, more representative communities. I am in a position to take responsibility for our failings and try to do better in future.

People can justifiably conclude that I'm making excuses, and I can make no argument against that other than to be asked to be judged by my actions. I hope to be able to look back at my time with the FSF and believe that I helped make a positive difference. But maybe this is hubris. Maybe I am just perpetuating the status quo. If so, I absolutely deserve criticism for my choices. We'll find out in a few years.

comment count unavailable comments

October 30, 2014 12:45 AM

October 27, 2014

Paul E. Mc Kenney: Lies that firmware tells RCU

One of the complaints that real-time people have against some firmware is that it lies about its age, attempting to cover up cycle-stealing via SMIs by reprogramming the TSC. Some firmware goes farther and lies about the number of CPUs on the system, apparently on the grounds that more is better, regardless of how many of those alleged CPUs actually exist.

RCU used to naively believe the firmware, and would therefore create one set of rcuo kthreads per advertised CPU. On some systems, this resulted in hundreds of such kthreads on systems with only a few tens of CPUs. But RCU can choose to create the rcuo kthreads only for CPUs that actually come online. Problem solved!

Mostly solved, that is.

Yanko Kaneti, Jay Vosburgh, Meelis Roos, and Eric B Munson discovered the “mostly” part when they encountered hangs in _rcu_barrier(). So what is rcu_barrier()?

The rcu_barrier() primitive waits for all pre-existing callbacks to be invoked. This is useful when you want to unload a module that uses call_rcu(), as described in this LWN article. It is important to note that rcu_barrier() does not necessarily wait for a full RCU grace period. In fact, if there are currently no RCU callbacks queued, rcu_barrier() is within its rights to simply return immediately. Otherwise, rcu_barrier() enqueues a callback on each CPU that already has callbacks, and waits for all these callbacks to be invoked. Because RCU is careful to invoke all callbacks posted to a given CPU in order, this guarantees that by the time rcu_barrier() returns, all pre-existing RCU callbacks will have already been invoked, as required.

However, it is possible to offload invocation of a given CPU's RCU callbacks to rcuo kthreads, as described in this LWN article. This kthread might well be executing on some other CPU, which means that the callbacks are moved from one list to another as they pass through their lifecycles. This makes it difficult for rcu_barrier() to reliably determine whether or not there are RCU callbacks pending for an offloaded CPU. So rcu_barrier() simply unconditionally enqueues an RCU callback for each offloaded CPU, regardless of that CPU's state.

In fact, rcu_barrier() even enqueues a callback for offloaded CPUs that are offline. The reason for this odd-seeming design decision is that a given CPU might enqueue a huge number of callbacks, then go offline. It might take the corresponding rcuo kthread significant time to work its way through this backlog of callbacks, which means that rcu_barrier() cannot safely assume that an offloaded CPU is callback-free just because it happens to be offline. So, to come full circle, rcu_barrier() enqueues an RCU callback for all offloaded CPUs, regardless of their state.

This approach works quite well in practice.

At least, it works well on systems where the firmware provides the Linux kernel with an accurate count of the number of CPUs. However, it breaks horribly when the firmware over-reports the number of CPUs, because then the system will then have CPUs that never ever come online. If these CPUs have been designated as offloaded CPUs, this means that their rcuo kthreads will never ever be spawned, which in turn means that any callbacks enqueued for these mythical CPUs will never ever be invoked. And because rcu_barrier() waits for all the callbacks that it posts to be invoked, rcu_barrier() ends up waiting forever, which can of course result in hangs.

The solution is to make rcu_barrier() refrain from posting callbacks for offloaded CPUs that have never been online, in other words, for CPUs that do not yet have an rcuo kthread.

With some luck, this patch will clear things up. And I did take the precaution of reviewing all of RCU's uses of for_each_possible_cpu(), so here is hoping! ;-)

October 27, 2014 10:09 PM

James Morris: Linux Security Summit 2014 Wrap-Up

The slides from the 2014 Linux Security Summit in August may be found linked at the schedule.

LWN covered both the James Bottomley keynote, and the SELinux on Android talk by Stephen Smalley.

We had an engaging and productive two days, with strong attendance throughout.  We’ll likely follow a similar format next year at LinuxCon.  I hope we can continue to expand the contributor base beyond mostly kernel developers.  We’re doing ok, but can certainly do better.  We’ll also look at finding a sponsor for food next year.

Thanks to those who contributed and attended, to the program committee, and of course, to the events crew at Linux Foundation, who do all of the heavy lifting logistics-wise.

See you next year!

October 27, 2014 12:56 PM

October 23, 2014

Dave Jones: Trinity and pages of random data.

Something trinity uses a lot, are pages of random data. They get passed around to syscalls, ioctls, whatever. 5 years ago, before I’d even added multiple children to trinity, this was done using ‘page_rand’. A single page allocated on startup, that was passed around, and scribbled over by anyone who needed something to scribble over.

After the VM work I did earlier this year, where we recycle successful calls to mmap, and inherit them across children, quite a few places started passing around map structs instead. This was good, because it started shaking out the many many kernel bugs that we had lingering in huge page support.

It kind of sucked that we had two sets of routines for doing things like “get a page”, “dirty a page” etc which were fundamentally the same operations, except one set worked on a pointer, and one on a struct. It also sucked that the page_rand code was actually buggy in a number of ways, which showed up as overruns.

Over time, I’ve been trying to move all the code that used page_rand to using mappings instead. Today I finished that work, and ripped out the last vestiges of page_rand support. The only real remnants of the supporting code was some of the dirtying code. We used to have separate ‘dirty page_rand’ and ‘dirty an mmap’ routines. After todays work, there’s now a single set of functions for mappings. There’s still a bunch more consolidation and cleanup to do, which I’ll get fixed up and merged over the next week.

The only feature that’s now missing is periodic dirtying of mappings. We did this every 100 syscalls for page_rand. Right now we only dirty mmap’s after a mmap() call succeeds, or on an mremap(). I plan on getting this done tomorrow.

The motivation for ripping out all this code, and unifying a lot of the support code is that a lot of code paths get simpler, and more importantly, the code in place now takes ‘len’ arguments, so we’re in a better position to make sure we’re not passing buffers that are too small when we do random syscalls.

In other news: while I was happy to report a few days ago that 3.18rc1 fixed up the btrfs bug that had been bothering me for a while, I’ve now managed to discover two new btrfs bugs [1]. [2]. Grumble.

Trinity and pages of random data. is a post from:

October 23, 2014 02:33 AM

October 19, 2014

Pete Zaitcev: Laptop bleg

I'm considering a laptop (actually two). Requirements:

Where it comes from is mostly my wife's Sony Vaio Z. I used to have a Z back in 2001 or so, when they were in 12" format. It was the best laptop ever, but unfortunately it succumbed to a DC-DC converter failure. The modern Z is not like that Z. The most super annoying problem is that the screws holding the battery failed in an interesting way: it is impossible to remove the battery now. Also, the contact between the battery and the moterboard is marginal. I managed to fix the problem by manufacturing a finely shaped wooden wedge that I drove into a gap and thus extended the life of that thing, but man, Sony, this is disappointing.

Unfortunately, I don't remember if it was Kota or Daisuke, but one of Japanese guys at a recent Swift Hackathon in Boston had a Z of the similar vintage, and it looked impeccable. Maybe Sony figured that it's going to be predominant mode of care that their wares receive, and so why not make the modern Z this much cheaper than the old, indestructable Z. But they still charge exorbitant prices.

Lenovo wins a special notice because I had a T400 for 3 years and swore never deal with it ever again. The biggest problem is the keyboard layout, because I use left pinky for control key. I could live with their idiotic placement of Escape, but I refuse to deal with 3 years of physical pain again. Also, their famous qualify seems slipping, as my mouse button broke within 3 years. Battery died, too. However, the T400 had a very good display, and I would like another like that, if possible.

October 19, 2014 04:29 PM

October 18, 2014

Pavel Machek: N900 nfs root

So you'd like to develop on Nokia N900... It has serial port, but with "interesting" connector. It has keyboard, but with "interesting" keyboard map, you mostly need full X to be useful... and it is too small for serious typing, anyway. You could put root filesystem on SD card, but that is disconnected when back cover is removed. And with back cover in place, you can't reset the machine.

Ok, so NFS. Insecure, tricky to setup, but actually makes the development usable. I started with commit 4f3e8d263^ (because that should have working usb networking according to mailing lists).. and with config from same page. Disadvantage is that video does not work with that configuration... but setting up system blind should not be that hard, right?

Assemblying minimal system with busybox from so I could run second-stage of debootstrap was tricky, and hacking into the resulting debian was not easy, either, but now I have telnet connections and things should only improve.

October 18, 2014 07:43 PM

Dave Jones: Trinity updates

Over a month ago, I posted about some pthreads work I was experimenting with in Trinity, and how that wasn’t really working out. After taking a short vacation, I came back with no real epiphanies, and decided to back-burner that work for now, and instead refocus on fixing up some other annoying problems that I’d stumbled across while doing that experimenting. Some of these problems were actually long-standing bugs in trinity. So that’s pretty much all I’ve been working on for the last month, and I’m now pretty happy with how long it runs for (providing you don’t hit a kernel bug first).

The primary motivation was to fix a problem where trinity’s internal data structures would get corrupted. After a series of debugging patches, I found a number of places where a child process would overrun a buffer it had allocated.

First up: the code that takes syscalls arguments and renders them into a human-readable string. In some cases this would write huge strings past the end of the buffer. One example of this was the instance where trinity would generate a random pathname. It would sometimes generate complete garbage, which was fine until it came to printing it out. Fixed by deleting lots of code in the pathname generator. Stressing the negative dentry case was never that interesting anyway. After fixing up a few other cases in the argument generator I looked at the code that performs rendering to buffers. None of this code took length parameters, or took into account the remaining space in the buffers. Fairly quick rewrite took care of that.

After these bugs were fixed trinity would (on a good kernel) run for a really long time without incident. With longer runtimes, a few more obscure corner cases turned up.

There were 2-3 cases where the watchdog process would hang waiting for a condition that would never be met (due to losing track of how many running child processes there were). I’m still not happy that this can even occur but it is at least a little less likely to hang when it happens now. I’ll investigate the actual cause for this later.

Another fun watchdog bug: we keep track of the time stamp a child performed its last syscall at, and check to make sure 1 second later that it has increased by some small amount. To make sure we haven’t corrupted our own state, there’s also a sanity check that we haven’t jumped into the future. But we also have to compensate for the possibility that adjtimex was the random syscall we did. That takes a maximum offset of 2145. The code checked for that but forgot to also add the one second since the last time we checked.

There’s been a bunch of small 1-2 fixes like this lately, but I’m sitting on a larger set of changes that I’ll start to trickle into git next week, which moves towards cleaning up the “create a random page to pass to syscalls” code, which has been another fun source of corruption bugs.

In kernel news: The only interesting bugs this week that Trinity has shown up, have been two ext4 bugs. Diagnosing those has pointed out some more enhancements that are needed to the post-mortem code in trinity. Once I’ve cleared the current backlog of patches, I’ll work on adding better tracking of fd’s in the logging code. In other news, the btrfs bug trinity hit in August is now fixed in 3.17+ git.

Trinity updates is a post from:

October 18, 2014 03:11 PM

October 16, 2014

Michael Kerrisk (manpages): man-pages-3.75 is released

I've released man-pages-3.75. The release tarball is available on The browsable online pages can be found on The Git repository for man-pages is available on

This is a quite small release. The most notable changes in man-pages-3.75 the following:

October 16, 2014 08:47 AM

October 10, 2014

Grant Likely: is down

For anyone who has been using, the server is currently down and I don’t know when it will be back up. I’ve moved my Linux kernel tree over to The new tree can be found here:

The other trees will be back when returns to life.

October 10, 2014 01:53 PM

October 06, 2014

Valerie Aurora: Operating systems war story: How feminism helped me solve one of file systems’ oldest conundrums

A smiling woman with pink hair wearing a t shirt with the word "O_PONIES" in Courier font

Valerie Aurora in 2009 (keep reading to find out why my shirt says “O_PONIES”)
CC BY NC-SA Robert Kaye

Hi, my name is Valerie Aurora, and I am the inventor of a software feature that has prevented billions of unnecessary writes to hard drives, saving energy and making our computers faster. My invention is called “relative atime,” and this is the story of how my feminist approach to computing helped me invent it – and what you can do to support women in open source software. (If you’re already convinced we need more women in open source, here’s a link to donate now to the Ada Initiative’s 2014 fundraising drive. My operating systems war story will still be here when you’re finished!)

Donate now

First, a little background for those of you who don’t live and breathe UNIX file systems performance. Ingo Molnar once called the access time, or “atime” feature of UNIX file systems “perhaps the most stupid Unix design idea of all times.” That’s harsh but fair. See, every time you read a file on a UNIX operating system – which includes OS X, Linux, and Android[1] – it is supposed to update the file to record the last time it was read, or accessed. This is called the access time or atime. Cool, right? You can imagine why it’s helpful to know when was the last time anything read a particular file – you can tell if you have new mail, for example, or figure out which files you haven’t used in a while and can throw away.

The problem with the atime feature is that updating atime requires writing to the disk. So every read to a file creates a tiny disk write – and writes are expensive and slow. (SSDs don’t get rid of this problem; you still don’t want to do unnecessary writes and most of the world’s data is still on spinning disks.) Here’s what Ingo said about this in 2006: “Atime updates are by far the biggest IO performance deficiency that Linux has today. Getting rid of atime updates would give us more everyday Linux performance than all the pagecache speedups of the past 10 years, _combined_.

So, atime is terrible idea – why don’t we just turn it off? That’s what many people did, using the “noatime” option that many file systems provide. The problem was that many programs did need to know the atime of a file to work properly. So most Linux distributions shipped with atime on, and it was up to the user to remember to turn it off (if they could). It was a bad situation.

A cartoon of a woman driving a robot penguin

LinuxChix logo

In 2006, I was a Linux file systems developer and also an active member of LinuxChix, a group for women who used Linux. LinuxChix existed in part because it was impossible to have technical discussions about Linux on most mailing lists without people insulting and flaming you for asking the simplest questions – and it was ten times worse for people with feminine usernames. Tell a cautionary story about installing RAM correctly, and the response might be a sneering, “Oh, you didn’t let out the magic smoke, did you?” On LinuxChix, that kind of obnoxiousness wasn’t allowed (though we still got a lot of what is now called mansplaining.)

So when I advised several people in LinuxChix to turn off atime, a friend felt safe telling me that hey, performance on her laptop was better, but now Mutt, the email reader we both used, thought she always had new email. This is because in her configuration, Mutt would look at an email file and compared its atime with the file’s last written time to figure out if any new email had arrived since the last time it read the file.

Now, the typical answer to “Mutt doesn’t work with noatime” was “Switch to a slower directory-based method,” or “Use a file size hack that had bugs,” or any number of other unhelpful things. Mostly, people just wouldn’t bother reporting things that broke with noatime. But I was part of a culture – a feminist culture – in which I respected people like my friend and programmers that attempted to use fully defined, useful features of UNIX in order to implement features efficiently.

I decided to look at the problem from a human point of view. What my friend and the Mutt programmers really wanted to know was this: Has this file been written since the last time I read it? They didn’t particularly care about the exact time of the last read, they just wanted to know if it had been read before or after the last write. I had an idea: What if we only updated a file’s atime if it would change the answer to the question, “Has this file been read since the last time it was written?” I called it “relative atime.”[2]

The amazing thing is: it worked! Matthew Garrett (also a known feminist), Ingo Molnar, and Andrew Morton made some changes to patch, including updating the atime if the current atime was more than 24 hours ago. Other than that, this incredibly simple algorithm worked well enough that in 2009, relative atime became the default in the mainline Linux kernel tree. Now, by default, people’s computers were fast and their programs worked.

I came up with this idea and the original patch in 2006, when the atime problem had been known for many years. Previous solutions had taken a very file-system-centric point of view, mainly along the lines of buffering up atime updates in memory and writing them out when we ran out of memory. What led me to a creative, simple, and extremely fast solution was being part of a feminist community in which people felt comfortable sharing their technical problems, wanted to help each other, and respected each other’s intelligence. Those are all feminist principles, and they make file systems development better.

I try to take that human-centered, feminist approach with other topics in file systems, including the great fsync()/rename() debate of 2009 (a.k.a “O_PONIES”) in which I argued that file systems developers should strive to make life easier for developers and users, not harder. As recently as 2013, a leading file systems developer was still arguing that file systems didn’t have to save file data reliably by mocking users for playing computer games.

I was working on another human-centered file system feature, union mounts, when I heard that a friend of mine had been groped at an open source conference for the third time in one year. While I loved my file systems work, I felt like stopping sexual harassment and assault of women in open source was more urgent, and that I was uniquely qualified to work on it. (I myself had been groped by another Linux storage developer.) So I quit my job as a Linux kernel developer and co-founded the Ada Initiative, whose mission is supporting women in open technology and culture. Unfortunately, as a result of my work, several more Linux storage developers came out publicly in favor of harassment and assault.

That’s one reason why I’m so excited that Ceph developer Sage Weil challenged the open storage community to raise $8192 for the Ada Initiative by Wednesday, Oct. 8 – and he’ll personally match that amount if we reach the goal! UPDATED: Sage and Mike Perez raised this to $16,384!!! The number of Linux file systems and storage developers who both donated to Sage’s challenge and wanted to be listed publicly as supporters is reminding me that the vast majority of the people I worked with in Linux want women to feel safe and comfortable in their community. I love file systems development, I love writing kernel code, and I miss working with and seeing my Linux friends. And as you can tell by the lack of something like union mounts in the mainline kernel 21 years after the first implementation, Linux file systems and storage does not have enough developers, and can’t afford to keep driving off women developers.

A woman sitting at a table explaining soemthing with her hands

Me teaching the Ally Skills Workshop

The Ada Initiative is capable of changing this situation. In August 2014, I taught the first Ally Skills Workshop at a Linux Foundation-run conference, LinuxCon North America. The Ally Skills Workshop teaches men simple everyday ways to support women in their workplaces in communities, and teaching it is my favorite part of my work. I was happy to see several Linux file systems and storage developers at the workshop. I was still nervous about running into the developers who support harassment and assault, but seeing how excited people were after the Ally Skills Workshop made it all worthwhile.

If you’d like to see more people working on Linux storage and file systems, and especially more women, please join Sage Weil and more than 30 other open storage developers in supporting the Ada Initiative. Donate now:

Donate now

Edited to add 10/6/2014: Sage made his goal, hurray! And here’s my favorite comment from the HN thread about this story, the only one actually flagged into non-existence (plenty of other creepy misogyny elsewhere though):

Screen Shot 2014-10-05 at 10.18.37 PM

Also, I had no idea Lennart Poettering planned to post this detailed description of the abuse, harassment, and death threats he’s suffered as an open source developer.

We’re still raising money for Ada Initiative to fight this kind of harassment, so feel free to donate:

Donate now

[1] Yes, Android is Linux too, I’m just naming the brands that non-operating systems experts would recognize.

[2] “Relative atime” isn’t so bad, but the name of the option that you pass to the kernel, “relatime”, showed my usual infelicity with naming things as it looks like a misspelling of “realtime”.

Tagged: ada initiative, feminism, filesystems, kernel, linux

October 06, 2014 09:36 PM

Michael Kerrisk (manpages): man-pages-3.74 is released

I've released man-pages-3.74. The release tarball is available on The browsable online pages can be found on The Git repository for man-pages is available on

Aside from various minor changes to many pages, the most notable changes in man-pages-3.74 the following:

October 06, 2014 06:06 PM

October 02, 2014

Matthew Garrett: Actions have consequences (or: why I'm not fixing Intel's bugs any more)

A lot of the kernel work I've ended up doing has involved dealing with bugs on Intel-based systems - figuring out interactions between their hardware and firmware, reverse engineering features that they refuse to document, improving their power management support, handling platform integration stuff for their GPUs and so on. Some of this I've been paid for, but a bunch has been unpaid work in my spare time[1].

Recently, as part of the anti-women #GamerGate campaign[2], a set of awful humans convinced Intel to terminate an advertising campaign because the site hosting the campaign had dared to suggest that the sexism present throughout the gaming industry might be a problem. Despite being awful humans, it is absolutely their right to request that a company choose to spend its money in a different way. And despite it being a dreadful decision, Intel is obviously entitled to spend their money as they wish. But I'm also free to spend my unpaid spare time as I wish, and I no longer wish to spend it doing unpaid work to enable an abhorrently-behaving company to sell more hardware. I won't be working on any Intel-specific bugs. I won't be reverse engineering any Intel-based features[3]. If the backlight on your laptop with an Intel GPU doesn't work, the number of fucks I'll be giving will fail to register on even the most sensitive measuring device.

On the plus side, this is probably going to significantly reduce my gin consumption.

[1] In the spirit of full disclosure: in some cases this has resulted in me being sent laptops in order to figure stuff out, and I was not always asked to return those laptops. My current laptop was purchased by me.

[2] I appreciate that there are some people involved in this campaign who earnestly believe that they are working to improve the state of professional ethics in games media. That is a worthy goal! But you're allying yourself to a cause that disproportionately attacks women while ignoring almost every other conflict of interest in the industry. If this is what you care about, find a new way to do it - and perhaps deal with the rather more obvious cases involving giant corporations, rather than obsessing over indie developers.

For avoidance of doubt, any comments arguing this point will be replaced with the phrase "Fart fart fart".

[3] Except for the purposes of finding entertaining security bugs

comment count unavailable comments

October 02, 2014 11:27 PM

October 01, 2014

Matt Domsch: Spamfighting: updated opendmarc packages, handling DMARC p=reject

I took a few months off from dealing with my spam problems, choosing to stick my head in the sand. Probably not my wisest move…

In the interim, the opendmarc developers have been busy, releasing version 1.3.0, which also adds the nice feature of doing SPF checking internally. This lets me CLOSE WONTFIX the smf-spf and libspf2 packages from the Fedora review process and remove them from my system. “All code has bugs. Unmaintained code with bugs that you aren’t running can’t harm you.” New packages and the open Fedora review are available.

I’ve also had several complaints from friends, all users, who have been sending mail to myself and family In most cases, simply forwards the emails on to yet other mail provider – it’s providing a mail forwarding service for a vanity domain name. However, now that Yahoo and AOL are publishing DMARC p=reject rules, after forwarded the mail on to its ultimate home, those downstream servers were rejecting the messages (presumably on SPF grounds – isn’t a valid mail server for

My solution to this is a bit akward, but will work for a while. Instead of forwarding mail from domains with DMARC p=reject or p=quarantine, I now store them and serve them up via POP3/IMAP to their ultimate destination. I’m using procmail to do the forwarding:

SENDER=`formail -c -x Return-Path`

# forward all mail except dmarc policy reject|quarantine.
:0 H
* ? formail -x'From:' | grep -o '[[:alnum:]+\.\_\-]*@[[:alnum:]+\.\_\-]*' | xargs opendmarc-check | egrep -s 'Domain policy: (reject|quarantine)'


This introduces quite a bit of latency (on the order of an hour) for mail delivery from my friends with addresses, but it keeps them from getting rejected due to their email provider’s lousy choice of policy.

Tim Draegen, the guy behind the excellent, is chairing a new IETF working group focusing on proper handling on “indirect email flows” such as mailing lists and vanity domain forwarding. I’m hoping to have time to get involved there. If you care, follow along on their mailing lists.

I’m choosing to ignore the fact that is getting spoofed 800k times a week (as reported by 8 mail providers and visualized nicely on, at least for now. I’m hoping the new working group can come up with a method to help solve this.

Do your friends use a mail service publishing DMARC p=reject? Has it caused problems for you? Let me know in the comments below.

October 01, 2014 08:45 PM

Eric Sandeen: Sage’s challenge to the open storage community – support the Ada Iniative!

The Ada Initiative supports women in open technology and culture – looking around my place of work and various conference halls I’ve visited, I think there’s little doubt that we’ve still got an old boy network running here, and I’d like to see that change.  I have daughters who may or may not get deep into tech culture, and I’m glad there are people like Val working to make it a better place for them if they do.

And she’s recently gotten a nice bit of potential help: Sage Weil of Ceph & Dreamhost fame has issued a challenge: Raise $8192 in 8 days, and he’ll match it.  They’re already on their way.  If you can help, please do.  Power-of-two donations encouraged, but not required.  :) Click the counter below to see their donation page.  Thanks!

Donate now

October 01, 2014 08:14 PM

September 26, 2014

Matt Domsch: [REPOST] Who am I?

I’ve started blogging again on the Dell TechCenter site, Enterprise Mobility Management section, along with the rest of my team.

Here’s the intro to my first post, “Who am I?”:

The existential question, asked by everyone and everything throughout their lifetimes – who am I? High school seniors choosing a college, college seniors considering grad school or entering the job market, adults in the midst of their mid-life crisis—the question comes far easier than the answer.

In the world of technology, who you are depends on the technology with which you are interacting. On Facebook, you are your quirky personal self, with pictures of your family and vacations you take. On LinkedIn, you are your professional self, sharing articles and achievements that are aligned with your career.

What about on the myriad devices you carry around? On the smartphone in my pocket, I have several personas—personal, business, gamer (my kids borrow my phone), constantly context-switching between them. In the not-too-distant past, people would carry two phones—one for personal use and one for work, keeping the personas separate via physical separation—two personas, two devices.

Read more…

September 26, 2014 03:47 PM

September 25, 2014

Andy Grover: A program calling command-line tools is the moral equivalent of web scraping.

I gave this talk at LPC 2012. It promotes the idea that programs layered on top of human-centric interfaces is a bad idea.

Download the PDF file .

The timing of this post with the announcement of the most recent bash vulnerability is not entirely coincidental.

September 25, 2014 06:48 PM

September 24, 2014

Ted Tso: “How Google Works” giveaway

The authors of “How Google Works” have given electronic versions of “How Google Works” to all Google employees. Since I had already purchased a copy via pre-order, to make life interesting, I’ve decided to give my Google Play coupon code to someone via an electronic lottery.  (Edit: the coupon code will only work for people with Google accounts in the US; if you live outside of the US, my apologies, but the coupon code will not work for you.)

I will be using the procedure documented by RFC-3797 to select someone from the list of people who have sent-email to, snapshotted at Noon US/Eastern on Friday, September 26th, 2014, using as inputs into the RFC-3797 algorithm: (1) the daily volume of GOOG on September 26th, 2014 as reported by, (2) the daily volume of GOOGL on September 26th, 2014 as reported by, and (3) the Massachusetts Powerball Lottery Numbers for September 27th, 2014. If any of these values are not available for any reason, the values for the next trading day or lottery draw will be used.

Have fun!

September 24, 2014 02:55 PM

Matthew Garrett: My free software will respect users or it will be bullshit

I had dinner with a friend this evening and ended up discussing the FSF's four freedoms. The fundamental premise of the discussion was that the freedoms guaranteed by free software are largely academic unless you fall into one of two categories - someone who is sufficiently skilled in the arts of software development to examine and modify software to meet their own needs, or someone who is sufficiently privileged[1] to be able to encourage developers to modify the software to meet their needs.

The problem is that most people don't fall into either of these categories, and so the benefits of free software are often largely theoretical to them. Concentrating on philosophical freedoms without considering whether these freedoms provide meaningful benefits to most users risks these freedoms being perceived as abstract ideals, divorced from the real world - nice to have, but fundamentally not important. How can we tie these freedoms to issues that affect users on a daily basis?

In the past the answer would probably have been along the lines of "Free software inherently respects users", but reality has pretty clearly disproven that. Unity is free software that is fundamentally designed to tie the user into services that provide financial benefit to Canonical, with user privacy as a secondary concern. Despite Android largely being free software, many users are left with phones that no longer receive security updates[2]. Textsecure is free software but the author requests that builds not be uploaded to third party app stores because there's no meaningful way for users to verify that the code has not been modified - and there's a direct incentive for hostile actors to modify the software in order to circumvent the security of messages sent via it.

We're left in an awkward situation. Free software is fundamental to providing user privacy. The ability for third parties to continue providing security updates is vital for ensuring user safety. But in the real world, we are failing to make this argument - the freedoms we provide are largely theoretical for most users. The nominal security and privacy benefits we provide frequently don't make it to the real world. If users do wish to take advantage of the four freedoms, they frequently do so at a potential cost of security and privacy. Our focus on the four freedoms may be coming at a cost to the pragmatic freedoms that our users desire - the freedom to be free of surveillance (be that government or corporate), the freedom to receive security updates without having to purchase new hardware on a regular basis, the freedom to choose to run free software without having to give up basic safety features.

That's why projects like the GNOME safety and privacy team are so important. This is an example of tying the four freedoms to real-world user benefits, demonstrating that free software can be written and managed in such a way that it actually makes life better for the average user. Designing code so that users are fundamentally in control of any privacy tradeoffs they make is critical to empowering users to make informed decisions. Committing to meaningful audits of all network transmissions to ensure they don't leak personal data is vital in demonstrating that developers fundamentally respect the rights of those users. Working on designing security measures that make it difficult for a user to be tricked into handing over access to private data is going to be a necessary precaution against hostile actors, and getting it wrong is going to ruin lives.

The four freedoms are only meaningful if they result in real-world benefits to the entire population, not a privileged minority. If your approach to releasing free software is merely to ensure that it has an approved license and throw it over the wall, you're doing it wrong. We need to design software from the ground up in such a way that those freedoms provide immediate and real benefits to our users. Anything else is a failure.

(title courtesy of My Feminism will be Intersectional or it will be Bullshit by Flavia Dzodan. While I'm less angry, I'm solidly convinced that free software that does nothing to respect or empower users is an absolute waste of time)

[1] Either in the sense of having enough money that you can simply pay, having enough background in the field that you can file meaningful bug reports or having enough followers on Twitter that simply complaining about something results in people fixing it for you

[2] The free software nature of Android often makes it possible for users to receive security updates from a third party, but this is not always the case. Free software makes this kind of support more likely, but it is in no way guaranteed.

comment count unavailable comments

September 24, 2014 06:59 AM

September 23, 2014

Grant Likely: Don’t fear ACPI on ARM

Before everyone freaks out about Matthew Garrett’s post regarding ACPI on ARM, here are a few things to keep in mind:

First, when we’re talking about Linux and ACPI on ARM, we’re talking about general purpose servers. In the general purpose server market, Linux is already the dominant OS, regardless of the CPU architecture. Servers are designed, built and sold to run Linux. It is already the situation that x86 server vendors build their ACPI tables to work with Linux. Supporting Linux on ARM servers is merely an extension of what vendors are already doing to support Linux on x86. Despite Matthew’s concern, I don’t think we’re entering new territory in this regard.

Second, many of us have bad memories of getting ACPI to work with Linux. However, it is worth remembering that most of our problems have been with machines where the vendor really doesn’t care about Linux – usually desktop or laptop PCs. It’s not surprising that we have problems with these machines since they’ve only been tested with Windows! Server vendors, on the other hand, have a vested interest in ensuring that Linux runs well on their hardware and so they regularly test with Linux. The negative lessons learned in the laptop and desktop markets don’t carry over to machines built to run Linux.

Third, the ACPI world has changed in the last 2 years. It used to be that the ACPI spec was governed in a closed process by 5 companies: HP, Intel, Microsoft, Phoenix, and Toshiba, with nary a Linux person to be seen. Last year ACPI governance was transferred to the UEFI Forum and we’ve got plenty of Linux engineers sitting at the table. In light of that, it is no longer true that ACPI only caters to the needs of Windows, and we have the ability to propose changes to the spec. In fact, if you look at the revision history in version 5.1 of the spec, you’ll find changes that were proposed by Linux engineers to make ARMv8 work.

That said, the issues raised by Matthew are important. There is a big question about how Linux should declare itself to the platform. Claiming to be compatible with “Windows 8″ in the ACPI _OSI (Operating System Interface) method obviously isn’t appropriate on ARM. There is some talk about removing _OSI entirely on ARM since the way Linux uses it isn’t actually useful, and the _OSC (Operating System Capability) method has been proposed as a better way to declare what the OS supports. There is also a need to make sure vendors are testing with linux-next and mainline kernels so that we know when breakage happens and we can either do something about it, or work with vendors to fix their firmware.

Both of these are important issues and I think we need to propose solutions before merging ARM ACPI support into the kernel. Some of this work has already started: Linaro is running Canonical’s Firmware Test Suite (FWTS), the ACPI API tests, and the ACPI ASL tests on ARM, and we’re porting the Linux UEFI Verification (LUV) project which packages all the test suites into an easy to use distribution.

While I agree with Matthew that getting the interface between firmware and the OS is hard, I do not see the nightmare scenario he is describing. It certainly hasn’t played out that way on x86 servers where Linux is already the preferred OS. Besides, I really cannot agree with the premise that Linux being the dominant OS is a bad thing! We have a lot more influence than we give ourselves credit for.

September 23, 2014 10:00 PM

September 21, 2014

Michael Kerrisk (manpages): man-pages-3.73 is released

I've released man-pages-3.73. The release tarball is available on The browsable online pages can be found on The Git repository for man-pages is available on

The most notable changes in man-pages-3.73 are various new and modified pages describing namespaces in general, and user and PID namespaces in detail:

September 21, 2014 11:39 AM

September 16, 2014

Matthew Garrett: ACPI, kernels and contracts with firmware

ACPI is a complicated specification - the latest version is 980 pages long. But that's because it's trying to define something complicated: an entire interface for abstracting away hardware details and making it easier for an unmodified OS to boot diverse platforms.

Inevitably, though, it can't define the full behaviour of an ACPI system. It doesn't explicitly state what should happen if you violate the spec, for instance. Obviously, in a just and fair world, no systems would violate the spec. But in the grim meathook future that we actually inhabit, systems do. We lack the technology to go back in time and retroactively prevent this, and so we're forced to deal with making these systems work.

This ends up being a pain in the neck in the x86 world, but it could be much worse. Way back in 2008 I wrote something about why the Linux kernel reports itself to firmware as "Windows" but refuses to identify itself as Linux. The short version is that "Linux" doesn't actually identify the behaviour of the kernel in a meaningful way. "Linux" doesn't tell you whether the kernel can deal with buffers being passed when the spec says it should be a package. "Linux" doesn't tell you whether the OS knows how to deal with an HPET. "Linux" doesn't tell you whether the OS can reinitialise graphics hardware.

Back then I was writing from the perspective of the firmware changing its behaviour in response to the OS, but it turns out that it's also relevant from the perspective of the OS changing its behaviour in response to the firmware. Windows 8 handles backlights differently to older versions. Firmware that's intended to support Windows 8 may expect this behaviour. If the OS tells the firmware that it's compatible with Windows 8, the OS has to behave compatibly with Windows 8.

In essence, if the firmware asks for Windows 8 support and the OS says yes, the OS is forming a contract with the firmware that it will behave in a specific way. If Windows 8 allows certain spec violations, the OS must permit those violations. If Windows 8 makes certain ACPI calls in a certain order, the OS must make those calls in the same order. Any firmware bug that is triggered by the OS not behaving identically to Windows 8 must be dealt with by modifying the OS to behave like Windows 8.

This sounds horrifying, but it's actually important. The existence of well-defined[1] OS behaviours means that the industry has something to target. Vendors test their hardware against Windows, and because Windows has consistent behaviour within a version[2] the vendors know that their machines won't suddenly stop working after an update. Linux benefits from this because we know that we can make hardware work as long as we're compatible with the Windows behaviour.

That's fine for x86. But remember when I said it could be worse? What if there were a platform that Microsoft weren't targeting? A platform where Linux was the dominant OS? A platform where vendors all test their hardware against Linux and expect it to have a consistent ACPI implementation?

Our even grimmer meathook future welcomes ARM to the ACPI world.

Software development is hard, and firmware development is software development with worse compilers. Firmware is inevitably going to rely on undefined behaviour. It's going to make assumptions about ordering. It's going to mishandle some cases. And it's the operating system's job to handle that. On x86 we know that systems are tested against Windows, and so we simply implement that behaviour. On ARM, we don't have that convenient reference. We are the reference. And that means that systems will end up accidentally depending on Linux-specific behaviour. Which means that if we ever change that behaviour, those systems will break.

So far we've resisted calls for Linux to provide a contract to the firmware in the way that Windows does, simply because there's been no need to - we can just implement the same contract as Windows. How are we going to manage this on ARM? The worst case scenario is that a system is tested against, say, Linux 3.19 and works fine. We make a change in 3.21 that breaks this system, but nobody notices at the time. Another system is tested against 3.21 and works fine. A few months later somebody finally notices that 3.21 broke their system and the change gets reverted, but oh no! Reverting it breaks the other system. What do we do now? The systems aren't telling us which behaviour they expect, so we're left with the prospect of adding machine-specific quirks. This isn't scalable.

Supporting ACPI on ARM means developing a sense of discipline around ACPI development that we simply haven't had so far. If we want to avoid breaking systems we have two options:

1) Commit to never modifying the ACPI behaviour of Linux.
2) Exposing an interface that indicates which well-defined ACPI behaviour a specific kernel implements, and bumping that whenever an incompatible change is made. Backward compatibility paths will be required if firmware only supports an older interface.

(1) is unlikely to be practical, but (2) isn't a great deal easier. Somebody is going to need to take responsibility for tracking ACPI behaviour and incrementing the exported interface whenever it changes, and we need to know who that's going to be before any of these systems start shipping. The alternative is a sea of ARM devices that only run specific kernel versions, which is exactly the scenario that ACPI was supposed to be fixing.

[1] Defined by implementation, not defined by specification
[2] Windows may change behaviour between versions, but always adds a new _OSI string when it does so. It can then modify its behaviour depending on whether the firmware knows about later versions of Windows.

comment count unavailable comments

September 16, 2014 10:51 PM

Andy Grover: Emacs and using multiple C code styles

I primarily work on Linux, so I put this in my Emacs config:

; Linux mode for C
(setq c-default-style
      '((c-mode . "linux") (other . "gnu")))

However, other projects like QEMU have their own style preferences. So here’s what I added to use a different style for that. First, I found the qemu C style defined here. Then, to only use this on some C code, we attach a hook that only overrides the default C style if the filename contains “qemu”, an imperfect but decent-enough test.

(defconst qemu-c-style
  '((indent-tabs-mode . nil)
    (c-basic-offset . 4)
    (tab-width . 8)
    (c-comment-only-line-offset . 0)
    (c-hanging-braces-alist . ((substatement-open before after)))
    (c-offsets-alist . ((statement-block-intro . +)
                        (substatement-open . 0)
                        (label . 0)
                        (statement-cont . +)
                        (innamespace . 0)
                        (inline-open . 0)
    (c-hanging-braces-alist .
                             (block-close . c-snug-do-while)
                             ;; structs have hanging braces on open
                             (class-open . (after))
                             ;; ditto if statements
                             (substatement-open . (after))
                             ;; and no auto newline at the end
  "QEMU C Programming Style")

(c-add-style "qemu" qemu-c-style)

(defun maybe-qemu-style ()
  (when (and buffer-file-name
       (string-match "qemu" buffer-file-name))
    (c-set-style "qemu")))

(add-hook 'c-mode-hook 'maybe-qemu-style)

September 16, 2014 01:16 AM

September 12, 2014

Dave Jones: Trinity threading improvements and misc

Since my blogging tsunami almost a month ago, I’ve been pretty quiet. The reason being that I’ve been heads down working on some new features for trinity which have turned out to be a lot more involved than I initially anticipated.

Trinity does all of its work in child processes continually forked off from a main process. For a long time I’ve had “investigate using pthreads” as a TODO item, but after various conversations at kernel summit, I decided to bump the priority of that up a little, and spend some time looking at it. I initially guessed that it would have take maybe a few weeks to have something usable, but after spending some time working on it, every time I make progress on one issue, it becomes apparent that there’s something else that is also going to need changing.

I’m taking a week off next week to clear my head and hopefully return to this work with fresh eyes, and make more progress, because so far it’s been mostly frustrating, and there may be an easier way to solve some of the problems I’ve been hitting. Sidenote: In the 15+ years I’ve been working on Linux, this is the first time I recall actually ever using pthreads in my own code. I can’t say I’ve been missing out.

Unrelated to that work, a month or so ago I came up with a band-aid fix for a problem where trinity would corrupt its own structures. That ‘fix’ turned out to break the post-mortem work I implemented a few months prior, so I’ve spent some time this week undoing that, and thinking about how I’m going to fix that properly. But before coming up with a fix, I needed to reproduce the problem reliably, and naturally now that I’ve added debug code to determine where the corruption is coming from, the bug has gone into hiding.

I need this vacation.

Trinity threading improvements and misc is a post from:

September 12, 2014 08:14 PM

September 07, 2014

Michael Kerrisk (manpages): man-pages-3.72 is released

I've released man-pages-3.72. The release tarball is available on The browsable online pages can be found on The Git repository for man-pages is available on

This is a small release; the  more notable changes in man-pages-3.72 are the addition of three new pages by Peter Schiffer that document glibc commands used for memory profile and malloc tracing:

September 07, 2014 01:36 PM

September 06, 2014

Pavel Machek: Fraud attempt from DAD GmbH

Got snail mail from DAD GmbH, Postfach 11 35 68, 20435. I should update my business info (which I never gave to them) and by submitting updated info, they would charge me 500 euro (small notice so that you are likely to miss it). I hope they go to jail for this.

September 06, 2014 09:38 PM

September 04, 2014

James Morris: New GPG Key

Just an FYI, I lost my GPG key a few months back during an upgrade, and have created a new one.  This was signed by folk at LinuxCon/KS last month.

The new key ID / fingerprint is: D950053C / 8327 23D0 EF9D D46D 9AC9  C03C AD98 4BBF D950 053C

Please use this key and not the old one!

September 04, 2014 09:38 PM

September 03, 2014

Pavel Machek: Boot shell

Yesterday I got electric shock. Yes, the device was supposed to be turned off by remote-control outlet, but I was still stupid to play with it.

Have you ever played the "press any key to stop autoboot" game, followed by copying boot commands from your notes, because you wanted to keep boot loader in original (early project phases) or final (late project phases) configuration? Have you reached level 2, playing autoboot game over internet?

If so, you may want to take a look at boot shell (bs) from Not Universal Test System project. In ideal case, it knows how to turn off/on the target, break into autoboot, boot your target in development mode, and login as root when user land is ready.

September 03, 2014 09:09 AM

August 29, 2014

Daniel Vetter: Review Training Slides

We currently have a large influx of new people contributing to i915 - for the curious just check the git logs. As part of ramping them up I've done a few trainings about upstream review, and a bunch of people I've talked with at KS in Chicago were interested in that, too. So I've cleaned up the slides a bit and dropped the very few references to Intel internal resources. No speaker notes or video recording, but I think this is useful all in itself. And of course if you have comments or see big gaps - feedback is very much welcome:

Upstream Review Training Slides

August 29, 2014 04:14 PM

August 21, 2014

Michael Kerrisk (manpages): man-pages-3.71 is released

I've released man-pages-3.71. The release tarball is available on The browsable online pages can be found on The Git repository for man-pages is available on

As well as many smaller fixes to various pages, the more notable changes in man-pages-3.71 are the following:

August 21, 2014 01:27 PM

August 19, 2014

Rusty Russell: POLLOUT doesn’t mean write(2) won’t block: Part II

My previous discovery that poll() indicating an fd was writable didn’t mean write() wouldn’t block lead to some interesting discussion on Google+.

It became clear that there is much confusion over read and write; eg. Linus thought read() was like write() whereas I thought (prior to my last post) that write() was like read(). Both wrong…

Both Linux and v6 UNIX always returned from read() once data was available (v6 didn’t have sockets, but they had pipes). POSIX even suggests this:

The value returned may be less than nbyte if the number of bytes left in the file is less than nbyte, if the read() request was interrupted by a signal, or if the file is a pipe or FIFO or special file and has fewer than nbyte bytes immediately available for reading.

But write() is different. Presumably so simple UNIX filters didn’t have to check the return and loop (they’d just die with EPIPE anyway), write() tries hard to write all the data before returning. And that leads to a simple rule.  Quoting Linus:

Sure, you can try to play games by knowing socket buffer sizes and look at pending buffers with SIOCOUTQ etc, and say “ok, I can probably do a write of size X without blocking” even on a blocking file descriptor, but it’s hacky, fragile and wrong.

I’m travelling, so I built an Ubuntu-compatible kernel with a printk() into select() and poll() to see who else was making this mistake on my laptop:

cups-browsed: (1262): fd 5 poll() for write without nonblock
cups-browsed: (1262): fd 6 poll() for write without nonblock
Xorg: (1377): fd 1 select() for write without nonblock
Xorg: (1377): fd 3 select() for write without nonblock
Xorg: (1377): fd 11 select() for write without nonblock

This first one is actually OK; fd 5 is an eventfd (which should never block). But the rest seem to be sockets, and thus probably bugs.

What’s worse, are the Linux select() man page:

       A file descriptor is considered ready if it is possible to
       perform the corresponding I/O operation (e.g., read(2)) without
       ... those in writefds will be watched to see if a write will
       not block...

And poll():

		Writing now will not block.

Man page patches have been submitted…

August 19, 2014 01:57 PM

August 15, 2014

Dave Jones: A breakdown of Linux kernel networking related issues from Coverity scan

For the last of these breakdowns, I’ll focus on fifth place: networking.

Linux supports many different network protocols, so I spent quite a while splitting the net/ tree into per-protocol components. The result looks like this.

Net-802 8
Net-Bluetooth 15
Net-CAIF 9
Net-Core 11
Net-DCCP 5
Net-IRDA 17
Net-NFC 11
Net-SCTP 18
Net-SunRPC 21
Net-Wireless 9
Net-XFRM 6
Net-bridge 14
Net-ipv4 24
Net-ipv6 16
Net-mac80211 12
Net-sched 5
everything else 124

The networking code has gotten noticably better over the last year. When I initially introduced these components they were all well into double figures. Now, even crap like DECNET has gotten better (both users will be very happy).

“Everything else” above is actually a screw-up on my part. For some reason around 50 or so netfilter issues haven’t been categorized into their appropriate component. The remaining ~70 are quite a mix, but nearly all small numbers of issues in many components.Things like 9p, atm, ax25, batman, can, ceph, l2tp, rds, rxrpc, tipc, vmwsock, and x25. The Lovecraftian protocols you only ever read about.

So networking is in pretty good shape considering just how much stuff it supports. While there’s 24 issues in a common protocol like ipv4, they tend to be mostly benign things rather than OMG 24 WAYS THE NSA IS OWNING YOUR LINUX RIGHT NOW.

That’s the last of these breakdowns I’ll do for now. I’ll do this again maybe in six months to a year, if things are dramatically different, but I expect any changes to be minor and incremental rather than anything too surprising.

After I get back from kernel summit and recover from travelling, I’ll start a series of posts showing code examples of the top checkers.

A breakdown of Linux kernel networking related issues from Coverity scan is a post from:

August 15, 2014 09:37 PM

Dave Jones: Breakdown of Linux kernel wireless drivers in Coverity scan

In fourth place on the list of hottest areas of the kernel as seen by Coverity, is drivers/net/wireless.

rtlwifi 96
Atheros 74
brcm80211 67
mwifiex 33
b43 16
iwlwifi 15
everything else 65

I mentioned in my drivers/staging examination that the realtek wifi drivers stood odd as especially problematic. Here we see the same situation. Larry Finger has been working on cleaning up this (and other drivers) for some time, but it apparently still has a long way to go.

It’s worth noting that “Atheros” here is actually a number of drivers (ar5523, ath10k, ath5k, ath6k, ath9k, carl9170, wcn36xx, wil6210). I’ve not had time to break those down into smaller components yet, though a quick look shows that ath9k in particular accounts for a sizable portion of those 74 issues)

I was actually surprised at how low the iwlwifi and b43 counts were. I guess there’s something to be said for ubiquitous hardware.

What of all the ancient wireless drivers ? The junky pcmcia/pccard drivers like orinoco and friends ?
They’re in with those 65 “everything else” bugs, and make up < 5-6 issues each. Considering their age, and lack of any real maintenance these days, they’re in surprisingly good shape.

Just for fun, here’s how the drivers above compare against the wireless drivers currently in staging.

rtl8821 102 (Staging)
rtlwifi 96
Atheros 74
brcm80211 67
rtl8188eu 42 (Staging)
mwifiex 33
rtl8712 22 (Staging)
rtl8192u 21 (Staging)
rtl8192e 17 (Staging)
b43 16
iwlwifi 15
everything else 65

Breakdown of Linux kernel wireless drivers in Coverity scan is a post from:

August 15, 2014 09:12 PM

Dave Jones: A breakdown of Linux kernel filesystem issues in Coverity scans

The filesystem code shows up in the number two position of the list of hottest areas of the kernel. Like the previous post on drivers/scsi, this isn’t because “the filesystem code is terrible”, but more that Linux supports so many filesystems, the accumulative effect of issues present in all of them adds up to a figure that dominates the statistics.

The breakdown looks like this.

fs/*.c 77
9P 3
EXTn 36
GFS2 12
HFSPlus 4
NFS 24
OCFS2 35
Reiserfs 12
UDF 14
XFS 33

fs/*.c accounts for the VFS core, AIO, binfmt parsers, eventfd, epoll, timerfd’s, xattr code and a bunch of assorted miscellany. Little wonder it show up with so high, it’s around 62,000 LOC by itself. Of all the entries on the list, this is perhaps the most concerning area given it affects every filesystem.

A little more concerning perhaps is that btrfs is so high on the list. Btrfs is still seeing a lot of churn each release, so many of these issues come and go, but it seems to be holding roughly at the same rate of new incoming issues each release.

EXTn counts for ext2, ext3, and ext4 combined. Not too bad considering that’s around 74,000 LOC combined. (and another 15K LOC for jbd/jbd2)

The CIFS, NFS and OCFS filesystems stand out as potentially something that might be of concern, especially if those issues are over-the-wire trigger-able.

XFS has been improving over the past year. It was around 60-70 when I started doing regular scans, and continues to move downward each release, with few new issues getting added.

The remaining filesystems: not too shabby. Especially considering some of the niche ones don’t get a lot of attention.

A breakdown of Linux kernel filesystem issues in Coverity scans is a post from:

August 15, 2014 03:40 PM

Dave Jones: A closer look at drivers/scsi Coverity scans.

drivers/scsi showed up in third place in the list of hottest areas of the kernel. Breaking it down into sub-components, it looks like this.

aic7xxx 15
be2iscsi 15
bfa 26
bnx2fc 6
csiostor 10
isci 11
lpfc 38
megaraid 10
mpt2sas 17
mpt3sas 15
pm8001 9
qla2xxx 42
qla4xxx 17
Everything else 152

All these components have been steadily improving over the last year. The obvious stand-out is “Everything else” that looks like it needs to be broken out into more components.
But drivers/scsi is one area of the kernel where we have a *lot* of legacy drivers, many of them 10-15 years old. (Remarkably, some of these are even still in regular use). Looking over the list of filenames matching the “Everything else” component, pretty much every driver that isn’t broken out into its own component is on the list. 3w-9xxx, NCR5380, aacraid, advansys, aic94xx, arcmsr, atp870, bnx2i, cxgbi, dc395x, dpt_i2o, eata, esas2, fdomain, fnic, gdth, hpsa, imm, ipr, ips, mvsas, mvumi, osst, pmcraid, qla1280, qlogicfas, stex, storvsc_drv, sym53x8xx, tmscsim.
None of these are particularly worse than the others, most averaging less than a half dozen issues each.

Ignoring the problems I currently have adding more components, it’s not particularly helpful to break it down further when the result is going to be components with a half dozen issues. It’s not that there’s a few awful drivers dragging down the average, it’s that there’s so many of them, and they all contribute a little bit of awful.

Something I’d like to component-ize, but can’t easily without crafting and maintaining ugly regexps, is the core scsi functionality and its libraries. The problem is that drivers/scsi/*.c includes both legacy drivers, and also scsi core functionality & library functions. I discussed potentially moving all the old drivers to a “legacy” or “vintage” sub-directory at LSF/MM earlier this year with James, but he didn’t seem overly enthusiastic. So it’s going to continue to be lumped in with “Everything else” for now.

The difficulty with figuring out whether many of these issues are real concerns is that because they are hardware drivers, the scanner has no way of knowing what range of valid responses the HBA will return. So there are a number of issues which are of the form “This can’t actually happen, because if the HBA returned this, then we would have called this other function instead”.
Not a problem unique to SCSI, and something that’s seen across many different parts of the kernel.

And for those ancient 15 year old drivers ? It’s tough to find someone who either remembers how they work on a chip level, or cares enough to go back and revisit them.

A closer look at drivers/scsi Coverity scans. is a post from:

August 15, 2014 02:59 PM

Dave Jones: drivers/staging under the Coverity microscope.

In my previous post, I mentioned that drivers/staging took the top spot for number of issues in a component.

Here’s a ‘zoomed in’ look at the sub-components under drivers/staging.

bcm 103
comedi 45
iio 13
line6 7
lustre 133
media 10
rtl8188eu 42
rtl8192e 17
rtl8192u 21
rtl8712 22
rtl8821 102
rts5208 19
unisys 14
vt6655 47
vt6656 4
everything else in drivers/staging/ (40 other uncategorized drivers) 95

Some of the sub-components with < 10 issues are likely to have their categories removed soon. When they were initially added, the open issues counts were higher, but over time they’ve improved to the point where they could just be lumped in with “everything else”

When Lustre was added back in 3.12, it caused a noticable jump in new issues detected. The largest delta from any one single addition since I’ve been doing regular scans. It’s continuing to make progress, with 20 or so issues being knocked out each release, and few new issues being introduced. Lustre doesn’t suffer from any one issue overly, but has a grab-bag of issues from the many checkers that Coverity has.
Amusingly, Lustre is the only part of the kernel that has Coverity annotations in the code.

Second on the list is the bcm Wimax driver. This has been around in staging for years, and has had a metric shitload of checkpatch type stylistic changes made to it, but relatively few actual functionality fixes. (confession: I was guilty of ~30 of those cleanups myself, but I couldn’t bare to look at the 1906 line bcm_char_ioctl function: Splitting that up did have a nice side-effect though). A lot of the issues in this driver are duplicates due to a problem in a macro being picked up as a new issue for every instance it gets used.

Something that sticks out in this list is the cluster of rtl* drivers. At time of writing there are seven drivers for various Realtek wireless chips, all of varying quality. Much of the code between these drivers is cut-and-pasted from previous drivers. It seems each time Realtek rev new silicon, they do another code-drop with a new driver. Worse yet, many of the fixes that went into the kernel variants don’t make it back to the driver they based their new work on. There have been numerous cases where a bug fixed in one driver has been reintroduced in a new variant months later. There’s a ton of work going on here, and a lot more needed.
Somewhat depressingly, even the not-in-staging rtlwifi driver that lives in drivers/net/wireless has ~100 issues. Many of them the exact same issues as those in the staging drivers.

As bad as it seems, staging is serving its purpose for the most part, and things have gotten a lot quieter each merge window when the staging tree gets pulled. It’s only when it contains something new and huge like Lustre that it really shows up noticeably in the daily stats after each scan. The number of new issues being added are generally lower than the number being fixed. For the 3.17 pull for example, 67 new issues, 132 eliminated. (Note: Those numbers are kernel wide, not *just* staging, but staging made up the majority of the results change on that day).

Something that bothers me slightly is that a number of drivers have ‘escaped’ drivers/staging into the kernel proper, with a high number of open issues. That said, many of those escapees are no worse than drivers that were added 10+ years ago when standards were lower. More on that in a future post.

drivers/staging under the Coverity microscope. is a post from:

August 15, 2014 01:50 AM

Dave Jones: Linux kernel Coverity scan ‘hot’ areas.

One of the time-consuming parts of organizing the data generated by Coverity has been sorting it into categories, (or components as Coverity refers to them). A component is a wildcard (or exact filename) that matches a specific subsystem, driver, filesystem etc.

As the Linux kernel has thousands of drivers, it isn’t really practical to add a component per-driver, so I started by generalizing into subsystems, and from there, broke down the larger groupings into per-driver components, while still leaving an “everything else” catch-all for drivers within a subsystem that hadn’t been broken out.

According to discussions I’ve had with Coverity, we are actually one of the more ‘heavy’ users of components, and we’ve hit a few scalability problems as we’ve added more and more of them, which has been another reason I’ve not broken things down more than the ~150 components we have so far. Also, if a component has less than 10 or so issues, it’s really not worth the effort of splitting up. (I may revise that cut-off even higher at some point just to keep things managable).

Before the big reveal, some caveats:

Right now, the top ten ‘hot areas’ of the kernel (these include accumulated broken-out drivers), sorted by number of issues are:

drivers/staging 694
fs/ 465
drivers/scsi/ 382
drivers/net/wireless 366
net/ 324
drivers/ethernet/ 285
drivers/media/ 262
drivers/usb/ 140
drivers/infiniband/ 109
arch/x86/ 95
sound/ 89

It should come as no surprise really that the staging drivers take the number one spot. If something had beaten it, I think it would have highlighted a somewhat embarrassing problem in our development methods.

In the next posts, I’ll drill down into each of these categories, and figure out exactly why they’re at the top of the list.

For the impatient: once this series is over, I intend to show breakdowns of the various types of issues being detected, but it’s going to take me a while to get to (probably post kernel summit). There’s an absolute ton of data to dig through, and I’m trying to present as much of it in bite-sized chunks as possible, rather than dozens of pages of info.

Linux kernel Coverity scan ‘hot’ areas. is a post from:

August 15, 2014 01:34 AM

August 13, 2014

Dave Jones: The first year of Coverity Linux kernel scans.

Next week at kernel summit, I’m going to be speaking about the Coverity scans, and have come up with more material than I have time to present in the short slot, so I’ve decided to turn it into a series of blog posts in a hope to kickstart some discussion ahead of time.

I started doing regular scans against the Linux kernel in July 2013. In that time, I’ve sent a bunch of patches, reported many bugs, and spent hours going through the database categorizing, diagnosing, and closing out issues where possible.

I’ve been doing at least one build per day during each merge window (except obviously on days when there haven’t been any commits), and at least one per -rc once the merge window closes.

A few people have asked me about the config file that I use for the builds.
It’s pretty much an ‘allmodconfig’, except where choices have to be made, I’ve tried to pick the general case that a distribution would select. For some of these, I will occasionally flip between them (for eg, SLAB/SLOB/SLUB, PREEMPT_NONE/PREEMPT_VOLUNTARY/PREEMPT) just for coverage. In total, currently 6955 CONFIG_ options are enabled, 117 disabled. (None by choice, they are all the deselected parts of multi-choice options).

The builds are done x86-64 only. At this time, it’s the only architecture Coverity scan supports. I do have CONFIG_COMPILE_TEST set, so non-x86 drivers that can be built do get scanned. The architecture specific code in arch/ and drivers not covered under COMPILE_TEST being the only parts of the kernel we’re not covering.

Builds take about an hour to build on a 24-core Nehalem. The results are then uploaded to a server which takes another 20 minutes. Then a script kicks something at Coverity to pick up the new tarball and scan it. This can take any number of hours. At best, around 5-6 hours, at worst I’ve seen it take as long as 12 hours. This hopefully answers why I don’t do even more builds, or builds of variant trees. (Although I’m still trying to figure out a way to scan linux-next while having it inherit the results of the issues already marked in Linus tree). Thankfully much of the build/upload/scan process is automated, so I can do other things while I wait for it to finish.

Over the year, the overall defect density has been decreasing.

3.11 0.68
3.12 0.62
3.13 0.59
3.14 0.55
3.15 0.55
3.16 0.53

Moving in the right direction, though things have slowed a little the last few releases. At least in part due to my spending more time on Trinity than going through the Coverity backlog. The good news is that the incoming rate of new bugs each window has also slowed.

Newer issues when they are getting introduced, are getting jumped on faster than before. Many developers have signed up for accounts and are looking over their subsystems each release, which is great. It means I have to spend less time sending email :)
Eventually I hope that Coverity implements a feature I asked for allowing each component to have a designated email address that new reports get sent to. With that in place, plus active triage on the backlog, a real dent could be made in the ~4700 outstanding issues.

Throughout the past year Coverity has made a number of improvements server-side, some at the behest of the scans, resulting in fewer false positives being found by some checkers. A good example of this was some additional heuristics being added to spot intentional ‘missing break in switch statement’ situations. I’ve also been in constant communication whenever an interesting bug was found upstream that Coverity didn’t detect, so over time, additional checkers should be added to catch more bugs.

How do we compare against other projects ?
I picked a few at random.

FreeBSD 0.54 (~15m LOC) 14655 total, 6446 fixed, 8093 outstanding.
Firefox 0.70 (~5.4m LOC) 9008 total. 5066 fixed. 3786 outstanding.
Linux 0.53 (~9m LOC) 13337 total. 7202 fixed. 4761 outstanding.
Python 0.03 ! (~400k LOC) 1030 total. 895 fixed. 3 outstanding.

(LOC based on C preprocessor output)

FreeBSD’s defect density is pretty much the same as Linux right now, despite having a lot more code. I think they include all their userspace in their scans also, so it’s picked up gcc, sendmail, binutils etc etc.

The Python people have made a big effort to keep their defect density low (afaik, the lowest of all projects in scan). They did however have a lot fewer issues to begin with, and have a much smaller codebase. Firefox by comparison seems to have a lot of the same problems Linux has. A large corpus of pre-existing issues, and a large codebase (probably with few people with ‘global’ knowledge)

In my next post, I’ll go into some detail about where some of the more dense areas of the kernel are for Coverity issues. Much of it should be no surprise (old, unmaintained/neglected code etc), but there are a few interesting cases).

update : added FreeBSD statistics.
update 2 : (hi hackernews!) added blurb about coverity improvements.

The first year of Coverity Linux kernel scans. is a post from:

August 13, 2014 07:07 PM

Paul E. Mc Kenney: A practitioner at a formal-methods conference

I had the privilege of being asked to present on ordering, RCU, and validation at a joint meeting of the REORDER (Third International Workshop on Memory Consistency Models) and EC2 (7th International Workshop on Exploiting Concurrency Efficiently and Correctly) workshops.

Before my talk, Michael Tautschnig gave a presentation (based on this paper) on an interesting prototype tool (called “mole,” with the name chosen because the gestation period of a mole is about 42 days) that helps identify patterns of usage in large code bases. It is early days for this tool, but one could imagine it growing into something quite useful, perhaps answering questions such as “what are the different ways in which the Linux kernel uses reference counting?” He also took care to call out the many disadvantages of testing, which include not being able to test all paths, all possible races, all possible inputs, or all possible much of anything, at least not in finite time.

I started my talk with an overview of split counters, where each CPU (or task or whatever) updates its own counter, and the aggregate counter is read out by summing all the per-CPU counters. There was some concern expressed by members of the audience about the possibility of inconsistent results. For example, if one CPU adds five, another CPU adds seven, and a third CPU adds 11 to initially zero counter, then two CPUs reading out the counter might see 12 and 18, respectively, which are inconsistent (they differ by six, and no CPU added six). To their credit, the attendees picked right up on a reasonable correctness criterion. The idea is that the aggregate counter's value varies with time, and that any given reader will be guaranteed to return a value between that of the counter when the reader started and that of the counter when the reader ended: Consistency is neither needed nor provided in a great number of situations.

I then gave my usual introduction to RCU, and of course it is always quite a bit of fun introducing RCU to people who have never encountered anything like it. There was quite a bit of skepticism initially, as well as a lot of questions and comments.

I then turned to validation, noting the promise of some formal-validation tooling. I ended by saying that although I agreed with the limitations of testing called out by the previous speaker, the fact remains that a number of people have devised tests that had found RCU bugs (thank you, Stephen, Dave, and Fengguang!), but no one has yet devised a hard-core formal-validation tool that has found any bugs in RCU. I also pointed out that this is definitely not because there are no bugs in RCU! (Yes, I have gotten rid of all day-one bugs in RCU, but only by having also gotten rid of all day-one code in RCU.) When asked if I meant bugs in RCU usage or in RCU itself, I replied “Either would be good.” Several people wrote down where to find RCU in the Linux kernel, so it will be interesting to see what they come up with. (Perhaps all too interesting!)

There were several talks on analyzing weakly ordered systems, but keep in mind that for these guys, even x86 is weakly ordered. After all, it allows prior stores to be reordered with later loads.

Another interesting talk was given by Kapil Vaswani on the topic of wait freedom. Recall that in a wait-free algorithm, every process is guaranteed to make some progress in a finite time, even in the presence of arbitrarily long delays for any given process. In contrast, in a lock-free algorithm, only one process is guaranteed to make some progress in a finite time, again, even in the presence of arbitrarily long delays for any given process. It is worth noting that neither of these guarantees is sufficient for real-time programs, which require a specified amount of progress (not merely some progress) in a bounded amount of time (not merely a finite amount of time). Wait-freedom and lock-freedom are nevertheless important forward-progress guarantees, and there are numerous other similar guarantees including obstruction freedom, deadlock freedom, starvation freedom, many more besides.

It turns out that most software in production, even in real-time systems, is not wait-free, which has been a source of consternation for many researchers for quite some time. Kapil went on to describe how Alistarh et al. showed that, roughly speaking, given a non-hostile scheduler and crash-free execution, lock-free algorithms have wait-free behavior.

The interesting thing about this is that you can take it quite a bit farther, and those of you who know me well won't be surprised to learn that I did just that in a question to the speaker. If you have a non-hostile scheduler, crash-free execution, FIFO locks, bounded lock-hold times, no lock nesting, a finite number of processes, and so on, you can obtain the benefits of the more aggressive forward-progress guarantees. The general idea is that if you have at most N processes, and if the maximum lock-hold time is T, then you can wait at most (N-1)T time to acquire a given lock. (Those wishing greater rigor should read Bjoern B. Brandenburg's dissertation — Full disclosure: I was on Bjoern's committee.) In happy contrast to the authors of the paper mentioned in the previous paragraph, the speaker and audience seemed quite intrigued by this line of thought.

In all, it was an extremely interesting and thought-provoking time. With some luck, perhaps we will see some powerful software tools introduced by this group of researchers.

August 13, 2014 03:40 AM

August 10, 2014

Matthew Garrett: Birthplace

For tedious reasons, I will at this stage point out that I was born in Galway, Ireland.

comment count unavailable comments

August 10, 2014 11:44 PM

August 08, 2014

Dave Jones: Week of kernel bugs in review

With the 3.17 merge window opening up this week, it’s been kinda busy.
I also made a few enhancements to Trinity, so it found some bugs that have been there for a while.

In addition to this, I started pulling together a talk for kernel summit based on all the stuff that Coverity has been finding. I’ll eventually get around to turning those into blog posts too, as there’s a lot of material.

Productive week.

Week of kernel bugs in review is a post from:

August 08, 2014 07:36 PM

Dave Jones: compiler sanitizers.

I only recently discovered the sanitizer libraries that both gcc and llvm support despite them being a few years old now. (libasan, liblsan, libtsan and my favorite libubsan for undefined behaviour detection). LLVM also has a -fsanitize=memory.

Building code with -fsanitize={address|leak|undefined} has turned up a number of hard to find issues in various userspace code I’ve written. (Unfortunately doing this on something like Trinity produces a lot of false positives, as it deliberately generates undefined behavior in many cases, like creating an mmap, never writing to it, and then passing it to something that reads it).

There’s also a variant of libasan for the kernel which looks interesting. I know that’s found a bunch of issues in concert with fuzzing via Trinity, and expect it’s something we’ll see more of if/when that functionality gets merged.

Today I was reading about the recent gcc meeting, and these slides by the sanitizer developers caught my attention. What I found of particular interest was the “MSan for Chromium” slide, where they mention they rebuilt ~40 libraries to link with the sanitizer.

I’ve been contemplating doing this for a subset of some userspace packages in Fedora that I care about for a while, but I’ve not had spare cycles to even look into it. I dogfood a lot of bleeding edge code on all my machines, and have been curious for some time to see what the fallout looks like from such a rebuild of various network facing daemons. I suspect with Chromium being more focused on the client side, there hasn’t been a huge amount of research into this for server side code. Looking at ASan’s found bugs wiki page, it does seem to support that hypothesis. I’m curious to see what would fall out from a rebuilt Apache, Bind, Sendmail, nginx, etc.
Hopefully the developers of all the network facing code we ship are just as curious.

There are obvious comparisons to valgrind, which doesn’t require rebuilding, but in my experience so far, the sanitizers have found a bunch of issues that valgrind didn’t (or got lost in the noise). Also, just like with fuzzers, different tools tend to find different bugs even if they have the same intent. I think there’s room for both approaches.

compiler sanitizers. is a post from:

August 08, 2014 06:54 PM

August 07, 2014

Daniel Vetter: Neat stuff for 3.17

So with the 3.16 kernel out of the door it's time to look at what's queued up for the Intel graphics driver in 3.17.

This release features the universal plane support from Matt Roper, all enabled already by default. This is prep work for atomic modesetting and pageflipping support: Since a while we support additional (overlay) planes in the DRM core and the i915 driver, but there have always been two implicit planes directly attached to the CRTC: The primary plane used by the SetCrtc and PageFlip functions, and the optional cursor support. But with the atomic ioctl these implicit planes it's easier to handle everything as an explicit plane, so Matt's patches split them away into separate real plane objects. This is a nice cleanup of the kms api in general since a lot of SoC hardware has unified plane hardware, where cursor, primary plane and any overlays are fully interchangeable. So we already expose this to userspace, if it sets the corresponding feature flag.

Another big feature on the display side is the improved PSR support, which is now enabled by default on Haswell and Broadwell. The tricky bit with PSR (and also with FBC) and the reason we didn't yet enable this by default is correctly support legacy frontbuffer rendering (for example for X). The hardware provides a bit of support to do that, but it doesn't catch all possible frontbuffer rendering and has a lot of other limitations. To finally fix this for real we've added accurate frontbuffer tracking in software. This should finally allow us to enable a lot of display power saving features by default like PSR on Baytrail, FBC (on all platforms) and DRRS (dynamic refresh rate switching).

On actual platform display enabling we have lots of improvements all over: Baytrail MIPI DSI support has greatly stabilized, backlight and power sequencer fixes, mmio based flips to work around issues with stalls and hangs for blitter ring based flips and plenty of other work. The core drm pieces for plane rotation support have also landed, unfortunately the i915 parts didn't make the cut for 3.17.

Another big area, as usual, has been general power management improvements. We now support runtime PM for DPMS Off and not just when the output is completely disabled. This was fairly invasive work since our current modesetting code assumed that a DPMS Off/On cycle will not destroy register state, but that's exactly what runtime PM can do. On the plus side this reorganization greatly cleaned up the code base and prepared the driver for atomic modesetting, which requires a similar separation between state computation and actual hw state updating like this feature.

Jesse Barnes implemented S0ix support for system suspend/resume. Marketing has some crazy descriptions for this, but essentially this means that we use the same power saving knobs for system suspend as for runtime PM - the entire machine is still running, just at a very low power state. Long-term this should simplify our system suspend code a bit since we can just reuse all the code used to implement runtime PM.

Moving on to the render side of the gpu there have been again improvements to the rps code. Chris Wilson further tuned the rps boost logic, and Ville and Deepak implemented rps support for Cherrytrail.
Jesse contributed ppgtt support for Baytrail which will be a lot more interesting once we enable full ppgtt again (hopefully in 3.18).

For Broadwell semaphores support from Ben and Rodrigo was merged, but it looks like we need to disable that again due to stability issues. Oscar Mateo also implemented a large pile of interrupt handling improvements which hopefully address the small races and bugs we've had in the past on some platforms. There's also a lot of refactoring patches to prepare for execlist support from Oscar. Excelists are the new way of submitting work to the gpu, first supported on Broadwell (but not yet mandatory). The key feature compared to legacy ringbuffer submission is that we'll finally be able to preempt gpu tasks.

And as usual there have been tons of bugsfixes and improvements all over. Oh and: User mode setting has moved one step further on the path to deprecation and is now fully disabled. If no one complains about this we can finally rip out all that code in one of the next kernel releases.

August 07, 2014 03:36 PM

August 05, 2014

Dave Jones: Linux 3.16 coverity stats

date rev Outstanding fixed defect density
Jun/8/2014 v3.15 4928 6397 0.55
Jun/16/2014 v3.16-rc1 4817 6651 0.53
Jun/23/2014 v3.16-rc2 4815 6653 0.53
Jun/29/2014 v3.16-rc3 4810 6659 0.53
Jul/6/2014 v3.16-rc4 4806 6661 0.53
Jul/14/2014 v3.16-rc5 4801 6663 0.53
Jul/21/2014 v3.16-rc6 4827 7022 0.53
Jul/28/2014 v3.16-rc7 4820 7022 0.53
Aug/4/2014 v3.16 4817 7023 0.53

The 3.16 cycle really started putting a dent in the backlog of older issues. Hundreds of older issues got fixed in -rc1.
There was a small bump at rc5 in new issues being detected, when Coverity upgraded to their 7.5.0 release.
Improvements in that upgrade also meant it closed out more issues than it found new (395 new: 409 eliminated)

Many of the new issues detected look to be real problems. 50 or so of them come from a new checker that looks for patterns like

if (condition)

In a lot of drivers however, it seems to be intentional, as these cases come with FIXME comments suggesting that the author doesn’t know what the right thing to do is in the ‘else’ case, or some functionality doesn’t work right yet, so it falls back to doing the same thing in both branches.

It’s now been a year since I first started doing regular builds in Coverity. In that time, the detected defect density has dropped from 0.68 to 0.53 today. We used to see upticks in new issues every time the merge window opened. Now, we’re seeing as many as (or more) issues closed as we are seeing new. As an example: day 1 of the 3.17 merge window yesterday featured 3638 new changes, including all the questionable code in drivers/staging/ Coverity picked up 67 new issues, but 132 got eliminated).

I’m hoping things continue to improve at this rate.

Linux 3.16 coverity stats is a post from:

August 05, 2014 04:27 PM