Kernel Planet

March 27, 2015

LPC 2015: Registration for LPC 2015 Is Now Open

The 2015 Linux Plumbers Conference organizing committee is pleased to announce that the registration for this year’s conference is now open. The conference will be taking place in Seattle, Washington, USA, August 19th through August 21st. Information on how to register can be found on the ATTEND page. Registration prices and cutoff dates are also published in the ATTEND page. As usual, contact us if you have questions.

March 27, 2015 11:04 PM

March 26, 2015

LPC 2015: Containers Microconference Accepted into 2015 Linux Plumbers Conference

Over the past year, the advent of Docker has further increased the level of Containers excitement. Additional points of Containers interest include the LXC 1.1 release (which includes CRIU checkpoint/restore, in-container systemd support, and feature-set compatibility across systemd, sysvinit, and upstart), the recently announced merger of OpenVZ and Cloud server, and progress in the kernel namespace and cgroups infrastructure.

The goal of this microconference is to get these diverse groups together to discuss the long-standing issue of container device isolation, live migration, security features such as user namespaces, and the dueling-systemd challenges stemming from running systemd both within containers and on the host OS (see “The Grumpy Editor’s guide to surviving the systemd debate” for other systemd-related topics).

Please join us for a timely and important discussion!

March 26, 2015 11:10 AM

March 25, 2015

Matthew Garrett: Python for remote reconfiguration of server firmware

One project I've worked on at Nebula is a Python module for remote configuration of server hardware. You can find it here, but there's a few caveats:

  1. It's not hugely well tested on a wide range of hardware
  2. The interface is not yet guaranteed to be stable
  3. You'll also need this module if you want to deal with IBM (well, Lenovo now) servers
  4. The IBM support is based on reverse engineering rather than documentation, so who really knows how good it is

There's documentation in the README, and I'm sorry for the API being kind of awful (it suffers rather heavily from me writing Python while knowing basically no Python). Still, it ought to work. I'm interested in hearing from anybody with problems, anybody who's interested in getting it on Pypi and anybody who's willing to add support for new HP systems.

comment count unavailable comments

March 25, 2015 11:51 PM

March 24, 2015

LPC 2015: Energy-Aware Scheduling and CPU Power Management Microconference Accepted into 2015 Linux Plumbers Conference

Energy efficiency has received considerable attention, for example, the microconference at last year’s Plumbers.  However, despite another year’s worth of vigorous efforts, there is still quite a bit left to be desired in Linux’s power management and in its energy-aware scheduling in particular, hence this year’s microconference.

This microconference will look at progress on frequency/performance scaling, thermal management, ACPI power management, device-tree representation of power-management features, energy-aware scheduling, management of power domains, integration of system-wide with runtime power management, and of course measurement techniques and tools.

We hope to see you there!

March 24, 2015 05:19 PM

LPC 2015: Checkpoint/Restart Microconference Accepted into 2015 Linux Plumbers Conference

Checkpoint/restart technology is the basis for live migration as well as its traditional use to take a snapshot of a long-running job. This microconference will focus on the C/R project called CRIU and will bring together people from Canonical, CloudLinux, Georgia Institute of Technology, Google, Parallels, and Qualcomm to discuss CRIU integration with the various containers projects, its use on Android, performance and testing issues and, of course, to show some live demoes.  See the Checkpoint/Restart wiki for more information.

Please join us for a timely and important discussion!

 

March 24, 2015 05:15 PM

March 16, 2015

Dave Airlie: virgil3d local rendering test harness

So I've still been working on the virgil3d project along with part time help from Marc-Andre and Gerd at Red Hat, and we've been making steady progress. This post is about a test harness I just finished developing for adding and debugging GL features.

So one of the more annoying issuess with working on virgil has been that while working on adding 3D renderer features or trying to track down a piglit failure, you generally have to run a full VM to do so. This adds a long round trip in your test/development cycle.

I'd always had the idea to do some sort of local system renderer, but there are some issues with calling GL from inside a GL driver. So my plan was to have a renderer process which loads the renderer library that qemu loads, and a mesa driver that hooks into the software rasterizer interfaces. So instead of running llvmpipe or softpipe I have a virpipe gallium wrapper, that wraps my virgl driver and the sw state tracker via a new vtest winsys layer for virgl.

So the virgl pipe driver sits on top of the new winsys layer, and the new winsys instead of using the Linux kernel DRM apis just passes the commands over a UNIX socket to a remote server process.

The remote server process then uses EGL and the renderer library, forks a new copy for each incoming connection and dies off when the rendering is done.

The final rendered result has to be read back over the socket, and then the sw winsys is used to putimage the rendering onto the screen.

So this system is probably going to be slower in raw speed terms, but for developing features or debugging fails it should provide an easier route without the overheads of the qemu process. I was pleasantly surprised it only took two days to pull most of this test harness together which was neat, I'd planned much longer for it!

The code lives in two halves.
http://cgit.freedesktop.org/~airlied/virglrenderer
http://cgit.freedesktop.org/~airlied/mesa virgl-mesa-driver

[updated: pushed into the main branches]

Also the virglrenderer repo is standalone now, it also has a bunch of unit tests in it that are run using valgrind also, in an attempt to lock down some more corners of the API and test for possible ways to escape the host.

March 16, 2015 06:24 AM

March 13, 2015

Dave Jones: LSF/MM 2015 recap.

It’s been a long week.
Spent Monday/Tuesday at LSFMM. This year it was in Boston, which was convenient in that I didn’t have to travel anywhere, but less convenient in that I had to get up early and do a rush-hour commute to get to the conference location in time. At least the weather got considerably better this week compared to the frankly stupid amount of snow we’ve had over the last month.
LWN did their usual great write-up which covers everything that was talked about in a lot more detail than my feeble mind can remember.

A lot of things from last years event seem to still be getting a lot of discussion. SMR drives & persistent memory being the obvious stand-outs. Lots of discussion surrounding various things related to huge pages (so much so one session overran and replaced a slot I was supposed to share with Sasha, not that I complained. It was interesting stuff, and I learned a few new reasons to dislike the way we handle hugepages & forking), and I lost track how many times the GFP_NOFAIL discussion came up.

In a passing comment in one session, one of the people Intel sent (Dave Hansen iirc) mentioned that Intel are now shipping a 18 core/36 thread CPU. A bargain at just $4642. Especially when compared to this madness.

A few days before the event, I had been asked if I wanted to do a “how Akamai uses Linux” type talk at LSFMM, akin to what Chris Mason did re: facebook at last years event. I declined, given I’m still trying to figure that out myself. Perhaps another time.

Wednesday/Thursday, I attended Vault at the same location.
My take-away’s:

I got asked “What are you doing at Akamai ?” a lot. (answer right now: trying to bring some coherence to our multiple test infrastructures).
Second most popular question: “What are going to do after that ?”. (answer: unknown, but likely something more related to digging into networking problems rather than fighting shell scripts, perl and Makefiles).

All that, plus a lot of hallway conversations, long lunches, and evening activities that went on possibly a little later than they should have have led to me almost losing my voice today.
Really good use of time though. I had fun, and it’s always good to catch up with various people.

LSF/MM 2015 recap. is a post from: codemonkey.org.uk

March 13, 2015 08:57 PM

March 12, 2015

Paul E. Mc Kenney: Confessions of a Recovering Proprietary Programmer, Part XV

So the Linux kernel now has a Documentation/CodeOfConflict file. As one of the people who provided an Acked-by for this file, I thought I should set down what went through my mind while reading it. Taking it one piece at a time:

The Linux kernel development effort is a very personal process compared to “traditional” ways of developing software. Your code and ideas behind it will be carefully reviewed, often resulting in critique and criticism. The review will almost always require improvements to the code before it can be included in the kernel. Know that this happens because everyone involved wants to see the best possible solution for the overall success of Linux. This development process has been proven to create the most robust operating system kernel ever, and we do not want to do anything to cause the quality of submission and eventual result to ever decrease.

In a perfect world, this would go without saying, give or take the “most robust” chest-beating. But I am probably not the only person to have noticed that the world is not always perfect. Sadly, it is probably necessary to remind some people that “job one” for the Linux kernel community is the health and well-being of the Linux kernel itself, and not their own pet project, whatever that might be.

On the other hand, I was also heartened by what does not appear in the above paragraph. There is no assertion that the Linux kernel community's processes are perfect, which is all to the good, because delusions of perfection all too often prevent progress in mature projects. In fact, in this imperfect world, there is nothing so good that it cannot be made better. On the other hand, there also is nothing so bad that it cannot be made worse, so random wholesale changes should be tested somewhere before being applied globally to a project as important as the Linux kernel. I was therefore quite happy to read the last part of this paragraph: “we do not want to do anything to cause the quality of submission and eventual result to ever decrease.”

If however, anyone feels personally abused, threatened, or otherwise uncomfortable due to this process, that is not acceptable.

That sentence is of course critically important, but must be interpreted carefully. For example, it is all too possible that someone might feel abused, threatened, and uncomfortable by the mere fact of a patch being rejected, even if that rejection was both civil and absolutely necessary for the continued robust operation of the Linux kernel. Or someone might claim to feel that way, if they felt that doing so would get their patch accepted. (If this sounds impossible to you, be thankful, but also please understand that the range of human behavior is extremely wide.) In addition, I certainly feel uncomfortable when someone points out a stupid mistake in one of my patches, but that discomfort is my problem, and furthermore encourages me to improve, which is a good thing. For but one example, this discomfort is exactly what motivated me to write the rcutorture test suite. Therefore, although I hope that we all know what is intended by the words “abused”, “threatened”, and “uncomfortable” in that sentence, the fact is that it will never be possible to fully codify the difference between constructive and destructive behavior.

Therefore, the resolution process is quite important:

If so, please contact the Linux Foundation's Technical Advisory Board at <tab@lists.linux-foundation.org>, or the individual members, and they will work to resolve the issue to the best of their ability. For more information on who is on the Technical Advisory Board and what their role is, please see:

http://www.linuxfoundation.org/programs/advisory-councils/tab

There can be no perfect resolution process, but this one seems to be squarely in the “good enough” category. The timeframes are long enough that people will not be rewarded by complaining to the LF TAB instead of fixing their patches. The composition of the LF TAB, although not perfect, is diverse, consisting of both men and women from multiple countries. The LF TAB appears to be able to manage the inevitable differences of opinion, based on the fact that not all members provided their Acked-by for this Code of Conflict. And finally, the LF TAB is an elected body that has oversight via the LF, so there are feedback mechanisms. Again, this is not perfect, but it is good enough that I am willing to overlook my concerns about the first sentence in the paragraph.

On to the final paragraph:

As a reviewer of code, please strive to keep things civil and focused on the technical issues involved. We are all humans, and frustrations can be high on both sides of the process. Try to keep in mind the immortal words of Bill and Ted, “Be excellent to each other.”

And once again, in a perfect world it would not be necessary to say this. Sadly, we are human beings rather than angels, and so it does appear to be necessary. Then again, if we were all angels, this would be a very boring world.

Or at least that is what I keep telling myself!

March 12, 2015 05:36 PM

Matthew Garrett: Vendors continue to break things

Getting on for seven years ago, I wrote an article on why the Linux kernel responds "False" to _OSI("Linux"). This week I discovered that vendors were making use of another behavioural difference between Linux and Windows to change the behaviour of their firmware and breaking things in the process.

The ACPI spec defines the _REV object as evaluating "to the revision of the ACPI Specification that the specified \_OS implements as a DWORD. Larger values are newer revisions of the ACPI specification", ie you reference _REV and you get back the version of the spec that the OS implements. Linux returns 5 for this, because Linux (broadly) implements ACPI 5.0, and Windows returns 2 because fuck you that's why[1].

(An aside: To be fair, Windows maybe has kind of an argument here because the spec explicitly says "The revision of the ACPI Specification that the specified \_OS implements" and all modern versions of Windows still claim to be Windows NT in \_OS and eh you can kind of make an argument that NT in the form of 2000 implemented ACPI 2.0 so handwave)

This would all be fine except firmware vendors appear to earnestly believe that they should ensure that their platforms work correctly with RHEL 5 even though there aren't any drivers for anything in their hardware and so are looking for ways to identify that they're on Linux so they can just randomly break various bits of functionality. I've now found two systems (an HP and a Dell) that check the value of _REV. The HP checks whether it's 3 or 5 and, if so, behaves like an old version of Windows and reports fewer backlight values and so on. The Dell checks whether it's 5 and, if so, leaves the sound hardware in a strange partially configured state.

And so, as a result, I've posted this patch which sets _REV to 2 on X86 systems because every single more subtle alternative leaves things in a state where vendors can just find another way to break things.

[1] Verified by hacking qemu's DSDT to make _REV calls at various points and dump the output to the debug console - I haven't found a single scenario where modern Windows returns something other than "2"

comment count unavailable comments

March 12, 2015 10:03 AM

March 10, 2015

Pavel Machek: Happy Easter from DRAM vendors

DRAM in about 50% of recent notebooks (and basically 50% of machines without ECC) is so broken it is exploitable. You can get root from normal user account (and more). But while everyone and their dog wrote about heartbleed and bash bugs, press did not notice yet. I guess it is because the vulnerability does not yet have a logo?

I propose this one:

           +==.
           |   \                                                                                     
  +---+    +====+
 -+   +-     ||
  |DDR|      ||
 -+   +-     ||
  +---+      ||


Memory testing code is at github. Unfortunately, Google did not publish list of known bad models. If you run a test, can you post the results in commments? My thinkpad X60 is not vulnerable, my Intel(R) Core(TM)2 Duo CPU     E7400 -based desktop (with useless DMI information, so I don't know who made the board) is vulnerable.

March 10, 2015 09:30 PM

Pavel Machek: Random notes

Time: 1039 Safe: 101

Plane 'X' landed at the wrong airport.

You beat your previous score!

 #:  name      host      game                time  real time  planes safe
 -------------------------------------------------------------------------------
  1:  pavel     duo       default             1040      34:38   101



And some good news: Old thinkpad x60 can take 3GiB RAM, making machine usable a little longer.

gpsd can run non-root, and seems to accept output from named pipe. Which is good, because it means using wifi accesspoints to provide position to clients such as foxtrotgps is easier. Code is in gitorious tui.

March 10, 2015 09:15 PM

March 09, 2015

Paul E. Mc Kenney: Verification Challenge 4: Tiny RCU

The first and second verification challenges were directed to people working on verification tools, and the third challenge was directed at developers. Perhaps you are thinking that it is high time that I stop picking on others and instead direct a challenge at myself. If so, this is the challenge you were looking for!

The challenge is to take the v3.19 Linux kernel code implementing Tiny RCU, unmodified, and use some formal-verification tool to prove that its grace periods are correctly implemented.

This requires a tool that can handle multiple threads. Yes, Tiny RCU runs only on a single CPU, but the proof will require at least two threads. The basic idea is to have one thread update a variable, wait for a grace period, then update a second variable, while another thread accesses both variables within an RCU read-side critical section, and a third parent thread verifies that this critical section did not span a grace period, like this:

 1 int x;
 2 int y;
 3 int r1;
 4 int r2;
 5
 6 void rcu_reader(void)
 7 {
 8   rcu_read_lock();
 9   r1 = x; 
10   r2 = y; 
11   rcu_read_unlock();
12 }
13
14 void *thread_update(void *arg)
15 {
16   x = 1; 
17   synchronize_rcu();
18   y = 1; 
19 }
20
21 . . .
22
23 assert(r2 == 0 || r1 == 1);


Of course, rcu_reader()'s RCU read-side critical section is not allowed to span thread_update()'s grace period, which is provided by synchronize_rcu(). Therefore, rcu_reader() must execute entirely before the end of the grace period (in which case r2 must be zero, keeping in mind C's default initialization to zero), or it must execute entirely after the beginning of the grace period (in which case r1 must be one).

There are a few technical problems to solve:


  1. The Tiny RCU code #includes numerous “interesting” files. I supplied empty files as needed and used “-I .” to focus the C preprocessor's attention on the current directory.
  2. Tiny RCU uses a number of equally interesting Linux-kernel primitives. I stubbed most of these out in fake.h, but copied a number of definitions from the Linux kernel, including IS_ENABLED, barrier(), and bool.
  3. Tiny RCU runs on a single CPU, so the two threads shown above must act as if this was the case. I used pthread_mutex_lock() to provide the needed mutual exclusion, keeping in mind that Tiny RCU is available only with CONFIG_PREEMPT=n. The thread that holds the lock is running on the sole CPU.
  4. The synchronize_rcu() function can block. I modeled this by having it drop the lock and then re-acquire it.
  5. The dyntick-idle subsystem assumes that the boot CPU is born non-idle, but in this case the system starts out idle. After a surprisingly long period of confusion, I handled this by having main() invoke rcu_idle_enter() before spawning the two threads. The confusion eventually proved beneficial, but more on that later.


The first step is to get the code to build and run normally. You can omit this step if you want, but given that compilers usually generate better diagnostics than do the formal-verification tools, it is best to make full use of the compilers.

I first tried goto-cc, goto-instrument, and satabs [Slide 44 of PDF] and impara [Slide 52 of PDF], but both tools objected strenuously to my code. My copies of these two tools are a bit dated, so it is possible that these problems have since been fixed. However, I decided to download version 5 of cbmc, which is said to have gained multithreading support.

After converting my code to a logic expression with no fewer than 109,811 variables and 457,344 clauses, cbmc -I . -DRUN fake.c took a bit more than ten seconds to announce VERIFICATION SUCCESSFUL. But should I trust it? After all, I might have a bug in my scaffolding or there might be a bug in cbmc.

The usual way to check for this is to inject a bug and see if cbmc catches it. I chose to break up the RCU read-side critical section as follows:

 1 void rcu_reader(void)
 2 {
 3   rcu_read_lock();
 4   r1 = x; 
 5   rcu_read_unlock();
 6   cond_resched();
 7   rcu_read_lock();
 8   r2 = y; 
 9   rcu_read_unlock();
10 }


Why not remove thread_update()'s call to synchronize_rcu()? Take a look at Tiny RCU's implementation of synchronize_rcu() to see why not!

With this change enabled via #ifdef statements, “cbmc -I . -DRUN -DFORCE_FAILURE fake.c” took almost 20 seconds to find a counter-example in a logic expression with 185,627 variables and 815,691 clauses. Needless to say, I am glad that I didn't have to manipulate this logic expression by hand!

Because cbmc catches an injected bug and verifies the original code, we have some reason to hope that the VERIFICATION SUCCESSFUL was in fact legitimate. As far as I know, this is the first mechanical proof of the grace-period property of a Linux-kernel RCU implementation, though admittedly of a rather trivial implementation. On the other hand, a mechanical proof of some properties of the dyntick-idle counters came along for the ride, courtesy of the WARN_ON_ONCE() statements in the Linux-kernel source code. (Previously, researchers at Oxford mechanically validated the relationship between rcu_dereference() and rcu_assign_pointer(), taking the whole of Tree RCU as input, and researchers at MPI-SWS formally validated userspace RCU's grace-period guarantee—manually.)

As noted earlier, I had confused myself into thinking that cbmc did not handle pthread_mutex_lock(). I verified that cbmc handles the gcc atomic builtins, but it turns out to be impractical to build a lock for cbmc's use from atomics. The problem stems from the “b” for “bounded” in “cbmc”, which means cbmc cannot analyze the unbounded spin loops used in locking primitives.

However, cbmc does do the equivalent of a full state-space search, which means it will automatically model all possible combinations of lock-acquisition delays even in the absence of a spin loop. This suggests something like the following:

 1 if (__sync_fetch_and_add(&cpu_lock, 1))
 2   exit();


The idea is to exclude from consideration any executions where the lock cannot be immediately acquired, again relying on the fact that cbmc automatically models all possible combinations of delays that the spin loop might have otherwise produced, but without the need for an actual spin loop. This actually works, but my mis-modeling of dynticks fooled me into thinking that it did not. I therefore made lock-acquisition failure set a global variable and added this global variable to all assertions. When this failed, I had sufficient motivation to think, which caused me to find my dynticks mistake. Fixing this mistake fixed all three versions (locking, exit(), and flag).

The exit() and flag approaches result in exactly the same number of variables and clauses, which turns out to be quite a bit fewer than the locking approach:

exit()/flaglocking
Verification69,050 variables, 287,548 clauses (output)109,811 variables, 457,344 clauses (output)
Verification Forced Failure113,947 variables, 501,366 clauses (output)   185,627 variables, 815,691 clauses (output)


So locking increases the size of the logic expressions by quite a bit, but interestingly enough does not have much effect on verification time. Nevertheless, these three approaches show a few of the tricks that can be used to accomplish real work using formal verification.

The GPL-licensed source for the Tiny RCU validation may be found here. C-preprocessor macros select the various options, with -DRUN being necessary for both real runs and cbmc verification (as opposed to goto-cc or impara verification), -DCBMC forcing the atomic-and-flag substitute for locking, and -DFORCE_FAILURE forcing the failure case. For example, to run the failure case using the atomic-and-flag approach, use:

cbmc -I . -DRUN -DCBMC -DFORCE_FAILURE fake.c


Possible next steps include verifying dynticks and interrupts, dynticks and NMIs, and of course use of call_rcu() in place of synchronize_rcu(). If you try these out, please let me know how it goes!

March 09, 2015 06:42 PM

Lucas De Marchi: Taking maintainership of dolt

For those who don’t know, dolt is a wrapper and replacement for libtool on sane systems that don’t need it at all. It was created some years ago by Josh Triplett to overcome the slowness of libtool.

Nowadays libtool should be much faster so the necessity for dolt should not be as big anymore. However as can be seen here, it’s not true (at least in libtool > 2.4.2). Yeah, this seems a bug in libtool that should be fixed. However I don’t like it being any more complicated than it should.

After talking to Josh and to Luca Barbato (who maintains an updated version in github) it seemed a good idea to revive dolt on its original repository. Since this file is supposed to be integrated in each project’s m4 directory, there are several copies out there with different improvements. Now the upstream repository has an updated version that should work for any project — feel free to copy it to your repository, synchronizing with upstream. I don’t really expect much maintenance. Credit where it’s due: most of the missing commits came from the version maintained by Luca.

So, if you want to integrate it in your repository using autotools, it’s pretty simple, feel free to follow the README.md file.

March 09, 2015 05:27 PM

March 06, 2015

Pavel Machek: Position privacy protection

Mozilla maintains access points (AP) database at location.services.mozilla.com. Location using WIFI is cool: you don't need GPS hardware, and you can fix quicker/for less battery power in cities, and you can get a fix indoors.

Mozilla will return your position if you know SSIDs of two nearby access points, using web service. That has disadvantages: you need working internet connection, connection costs you time, power and money, and Mozilla now knows where you are.
Obvious solution would be to publish AP database, but that has downside: if you visit Anicka once and learn SSID of her favourite access point, you could locate Anicka with simple database query once she moves.

Solution:
first = select N numerically lower (or most commonly seen) access points in the area
second = all access points in the area
for i in first:
      for j in second:
              at position sha1(i, j, salt?) in the database, store GPS coordinates.
If probability of missing access point when you are in the right area is P, probability of not being able to tell your location is P^N. Database will grow approximately N times.
Storing salt: it will make it harder to see differences between different version (good). But if we play some tricks with hash-size to artificaly introduce collisions, this may make them ineffective.
Problem: There is only 2^48 access points. Someone could brute force hash. Solution: store fewer bits of hash to create collisions?
Problem: If you can guess Anicka anicka likes South Pole, and suddenly new access point appears in the area, you can guess her address. Comment: not a problem since Anicka would have to take two access points to the South Pole? Or still a problem since you don't need to know address of the second AP to locate her?
Problem: If you know Anicka likes Mesto u Chciplyho psa, where noone ever moves and noone ever activates/deactivats APs, you can still locate her. Comment: is it a problem? Are there such places?

Any ideas? Does it work, or did I make a mistake somewhere? Is there solution with lower overhead?

March 06, 2015 10:45 PM

Andy Grover: iSER target should work fine in RHEL 7.1

Contrary to what RHEL 7.1 release notes might say, RHEL 7.1 should be fine as an iSER target, and it should be fine to use iSER even during the discovery phase. There was significant late-breaking work by our storage partners to fix both of these issues.

Unfortunately, there were multiple Bugzilla entries for the same issues, and while some were properly closed, others were not, and the issues erroneously were mentioned in the release notes.

So, for the hordes out there eager to try iSER target on RHEL 7.1 and who actually read the release notes —  I hope you see this too and know it’s OK give it a go :-)

March 06, 2015 12:47 AM

March 04, 2015

Matt Domsch: Dell’s Linux Engineering team is hiring

Dell’s Linux Engineering team, based in Austin, TX, is hiring a Senior Principal Engineer. This role is one I’ve previously held and enjoyed greatly – ensuring that Linux (all flavors) “just works” on all Dell PowerEdge servers. It is as much a relationship role (working closely with Dell and partner hardware teams, OS vendors and developers, internal teams, and the greater open source community) as it is technical (device driver work, OS kernel and userspace work). If you’re a “jack of all trades” in Linux and looking for a very senior technical role to continue the outstanding work that ranks Dell at the top of charts for Linux servers, we’d love to speak with you.

The formal job description is on Dell’s job site. If you’d like to speak with me personally about it, drop me a line.

March 04, 2015 03:31 PM

March 03, 2015

Dave Jones: Trinity 1.5 release.

As announced this morning, today I decided that things had slowed down (to an almost standstill of late) enough that it was worth making a tarball release of Trinity, to wrap up everything that’s gone in over the last year.

The email linked above covers most of the major changes, but a lot of the change over the last year has actually been groundwork for those features. Things like..

As I mentioned in the announcement, I don’t see myself having a huge amount of time for at least this year to work on Trinity. I’ve had a number of people email me asking the status of some feature. Hopefully this demarkation point will answer the question.

So, it’s not abandoned, it just won’t be seeing the volume of change it has over the last few years. I expect my personal involvement will be limited to merging patches, and updating the syscall lists when new syscalls get added.

Trinity used to be on roughly a six month release schedule. We’ll see if by the end of the year there’s enough input from other people to justify doing a 1.6 release.

I’m also hopeful that time working on other projects mean I’ll come back to this at some point with fresh eyes. There are a number of features I wanted to implement that needed a lot more thought. Perhaps working on some other things for a while will give me the perspective necessary to realize those features.

Trinity 1.5 release. is a post from: codemonkey.org.uk

March 03, 2015 02:14 AM

March 02, 2015

Michael Kerrisk (manpages): man-pages-3.81 is released

I've released man-pages-3.81. The release tarball is available on kernel.org. The browsable online pages can be found on man7.org. The Git repository for man-pages is available on kernel.org.

The changes in man-pages-3.81 relate exclusively to the (glibc) thread-safety markings in various man pages. More than 400 patches, mainly by Ma Shimiao and Peng Haitao of Fujitsu brought the following changes:

By now, thanks mainly to the work of Peng Haitao and Ma Shimiao, nearly 400 of the (around 980) pages in man-pages carry thread-safety information.

In addition, a new attributes(7) man page, based on text supplied by Alexandre Oliva (who was responsible for adding thread-safety information to the GNU C Library manual) provides an overview of the thread-safety concepts documented in man-pages, and a description of the notation used in man-pages to describe the thread safety of functions. (Thanks also to Carlos O'Donell for helping us to obtain the permissions needed so that man-pages could recycle this text from the GNU C Library manual.)

March 02, 2015 04:02 PM

Paul E. Mc Kenney: Verification Challenge 3: cbmc

The first and second verification challenges were directed to people working on verification tools, but this one is instead directed at developers.

It turns out that there are a number of verification tools that have seen heavy use. For example, I have written several times about Promela and spin (here, here, and here), which I have used from time to time over the past 20 years. However, this tool requires that you translate your code to Promela, which is not conducive to use of Promela for regression tests.

For those of use working in the Linux kernel, it would be nice to have a verification tool that operated directly on C source code. And there are tools that do just that, for example, the C Bounded Model Checker (cbmc). This tool, which is included in a number of Linux distributions, converts a C-language input file into a (possibly quite large) logic expression. This expression is constructed so that if any combination of variables causes the logic expression to evaluate to true, then (and only then) one of the assertions can be triggered. This logic expression is then passed to a SAT solver, and if this SAT solver finds a solution, then there is a set of inputs that can trigger the assertion. The cbmc tool is also capable of checking for array-bounds errors and some classes of pointer misuse.

Current versions of cbmc can handle some useful tasks. For example, suppose it was necessary to reverse the sense of the if condition in the following code fragment from Linux-kernel RCU:

 1   if (rnp->exp_tasks != NULL ||
 2       (rnp->gp_tasks != NULL &&
 3        rnp->boost_tasks == NULL &&
 4        rnp->qsmask == 0 &&
 5        ULONG_CMP_GE(jiffies, rnp->boost_time))) {
 6     if (rnp->exp_tasks == NULL) 
 7       rnp->boost_tasks = rnp->gp_tasks;
 8     /* raw_spin_unlock_irqrestore(&rnp->lock, flags); */
 9     t = rnp->boost_kthread_task;
10     if (t)   
11       rcu_wake_cond(t, rnp->boost_kthread_status);
12   } else {
13     rcu_initiate_boost_trace(rnp);
14     /* raw_spin_unlock_irqrestore(&rnp->lock, flags); */
15   }


This is a simple application of De Morgan's law, but an error-prone one, particularly if carried out in a distracting environment. Of course, to test a validation tool, it is best to feed it buggy code to see if it detects those known bugs. And applying De Morgan's law in a distracting environment is an excellent way to create bugs, as you can see below:

 1   if (rnp->exp_tasks == NULL &&
 2       (rnp->gp_tasks == NULL ||
 3        rnp->boost_tasks != NULL ||
 4        rnp->qsmask != 0 &&
 5        ULONG_CMP_LT(jiffies, rnp->boost_time))) {
 6     rcu_initiate_boost_trace(rnp);
 7     /* raw_spin_unlock_irqrestore(&rnp->lock, flags); */
 8   } else {
 9     if (rnp->exp_tasks == NULL) 
10       rnp->boost_tasks = rnp->gp_tasks;
11     /* raw_spin_unlock_irqrestore(&rnp->lock, flags); */
12     t = rnp->boost_kthread_task;
13     if (t)   
14       rcu_wake_cond(t, rnp->boost_kthread_status);
15   }


Of course, a full exhaustive test is infeasible, but structured testing would result in a manageable number of test cases. However, we can use cbmc to do the equivalent of a full exhaustive test, despite the fact that the number of combinations is on the order of two raised to the power 1,000. The approach is to create task_struct and rcu_node structures that contain only those fields that are used by this code fragment, but that also contain flags that indicate which functions were called and what their arguments were. This allows us to wrapper both the old and the new versions of the code fragment in their respective functions, and call them in sequence on different instances of identically initialized task_struct and rcu_node structures. These two calls are followed by an assertion that checks that the return value and the corresponding fields of the structures are identical.

This approach results in checkiftrans-1.c (raw C code here). Lines 5-8 show the abbreviated task_struct structure and lines 13-22 show the abbreviated rcu_node struButcture. Lines 10, 11, 24, and 25 show the instances. Lines 27-31 record a call to rcu_wake_cond() and lines 33-36 record a call to rcu_initiate_boost_trace().

Lines 38-49 initialize a task_struct/rcu_node structure pair. The rather unconventional use of the argv[] array works because cbmc assumes that this array contains random numbers. The old if statement is wrappered by do_old_if() on lines 51-71, while the new if statement is wrappered by do_new_if() on lines 73-93. The assertion is in check() on lines 95-107, and finally the main program is on lines 109-118.

Running cbmc checkiftrans-1.c gives this output, which prominently features VERIFICATION FAILED at the end of the file. On lines 4, 5, 12 and 13 of the file are complaints that neither ULONG_CMP_GE() nor ULONG_CMP_LT() are defined. Lacking definitions for these these two functions, cbmc seems to treat them as random-number generators, which could of course cause the two versions of the if statement to yield different results. This is easily fixed by adding the required definitions:

 1 #define ULONG_MAX         (~0UL)
 2 #define ULONG_CMP_GE(a, b)  (ULONG_MAX / 2 >= (a) - (b))
 3 #define ULONG_CMP_LT(a, b)  (ULONG_MAX / 2 < (a) - (b))


This results in checkiftrans-2.c (raw C code here). However, running cbmc checkiftrans-2.c gives this output, which still prominently features VERIFICATION FAILED at the end of the file. At least there are no longer any complaints about undefined functions!

It turns out that cbmc provides a counterexample in the form of a traceback. This traceback clearly shows that the two instances executed different code paths, and a closer examination of the two representations of the if statement show that I forgot to convert one of the && operators to a ||—that is, the “rnp->qsmask != 0 &&” on line 84 should instead be “rnp->qsmask != 0 ||”. Making this change results incheckiftrans-3.c (raw C code here). The inverted if statement is now as follows:

 1   if (rnp->exp_tasks == NULL &&
 2       (rnp->gp_tasks == NULL ||
 3        rnp->boost_tasks != NULL ||
 4        rnp->qsmask != 0 ||
 5        ULONG_CMP_LT(jiffies, rnp->boost_time))) {
 6     rcu_initiate_boost_trace(rnp);
 7     /* raw_spin_unlock_irqrestore(&rnp->lock, flags); */
 8   } else {
 9     if (rnp->exp_tasks == NULL) 
10       rnp->boost_tasks = rnp->gp_tasks;
11     /* raw_spin_unlock_irqrestore(&rnp->lock, flags); */
12     t = rnp->boost_kthread_task;
13     if (t)   
14       rcu_wake_cond(t, rnp->boost_kthread_status);
15   }


This time, running cbmc checkiftrans-3.c produces this output, which prominently features VERIFICATION SUCCESSFUL at the end of the file. Furthermore, this verification consumed only about 100 milliseconds on my aging laptop. And, even better, because it refused to verify the buggy version, we have at least some reason to believe it!

Of course, one can argue that doing such work carefully and in a quiet environment would eliminate the need for such verification, and 30 years ago I might have emphatically agreed with this argument. I have since learned that ideal work environments are not always as feasible as we might like to think, especially if there are small children (to say nothing of adult-sized children) in the vicinity. Besides which, human beings do make mistakes, even when working in ideal circumstances, and if we are to have reliable software, we need some way of catching these mistakes.

The canonical pattern for using cbmc in this way is as follows:

 1 retref = funcref(...);
 2 retnew = funcnew(...);
 3 assert(retref == retnew && ...);


The ... sequences represent any needed arguments to the calls and any needed comparisons of side effects within the assertion.

Of course, there are limitations:


  1. The “b” in cbmc stands for “bounded.” In particular, cbmc handles neither infinite loops nor infinite recursion. The --unwind and --depth arguments to cbmc allow you to control how much looping and recursion is analyzed. See the manual for more information.
  2. The SAT solvers used by cbmc have improved greatly over the past 25 years. In fact, where a 100-variable problem was at the edge of what could be handled in the 1990s, most ca-2015 solvers can handle more than a million variables. However, the NP-complete nature of SAT does occasionally make its presence known, for example, programs that reduce to a proof involving the pigeonhole principle are not handled well as of early 2015.
  3. Handling of concurrency is available in later versions of cbmc, but is not as mature as is the handling of single-threaded code.


All that aside, everything has its limitations, and cbmc's ease of use is quite impressive. I expect to continue to use it from time to time, and strongly recommend that you give it a try!

March 02, 2015 01:08 AM

February 27, 2015

Matthew Garrett: Actions have consequences (or: why I'm not fixing Intel's bugs any more)

A lot of the kernel work I've ended up doing has involved dealing with bugs on Intel-based systems - figuring out interactions between their hardware and firmware, reverse engineering features that they refuse to document, improving their power management support, handling platform integration stuff for their GPUs and so on. Some of this I've been paid for, but a bunch has been unpaid work in my spare time[1].

Recently, as part of the anti-women #GamerGate campaign[2], a set of awful humans convinced Intel to terminate an advertising campaign because the site hosting the campaign had dared to suggest that the sexism present throughout the gaming industry might be a problem. Despite being awful humans, it is absolutely their right to request that a company choose to spend its money in a different way. And despite it being a dreadful decision, Intel is obviously entitled to spend their money as they wish. But I'm also free to spend my unpaid spare time as I wish, and I no longer wish to spend it doing unpaid work to enable an abhorrently-behaving company to sell more hardware. I won't be working on any Intel-specific bugs. I won't be reverse engineering any Intel-based features[3]. If the backlight on your laptop with an Intel GPU doesn't work, the number of fucks I'll be giving will fail to register on even the most sensitive measuring device.

On the plus side, this is probably going to significantly reduce my gin consumption.

[1] In the spirit of full disclosure: in some cases this has resulted in me being sent laptops in order to figure stuff out, and I was not always asked to return those laptops. My current laptop was purchased by me.

[2] I appreciate that there are some people involved in this campaign who earnestly believe that they are working to improve the state of professional ethics in games media. That is a worthy goal! But you're allying yourself to a cause that disproportionately attacks women while ignoring almost every other conflict of interest in the industry. If this is what you care about, find a new way to do it - and perhaps deal with the rather more obvious cases involving giant corporations, rather than obsessing over indie developers.

For avoidance of doubt, any comments arguing this point will be replaced with the phrase "Fart fart fart".

[3] Except for the purposes of finding entertaining security bugs

comment count unavailable comments

February 27, 2015 12:28 AM

February 23, 2015

Dave Jones: backup solutions.

For the longest time, my backup solution has been a series of rsync scripts that have evolved over time into a crufty mess. Having become spoiled on my mac with time machine, I decided to look into something better that didn’t involve a huge time investment on my part.

The general consensus seemed to be that for ready-to-use home-nas type devices, the way to go was either Synology, or Drobo. You just stick in some disks, and setup NFS/SAMBA etc with a bunch of mouse clicking. Perfect.

I had already decided I was going to roll with a 5 disk RAID6 setup, so bit the bullet and laid down $1000 for a Synology 8-Bay DS1815+. It came *triple* boxed, unlike the handful of 3TB HGST drives.
I chose the HGST’s after reading backblaze’s report on failure rates across several manufacturers, and figured that after the RAID6 overhead, 8TB would be more than enough for a long time, even at the rate I accumulate flac and wav files. Also, worst case, I still had 3 spare bays I could expand into later if needed.

Installation was a breeze. The plastic drive caddies felt a little flimsy, but the drives were secure once in them, even if they did feel like they were going to snap as I flexed them to pop them into place. After putting in all the drives, I connected the four ethernet ports, I powered it up.
After connecting to its web UI, it wanted to do a firmware update, like just about every internet connected device wants to do these days. It rebooted, and finally I could get about setting things up.

On first logging into the device over ssh, I think the first command I typed was uname. Seeing a 3.2 kernel surprised me a little. I got nervous thinking about how many VFS,EXT4,MD bugfixes hadn’t made their way back to long-term stable, and got the creeps a little. I decided to not think too much about it, and put faith in the Synology people doing backports (though I never got as far as looking into their kernel package).

The web ui is pretty slick, though felt a little sluggish at times. I set up my RAID6 volume with a bunch of clicks, and then listened as all those disks started clattering away. After creation, it wanted to do an initial parity scan. I set it going, and went to bed. The next morning before going to work, I checked on it, and noticed it wasn’t even at 20% done. I left it going while I went into the office the next day. I spent the night away from home, and so didn’t get back to it until another day later.

When I returned home, the volume was now ready, but I noticed the device was now noticeably hotter to touch than I remembered. I figured it had been hammering the disks non-stop for 24hrs, so go figure, and that it would probably cool off a little as it idled. As the device was now ready for exporting, I set up an nfs export, and then spent some time fighting uid mappings, as you do. The device does have ability to deal with LDAP and some other stuff that I’ve never had time to setup, so I did things the hard way. Once I had the export mounted, I started my first rsync from my existing backups.

While it was running, I remembered I had intended to set up bonding. A little bit of clicky-clicky later, it was done, and transfers started getting even faster. Very nice. I set up two bonds, with a pair of NICs in each. Given my desktop only has a dual NIC, that was good enough. Having a 2nd 2GigE bond I figured was nice in case I had multiple machines wanting to use it while I was doing a backup.

So the backup was going to take a while, so I left it running.
A few hours later, I got back to it, and again, it was getting really hot. There are two pretty big fans in the back of the units, and they were cranking out heat. Then, things started getting really weird. I noticed that the rsync had hung. I ctrl-c’d it, and tried logging into the device as root. It took _minutes_ to get a command prompt. I typed top and waited. About two minutes later top started. Then it spontaneously rebooted.

When it came back up, I logged in, and poked around the log files, and didn’t see anything out of the ordinary.
I restarted the rsync, and left it go for a while. About 20 minutes later, I came back to check on it again, and found that the box had just hung completely. The rsync was stalled, I couldn’t ssh in. I rebooted the device, cursed a bit, and then decided to think about it for a while, so never restarted the rsync. I clicked around in the interface, to see if there was anything I could turn on/off that would perhaps give me some clues wtf was going on.
Then it rebooted spontaneously again.

It was about this time I was ready to throw the damn thing out the window. I bought this thing because I wanted a turn-key solution that ‘just worked’, and had quickly come to realize that with this device when something went bad, I was pretty screwed. Sometimes “It runs Linux” just isn’t enough. For some people, the Synology might be a great solution, but it wasn’t for me. Reading some of the Amazon reviews, it seems there were a few people complaining about their units overheating, which might explain the random reboots I saw. For a device I wanted to leave switched on 24/7 and never think about, something that overheats (especially when I’m not at home) really doesn’t give me feel good vibes. Some of the other reviews on Amazon rave about the DS1815+. It may be that there was a bad batch, and I got unlucky, but I felt burnt on the whole experience, and even if I had got a replacement, I don’t know if I would have felt like I could have trusted this thing with my data.

I ended up returning it to Amazon for a refund, and used the money to buy a motherboard, cpu, ram etc to build a dedicated backup computer. It might not have the fancy web ui, and it might mean I’ll still be using my crappy rsync scripts, but when things go wrong, I generally have a much better chance of fixing the problems.

Other surprises: At one point, I opened the unit up to install an extra 4GB of RAM (It comes with just 2GB by default), I noticed that it runs off a single 250W power supply, which seemed surprising to me. I thought disks during spin-up used considerably more power, but apparently they’re pretty low power these days.

So, two weeks of wasted time, frustration, and failed experiments. Hopefully by next week I’ll have my replacement solution all set up and can move on to more interesting things instead of fighting appliances.

backup solutions. is a post from: codemonkey.org.uk

February 23, 2015 02:04 AM

February 21, 2015

Michael Kerrisk (manpages): man-pages-3.80 is released

I've released man-pages-3.80. The release tarball is available on kernel.org. The browsable online pages can be found on man7.org. The Git repository for man-pages is available on kernel.org.

Aside from very many small fixes and improvements to various pages (by more than 30 contributors!), the most notable changes in man-pages-3.80 are the following:

February 21, 2015 12:50 PM

Paul E. Mc Kenney: Confessions of a Recovering Proprietary Programmer, Part XIV

Although junk mail, puppies, and patches often are unwelcome, there are exceptions. For example, if someone has been wanting a particular breed of dog for some time, that person might be willing to accept a puppy, even if that means giving it shots, housebreaking it, teaching it the difference between furniture and food, doing bottlefeeding, watching over it day and night, and even putting up with some sleepless nights.

Similarly, if a patch fixes a difficult and elusive bug, the maintainer might be willing to apply the patch by hand, fix build errors and warnings, fix a few bugs in the patch itself, run a full set of tests, fix and style problems, and even accept the risk that the bug might have unexpected side effects, some of which might result in some sleepless nights. This in fact is one of the reasons for the common advice given to open-source newbies: start by fixing bugs.

Other good advice for new contributors can be found here:


  1. Greg Kroah-Hartman's HOWTO do Linux kernel development – take 2 (2005)
  2. Jonathan Corbet's How to Participate in the Linux Community (2008)
  3. Greg Kroah-Hartman's Write and Submit your first Linux kernel Patch (2010)
  4. My How to make a positive difference in a FOSS project (2012)
  5. Daniel Lezcano's What do we mean by working upstream: A long-term contributor’s view


This list is mostly about contributing to the Linux kernel, but most other projects have similar pages giving good new-contributor advice.

February 21, 2015 04:51 AM

February 20, 2015

Pavel Machek: What you might want to know about uloz.to

1. captcha is not case-sensitive

2. you can get around concurrent downloads limit using incognito window from chromium. If you need more downloads, chromium --temp-profile does the trick, too.

What you might want to know about Debian stable

Somehow, Debian stable rules do not apply to the Chromium web browser: you won't get security updates for it. I'd say that Chromium is the most security critical package on the system, so it is strange decision to me. In any case, you want to uninstall chromium, or perhaps update to Debian testing.

February 20, 2015 09:56 AM

February 19, 2015

Matthew Garrett: It has been 0 days since the last significant security failure. It always will be.

So blah blah Superfish blah blah trivial MITM everything's broken.

Lenovo deserve criticism. The level of incompetence involved here is so staggering that it wouldn't be a gross injustice for the company to go under as a result[1]. But let's not pretend that this is some sort of isolated incident. As an industry, we don't care about user security. We will gladly ship products with known security failings and no plans to update them. We will produce devices that are locked down such that it's impossible for anybody else to fix our failures. We will hide behind vague denials, we will obfuscate the impact of flaws and we will deflect criticisms with announcements of new and shinier products that will make everything better.

It'd be wonderful to say that this is limited to the proprietary software industry. I would love to be able to argue that we respect users more in the free software world. But there are too many cases that demonstrate otherwise, even where we should have the opportunity to prove the benefits of open development. An obvious example is the smartphone market. Hardware vendors will frequently fail to provide timely security updates, and will cease to update devices entirely after a very short period of time. Fortunately there's a huge community of people willing to produce updated firmware. Phone manufacturer is never going to fix the latest OpenSSL flaw? As long as your phone can be unlocked, there's a reasonable chance that there's an updated version on the internet.

But this is let down by a kind of callous disregard for any deeper level of security. Almost every single third-party Android image is either unsigned or signed with the "test keys", a set of keys distributed with the Android source code. These keys are publicly available, and as such anybody can sign anything with them. If you configure your phone to allow you to install these images, anybody with physical access to your phone can replace your operating system. You've gained some level of security at the application level by giving up any real ability to trust your operating system.

This is symptomatic of our entire ecosystem. We're happy to tell people to disable security features in order to install third-party software. We're happy to tell people to download and build source code without providing any meaningful way to verify that it hasn't been tampered with. Install methods for popular utilities often still start "curl | sudo bash". This isn't good enough.

We can laugh at proprietary vendors engaging in dreadful security practices. We can feel smug about giving users the tools to choose their own level of security. But until we're actually making it straightforward for users to choose freedom without giving up security, we're not providing something meaningfully better - we're just providing the same shit sandwich on different bread.

[1] I don't see any way that they will, but it wouldn't upset me

comment count unavailable comments

February 19, 2015 07:43 PM

February 18, 2015

Pavel Machek: When you are drowning in regressions

..then the first step is to stop more regressions. That's the situation with Nokia N900 kernel now: it has a lot of hardware, and there's kernel support for most of that, but userland support really is not mature enough. So I added test script to tui/ofone, which allows testing of battery, audio, LEDs, backlight, GPS, bluetooth and more. It is called "tefone". "ofone" script (with gtk gui) can be used to control modem, place calls and read SMSes. You'll need a library to actually get voice calls with audio.

On a related note, my PC now resumes faster than my monitor turns on. I guess we are doing good job. (Or maybe Fujitsu did not do such a good job). Too bad resume broke completely in 3.20-rc0 on thinkpad.

February 18, 2015 10:50 AM

February 16, 2015

Matthew Garrett: Intel Boot Guard, Coreboot and user freedom

PC World wrote an article on how the use of Intel Boot Guard by PC manufacturers is making it impossible for end-users to install replacement firmware such as Coreboot on their hardware. It's easy to interpret this as Intel acting to restrict competition in the firmware market, but the reality is actually a little more subtle than that.

UEFI Secure Boot as a specification is still unbroken, which makes attacking the underlying firmware much more attractive. We've seen several presentations at security conferences lately that have demonstrated vulnerabilities that permit modification of the firmware itself. Once you can insert arbitrary code in the firmware, Secure Boot doesn't do a great deal to protect you - the firmware could be modified to boot unsigned code, or even to modify your signed bootloader such that it backdoors the kernel on the fly.

But that's not all. Someone with physical access to your system could reflash your system. Even if you're paranoid enough that you X-ray your machine after every border crossing and verify that no additional components have been inserted, modified firmware could still be grabbing your disk encryption passphrase and stashing it somewhere for later examination.

Intel Boot Guard is intended to protect against this scenario. When your CPU starts up, it reads some code out of flash and executes it. With Intel Boot Guard, the CPU verifies a signature on that code before executing it[1]. The hash of the public half of the signing key is flashed into fuses on the CPU. It is the system vendor that owns this key and chooses to flash it into the CPU, not Intel.

This has genuine security benefits. It's no longer possible for an attacker to simply modify or replace the firmware - they have to find some other way to trick it into executing arbitrary code, and over time these will be closed off. But in the process, the system vendor has prevented the user from being able to make an informed choice to replace their system firmware.

The usual argument here is that in an increasingly hostile environment, opt-in security isn't sufficient - it's the role of the vendor to ensure that users are as protected as possible by default, and in this case all that's sacrificed is the ability for a few hobbyists to replace their system firmware. But this is a false dichotomy - UEFI Secure Boot demonstrated that it was entirely possible to produce a security solution that provided security benefits and still gave the user ultimate control over the code that their machine would execute.

To an extent the market will provide solutions to this. Vendors such as Purism will sell modern hardware without enabling Boot Guard. However, many people will buy hardware without consideration of this feature and only later become aware of what they've given up. It should never be necessary for someone to spend more money to purchase new hardware in order to obtain the freedom to run their choice of software. A future where users are obliged to run proprietary code because they can't afford another laptop is a dystopian one.

Intel should be congratulated for taking steps to make it more difficult for attackers to compromise system firmware, but criticised for doing so in such a way that vendors are forced to choose between security and freedom. The ability to control the software that your system runs is fundamental to Free Software, and we must reject solutions that provide security at the expense of that ability. As an industry we should endeavour to identify solutions that provide both freedom and security and work with vendors to make those solutions available, and as a movement we should be doing a better job of articulating why this freedom is a fundamental part of users being able to place trust in their property.

[1] It's slightly more complicated than that in reality, but the specifics really aren't that interesting.

comment count unavailable comments

February 16, 2015 08:44 PM

February 15, 2015

Matt Domsch: s3cmd 1.5.2 – major update coming to Fedora and EPEL

As new upstream maintainer for the popular s3cmd program, I have been collecting and making fixes all across the codebase for several months. In the last couple weeks it has finally gotten stable enough to warrant publishing a formal release. Aside from bugfixes, its primary enhancement is adding support for the AWS Signature v4 method, which is required to create S3 buckets in the eu-central-1 (Frankfurt) region, and is a more secure request-signing method usable in all AWS S3 regions. Since releasing s3cmd v1.5.0, python 2.7.9 (as exemplified in Arch Linux) added support for SSL certificate validation. Unfortunately, that validation broke for SSL wildcard certificates (e.g. *.s3.amazonaws.com). Ubuntu 14.04 has an intermediate flavor of this validation, which also broke s3cmd. A couple quick fixes later, and v1.5.2 is published now.

I’ve updated the packages in Fedora 20, 21, and rawhide. EPEL 6 and 7 epel-testing repositories has these as well. If you use s3cmd on RHEL/CentOS, please upgrade to the package in epel-testing and give it karma. Bug reports are welcome at https://github.com/s3tools/s3cmd.

February 15, 2015 05:05 PM

February 13, 2015

Rusty Russell: lguest Learns PCI

With the 1.0 virtio standard finalized by the committee (though minor non-material corrections and clarifications are still trickling in), Michael Tsirkin did the heavy lifting of writing the Linux drivers (based partly on an early prototype of mine).

But I wanted an independent implementation to test: both because OASIS insists on multiple implementations before standard ratification, but also because I wanted to make sure the code which is about to go into the merge window works well.

Thus, I began the task of making lguest understand PCI.  Fortunately, the osdev wiki has an excellent introduction on how to talk PCI on an x86 machine.  It didn’t take me too long to get a successful PCI bus scan from the guest, and start about implementing the virtio parts.

The final part (over which I procrastinated for a week) was to step through the spec and document all the requirements in the lguest comments.  I also added checks that the guest driver was behaving sufficiently, but now it’s finally done.

It also resulted in a few minor patches, and some clarification patches for the spec.  No red flags, however, so I’m reasonably confident that 3.20 will have compliant 1.0 virtio support!

February 13, 2015 06:31 AM

James Morris: Save the Date — Linux Security Summit 2015, August 20-21, Seattle WA, USA

The Linux Security Summit for 2015 will be held across 20-21 August, in Seattle, WA, USA.  As with previous events, we’ll be co-located with LinuxCon.

Preliminary event details are available at the event site:

http://events.linuxfoundation.org/events/linux-security-summit

A CFP will be issued soon — stay tuned!

Thank you to Máirín Duffy, who created wonderful logos for the event.


February 13, 2015 03:26 AM

February 11, 2015

Daniel Vetter: Neat drm/i915 Stuff for 3.20

Linux 3.19 was just released and my usual overview of what the next merge window will bring is more than overdue. The big thing overall is certainly all the work around atomic display updates, but read on for what else all has been done.

Let's first start with all the driver internal rework to support atomic. The big thing with atomic is that it requires a clean split between code that checks display updates and the code that commits a new display state to the hardware. The corallary from that is that any derived state that's computed in the validation code and needed int the commit code must be stored somewhere in the state object. Gustavo Padovan and Matt Roper have done all that work to support atomic plane updates. This is the code that's now in 3.20 as a tech preview. The big things missing for proper atomic plane updates is async commit support (which has already landed for 3.21) and support to adjust watermark settings on the fly. Patches for from Ville have been around for a long time, but need to be rebased, reviewed and extended for recently added platforms.

On the modeset side Ander Conselvan de Oliveira has done a lot of the necessary work already. Unfortunately converting the modeset code is much more involved for mainly two reaons: First there's a lot more derived state that needs to be handled, and the existing code already has structures and code for this. Conceptually the code has been prepared for an atomic world since the big display rewrite and the introduction of CRTC configuration structures. But converting the i915 modeset code to the new DRM atomic structures and interface is still a lot of work. Most of these refactorings have landed in 3.20. The other hold-up is shared resources and the software state to handle that. This is mostly for handling display PLLs, but also other shared resources like the display FIFO buffers. Patches to handle this are still in-flight.

Continuing with modeset work Jani Nikula has reworked the DSI support to use the shared DSI helpers from the DRM core. Jani also reworked the DSI to in preparation for dual-link DSI support, which Gaurav Singh implemented. Rodrigo Vivi and others provided a lot of patches to improve PSR support and enable it for Baytrail/Braswell. Unfortunately there's still issues with the automated testcase and so PSR unfortunately stays disabled by default for now. Rodrigo also wrote some nice DocBook documentation for fbc, another step towards fully documenting i915 driver internals.

Moving on to platform enabling there has been a lot of work from Ville on Cherryview: RPS/gpu turbo and pipe CRC support (used for automated display testing) are both improved. On Skylake almost all the basic enabling is merged now: PM patches, enabling mmio pageflips and fastboot support from Damien have landed. Tvrtko Ursulin also create the infrastructure for global GTT views. This will be used for some upcoming features on Skylake. And to simplify enabling future platforms Chris Wilson and Mika Kuoppala have completely rewritten the forcewake domains handling code.

Also really important for Skylake is that the execlist support for gen8+ command submission is finally in a good enough to be used by default - on Skylake that's the only support path, legacy ring submission has been deprecated. Around that feature and a few other closely related ones a lot of code was refactoring: John Harrison implemented the conversion from raw sequence numbers to request objects for gpu progress tracking. This as is also prep work for landing a gpu scheduler. Nick Hoath removed the special execlist request tracking structures, simplifying the code. The engine initialization code was also refactored for a cleaner split between software and hardware initialization, leading to robuster code for driver load and system resume. Dave Gordon has also reworked the code tracking and computing the ring free space. On top of that we've also enabled full ppgtt again, but for now only where execlists are available since there are still issues with the legacy ring-based pagetable loading.

For generic GEM work there's the really nice support for write-combine cpu memory mappings from Akash Goel and Chris Wilson. On Atom SoC platforms lacking the giant LLC bigger chips have this gives a really fast way to upload textures. And even with LLC it's useful for uploading to scanout targets since those are always uncached. But like the special-purpose uploads paths for LLC platforms the cpu mmap views do not detile textures, hence special-purpose fastpaths need to be written in mesa and other userspace to fully exploit this. In other GEM features the shadow batch copying code for the command parser has now also landed.

Finally there's the redesign from Imre Deak to use the component framework for the snd-hda/i915 interactions. Modern platforms need a lot of coordination between the graphics and sound driver side for audio over hdmi, and thus far this was done with some ad-hoc dynamic symbol lookups. Which results in a lot of headaches to get the ordering correctly for driver load or system suspend and resume. With the component framework this depency is now explicit, which means we will be able to get rid of a few hacks. It's also much easier to extend for the future - new platforms tend to integrate different components even more.

February 11, 2015 11:20 PM

Paul E. Mc Kenney: Confessions of a Recovering Proprietary Programmer, Part XIII

True confession: I was once a serial junk mailer. Not mere email spam, but physical bulk-postage-rate flyers, advertising a series of non-technical conferences. It was of course for a good cause, and one of the most difficult things about that task was convincing that cause's leaders that this flyer was in fact junk mail. They firmly believed that anyone with even a scrap of compassion would of course read the flyer from beginning to end, feeling the full emotional impact of each and every lovingly crafted word. They reluctantly came around to my view, which was that we had at most 300 milliseconds to catch the recipient's attention, that being the amount of time that the recipient might glance at the flyer on its way into the trash. Or at least I think that they came around to my view. All I really know is that they stopped disputing the point.

But junk mail for worthy causes is not the only thing that can be less welcome than its sender might like.

For example, Jim Wasko noticed a sign at a daycare center that read: “If you are late picking up your child and have not called us in advance, we will give him/her an espresso and a puppy. Have a great day.”

Which goes to show that although puppies are cute and lovable, and although their mother no doubt went to a lot of trouble to bring them into this world, they are, just like junk mail, not universally welcome. And this should not be too surprising, given the questions that come to mind when contemplating a free puppy. Has it had its shots? Is it housebroken? Has it learned that furniture is not food? Has it been spayed/neutered? Is it able to eat normal dogfood, or does it still require bottlefeeding? Is it willing to entertain itself for long periods? And, last, but most definitely not least, is it willing to let you sleep through the night?

Nevertheless, people are often surprised and bitterly disappointed when their offers of free puppies are rejected.

Other people are just as surprised and disappointed when their offers of free patches are rejected. After all, they put a lot of work into their patches, and they might even get into trouble if the patch isn't eventually accepted.

But it turns out that patches are a lot like junk mail and puppies. They are greatly valued by those who produce them, but often viewed with great suspicion by the maintainers receiving them. You see, the thought of accepting a free patch also raises questions. Does the patch apply cleanly? Does it build without errors and warnings? Does it run at all? Does it pass regression tests? Has it been tested with the commonly used combination of configuration parameters? Does the patch have good code style? Is the patch maintainable? Does the patch provide a straightforward and robust solution to whatever problem it is trying to solve? In short, will this patch allow the maintainer to sleep through the night?

I am extremely fortunate in that most of the RCU patches that I receive are “good puppies.” However, not everyone is so lucky, and I occasionally hear from patch submitters whose patches were not well received. They often have a long list of reasons why their patches should have been accepted, including:


  1. I put a lot of work into that patch, so it should have been accepted! Unfortunately, hard work on your part does not guarantee a perception of value on the maintainer's part.
  2. The maintainer's job is to accept patches. Maybe not, your maintainer might well be an unpaid volunteer.
  3. But my maintainer is paid to maintain! True, but he is probably not being paid to do your job.
  4. I am not asking him to do my job, but rather his/her job, which is to accept patches! The maintainer's job is not to accept any and all patches, but instead to accept good patches that further the project's mission.
  5. I really don't like your attitude! I put a lot of work into making this be a very good patch! It should have been accepted! Really? Did you make sure it applied cleanly? Did you follow the project's coding conventions? Did you make sure that it passed regression tests? Did you test it on the full set of platforms supported by the project? Does it avoid problems discussed on the project's mailing list? Did you promptly update your patch based on any feedback you might have received? Is your code maintainable? Is your code aligned with the project's development directions? Do you have a good reputation with the community? Do you have a track record of supporting your submissions? In other words, will your patch allow the maintainer to sleep through the night?
  6. But I don't have time to do all that! Then the maintainer doesn't have time to accept your patch. And most especially doesn't have time to deal with all the problems that your patch is likely to cause.

As a recovering proprietary programmer, I can assure you that things work a bit differently in the open-source world, so some adjustment is required. But participation in an open-source project can be very rewarding and worthwhile!

February 11, 2015 07:40 PM

Michael Kerrisk (manpages): Linux/UNIX System Programming course scheduled for May 2015, Munich

I've scheduled a further 5-day Linux/UNIX System Programming course to take place in Munich, Germany, for the week of 4-8 May 2015.

The course is intended for programmers developing system-level, embedded, or network applications for Linux and UNIX systems, or programmers porting such applications from other operating systems (e.g., Windows) to Linux or UNIX. The course is based on my book, The Linux Programming Interface (TLPI), and covers topics such as low-level file I/O; signals and timers; creating processes and executing programs; POSIX threads programming; interprocess communication (pipes, FIFOs, message queues, semaphores, shared memory), and network programming (sockets).
     
The course has a lecture+lab format, and devotes substantial time to working on some carefully chosen programming exercises that put the "theory" into practice. Students receive printed and electronic copies of TLPI, along with a 600-page course book that includes all slides presented in the course. A reading knowledge of C is assumed; no previous system programming experience is needed.

Some useful links for anyone interested in the course:

Questions about the course? Email me via training@man7.org.

February 11, 2015 09:06 AM

February 10, 2015

James Bottomley: Anatomy of the UEFI Boot Sequence on the Intel Galileo

The Basics

UEFI boot officially has three phases (SEC, PEI and DXE).  However, the DXE phase is divided into DXEBoot and DXERuntime (the former is eliminated after the call to ExitBootSerivices()).  The jobs of each phase are

  1. SEC (SECurity phase). This contains all the CPU initialisation code from the cold boot entry point on.  It’s job is to set the system up far enough to find, validate, install and run the PEI.
  2. PEI (Pre-Efi Initialization phase).  This configures the entire platform and then loads and boots the DXE.
  3. DXE (Driver eXecution Environment).  This is where the UEFI system loads drivers for configured devices, if necessary; mounts drives and finds and executes the boot code.  After control is transferred to the boot OS, the DXERuntime stays resident to handle any OS to UEFI calls.

How it works on Quark

This all sounds very simple (and very like the way an OS like Linux boots up).  However, there’s a very crucial difference: The platform really is completely unconfigured when SEC begins.  In particular it won’t have any main memory, so you begin in a read only environment until you can configure some memory.  C code can’t begin executing until you’ve at least found enough writable memory for a stack, so the SEC begins in hand crafted assembly until it can set up a stack.

On all x86 processors (including the Quark), power on begins execution in 16 bit mode at the ResetVector (0xfffffff0). As a helping hand, the default power on bus routing has the top 128KB of memory mapped into the top of SPI flash (read only, of course) via a PCI routing in the Legacy Bridge, meaning that the reset vector executes directly from the SPI Flash (this is actually very slow: SPI means Serial Peripheral Interface, so every byte of SPI flash has to be read serially into the instruction cache before it can be executed).

The hand crafted assembly clears the cache, transitions to Flat32 bit execution mode and sets up the necessary x86 descriptor tables.  It turns out that memory configuration on the Quark SoC is fairly involved and complex so, in order to spare the programmer from having to do this all in assembly, there’s a small (512kB) static ram chip that can be easily configured, so the last assembly job of the SEC is to configure the eSRAM (to a fixed address at 2GB), set the top as the stack, load the PEI into the base (by reconfiguring the SPI flash mapping to map the entire 8MB flash to the top of memory and then copying the firmware volume containing the PEI) and begin executing.

QuarkPlatform Build Oddities

Usually the PEI code is located by the standard Flash Volume code of UEFI and the build time PCDs (Platform Configuration Database entries) which use the values in the Flash Definition File to build the firmware.  However, the current Quark Platform package has a different style because it rips apart and rebuilds the flash volumes, so instead of using PCDs, it uses something it calls Master Flash Headers (MFHs) which are home grown for Quark.  These are a fixed area of the flash that can be read as a database giving the new volume layout (essentially duplicating what the PCDs would normally have done).  Additionally the Quark adds a non-standard signature header occupying 1k to each flash volume which serves two purposes: For the SECURE_LD case, it actually validates the volume, but for the three items in the firmware that don’t have flash headers (the kernel, the initrd and the grub config) it serves to give the lengths of each.

Laying out Flash Rom

This is a really big deal for most embedded systems because the amount of flash available is really limited.  The Galileo board is nice because it supplies 8MB of flash … which is huge in embedded terms.  All flash is divided into Flash Volumes1.  If you look at OVMF for instance, it builds its flash as four volumes: Three for the three SEC, PEI and DXE phases and one for the EFI variables.  In EdkII, flash files are built by the flash definition file (the one with a .fdf ending).  Usually some part of the flash is compressed and has to be inflated into memory (in OVMF this is PEI and DXE) and some are designed to be execute in place (usually SEC).  If you look at the Galileo layout, you see that it has a big SEC phase section (called BOOTROM_OVERRIDE) designed for the top 128kb of the flash , the usual variable area and then five additional sections, two for PEI and DXE and three recovery ones. (and, of course, an additional payload section for the OS that boots from flash).

Embedded Recovery Sections

For embedded devices (and even normal computers) recovery in the face of flash failure (whether from component issues or misupdate of the flash) is really important, so the Galileo follows a two stage fallback process.  The first stage is to detect a critical error signalled by the platform sticky bit, or recovery strap in the SEC and boot up to the fixed phase recovery which tries to locate a recovery capsule on the USB media2. The other recovery is a simple copy of the PEI image for fallback in case the primary PEI image fails (by now you’ll have realised there are three separate but pretty much identical copies of PEI in the flash rom).  One of the first fixes that can be made to the Quark build is to consolidate all of these into a single build description.

Putting it all together: Implementing a compressed PEI Phase

One of the first things I discovered when trying to update the UEFI version to something more modern is that the size of the PEI phase overflows the allowed size of the firmware volume.  This means either redo the flash layout or compress the PEI image.  I chose the latter and this is the story of how it went.

The first problem is that debug prints don’t work in the SEC phase, which is where the changes are going to have to be.  This is a big stumbling block because without debugging, you never know where anything went wrong.  It turns out that UEFI nicely supports this via a special DebugLib that outputs to the serial console, but that the Galileo firmware build has this disabled by this line:

[LibraryClasses.IA32.SEC]
...
 DebugLib|MdePkg/Library/BaseDebugLibNull/BaseDebugLibNull.inf

The BaseDebugLibNull does pretty much what you expect: throws away all Debug messages.  When this is changed to something that outputs messages, the size of the PEI image explodes again, mainly because Stage1 has all the SEC phase code in it.  The fix here is only to enable debugging inside the QuarkResetVector SEC phase code.  You can do this in the .dsc file with

 QuarkPlatformPkg/Cpu/Sec/ResetVector/QuarkResetVector.inf {
   <LibraryClasses>
     DebugLib|MdePkg/Library/BaseDebugLibSerialPort/BaseDebugLibSerialPort.inf
 }

And now debugging works in the SEC phase!

It turns out that a compressed PEI is possible but somewhat more involved than I imagined so that will be the subject of the next blog post.  For now I’ll round out with other oddities I discovered along the way

Quark Platform SEC and PEI Oddities

On the current quark build, the SEC phase is designed to be installed into the bootrom from 0xfffe 0000 to 0xffff ffff.  This contains a special copy of the reset vector (In theory it contains the PEI key validation for SECURE_LD, but in practise the verifiers are hard coded to return success).  The next oddity is that the stage1 image, which should really just be the PEI core actually contains another boot from scratch SEC phase, except this one is based on the standard IA32 reset vector code plus a magic QuarkSecLib and then the PEI code.  This causes the stage1 bring up to be different as well, because usually, the SEC code locates the PEI core in stage1 and loads, relocates and executes it starting from the entry point PeiCore().  However, quark doesn’t do this at all.  It relies on the Firmware Volume generator populating the first ZeroVector (an area occupying the first 16 bytes of the Firmware Volume Header) with a magic entry (located in the ResetVector via the magic string ‘SPI Entry Point ‘ with the trailing space).  The SEC code indirects through the ZeroVector to this code and effectively re-initialises the stack and begins executing the new SEC code, which then locates the internal copy of the PEI core and jumps to it.

February 10, 2015 12:58 AM

February 06, 2015

LPC 2015: LPC 2015 Call for Microconference Proposals

The LPC 2015 Planning Committee has opened the submissions for Microconference Proposals. Please review the instructions on our Participate page carefully, because this year we have implemented a few new changes to the process. We look forward to your proposals!

As usual if you have any questions, please contact us.

February 06, 2015 10:21 PM

LPC 2015: LPC 2015 Call for Refereed Presentations

We are pleased to announce that the Call for Refereed Presentations for the Linux Plumbers Conference 2015 is now open. Please follow the instructions on our Participate page. Just like in the past, the Refereed Presentation track is shared with LinuxCon North America which will take place August 17 – 19 at the same location. In particular, the LPC specific refereed presentations will be scheduled on the overlap day, August 19, 2015. Mark your calendar!

If you have any questions, please contact the LPC planning committee.

February 06, 2015 10:13 PM

LPC 2015: Welcome to the 2015 Edition of the Linux Plumbers Conference!

The 2015 LPC planning committee is pleased to announce that the planning for this year’s edition of the Linux Plumbers Conference has started. The conference will take place in Seattle, Washington USA from Wednesday August 19 to Friday August 21, 2015, at the Sheraton Seattle Hotel.
We hope to see you there for a very productive set of discussions!

February 06, 2015 09:57 PM

February 02, 2015

Michael Kerrisk (manpages): man-pages-3.79 is released

I've released man-pages-3.79. The release tarball is available on kernel.org. The browsable online pages can be found on man7.org. The Git repository for man-pages is available on kernel.org.

Aside from many smaller fixes and improvements to various pages, the most notable changes in man-pages-3.79 are the following:

February 02, 2015 08:47 AM

February 01, 2015

Paul E. Mc Kenney: Parallel Programming: January 2015 Update

This release of Is Parallel Programming Hard, And, If So, What Can You Do About It? features a new chapter on SMP real-time programming, a updated formal-verification chapter, removal of a couple of the less-relevant appendices, several new cartoons (along with some refurbishing of old cartoons), and other updates, including contributions from Bill Pemberton, Borislav Petkov, Chris Rorvick, Namhyung Kim, Patrick Marlier, and Zygmunt Bazyli Krynicki.

As always, git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/perfbook.git will be updated in real time.

February 01, 2015 02:36 AM

January 29, 2015

Paul E. Mc Kenney: Verification Challenge 2: RCU NO_HZ_FULL_SYSIDLE

I suppose that I might as well get the “But what about Verification Challenge 1?” question out of the way to start with. You will find Verification Challenge 1 here.

Now, Verification Challenge 1 was about locating a known bug. Verification Challenge 2 takes a different approach: The goal is instead to find a bug if there is one, or to prove that there is no bug. This challenge involves the Linux kernel's NO_HZ_FULL_SYSIDLE functionality, which is supposed to determine whether or not all non-housekeeping CPUs are idle. The normal Linux-kernel review process located an unexpected bug (which was allegedly fixed), so it seemed worthwhile to apply some formal verification. Unfortunately, all of the tools that I tried failed. Not simply failed to verify, but failed to run correctly at all—though I have heard a rumor that one of the tools was fixed, and thus advanced to the “failed to verify” state, where “failed to verify” apparently meant that the tool consumed all available CPU and memory without deigning to express an opinion as to the correctness of the code.

So I fell back to 20-year-old habits and converted my C code to a Promela model and used spin to do a full-state-space verification. After some back and forth, this model did claim verification, and correctly refused to verify bug-injected perturbations of the model. Mathieu Desnoyers created a separate Promela model that made more deft use of temporal logic, and this model also claimed verification and refused to verify bug-injected perturbations. So maybe I can trust them. Or maybe not.

Unfortunately, regardless of whether or not I can trust these models, they are not appropriate for regression testing. How large a change to the C code requires a corresponding change to the Promela models? And how can I be sure that this corresponding change was actually the correct change? And so on: These kinds of questions just keep coming.

It would therefore be nice to be able to validate straight from the C code. So if you have a favorite verification tool, why not see what it can make of NO_HZ_FULL_SYSIDLE? The relevant fragments of the C code, along with both Promela models, can be found here. See the README file for a description of the files, and you know where to find me for any questions that you might have.

If you do give it a try, please let me know how it goes!

January 29, 2015 05:06 AM

January 28, 2015

Daniel Vetter: Update for Atomic Display Updates

Another kernel release is imminent and a lot of things happened since my last big blog post about atomic modeset. Read on for what new shiny things 3.20 will bring this area.

Atomic IOCTL and Properties


The big thing for sure is that the actual atomic IOCTL from Ville has finally landed for 3.20. That together with all the work from Rob Clark to add all the new atomic properties to make it functional (there's no IOCTL structs for any standard values, everything is a property now) means userspace can finally start using atomic. Well, it's still hidden behind the experimental module option drm.atomic=1 but otherwise it's all there. There's a few big differences compared to earlier iterations:
Another missing piece that's now resolved is the DPMS support. On a high level this greatly reduces the complexity of the legacy DPMS settings into a  simple per-CRTC boolean. Contemporary userspace really wants nothing more and anything but a simple on/off is the only thing that current hardware can support. Furthermore all the bookkeeping is done in the helpers, which call down into drivers to enable or disable entire display pipelines as needed. Which means that drivers can rip out tons of code for the legacy DPMS support by just wiring up drm_atomic_helper_connector_dpms.

New Driver Conversions and Backend Hooks


Another big thing for 3.20 is that driver support has moved forward greatly: Tegra has now most of the basic bits ready, MSM is already converted. Both still lack conversion and testing for DPMS since that landed very late though. There's a lot of prep work for exynos, but unfortunately nothing yet towards atomic support. And i915 is in the process of being converted to the new structures and will have very preliminary atomic plane updates support in 3.20, hidden behind a module option for now.

And all that work resulted in piles of little fixes. Like making sure that the legacy cursor IOCTLs don't suddenly stall a lot more, breaking old userspace. But we also added a bunch more hooks for driver backends to simplifiy driver code:
Finally driver conversions showed that vblank handling requirements imposed by the atomic helpers are a lot stricter than what userspace tends to cope with. i915 has recently reworked all its vblank handling and improved the core helpers with the addition of drm_crtc_vblank_off() and drm_crtc_vblank_on(). If these are used in the CRTC disable and enable hooks the vblank code will automatically reject vblank waits when the pipe is off. Which is the behaviour both the atomic helpers and the transitional helpers expect.

One complication is that on load the driver must ensure manually that the vblank state matches up with the hardware and atomic software state with a manual call to these functions. In the simple case where drivers reset everything to off (which is what the reset implementations provided by the atomic helpers presume) this just means calling drm_crtc_vblank_off() somewhen in the driver load code. For drivers that read out the actual hardware state they need to call either _off() or _on() matching on the actual display pipe status.

Future Work


Of course there's still a few things left to do before atomic display updates can be rolled out to the masses. And a few things that would be rather nice to have, too:
It's promising to keep interesting!

January 28, 2015 05:18 PM

January 27, 2015

James Bottomley: Secure Boot on the Intel Galileo

The first thing to do is set up a build environment.  The Board support package that Intel supplies comes with a vast set of instructions and a three stage build process that uses the standard edk2 build to create firmware volumes, rips them apart again then re-lays them out using spi-flashtools to include the Arduino payload (grub, the linux kernel, initrd and grub configuration file), adds signed headers before creating a firmware flash file with no platform data and then as a final stage adds the platform data.  This is amazingly painful once you’ve done it a few times, so I wrote my own build script with just the essentials for building a debug firmware for my board (it also turns out there’s a lot of irrelevant stuff in quarkbuild.sh which I dumped).  I’m also a git person, not really an svn one, so I redid the Quark_EDKII tree as a git tree with full edk2 and FatPkg as git submodules and a single build script (build.sh) which pulls in all the necessary components and delivers a flashable firmware file.  I’ve linked the submodules to the standard tianocore github ones.  However, please note that the edk2 git pack is pretty huge and it’s in the nature of submodules to be volatile (you end up blowing them away and re-pulling the entire repo with git submodule update a lot, especially because the build script ends up patching the submodules) so you’ll want to clone your own edk2 tree and then add the submodule as a reference.  To do this, execute

git submodule init
git submodule update --reference <wherever you put your local edk2 tree> .module/edk2

before running build.sh; it will save a lot of re-cloning of the edk2 tree. You can also do this for FatPkg, but it’s tiny by comparison and not updated by the build scripts.

A note on the tree format: the Intel supplied Quark_EDKII is a bit of a mess: one of the requirements for the edk2 build system is that some files have to have dos line endings and some have to have unix.  Someone edited the Quark_EDKII less than carefully and the result is a lot of files where emacs shows splatterings of ^M, which annoys me incredibly so I’ve sanitised the files to be all dos or all unix line endings before importing.

A final note on the payload:  for the purposes of UEFI testing, it doesn’t matter and it’s another set of code to download if you want the Arduino payload, so the layout in my build just adds the EFI Shell as the payload for easy building.  The Arduino sections are commented out in the layout file, so you can build your own Arduino payload if you want (as long as you have the necessary binaries).  Also, I’m building for Galileo Kips Bay D … if you have a gen 2, you need to fix the platform-data.ini file.

An Aside about vga over serial consoles

If, like me, you’re interested in the secure boot aspect, you’re going to be spending a lot of time interacting with the VGA console over the serial line and you want it to look the part.  The default VGA PC console is very much stuck in an 80s timewarp.  As a result, the world has moved on and it hasn’t leading to console menus that look like this

vgaThis image is directly from the Intel build docs and actually, this is Windows, which is also behind the times1; if you fire this up from an xterm, the menu looks even worse.  To get VGA looking nice from an xterm, first you need to use a vga font (these have the line drawing characters in the upper 128 bytes), then you need to turn off UTF-8 (otherwise some of the upper 128 character get seen as UTF-8 encodings), turn off the C1 control characters and set a keyboard mapping that sends what UEFI is expecting as F7.  This is what I end up with

xterm -fn vga -geometry 80x25 +u8 -kt sco -k8 -xrm "*backarrowKeyIsErase: false"

And don’t expect the -fn vga to work out of the box on your distro either … vga fonts went out with the ark, so you’ll probably have to install it manually.  With all of this done, the result is

vga-xtermAnd all the function keys work.

Back to our regularly scheduled programming: Secure Boot

The Quark build environment comes with a SECURE_LD mode, which, at first sight, does look like secure boot.  However, what it builds is a cryptographic verification system for the firmware payloads (and disables the shell for good measure).  There is also an undocumented SECURE_BOOT build define instead; unfortunately this doesn’t even build (it craps out trying to add a firmware menu: the Quark is embedded and embedded don’t have firmware menus). Once this is fixed, the image will build but it won’t boot properly.  The first thing to discover is that ASSERT() fails silently on this platform.  Why? Well if you turn on noisy assert, which prints the file and line of the failure, you’ll find that the addition of all these strings make the firmware volumes too big to fit into their assigned spaces … Bummer.  However printing nothing is incredibly useless for debugging, so just add a simple print to let you know when ASSERT() failure is hit.  It turns out there are a bunch of wrong asserts, including one put in deliberately by intel to prevent secure boot from working because they haven’t tested it.  Once you get rid of these,  it boots and mostly works.

Mostly because there’s one problem: authenticating by enrolled binary hashes mostly fails.  Why is this?  Well, it turns out to be a weakness in the Authenticode spec which UEFI secure boot relies on.  The spec is unclear on padding and alignment requirements (among a lot of other things; after all, why write a clear spec when there’s only going to be one implementation … ?).  In the old days, when we used crcs, padding didn’t really matter because additional zeros didn’t affect the checksum but in these days of cryptographic hashes, it does matter.  The problem is that the signature block appended to the EFI binary has an eight byte alignment.  If you add zeros at the end to achieve this, those zeros become part of the hash.  That means you have to hash IA32 binaries as if those padded zeros existed otherwise the hash is different for signed and unsigned binaries. X86-64 binaries seem to be mostly aligned, so this isn’t noticeable for them.  This problem is currently inherent in edk2, so it needs patching manually in the edk2 tree to get this working.

With this commit, the Quark build system finally builds a working secure boot image.  All you need to do is download the ia32 efitools (version 1.5.2 or newer — turns out I also didn’t compute hashes correctly) to an sd card and you’re ready to play.

January 27, 2015 03:20 AM

January 26, 2015

Dave Jones: So much to learn..

At lunch a few days ago, I was discussing with a coworker how right now I’m feeling a little overwhelmed. Not completely unsurprising I suppose given the volume of new things I need to learn. But as I’m getting acclimated in my new job, it’s becoming clearer which things I either don’t know well enough or am completely unfamiliar with.

I’m only now realizing that not only the scope of what I’m working on changed, but how I work on things. During the ten years I worked on Fedora, because we were woefully understaffed and buried alive in bugs, there was never really time to spend a lot of time on an individual bug, unless a lot of users were hitting it. Because of this, the things that I ended up fixing were usually fairly small fixes. The occasional NULL pointer deference. Perhaps a use after free. Maybe linked-list corruption. Pretty basic stuff that anyone with familiarity with any part of the kernel could figure out reasonably quickly. Do enough of this, and it becomes less about engineering, and more about pattern recognition.

Even a lot of the bugs that trinity finds aren’t terribly invasive fixes.
(Screwed up error paths being probably the most common thing it picks up, because nothing else really ever tests them).

The more complicated bugs, like a WARN_ON being hit ? That could take a lot more understanding that could take a considerable time to get to.

As Fedora kernels were so close to mainline a lot of the time we could chase up the maintainer, pass the bug along, and move onto something else. The result of ten years of this way of working has meant I have a passing understanding how lots of parts of the kernel interact from a 10,000ft perspective, and perhaps some understanding of some nuances based on past interactions, but no deep architectural understanding of for eg: how an sk_buff traverses through the various parts of the network stack. A warning deep in the tcp guts, caused by a packet that’s been through bonding, netfilter, etc ? I’m not your guy. At least today.

Thankfully I’ve got some time to ramp up my learning on unfamiliar parts of the kernel, and get a better understanding of how things work on a deeper level. I suspect I’ll end up turning at least some of the stuff I learn into future posts here.

In the meantime, I’ve got a lot of reading to do.
I felt a little reassured at least when my coworker responded “yeah, me too”.

So much to learn.. is a post from: codemonkey.org.uk

January 26, 2015 03:09 PM

Pavel Machek: On limping mice and cool-looking ideas

I got two Logitech mice, and they worked rather nice. But then buttons started acting funny on first, then second, and I switched to Genius one (probably with lower resolution sensor). I quickly found that I have two mice with working sensor, and one mouse with working buttons. Not good.
Now, I got new Logitech M90.. this one has working sensor, two working buttons, and three legs of the same length. Ouch. It wobbles on table :-(.
It also shows why mouse-box is cool-looking, but not quite cool idea. Mouse is something that gets damaged over time, and needs to be replaced... which is something you don't want to do with your computer. (Plus, your mouse will not be too comfortable to use when it has three cables that were not designed for bending, connected.)

January 26, 2015 11:19 AM

January 25, 2015

James Bottomley: Adventures in Embedded UEFI with Intel Galileo

At one of the Intel Technology Days conferences a while ago, Intel gave us a gift of a Galileo board, which is based on the Quark SoC, just before the general announcement.  The promise of the Quark SoC was that it would be a fully open (down to the firmware) embedded system based on UEFI.  When the board first came out, though, the UEFI code was missing (promised for later), so I put it on a shelf and forgot about it.   Recently, the UEFI Security Subteam has been considering issues that impinge on embedded architectures (mostly arm) so having an actual working embedded development board could prove useful.  This is the first part of the story of trying to turn the Galileo into an embedded reference platform for UEFI.

The first problem with getting the Galileo working is that if you want to talk to the UEFI part it’s done over a serial interface, with a 3.5″ jack connection.  However, a quick trip to amazon solved that one.  Equipped with the serial interface, it’s now possible to start running UEFI binaries.   Just using the default firmware (with no secure boot) I began testing the efitools binaries.  Unfortunately, they were building the size of the secure variables (my startup.nsh script does an append write to db) and eventually the thing hit an assert failure on entering the UEFI handoff.  This led to the discovery that the recovery straps on the board didn’t work, there’s no way to clear the variable NVRAM and the only way to get control back was to use an external firmware flash tool.  So it’s off for an unexpected trip to uncharted territory unless I want the board to stay bricked.

The flash tool Intel recommends is the Dediprog SF 100.  These are a bit expensive (around US$350) and there’s no US supplier, meaning you have to order from abroad, wait ages for it to be delivered and deal with US customs on import, something I’ve done before and have no wish to repeat.  So, casting about for a better solution, I came up with the Bus Pirate.  It’s a fully open hardware component (including a google code repository for the firmware and the schematics) plus it’s only US$35 (that’s 10x cheaper than the dediprog), available from Amazon and works well with Linux (for full disclosure, Amazon was actually sold out when I bought mine, so I got it here instead).

The Bus Pirate comes as a bare circuit board (no case or cables), so you have to buy everything you might need to use it extra.  I knew I’d need a ribbon cable with SPI plugs (the Galileo has an SPI connector for the dediprog), so I ordered one with the card.  The first surprise when the card arrived was that the USB connector is actually Mini B not the now standard Micro connector.  I’ve not seen one of those since I had an Android G1, but, after looking in vain for my old android one, Staples still has the cables.    The next problem is that, being open hardware, there are multiple manufacturers.  They all supply a nice multi coloured ribbon cable, but there are two possible orientations and both are produced.  Turns out I have a sparkfun cable which is the opposite way around from the colour values in the firmware (which is why the first attempt to talk to the chip didn’t work).  The Galileo has diode isolators so the SPI flash chip can be powered up and operated independently by the Bus Pirate;  accounting for cable orientation, and disconnecting the Galileo from all other external power, this now works.  Finally, there’s a nice Linux project, flashrom, which allows you to flash all manner of chips and it has a programmer mode for the Bus Pirate.  Great, except that the default USB serial speed is 115200 and at that baud rate, it takes about ten minutes just to read an 8MB SPI flash (flashrom will read, program and then verify, giving you about 25 mins each time you redo the firmware).  Speeding this up is easy: there’s an unapplied patch to increase the baud rate to 2Mbit and I wrote some code to flash only designated areas of the chip (note to self: send this upstream).  The result is available on the OpenSUSE build service.  The outcome is that I’m now able to build and reprogram the firmware in around a minute.

By now this is two weeks, a couple of hacks to a tool I didn’t know I’d need and around US$60 after I began the project, but at least I’m now an embedded programmer and have the scars to prove it.  Next up is getting Secure Boot actually working ….

January 25, 2015 12:44 AM

January 23, 2015

Michael Kerrisk (manpages): man-pages-3.78 is released

I've released man-pages-3.78. The release tarball is available on kernel.org. The browsable online pages can be found on man7.org. The Git repository for man-pages is available on kernel.org.

Aside from many smaller fixes and improvements to various pages, the most notable changes in man-pages-3.78 are the following:

January 23, 2015 08:28 AM

James Bottomley: efitools now working for both x86 and x86_64

Some people noticed a while ago that version 1.5.1 was building for ia32, but 1.5.2 is the release that’s tested and works (1.5.1 had problems with the hash computations).  The primary reason for this work is bringing up secure boot on the Intel Quark SoC board (I’ve got the Galileo Kipps Bay D but the link is to the new Gen 2 board).  There’s a corresponding release of OVMF, with a fix for a DxeImageVerification problem that meant IA32 hashes weren’t getting computed properly.   I’ll post more on the Galileo and what I’ve been doing with it in a different post (tagged for embedded, since it’s mostly a story of embedded board programming).

With these tools, you can now test out IA32 images for both UEFI and the key manipulation and signing tools.

January 23, 2015 05:05 AM

January 21, 2015

Dave Jones: Thoughts on long-term stable kernels.

I remembered something that I found eye-opening while interviewing/phone-screening over the last few months.

The number of companies that base their systems (especially those that don’t actually distribute their kernels outside the company) on the long-term stable releases of Linux caught me by surprise. We dismissed the idea of basing on long-term stable releases in Fedora after giving it a try circa Fedora 14, and it generally being a disaster because the bugs being fixed didn’t match up much to the bugs our users were seeing. We found that we got more bugs we cared about being fixed by sticking to the rolling model that Fedora is known for.

After discussing this with several potential employers, I now have a different perspective on this, and can see the appeal if you have a limited use-case, and only care about a small subset of hardware.

The general purpose “one size fits all” model that distribution kernels have to fit is a much bigger problem to solve, and with the feedback loop of “stable release -> bug -> report upstream -> fixed in mainline -> backport to next stable release” being so long, it’s not really a surprise that just having users be closer to bleeding edge gets a higher volume of bugs fixed (though at the cost of all the shiny new bugs that get introduced along the way) unless you have a RHEL-like small army of developers backporting fixes.

Finally, nearly everyone I talked to who uses long-term stable was also carrying some ‘extras’, such as updated drivers for hardware they cared about (in some cases, out of tree variants that aren’t even synced with linux-next yet).

It’s a complicated mess out there. “We run linux stable” doesn’t necessarily tell you the whole picture.

Thoughts on long-term stable kernels. is a post from: codemonkey.org.uk

January 21, 2015 04:31 AM

January 19, 2015

Daniel Vetter: LCA 2015: Botching Up IOCTLs

So I'm stuck somewhere on an airport and jetlagged on my return trip from my very first LCA in Auckland - awesome conference, except for being on the wrong side of the globe: Very broad range of speakers, awesome people all around, great organization and since it's still a community conference none of the marketing nonsense and sales-pitch keynotes.

Also done a presentation about botching up ioctls. Compared to my blog post a bunch more details on technical issues and some overall comments on what's really important to avoid a v2 ioctl because v1 ended up being unsalvageable. Slides and video (curtesy the LCA video team).

January 19, 2015 05:55 AM

January 15, 2015

Pete Zaitcev: UAT on RTL-SDR update

About a year ago, when I started playing with ADS-B over 1090ES, I noticed that small airplanes heavily favour UAT in 978 MHz, because it's cheaper. For the purposes of independent on-board traffic, thus, it would be important to tap directly into UAT. If I mingle with airliners and their 1090ES, it's in controlled airspace anyway (yes, I know that most collisions happen in controlled airspace, but I'm grasping at excuses to play with UAT here, okay).

UAT poses a large challenge for RTL-SDR because of its relatively high data rate: 1.041667 mbit/s. RTL 2832U can only sample at 3.2 MS/s, and is only stable at 2.8 MS/s. Theoretically it should be enough, but everything I saw out there only works with 8 samples per bit, for weak signals. So, for initial experiments, I thought to try a trick: self-clocking. I set the sample rate to 2083334, and then do no clock recovery whatsoever. It hits where it hits, and if sample points lay well onto bits, the packet is recovered, otherwise it's lost.

I threw together some code, ran it, and it didn't work. Only received white noise. Oh, well. I moved the repo into a dusty corner of Github and forgot about it.

Fast forward a year, a gentleman by the name Oliver Jowett noticed a big problem: the phase was computed incorrectly. I fixed that up and suddenly saw some bits recovered (as it happened, zeroes and ones were swapped, which Oliver had to correct again).

After that, things started to move forward. Having bits recovered allowed to measure reception, and I found that the antenna that I built for the 978 MHz band was much worse than the stock antenna for TV. Imagine my surprise and disappoinment: all that soldering for nothing. I don't know where I screwed up, but some suggest that the computer and dongle produce RF noise that screws with antenna, and a length of coax helps with that despite the losses incurred by the coax (/u/christ0ph liked that in parti-cular).

This is bad.

This is good. Or better at least.

From now on, it's the error recovery. Unfortunately, I have no clue what this means:

The FEC parity generation shall be based on a systematic RS 256-ary code with 8-bit code word symbols. FEC parity generation for each of the six blocks shall be a RS (92,72) code.

Quoted from Annex 10, Volume III, 12.4.4.2.1.

P.S. Observing Oliver's key involvement, the cynical may conclude that the whole premise of Open Source is that you write drek that does not work, upload it to Github, and someone fixes it up for you, free of charge. ESR wrote a whole book about it ("with enough eyes all bugs are shallow")!

January 15, 2015 07:52 PM

January 12, 2015

Pavel Machek: Why avoid ACPI on ARM (and everywhere else, too)

Grant Likely published article about ACPI and ARM at http://www.secretlab.ca/archives/151 . He acknowledges systems with ACPI are harder to debug, but because Microsoft says so, we have to use ACPI (basically).

I believe doing wrong technical choice "because Microsoft says so" is a wrong thing to do.

Yes, ACPI gives more flexibility to hardware vendors. Imagine replacing block devices with interpretted bytecode coming from ROM. That is obviously bad, right? Why is it good for power management?

It is not.

Besides being harder to debug, there are more disadvantages:
* Size, speed and complexity disadvantage of bytecode interpretter in the kernel.
* Many more drivers. Imagine GPIO switch, controlling rfkill (for example). In device tree case, that's few lines in the .dts specifying which GPIO that switch is on.
In ACPI case, each hardware vendor initially implements rfkill switch in AML, differently. After few years, each vendor implements (different) kernel<->AML interface for querying rfkill state and toggling it in software. Few years after that, we implement kernel drivers for those AML interfaces, to properly integrate them in the kernel.
* Incompatibility. ARM servers will now be very different from other ARM systems.


Now, are there some arguments for ACPI? Yes -- it allows hw vendors to hack half-working drivers without touching kernel sources. (Half-working: such drivers are not properly integrated in all the various subsystems). Grant claims that power management is somehow special, and requirement for real drivers is somehow ok for normal drivers (block, video), but not for power management. Now, getting driver merged into the kernel does not take that long -- less than half a year if you know what you are doing. Plus, for power management, you can really just initialize hardware in the bootloader (into working but not optimal state). But basic drivers are likely to merged fast, and then you'll just have to supply DT tables.
Avoid ACPI. It only makes things more complex and harder to debug.

January 12, 2015 05:00 PM

January 10, 2015

Michael Kerrisk (manpages): man-pages-3.77 is released

I've released man-pages-3.77. The release tarball is available on kernel.org. The browsable online pages can be found on man7.org. The Git repository for man-pages is available on kernel.org.

Aside from many smaller fixes and improvements to various pages, the most notable changes in man-pages-3.77 is the following:

January 10, 2015 04:17 PM

Grant Likely: Why ACPI on ARM?

Why are we doing ACPI on ARM? That question has been asked many times, but we haven’t yet had a good summary of the most important reasons for wanting ACPI on ARM. This article is an attempt to state the rationale clearly.

During an email conversation late last year, Catalin Marinas asked for a summary of exactly why we want ACPI on ARM, Dong Wei replied with the following list:
> 1. Support multiple OSes, including Linux and Windows
> 2. Support device configurations
> 3. Support dynamic device configurations (hot add/removal)
> 4. Support hardware abstraction through control methods
> 5. Support power management
> 6. Support thermal management
> 7. Support RAS interfaces

The above list is certainly true in that all of them need to be supported. However, that list doesn’t give the rationale for choosing ACPI. We already have DT mechanisms for doing most of the above, and can certainly create new bindings for anything that is missing. So, if it isn’t an issue of functionality, then how does ACPI differ from DT and why is ACPI a better fit for general purpose ARM servers?

The difference is in the support model. To explain what I mean, I’m first going to expand on each of the items above and discuss the similarities and differences between ACPI and DT. Then, with that as the groundwork, I’ll discuss how ACPI is a better fit for the general purpose hardware support model.

Device Configurations

2. Support device configurations
3. Support dynamic device configurations (hot add/removal)

From day one, DT was about device configurations. There isn’t any significant difference between ACPI & DT here. In fact, the majority of ACPI tables are completely analogous to DT descriptions. With the exception of the DSDT and SSDT tables, most ACPI tables are merely flat data used to describe hardware.

DT platforms have also supported dynamic configuration and hotplug for years. There isn’t a lot here that differentiates between ACPI and DT. The biggest difference is that dynamic changes to the ACPI namespace can be triggered by ACPI methods, whereas for DT changes are received as messages from firmware and have been very much platform specific (e.g. IBM pSeries does this)

Power Management Model

4. Support hardware abstraction through control methods
5. Support power management
6. Support thermal management

Power, thermal, and clock management can all be dealt with as a group. ACPI defines a power management model (OSPM) that both the platform and the OS conform to. The OS implements the OSPM state machine, but the platform can provide state change behaviour in the form of bytecode methods. Methods can access hardware directly or hand off PM operations to a coprocessor. The OS really doesn’t have to care about the details as long as the platform obeys the rules of the OSPM model.

With DT, the kernel has device drivers for each and every component in the platform, and configures them using DT data. DT itself doesn’t have a PM model. Rather the PM model is an implementation detail of the kernel. Device drivers use DT data to decide how to handle PM state changes. We have clock, pinctrl, and regulator frameworks in the kernel for working out runtime PM. However, this only works when all the drivers and support code have been merged into the kernel. When the kernel’s PM model doesn’t work for new hardware, then we change the model. This works very well for mobile/embedded because the vendor controls the kernel. We can change things when we need to, but we also struggle with getting board support mainlined.

This difference has a big impact when it comes to OS support. Engineers from hardware vendors, Microsoft, and most vocally Red Hat have all told me bluntly that rebuilding the kernel doesn’t work for enterprise OS support. Their model is based around a fixed OS release that ideally boots out-of-the-box. It may still need additional device drivers for specific peripherals/features, but from a system view, the OS works. When additional drivers are provided separately, those drivers fit within the existing OSPM model for power management. This is where ACPI has a technical advantage over DT. The ACPI OSPM model and it’s bytecode gives the HW vendors a level of abstraction under their control, not the kernel’s. When the hardware behaves differently from what the OS expects, the vendor is able to change the behaviour without changing the HW or patching the OS.

At this point you’d be right to point out that it is harder to get the whole system working correctly when behaviour is split between the kernel and the platform. The OS must trust that the platform doesn’t violate the OSPM model. All manner of bad things happen if it does. That is exactly why the DT model doesn’t encode behaviour: It is easier to make changes and fix bugs when everything is within the same code base. We don’t need a platform/kernel split when we can modify the kernel.

However, the enterprise folks don’t have that luxury. The platform/kernel split isn’t a design choice. It is a characteristic of the market. Hardware and OS vendors each have their own product timetables, and they don’t line up. The timeline for getting patches into the kernel and flowing through into OS releases puts OS support far downstream from the actual release of hardware. Hardware vendors simply cannot wait for OS support to come online to be able to release their products. They need to be able to work with available releases, and make their hardware behave in the way the OS expects. The advantage of ACPI OSPM is that it defines behaviour and limits what the hardware is allowed to do without involving the kernel.

What remains is sorting out how we make sure everything works. How do we make sure there is enough cross platform testing to ensure new hardware doesn’t ship broken and that new OS releases don’t break on old hardware? Those are the reasons why a UEFI/ACPI firmware summit is being organized, it’s why the UEFI forum holds plugfests 3 times a year, and it is why we’re working on FWTS and LuvOS.

Reliability, Availability & Serviceability (RAS)

7. Support RAS interfaces

This isn’t a question of whether or not DT can support RAS. Of course it can. Rather it is a matter of RAS bindings already existing for ACPI, including a usage model. We’ve barely begun to explore this on DT. This item doesn’t make ACPI technically superior to DT, but it certainly makes it more mature.

Multiplatform support

1. Support multiple OSes, including Linux and Windows

I’m tackling this item last because I think it is the most contentious for those of us in the Linux world. I wanted to get the other issues out of the way before addressing it.

The separation between hardware vendors and OS vendors in the server market is new for ARM. For the first time ARM hardware and OS release cycles are completely decoupled from each other, and neither are expected to have specific knowledge of the other (ie. the hardware vendor doesn’t control the choice of OS). ARM and their partners want to create an ecosystem of independent OSes and hardware platforms that don’t explicitly require the former to be ported to the latter.

Now, one could argue that Linux is driving the potential market for ARM servers, and therefore Linux is the only thing that matters, but hardware vendors don’t see it that way. For hardware vendors it is in their best interest to support as wide a choice of OSes as possible in order to catch the widest potential customer base. Even if the majority choose Linux, some will choose BSD, some will choose Windows, and some will choose something else. Whether or not we think this is foolish is beside the point; it isn’t something we have influence over.

During early ARM server planning meetings between ARM, its partners and other industry representatives (myself included) we discussed this exact point. Before us were two options, DT and ACPI. As one of the Linux people in the room, I advised that ACPI’s closed governance model was a show stopper for Linux and that DT is the working interface. Microsoft on the other hand made it abundantly clear that ACPI was the only interface that they would support. For their part, the hardware vendors stated the platform abstraction behaviour of ACPI is a hard requirement for their support model and that they would not close the door on either Linux or Windows.

However, the one thing that all of us could agree on was that supporting multiple interfaces doesn’t help anyone: It would require twice as much effort on defining bindings (once for Linux-DT and once for Windows-ACPI) and it would require firmware to describe everything twice. Eventually we reached the compromise to use ACPI, but on the condition of opening the governance process to give Linux engineers equal influence over the specification. The fact that we now have a much better seat at the ACPI table, for both ARM and x86, is a direct result of these early ARM server negotiations. We are no longer second class citizens in the ACPI world and are actually driving much of the recent development.

I know that this line of thought is more about market forces rather than a hard technical argument between ACPI and DT, but it is an equally significant one. Agreeing on a single way of doing things is important. The ARM server ecosystem is better for the agreement to use the same interface for all operating systems. This is what is meant by standards compliant. The standard is a codification of the mutually agreed interface. It provides confidence that all vendors are using the same rules for interoperability.

Summary

To summarize, here is the short form rationale for ACPI on ARM:

At the beginning of this article I made the statement that the difference is in the support model. For servers, responsibility for hardware behaviour cannot be purely the domain of the kernel, but rather is split between the platform and the kernel. ACPI frees the OS from needing to understand all the minute details of the hardware so that the OS doesn’t need to be ported to each and every device individually. It allows the hardware vendors to take responsibility for PM behaviour without depending on an OS release cycle which it is not under their control.

ACPI is also important because hardware and OS vendors have already worked out how to use it to support the general purpose ecosystem. The infrastructure is in place, the bindings are in place, and the process is in place. DT does exactly what we need it to when working with vertically integrated devices, but we don’t have good processes for supporting what the server vendors need. We could potentially get there with DT, but doing so doesn’t buy us anything. ACPI already does what the hardware vendors need, Microsoft won’t collaborate with us on DT, and the hardware vendors would still need to provide two completely separate firmware interface; one for Linux and one for Windows.

January 10, 2015 02:00 PM

January 09, 2015

Pavel Machek: fight with 2.6.28

...was not easy. 2.6.28 will not compile with semi-recent toolchain (eldk-5.4), and it needs fixes even with older toolchain (eldk-5.2). To add difficulty, it also needs fixes to work with new make (!). Anyway, now it boots, including nfsroot. That should make userspace development possible. I still hope to get voice calls working, but it is not easy on 3.18/3.19 due to all the audio problems.

January 09, 2015 10:32 PM

January 08, 2015

Dave Jones: the new job reveal.

I let the cat out of the bag earlier this afternoon on twitter. Next Monday is day one of my new job at Akamai. The scope of my new role is pretty wide, ranging from the usual kernel debugging type work, to helping stabilizing production releases, proactively finding new bugs, misc QA work, development of some new tooling and a whole bunch of other stuff I can’t talk about just yet. (And probably a whole slew of things I don’t even know about yet).

It seemed almost serendipitous that I’ve ended up here. Earlier this year, I read Fatal System Error, a book detailing how Prolexic got founded. It’s a fascinating story beginning with DDoS’s of offshore gambling sites and ending with Russian organised crime syndicates. I don’t know how much of it got embellished for the book, but it’s a good read all the same. Also, Amazon usually has used copies for 1 cent. Anyway, a month after I read that book, Akamai acquired Prolexic.

Shortly afterwards, I found myself reading another Akamai related book, No Better Time, the autobiography of the late founder of Akamai, Danny Lewin. After reading it, I decided it was at least going to be worth interviewing there.

The job search led me a few possibilities, and the final decision to go with Akamai wasn’t an easy one to make. The combination of interesting work, an “easy to commute to” office (I’ll be at the Kendall square office in Cambridge,MA) and a small team that seemed easy to get along with (famous last words) is what decided it. (That, and all the dubstep).

It’s going to be an interesting challenge ahead to switch from the mindset of “bug that affects a single computer, or a handful of users” to “something that could bring down a 200,000 node cluster”, but I think I’m up for it. One thing I am definitely looking forward to is only caring about contemporary hardware, and a limited number of platforms.

I apologize in advance for any unexpected internet outages. I swear it was like that when I found it.

the new job reveal. is a post from: codemonkey.org.uk

January 08, 2015 09:59 PM

Pete Zaitcev: Swift and balance

Swift is on the cusp of getting yet another intricate mechanism that regulates how partitions are placed: so-called "overload". But new users of Swift even keep asking what weights are, and now this? I am not entirely sure it's necessary, but here's a simple explanation why we're ending with attempts at complexity (thanks to John Dickinson on IRC).

Suppose you have a system that spreads your replicas (of partitions) across (failure) zones. This works great as long as your zones are about the same size, usually a rack. But then one day you buy a new rack with 8TB drives and suddenly the new zone is several times larger than others. If you do not adjust anything, it ends only filled by quarter at best.

So, fine, we add "weights". Now a zone that has weight 100.0 gets 2 times more replicas than zone with weight 50.0. This allows you to fill zones better, but this must compromize your dispersion and thus durability. Suppose you only have 4 racks: three with 2TB drives and one with 8TB drives. Not an unreasonable size for a small cloud. So, you set weights to 25, 25, 25, 100. With replication factor of 3, there's still a good probability (which I'm unable to calculate, although I feel it ought to be easy for someone better educated) that the bigger node will end with 2 replicas for some partitions. Once that node goes down, you lose redundancy completely for those partitions.

In the small-cloud example above, if you care about your customers' data, you have to eat the imbalance and underutilization until your retire the 2TB drives [1].

<clayg> torgomatic: well if you have 6 failure domains in a tier but their sized 10000 10 10 10 10 10 - you're still sorta screwed

My suggestion would be just ignore all the complexity we thoughtfuly provided for the people with "screwed" clusters. Deploy and maintain your cluster to make it easy for the placement and replication: have a good number of more or less uniform zones that are well aligned to natural failure domains. Everything else is a workaround -- even weights.

P.S. Kinda wondering how Ceph deals with these issues. It is more automagic when it decides what to store where, but surely there ought to be a good and bad way to add OSDs.

[1] Strictly speaking, other options exist. You can delegate to another tier by tying 2 small racks into a zone: yet another layer of Swift's complexity. Or, you could put new 8TB drives on trays and stuff them into existing nodes. But considering that muddies the waters.

UPDATE: See the changelog for better placement in Swift 2.2.2.

January 08, 2015 06:17 PM

Dave Jones: Some closure on a particularly nasty bug.

For the final three months of my tenure at Red Hat, I was chasing what was possibly the most frustrating bug I’d encountered since I had started work there. I had been fixing up various bugs in Trinity over the tail end of summer that meant on a good kernel, it would run and run for quite some time. I still didn’t figure out exactly what the cause of a self-corruption was, but narrowed it down enough that I could come up with a workaround. With this workaround in place, every so often, the kernels NMI watchdog would decide that a process had wedged, and eventually the box would grind to a halt. Just to make it doubly annoying, the hang would happen at seeming random intervals. Sometimes I could trigger it within an hour, sometimes it would take 24 hours.

I spent a lot of time trying to narrow down exactly the circumstances that would trigger this, without much luck. A lot of time was wasted trying to bisect where this was introduced, based upon bad assumptions that earlier kernels were ‘good’. During all this, a lot of theories were being thrown around, and people started looking at multiple areas of the kernel. The watchdog code, the scheduler, FPU context saving, the page fault handling, and time management. Along the way several suspect areas were highlighted, some things got fixed/cleaned up, but ultimately, they didn’t solve the problem. (A number of other suspect areas of code were highlighted that don’t have commits yet).

Then, right down to the final week before I gave all my hardware back to Red Hat, Linus managed to reproduce similar symptoms, by scribbling directly to the HPET. He came up with a hack that at least made the kernel survive for him. When I tried the same patch, the machine ran for three days before I interrupted it. The longest it had ever run.

The question remains, what was scribbling over the HPET in my case ? The /dev/hpet node doesn’t allow writing, even as root. You can mmap /dev/mem if you know the address of the HPET, and directly write to it, but..
1. That would be a root-only possibility, and..
2. Trinity blacklists /dev/mem, and never touches it.

The only two plausible scenarios I could think of were

So that’s where the story (mostly) ends. When I left Red Hat, I gave that (possibly flawed) machine back. Linus’ hacky workaround didn’t get committed, but him & John Stultz continue to back & forth on hardening the clock management code in the face of screwed up hardware, so maybe soon we’ll see something real get committed there.

It was an interesting (though downright annoying) bug that took a lot longer to get any kind of closure on than expected. Some things I learned from this experience:

Some closure on a particularly nasty bug. is a post from: codemonkey.org.uk

January 08, 2015 05:09 PM

January 07, 2015

Dave Jones: continuity of various projects

I’ve had a bunch of people emailing me asking how my new job will affect various things I’ve worked on over the last few years. For the most part, not massively, but there are some changes ahead.

So that’s about it. For at least the first few months of 2015, I expect to be absorbed in getting acclimated my new job, so I won’t be as visible as usual, but I’m not going to disappear from the Linux community forever, which was something I made clear wasn’t something I wanted to happen at everywhere I interviewed.

continuity of various projects is a post from: codemonkey.org.uk

January 07, 2015 03:08 PM

January 05, 2015

Pete Zaitcev: Going beyond ZFS by accident

Yesterday, CKS wrote an article that tries to redress the balance a little in the coverage of ZFS. Apparently, detractors of ZFS were pulling quotes from his operational gripes, so he went forceful with the observation that ZFS remains the only viable advanced filesystem (on Linux or not). CKS has no time for your btrfs bullshit.

The situation where weselovskys of the world hold the only viable advanced filesystem hostage and call everyone else "jackass" is very sad for Linux, but it may not be quite so dire, because it's possible that events are overtaking btrfs and ZFS. I am talking about the march of super-advanced, distributed filesystems downstream.

It all started with the move beyond POSIX, which, admittedly, seemed very silly at the time. The early DHT was laughable and I remember how they struggled for years to even enable writes. However, useful software was developed since then.

The poster child of it is Sage's Ceph, which relies on plain old XFS for back-end storage, composes an object storage out of nodes (called RADOS), and layers a POSIX layer on top for those who want it. It is in field use at Dreamhost. I can easily see someone using it where otherwise a ZFS-backed NFS/CIFS cluster would be deployed.

Another piece of software that I place in the same category is OpenStack Swift. I know, Swift is not competing with ZFS directly. The consistency of its meta layer is not sufficient to emulate POSIX anyway. However, you get all those built-in checksums and all that durability jazz that CKS wants. And Swift aims even further up in scale than Ceph, by being in field use at Rackspace. So, what seems to be happening is that folks who really need to go large are willing at times to forsake even compatibility with POSIX, in part of get the benefits that ZFS provides to CKS. Mercado Libre is one well-hyped case of migration from a pile of NFS filers to a Swift cluster.

Now that these systems are established and have themselves proven, I see constant efforts to take them downscale. Original Swift 1.0 did not even work right if you had less than 3 nodes (strictly speaking, if you had fewer zones than replication factor). This was fixed since by so-called "as good as possible placement" around 1.13 Havana, so you can have 1-node Swift easily. Ceph, similarly, would not consider PGs on the same node healthy and it's a bit of a PITA even in Firefly. So yea, there are issues, but we're working on it. And in the process, we're definitely coming for chunks of ZFS space.

January 05, 2015 06:25 PM

Pete Zaitcev: blitz2 for GNOME catastrophy

After putting blitz2 on my Nexus and poking into GUI buttons, I reckoned that it might be time to stop typing in a terminal like a caveman on Linux too (I'm joking, but only just). And the project rolled smoothly for a while. What it took me 2 months to accomplish in Android, only took 2 days in GNOME. However, 2 lines of code from the end, it all came to an abrubt halt when I found out that it is impossible to access clipboard from a JavaScript application. See GNOME bugs 579312, 712752.

January 05, 2015 04:27 PM