Kernel Planet

July 16, 2018

Pete Zaitcev: Finally a use for code 451

Saw today at a respectable news site, which does not even nag about adblock:

451

We recognise you are attempting to access this website from a country belonging to the European Economic Area (EEA) including the EU which enforces the General Data Protection Regulation (GDPR) and therefore cannot grant you access at this time. For any issues, e-mail us at xxxxxxxx@xxxxxx.com or call us at xxx-xxx-4000.

What a way to brighten one's day. The phone without a country code is a cherry on top.

P.S. The only fly in this ointment is, I wasn't accessing it from the GDPR area. It was a geolocation failure.

July 16, 2018 12:46 AM

July 15, 2018

James Bottomley: Measuring the Horizontal Attack Profile of Nabla Containers

One of the biggest problems with the current debate about Container vs Hypervisor security is that no-one has actually developed a way of measuring security, so the debate is all in qualitative terms (hypervisors “feel” more secure than containers because of the interface breadth) but no-one actually has done a quantitative comparison.  The purpose of this blog post is to move the debate forwards by suggesting a quantitative methodology for measuring the Horizontal Attack Profile (HAP).  For more details about Attack Profiles, see this blog post.  I don’t expect this will be the final word in the debate, but by describing how we did it I hope others can develop quantitative measurements as well.

Well begin by looking at the Nabla technology through the relatively uncontroversial metric of performance.  In most security debates, it’s acceptable that some performance is lost by securing the application.  As a rule of thumb, placing an application in a hypervisor loses anywhere between 10-30% of the native performance.  Our goal here is to show that, for a variety of web tasks, the Nabla containers mechanism has an acceptable performance penalty.

Performance Measurements

We took some standard benchmarks: redis-bench-set, redis-bench-get, python-tornado and node-express and in the latter two we loaded up the web servers with simple external transactional clients.  We then performed the same test for docker, gVisor, Kata Containers (as our benchmark for hypervisor containment) and nabla.  In all the figures, higher is better (meaning more throughput):

The red Docker measure is included to show the benchmark.  As expected, the Kata Containers measure is around 10-30% down on the docker one in each case because of the hypervisor penalty.  However, in each case the Nabla performance is the same or higher than the Kata one, showing we pay less performance overhead for our security.  A final note is that since the benchmarks are network ones, there’s somewhat of a penalty paid by userspace networking stacks (which nabla necessarily has) for plugging into docker network, so we show two values, one for the bridging plug in (nabla-containers) required to orchestrate nabla with kubernetes and one as a direct connection (nabla-raw) showing where the performance would be without the network penalty.

One final note is that, as expected, gVisor sucks because ptrace is a really inefficient way of connecting the syscalls to the sandbox.  However, it is more surprising that gVisor-kvm (where the sandbox connects to the system calls of the container using hypercalls instead) is also pretty lacking in performance.  I speculate this is likely because hypercalls exact their own penalty and hypervisors usually try to minimise them, which using them to replace system calls really doesn’t do.

HAP Measurement Methodology

The Quantitative approach to measuring the Horizontal Attack Profile (HAP) says that we take the bug density of the Linux Kernel code  and multiply it by the amount of unique code traversed by the running system after it has reached a steady state (meaning that it doesn’t appear to be traversing any new kernel paths). For the sake of this method, we assume the bug density to be uniform and thus the HAP is approximated by the amount of code traversed in the steady state.  Measuring this for a running system is another matter entirely, but, fortunately, the kernel has a mechanism called ftrace which can be used to provide a trace of all of the functions called by a given userspace process and thus gives a reasonable approximation of the number of lines of code traversed (note this is an approximation because we measure the total number of lines in the function taking no account of internal code flow, primarily because ftrace doesn’t give that much detail).  Additionally, this methodology works very well for containers where all of the control flow emanates from a well known group of processes via the system call information, but it works less well for hypervisors where, in addition to the direct hypercall interface, you also have to add traces from the back end daemons (like the kvm vhost kernel threads or dom0 in the case of Xen).

HAP Results

The results are for the same set of tests as the performance ones except that this time we measure the amount of code traversed in the host kernel:

As stated in our methodology, the height of the bar should be directly proportional to the HAP where lower is obviously better.  On these results we can say that in all cases the Nabla runtime tender actually has a better HAP than the hypervisor contained Kata technology, meaning that we’ve achieved a container system with better HAP (i.e. more secure) than hypervisors.

Some of the other results in this set also bear discussing.  For instance the Docker result certainly isn’t 10x the Kata result as a naive analysis would suggest.  In fact, the containment provided by docker looks to be only marginally worse than that provided by the hypervisor.  Given all the hoopla about hypervisors being much more secure than containers this result looks surprising but you have to consider what’s going on: what we’re measuring in the docker case is the system call penetration of normal execution of the systems.  Clearly anything malicious could explode this result by exercising all sorts of system calls that the application doesn’t normally use.  However, this does show clearly that a docker container with a well crafted seccomp profile (which blocks unexpected system calls) provides roughly equivalent security to a hypervisor.

The other surprising result is that, in spite of their claims to reduce the exposure to Linux System Calls, gVisor actually is either equivalent to the docker use case or, for the python tornado test, significantly worse than the docker case.  This too is explicable in terms of what’s going on under the covers: gVisor tries to improve containment by rewriting the Linux system call interface in Go.  However, no-one has paid any attention to the amount of system calls the Go runtime is actually using, which is what these results are really showing.  Thus, while current gVisor doesn’t currently achieve any containment improvement on this methodology, it’s not impossible to write a future version of the Go runtime that is much less profligate in the way it uses system calls by developing a Secure Go using the same methodology we used to develop Nabla.

Conclusions

On both tests, Nabla is far and away the best containment technology for secure workloads given that it sacrifices the least performance over docker to achieve the containment and, on the published results, is 2x more secure even than using hypervisor based containment.

Hopefully these results show that it is perfectly possible to have containers that are more secure than hypervisors and lays to rest, finally, the arguments about which is the more secure technology.  The next step, of course, is establishing the full extent of exposure to a malicious application and to do that, some type of fuzz testing needs to be employed.  Unfortunately, right at the moment, gVisor is simply crashing when subjected to fuzz testing, so it needs to become more robust before realistic measurements can be taken.

July 15, 2018 05:54 AM

James Bottomley: A New Method of Containment: IBM Nabla Containers

In the previous post about Containers and Cloud Security, I noted that most of the tenants of a Cloud Service Provider (CSP) could safely not worry about the Horizontal Attack Profile (HAP) and leave the CSP to manage the risk.  However, there is a small category of jobs (mostly in the financial and allied industries) where the damage done by a Horizontal Breach of the container cannot be adequately compensated by contractual remedies.  For these cases, a team at IBM research has been looking at ways of reducing the HAP with a view to making containers more secure than hypervisors.  For the impatient, the full open source release of the Nabla Containers technology is here and here, but for the more patient, let me explain what we did and why.  We’ll have a follow on post about the measurement methodology for the HAP and how we proved better containment than even hypervisor solutions.

The essence of the quest is a sandbox that emulates the interface between the runtime and the kernel (usually dubbed the syscall interface) with as little code as possible and a very narrow interface into the kernel itself.

The Basics: Looking for Better Containment

The HAP attack worry with standard containers is shown on the left: that a malicious application can breach the containment wall and attack an innocent application.  This attack is thought to be facilitated by the breadth of the syscall interface in standard containers so the guiding star in developing Nabla Containers was a methodology for measuring the reduction in the HAP (and hence the improvement in containment), but the initial impetus came from the observation that unikernel systems are nicely modular in the libOS approach, can be used to emulate systemcalls and, thanks to rumprun, have a wide set of support for modern web friendly languages (like python, node.js and go) with a fairly thin glue layer.  Additionally they have a fairly narrow set of hypercalls that are actually used in practice (meaning they can be made more secure than conventional hypervisors).  Code coverage measurements of standard unikernel based kvm images confirmed that they did indeed use a far narrower interface.

Replacing the Hypervisor Interface

One of the main elements of the hypervisor interface is the transition from a less privileged guest kernel to a more privileged host one via hypercalls and vmexits.  These CPU mediated events are actually quite expensive, certainly a lot more expensive than a simple system call, which merely involves changing address space and privilege level.  It turns out that the unikernel based kvm interface is really only nine hypercalls, all of which are capable of being rewritten as syscalls, so the approach to running this new sandbox as a container is to do this rewrite and seccomp restrict the interface to being only what the rewritten unikernel runtime actually needs (meaning that the seccomp profile is now CSP enforced).  This vision, by the way, of a broad runtime above being mediated to a narrow interface is where the name Nabla comes from: The symbol for Nabla is an inverted triangle (∇) which is broad at the top and narrows to a point at the base.

Using this formulation means that the nabla runtime (or nabla tender) can be run as a single process within a standard container and the narrowness of the interface to the host kernel prevents most of the attacks that a malicious application would be able to perform.

DevOps and the ParaVirt conundrum

Back at the dawn of virtualization, there were arguments between Xen and VMware over whether a hypervisor should be fully virtual (capable of running any system supported by the virtual hardware description) or paravirtual (the system had to be modified to run on the virtualization system and thus would be incapable of running on physical hardware).  Today, thanks in a large part to CPU support for virtualization primtives, fully paravirtual systems have long since gone the way of the dodo and everyone nowadays expects any OS running on a hypervisor to be capable of running on physical hardware1.  The death of paravirt also left the industry with an aversion to ever reviving it, which explains why most sandbox containment systems (gVisor, Kata) try to require no modifications to the image.

With DevOps, the requirement is that images be immutable and that to change an image you must take it through the full develop build, test, deploy cycle.  This development centric view means that, provided there’s no impact to the images you use as the basis for your development, you can easily craft your final image to suit the deployment environment, which means a step like linking with the nabla tender is very easy.  Essentially, this comes down to whether you take the Dev (we can rebuild to suit the environment) or the Ops (the deployment environment needs to accept arbitrary images) view.  However, most solutions take the Ops view because of the anti-paravirt bias.  For the Nabla tender, we take the Dev view, which is born out by the performance figures.

Conclusion

Like most sandbox models, the Nabla containers approach is an alternative to namespacing for containment, but it still requires cgroups for resource management.  The figures show that the containment HAP is actually better than that achieved with a hypervisor and the performance, while being marginally less than a namespaced container, is greater than that obtained by running a container inside a hypervisor.  Thus we conclude that for tenants who have a real need for HAP reduction, this is a viable technology.

July 15, 2018 05:54 AM

July 12, 2018

Pete Zaitcev: Guido van Rossum steps down

See a mailing list message:

I would like to remove myself entirely from the decision process. // I am not going to appoint a successor.

July 12, 2018 06:01 PM

June 29, 2018

Pete Zaitcev: The Proprietary Mind

Regarding the Huston missive, two quotes jumped at me the most. The first is just beautiful:

It may be slightly more disconcerting to realise that your electronic wallet is on a device that is using a massive compilation of open source software of largely unknown origin [...]

Yeah, baby. This moldy canard is still operational.

The second is from the narrative of the smartphone revolution:

Apple’s iPhone, released in 2007, was a revolutionary device. [...] Apple’s early lead was rapidly emulated by Windows and Nokia with their own offerings. Google’s position was more as an active disruptor, using an open licensing framework for the Android platform [...]

Again, it's not like he's actually lying. He merely implies heavily that Nokia came next. I don't think the Nokia blunder even deserve a footnote, but to Huston, Google was too open. Google, Carl!

June 29, 2018 12:58 PM

June 26, 2018

James Morris: Linux Security Summit North America 2018: Schedule Published

The schedule for the Linux Security Summit North America (LSS-NA) 2018 is now published.

Highlights include:

and much more!

LSS-NA 2018 will be co-located with the Open Source Summit, and held over 27th-28th August, in Vancouver, Canada.  The attendance fee is $100 USD.  Register here.

See you there!

June 26, 2018 09:11 PM

June 25, 2018

Vegard Nossum: Compiler fuzzing, part 1

Much has been written about fuzzing compilers already, but there is not a lot that I could find about fuzzing compilers using more modern fuzzing techniques where coverage information is fed back into the fuzzer to find more bugs.

If you know me at all, you know I'll throw anything I can get my hands on at AFL. So I tried gcc. (And clang, and rustc -- but more about Rust in a later post.)

Levels of fuzzing


First let me summarise a post by John Regehr called Levels of Fuzzing, which my approach builds heavily on. Regehr presents a very important idea (which stems from earlier research/papers by others), namely that fuzzing can operate at different "levels". These levels correspond somewhat loosely to the different stages of compilation, i.e. lexing, parsing, type checking, code generation, and optimisation. In terms of fuzzing, the source code that you pass to the compiler has to "pass" one stage before it can enter the next; if you give the compiler a completely random binary file, it is unlikely to even get past the lexing stage, never mind to the point where the compiler is actually generating code. So it is in our interest (assuming we want to fuzz more than just the lexer) to generate test cases more intelligently than just using random binary data.

If we simply try to compile random data, we're not going to get very far.
 
In a "naïve" approach, we simply compile gcc with AFL instrumentation and run afl-fuzz on it as usual. If we give a reasonable corpus of existing C code, it is possible that the fuzzer will find something interesting by randomly mutating the test cases. But more likely than not, it is mostly going to end up with random garbage like what we see above, and never actually progress to more interesting stages of compilation. I did try this -- and the results were as expected. It takes a long time before the fuzzer hits anything interesting at all. Now, Sami Liedes did this with clang back in 2014 and obtained some impressive results ("34 distinct assertion failures in the first 11 hours"). So clearly it was possible to find bugs in this way. When I tried this myself for GCC, I did not find a single crash within a day or so of fuzzing. And looking at the queue of distinct testcases it had found, it was very clear that it was merely scratching the very outermost surface of the input handling in the compiler -- it was not able to produce a single program that would make it past the parsing stage.

AFL has a few built-in mutation strategies: bit flips, "byte flips", arithmetic on bytes, 2-bytes, and 4-bytes, insertion of common boundary values (like 0, 1, powers of 2, -1, etc.), insertions of and substitution by "dictionary strings" (basically user-provided lists of strings), along with random splicing of test cases. We can already sort of guess that most of these strategies will not be useful for C and C++ source code. Perhaps the "dictionary strings" is the most promising for source code as it allows you to insert keywords and snippets of code that have at least some chance of ending up as a valid program. For the other strategies, single bit flips can change variable names, but changing variable names is not that interesting unless you change one variable into another (which both have to exist, as otherwise you would hit a trivial "undeclared" error). They can also create expressions, but if you somehow managed to change a 'h' into a '(', source code with this mutation would always fail unless you also inserted a ')' somewhere else to balance the expression. Source code has a lot of these "correspondances" where changing one thing also requires changing another thing somewhere else in the program if you want it to still compile (even though you don't generate an equivalent program -- that's not what we're trying to do here). Variable uses match up with variable declarations. Parantheses, braces, and brackets must all match up (and in the right order too!).

These "correspondences" remind me a lot of CRCs and checksums in other file formats, and they give the fuzzer problems for the exact same reason: without extra code it's hard to overcome having to change the test case simultaneously in two or more places, never mind making the exact change that will preserve the relationship between these two values. It's a game of combinatorics; the more things we have to change at once and the more possibilities we have for those changes, the harder it will be to get that exact combination when you're working completely at random. For checksums the answer is easy, and there are two very good strategies: either you disable the checksum verification in the code you're fuzzing, or you write a small wrapper to "fix up" your test case so that the checksum always matches the data it protects (of course, after mutating an input you may not really know where in the file the checksum will be located anymore, but that's a different problem).

For C and C++ source code it's not so obvious how to help the fuzzer overcome this. You can of course generate programs with a grammar (and some heuristics), which is what several C random code generators such as Csmith, ccg, and yarpgen do. This is in a sense on the completely opposite side of the spectrum when it comes to the levels of fuzzing. By generating programs that you know are completely valid (and correct, and free of undefined behaviour), you will breeze through the lexing, the parsing, and the type checking and target the code generation and optimization stages. This is what Regehr et al. did in "Taming compiler fuzzers", another very interesting read. (Their approach does not include instrumentation feedback, however, so it is more of a traditional black-box fuzzing approach than AFL, which is considered grey-box fuzzing.)

But if you use a C++ grammar to generate C++ programs, that will also exclude a lot of inputs that are not valid but nevertheless accepted by the compiler. This approach relies on our ability to express all programs that should be valid, but there may also be programs non-valid programs that crash the compiler. As an example, if our generator knows that you cannot add an integer to a function, or assign a value to a constant, then the code paths checking for those conditions in the compiler would never be exercised, despite the fact that those errors are more interesting than mere syntax errors. In other words, there is a whole range of "interesting" test cases which we will never be able to generate if we restrict ourselves only to those programs that are actually valid code.

Please note that I am not saying that one approach is better than the other! I believe we need all of them to successfully find bugs in all the areas of the compiler. By realising exactly what the limits of each method are, we can try to find other ways to fill the gaps.

Fuzzing with a loose grammar


So how can we fill the gap between the shallow syntax errors in the front end and the very deep of the code generation in the back end? There are several things we can do.

The main feature of my solution is to use a "loose" grammar. As opposed to a "strict" grammar which would follow the C/C++ specs to the dot, the loose grammar only really has one type of symbol, and all the production rules in the grammar create this type of symbol. As a simple example, a traditional C grammar will not allow you to put a statement where an expression is expected, whereas the loose grammar has no restrictions on that. It does, however, take care that your parantheses and braces match up. My grammar file therefore looks something like this (also see the full grammar if you're curious!):
"[void] [f] []([]) { [] }"
"[]; []"
"{ [] }"
"[0] + [0]"
...
Here, anything between "[" and "]" (call it a placeholder) can be substituted by any other line from the grammar file. An evolution of a program could therefore plausibly look like this:
void f () { }           // using the "[void] [f] []([]) { [] }" rule
void f () { ; } // using the "[]; []" rule
void f () { 0 + 0; } // using the "[0] + [0]" rule
void f ({ }) { 0 + 0; } // using the "{ [] }" rule
...
Wait, what happened at the end there? That's not valid C. No -- but it could still be an interesting thing to try to pass to the compiler. We did have a placeholder where the arguments usually go, and according to the grammar we can put any of the other rules in there. This does quickly generate a lot of nonsensical programs that stop the compiler completely dead in its track at the parsing stage. We do have another trick to help things along, though...

AFL doesn't care at all whether what we pass it is accepted by the compiler or not; it doesn't distinguish between success and failure, only between graceful termination and crashes. However, all we have to do is teach the fuzzer about the difference between exit codes 0 and 1; a 0 means the program passed all of gcc's checks and actually resulted in an object file. Then we can discard all the test cases that result in an error, and keep a corpus of test cases which compile successfully. It's really a no-brainer, but makes such a big difference in what the fuzzer can generate/find.

Enter prog-fuzz


prog-fuzz output


If it's not clear by now, I'm not using afl-fuzz to drive the main fuzzing process for the techniques above. I decided it was easier to write a fuzzer from scratch, just reusing the AFL instrumentation and some of the setup code to collect the coverage information. Without the fork server, it's surprisingly little code, on the order of 15-20 lines of code! (I do have support for the fork server on a different branch and it's not THAT much harder to implement, but I simply haven't gotten around to it yet; and it also wasn't really needed to find a lot of bugs).

You can find prog-fuzz on GitHub: https://github.com/vegard/prog-fuzz

The code is not particularly clean, it's a hacked-up fuzzer that gets the job done. I'll want to clean that up at some point, document all the steps to build gcc with AFL instrumentation, etc., and merge a proper fork server. I just want the code to be out there in case somebody else wants to have a poke around.

Results


From the end of February until some time in April I ran the fuzzer on and off and reported just over 100 distinct gcc bugs in total (32 of them fixed so far, by my count):
Now, there are a few things to be said about these bugs.

First, these bugs are mostly crashes: internal compiler errors ("ICEs"), assertion failures, and segfaults. Compiler crashes are usually not very high priority bugs -- especially when you are dealing with invalid programs. Most of the crashes would never occur "naturally" (i.e. as the result of a programmer trying to write some program). They represent very specific edge cases that may not be important at all in normal usage. So I am under no delusions about the relative importance of these bugs; a compiler crash is hardly a security risk.

However, I still think there is value in fuzzing compilers. Personally I find it very interesting that the same technique on rustc, the Rust compiler, only found 8 bugs in a couple of weeks of fuzzing, and not a single one of them was an actual segfault. I think it does say something about the nature of the code base, code quality, and the relative dangers of different programming languages, in case it was not clear already. In addition, compilers (and compiler writers) should have these fuzz testing techniques available to them, because it clearly finds bugs. Some of these bugs also point to underlying weaknesses or to general cases where something really could go wrong in a real program. In all, knowing about the bugs, even if they are relatively unimportant, will not hurt us.

Second, I should also note that I did have conversations with the gcc devs while fuzzing. I asked if I should open new bugs or attach more test cases to existing reports if I thought the area of the crash looked similar, even if it wasn't the exact same stack trace, etc., and they always told me to file a new report. In fact, I would like to praise the gcc developer community: I have never had such a pleasant bug-reporting experience. Within a day of reporting a new bug, somebody (usually Martin Liška or Marek Polacek) would run the test case and mark the bug as confirmed as well as bisect it using their huge library of precompiled gcc binaries to find the exact revision where the bug was introduced. This is something that I think all projects should strive to do -- the small feedback of having somebody acknowledge the bug is a huge encouragement to continue the process. Other gcc developers were also very active on IRC and answered almost all my questions, ranging from silly "Is this undefined behaviour?" to "Is this worth reporting?". In summary, I have nothing but praise for the gcc community.

I should also add that I played briefly with LLVM/clang, and prog-fuzz found 9 new bugs (2 of them fixed so far):
In addition to those, I also found a few other bugs that had already been reported by Sami Liedes back in 2014 which remain unfixed.

For rustc, I will write a more detailed blog post about how to set it up, as compiling rustc itself with AFL instrumentation is non-trivial and it makes more sense to detail those exact steps apart from this post.

What next?


I mentioned the efforts by Regehr et al. and Dmitry Babokin et al. on Csmith and yarpgen, respectively, as fuzzers that generate valid (UB-free) C/C++ programs for finding code generation bugs. I think there is work to be done here to find more code generation bugs; as far as I can tell, nobody has yet combined instrumentation feedback (grey-box fuzzing) with this kind of test case generator. Well, I tried to do it, but it requires a lot of effort to generate valid programs that are also interesting, and I stopped before finding any actual bugs. But I really think this is the future of compiler fuzzing, and I will outline the ideas that I think will have to go into it:
I don't have the time to continue working on this at the moment, but please do let me know if you would like to give it a try and I'll do my best to answer any questions about the code or the approach.

Acknowledgements


Thanks to John Regehr, Martin Liška, Marek Polacek, Jakub Jelinek, Richard Guenther, David Malcolm, Segher Boessenkool, and Martin Jambor for responding to my questions and bug reports!

Thanks to my employer, Oracle, for allowing me to do part of this fuzzing effort using company time and resources.

June 25, 2018 07:35 AM

June 22, 2018

Paul E. Mc Kenney: Stupid RCU Tricks: Changes to -rcu Workflow

The -rcu tree also takes LKMM patches, and I have been handling these completely separately, with one branch for RCU and another for LKMM. But this can be a bit inconvenient, and more important, can delay my response to patches to (say) LKMM if I am doing (say) extended in-tree RCU testing. So it is time to try something a bit different.

My current thought is continue to have separate LKMM and RCU branches (or more often, sets of branches) containing the commits to be offered up to the next merge window. The -rcu branch lkmm would flag the LKMM branch (or, more often, merge commit) and a new -rcu branch rcu would flag the RCU branch (or, again more often, merge commit). Then the lkmm and rcu merge commits would be merged, with new commits on top. These new commits would be intermixed RCU and LKMM commits.

The tip of the -rcu development effort (both LKMM and RCU) would be flagged with a new dev branch, with the old rcu/dev branch being retired. The rcu/next branch will continue to mark the commit to be pulled into the -next tree, and will point to the merge of the rcu and lkmm branches during the merge window.

I will create the next-merge-window branches sometime around -rc1 or -rc2, as I have in the past. I will send RFC patches to LKML shortly thereafter. I will send a pull request for the rcu branch around -rc5, and will send final patches from the lkmm branch at about that same time.

Should continue to be fun! :–)

June 22, 2018 09:17 PM

June 21, 2018

James Bottomley: Containers and Cloud Security

Introduction

The idea behind this blog post is to take a new look at how cloud security is measured and what its impact is on the various actors in the cloud ecosystem.  From the measurement point of view, we look at the vertical stack: all code that is traversed to provide a service all the way from input web request to database update to output response potentially contains bugs; the bug density is variable for the different components but the more code you traverse the higher your chance of exposure to exploitable vulnerabilities.  We’ll call this the Vertical Attack Profile (VAP) of the stack.  However, even this axis is too narrow because the primary actors are the cloud tenant and the cloud service provider (CSP).  In an IaaS cloud, part of the vertical profile belongs to the tenant (The guest kernel, guest OS and application) and part (the hypervisor and host OS) belong to the CSP.  However, the CSP vertical has the additional problem that any exploit in this piece of the stack can be used to jump into either the host itself or any of the other tenant virtual machines running on the host.  We’ll call this exploit causing a failure of containment the Horizontal Attack Profile (HAP).  We should also note that any Horizontal Security failure is a potentially business destroying event for the CSP, so they care deeply about preventing them.  Conversely any exploit occurring in the VAP owned by the Tenant can be seen by the CSP as a tenant only problem and one which the Tenant is responsible for locating and fixing.  We correlate size of profile with attack risk, so the large the profile the greater the probability of being exploited.

From the Tenant point of view, improving security can be done in one of two ways, the first (and mostly aspirational) is to improve the security and monitoring of the part of the Vertical the Tenant is responsible for and the second is to shift responsibility to the CSP, so make the CSP responsible for more of the Vertical.  Additionally, for most Tenants, a Horizontal failure mostly just means they lose trust in the CSP, unless the Tenant is trusting the CSP with sensitive data which can be exfiltrated by the Horizontal exploit.  In this latter case, the Tenant still cannot do anything to protect the CSP part of the Security Profile, so it’s mostly a contractual problem: SLAs and penalties for SLA failures.

Examples

To see how these interpretations apply to the various cloud environments, lets look at some of the Cloud (and pre-Cloud) models:

Physical Infrastructure

The left hand diagram shows a standard IaaS rented physical system.  Since the Tenant rents the hardware it is shown as red indicating CSP ownership and the the two Tenants are shown in green and yellow.  In this model, barring attacks from the actual hardware, the Tenant owns the entirety of the VAP.  The nice thing for the CSP is that hardware provides air gap security, so there is no HAP which means it is incredibly secure.

However, there is another (much older) model shown on the right, called the shared login model,  where the Tenant only rents a login on the physical system.  In this model, only the application belongs to the Tenant, so the CSP is responsible for much of the VAP (the expanded red area).  Here the total VAP is the same, but the Tenant’s VAP is much smaller: the CSP is responsible for maintaining and securing everything apart from the application.  From the Tenant point of view this is a much more secure system since they’re responsible for much less of the security.  From the CSP point of view there is now a  because a tenant compromising the kernel can control the entire system and jump to other tenant processes.  This actually has the worst HAP of all the systems considered in this blog.

Hypervisor based Virtual Infrastructure

In this model, the total VAP is unquestionably larger (worse) than the physical system above because there’s simply more code to traverse (a guest and a host kernel).  However, from the Tenant’s point of view, the VAP should be identical to that of unshared physical hardware because the CSP owns all the additional parts.  However, there is the possibility that the Tenant may be compromised by vulnerabilities in the Virtual Hardware Emulation.  This can be a worry because an exploit here doesn’t lead to a Horizontal security problem, so the CSP is apt to pay less attention to vulnerabilities in the Virtual Hardware simply because each guest has its own copy (even though that copy is wholly under the control of the CSP).

The HAP is definitely larger (worse) than the physical host because of the shared code in the Host Kernel/Hypervisor, but it has often been argued that because this is so deep in the Vertical stack that the chances of exploit are practically zero (although venom gave the lie to this hope: stack depth represents obscurity, not security).

However, there is another way of improving the VAP and that’s to reduce the number of vulnerabilities that can be hit.  One way that this can be done is to reduce the bug density (the argument for rewriting code in safer languages) but another is to restrict the amount of code which can be traversed by narrowing the interface (for example, see arguments in this hotcloud paper).  On this latter argument, the host kernel or hypervisor does have a much lower VAP than the guest kernel because the hypercall interface used for emulating the virtual hardware is very narrow (much narrower than the syscall interface).

The important takeaways here are firstly that simply transferring ownership of elements in the VAP doesn’t necessarily improve the Tenant VAP unless you have some assurance that the CSP is actively monitoring and fixing them.  Conversely, when the threat is great enough (Horizontal Exploit), you can trust to the natural preservation instincts of the CSP to ensure correct monitoring and remediation because a successful Horizontal attack can be a business destroying event for the CSP.

Container Based Virtual Infrastructure

The total VAP here is identical to that of physical infrastructure.  However, the Tenant component is much smaller (the kernel accounting for around 50% of all vulnerabilities).  It is this reduction in the Tenant VAP that makes containers so appealing: the CSP is now responsible for monitoring and remediating about half of the physical system VAP which is a great improvement for the Tenant.  Plus when the CSP remediates on the host, every container benefits at once, which is much better than having to crack open every virtual machine image to do it.  Best of all, the Tenant images don’t have to be modified to benefit from these fixes, simply running on an updated CSP host is enough.  However, the cost for this is that the HAP is the entire linux kernel syscall interface meaning the HAP is much larger than then hypervisor virtual infrastructure case because the latter benefits from interface narrowing to only the hypercalls (qualitatively, assuming the hypercall interface is ~30 calls and the syscall interface is ~300 calls, then the HAP is 10x larger in the container case than the hypervisor case); however, thanks to protections from the kernel namespace code, the HAP is less than the shared login server case.  Best of all, from the Tenant point of view, this entire HAP cost is borne by the CSP, which makes this an incredible deal: not only does the Tenant get a significant reduction in their VAP but the CSP is hugely motivated to keep on top of all vulnerabilities in their part of the VAP and remediate very fast because of the business implications of a successful horizontal attack.  The flip side of this is that a large number of the world’s CSPs are very unhappy about this potential risks and costs and actually try to shift responsibility (and risk) back to the Tenant by advocating nested virtualization solutions like running containers in hypervisors. So remember, you’re only benefiting from the CSP motivation to actively maintain their share of the VAP if your CSP runs bare metal containers because otherwise they’ve quietly palmed the problem back off on you.

Other Avenues for Controlling Attack Profiles

The assumption above was that defect density per component is roughly constant, so effectively the more code the more defects.  However, it is definitely true that different code bases have different defect densities, so one way of minimizing your VAP is to choose the code you rely on carefully and, of course, follow bug reduction techniques in the code you write.

Density Reduction

The simplest way of reducing defects is to find and fix the ones in the existing code base (while additionally being careful about introducing new ones).  This means it is important to know how actively defects are being searched for and how quickly they are being remediated.  In general, the greater the user base for the component, the greater the size of the defect searchers and the faster the speed of their remediation, which means that although the Linux Kernel is a big component in the VAP and HAP, a diligent patch routine is a reasonable line of defence because a fixed bug is not an exploitable bug.

Another way of reducing defect density is to write (or rewrite) the component in a language which is less prone to exploitable defects.  While this approach has many advocates, particularly among language partisans, it suffers from the defect decay issue: the idea that the maximum number of defects occurs in freshly minted code and the number goes down over time because the more time from release the more chance they’ve been found.  This means that a newly rewritten component, even in a shiny bug reducing language, can still contain more bugs than an older component written in a more exploitable language, simply because a significant number of bugs introduced on creation have been found in the latter.

Code Reduction (Minimization Techniques)

It also stands to reason that, for a complex component, simply reducing the amount of code that is accessible to the upper components reduces the VAP because it directly reduces the number of defects.  However, reducing the amount of code isn’t as simple as it sounds: it can only really be done by components that are configurable and then only if you’re not using the actual features you eliminate.  Elimination may be done in two ways, either physically, by actually removing the code from the component or virtually by blocking access using a guard (see below).

Guarding and Sandboxing

Guarding is mostly used to do virtual code elimination by blocking access to certain code paths that the upper layers do not use.  For instance, seccomp  in the Linux Kernel can be used to block access to system calls you know the application doesn’t use, meaning it also blocks any attempt to exploit code that would be in those system calls, thus reducing the VAP (and also reducing the HAP if the kernel is shared).

The deficiencies in the above are obvious: if the application needs to use a system call, you cannot block it although you can filter it, which leads to huge and ever more complex seccomp policies.  The solution for the system call an application has to use problem can sometimes be guarding emulation.  In this mode the guard code actually emulates all the effects of the system call without actually making the actual system call into the kernel.  This approach, often called sandboxing, is certainly effective at reducing the HAP since the guards usually run in their own address space which cannot be used to launch a horizontal attack.  However, the sandbox may or may not reduce the VAP depending on the bugs in the emulation code vs the bugs in the original.  One of the biggest potential disadvantages to watch out for with sandboxing is the fact that the address space the sandbox runs in is often that of the tenant, often meaning the CSP has quietly switched ownership of that component back to the tenant as well.

Conclusions

First and foremost: security is hard.  As a cloud Tenant, you really want to offload as much of it as possible to people who are much more motivated to actually do it than you are (i.e. the Cloud Service Provider).

The complete Vertical Attack Profile of a container bare metal system in the cloud is identical to a physical system and better than a Hypervisor based system; plus the tenant owned portion is roughly 50% of the total VAP meaning that Containers are by far the most secure virtualization technology available today from the Tenant perspective.

The increased Horizontal Attack profile that containers bring should all rightly belong to the Cloud Service Provider.  However, CSPs are apt to shirk this responsibility and try to find creative ways to shift responsibility back to the tenant including spreading misinformation about the container Attack profiles to try to make Tenants demand nested solutions.

Before you, as a Tenant, start worrying about the CSP owned Horizontal Attack Profile, make sure that contractual remedies (like SLAs or Reputational damage to the CSP) would be insufficient to cover the consequences of any data loss that might result from a containment breach.  Also remember that unless you, as the tenant, are under external compliance obligations like HIPPA or PCI, contractual remedies for a containment failure are likely sufficient and you should keep responsibility for the HAP where it belongs: with the CSP.

June 21, 2018 05:31 AM

June 19, 2018

Pete Zaitcev: Slasti py3

Got Slasti 2.1 released today, the main feature being a support for Python 3. Some of the changes were somewhat... horrifying maybe? I tried to adhere to a general plan, where the whole of the application operates in unicode, and the UTF-8 data is encoded/decoded at the boundary. Unfortunately, in practice the boundary was rather leaky, so in several places I had to resort to isinstance(). I expected to always assign a type to all variables and fields, and then rigidly convert as needed. But WSGI had its own ideas.

Overall, the biggest source of issues was not the py3 model, but trying to make the code compatible. I'm not going to do that again if I can help it: either py2 or py3, but not both.

UPDATE: Looks like CKS agrees that compatible code is usually too hard. I'm glad the recommendation to avoid Python 3 entirely is no longer operational.

June 19, 2018 02:54 AM

June 18, 2018

James Morris: Linux Security BoF at Open Source Summit Japan

This is a reminder for folks attending OSS Japan this week that I’ll be leading a  Linux Security BoF session  on Wednesday at 6pm.

If you’ve been working on a Linux security project, feel welcome to discuss it with the group.  We will have a whiteboard and projector.   This is also a good opportunity to raise topics for discussion, and to ask questions about Linux security.

See you then!

June 18, 2018 08:26 AM

June 15, 2018

Pete Zaitcev: Fedora 28 and IPv6 Neighbor Discovery

Finally updated my laptop to F28 and ssh connections started hanging. They hang for 15-20 seconds, then unstuck for a few seconds, then hang, and so on, cycling. I thought it was a WiFi problem at first. But eventually I narrowed it down to IPv6 ND being busted.

A packet trace on the laptop shows that traffic flows until the laptop issues a neighbor solicitation. The router replies with an advertisement, which I presume is getting dropped. Traffic stops — although what's strange, tcpdump still captures outgoing packets that the laptop sends. In a few seconds, the router sends a neighbor solicitation, but the laptop never replies. Presumably, dropped as well. This continues until a router advertisement resets the cycle.

Stopping firewalld lets solicitations in and the traffic resumes, so obviously a rule is busted somewhere. The IPv6 ICMP appears allowed, but the ip6tables rules generated by Firewalld are fairly opaque, I cannot be sure. Ended filing bug 1591867 for the time being and forcing ssh -4.

UPDATE: Looks like the problem is a "reverse path filter". Setting IPv6_rpfilter=no in /etc/firewalld/firewalld.conf fixes the the issue (thanks to Victor for the tip). Here's an associated comment in the configuration file:

# Performs a reverse path filter test on a packet for IPv6. If a reply to the
# packet would be sent via the same interface that the packet arrived on, the
# packet will match and be accepted, otherwise dropped.
# The rp_filter for IPv4 is controlled using sysctl.

Indeed there's no such sysctl for v6. Obviously the problem is that packets with the source of fe80::/16 are mistakenly assumed to be martians and dropped. That's easy enough to fix, I hope. But it's fascinating that we have an alternative configuration method nowadays, only exposed by certain specialist tools. If I don't have firewalld installed, and want this setting changed, what then?

Remarkably, the problem was reported first in March (it's June now). This tells me that most likely the erroneous check itself is in the kernel somewhere, and firewalld is not at fault, which is why Erik isn't fixing it. He should've reassigned the bug to kernel, if so, but...

The commit cede24d1b21d68d84ac5a36c44f7d37daadcc258 looks like the fix. Unfortunately, it just missed the 4.17.

June 15, 2018 05:39 PM

June 14, 2018

Kees Cook: security things in Linux v4.17

Previously: v4.16.

Linux kernel v4.17 was released last week, and here are some of the security things I think are interesting:

Jailhouse hypervisor

Jan Kiszka landed Jailhouse hypervisor support, which uses static partitioning (i.e. no resource over-committing), where the root “cell” spawns new jails by shrinking its own CPU/memory/etc resources and hands them over to the new jail. There’s a nice write-up of the hypervisor on LWN from 2014.

Sparc ADI

Khalid Aziz landed the userspace support for Sparc Application Data Integrity (ADI or SSM: Silicon Secured Memory), which is the hardware memory coloring (tagging) feature in Sparc M7. I’d love to see this extended into the kernel itself, as it would kill linear overflows between allocations, since the base pointer being used is tagged to belong to only a certain allocation (sized to a multiple of cache lines). Any attempt to increment beyond, into memory with a different tag, raises an exception. Enrico Perla has some great write-ups on using ADI in allocators and a comparison of ADI to Intel’s MPX.

new kernel stacks cleared on fork

It was possible that old memory contents would live in a new process’s kernel stack. While normally not visible, “uninitialized” memory read flaws or read overflows could expose these contents (especially stuff “deeper” in the stack that may never get overwritten for the life of the process). To avoid this, I made sure that new stacks were always zeroed. Oddly, this “priming” of the cache appeared to actually improve performance, though it was mostly in the noise.

MAP_FIXED_NOREPLACE

As part of further defense in depth against attacks like Stack Clash, Michal Hocko created MAP_FIXED_NOREPLACE. The regular MAP_FIXED has a subtle behavior not normally noticed (but used by some, so it couldn’t just be fixed): it will replace any overlapping portion of a pre-existing mapping. This means the kernel would silently overlap the stack into mmap or text regions, since MAP_FIXED was being used to build a new process’s memory layout. Instead, MAP_FIXED_NOREPLACE has all the features of MAP_FIXED without the replacement behavior: it will fail if a pre-existing mapping overlaps with the newly requested one. The ELF loader has been switched to use MAP_FIXED_NOREPLACE, and it’s available to userspace too, for similar use-cases.

pin stack limit during exec

I used a big hammer and pinned the RLIMIT_STACK values during exec. There were multiple methods to change the limit (through at least setrlimit() and prlimit()), and there were multiple places the limit got used to make decisions, so it seemed best to just pin the values for the life of the exec so no games could get played with them. Too much assumed the value wasn’t changing, so better to make that assumption actually true. Hopefully this is the last of the fixes for these bad interactions between stack limits and memory layouts during exec (which have all been defensive measures against flaws like Stack Clash).

Variable Length Array removals start

Following some discussion over Alexander Popov’s ongoing port of the stackleak GCC plugin, Linus declared that Variable Length Arrays (VLAs) should be eliminated from the kernel entirely. This is great because it kills several stack exhaustion attacks, including weird stuff like stepping over guard pages with giant stack allocations. However, with several hundred uses in the kernel, this wasn’t going to be an easy job. Thankfully, a whole bunch of people stepped up to help out: Gustavo A. R. Silva, Himanshu Jha, Joern Engel, Kyle Spiers, Laura Abbott, Lorenzo Bianconi, Nikolay Borisov, Salvatore Mesoraca, Stephen Kitt, Takashi Iwai, Tobin C. Harding, and Tycho Andersen. With Linus Torvalds and Martin Uecker, I also helped rewrite the max() macro to eliminate false positives seen by the -Wvla compiler option. Overall, about 1/3rd of the VLA instances were solved for v4.17, with many more coming for v4.18. I’m hoping we’ll have entirely eliminated VLAs by the time v4.19 ships.

That’s in for now! Please let me know if you think I missed anything. Stay tuned for v4.18; the merge window is open. :)

© 2018, Kees Cook. This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 License.
Creative Commons License

June 14, 2018 11:23 PM

June 07, 2018

Pete Zaitcev: Fundamental knowledge

Colleagues working in space technologies discussed recently if fundamental education were necessary for a programmer, so just for a reference, here's a list of fundamental-ish areas I had trouble with in practice over a 30 year career.

Statistics. This should be obvious. Although in theory I'm educated in the topic, I always had difficulty with it, and barely passed my tests, decades ago.

Error correction. To be entirely honest, I blew this. Every time I had to do it, I ended either using Phil Karn's library, or relying on Kevin Greenan's erasure coding package. I think the only time I implemented something that worked was the UAT.

The DSP on Inphase/Quadrature data. This one is really vexing. I ended with some ridiculous ad-hoc code, even though it's very interesting. In my excuse, there were some difficult performance constraints, so even if I knew the underlying math, there would be no way to apply it.

Other than the above, I don't feel like I was held back by any kind of fundamental background, most of all not in CS. About the only time it mattered was when an interviewer asked me to implement an R-B tree.

June 07, 2018 04:03 PM

Pavel Machek: Complex cameras coming to PCs

It seems PCs are getting complex cameras. Which is bad news for PCs, because existing libv4l2 will not work there, but good news for OMAP3, as there will be bigger pressure to fix stuff.

June 07, 2018 12:34 PM

June 05, 2018

Davidlohr Bueso: Linux v4.17: Performance Goodies

With Linux v4.17 now released, there are some interesting performance changes that went worth looking at. As always, the term 'performance' can be vague in that some gains in one area can negatively affect another so take everything with a grain of salt.


sysvipc: introduce STAT_ANY commands

There was a permission discrepancy when consulting shm ipc object metadata between /proc/sysvipc/shm (0444) and getting stat info (such as via SHM_STAT shmctl command). The later does permission checks for the object vs S_IRUGO. As such there can be cases where EACCESS is returned via syscall but the info is displayed anyways in the procfs files. While this might have security implications via info leaking (albeit no writing to the shm metadata), this behavior goes way back and showing all the objects regardless of the permissions was most likely an overlook - so we are stuck with it.

Some applications require getting the procfs info (without root privileges) and can be rather slow in comparison with a syscall -- up to 500x in some reported cases. For this, the new {SEM,SHM,MSG}_STAT_ANY commands have been introduced.
[Commit c21a6970ae72, a280d6dc77eb, 23c8cec8cf67]


kvm: x86 paravirtualization hints and KVM_HINTS_DEDICATED

When dealing with CPU virtualization, many in-kernel heuristics and optimizations revolve around the overcommited scenario.  By introducing KVM_HINTS_DEDICATED, the hypervisor administrator can select this option when there are pinned 1:1 virtual to physical CPU scenarios; particularly reducing the paravirt overhead in locking and TLB flushing as the vCPU is most unlikely to get preempted. In these cases, native qspinlock may perform better than pvqspinlock as it disables paravirt spinlock slowpath optimizations. There is an older Xen equivalent available as a kernel parameter: xen_nopvspin.
[Commit b2798ba0b876, 34226b6b7098, 6beacf74c257]


sched: rework idle loop

Rework the idle loop in order to prevent CPUs from spending too much time in shallow idle states by making it stop the scheduler tick before putting the CPU into an idle state only if the idle duration predicted by the idle governor is long enough. It reduces idle power on some systems by 10% or more and may improve performance of workloads in which the idle loop overhead matters. This required the code to be reordered to invoke the idle governor before stopping the tick, among other things
[Commit 0e7767687fda, 2aaf709a518d, ed98c3491998]


 mm: pcpu pages optimizations around zone lock

Two optimizations around zone->lock in free_pcpupages_bulk() that yield around a 5% performance improvement in page-fault benchmarks (will-it-scale in this case). The first reduces the scope of the  when freeing a batch of pages from back to buddy. Considering the per-cpu semantics, the lock was unnecessarily  held while pages are chosen from the pcpu page's migratetype list.

The second improvement adds a prefetch to the to-be-freed page's buddy outside of  the lock in hope that accessing the buddy's page structure later with the lock held will be faster. Normally prefetching is froundupon, particularly for microbenchmarks, however in the particular case the prefetched pointer will always be used.
[Commit 0a5f4e5b4562, 97334162e4d7]


mm: lockless list_lru_count_one()

During the reclaiming slab of a memcg, shrink_slab() iterates over all registered shrinkers in the system, trying to count and consume objects related to the cgroup. In case of memory pressure, the operation was had a bottlenecking while trying to acquire the nlru->lock. By applying RCU to the data structure, the lookup can be done without taking the lock, which translates in the overall contention pretty much disappearing.
[Commit 0c7c1bed7e13]


memory hotplug optimizations

Such optimizations reduce the amount of times struct pages is traversed during a memory hotplug operation, from three to one. Among other benefits, the memory hotplug is made similar to the boot memory initialization path because it initializes struct pages only in one function. Finally, this improves memory hotplug performance because the cache is not being evicted several times and also reduce loop branching overhead.
[Commit d0dc12e86b31]


procfs: miscellaneous optimizations

Access to various files within procfs have been optimized by replacing calls to seq_printf() with lower cost alternatives. Changes show some performance benefits for ad-hoc microbenchmarks.


btrfs: relax barrier when unlocking an extent buffer

Serializing checks for active waitqueue requires a barrier as it can race with  the waiter side. Such is the case with btrfs_tree_unlock(), which was abusing the barrier semantics on architectures where atomic operations are ordered, such as x86. A performance improvement is immediately noticeable by optimizing barrier usage while maintaining the necessary semantics.
[Commit 2e32ef87b074]


x86/pti: leave kernel text global for no PCID

From the patch: Global pages are bad for hardening because they potentially let an exploit read the kernel image via a Meltdown-style attack. But, global pages are good for performance because they reduce TLB misses when making user/kernel transitions, especially when PCIDs are not available, such as on older hardware, or where a hypervisor has disabled them for some reason.

This change implements a basic, sane policy: If PCIDs are available, only map a minimal amount of kernel text global. If no PCIDs, map all kernel text global. This translates into a considerable throughput increase on an lseek microbenhmark.
[Commit 8c06c7740d19]


lib/raid6/altivec: Add vpermxor implementation for raid6 Q syndrome

This enhancement uses the vpermxor instruction to optimize the raid6 Q syndrome. This instruction was made available with POWER8, ISA version 2.07. It allows for both vperm and vxor instructions to be done in a single instruction. The benchmark results show a 35% speed increase over the best existing algorithm for powerpc (altivec).
[Commit 751ba79cc552]

June 05, 2018 02:51 PM

June 04, 2018

James Bottomley: Why Microsoft is a good steward for GitHub

There seems to be a lot of hysteria going on in various communities that depend on GitHub for their project hosting around the Microsoft acquisition (just look in the comments here and here).  Obviously a lot of social media ink will be expended on this, so I’d just like to explain why as a committed open source developer, I think this will actually be a good thing.

Firstly, it’s very important to remember that git may be open source, but GitHub isn’t: none of the scripts that run the service have much published source code at all.  It may be a closed source hosting infrastructure that a lot of open source projects rely on but that doesn’t make it open source itself.  So why is GitHub not open source?  Well, it all goes back to the business model.  Notwithstanding fantastic market valuations there are lots of companies that play in the open source ecosystem, like GitHub, which struggle to find a sustainable business model (or even revenue).  This leads to a lot of open closed/open type models like GitHub (the reason GitHub keeps the code closed is so they can sell it to other companies for internal source management) or Docker Enterprise.

Secondly, even if GitHub were fully open source, as I’ve argued in my essays about the GPL, to trust a corporate player in the ecosystem, you need to be able to understand fully its business motivation for being there and verify the business goals align with the community ones.  As long as the business motivation is transparent and aligned with the community, you know you can trust it.  However, most of the new supposedly “open source” companies don’t have clear business models at all, which means their business motivation is anything but transparent.  Paradoxically this means that most of the new corporate idols in the open source ecosystem are remarkably untrustworthy because their business model changes from week to week as they struggle to please their venture capitalist overlords.  There’s no way you can get the transparency necessary for open source trust if the company itself doesn’t know what its business model will be next week.

Finally, this means that companies with well established open source business models and motivations that don’t depend on the whims of VCs are much more trustworthy in open source in the long term.  Although it’s a fairly recent convert, Microsoft is now among these because it’s clearly visible how its conversion from desktop to cloud both requires open source and requires Microsoft to play nicely with open source.  The fact that it has a trust deficit from past actions is a bonus because from the corporate point of view it has to be extra vigilant in maintaining its open source credentials.  The clinching factor is that GitHub is now ancillary to Microsoft’s open source strategy, not its sole means of revenue, so lots of previous less community oriented decisions, like keeping the GitHub code closed source, can be revisited in time as Microsoft seeks to gain community trust.

For the record, I should point out that although I have a github account, I host all my code on kernel.org mostly because the GitHub workflow really annoys me, having spent a lot of time trying to deduce commit motivations in a sparse git commit messages which then require delving into github issues and pull requests only to work out that most of the necessary details are in some private slack back channel well away from public view.  Regardless of who owns GitHub, I don’t see this workflow problem changing any time soon, so I’ll be sticking to my current hosting setup.

June 04, 2018 06:31 PM

May 30, 2018

Paul E. Mc Kenney: Call For Participation in 2018 Linux Plumbers Conference!

Referred-track, microconference, and BoF proposals all welcome, see below!

Submissions close: September 2, 2018
Speakers notified: September 23, 2018
Slides due: November 9, 2018

Microconference slots often fill before the deadline (so don't wait to submit yours!) but BoF submissions can come late.

Call for Refereed-Track Proposals

We are pleased to announce the Call for Refereed-Track Proposals for the 2018 edition of the Linux Plumbers Conference, which will held be in Vancouver, BC, Canada on November 13-15 in conjunction with the Linux Kernel Summit.

Refereed track presentations are 50 minutes in length (which includes time for questions and discussion) and should focus on a specific aspect of the "plumbing" in the Linux system. Examples of Linux plumbing include core kernel subsystems, toolchains, container runtimes, core libraries, windowing systems, management tools, device support, media creation/playback, and so on. The best presentations are not about finished work, but rather problems, proposals, or proof-of-concept solutions that require face-to-face discussions and debate.

Given that Plumbers is not colocated with Open Source Summit this year, we are spreading the refereed-track talks over all three days. This provides a change of pace and also provides a conflict-free schedule for the refereed-track talks. (Yes, this does result in more conflicts between the refereed-track talks and the Microconferences, but we never claimed that the world was perfect.)

Linux Plumbers Conference Program Committee members will be reviewing all submitted sessions. High-quality submisssion that cannot be accepted due to the limited number of slots will be forwarded to the Microconference leads for further consideration. We also encourage submitters to consider BoF sessions and the unconference.

To submit a refereed track talk proposal follow the instructions at this website.

Please note that we have a completely different submission system than last year, so please do not let your muscle memory take over.

Submissions are due on or before Friday September 2, 2018 at noon Mountain Time. Since this is after the closure of early registration, speakers may register before this date and we'll refund the registration for any selected presentation's speaker, but for only one speaker per presentation.

Call for Microconference Proposals

We are pleased to announce the Call for Microconferences for the 2018 edition of the Linux Plumbers Conference, which will be held in Vancouver BC, Canada on November 13-15 in conjunction with the Linux Kernel Summit.

A microconference is a collection of collaborative sessions focused on problems in a particular area of the Linux plumbing, which includes the kernel, libraries, utilities, UI, and so forth, but can also focus on cross-cutting concerns such as security, scaling, energy efficiency, toolchains, container runtimes, or a particular use case. Good microconferences result in solutions to these problems and concerns, while the best microconferences result in patches that implement those solutions. For more information on submitting a microconference proposal, see this website.

Again, please note that we have a completely different submission system than last year, so please do not let your muscle memory take over. In particular, unlike last year, there is no wiki. So instead of creating an entry for you microconference on a wiki, you submit it using the above URL.

Call for Bird of a Feather (BoF) Session Proposals

Last, but by no means least, we are also pleased to announce a call for BoF sessions. These are free-form get-togethers for people wishing to discuss a particular topic. As always, you only need to submit proposals for BoFs you want to hold on-site. In contrast, and again as always, informal BoFs may be held at local drinking establishments or in the “hallway track” at your convenience.

May 30, 2018 07:32 PM

May 29, 2018

Pete Zaitcev: Advogato.org is gone

The domain now redirects to Wayback Machine. The last captured post is the official farewell message from June 22.

Personally, I would prefer people hack upon trust metrics than blockchain. But they did not agree, and personally I've done nothing to advance the field, so I don't have room to complain. And now the flagship open-source implementation is no more (well, of course Google still exists and so the trust metrics stay with us; possibly even get developed further).

May 29, 2018 08:14 PM

May 25, 2018

Gustavo F. Padovan: CFP for linuxdev-br conference extended until 7th of June

We already received some great talk proposals for this year’s event, but to bring in even more good content to our attendees we are extending the Call for Presentation until the 7th of June.

Linux Developer Conference Brazil – linuxdev-br for short – aims to be a meeting point for the worldwide Linux development community. We are looking for talks that deal with the most recent as well as the most relevant topics in FOSS development, including but not limited to kernel and drivers, bootloaders, networking and protocols, containers and virtualization, security, IoT, industry challenges and more. No matter what your background or level is, come share your views with the FOSS community at large.

Details on the topics accepted and how to submit can be found at the conference’s CFP page. Submit your talk now!

The post CFP for linuxdev-br conference extended until 7th of June appeared first on Gustavo Padovan.

May 25, 2018 01:19 AM

May 23, 2018

Pete Zaitcev: Jack Baruth on the agile development

As seen at a blog about cars:

Every software shop from Hyderabad to Cleveland now faithfully, and idiotically, replicates a cargo-cult version of the “standups” and “kanban methods” that were designed to work on a factory floor.

The “standups” are particularly miserable: Toyota’s version was best understood as a five-minute meeting where any potential issues in a given assembly-line department would be sorted out before the shift began, but under the corrupting influence of IBM, Accenture, and other “body shops,” the concept has degenerated into a 45-minute hellscape of offshore “engineers” mumbling a list of their miniature accomplishments out of a speakerphone while everybody else shifts from leg to leg and attempts not to fall asleep.

Shit, man. If even a pro racer turned autojourno can tell, we in software are past the point of ridiculous. That said, morning assembly is nothing new - it was a thing in the 1950s, long before Toyota. It even had native names: in Russia it was called "lineyka", in Japan it was "cho~rei".

May 23, 2018 06:49 PM

May 17, 2018

Pete Zaitcev: Amazon AI plz

Not being a native speaker, I get amusing results sometimes when searching on Amazon. For example, "floor scoop" brings up mostly fancy dresses. Apparently, a scoop is a type of dress, which can be floor-length, and so. The correct request is actually "dust pan". Today though, searching for "Peliton termite" ended with a bunch of bicycle saddles. Apparently, Amazon force-replaced it with "peloton", and I know of no syntax to force my spelling. I suspect that Peliton may have trouble selling their products at Amazon. This sort of thing is making me wary of Alexa. I don't see myself ever winning an argument with a robot who knows better, and is implemented in proprietary software that I cannot adjust.

UPDATE: The "plus prefix" works, e.g. "+peliton" (thanks to elisteran).

May 17, 2018 05:48 PM

May 09, 2018

Pete Zaitcev: The space-based ADS-B

Today, I want to build a satellite that receives ADS-B signals from airplanes over the open ocean, far away from land. With a decent receiver and a simple antenna, it should be possible on a gravity-stabilized cubesat. I know about terrestrial receivers picking signals 200..300 km out, surely with care one can do better. But I highly doubt that it's possible to finance such a toy — unless someone has already done that. I know that people somehow manage to finance AIS receivers, which are basically the same thing, only for ships. How do they do that?

UPDATE: Reportedly, hosted payloads by Aireon on Iridium NEXT satellites do ADS-B. The working altitude of the previous generation of Iridium was 780 km, the NEXT is probably the same.

May 09, 2018 03:27 AM

May 07, 2018

Davidlohr Bueso: Linux v4.16: Performance Goodies


Linux v4.16 was released a few weeks ago and continues the mitigation of meltdown and spectre bugs for x86-64, as well as for arm64 and IBM s390. While v4.16 is not the most exciting kernel version in terms of performance and scalability, the following is an unsorted and incomplete list of changes that went in which I have cherry-picked. As always, the term 'performance' can be vague in that some gains in one area can negatively affect another so take everything with a grain of salt.

sched: reduce migrations and spreading of load to multiple CPUs

The scheduler decisions are biased towards reducing latency of searches but tends to spread load across an entire socket, unnecessarily. On low CPU usage, this means the load on each individual CPU is low which can be good but cpufreq decides that utilization on individual CPUs is too low to increase P-state and overall throughput suffers.

When a cpufreq driver is completely under the control of the OS, it can be compensated for. For example, intel_pstate can decide to boost apparent cpu utilization if a task recently slept on a CPU for idle. However, if hardware-based cpufreq is in play (e.g. hardware P-states HWP) then very poor decisions can be made and the OS cannot do much about it. This only gets worse as HWP becomes more prevalent, sockets get larger and the p-state for individual cores can be controlled. Just setting the performance governor is not an answer given that plenty of people really do worry about power utilization and still want a reasonable balance between performance and power. Experiments show performance benefits for network benchmarks running on localhost (at ~10% on netperf RR for UDP and TCP, depending on the machine). Hackbench also has some small improvements with ~6-11%, depending on machine and thread count.
[Commit 89a55f56fd1c, 3b76c4a33959, 806486c377e3, 32e839dda3ba]


printk: new locking scheme

Problems around the kernel's printk() call aren't new and traditionally must overcome issues with the console lock. Considering that the kernel printing out to the console is very generic operation which can be called from virtually anywhere at any time, relying on any sort of lock can cause deadlocks. Similarly, the call to printk() must proceed regardless of the availability of the console lock. As such, what would happen is that upon contention, the task buffers the output for the console lock owner to flush as when it releases the lock.

On large multi-core systems this scheme can lead to the console owner to pile up a lot unbound work before it can release the lock, triggering watchdog lockups. This was replaced with a new mechanism that, upon contention, the task will not delay the work to the console lock owner and return, but it'll stay around spinning until it is available. The heuristics imply a console owner and waiter such that if multiple CPUs are generating output, the console lock will circulate between them, and none will end up printing output for too long.
[Commit dbdda842fe96]

idr tree optimizations

With the extensions and improvements of the ID allocation API, there is a performance enhancement for ID numbering schemes that don't start at 0; which, according to the patch, accounts for ~20% of all the kernel users. So by using the new idr functions with the _base() suffix users can immediately benefit from unnecessary iterations in the underlying radix tree.
[Commit 6ce711f27500]

 arm64: 52-bit physical address support

With ARMv8.2 the physical address space is extended from 48 to 52-bit, thus tasks are now able to address up to 4 pebibytes (PiB).
[Commit fa2a8445b1d3, 193383043f14, 529c4b05a3cb, 787fd1d019b2]

May 07, 2018 05:53 PM

April 30, 2018

Michael Kerrisk (manpages): man-pages-4.16 is released

I've released man-pages-4.16. The release tarball is available on kernel.org. The browsable online pages can be found on man7.org. The Git repository for man-pages is available on kernel.org.

This release resulted from patches, bug reports, reviews, and comments from 29 contributors. Somewhat more than 160 commits changed around 60 pages. A summary of the changes can be found here.

April 30, 2018 07:28 PM

April 29, 2018

Pavel Machek: Crazy complexity

Its crazy how complex PCs have become. ARMs are not exactly simple with the TrustZone and similar stuff, but.. this is crazy. If you thought you understand x86 architecture... this is likely to prove you wrong. There's now non-x86 cpu inside x86 that performs a lot of rather critical functions...
https://eprint.iacr.org/2016/086.pdf
...and shows that SGX indeed is evil.

April 29, 2018 08:52 PM

Pavel Machek: Microsoft sabotaging someone else's computers

My father got himself in a nice trap: he let his Lenovo notebook to update to Windows 10. Hard to blame him, as user interface was confusing on purpose.Now 2 out of 3 USB ports are non-functional (USB 2 port works USB 3 ports don't), and there's no way to fix that. And apparently, Microsoft knew about the problem. Congratulations, Microsoft...

Ouch and they are also sending people to jail for producing CDs neccessary to use licenses they already sold. Microsoft still is evil.

April 29, 2018 08:50 PM

Pavel Machek: O2 attacking their own customers

Just because you are paying for internet service does not mean O2 will not try to replace web-pages with advertising. Ouch. Seems like everyone needs to use https, we need better network-neutrality laws, and probably also class-action lawsuits.

April 29, 2018 08:45 PM

Pavel Machek: Dark design patterns

Got Jolla installed. Ok, it looks cool. But already some unnice things can be seen. You _need_ jolla account to install apps. You need to agree to nasty legaleese. You are asked for name and password, it looks like that's all, and then it wants to know real name, email address, birthday... Appstore looks cool... but does not list licenses for software being installed. Still better than Android. Miles away from Debian.

It also seems to require login separate from app store login to get the "really" interesting stuff. Unfortunately, I don't know how to get that one.
I'd quite like to get python/gtk to work on Jolla (or maybe Android). If someone knows how to do that, I'd like to know. But I guess running Maemo Leste is easier at the moment.

April 29, 2018 08:43 PM

Pavel Machek: Motorola Droid 4 is now usable

23.4.2018, around 12:34... I realized how unix ttys are sabotaging my attempts to send SMS.. and solved it. So now I have Motorola Droid 4, running 4.17-rc1 kernel, with voice calls working, SMSes, data connection, GPS working and have some basic GUIs to control the stuff. WIFI works. Screen locks, and keyboard map still could be improved. Battery life will probably will not be great. But hey, its a start -- I have GNU/Linux working on a cellphone. More precisely Maemo Leste, based on Devuan, based on Debian. Sure, some kernel patches are still needed, and there's a lot more work to do in userland... Today, Microsoft sold out last Windows Mobile phones. I guess that's just a coincidence.

April 29, 2018 08:40 PM

April 23, 2018

Pete Zaitcev: Azure Sphere

Oh Microsoft, you card:

[Azure Sphere OS] combines security innovations pioneered in Windows, a security monitor, and a custom Linux kernel [...]</p>

Kinda like Oracle shipping "Unbreakable Linux". Still in the "embrace" phase.

April 23, 2018 06:37 PM

Daniel Vetter: Linux Kernel Maintainer Statistics

As part of preparing my last two talks at LCA on the kernel community, “Burning Down the Castle” and “Maintainers Don’t Scale”, I have looked into how the Kernel’s maintainer structure can be measured. One very interesting approach is looking at the pull request flows, for example done in the LWN article “How 4.4’s patches got to the mainline”. Note that in the linux kernel process, pull requests are only used to submit development from entire subsystems, not individual contributions. What I’m trying to work out here isn’t so much the overall patch flow, but focusing on how maintainers work, and how that’s different in different subsystems.

Methodology

In my presentations I claimed that the kernel community is suffering from too steep hierarchies. And worse, the people in power don’t bother to apply the same rules to themselves as anyone else, especially around purported quality enforcement tools like code reviews.

For our purposes a contributor is someone who submits a patch to a mailing list, but needs a maintainer to apply it for them, to get the patch merged. A maintainer on the other hand can directly apply a patch to a subsystem tree, and will then send pull requests up the maintainer hierarchy until the patch lands in Linus’ tree. This is relatively easy to measure accurately in git: If the recorded patch author and committer match, it’s a maintainer self-commit, if they don’t match it’s a contributor commit.

There’s a few annoying special cases to handle:

Also note that this is a property of each commit - the same person can be both a maintainer and a contributor, depending upon how each of their patches gets merged.

The ratio of maintainer self-commits compared to overall commits then gives us a crude, but fairly useful metric to measure how steep the kernel community overall is organized.

Measuring review is much harder. For contributor commits review is not recorded consistently. Many maintainers forgo adding an explicit Reviewed-by tag since they’re adding their own Signed-off-by tag anyway. And since that’s required for all contributor commits, it’s impossible to tell whether a patch has seen formal review before merging. A reasonable assumption though is that maintainers actually look at stuff before applying. For a minimal definition of review, “a second person looked at the patch before merging and deemed the patch a good idea” we can assume that merged contributor patches have a review ratio of 100%. Whether that’s a full formal review or not can unfortunately not be measured with the available data.

A different story is maintainer self-commits - if there is no tag indicating review by someone else, then either it didn’t happen, or the maintainer felt it’s not important enough work to justify the minimal effort to record it. Either way, a patch where the git author and committer match, and which sports no review tags in the commit message, strongly suggests it has indeed seen none.

An objection would be that these patches get reviewed by the next maintainer up, when the pull request gets merged. But there’s well over a thousand such patches each kernel release, and most of the pull requests containing them go directly to Linus in the 2 week long merge window, when the over 10k feature patches of each kernel release land in the mainline branch. It is unrealistic to assume that Linus carefully reviews hundreds of patches himself in just those 2 weeks, while getting hammered by pull requests all around. Similar considerations apply at a subsystem level.

For counting reviews I looked at anything that indicates some kind of patch review, even very informal ones, to stay consistent with the implied oversight the maintainer’s Signed-off-by line provides for merged contributor patches. I therefore included both Reviewed-by and Acked-by tags, including a plethora of misspelled and combined versions of the same.

The scripts also keep track of how pull requests percolate up the hierarchy, which allows filtering on a per-subsystem level. Commits in topic branches are accounted to the subsystem that first lands in Linus’ tree. That’s fairly arbitrary, but simplest to implement.

Last few years of GPU subsystem history

Since I’ve pitched the GPU subsystem against the kernel at large in my recent talks, let’s first look at what things look like in graphics:

GPU maintainer commit statistics Fig. 1 GPU total commits, maintainer self-commits and reviewed maintainer self-commits
GPU relative maintainer commit statistics Fig. 2 GPU percentage maintainer self-commits and reviewed maintainer self-commits

In absolute numbers it’s clear that graphics has grown tremendously over the past few years. Much faster than the kernel at large. Depending upon the metric you pick, the GPU subsystem has grown from being 3% of the kernel to about 10% and now trading spots for 2nd largest subsystem with arm-soc and staging (depending who’s got a big pull for that release).

Maintainer commits keep up with GPU subsystem growth

The relative numbers have a different story. First, commit rights and the fairly big roll out of group maintainership we’ve done in the past 2 years aren’t extreme by historical graphics subsystem standards. We’ve always had around 30-40% maintainer self-commits. There’s a bit of a downward trend in the years leading towards v4.4, due to the massive growth of the i915 driver, and our failure to add more maintainers and committers for a few releases. Adding lots more committers and creating bigger maintainer groups from v4.5 on forward, first for the i915 driver, then to cope with the influx of new small drivers, brought us back to the historical trend line.

There’s another dip happening in the last few kernels, due to AMD bringing in a big new team of contributors to upstream. v4.15 was even more pronounced, in that release the entirely rewritten DC display driver for AMD GPUs landed. The AMD team is already using a committer model for their staging and internal trees, but not (yet) committing directly to their upstream branch. There’s a few process holdups, mostly around the CI flow, that need to be fixed first. As soon as that’s done I expect this recent dip will again be over.

In short, even when facing big growth like the GPU subsystem has, it’s very much doable to keep training new maintainers to keep up with the increased demand.

Review of maintainer self-commits established in the GPU subsystem

Looking at relative changes in how consistently maintainer self-commits are reviewed, there’s a clear growth from mostly no review to 80+% of all maintainer self-commits having seen some formal oversight. We didn’t just keep up with the growth, but scaled faster and managed to make review a standard practice. Most of the drivers, and all the core code, are now consistently reviewed. Even for tiny drivers with small to single person teams we’ve managed to pull this off, through combining them into larger teams run with a group maintainership model.

Last few years of kernel w/o GPU history

kernel w/o GPU maintainer commit statistics Fig. 3 kernel w/o GPU maintainer self-commits and reviewed maintainer self-commits
kernel w/o GPU relative maintainer commit statistics Fig. 4 kernel w/o GPU percentage maintainer self-commits and reviewed maintainer self-commits

Kernel w/o graphics is an entirely different story. Overall, review is much less a thing that happens, with only about 30% of all maintainer self-commits having any indication of oversight. The low ratio of maintainer self-commits is why I removed the total commit number from the absolute graph - it would have dwarfed the much more interesting data on self-commits and reviewed self-commits. The positive thing is that there’s at least a consistent, if very small upward trend in maintainer self-commit reviews, both in absolute and relative numbers. But it’s very slow, and will likely take decades until there’s no longer a double standard on review between contributors and maintainers.

Maintainers are not keeping up with the kernel growth overall

Much more worrying is the trend on maintainer self-commits. Both in absolute, and much more in relative numbers, there’s a clear downward trend, going from around 25% to below 15%. This indicates that the kernel community fails to mentor and train new maintainers at a pace sufficient to keep up with growth. Current maintainers are ever more overloaded, leaving ever less time for them to write patches of their own and get them merged.

Naively extrapolating the relative trend predicts that around the year 2025 large numbers of kernel maintainers will do nothing else than be the bottleneck, preventing everyone else from getting their work merged and not contributing anything of their own. The kernel community imploding under its own bureaucratic weight being the likely outcome of that.

This is a huge contrast to the “everything is getting better, bigger, and the kernel community is very healthy” fanfare touted at keynotes and the yearly kernel report. In my opinion, the kernel community is very much not looking like it is coping with its growth well and an overall healthy community. Even when ignoring all the issues around conduct that I’ve raised.

It is also a huge contrast to what we’ve experienced in the GPU subsystem since aggressively rolling out group maintainership starting with the v4.5 release; by spreading the bureaucratic side of applying patches over many more people, maintainers have much more time to create their own patches and get them merged. More crucially, experienced maintainers can focus their limited review bandwidth on the big architectural design questions since they won’t get bogged down in the minutiae of every single simple patch.

4.16 by subsystem

Let’s zoom into how this all looks at a subsystem level, looking at just the recently released 4.16 kernel.

Most subsystems have unsustainable maintainer ratios

Trying to come up with a reasonable list of subsystems that have high maintainer commit ratios is tricky; some rather substantial pull requests are essentially just maintainers submitting their own work, giving them an easy 100% score. But of course that’s just an outlier in the larger scope of the kernel overall having a maintainer self-commit ratio of just 15%. To get a more interesting list of subsystems we need to look at only those with a group of regular contributors and more than just 1 maintainer. A fairly arbitrary cut-off of 200 commits or more in total seems to get us there, yielding the following top ten list:

subsystem total commits maintainer self-commits maintainer ratio
GPU 1683 614 36%
KVM 257 91 35%
arm-soc 885 259 29%
linux-media 422 111 26%
tip (x86, core, …) 792 125 16%
linux-pm 201 31 15%
staging 650 61 9%
linux-block 249 20 8%
sound 351 26 7%
powerpc 235 16 7%

In short there’s very few places where it’s easier to become a maintainer than in the already rather low, roughly 15%, the kernel scores overall. Outside of these few subsystems, the only realistic way is to create a new subsystem, somehow get it merged, and become its maintainer. In most subsystems being a maintainer is an elite status, and the historical trends suggest it will only become more so. If this trend isn’t reversed, then maintainer overload will get a lot worse in the coming years.

Of course subsystem maintainers are expected to spend more time reviewing and managing other people’s contribution. When looking at individual maintainers it would be natural to expect a slow decline in their own contributions in patch form, and hence a decline in self-commits. But below them a new set of maintainers should grow and receive mentoring, and those more junior maintainers would focus more on their own work. That sustainable maintainer pipeline seems to not be present in many kernel subsystems, drawing a bleak future for them.

Much more interesting is the review statistics, split up by subsystem. Again we need a cut-off for noise and outliers. The big outliers here are all the pull requests and trees that have seen zero review, not even any Acked-by tags. As long as we only look at positive examples we don’t need to worry about those. A rather low cut-off of at least 10 maintainer self-commits takes care of other random noise:

subsystem total commits maintainer self-commits maintainer review ratio
f2fs 72 12 100%
XFS 105 78 100%
arm64 166 23 91%
GPU 1683 614 83%
linux-mtd 99 12 75%
KVM 257 91 74%
linux-pm 201 31 71%
pci 145 37 65%
remoteproc 19 14 64%
clk 139 14 64%
dma-mapping 63 60 60%

Yes, XFS and f2fs have their shit together. More interesting is how wide the spread in the filesystem code is; there’s a bunch of substantial fs pulls with a review ratio of flat out zero. Not even a single Acked-by. XFS on the other hand insists on full formal review of everything - I spot checked the history a bit. f2fs is a bit of an outlier with 4.16, barely getting above the cut-off. Usually it has fewer patches and would have been excluded.

Everyone not in the top ten taken together has a review ratio of 27%.

Review double standards in many big subsystems

Looking at the big subsystems with multiple maintainers and huge groups of contributors - I picked 500 patches as the cut-off - there’s some really low review ratios: Staging has 7%, networking 9% and tip scores 10%. Only arm-soc is close to the top ten, with 50%, at the 14th position.

Staging having no standard is kinda the point, but the other core subsystems eschewing review is rather worrisome. More than 9 out of 10 maintainer self-commits merged into these core subsystem do not carry any indication that anyone else ever looked at the patch and deemed it a good idea. The only other subsystem with more than 500 commits is the GPU subsystem, at 4th position with a 83% review ratio.

Compared to maintainers overall the review situation is looking a lot less bleak. There’s a sizeable group of subsystems who at least try to make this work, by having similar review criteria for maintainer self-commits than normal contributors. This is also supported by the rather slow, but steady overall increase of reviews when looking at historical trend.

But there’s clearly other subsystems where review only seems to be a gauntlet inflicted on normal contributors, entirely optional for maintainers themselves. Contributors cannot avoid review, because they can’t commit their own patches. When maintainers outright ignore review for most of their patches this creates a clear double standard between maintainers and mere contributors.

One year ago I wrote “Review, not Rocket Science” on how to roll out review in your subsystem. Looking at this data here I can close with an even shorter version:

What would Dave Chinner do?

Thanks a lot to Daniel Stone, Dave Chinner, Eric Anholt, Geoffrey Huntley, Luce Carter and Sean Paul for reading and commenting on drafts of this article.

April 23, 2018 12:00 AM

April 20, 2018

Kees Cook: UEFI booting and RAID1

I spent some time yesterday building out a UEFI server that didn’t have on-board hardware RAID for its system drives. In these situations, I always use Linux’s md RAID1 for the root filesystem (and/or /boot). This worked well for BIOS booting since BIOS just transfers control blindly to the MBR of whatever disk it sees (modulo finding a “bootable partition” flag, etc, etc). This means that BIOS doesn’t really care what’s on the drive, it’ll hand over control to the GRUB code in the MBR.

With UEFI, the boot firmware is actually examining the GPT partition table, looking for the partition marked with the “EFI System Partition” (ESP) UUID. Then it looks for a FAT32 filesystem there, and does more things like looking at NVRAM boot entries, or just running BOOT/EFI/BOOTX64.EFI from the FAT32. Under Linux, this .EFI code is either GRUB itself, or Shim which loads GRUB.

So, if I want RAID1 for my root filesystem, that’s fine (GRUB will read md, LVM, etc), but how do I handle /boot/efi (the UEFI ESP)? In everything I found answering this question, the answer was “oh, just manually make an ESP on each drive in your RAID and copy the files around, add a separate NVRAM entry (with efibootmgr) for each drive, and you’re fine!” I did not like this one bit since it meant things could get out of sync between the copies, etc.

The current implementation of Linux’s md RAID puts metadata at the front of a partition. This solves more problems than it creates, but it means the RAID isn’t “invisible” to something that doesn’t know about the metadata. In fact, mdadm warns about this pretty loudly:

# mdadm --create /dev/md0 --level 1 --raid-disks 2 /dev/sda1 /dev/sdb1 mdadm: Note: this array has metadata at the start and may not be suitable as a boot device. If you plan to store '/boot' on this device please ensure that your boot-loader understands md/v1.x metadata, or use --metadata=0.90

Reading from the mdadm man page:

-e, --metadata= ... 1, 1.0, 1.1, 1.2 default Use the new version-1 format superblock. This has fewer restrictions. It can easily be moved between hosts with different endian-ness, and a recovery operation can be checkpointed and restarted. The different sub-versions store the superblock at different locations on the device, either at the end (for 1.0), at the start (for 1.1) or 4K from the start (for 1.2). "1" is equivalent to "1.2" (the commonly preferred 1.x format). "default" is equivalent to "1.2".

First we toss a FAT32 on the RAID (mkfs.fat -F32 /dev/md0), and looking at the results, the first 4K is entirely zeros, and file doesn’t see a filesystem:

# dd if=/dev/sda1 bs=1K count=5 status=none | hexdump -C 00000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| * 00001000 fc 4e 2b a9 01 00 00 00 00 00 00 00 00 00 00 00 |.N+.............| ... # file -s /dev/sda1 /dev/sda1: Linux Software RAID version 1.2 ...

So, instead, we’ll use --metadata 1.0 to put the RAID metadata at the end:

# mdadm --create /dev/md0 --level 1 --raid-disks 2 --metadata 1.0 /dev/sda1 /dev/sdb1 ... # mkfs.fat -F32 /dev/md0 # dd if=/dev/sda1 bs=1 skip=80 count=16 status=none | xxd 00000000: 2020 4641 5433 3220 2020 0e1f be77 7cac FAT32 ...w|. # file -s /dev/sda1 /dev/sda1: ... FAT (32 bit)

Now we have a visible FAT32 filesystem on the ESP. UEFI should be able to boot whatever disk hasn’t failed, and grub-install will write to the RAID mounted at /boot/efi.

However, we’re left with a new problem: on (at least) Debian and Ubuntu, grub-install attempts to run efibootmgr to record which disk UEFI should boot from. This fails, though, since it expects a single disk, not a RAID set. In fact, it returns nothing, and tries to run efibootmgr with an empty -d argument:

Installing for x86_64-efi platform. efibootmgr: option requires an argument -- 'd' ... grub-install: error: efibootmgr failed to register the boot entry: Operation not permitted. Failed: grub-install --target=x86_64-efi WARNING: Bootloader is not properly installed, system may not be bootable

Luckily my UEFI boots without NVRAM entries, and I can disable the NVRAM writing via the “Update NVRAM variables to automatically boot into Debian?” debconf prompt when running: dpkg-reconfigure -p low grub-efi-amd64

So, now my system will boot with both or either drive present, and updates from Linux to /boot/efi are visible on all RAID members at boot-time. HOWEVER there is one nasty risk with this setup: if UEFI writes anything to one of the drives (which this firmware did when it wrote out a “boot variable cache” file), it may lead to corrupted results once Linux mounts the RAID (since the member drives won’t have identical block-level copies of the FAT32 any more).

To deal with this “external write” situation, I see some solutions:

Since mdadm has the “--update=resync” assembly option, I can actually do the latter option. This required updating /etc/mdadm/mdadm.conf to add <ignore> on the RAID’s ARRAY line to keep it from auto-starting:

ARRAY <ignore> metadata=1.0 UUID=123...

(Since it’s ignored, I’ve chosen /dev/md100 for the manual assembly below.) Then I added the noauto option to the /boot/efi entry in /etc/fstab:

/dev/md100 /boot/efi vfat noauto,defaults 0 0

And finally I added a systemd oneshot service that assembles the RAID with resync and mounts it:

[Unit] Description=Resync /boot/efi RAID DefaultDependencies=no After=local-fs.target [Service] Type=oneshot ExecStart=/sbin/mdadm -A /dev/md100 --uuid=123... --update=resync ExecStart=/bin/mount /boot/efi RemainAfterExit=yes [Install] WantedBy=sysinit.target

(And don’t forget to run “update-initramfs -u” so the initramfs has an updated copy of /dev/mdadm/mdadm.conf.)

If mdadm.conf supported an “update=” option for ARRAY lines, this would have been trivial. Looking at the source, though, that kind of change doesn’t look easy. I can dream!

And if I wanted to keep a “pristine” version of /boot/efi that UEFI couldn’t update I could rearrange things more dramatically to keep the primary RAID member as a loopback device on a file in the root filesystem (e.g. /boot/efi.img). This would make all external changes in the real ESPs disappear after resync. Something like:

# truncate --size 512M /boot/efi.img # losetup -f --show /boot/efi.img /dev/loop0 # mdadm --create /dev/md100 --level 1 --raid-disks 3 --metadata 1.0 /dev/loop0 /dev/sda1 /dev/sdb1

And at boot just rebuild it from /dev/loop0, though I’m not sure how to “prefer” that partition…

© 2018, Kees Cook. This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 License.
Creative Commons License

April 20, 2018 12:34 AM

April 16, 2018

Pete Zaitcev: Suddenly Liferea tonight

Liferea irritated me for many years with a strange behavior when dragging a subscription. You mouse down on the feed, it becomes selected — so far so good. Then you drag it somewhere — possibly far off screen, making the view scroll — then drop it. Drops fine, updates the DB, model, and the view fine. But! The selection then jumps to a completely random feed somewhere.

Well, it's not actually random. What happens instead, the GtkTreeView implements DnD by removing a row, then re-inserting it. When a selected row is removed, obviously the selection has to disappear, but instead it's set to the next row after the removed one. I suppose I may be uniquely vulnerable to this because I have 300+ feeds and I drag them around all the time. If Liferea weren't kind enough to remember the preferred order, this would not matter so much.

I meant to fix this for a long time, but somehow a wrong information got stuck in my head: I thought that Liferea was written in C++, so it took years to gather the motivation. Imagine my surprise when I found plain old C. I spent a good chunk of Sunday figuring out GTK's tree view thingie, but in the end it was quite simple.

April 16, 2018 03:09 PM

April 13, 2018

Kees Cook: security things in Linux v4.16

Previously: v4.15.

Linux kernel v4.16 was released last week. I really should write these posts in advance, otherwise I get distracted by the merge window. Regardless, here are some of the security things I think are interesting:

KPTI on arm64

Will Deacon, Catalin Marinas, and several other folks brought Kernel Page Table Isolation (via CONFIG_UNMAP_KERNEL_AT_EL0) to arm64. While most ARMv8+ CPUs were not vulnerable to the primary Meltdown flaw, the Cortex-A75 does need KPTI to be safe from memory content leaks. It’s worth noting, though, that KPTI does protect other ARMv8+ CPU models from having privileged register contents exposed. So, whatever your threat model, it’s very nice to have this clean isolation between kernel and userspace page tables for all ARMv8+ CPUs.

hardened usercopy whitelisting
While whole-object bounds checking was implemented in CONFIG_HARDENED_USERCOPY already, David Windsor and I finished another part of the porting work of grsecurity’s PAX_USERCOPY protection: usercopy whitelisting. This further tightens the scope of slab allocations that can be copied to/from userspace. Now, instead of allowing all objects in slab memory to be copied, only the whitelisted areas (where a subsystem has specifically marked the memory region allowed) can be copied. For example, only the auxv array out of the larger mm_struct.

As mentioned in the first commit from the series, this reduces the scope of slab memory that could be copied out of the kernel in the face of a bug to under 15%. As can be seen, one area of work remaining are the kmalloc regions. Those are regularly used for copying things in and out of userspace, but they’re also used for small simple allocations that aren’t meant to be exposed to userspace. Working to separate these kmalloc users needs some careful auditing.

Total Slab Memory: 48074720 Usercopyable Memory: 6367532 13.2% task_struct 0.2% 4480/1630720 RAW 0.3% 300/96000 RAWv6 2.1% 1408/64768 ext4_inode_cache 3.0% 269760/8740224 dentry 11.1% 585984/5273856 mm_struct 29.1% 54912/188448 kmalloc-8 100.0% 24576/24576 kmalloc-16 100.0% 28672/28672 kmalloc-32 100.0% 81920/81920 kmalloc-192 100.0% 96768/96768 kmalloc-128 100.0% 143360/143360 names_cache 100.0% 163840/163840 kmalloc-64 100.0% 167936/167936 kmalloc-256 100.0% 339968/339968 kmalloc-512 100.0% 350720/350720 kmalloc-96 100.0% 455616/455616 kmalloc-8192 100.0% 655360/655360 kmalloc-1024 100.0% 812032/812032 kmalloc-4096 100.0% 819200/819200 kmalloc-2048 100.0% 1310720/1310720

This series took quite a while to land (you can see David’s original patch date as back in June of last year). Partly this was due to having to spend a lot of time researching the code paths so that each whitelist could be explained for commit logs, partly due to making various adjustments from maintainer feedback, and partly due to the short merge window in v4.15 (when it was originally proposed for merging) combined with some last-minute glitches that made Linus nervous. After baking in linux-next for almost two full development cycles, it finally landed. (Though be sure to disable CONFIG_HARDENED_USERCOPY_FALLBACK to gain enforcement of the whitelists — by default it only warns and falls back to the full-object checking.)

automatic stack-protector

While the stack-protector features of the kernel have existed for quite some time, it has never been enabled by default. This was mainly due to needing to evaluate compiler support for the feature, and Kconfig didn’t have a way to check the compiler features before offering CONFIG_* options. As a defense technology, the stack protector is pretty mature. Having it on by default would have greatly reduced the impact of things like the BlueBorne attack (CVE-2017-1000251), as fewer systems would have lacked the defense.

After spending quite a bit of time fighting with ancient compiler versions (*cough*GCC 4.4.4*cough*), I landed CONFIG_CC_STACKPROTECTOR_AUTO, which is default on, and tries to use the stack protector if it is available. The implementation of the solution, however, did not please Linus, though he allowed it to be merged. In the future, Kconfig will gain the knowledge to make better decisions which lets the kernel expose the availability of (the now default) stack protector directly in Kconfig, rather than depending on rather ugly Makefile hacks.

execute-only memory for PowerPC

Similar to the Protection Keys (pkeys) hardware support that landed in v4.6 for x86, Ram Pai landed pkeys support for Power7/8/9. This should expand the scope of what’s possible in the dynamic loader to avoid having arbitrary read flaws allow an exploit to read out all of executable memory in order to find ROP gadgets.

That’s it for now; let me know if you think I should add anything! The v4.17 merge window is open. :)

Edit: added details on ARM register leaks, thanks to Daniel Micay.

Edit: added section on protection keys for POWER, thanks to Florian Weimer.

© 2018, Kees Cook. This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 License.
Creative Commons License

April 13, 2018 12:04 AM

April 11, 2018

James Morris: Linux Security Summit North America 2018 CFP Announced

lss logo

The CFP for the 2018 Linux Security Summit North America (LSS-NA) is announced.

LSS will be held this year as two separate events, one in North America
(LSS-NA), and one in Europe (LSS-EU), to facilitate broader participation in
Linux Security development. Note that this CFP is for LSS-NA; a separate CFP
will be announced for LSS-EU in May. We encourage everyone to attend both
events.

LSS-NA 2018 will be held in Vancouver, Canada, co-located with the Open Source Summit.

The CFP closes on June 3rd and the event runs from 27th-28th August.

To make a CFP submission, click here.

April 11, 2018 11:29 PM

April 10, 2018

Linux Plumbers Conference: Welcome to the 2018 LPC blog

Planning for the 2018 Linux Plumbers Conference is well underway at this point. The planning committee will be posting various informational blurbs here, including information on hotels, microconference acceptance, evening events, scheduling, and so on. Next up will be a “call for proposals” that should appear soon.

LPC will be held at the Sheraton Vancouver Wall Center in Vancouver, British Columbia, Canada, November 13-15, colocated with the Linux Kernel Summit.

April 10, 2018 04:40 PM

April 06, 2018

Pete Zaitcev: With Blockchain Technology

Recently it became common to see a mocking of startup founders that add "blockchain" to something, then sell it to gullible VCs and reap the green harvest. Apparently it has become quite a thing. But now they went a step further.

The other day I was watching some anime at Crunchyroll, when a commercial came up. It pitched a fantasy sports site "with blockchain technology" and smart contracts. The remarkable part about it is, it wasn't aimed at investors. It was a consumer advertisement. Its creators apparently expect members of the public — who play fantasy sports, no less — to know that blockchain exists and think about it in positive terms.

April 06, 2018 05:19 AM

April 05, 2018

Matthew Garrett: Linux kernel lockdown and UEFI Secure Boot

David Howells recently published the latest version of his kernel lockdown patchset. This is intended to strengthen the boundary between root and the kernel by imposing additional restrictions that prevent root from modifying the kernel at runtime. It's not the first feature of this sort - /dev/mem no longer allows you to overwrite arbitrary kernel memory, and you can configure the kernel so only signed modules can be loaded. But the present state of things is that these security features can be easily circumvented (by using kexec to modify the kernel security policy, for instance).

Why do you want lockdown? If you've got a setup where you know that your system is booting a trustworthy kernel (you're running a system that does cryptographic verification of its boot chain, or you built and installed the kernel yourself, for instance) then you can trust the kernel to keep secrets safe from even root. But if root is able to modify the running kernel, that guarantee goes away. As a result, it makes sense to extend the security policy from the boot environment up to the running kernel - it's really just an extension of configuring the kernel to require signed modules.

The patchset itself isn't hugely conceptually controversial, although there's disagreement over the precise form of certain restrictions. But one patch has, because it associates whether or not lockdown is enabled with whether or not UEFI Secure Boot is enabled. There's some backstory that's important here.

Most kernel features get turned on or off by either build-time configuration or by passing arguments to the kernel at boot time. There's two ways that this patchset allows a bootloader to tell the kernel to enable lockdown mode - it can either pass the lockdown argument on the kernel command line, or it can set the secure_boot flag in the bootparams structure that's passed to the kernel. If you're running in an environment where you're able to verify the kernel before booting it (either through cryptographic validation of the kernel, or knowing that there's a secret tied to the TPM that will prevent the system booting if the kernel's been tampered with), you can turn on lockdown.

There's a catch on UEFI systems, though - you can build the kernel so that it looks like an EFI executable, and then run it directly from the firmware. The firmware doesn't know about Linux, so can't populate the bootparam structure, and there's no mechanism to enforce command lines so we can't rely on that either. The controversial patch simply adds a kernel configuration option that automatically enables lockdown when UEFI secure boot is enabled and otherwise leaves it up to the user to choose whether or not to turn it on.

Why do we want lockdown enabled when booting via UEFI secure boot? UEFI secure boot is designed to prevent the booting of any bootloaders that the owner of the system doesn't consider trustworthy[1]. But a bootloader is only software - the only thing that distinguishes it from, say, Firefox is that Firefox is running in user mode and has no direct access to the hardware. The kernel does have direct access to the hardware, and so there's no meaningful distinction between what grub can do and what the kernel can do. If you can run arbitrary code in the kernel then you can use the kernel to boot anything you want, which defeats the point of UEFI Secure Boot. Linux distributions don't want their kernels to be used to be used as part of an attack chain against other distributions or operating systems, so they enable lockdown (or equivalent functionality) for kernels booted this way.

So why not enable it everywhere? There's a couple of reasons. The first is that some of the features may break things people need - for instance, some strange embedded apps communicate with PCI devices by mmap()ing resources directly from sysfs[2]. This is blocked by lockdown, which would break them. Distributions would then have to ship an additional kernel that had lockdown disabled (it's not possible to just have a command line argument that disables it, because an attacker could simply pass that), and users would have to disable secure boot to boot that anyway. It's easier to just tie the two together.

The second is that it presents a promise of security that isn't really there if your system didn't verify the kernel. If an attacker can replace your bootloader or kernel then the ability to modify your kernel at runtime is less interesting - they can just wait for the next reboot. Appearing to give users safety assurances that are much less strong than they seem to be isn't good for keeping users safe.

So, what about people whose work is impacted by lockdown? Right now there's two ways to get stuff blocked by lockdown unblocked: either disable secure boot[3] (which will disable it until you enable secure boot again) or press alt-sysrq-x (which will disable it until the next boot). Discussion has suggested that having an additional secure variable that disables lockdown without disabling secure boot validation might be helpful, and it's not difficult to implement that so it'll probably happen.

Overall: the patchset isn't controversial, just the way it's integrated with UEFI secure boot. The reason it's integrated with UEFI secure boot is because that's the policy most distributions want, since the alternative is to enable it everywhere even when it doesn't provide real benefits but does provide additional support overhead. You can use it even if you're not using UEFI secure boot. We should have just called it securelevel.

[1] Of course, if the owner of a system isn't allowed to make that determination themselves, the same technology is restricting the freedom of the user. This is abhorrent, and sadly it's the default situation in many devices outside the PC ecosystem - most of them not using UEFI. But almost any security solution that aims to prevent malicious software from running can also be used to prevent any software from running, and the problem here is the people unwilling to provide that policy to users rather than the security features.
[2] This is how X.org used to work until the advent of kernel modesetting
[3] If your vendor doesn't provide a firmware option for this, run sudo mokutil --disable-validation

comment count unavailable comments

April 05, 2018 01:07 AM

April 04, 2018

Pete Zaitcev: Jim Whitehurst on OpenStack in 2018

Remarks of our CEO, as captured in an interview by TechCrunch:

The other major open-source project Red Hat is betting on is OpenStack . That may come as a bit of a surprise, given that popular opinion in the last year or so has shifted against the massive project that wants to give enterprises an open source on-premise alternative to AWS and other cloud providers. “There was a sense among big enterprise tech companies that OpenStack was going to be their savior from Amazon,” Whitehurst said. “But even OpenStack, flawlessly executed, put you where Amazon was five years ago. If you’re Cisco or HP or any of those big OEMs, you’ll say that OpenStack was a disappointment. But from our view as a software company, we are seeing good traction.”

He's over-simplifying things for the constraints of an interview: the last sencence needs unpacking. Why do you think that "traction" happens? Because OpenStack gives its users something that Amazon does not. For example, Swift isn't trying to match features of S3. Attempting to do that would cause the exact lag he's referring. Instead, Swift works to solve the problem of people who want to own their own data in general. So, it's mostly about the implementation: how to make it scalable, inexpensive, etc. And, of course, keeing it open source, preserving user's freedom to modify. This is why often you see people installing a truncated OpenStack that only has Swift. I'm sure this applies to other parts of OpenStack, in particular the SDN/NFV.

April 04, 2018 04:44 PM

April 03, 2018

Paul E. Mc Kenney: A Linux-kernel memory model!

A big “thank you” to all my partners in LKMM crime, most especially to Jade, Luc, Andrea, and Alan! Jade presented our paper (slides, supplementary material) at ASPLOS, which was well-received. A number of people asked how they could learn more about LKMM, which is what much of this blog post is about.

Approaches to learning LKMM include:


  1. Read the documentation, starting with explanation.txt. This documentation replaces most of the older LWN series.
  2. Go through Ted Cooper's coursework for Portland State University's CS510 Advanced Topics in Concurrency class, taught by Jon Walpole.
  3. Those interested in the history of LKMM might wish to look at my 2017 linux.conf.au presentation (video).
  4. Play with the actual model.
The first three options are straightforward, but playing with the model requires some installation. However, playing with the model is probably key to gaining a full understanding of LKMM, so this installation step is well worth the effort.

Installation instructions may be found here (see the “REQUIREMENTS” section). The ocaml language is a prerequisite, which is fortunately included in many Linux distros. If you choose to install ocaml from source (for example, because you need a more recent version), do yourself a favor and read the instructions completely before starting the build process! Otherwise, you will find yourself learning of the convenient one-step build process only after carrying out the laborious five-step process, which can be a bit frustrating.

Of course, if you come across better methods to quickly, easily, and thoroughly learn LKMM, please do not keep them a secret!

Those wanting a few rules of thumb safely approximating LKMM should look at slide 96 (PDF page 78) of the aforementioned linux.conf.au presentation. Please note that material earlier in the presentation is required to make sense of the three rules of thumb.

We also got some excellent questions during Jade's ASPLOS talk, mainly from the renowned and irrepressible Sarita Adve:There were of course a great many other excellent presentations at ASPLOS, but that is a topic for another post!

April 03, 2018 06:35 PM

April 02, 2018

Pete Zaitcev: Wayland versus Glib in Liferea on F27

I decided to build Liferea over the weekend, and the build crashes at the introspection phase.

Apparently, GTK+ programs are set up to introspect themselves: basically the binary can look at its own types or whatnot, then output the result. I'm not quite clear what the purpose of that is, the online docs imply that it's for API documentation mostly. Anyhow, the build runs the liferea binary itself, with arguments that make it run the introspection, then this happens:

(gdb) where
#0  0x00007fa90a2a93b0 in wl_list_insert_list ()
    at /lib64/libwayland-server.so.0
#1  0x00007fa90a2a4e6f in wl_priv_signal_emit ()
    at /lib64/libwayland-server.so.0
#2  0x00007fa90a2a5477 in wl_display_destroy ()
    at /lib64/libwayland-server.so.0
#3  0x00007fa916d163d9 in \
  WebCore::PlatformDisplayWayland::~PlatformDisplayWayland() () at \
  /lib64/libwebkit2gtk-4.0.so.37
#4  0x00007fa916d163e9 in \
  WebCore::PlatformDisplayWayland::~PlatformDisplayWayland() () at \
  /lib64/libwebkit2gtk-4.0.so.37
#5  0x00007fa91100cb58 in __run_exit_handlers () at /lib64/libc.so.6
#6  0x00007fa91100cbaa in  () at /lib64/libc.so.6
#7  0x00007fa911e9d367 in  () at /lib64/libgirepository-1.0.so.1
#8  0x00007fa91197d188 in parse_arg.isra () at /lib64/libglib-2.0.so.0
#9  0x00007fa91197d8ca in parse_long_option () at /lib64/libglib-2.0.so.0
#10 0x00007fa91197f2d6 in g_option_context_parse () at \
  /lib64/libglib-2.0.so.0
#11 0x00007fa91197fd84 in g_option_context_parse_strv ()
    at /lib64/libglib-2.0.so.0
#12 0x00007fa912164558 in g_application_real_local_command_line ()
    at /lib64/libgio-2.0.so.0
#13 0x00007fa912164bf6 in g_application_run () at /lib64/libgio-2.0.so.0
#14 0x000000000041b9ff in main (argc=2, argv=0x7fff2e1203d8) at main.c:77

As much as I can tell, despite being asked only to do the introspection, Liferea (unknowingly, through GTK+) pokes Wayland, which sets exit handlers. However, Wayland is never used (introspection, duh), and not initialized completely, so when its exit handlers run, it crashes.

Well, now what?

I supplse the cleanest approach might be to modify Glib so it avoids provoking Wayland when merely introspecting. But honestly I have no clue about desktop apps and do not know where to even start looking.

UPDATE: Much thanks to Branko Grubic, who pointed me to a bug in WebKit. Currently building with this as a workaround:

--- a/src/Makefile.am
+++ b/src/Makefile.am
@@ -82,6 +82,7 @@ INTROSPECTION_GIRS = Liferea-3.0.gir
 
 Liferea-3.0.gir: liferea$(EXEEXT)
 INTROSPECTION_SCANNER_ARGS = -I$(top_srcdir)/src --warn-all -......
+INTROSPECTION_SCANNER_ENV = WEBKIT_DISABLE_COMPOSITING_MODE=1
 Liferea_3_0_gir_NAMESPACE = Liferea
 Liferea_3_0_gir_VERSION = 3.0
 Liferea_3_0_gir_PROGRAM = $(builddir)/liferea$(EXEEXT)

April 02, 2018 05:59 PM

March 20, 2018

Davidlohr Bueso: Linux v4.15: Performance Goodies

With the Meltdown and Spectre fiascos, performance isn't a very hot topic at the moment. In fact, with Linux v4.15 released, it is one of the rare times I've seen security win over performance in such a one sided way. Normally security features are tucked away under a kernel config option nobody really uses. Of course the software fixes are also backported in one way or another, so this isn't really specific to the latest kernel release.

All this said, v4.15 came out with a few performance enhancements across subsystems. The following is an unsorted and incomplete list of changes that went in. Note that the term 'performance' can be vague in that some gains in one area can negatively affect another, so take everything with a grain of salt and reach your own conclusions.

epoll: scale nested calls

Nested epolls are necessary to allow semantics where a file descriptor in the epoll interested-list is also an epoll instance. Such calls are not all that common, but some real world applications suffered severe performance issues in that it relied on global spinlocks, acquired throughout the callbacks in the epoll state machine. By removing them, we can speed up adding fds to the instance as well as polling, such that epoll_wait() can improve by 100x, scaling linearly when increasing amounts of cores block an an event.
[Commit 57a173bdf5ba,  37b5e5212a44]


pvspinlock: hybrid fairness paravirt semantics

Locking under virtual environments can be tricky, balancing performance and fairness while avoiding artifacts such as starvation and lock holder/waiter preemption. The current paravirtual queued spinlocks, while free from starvation, can perform less optimally than an unfair lock in guests with CPU over-commitment. With Linux v4.15, guest spinlocks now combine the best of both worlds, with an unfair and a queued mode. The idea is that, upon contention, extend the lock stealing attempt in the slowpath (unfair mode) as long as there are queued MCS waiters present, hence improving performance while avoiding starvation. Kernel build experiments show that as a VM becomes more and more over-committed, the ratio of locks acquired in unfair mode increases.
[Commit 11752adb68a3]


mm,x86: avoid saving/restoring interrupts state in gup

When x86 was converted to use the generic get_user_pages_fast() call a performance regression was introduced at a microbenchmark level. The generic gup function attempts to walk the page tables without acquiring any locks, such as the mmap semaphore. In order to do this, interrupts must be disabled, which is where things went different between the arch-specific and generic flavors. The later must save and restore the current state of interrupt, introducing extra overhead when compared to a simple local_irq_enable/disable().
[Commit 5b65c4677a57]



ipc: scale INFO commands

Any syscall used to get info from sysvipc (such as semctl(IPC_INFO) or shmctl(SHM_INFO)) requires internally computing the last ipc identifier. For cases with large amounts of keys, this operation alone can consume a large amount of cycles as it looked up on-demand, in O(N). In order to make this information available in constant time, we keep track of it whenever a new identifier is added.
[Commit 15df03c87983]



ext4:  improve smp scalability for inode generation

The superblock's inode generation number was currently sequentially increased (from a randomly initialized value) and protected by a spinlock, making the usage pattern quite primitive and not very friendly to workloads that are generating files/inodes concurrently. The inode generation path was optimized to remove the lock altogether and simply rely on prandom_u32() such that a fast/seeded pseudo random-number algorithm is used for computing the i_generation.
[Commit 232530680290]

March 20, 2018 05:37 PM

March 15, 2018

Pete Zaitcev: The more you tighten your grip

Seen at the webpage for RancherOS:

Everything in RancherOS is a Docker container. We accomplish this by launching two instances of Docker. One is what we call System Docker, the first process on the system. All other system services, like ntpd, syslog, and console, are running in Docker containers. System Docker replaces traditional init systems like systemd, and can be used to launch additional system services.

March 15, 2018 10:33 PM

March 13, 2018

Pete Zaitcev: You Are Not Uber: Only Uber Are Uber

Remember how FAA shut down the business of NavWorx, with heavy monetary and loss-of-use consequences for its customers? Imagine receiving a letter from U.S. Government telling you that your car is not compatible with roads, and therefore you are prohibited from continuing to drive it. Someone sure forgot that the power to regulate is the power to destroy. This week, we have this report by IEEE Spectrum:

IEEE Spectrum can reveal that the SpaceBees are almost certainly the first spacecraft from a Silicon Valley startup called Swarm Technologies, currently still in stealth mode. Swarm was founded in 2016 by one engineer who developed a spacecraft concept for Google and another who sold his previous company to Apple. The SpaceBees were built as technology demonstrators for a new space-based Internet of Things communications network.

The only problem is, the Federal Communications Commission (FCC) had dismissed Swarm’s application for its experimental satellites a month earlier, on safety grounds.

On Wednesday, the FCC sent Swarm a letter revoking its authorization for a follow-up mission with four more satellites, due to launch next month. A pending application for a large market trial of Swarm’s system with two Fortune 100 companies could also be in jeopardy.

Swarm Technologies, based in Menlo Park, Calif., is the brainchild of two talented young aerospace engineers. Sara Spangelo, its CEO, is a Canadian who worked at NASA’s Jet Propulsion Laboratory, before moving to Google in 2016. Spangelo’s astronaut candidate profile at the Canadian Space Agency says that while at Google, she led a team developing a spacecraft concept for its moonshot X division, including both technical and market analyses.

Swarm CFO Benjamin Longmier has an equally impressive resume. In 2015, he sold his near-space balloon company Aether Industries to Apple, before taking a teaching post at the University of Michigan. He is also co-founder of Apollo Fusion, a company producing an innovative electric propulsion system for satellites.

Although a leading supplier in its market, NavWorx was a bit player at the government level. Not that many people have small private airplanes anymore. But Swarm operates at a different level, an may be able to grease a enough palms in the Washington, D.C., enough to survive this debacle. Or, they may reconstitute as a notionally new company, then claim a clean start. Again unlike the NavWorx, there's no installed base.

March 13, 2018 03:45 PM

March 11, 2018

Greg Kroah-Hartman: My affidavit in the Geniatech vs. McHardy case

As many people know, last week there was a court hearing in the Geniatech vs. McHardy case. This was a case brought claiming a license violation of the Linux kernel in Geniatech devices in the German court of OLG Cologne.

Harald Welte has written up a wonderful summary of the hearing, I strongly recommend that everyone go read that first.

In Harald’s summary, he refers to an affidavit that I provided to the court. Because the case was withdrawn by McHardy, my affidavit was not entered into the public record. I had always assumed that my affidavit would be made public, and since I have had a number of people ask me about what it contained, I figured it was good to just publish it for everyone to be able to see it.

There are some minor edits from what was exactly submitted to the court such as the side-by-side German translation of the English text, and some reformatting around some footnotes in the text, because I don’t know how to do that directly here, and they really were not all that relevant for anyone who reads this blog. Exhibit A is also not reproduced as it’s just a huge list of all of the kernel releases in which I felt that were no evidence of any contribution by Patrick McHardy.

AFFIDAVIT

I, the undersigned, Greg Kroah-Hartman,
declare in lieu of an oath and in the
knowledge that a wrong declaration in
lieu of an oath is punishable, to be
submitted before the Court:

I. With regard to me personally:

1. I have been an active contributor to
   the Linux Kernel since 1999.

2. Since February 1, 2012 I have been a
   Linux Foundation Fellow.  I am currently
   one of five Linux Foundation Fellows
   devoted to full time maintenance and
   advancement of Linux. In particular, I am
   the current Linux stable Kernel maintainer
   and manage the stable Kernel releases. I
   am also the maintainer for a variety of
   different subsystems that include USB,
   staging, driver core, tty, and sysfs,
   among others.

3. I have been a member of the Linux
   Technical Advisory Board since 2005.

4. I have authored two books on Linux Kernel
   development including Linux Kernel in a
   Nutshell (2006) and Linux Device Drivers
   (co-authored Third Edition in 2009.)

5. I have been a contributing editor to Linux
   Journal from 2003 - 2006.

6. I am a co-author of every Linux Kernel
   Development Report. The first report was
   based on my Ottawa Linux Symposium keynote
   in 2006, and the report has been published
   every few years since then. I have been
   one of the co-author on all of them. This
   report includes a periodic in-depth
   analysis of who is currently contributing
   to Linux. Because of this work, I have an
   in-depth knowledge of the various records
   of contributions that have been maintained
   over the course of the Linux Kernel
   project.

   For many years, Linus Torvalds compiled a
   list of contributors to the Linux kernel
   with each release. There are also usenet
   and email records of contributions made
   prior to 2005. In April of 2005, Linus
   Torvalds created a program now known as
   “Git” which is a version control system
   for tracking changes in computer files and
   coordinating work on those files among
   multiple people. Every Git directory on
   every computer contains an accurate
   repository with complete history and full
   version tracking abilities.  Every Git
   directory captures the identity of
   contributors.  Development of the Linux
   kernel has been tracked and managed using
   Git since April of 2005.

   One of the findings in the report is that
   since the 2.6.11 release in 2005, a total
   of 15,637 developers have contributed to
   the Linux Kernel.

7. I have been an advisor on the Cregit
   project and compared its results to other
   methods that have been used to identify
   contributors and contributions to the
   Linux Kernel, such as a tool known as “git
   blame” that is used by developers to
   identify contributions to a git repository
   such as the repositories used by the Linux
   Kernel project.

8. I have been shown documents related to
   court actions by Patrick McHardy to
   enforce copyright claims regarding the
   Linux Kernel. I have heard many people
   familiar with the court actions discuss
   the cases and the threats of injunction
   McHardy leverages to obtain financial
   settlements. I have not otherwise been
   involved in any of the previous court
   actions.

II. With regard to the facts:

1. The Linux Kernel project started in 1991
   with a release of code authored entirely
   by Linus Torvalds (who is also currently a
   Linux Foundation Fellow).  Since that time
   there have been a variety of ways in which
   contributions and contributors to the
   Linux Kernel have been tracked and
   identified. I am familiar with these
   records.

2. The first record of any contribution
   explicitly attributed to Patrick McHardy
   to the Linux kernel is April 23, 2002.
   McHardy’s last contribution to the Linux
   Kernel was made on November 24, 2015.

3. The Linux Kernel 2.5.12 was released by
   Linus Torvalds on April 30, 2002.

4. After review of the relevant records, I
   conclude that there is no evidence in the
   records that the Kernel community relies
   upon to identify contributions and
   contributors that Patrick McHardy made any
   code contributions to versions of the
   Linux Kernel earlier than 2.4.18 and
   2.5.12. Attached as Exhibit A is a list of
   Kernel releases which have no evidence in
   the relevant records of any contribution
   by Patrick McHardy.

March 11, 2018 01:51 AM

March 07, 2018

Dave Airlie (blogspot): radv - Vulkan 1.1 conformant on launch day

Vulkan 1.1 was officially released today, and thanks to a big effort by Bas and a lot of shared work from the Intel anv developers, radv is a launch day conformant implementation.

https://www.khronos.org/conformance/adopters/conformant-products#submission_308

is a link to the conformance results. This is also radv's first time to be officially conformant on Vega GPUs. 

https://patchwork.freedesktop.org/series/39535/
is the patch series, it requires a bunch of common anv patches to land first. This stuff should all be landing in Mesa shortly or most likely already will have by the time you read this.

In order to advertise 1.1 you need at least a 4.15 Linux kernel.

Thanks to the all involved in making this happen, including the behind the scenes effort to allow radv to participate in the launch day!

March 07, 2018 07:13 PM

March 04, 2018

Pete Zaitcev: MITM in Ireland

I'm just back from OpenStack PTG (Project Technical Gathering) in Dublin, Ireland and while I was there, Firefox reported wrong TLS certificates for some obscure websites, although not others. Example: zaitcev.us retains old certificate, as does wrk.ru. But sealion.club goes bad. I presume that Irish authorities and/or ISPs deemed it proper to MITM these sites. The question is, why such a strange choice of targets?

The sealion.club is a free speech and discussion site, named, as much as I can tell, after an old (possibly classic or memetic) Wondermark cartoon. Maybe the Irish just hate the free speech.

Or, they do not MITM sites that have TLS settings that are too easy to break... and Gmail.

March 04, 2018 07:14 AM

February 21, 2018

Paul E. Mc Kenney: Exit Libris

I have only so many bookshelves, and I have not yet bought into ereaders, so from time to time books must leave. Here is the current batch:



It is a bit sad to abandon some old friends, but such is life with physical books!

February 21, 2018 05:06 AM

February 16, 2018

Pete Zaitcev: ARM servers apparently exist at last

Check out what I found at Pogo Linux (h/t Bryan Lunduke):

ARM R150-T62
2 x Cavium® ThunderX™ 48 Core ARM processors
16 x DDR4 DIMM slots
3 x 40GbE QSFP+ LAN ports
4 x 10GbE SFP+ LAN ports
4 x 3.5” hot-swappable HDD/SSD bays
650W 80 PLUS Platinum redundant PSU
$5,638.82

The prices are ridiculouts, but at least it's a server with CentOS.

February 16, 2018 06:42 AM

Dave Airlie (blogspot): virgl caps - oops I messed.up

When I designed virgl I added a capability system to pass some info about the host GL to the guest driver along the lines of gallium caps. The design was at the virtio GPU level you have a number of capsets each of which has a max version and max size.

The virgl capset is capset 1 with max version 1 and size 308 bytes.

Until now we've happily been using version 1 at 308 bytes. Recently we decided we wanted to have a v2 at 380 bytes, and the world fell apart.

It turned out there is a bug in the guest kernel driver, it asks the host for a list of capsets and allows guest userspace to retrieve from it. The guest userspace has it's own copy of the struct.

The flow is:
Guest mesa driver gives kernel a caps struct to fill out for capset 1.
Kernel driver asks the host over virtio for latest capset 1 info, max size, version.
Host gives it the max_size, version for capset 1.
Kernel driver asks host to fill out malloced memory of the max_size with the
caps struct.
Kernel driver copies the returned caps struct to userspace, using the size of the returned host struct.

The bug is the last line, it uses the size of the returned host struct which ends up corrupting the guest in the scenario where the host has a capset 1 v2, size 380, but the host is still running old userspace which understands capset v1, size 308.

The 380 bytes gets memcpy over the 308 byte struct and boom.

Now we can fix the kernel to not do this, but we can't upgrade every kernel in an existing VM. So if we allow the virglrenderer process to expose a v2 all older sw will explode unless it is also upgraded which isn't really something you want in a VM world.

I came up with some virglrenderer workarounds, but due to another bug where qemu doesn't reset virglrenderer when it should, there was no way to make it reliable, and things like kexec old kernel from new kernel would blow up.

I decided in the end to bite the bullet and just make capset 2 be a repaired one. Unfortunately this needs patches in all 4 components before it can be used.

1) virglrenderer needs to expose capset 2 with the new version/size to qemu.
2) qemu needs to allow the virtio-gpu to transfer capset 2 as a virgl capset to the host.
3) The kernel on the host needs fixing to make sure we copy the minimum of the host caps and the guest caps into the guest userspace driver, then it needs to
provide a way that guest userspace knows the fixed version is in place.
4) The guest userspace needs to check if the guest kernel has the fix, and then query capset 2 first, and fallback to querying capset 1.

After talking to a few other devs in virgl land, they pointed out we could probably just never add a new version of capset 2, and grow the struct endlessly.

The guest driver would fill out the struct it wants to use with it's copy of default minimum values.
It would then call the kernel ioctl to copy over the host caps. The kernel ioctl would copy the minimum size of the host caps and the guest caps.

In this case if the host has a 400 byte capset 2, and the guest still only has 380 byte capset 2, the new fields from the host won't get copied into the guest struct
and it will be fine.

If the guest has the 400 byte capset 2, but the host only has the 380 byte capset 2, the guest would preinit the extra 20 bytes with it's default values (0 or whatever) and the kernel would only copy 380 bytes into the start of the 400 bytes and leave the extra bytes alone.

Now I just have to got write the patches and confirm it all.

Thanks to Stephane at google for creating the patch that showed how broken it was, and to others in the virgl community who noticed how badly it broke old guests! Now to go write the patches...

February 16, 2018 12:11 AM

February 14, 2018

Pete Zaitcev: More system administration in the age of SystemD

I'm tinkering with OpenStack TripleO in a simulated environment. It uses a dedicated non-privileged user, "stack", which can do things such as list VMs with "virsh list". So, yesterday I stopped the undercloud VM, and went to sleep. Today, I want to restart it... but virsh says:

error: failed to connect to the hypervisor
error: Cannot create user runtime directory '/run/user/1000/libvirt': Permission denied

What seems to happen is that when one logs into the stack@ user over ssh, systemd-logind mounts that /run/user/UID thing, but if I log as zaitcev@ and then do "su - stack", this fails to occur.

I have no idea what to do about this. It's probably trivial for someone more knowledgeable to throw the right pam_systemd line into /etc/pam.d/su. But su-l includes system-auth, which invokes pam_systemd.so, and yet... Oh well.

February 14, 2018 11:23 PM

February 06, 2018

Eric Sandeen: LEAF battery replacement update

New LEAF battery

Just a quick note here – the LEAF battery did finally go under warranty on Sept 24, 2017, and I got it replaced with minimal hassle back in great shape on October 3.  The LeafSPY stats on the new battery actually dropped fairly quickly after I got it which was worrisome, but now (in the very cold weather) it’s holding steady at about 97% state of health, with 62.3Ahr and 90.35Hx.

The stats when it finally dropped the 9th bar were:

Miles: 40623
Ahr: 43.51
Hx: 45.25

I’ve definitely needed that fresh capacity for this harsh winter, it’s been fine, but frigid mornings still show the Guess-o-Meter at as low as 50-60 miles at times.

February 06, 2018 08:25 PM

February 05, 2018

Kees Cook: security things in Linux v4.15

Previously: v4.14.

Linux kernel v4.15 was released last week, and there’s a bunch of security things I think are interesting:

Kernel Page Table Isolation
PTI has already gotten plenty of reporting, but to summarize, it is mainly to protect against CPU cache timing side-channel attacks that can expose kernel memory contents to userspace (CVE-2017-5754, the speculative execution “rogue data cache load” or “Meltdown” flaw).

Even for just x86_64 (as CONFIG_PAGE_TABLE_ISOLATION), this was a giant amount of work, and tons of people helped with it over several months. PowerPC also had mitigations land, and arm64 (as CONFIG_UNMAP_KERNEL_AT_EL0) will have PTI in v4.16 (though only the Cortex-A75 is vulnerable). For anyone with really old hardware, x86_32 is under development, too.

An additional benefit of the x86_64 PTI is that since there are now two copies of the page tables, the kernel-mode copy of the userspace mappings can be marked entirely non-executable, which means pre-SMEP hardware now gains SMEP emulation. Kernel exploits that try to jump into userspace memory to continue running malicious code are dead (even if the attacker manages to turn SMEP off first). With some more work, SMAP emulation could also be introduced (to stop even just reading malicious userspace memory), which would close the door on these common attack vectors. It’s worth noting that arm64 has had the equivalent (PAN emulation) since v4.10.

retpoline
In addition to the PTI work above, the retpoline kernel mitigations for CVE-2017-5715 (“branch target injection” or “Spectre variant 2”) started landing. (Note that to gain full retpoline support, you’ll need a patched compiler, as appearing in gcc 7.3/8+, and currently queued for release in clang.)

This work continues to evolve, and clean-ups are continuing into v4.16. Also in v4.16 we’ll start to see mitigations for the other speculative execution variant (i.e. CVE-2017-5753, “bounds check bypass” or “Spectre variant 1”).

x86 fast refcount_t overflow protection
In v4.13 the CONFIG_REFCOUNT_FULL code was added to stop many types of reference counting flaws (with a tiny performance loss). In v4.14 the infrastructure for a fast overflow-only refcount_t protection on x86 (based on grsecurity’s PAX_REFCOUNT) landed, but it was disabled at the last minute due to a bug that was finally fixed in v4.15. Since it was a tiny change, the fast refcount_t protection was backported and enabled for the Longterm maintenance kernel in v4.14.5. Conversions from atomic_t to refcount_t have also continued, and are now above 168, with a handful remaining.

%p hashing
One of the many sources of kernel information exposures has been the use of the %p format string specifier. The strings end up in all kinds of places (dmesg, /sys files, /proc files, etc), and usage is scattered through-out the kernel, which had made it a very hard exposure to fix. Earlier efforts like kptr_restrict‘s %pK didn’t really work since it was opt-in. While a few recent attempts (by William C Roberts, Greg KH, and others) had been made to provide toggles for %p to act like %pK, Linus finally stepped in and declared that %p should be used so rarely that it shouldn’t used at all, and Tobin Harding took on the task of finding the right path forward, which resulted in %p output getting hashed with a per-boot secret. The result is that simple debugging continues to work (two reports of the same hash value can confirm the same address without saying what the address actually is) but frustrates attacker’s ability to use such information exposures as building blocks for exploits.

For developers needing an unhashed %p, %px was introduced but, as Linus cautioned, either your %p remains useful when hashed, your %p was never actually useful to begin with and should be removed, or you need to strongly justify using %px with sane permissions.

It remains to be seen if we’ve just kicked the information exposure can down the road and in 5 years we’ll be fighting with %px and %lx, but hopefully the attitudes about such exposures will have changed enough to better guide developers and their code.

struct timer_list refactoring
The kernel’s timer (struct timer_list) infrastructure is, unsurprisingly, used to create callbacks that execute after a certain amount of time. They are one of the more fundamental pieces of the kernel, and as such have existed for a very long time, with over 1000 call sites. Improvements to the API have been made over time, but old ways of doing things have stuck around. Modern callbacks in the kernel take an argument pointing to the structure associated with the callback, so that a callback has context for which instance of the callback has been triggered. The timer callbacks didn’t, and took an unsigned long that was cast back to whatever arbitrary context the code setting up the timer wanted to associate with the callback, and this variable was stored in struct timer_list along with the function pointer for the callback. This creates an opportunity for an attacker looking to exploit a memory corruption vulnerability (e.g. heap overflow), where they’re able to overwrite not only the function pointer, but also the argument, as stored in memory. This elevates the attack into a weak ROP, and has been used as the basis for disabling SMEP in modern exploits (see retire_blk_timer). To remove this weakness in the kernel’s design, I refactored the timer callback API and and all its callers, for a whopping:

1128 files changed, 4834 insertions(+), 5926 deletions(-)

Another benefit of the refactoring is that once the kernel starts getting built by compilers with Control Flow Integrity support, timer callbacks won’t be lumped together with all the other functions that take a single unsigned long argument. (In other words, some CFI implementations wouldn’t have caught the kind of attack described above since the attacker’s target function still matched its original prototype.)

That’s it for now; please let me know if I missed anything. The v4.16 merge window is now open!

© 2018, Kees Cook. This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 License.
Creative Commons License

February 05, 2018 11:45 PM

Greg Kroah-Hartman: Linux Kernel Release Model

Note

This post is based on a whitepaper I wrote at the beginning of 2016 to be used to help many different companies understand the Linux kernel release model and encourage them to start taking the LTS stable updates more often. I then used it as a basis of a presentation I gave at the Linux Recipes conference in September 2017 which can be seen here.

With the recent craziness of Meltdown and Spectre , I’ve seen lots of things written about how Linux is released and how we handle handles security patches that are totally incorrect, so I figured it is time to dust off the text, update it in a few places, and publish this here for everyone to benefit from.

I would like to thank the reviewers who helped shape the original whitepaper, which has helped many companies understand that they need to stop “cherry picking” random patches into their device kernels. Without their help, this post would be a total mess. All problems and mistakes in here are, of course, all mine. If you notice any, or have any questions about this, please let me know.

Overview

This post describes how the Linux kernel development model works, what a long term supported kernel is, how the kernel developers approach security bugs, and why all systems that use Linux should be using all of the stable releases and not attempting to pick and choose random patches.

Linux Kernel development model

The Linux kernel is the largest collaborative software project ever. In 2017, over 4,300 different developers from over 530 different companies contributed to the project. There were 5 different releases in 2017, with each release containing between 12,000 and 14,500 different changes. On average, 8.5 changes are accepted into the Linux kernel every hour, every hour of the day. A non-scientific study (i.e. Greg’s mailbox) shows that each change needs to be submitted 2-3 times before it is accepted into the kernel source tree due to the rigorous review and testing process that all kernel changes are put through, so the engineering effort happening is much larger than the 8 changes per hour.

At the end of 2017 the size of the Linux kernel was just over 61 thousand files consisting of 25 million lines of code, build scripts, and documentation (kernel release 4.14). The Linux kernel contains the code for all of the different chip architectures and hardware drivers that it supports. Because of this, an individual system only runs a fraction of the whole codebase. An average laptop uses around 2 million lines of kernel code from 5 thousand files to function properly, while the Pixel phone uses 3.2 million lines of kernel code from 6 thousand files due to the increased complexity of a SoC.

Kernel release model

With the release of the 2.6 kernel in December of 2003, the kernel developer community switched from the previous model of having a separate development and stable kernel branch, and moved to a “stable only” branch model. A new release happened every 2 to 3 months, and that release was declared “stable” and recommended for all users to run. This change in development model was due to the very long release cycle prior to the 2.6 kernel (almost 3 years), and the struggle to maintain two different branches of the codebase at the same time.

The numbering of the kernel releases started out being 2.6.x, where x was an incrementing number that changed on every release The value of the number has no meaning, other than it is newer than the previous kernel release. In July 2011, Linus Torvalds changed the version number to 3.x after the 2.6.39 kernel was released. This was done because the higher numbers were starting to cause confusion among users, and because Greg Kroah-Hartman, the stable kernel maintainer, was getting tired of the large numbers and bribed Linus with a fine bottle of Japanese whisky.

The change to the 3.x numbering series did not mean anything other than a change of the major release number, and this happened again in April 2015 with the movement from the 3.19 release to the 4.0 release number. It is not remembered if any whisky exchanged hands when this happened. At the current kernel release rate, the number will change to 5.x sometime in 2018.

Stable kernel releases

The Linux kernel stable release model started in 2005, when the existing development model of the kernel (a new release every 2-3 months) was determined to not be meeting the needs of most users. Users wanted bugfixes that were made during those 2-3 months, and the Linux distributions were getting tired of trying to keep their kernels up to date without any feedback from the kernel community. Trying to keep individual kernels secure and with the latest bugfixes was a large and confusing effort by lots of different individuals.

Because of this, the stable kernel releases were started. These releases are based directly on Linus’s releases, and are released every week or so, depending on various external factors (time of year, available patches, maintainer workload, etc.)

The numbering of the stable releases starts with the number of the kernel release, and an additional number is added to the end of it.

For example, the 4.9 kernel is released by Linus, and then the stable kernel releases based on this kernel are numbered 4.9.1, 4.9.2, 4.9.3, and so on. This sequence is usually shortened with the number “4.9.y” when referring to a stable kernel release tree. Each stable kernel release tree is maintained by a single kernel developer, who is responsible for picking the needed patches for the release, and doing the review/release process. Where these changes are found is described below.

Stable kernels are maintained for as long as the current development cycle is happening. After Linus releases a new kernel, the previous stable kernel release tree is stopped and users must move to the newer released kernel.

Long-Term Stable kernels

After a year of this new stable release process, it was determined that many different users of Linux wanted a kernel to be supported for longer than just a few months. Because of this, the Long Term Supported (LTS) kernel release came about. The first LTS kernel was 2.6.16, released in 2006. Since then, a new LTS kernel has been picked once a year. That kernel will be maintained by the kernel community for at least 2 years. See the next section for how a kernel is chosen to be a LTS release.

Currently the LTS kernels are the 4.4.y, 4.9.y, and 4.14.y releases, and a new kernel is released on average, once a week. Along with these three kernel releases, a few older kernels are still being maintained by some kernel developers at a slower release cycle due to the needs of some users and distributions.

Information about all long-term stable kernels, who is in charge of them, and how long they will be maintained, can be found on the kernel.org release page.

LTS kernel releases average 9-10 patches accepted per day, while the normal stable kernel releases contain 10-15 patches per day. The number of patches fluctuates per release given the current time of the corresponding development kernel release, and other external variables. The older a LTS kernel is, the less patches are applicable to it, because many recent bugfixes are not relevant to older kernels. However, the older a kernel is, the harder it is to backport the changes that are needed to be applied, due to the changes in the codebase. So while there might be a lower number of overall patches being applied, the effort involved in maintaining a LTS kernel is greater than maintaining the normal stable kernel.

Choosing the LTS kernel

The method of picking which kernel the LTS release will be, and who will maintain it, has changed over the years from an semi-random method, to something that is hopefully more reliable.

Originally it was merely based on what kernel the stable maintainer’s employer was using for their product (2.6.16.y and 2.6.27.y) in order to make the effort of maintaining that kernel easier. Other distribution maintainers saw the benefit of this model and got together and colluded to get their companies to all release a product based on the same kernel version without realizing it (2.6.32.y). After that was very successful, and allowed developers to share work across companies, those companies decided to not do that anymore, so future LTS kernels were picked on an individual distribution’s needs and maintained by different developers (3.0.y, 3.2.y, 3.12.y, 3.16.y, and 3.18.y) creating more work and confusion for everyone involved.

This ad-hoc method of catering to only specific Linux distributions was not beneficial to the millions of devices that used Linux in an embedded system and were not based on a traditional Linux distribution. Because of this, Greg Kroah-Hartman decided that the choice of the LTS kernel needed to change to a method in which companies can plan on using the LTS kernel in their products. The rule became “one kernel will be picked each year, and will be maintained for two years.” With that rule, the 3.4.y, 3.10.y, and 3.14.y kernels were picked.

Due to a large number of different LTS kernels being released all in the same year, causing lots of confusion for vendors and users, the rule of no new LTS kernels being based on an individual distribution’s needs was created. This was agreed upon at the annual Linux kernel summit and started with the 4.1.y LTS choice.

During this process, the LTS kernel would only be announced after the release happened, making it hard for companies to plan ahead of time what to use in their new product, causing lots of guessing and misinformation to be spread around. This was done on purpose as previously, when companies and kernel developers knew ahead of time what the next LTS kernel was going to be, they relaxed their normal stringent review process and allowed lots of untested code to be merged (2.6.32.y). The fallout of that mess took many months to unwind and stabilize the kernel to a proper level.

The kernel community discussed this issue at its annual meeting and decided to mark the 4.4.y kernel as a LTS kernel release, much to the surprise of everyone involved, with the goal that the next LTS kernel would be planned ahead of time to be based on the last kernel release of 2016 in order to provide enough time for companies to release products based on it in the next holiday season (2017). This is how the 4.9.y and 4.14.y kernels were picked as the LTS kernel releases.

This process seems to have worked out well, without many problems being reported against the 4.9.y tree, despite it containing over 16,000 changes, making it the largest kernel to ever be released.

Future LTS kernels should be planned based on this release cycle (the last kernel of the year). This should allow SoC vendors to plan ahead on their development cycle to not release new chipsets based on older, and soon to be obsolete, LTS kernel versions.

Stable kernel patch rules

The rules for what can be added to a stable kernel release have remained almost identical for the past 12 years. The full list of the rules for patches to be accepted into a stable kernel release can be found in the Documentation/process/stable_kernel_rules.rst kernel file and are summarized here. A stable kernel change:

The last rule, “a change must be in Linus’s tree”, prevents the kernel community from losing fixes. The community never wants a fix to go into a stable kernel release that is not already in Linus’s tree so that anyone who upgrades should never see a regression. This prevents many problems that other projects who maintain a stable and development branch can have.

Kernel Updates

The Linux kernel community has promised its userbase that no upgrade will ever break anything that is currently working in a previous release. That promise was made in 2007 at the annual Kernel developer summit in Cambridge, England, and still holds true today. Regressions do happen, but those are the highest priority bugs and are either quickly fixed, or the change that caused the regression is quickly reverted from the Linux kernel tree.

This promise holds true for both the incremental stable kernel updates, as well as the larger “major” updates that happen every three months.

The kernel community can only make this promise for the code that is merged into the Linux kernel tree. Any code that is merged into a device’s kernel that is not in the kernel.org releases is unknown and interactions with it can never be planned for, or even considered. Devices based on Linux that have large patchsets can have major issues when updating to newer kernels, because of the huge number of changes between each release. SoC patchsets are especially known to have issues with updating to newer kernels due to their large size and heavy modification of architecture specific, and sometimes core, kernel code.

Most SoC vendors do want to get their code merged upstream before their chips are released, but the reality of project-planning cycles and ultimately the business priorities of these companies prevent them from dedicating sufficient resources to the task. This, combined with the historical difficulty of pushing updates to embedded devices, results in almost all of them being stuck on a specific kernel release for the entire lifespan of the device.

Because of the large out-of-tree patchsets, most SoC vendors are starting to standardize on using the LTS releases for their devices. This allows devices to receive bug and security updates directly from the Linux kernel community, without having to rely on the SoC vendor’s backporting efforts, which traditionally are very slow to respond to problems.

It is encouraging to see that the Android project has standardized on the LTS kernels as a “minimum kernel version requirement”. Hopefully that will allow the SoC vendors to continue to update their device kernels in order to provide more secure devices for their users.

Security

When doing kernel releases, the Linux kernel community almost never declares specific changes as “security fixes”. This is due to the basic problem of the difficulty in determining if a bugfix is a security fix or not at the time of creation. Also, many bugfixes are only determined to be security related after much time has passed, so to keep users from getting a false sense of security by not taking patches, the kernel community strongly recommends always taking all bugfixes that are released.

Linus summarized the reasoning behind this behavior in an email to the Linux Kernel mailing list in 2008:

On Wed, 16 Jul 2008, pageexec@freemail.hu wrote:
>
> you should check out the last few -stable releases then and see how
> the announcement doesn't ever mention the word 'security' while fixing
> security bugs

Umm. What part of "they are just normal bugs" did you have issues with?

I expressly told you that security bugs should not be marked as such,
because bugs are bugs.

> in other words, it's all the more reason to have the commit say it's
> fixing a security issue.

No.

> > I'm just saying that why mark things, when the marking have no meaning?
> > People who believe in them are just _wrong_.
>
> what is wrong in particular?

You have two cases:

 - people think the marking is somehow trustworthy.

   People are WRONG, and are misled by the partial markings, thinking that
   unmarked bugfixes are "less important". They aren't.

 - People don't think it matters

   People are right, and the marking is pointless.

In either case it's just stupid to mark them. I don't want to do it,
because I don't want to perpetuate the myth of "security fixes" as a
separate thing from "plain regular bug fixes".

They're all fixes. They're all important. As are new features, for that
matter.

> when you know that you're about to commit a patch that fixes a security
> bug, why is it wrong to say so in the commit?

It's pointless and wrong because it makes people think that other bugs
aren't potential security fixes.

What was unclear about that?

    Linus

This email can be found here, and the whole thread is recommended reading for anyone who is curious about this topic.

When security problems are reported to the kernel community, they are fixed as soon as possible and pushed out publicly to the development tree and the stable releases. As described above, the changes are almost never described as a “security fix”, but rather look like any other bugfix for the kernel. This is done to allow affected parties the ability to update their systems before the reporter of the problem announces it.

Linus describes this method of development in the same email thread:

On Wed, 16 Jul 2008, pageexec@freemail.hu wrote:
>
> we went through this and you yourself said that security bugs are *not*
> treated as normal bugs because you do omit relevant information from such
> commits

Actually, we disagree on one fundamental thing. We disagree on
that single word: "relevant".

I do not think it's helpful _or_ relevant to explicitly point out how to
tigger a bug. It's very helpful and relevant when we're trying to chase
the bug down, but once it is fixed, it becomes irrelevant.

You think that explicitly pointing something out as a security issue is
really important, so you think it's always "relevant". And I take mostly
the opposite view. I think pointing it out is actually likely to be
counter-productive.

For example, the way I prefer to work is to have people send me and the
kernel list a patch for a fix, and then in the very next email send (in
private) an example exploit of the problem to the security mailing list
(and that one goes to the private security list just because we don't want
all the people at universities rushing in to test it). THAT is how things
should work.

Should I document the exploit in the commit message? Hell no. It's
private for a reason, even if it's real information. It was real
information for the developers to explain why a patch is needed, but once
explained, it shouldn't be spread around unnecessarily.

    Linus

Full details of how security bugs can be reported to the kernel community in order to get them resolved and fixed as soon as possible can be found in the kernel file Documentation/admin-guide/security-bugs.rst

Because security bugs are not announced to the public by the kernel team, CVE numbers for Linux kernel-related issues are usually released weeks, months, and sometimes years after the fix was merged into the stable and development branches, if at all.

Keeping a secure system

When deploying a device that uses Linux, it is strongly recommended that all LTS kernel updates be taken by the manufacturer and pushed out to their users after proper testing shows the update works well. As was described above, it is not wise to try to pick and choose various patches from the LTS releases because:

Note, this author has audited many SoC kernel trees that attempt to cherry-pick random patches from the upstream LTS releases. In every case, severe security fixes have been ignored and not applied.

As proof of this, I demoed at the Kernel Recipes talk referenced above how trivial it was to crash all of the latest flagship Android phones on the market with a tiny userspace program. The fix for this issue was released 6 months prior in the LTS kernel that the devices were based on, however none of the devices had upgraded or fixed their kernels for this problem. As of this writing (5 months later) only two devices have fixed their kernel and are now not vulnerable to that specific bug.

February 05, 2018 05:13 PM

February 04, 2018

Pete Zaitcev: Farewell Nexus 7, Hello Huawei M3

Flying a photoshoot of the Carlson, I stuffed my Nexus 7 under my thighs and cracked the screen. In my defense, I did it several times before, because I hate leaving it on the cockpit floor. I had to fly uncoordinated for the photoshoot, which causes anything that's not fixed in place slide around, and I'm paranoid about a controls interference. Anyway, the cracked screen caused a significant dead zone where touch didn't register anymore, and that made the tablet useless. I had to replace it.

In the years since I had the Nexus (apparently since 2014), the industry stopped making good 7-inch tablets. Well, you can still buy $100 tablets in that size. But because the Garmin Pilot was getting spec-hungry recently, I had no choice but to step up. Sad, really. Naturally, I'm having trouble fitting the M3 into pockets where Nexus lived comfortably before. {It's a full-size iPad in the picture, not a Mini.}

The most annoying problem that I encountered was Chrome not liking the SSL certificate of www.zaitcev.us. It bails with ERR_SSL_SERVER_CERT_BAD_FORMAT. I have my own fake CA, so I install my CA certificate on clients and I sign my hosts. I accept the consequences and inconventice. The annoyance arises because Chrome does not tell what it does not like about the certificate. Firefox works fine with it, as do other applications (like IMAP clients). Chrome in the Nexus worked fine. A cursory web search suggests that Chrome may want alternative names keyed with "DNS.1" instead of "DNS". Dunno what it means and if it is true.

UPDATE: "Top FBI, CIA, and NSA officials all agree: Stay away from Huawei phones"

February 04, 2018 05:17 AM

February 02, 2018

Michael Kerrisk (manpages): man-pages-4.15 is released

I've released man-pages-4.15. The release tarball is available on kernel.org. The browsable online pages can be found on man7.org. The Git repository for man-pages is available on kernel.org.

This release resulted from patches, bug reports, reviews, and comments from 26 contributors. Just over 200 commits changed around 75 pages. In addition, 3 new manual pages were added.

Among the more significant changes in man-pages-4.15 are the following:

February 02, 2018 03:21 PM

Daniel Vetter: LCA Sydney: Burning Down the Castle

I’ve done a talk about the kernel community. It’s a hot take, but with the feedback I’ve received thus far I think it was on the spot, and started a lot of uncomfortable, but necessary discussion. I don’t think it’s time yet to give up on this project, even if it will take years.

Without further ado the recording of my talk “Burning Down the Castle is on youtueb”. For those who prefer reading, LWN has you covered with “Too many lords, not enough stewards”. I think Jake Edge and Jon Corbet have done an excellent job in capturing my talk in a balanced fashion. I have also uploaded my slides.

Further Discussion

For understanding abuse dynamics I can’t recommend “Why Does He Do That?: Inside the Minds of Angry and Controlling Men” by Lundy Bancroft enough. All the examples are derived from a few decades of working with abusers in personal relationships, but the patterns and archetypes that Lundy Bancroft extracts transfers extremely well to any other kind of relationship, whether that’s work, family or open source communities.

There’s endless amounts of stellar talks about building better communities. I’d like to highlight just two: “Life is better with Rust’s community automation” by Emily Dunham and “Have It Your Way: Maximizing Drive-Thru Contribution” by VM Brasseur. For learning more there’s lots of great community topic tracks at various conferences, but also dedicated ones - often as unconferences: Community Leadership Summit, including its various offsprings and maintainerati are two I’ve been at and learned a lot.

Finally there’s the fun of trying to change a huge existing organization with lots of inertia. “Leading Change” by John Kotter has some good insights and frameworks to approach this challenge.

Despite what it might look like I’m not quitting kernel hacking nor the X.org community, and I’m happy to discuss my talk over mail and in upcoming hallway tracks.

February 02, 2018 12:00 AM

January 23, 2018

Pete Zaitcev: 400 gigabits, every second

I keep waiting for RJ-45 to fail to keep the pace with the gigabits, for many years. And it always catches up. But maybe not anymore. Here's what the connector looks for QSFP-DD, a standard module connector for 400GbE:

Two rows, baby, same as on USB3.

These speeds are mostly used between leaf and spine switches, but I'm sure we'll see them in the upstream routers, too.

January 23, 2018 07:43 PM

January 22, 2018

James Morris: LCA 2018 Kernel Miniconf – SELinux Namespacing Slides

I gave a short talk on SELinux namespacing today at the Linux.conf.au Kernel Miniconf in Sydney — the slides from the talk are here: http://namei.org/presentations/selinux_namespacing_lca2018.pdf

This is a work in progress to which I’ve been contributing, following on from initial discussions at Linux Plumbers 2017.

In brief, there’s a growing need to be able to provide SELinux confinement within containers: typically, SELinux appears disabled within a container on Fedora-based systems, as a workaround for a lack of container support.  Underlying this is a requirement to provide per-namespace SELinux instances,  where each container has its own SELinux policy and private kernel SELinux APIs.

A prototype for SELinux namespacing was developed by Stephen Smalley, who released the code via https://github.com/stephensmalley/selinux-kernel/tree/selinuxns.  There were and still are many TODO items.  I’ve since been working on providing namespacing support to on-disk inode labels, which are represented by security xattrs.  See the v0.2 patch post for more details.

Much of this work will be of interest to other LSMs such as Smack, and many architectural and technical issues remain to be solved.  For those interested in this work, please see the slides, which include a couple of overflow pages detailing some known but as yet unsolved issues (supplied by Stephen Smalley).

I anticipate discussions on this and related topics (LSM stacking, core namespaces) later in the year at Plumbers and the Linux Security Summit(s), at least.

The session was live streamed — I gather a standalone video will be available soon!

ETA: the video is up! See:

January 22, 2018 08:38 AM