Kernel Planet

July 29, 2015

Pete Zaitcev: Conference submission and voting

Generally I feel that I do not do any work that's important enough to present at conferences. My previous presentation was at OLS back in 2005, concerning usbmon. The usbmon is something a guy learning C would program: it's a circular buffer into which kernel drops tracing events; Wireshark pulls them out. Hardly a conference material, but at the time I thought it was supremely important to proseltize the basic techniques of always-on tracing, because it would improve the quality and the ease of debugging of the kernel overall. I really wanted FireWire guys to adopt a similar tracing scheme, because it was a hell on a stick debugging juju with just printk(). Needless to say, that was a miserable failure, as was FireWire itself. I don't think anyone who came to listen to my presentation in Ottawa received their money's worth.

Or did they? Recently an epiphany occured to me. I really should not even think if anyone is interested. That is conference organizers' job, not mine! As a result, I sent a proposal to OpenStack Tokyo, entitled "The Plot to Destroy OpenStack Swift Using C++: Enhancements of Swift API Compatibility in Ceph RADOS Gateway". It's basically a compendum of practical issues that occur when running Swift apps on top of Ceph RGW and what we do to help people do that.

The things are a little different from 10 years ago, because attendees can vote on the submissions. This sounds democratic. I went through all submissions on the storage track and voted them according to my preference. It took a very long time and I suspect that I was crowdsourced by the organizers in the best traditions of Web 2.0. I wonder if they'll even read the abstracts. :-)

July 29, 2015 04:23 PM

July 28, 2015

Matthew Garrett: Your Ubuntu-based container image is probably a copyright violation

Update: A Canonical employee responded here, but doesn't appear to actually contradict anything I say below.

I wrote about Canonical's Ubuntu IP policy here, but primarily in terms of its broader impact, but I mentioned a few specific cases. People seem to have picked up on the case of container images (especially Docker ones), so here's an unambiguous statement:

If you generate a container image that is not a 100% unmodified version of Ubuntu (ie, you have not removed or added anything), Canonical insist that you must ask them for permission to distribute it. The only alternative is to rebuild every binary package you wish to ship[1], removing all trademarks in the process. As I mentioned in my original post, the IP policy does not merely require you to remove trademarks that would cause infringement, it requires you to remove all trademarks - a strict reading would require you to remove every instance of the word "ubuntu" from the packages.

If you want to contact Canonical to request permission, you can do so here. Or you could just derive from Debian instead.

[1] Other than ones whose license explicitly grants permission to redistribute binaries and which do not permit any additional restrictions to be imposed upon the license grants - so any GPLed material is fine

comment count unavailable comments

July 28, 2015 08:06 PM

LPC 2015: Microconference schedule now available

The Linux Plumbers Conference starts in less than three weeks and so the schedule for Microconferences is now available!  Looking forward to seeing you all there!

July 28, 2015 07:40 PM

July 27, 2015

Andi Kleen: Energy efficient servers book review

Energy efficient servers – Blue prints for data center optimization from Gough/Steiner/Sanders is a new book on power tuning on servers that was recently published at Apress. I got my copy a few weeks ago and read it and it is great.

Disclaimer: I contributed a few pages to the book, but have no financial interest in its success.

As you probably already know power efficiency is very important for modern computing. It matters to mobile devices to extend battery time, it matters to desktops and servers to avoid exceeding the thermal/power capacity and lower energy costs.

Modern chips cannot run all their transistors at full speed at the same time due to the dark silicon problem. This results in the somewhat paradoxical situation that power management is needed, even if energy costs don’t matter, just to give the best performance (such as the highest Turbo frequencies)

Power management in modern systems is quite complex, with many different moving parts, hardware, operating systems, drivers, firmware, embedded micro-controllers working together to be as efficient as possible. I’m not aware of any good overview of all of this.

There is some lore around — for example you may have heard of race to idle, that is running as fast as possible to go idle again — but nothing really that puts it all into a larger context. BTW race-to-idle is not always a good idea, as the book explains.

The new book makes an attempt to explain all of this together for Intel servers (the basic concepts are similar on other systems and also on client systems).

It starts with a (short) introduction of the underlying physical principles and then moves on to the basic CPU and platform power management techniques, such as frequency scaling and idle state and thermal management. It has a discussion on modern memory subsystems and describes the trade offs between different DIMM configurations. It describes the power management differences between larger servers and micro servers. And there is a overview of thermal management and power supply, such as energy efficient power supplies and voltage regulators.

Then it moves on to an overview of the software involved in power management, including firmware, rack level power management software, and operating systems. Then there is an extensive chapter how to instrument and measure power management

Finally (and perhaps most valuable) the book lays out a systematic power tuning methodology, starting with measurements and then concrete steps to optimize existing workloads for the best power efficiency.

The book is written not as an academic text book, but intended for people who solve concrete problems on shipping systems. It is quite readable, explaining any complicated concepts. You can clearly tell the authors have deep knowledge on the topic. While the details are intended for Intel servers, I would expect the book to be useful even to people working on clients or also other architectures.

One possible issue with the book is that it may be too specific for today’s systems. We’ll see how well it ages to future systems. But right now, as it just came out, it it very up-to-date and a good guide. It has some descriptions of data center design (such as efficient cooling), but these parts are quite short and are clearly not the main focus.

The ebook version is currently available as a free download both at the the publisher after registration, or at amazon as free kindle edition, or as reasonable priced paperback.

July 27, 2015 06:14 AM

July 24, 2015

James Morris: Linux Security Summit 2015 Update: Free Registration

In previous years, attending the Linux Security Summit (LSS) has required full registration as a LinuxCon attendee.  This year, LSS has been upgraded to a hosted event.  I didn’t realize that this meant that LSS registration was available entirely standalone.  To quote an email thread:

If you are only planning on attending the The Linux Security Summit, there is no need to register for LinuxCon North America. That being said you will not have access to any of the booths, keynotes, breakout sessions, or breaks that come with the LinuxCon North America registration.  You will only have access to The Linux Security Summit.

Thus, if you wish to attend only LSS, then you may register for that alone, at no cost.

There may be a number of people who registered for LinuxCon but who only wanted to attend LSS.   In that case, please contact the program committee at lss-pc_AT_lists.linuxfoundation.org.

Apologies for any confusion.

July 24, 2015 03:46 AM

July 23, 2015

Michael Kerrisk (manpages): man-pages-4.01 is released

I've released man-pages-4.01. The release tarball is available on kernel.org. The browsable online pages can be found on man7.org. The Git repository for man-pages is available on kernel.org.

This release resulted from patches, bug reports,and  comments from nearly 50 contributors. As well as a large number of minor fixes to over 100 man pages, the more significant changes in man-pages-4.01 include the following:

July 23, 2015 06:03 PM

July 20, 2015

Pete Zaitcev: Fedora 22 killed IPv6 and I'm fine

I upgraded Fedora on my home router to F22 and immediately IPv6 disappeared on the internal network. The problem is that radvd started throwing its usual "no linklocal address configured on ethmain.5" (although the message is only visible with "IgnoreIfMissing off;"), which leads to "interface ethmain.5 does not exist or is not set up properly". With the default IgnoreIfMissing, radvd continues running but refuses to work, quietly. Needless to say, the interface has a perfectly valid link-local address, same as it had in F21 before the upgrade.

There used to be a time when I took a problem like this as an affront to the idea of IPv6 superiority and the reputation of Fedora as a platform for roll-your-own home router. Now though, I don't give a rat's tail for IPv6. Let Comcast and Google care and pay someone to care. Okay, I lied. I cared enough for file a bug 1244428, but I'm not rushing to build from SRPMs, reinstall old versions, and such.

July 20, 2015 04:18 PM

Mel Gorman: Continual testing of mainline kernels

It is not widely known that the SUSE Performance team runs continual testing of mainline kernels and collects data on machines that would be otherwise idle. Testing is a potential topic for Kernel Summit 2015 topic so now seems like a good a time introduce Marvin. Marvin is a system that continually runs performance-related tests and is named after another robot doomed with repetitive tasks. When tests are complete it generates a performance comparison report that is publicly available but rarely linked. The primary responsibility of this system is to check SUSE Linux for Enterprise kernels for performance regressions but it is also configured to run tests against mainline releases. There are four primary components Marvin of interest.

The first component is the test client which is a copy of MMTests. The use of MMTests ensures that the tests can be independently replicated and the methodology examined. The second component is Bob which is a builder that monitors git trees for new kernels to test, builds the kernel when it's released and schedules it to be tested. In practice this monitors the SLE kernel tree continually and checks the mainline git tree once a month for new releases. Bob only builds and queues released kernels and ignores -rc kernels in mainline. The reason for this is simple -- time. The full battery of tests can take up to a month to complete in come cases and it's impractical to do that on every -rc release. There are times when a small subset of tests will be checked for a pre-release kernel but only when someone on the performance team is checking a specific series of patches and it's urgent to get the results quickly. When tests complete, it's Bob that generates the report. The third component is Marvin which runs on the server and one instance exists per test machine. It checks the queue, prepares the test machine and executes tests when the machine is ready. The final component is a configuration manager that is responsible for reserving machines for exclusive use, managing power, managing serial consoles and deploying distributions automatically. The inventory management does not have a specific name as it's different depending on where Marvin is setup.

There are two installations of Marvin -- one that runs in my house and a second that runs within SUSE and they have slightly different configurations. Technically Marvin supports testing on different distributions but only openSUSE and SLE are deployed. SLE kernels are tested on the corresponding SLE distribution. The Marvin instance in my house tests kernels 3.0 up to 3.12 on openSUSE 13.1 and then kernels 3.12 up to current mainline on openSUSE 13.2. In the SUSE instance, SLE 11 SP3 is used as the distribution for testing kernels 3.0 up to 3.12 and openSUSE 13.2 is used for 3.12 and later kernels. The kernel configuration used corresponds to the distribution. The raw results are not publicly available but the reports generated on private servers and mirrored once a week to the following locations;


Dashboard for kernels 3.0 to 3.12 on openSUSE 13.1 running on home machines

Dashboard for kernels 3.12 to current on openSUSE 13.2 running on home machines

Dashboard for kernels 3.0 to 3.12 on SLE 11 SP3 running on SUSE machines

Dashboard for kernels 3.12 to current on openSUSE 13.2 running on SUSE machines


The dashboard is intended to be a very high-level view detailing if there are regressions or not in comparison to a baseline. For example, in the first report linked above, the baseline is always going to be a 3.0-based kernel. It needs a human to establish if the regression is real or if it's an acceptable trade-off. The top section makes a guess as to where the biggest regressions might be but it's not perfect so double check. Each test that was conducted is then listed. The name of the test corresponds to a MMTests configuration file in configs/ with extentions naming the filesystem used if that is applicable. The columns are then machines with a number which represents a performance delta. 1 means there is no difference. 1.02 would mean there is a 2% difference and the colour indicates whether it is a performance regression or gain. Green is good, red is bad, gray or white is neutral. It will automatically guess if the result is significant which is why 0.98 on one test might be a 2% performance regression in one test (red) and in the noise for another.

It is important to note that the dashboard figure is a very rough estimate that often decomposing multiple values into a single number. There is no substitute for reading the detailed report and making an assessment. It is also important to note that Marvin is not up to date and some machines have not started testing 4.1. It is known that the reports are very ugly but making it prettier has yet to climb up the list of priorities. Where possible we are instead picking a regression and doing something about it instead of making HTML pages look pretty.

The obvious question is what has been done with this data. When Marvin was first assembled, the intent was to identify and fix regressions between 2.6.32 (yes, really) and 3.12. This is one of the reasons why 3.12-stable contains so many performance related fixes. When a regression was found there are generally one of three outcomes. The first is that it gets fixed obviously. The second is that it is identified as an apparent, but not real, regression. Usually this means the kernel was buggy in an old kernel in a manner that happened to benefit a particular benchmark. Tiobench is an excellent example. On old kernels there was a bug that preserved old pages and reclaimed new pages in certain circumstances. For most workloads, this is terrible but in tiobench it means that parts of the file were cached and the IO appeared to complete faster but it was a lie. The third possible outcome is that it's slower but it's a tradeoff to win somewhere else and the tradeoff is acceptable. Some scheduler regressions fall under this heading where a context-switch micro-benchmark might be hurt but it's because the scheduler is making an intelligent placement decision.

The focus on 3.12 is also why Marvin is not widely advertised within the community. It is rare that mainline developers are concerned with performance in -stable kernels unless the most recent kernel is also discussed. In some cases the most recent kernel may have the same regression but it is common to discover there is simply a different mix of problems in a recent kernel. Each problem must be identified and addressed in turn and time is spent on that instead of adding volume to LKML. Advertising the existence of Marvin wasalso postponed because some of the tests or reporting were buggy and each time I wanted to fix the problem. There are very few that are known to be problematic now but it takes a surprising amount of time to address all problems that crop up when running tests across large numbers of machines. There are still issues lurking in there but if a particularly issue is important to you then let me know and I'll see if it can be examined faster.

An obvious question is how this compares to other performance-based automated testing such as Intel's 0-day kernel test infrastructure. The answer is that they are complementary. The 0-day infrastructure tests every commit to quickly identify both performance gains and regressions. The tests are short-lived by necessity and are invaluable at quickly catching some classes of problems. The tests run by Marvin are much longer-lived and there is only overlap in a small number of places. The two systems are simply looking for different problems. Hence, in 2012 I was tempted to try integrating parts of what became Marvin with 0-day but ultimately it was unnecessary and there is value in both. The other system worth looking at is the results reported on Phoronix Test Suite. In that case, it's relatively rare that the data needed to debug a problem is included in the reports which complicates matters. In a few cases I examined in detail I had problems with the testing methodology. As MMTests already supported large amounts of what I was looking for there was no benefit to discarding it and starting again with Phoronix and addressing any perceived problems there. Finally, on the site that reports the results, there is a frequent emphasis there on graphics performance or the relative performance between different hardware configurations. It is relatively rare that this is the type of comparison my team is interested in.

The next obvious question is how are recent releases performing? At this time I so not want to make a general statement as I have not examined all the data in sufficient detail and am currently developing a series aimed at one of the problems. When I work on mainline patches, it's usually with reference to the problem I picked out after browsing through reports, targeting a particular subsystem area or in response to a bug report. I'm not applying a systematic process to identify all regressions at this point and it's still considered a manual process to determine if a reported regression is real, apparent or a tradeoff. When a real regression is found then Marvin can optionally conduct an automated bisection but that process is usually "invisible" and is only reported indirectly in a changelog if the regression gets fixed.

So what's next? The first is that more attention is going to be paid to recent kernels and checking if regressions were introduced since 3.12 that need addressing. The second is identifying any bottlenecks that exist in mainline that are not regressions but still should be addressed. The last of course if coverage. The first generation of Marvin focused on some common workloads and for a long time it was very useful. The number of problems it is finding is now declining so other workloads will be added over time. Each time a new configuration is added, Marvin will go back through all the old kernels and collect data. This is probably not a task that will ever finish. There always will be some new issue be it due to a hardware change, a new class of workload as the usage of computers evolve or a modification that fixed one problem and introduced another. Fun times!

July 20, 2015 02:55 PM

July 15, 2015

Matthew Garrett: Canonical's Ubuntu IP policy is garbage

(In order to avoid any ambiguity here, this is a personal opinion. The Free Software Foundation's opinion on this matter is here)

Canonical have a legal policy surrounding reuse of Intellectual Property they own in Ubuntu, and you can find it here. It's recently been modified to handle concerns raised by various people including the Free Software Foundation[1], who have some further opinions on the matter here. The net outcome is that Canonical made it explicit that if the license a piece of software is under explicitly says you can do something, you can do that even if the Ubuntu IP policy would otherwise forbid it.

Unfortunately, "Canonical have made it explicit that they're not attempting to violate the GPL" is about the nicest thing you can say about this. The most troubling statement is Any redistribution of modified versions of Ubuntu must be approved, certified or provided by Canonical if you are going to associate it with the Trademarks. Otherwise you must remove and replace the Trademarks and will need to recompile the source code to create your own binaries.. The apparent aim here is to avoid situations where people take Ubuntu, modify it and continue to pass it off as Ubuntu. But it reaches far further than that. Cases where this may apply include (but are not limited to):


In each of these cases, a strict reading of the policy indicates that you are distributing a modified version of Ubuntu and therefore must either get it approved by Canonical or remove the trademarks and rebuild everything. The strange thing is that this doesn't limit itself to rebuilding packages that include Canonical's trademarks - there's a requirement that you rebuild all binaries.

Now obviously this is good engineering practice in a whole bunch of ways, but it's a huge pain in the ass. And to make things worse, Canonical won't clarify what they consider to be use of their trademarks. Many Ubuntu packages rebuilt from Debian include the word "ubuntu" in their version string. Many Ubuntu packages will contain the word "ubuntu" in maintainer email addresses. Many Ubuntu packages include references to Ubuntu (for instance, documentation might say "This configuration file is located under /etc/default in Debian and Ubuntu"). And many Ubuntu packages will include the compiler version string, which will include the word "ubuntu". Realistically, there's no risk of confusion by using the trademarks in this way, and as a consequence there would be no infringement under trademark law. But Canonical aren't using trademark law here. Canonical assert that they hold copyright over binaries that they have built form source, and require that for you to have permission to redistribute these binaries under copyright law you must remove the trademarks. This means that it doesn't matter whether your use of the trademarks would be infringing or not - you're required to remove them, because fuck you that's why.

This is a huge overreach. It's hostile to free software, in that it makes it significantly more difficult to produce derivative works of Ubuntu and doesn't benefit the community in the process. It's hostile to our understanding of IP law, in that it claims that the mechanical process of turning source code into binaries creates an independently copyrightable work. And in some cases it may make it impossible to create derivative works that interoperate with Ubuntu due to applications making assumptions about the presence of strings.

It'd be easy write this off as an over the top misinterpretation of the policy if it hadn't been confirmed by the Ubuntu Community Manager that any binaries shipped by Ubuntu under licenses that don't grant an explicit right to redistribute the binaries can't be redistributed without permission or rebuilding. When I asked for clarification from Canonical over a year ago, I got no response[2]. Perhaps Canonical don't want to force you to remove every single use of the word Ubuntu from derivative works, but their policy is written such that the natural reading is that they do, and they've refused every single opportunity they've been given to clarify the point.

So, we're left with a policy that makes it hugely impractical to redistribute modified versions of Ubuntu unless Canonical approve of it. That's not freedom, and it's certainly not Ubuntu. If Canonical are serious about participating in the free software community then they need to demonstrate their willingness to continue improving this policy to bring it closer to our goals. Failure to do so will give a strong indication of their priorities.

[1] While I'm a member of the FSF's board of directors, I'm not involved in the majority of the FSF's day to day activities and was not part of this process
[2] Nebula's OS was a mixture of binary packages we pulled straight from Ubuntu and packages we rebuilt, so we were obviously pretty interested in what the answer was

comment count unavailable comments

July 15, 2015 07:20 PM

July 12, 2015

Dave Jones: Future development of Trinity.

It’s been an odd few weeks regarding Trinity based things.

First an email from a higher-up at my former employer asking (paraphrased)..

"That thing we asked you to stop working on when you worked here, any chance now you've left you'll implement these features."

I’m still trying to get my head around the thought process that led to that being a reasonable thing to ask. I’ve made the occasional commit over the last six months, but it’s mostly been code motion, clean-up work, and things like syscall table updates. New feature development came to a halt long ago.

It’s no coincidence that the number of bugs reported found with Trinity have dropped off sharply since the beginning of the year, and I don’t think it’s because the Linux kernel suddenly got lots better. Rather, it’s due to the lack of real ongoing development to “try something else” when some approaches dry up. Sadly we now live in a world where it’s easier to get paid to run someone else’s fuzzer these days than it is to develop one.

Then earlier this week, came the revelation that the only people prepared to fund that kind of new feature development are pretty much the worst people.

Apparently Hacking Team modified Trinity to fuzz ioctl() on Android, which yielded some results. I’ve done no analysis on whether those crashes are are exploitable/fixed/only relevant to Android etc. (Frankly, I’m past caring). I’m not convinced their approach is particularly sound even if it was finding results Trinity wasn’t, so it looks unlikely there are even ideas to borrow here. (We all already knew that ioctl was ripe with bugs, and had practically zero coverage testing).

It bothers me that my work was used as a foundation for their hack-job. Then again, maybe if I hadn’t released Trinity, they’d have based on iknowthis, or some other less useful fuzzer. None of this really should surprise me. I’ve known for some time that there are some “security” people that have their own modifications they have no intention of sending my way. Thanks to the way that people that release 0-days are revered in this circus, there’s no incentive for people to share their modifications if it means that someone else might beat them to finding their precious bugs.

It’s unfortunate that this project has attracted so many awful people. When I began it, the motivation had nothing to do with security. Back in 2010 we were inundated in weird oopses that we couldn’t reproduce, many times triggered by jvm’s. I came up with the idea that maybe a fuzzer could create a realistic enough workload to tickle some of those same bugs. Turned out I was right, and so began a series of huge page and other VM related bug fixes.

In the five years that I’ve made Trinity available, I’ve received notable contributions from perhaps a half dozen people. In return I’ve made my changes available before I’d even given them runtime myself.

It’s a project everyone wants to take from, but no-one wants to give back to.

And that’s why for the foreseeable future, I’m unlikely to make public any further feature work I do on it.
I’m done enabling assholes.

Future development of Trinity. is a post from: codemonkey.org.uk

July 12, 2015 09:37 PM

July 10, 2015

Andi Kleen: Speeding up less

Often when doing performance analysis or debugging, it boils down to stare at long text trace files with the less text viewer. Yes you can do a lot of analysis with custom scripts, but at some point it’s usually needed to also look at the raw data.

The first annoyance in less when opening a large file is the time it takes to count lines (less counts lines at the beginning to show you the current position as a percentage). The line counting has the easy workaround of hitting Ctrl-C or using less -n to disable percentage. But it would be still better if that wasn’t needed.

Nicolai Haenle speeded the process by about 20x in his less repository.

One thing that always bothered me was that searching in less is so slow. If you’re browsing a tens to hundreds of MB file file it can easily take minutes to search for a string. When browsing log and trace files searching over longer distances is often very important.

And there is no good workaround. Running grep on the file is much faster, but you can’t easily transfer the file position from grep to the less session.

Some profiling with perf shows that most of the time searching is spent converting each line. Less internally cleans up the line, convert it to canonical case, remove backspace bold, and some other changes. The conversion loop processes each character in a inefficient way. Most of the time this is not needed, so I replaced that with a quick check if the line contains any backspaces using the optimized strchr() from the standard C library. For case conversion the string search functions (either regular expression or fixed string search) can also handle case insensitive search directly, so we don’t need an extra conversion step. The default fixed string search (when the search string contains no regular expression meta characters) can be also done using the optimized C library functions.

The resulting less version searches ~85% faster on my benchmarks. I tried to submit the patch to the less maintainer, but it was ignored unfortunately. The less version in the repository also includes Nicolai’s speedup patches for the initial line counting.

One side effect of the patch is that less now defaults to case sensitive searches. The original less had a feature (or bug) to default to case-insensitive even without the -i option. To get case insensitive searches now “less -i” needs to be used.

[Edit: Fix typos]

July 10, 2015 08:26 PM

Pavel Machek: Front USB connectors are evil

NFSroot over USB on n900 was only giving me 300KiB/sec... and I thought that was normal. Now I plugged the cable to back USB port (not front one) and... speed went from 300KiB/sec to 2.5MiB/sec. Not bad for old cellphone.

Someone must be joking?
root@n900:/sys/devices/platform/68000000.ocp/480ab000.usb_otg_hs/musb-hdrc.0.auto# cat vbus
Vbus off, timeout 1100 msec

It looks like my n900 was bitten by famous "all calls disabled" problem. ( example solution ). Prague Brmlab helped a lot, and baking n900 for 15minutes at 250C seems to have fixed the problem... for a week. Now it looks like it slowly creeps back.

July 10, 2015 10:11 AM

July 09, 2015

Andi Kleen: toplev tutorial and manual

toplev, part of pmu-tools is a tool to determine the CPU bottleneck of workloads. Now finally there is a tutorial and manual available for toplev.,

July 09, 2015 05:57 PM

Andi Kleen: Adding Processor Trace support to Linux

I published an article at LWN: Adding processor trace to Linux. It describes the Linux perf support for the Intel Processor Trace feature on Intel Broadwell and other CPUs. Processor Trace allows fine grained tracing of program control flow.

July 09, 2015 05:51 PM

July 08, 2015

Rusty Russell: The Megatransaction: Why Does It Take 25 Seconds?

Last night f2pool mined a 1MB block containing a single 1MB transaction.  This scooped up some of the spam which has been going to various weakly-passworded “brainwallets”, gaining them 0.5569 bitcoins (on top of the normal 25 BTC subsidy).  You can see the megatransaction on blockchain.info.

It was widely reported to take about 25 seconds for bitcoin core to process this block: this is far worse than my “2 seconds per MB” result in my last post, which was considered a pretty bad case.  Let’s look at why.

How Signatures Are Verified

The algorithm to check a transaction input (of this form) looks like this:

  1. Strip the other inputs from the transaction.
  2. Replace the input script we’re checking with the script of the output it’s trying to spend.
  3. Hash the resulting transaction with SHA256, then hash the result with SHA256 again.
  4. Check the signature correctly signed that hash result.

Now, for a transaction with 5570 inputs, we have to do this 5570 times.  And the bitcoin core code does this by making a copy of the transaction each time, and using the marshalling code to hash that; it’s not a huge surprise that we end up spending 20 seconds on it.

How Fast Could Bitcoin Core Be If Optimized?

Once we strip the inputs, the result is only about 6k long; hashing 6k 5570 times takes about 265 milliseconds (on my modern i3 laptop).  We have to do some work to change the transaction each time, but we should end up under half a second without any major backflips.

Problem solved?  Not quite….

This Block Isn’t The Worst Case (For An Optimized Implementation)

As I said above, the amount we have to hash is about 6k; if a transaction has larger outputs, that number changes.  We can fit in fewer inputs though.  A simple simulation shows the worst case for 1MB transaction has 3300 inputs, and 406000 byte output(s): simply doing the hashing for input signatures takes about 10.9 seconds.  That’s only about two or three times faster than the bitcoind naive implementation.

This problem is far worse if blocks were 8MB: an 8MB transaction with 22,500 inputs and 3.95MB of outputs takes over 11 minutes to hash.  If you can mine one of those, you can keep competitors off your heels forever, and own the bitcoin network… Well, probably not.  But there’d be a lot of emergency patching, forking and screaming…

Short Term Steps

An optimized implementation in bitcoind is a good idea anyway, and there are three obvious paths:

  1. Optimize the signature hash path to avoid the copy, and hash in place as much as possible.
  2. Use the Intel and ARM optimized SHA256 routines, which increase SHA256 speed by about 80%.
  3. Parallelize the input checking for large numbers of inputs.

Longer Term Steps

A soft fork could introduce an OP_CHECKSIG2, which hashes the transaction in a different order.  In particular, it should hash the input script replacement at the end, so the “midstate” of the hash can be trivially reused.  This doesn’t entirely eliminate the problem, since the sighash flags can require other permutations of the transaction; these would have to be carefully explored (or only allowed with OP_CHECKSIG).

This soft fork could also place limits on how big an OP_CHECKSIG-using transaction could be.

Such a change will take a while: there are other things which would be nice to change for OP_CHECKSIG2, such as new sighash flags for the Lightning Network, and removing the silly DER encoding of signatures.

July 08, 2015 03:09 AM

July 07, 2015

James Morris: Linux Security Summit 2015 Schedule Published

The schedule for the 2015 Linux Security Summit is now published!

The refereed talks are:

There will be several discussion sessions:

Also featured are brief updates on kernel security subsystems, including SELinux, Smack, AppArmor, Integrity, Capabilities, and Seccomp.

The keynote speaker will be Konstantin Ryabitsev, sysadmin for kernel.org.  Check out his Reddit AMA!

See the schedule for full details, and any updates.

This year’s summit will take place on the 20th and 21st of August, in Seattle, USA, as a LinuxCon co-located event.  As such, all Linux Security Summit attendees must be registered for LinuxCon. Attendees are welcome to attend the Weds 19th August reception.

Hope to see you there!

July 07, 2015 03:04 PM

July 06, 2015

Rusty Russell: Bitcoin Core CPU Usage With Larger Blocks

Since I was creating large blocks (41662 transactions), I added a little code to time how long they take once received (on my laptop, which is only an i3).

The obvious place to look is CheckBlock: a simple 1MB block takes a consistent 10 milliseconds to validate, and an 8MB block took 79 to 80 milliseconds, which is nice and linear.  (A 17MB block took 171 milliseconds).

Weirdly, that’s not the slow part: promoting the block to the best block (ActivateBestChain) takes 1.9-2.0 seconds for a 1MB block, and 15.3-15.7 seconds for an 8MB block.  At least it’s scaling linearly, but it’s just slow.

So, 16 Seconds Per 8MB Block?

I did some digging.  Just invalidating and revalidating the 8MB block only took 1 second, so something about receiving a fresh block makes it worse. I spent a day or so wrestling with benchmarking[1]…

Indeed, ConnectTip does the actual script evaluation: CheckBlock() only does a cursory examination of each transaction.  I’m guessing bitcoin core is not smart enough to parallelize a chain of transactions like mine, hence the 2 seconds per MB.  On normal transaction patterns even my laptop should be about 4 times faster than that (but I haven’t actually tested it yet!).

So, 4 Seconds Per 8MB Block?

But things are going to get better: I hacked in the currently-disabled libsecp256k1, and the time for the 8MB ConnectTip dropped from 18.6 seconds to 6.5 seconds.

So, 1.6 Seconds Per 8MB Block?

I re-enabled optimization after my benchmarking, and the result was 4.4 seconds; that’s libsecp256k1, and an 8MB block.

Let’s Say 1.1 Seconds for an 8MB Block

This is with some assumptions about parallelism; and remember this is on my laptop which has a fairly low-end CPU.  While you may not be able to run a competitive mining operation on a Raspberry Pi, you can pretty much ignore normal verification times in the blocksize debate.


 

[1] I turned on -debug=bench, which produced impenetrable and seemingly useless results in the log.

So I added a print with a sleep, so I could run perf.  Then I disabled optimization, so I’d get understandable backtraces with perf.  Then I rebuilt perf because Ubuntu’s perf doesn’t demangle C++ symbols, which is part of the kernel source package. (Are we having fun yet?).  I even hacked up a small program to help run perf on just that part of bitcoind.   Finally, after perf failed me (it doesn’t show 100% CPU, no idea why; I’d expect to see main in there somewhere…) I added stderr prints and ran strace on the thing to get timings.

July 06, 2015 09:58 PM

Matthew Garrett: Anti Evil Maid 2 Turbo Edition

The Evil Maid attack has been discussed for some time - in short, it's the idea that most security mechanisms on your laptop can be subverted if an attacker is able to gain physical access to your system (for instance, by pretending to be the maid in a hotel). Most disk encryption systems will fall prey to the attacker replacing the initial boot code of your system with something that records and then exfiltrates your decryption passphrase the next time you type it, at which point the attacker can simply steal your laptop the next day and get hold of all your data.

There are a couple of ways to protect against this, and they both involve the TPM. Trusted Platform Modules are small cryptographic devices on the system motherboard[1]. They have a bunch of Platform Configuration Registers (PCRs) that are cleared on power cycle but otherwise have slightly strange write semantics - attempting to write a new value to a PCR will append the new value to the existing value, take the SHA-1 of that and then store this SHA-1 in the register. During a normal boot, each stage of the boot process will take a SHA-1 of the next stage of the boot process and push that into the TPM, a process called "measurement". Each component is measured into a separate PCR - PCR0 contains the SHA-1 of the firmware itself, PCR1 contains the SHA-1 of the firmware configuration, PCR2 contains the SHA-1 of any option ROMs, PCR5 contains the SHA-1 of the bootloader and so on.

If any component is modified, the previous component will come up with a different measurement and the PCR value will be different, Because you can't directly modify PCR values[2], this modified code will only be able to set the PCR back to the "correct" value if it's able to generate a sequence of writes that will hash back to that value. SHA-1 isn't yet sufficiently broken for that to be practical, so we can probably ignore that. The neat bit here is that you can then use the TPM to encrypt small quantities of data[3] and ask it to only decrypt that data if the PCR values match. If you change the PCR values (by modifying the firmware, bootloader, kernel and so on), the TPM will refuse to decrypt the material.

Bitlocker uses this to encrypt the disk encryption key with the TPM. If the boot process has been tampered with, the TPM will refuse to hand over the key and your disk remains encrypted. This is an effective technical mechanism for protecting against people taking images of your hard drive, but it does have one fairly significant issue - in the default mode, your disk is decrypted automatically. You can add a password, but the obvious attack is then to modify the boot process such that a fake password prompt is presented and the malware exfiltrates the data. The TPM won't hand over the secret, so the malware flashes up a message saying that the system must be rebooted in order to finish installing updates, removes itself and leaves anyone except the most paranoid of users with the impression that nothing bad just happened. It's an improvement over the state of the art, but it's not a perfect one.

Joanna Rutkowska came up with the idea of Anti Evil Maid. This can take two slightly different forms. In both, a secret phrase is generated and encrypted with the TPM. In the first form, this is then stored on a USB stick. If the user suspects that their system has been tampered with, they boot from the USB stick. If the PCR values are good, the secret will be successfully decrypted and printed on the screen. The user verifies that the secret phrase is correct and reboots, satisfied that their system hasn't been tampered with. The downside to this approach is that most boots will not perform this verification, and so you rely on the user being able to make a reasonable judgement about whether it's necessary on a specific boot.

The second approach is to do this on every boot. The obvious problem here is that in this case an attacker simply boots your system, copies down the secret, modifies your system and simply prints the correct secret. To avoid this, the TPM can have a password set. If the user fails to enter the correct password, the TPM will refuse to decrypt the data. This can be attacked in a similar way to Bitlocker, but can be avoided with sufficient training: if the system reboots without the user seeing the secret, the user must assume that their system has been compromised and that an attacker now has a copy of their TPM password.

This isn't entirely great from a usability perspective. I think I've come up with something slightly nicer, and certainly more Web 2.0[4]. Anti Evil Maid relies on having a static secret because expecting a user to remember a dynamic one is pretty unreasonable. But most security conscious people rely on dynamic secret generation daily - it's the basis of most two factor authentication systems. TOTP is an algorithm that takes a seed, the time of day and some reasonably clever calculations and comes up with (usually) a six digit number. The secret is known by the device that you're authenticating against, and also by some other device that you possess (typically a phone). You type in the value that your phone gives you, the remote site confirms that it's the value it expected and you've just proven that you possess the secret. Because the secret depends on the time of day, someone copying that value won't be able to use it later.

But instead of using your phone to identify yourself to a remote computer, we can use the same technique to ensure that your computer possesses the same secret as your phone. If the PCR states are valid, the computer will be able to decrypt the TOTP secret and calculate the current value. This can then be printed on the screen and the user can compare it against their phone. If the values match, the PCR values are valid. If not, the system has been compromised. Because the value changes over time, merely booting your computer gives your attacker nothing - printing an old value won't fool the user[5]. This allows verification to be a normal part of every boot, without forcing the user to type in an additional password.

I've written a prototype implementation of this and uploaded it here. Do pay attention to the list of limitations - without a bootloader that measures your kernel and initrd, you're still open to compromise. Adding TPM support to grub is on my list of things to do. There are also various potential issues like an attacker being able to use external DMA-capable devices to obtain the secret, especially since most Linux distributions still ship kernels that don't enable the IOMMU by default. And, of course, if your firmware is inherently untrustworthy there's multiple ways it can subvert this all. So treat this very much like a research project rather than something you can depend on right now. There's a fair amount of work to do to turn this into a meaningful improvement in security.

[1] I wrote about them in more detail here, including a discussion of whether they can be used for general purpose DRM (answer: not really)

[2] In theory, anyway. In practice, TPMs are embedded devices running their own firmware, so who knows what bugs they're hiding.

[3] On the order of 128 bytes or so. If you want to encrypt larger things with a TPM, the usual way to do it is to generate an AES key, encrypt your material with that and then encrypt the AES key with the TPM.

[4] Is that even a thing these days? What do we say instead?

[5] Assuming that the user is sufficiently diligent in checking the value, anyway

comment count unavailable comments

July 06, 2015 05:39 PM

Matthew Garrett: Internet abuse culture is a tech industry problem

After Jesse Frazelle blogged about the online abuse she receives, a common reaction in various forums[1] was "This isn't a tech industry problem - this is what being on the internet is like"[2]. And yes, they're right. Abuse of women on the internet isn't limited to people in the tech industry. But the severity of a problem is a product of two separate factors: its prevalence and what impact it has on people.

Much of the modern tech industry relies on our ability to work with people outside our company. It relies on us interacting with a broader community of contributors, people from a range of backgrounds, people who may be upstream on a project we use, people who may be employed by competitors, people who may be spending their spare time on this. It means listening to your users, hearing their concerns, responding to their feedback. And, distressingly, there's significant overlap between that wider community and the people engaging in the abuse. This abuse is often partly technical in nature. It demonstrates understanding of the subject matter. Sometimes it can be directly tied back to people actively involved in related fields. It's from people who might be at conferences you attend. It's from people who are participating in your mailing lists. It's from people who are reading your blog and using the advice you give in their daily jobs. The abuse is coming from inside the industry.

Cutting yourself off from that community impairs your ability to do work. It restricts meeting people who can help you fix problems that you might not be able to fix yourself. It results in you missing career opportunities. Much of the work being done to combat online abuse relies on protecting the victim, giving them the tools to cut themselves off from the flow of abuse. But that risks restricting their ability to engage in the way they need to to do their job. It means missing meaningful feedback. It means passing up speaking opportunities. It means losing out on the community building that goes on at in-person events, the career progression that arises as a result. People are forced to choose between putting up with abuse or compromising their career.

The abuse that women receive on the internet is unacceptable in every case, but we can't ignore the effects of it on our industry simply because it happens elsewhere. The development model we've created over the past couple of decades is just too vulnerable to this kind of disruption, and if we do nothing about it we'll allow a large number of valuable members to be driven away. We owe it to them to make things better.

[1] Including Hacker News, which then decided to flag the story off the front page because masculinity is fragile

[2] Another common reaction was "But men get abused as well", which I'm not even going to dignify with a response

comment count unavailable comments

July 06, 2015 05:37 PM

July 03, 2015

Rusty Russell: Wrapper for running perf on part of a program.

Linux’s perf competes with early git for title of least-friendly Linux tool.  Because it’s tied to kernel versions, and the interfaces changes fairly randomly, you can never figure out how to use the version you need to use (hint: always use -g).

But when it works, it’s very useful.  Recently I wanted to figure out where bitcoind was spending its time processing a block; because I’m a cool kid, I didn’t use gprof, I used perf.  The problem is that I only want information on that part of bitcoind.  To start with, I put a sleep(30) and a big printf in the source, but that got old fast.

Thus, I wrote “perfme.c“.  Compile it (requires some trivial CCAN headers) and link perfme-start and perfme-stop to the binary.  By default it runs/stops perf record on its parent, but an optional pid arg can be used for other things (eg. if your program is calling it via system(), the shell will be the parent).

July 03, 2015 03:19 AM

June 29, 2015

Paul E. Mc Kenney: In Practice, What is the C Language, Really?

The official definition of the C Language is the standard, but the standard doesn't actually compile any programs. One can argue that the actual implementations are the real definition of the C Language, although further thought along this line usually results in a much greater appreciation of the benefits of having standards. Nevertheless, the implementations usually win any conflicts with the standard, at least in the short term.

Another interesting source of definitions is the opinions of the developers who actually write C. And both the standards bodies and the various implementations do take these opinions into account at least some of the time. Differences of opinion within the standards bodies are sometimes settled by surveying existing usage, and implementations sometimes provide facilities outside the standard based on user requests. For example, relatively few compiler warnings are actually mandated by the standard.

Although one can argue that the standard is the end-all and be-all definition of the C Language, the fact remains that if none of the implementers provide a facility called out by the standard, the implementers win. Similarly, if nobody uses a facility that is called out by the standard, the users win—even if that facility is provided by each and every implementation. Of course, things get more interesting if the users want something not guaranteed by the standard.

Therefore, it is worth knowing what users expect, even if only to adjust their expectations, as John Regehr has done for number of topics, perhaps most notably signed integer overflow. Some researchers have been taking a more proactive stance, with one example being Peter Sewell's group from the University of Cambridge. This group has put together a survey on padding bytes, pointer arithmetic, and unions. This survey is quite realistic, with “that would be crazy” being a valid answer to a number of the questions.

So, if you think you know a thing or two about C's handling of padding bytes, pointer arithmetic, and unions, take the survey!

June 29, 2015 04:46 PM

June 26, 2015

Daniel Vetter: Neat drm/i915 stuff for 4.2

The 4.1 kernel release is still a few weeks off and hence a bit early to talk about 4.2. But the drm subsystem feature cut-off already passed and I'm going on vacation for 2 weeks, so here we go.

First things first: No, i915 does not yet support atomic modesets. But a lot of progress has been made again towards enabling it. As I explained last time around the trouble is that the intel driver has grown its own almost-atomic modeset infrastructure over the past few years. And now we need to convert that to the slightly different proper atomic support infrastructure merged into the drm core, which means lots and lots of small changes all over the driver. A big part merged in this release is the removal of the ->new_config pointer by Ander, Matt & Maarten. This was the old i915-specific pointer to the staged new configuration. Removing it required switching all the CRTC code over to handling the staged configuration stored in the struct drm_atomic_state to be compatible with the atomic core. Unfortunately we still need to do the same for encoder/connector states and for plane states, so there's still lots of shuffling pending for 4.2.

There has also been other feature work going on on the modeset side: Ville cleaned&fixed up the CDCLK support in anticipation of implementing dynamic display clock frequency scaling. Unfortunately that part of his patches hasn't landed yet. Ville has also merged patches to fix up some details in the CPT modeset sequence, maybe this will finally fix the remaining "DP port stuck" issues we still seem to have.

Looking at newer platforms the interesting bit is rotation support for SKL from Sonika and Tvrtko. Compared to older platforms skl now also supports 90° and 270° rotation in the scanout engines, but only when the framebuffer uses a special tiling layout (which have been enabled in 4.0). A related feature is support for plane/CRTC scalers on SKL, provided by Chandra. Skylake has also gained support for the new low-power display states DC5/6. For Broxton basic enabling has landed, but there's nothing too interesting yet besides piles of small adjustments all over. This is because Broxton and Skylake have a common display block (similar to how the render block for atom chips was already shared since Baytrail) and hence share a lot of the infrastructure code. Unfortunately neither of these platforms has yet left the preliminary hardware support label for the i915 driver.

There's also a few minor features in the display code worth mentioning: DP compliance testing infrastructure from Todd Previte - DP compliance test devices have a special DP AUX sidechannel protocol for requesting certain test procedures and hence need a bit of driver support. Most of this will be in userspace though, with the kernel just forward requests and handing back results. Mika Kahola has optimized the DP link training, the kernel will now first try to use the current values (either from a previous modeset or set up by the firmware). PSR has also seen some more work, unfortunately it's still not yet enabled by default. And finally there's been lots of cleanups and improvements under the hood all over, as usual.

A big feature is the dynamic pagetable allocation for gen8+ from Michel Thierry and Ben Widawsky. This will greatly reduce the overhead of PPGTT and is a requirement for 48bit address space support - with that big a VM preallocating all the pagetables is just not possible any more. The gen7 cmd parser is now finally fixed up and enabled by default (thanks to Rebecca Palmer for one crucial fix), which means finally some newer GL extensions can be used without adding kernel hacks. And Chris Wilson has fine-tuned the cmd parser with a big pile of patches to reduce the overhead. And Chris has tuned the RPS boost code more, it should now no longer erratically boost the GPU's clock when it's inappropriate. He has also written a lot of patches to reduce the overhead of execlist command submission, and some of those patches have been merged into this release.

Finally two pieces of prep work: A few patches from John Harrison to prepare for removing the outstanding lazy request. We've added this years ago as a cheap way out of a memory and ringbuffer space preallocation issue and ever since then paid the price for this with added complexity leaking all over the GEM code. Unfortunately the actual removal is still pending. And then Joonas Lahtinen has implemented partial GTT mmap support. This is needed for virtual enviroments like XenGT where the GTT is cut up between the different guests and hence badly fragmented. The merged code only supports linear views and still needs support for fenced buffer objects to be actually useful.

June 26, 2015 08:58 AM

June 25, 2015

Rusty Russell: Hashing Speed: SHA256 vs Murmur3

So I did some IBLT research (as posted to bitcoin-dev ) and I lazily used SHA256 to create both the temporary 48-bit txids, and from them to create a 16-bit index offset.  Each node has to produce these for every bitcoin transaction ID it knows about (ie. its entire mempool), which is normally less than 10,000 transactions, but we’d better plan for 1M given the coming blopockalypse.

For txid48, we hash an 8 byte seed with the 32-byte txid; I ignored the 8 byte seed for the moment, and measured various implementations of SHA256 hashing 32 bytes on on my Intel Core i3-5010U CPU @ 2.10GHz laptop (though note we’d be hashing 8 extra bytes for IBLT): (implementation in CCAN)

  1. Bitcoin’s SHA256: 527.7+/-0.9 nsec
  2. Optimizing the block ending on bitcoin’s SHA256: 500.4+/-0.66 nsec
  3. Intel’s asm rorx: 314.1+/-0.3 nsec
  4. Intel’s asm SSE4 337.5+/-0.5 nsec
  5. Intel’s asm RORx-x8ms 458.6+/-2.2 nsec
  6. Intel’s asm AVX 336.1+/-0.3 nsec

So, if you have 1M transactions in your mempool, expect it to take about 0.62 seconds of hashing to calculate the IBLT.  This is too slow (though it’s fairly trivially parallelizable).  However, we just need a universal hash, not a cryptographic one, so I benchmarked murmur3_x64_128:

  1. Murmur3-128: 23 nsec

That’s more like 0.046 seconds of hashing, which seems like enough of a win to add a new hash to the mix.

June 25, 2015 07:51 AM

June 24, 2015

Matthew Garrett: Python for remote reconfiguration of server firmware

One project I've worked on at Nebula is a Python module for remote configuration of server hardware. You can find it here, but there's a few caveats:

  1. It's not hugely well tested on a wide range of hardware
  2. The interface is not yet guaranteed to be stable
  3. You'll also need this module if you want to deal with IBM (well, Lenovo now) servers
  4. The IBM support is based on reverse engineering rather than documentation, so who really knows how good it is

There's documentation in the README, and I'm sorry for the API being kind of awful (it suffers rather heavily from me writing Python while knowing basically no Python). Still, it ought to work. I'm interested in hearing from anybody with problems, anybody who's interested in getting it on Pypi and anybody who's willing to add support for new HP systems.

(Edited to update URL after Nebula went out of business and stopped paying for github)

comment count unavailable comments

June 24, 2015 11:56 PM

June 19, 2015

Rusty Russell: Mining on a Home DSL connection: latency for 1MB and 8MB blocks

I like data.  So when Patrick Strateman handed me a hacky patch for a new testnet with a 100MB block limit, I went to get some.  I added 7 digital ocean nodes, another hacky patch to prevent sendrawtransaction from broadcasting, and a quick utility to create massive chains of transactions/

My home DSL connection is 11Mbit down, and 1Mbit up; that’s the fastest I can get here.  I was CPU mining on my laptop for this test, while running tcpdump to capture network traffic for analysis.  I didn’t measure the time taken to process the blocks on the receiving nodes, just the first propagation step.

1 Megabyte Block

Naively, it should take about 10 seconds to send a 1MB block up my DSL line from first packet to last.  Here’s what actually happens, in seconds for each node:

  1. 66.8
  2. 70.4
  3. 71.8
  4. 71.9
  5. 73.8
  6. 75.1
  7. 75.9
  8. 76.4

The packet dump shows they’re all pretty much sprayed out simultaneously (bitcoind may do the writes in order, but the network stack interleaves them pretty well).  That’s why it’s 67 seconds at best before the first node receives my block (a bit longer, since that’s when the packet left my laptop).

8 Megabyte Block

I increased my block size, and one node dropped out, so this isn’t quite the same, but the times to send to each node are about 8 times worse, as expected:

  1. 501.7
  2. 524.1
  3. 536.9
  4. 537.6
  5. 538.6
  6. 544.4
  7. 546.7

Conclusion

Using the rough formula of 1-exp(-t/600), I would expect orphan rates of 10.5% generating 1MB blocks, and 56.6% with 8MB blocks; that’s a huge cut in expected profits.

Workarounds

Fixes

 

June 19, 2015 02:37 AM

June 16, 2015

Michael Kerrisk (manpages): Linux/UNIX System Programming course scheduled for September 2015, Munich

I've scheduled a further 5-day Linux/UNIX System Programming course to take place in Munich, Germany, for the week of 14-18 September 2015.

The course is intended for programmers developing system-level, embedded, or network applications for Linux and UNIX systems, or programmers porting such applications from other operating systems (e.g., Windows) to Linux or UNIX. The course is based on my book, The Linux Programming Interface (TLPI), and covers topics such as low-level file I/O; signals and timers; creating processes and executing programs; POSIX threads programming; interprocess communication (pipes, FIFOs, message queues, semaphores, shared memory), and network programming (sockets).
     
The course has a lecture+lab format, and devotes substantial time to working on some carefully chosen programming exercises that put the "theory" into practice. Students receive printed and electronic copies of TLPI, along with a 600-page course book that includes all slides presented in the course. A reading knowledge of C is assumed; no previous system programming experience is needed.

Some useful links for anyone interested in the course:

Questions about the course? Email me via training@man7.org.

June 16, 2015 02:24 PM

June 11, 2015

Pete Zaitcev: Rich and comments

Rich Jones posted an article about being banned by Boing-Boing, supposedly for bringing attention to their use of affiliate links (the practice that Gamergate groups criticized as well — and scored a regulatory win against). Meanwhile, all my comments at Rich's blog are blackholed, which is quite ironic. Generally, I am not into this "blog comment" thing. Ani-nouto never had any comments and is doing great that way. But some people like comments, so I leave them as necessary.

June 11, 2015 05:38 PM

June 10, 2015

LPC 2015: General Registration is now Closed

We’re pleased to announce that thanks to overwhelming support, interest in Linux Plumbers Conference has exceeded expectations.  The downside is that the conference is now officially full. Originally we were going to post a warning today, but a sudden surge of corporate registrations caught us off guard, so we had to close immediately.

If you still haven’t registered but would like to participate, contact us. We are running a waiting list on a first come first serve basis but with priority given to people who have accepted microconference topics. You could also try to use one of the sponsors tickets if your employer can provide one to you.

We look forward to seeing you in Seattle for a memorable conference.

June 10, 2015 02:14 AM

June 09, 2015

LPC 2015: Graphics Microconference Accepted into 2015 Linux Plumbers Conference

Although the Year of the Linux Desktop has yet to arrive, a surprising number of Linux users nevertheless need graphics support. This is because there have been a number of years of the Linux smartphone, the Linux television, the Linux digital sign/display/billboard, the Linux automobile, and more. This microconference will cover a number of topics including atomic modesetting in KMS, buffer allocation, verified-secure graphics pipelines, fencing and synchronisation, Wayland, and more.

For more information on this important topic, see the wiki page.

June 09, 2015 03:55 PM

June 05, 2015

James Morris: Hiring Subsystem Maintainers

The regular LWN kernel development stats have been posted here for version 4.1 (if you really don’t have a subscription, email me for a free link).  In this, Jon Corbet notes:

over 60% of the changes going into this kernel passed through the hands of developers working for just five companies. This concentration reflects a simple fact: while many companies are willing to support developers working on specific tasks, the number of companies supporting subsystem maintainers is far smaller. Subsystem maintainership is also, increasingly, not a job for volunteer developers..

As most folks reading this would know, I lead the mainline Linux Kernel team at Oracle.  We do have several people on the team who work in leadership roles in the kernel community (myself included), and what I’d like to make clear is that we are actively looking to support more such folk.

If you’re a subsystem maintainer (or acting in a comparable leadership role), please always feel free to contact me directly via email to discuss employment possibilities.  You can also contact Oracle kernel folk who may be presenting or attending Linux conferences.

June 05, 2015 05:34 AM

June 04, 2015

LPC 2015: Thermal Microconference Accepted into 2015 Linux Plumbers Conference

In stark contrast with decades past, thermal issues in computer systems now means much more than fans and heat sinks, and this microconference looks at some of the things that are now handled in software. The topics include the thermal framework, handling of temperature sensors, and different approaches to handling overtemperature conditions, ranging up to and including closed-loop control. Of course, software that is not tested can be assumed not to work, and the best way to ensure that testing happens when needed is to automate it, so automated testing of thermal subsystems is also on the agenda. Coordination with userspace is useful in order to determine how best to shed computational load, as is coordination among multiple cooling devices.

For more information on this important new-to-Plumbers topic, see the wiki page.

June 04, 2015 04:04 PM

June 03, 2015

LPC 2015: File and Storage Systems Microconference Accepted into 2015 Linux Plumbers Conference

Despite having been in production use for many decades, file and storage systems are very active areas, and have retained the ability to provide many new-technology surprises. This year’s edition of the File and Storage Systems Microconference will look at improved error reporting, filesystem-level encryption in traditional filesystems, SMR drives, online fsck, persistent memory, smart block-layer support in traditional filesystems, better interoperability of and support for NFS and Samba, userspace filesystem innovations, and much else besides.

For more information on this topic, see the wiki page.

June 03, 2015 04:33 PM

Rusty Russell: What Transactions Get Crowded Out If Blocks Fill?

What happens if bitcoin blocks fill?  Miners choose transactions with the highest fees, so low fee transactions get left behind.  Let’s look at what makes up blocks today, to try to figure out which transactions will get “crowded out” at various thresholds.

Some assumptions need to be made here: we can’t automatically tell the difference between me taking a $1000 output and paying you 1c, and me paying you $999.99 and sending myself the 1c change.  So my first attempt was very conservative: only look at transactions with two or more outputs which were under the given thresholds (I used a nice round $200 / BTC price throughout, for simplicity).

(Note: I used bitcoin-iterate to pull out transaction data, and rebuild blocks without certain transactions; you can reproduce the csv files in the blocksize-stats directory if you want).

Paying More Than 1 Person Under $1 (< 500000 Satoshi)

Here’s the result (against the current blocksize):

Sending 2 Or More Sub-$1 Outputs

Let’s zoom in to the interesting part, first, since there’s very little difference before 220,000 (February 2013).  You can see that only about 18% of transactions are sending less than $1 and getting less than $1 in change:

Since March 2013…

Paying Anyone Under 1c, 10c, $1

The above graph doesn’t capture the case where I have $100 and send you 1c.   If we eliminate any transaction which has any output less than various thresholds, we’ll catch that. The downside is that we capture the “sending myself tiny change” case, but I’d expect that to be rarer:

Blocksizes Without Small Output Transactions

This eliminates far more transactions.  We can see only 2.5% of the block size is taken by transactions with 1c outputs (the dark red line following the block “current blocks” line), but the green line shows about 20% of the block used for 10c transactions.  And about 45% of the block is transactions moving $1 or less.

Interpretation: Hard Landing Unlikely, But Microtransactions Lose

If the block size doesn’t increase (or doesn’t increase in time): we’ll see transactions get slower, and fees become the significant factor in whether your transaction gets processed quickly.  People will change behaviour: I’m not going to spend 20c to send you 50c!

Because block finding is highly variable and many miners are capping blocks at 750k, we see backlogs at times already; these bursts will happen with increasing frequency from now on.  This will put pressure on Satoshdice and similar services, who will be highly incentivized to use StrawPay or roll their own channel mechanism for off-blockchain microtransactions.

I’d like to know what timescale this happens on, but the graph shows that we grow (and occasionally shrink) in bursts.  A logarithmic graph prepared by Peter R of bitcointalk.org suggests that we hit 1M mid-2016 or so; expect fee pressure to bend that graph downwards soon.

The bad news is that even if fees hit (say) 25c and that prevents all the sub-$1 transactions, we only double our capacity, giving us perhaps another 18 months. (At that point miners are earning $1000 from transaction fees as well as $5000 (@ $200/BTC) from block reward, which is nice for them I guess.)

My Best Guess: Larger Blocks Desirable Within 2 Years, Needed by 3

Personally I think 5c is a reasonable transaction fee, but I’d prefer not to see it until we have decentralized off-chain alternatives.  I’d be pretty uncomfortable with a 25c fee unless the Lightning Network was so ubiquitous that I only needed to pay it twice a year.  Higher than that would have me reaching for my credit card to charge my Lightning Network account :)

Disclaimer: I Work For BlockStream, on Lightning Networks

Lightning Networks are a marathon, not a sprint.  The development timeframes in my head are even vaguer than the guesses above.  I hope it’s part of the eventual answer, but it’s not the bandaid we’re looking for.  I wish it were different, but we’re going to need other things in the mean time.

I hope this provided useful facts, whatever your opinions.

June 03, 2015 03:57 AM

Rusty Russell: Current Blocksize, by graphs.

I used bitcoin-iterate and gnumeric to render the current bitcoin blocksizes, and here are the results.

My First Graph: A Moment of Panic

This is block sizes up to yesterday; I’ve asked gnumeric to derive an exponential trend line from the data (in black; the red one is linear)

Woah! We hit 1M blocks in a month! PAAAANIC!

That trend line hits 1000000 at block 363845.5, which we’d expect in about 32 days time!  This is what is freaking out so many denizens of the Bitcoin Subreddit. I also just saw a similar inaccurate [correction: misleading] graph reshared by Mike Hearn on G+ :(

But Wait A Minute

That trend line says we’re on 800k blocks today, and we’re clearly not.  Let’s add a 6 hour moving average:

Oh, we’re only halfway there….

In fact, if we cluster into 36 blocks (ie. 6 hours worth), we can see how misleading the terrible exponential fit is:

What! We’re already over 1M blocks?? Maths, you lied to me!

Clearer Graphs: 1 week Moving Average

Actual Weekly Running Average Blocksize

So, not time to panic just yet, though we’re clearly growing, and in unpredictable bursts.

June 03, 2015 02:34 AM

June 02, 2015

LPC 2015: Boot, Init, and Config Microconference Accepted into 2015 Linux Plumbers Conference

The combination of security issues, kernel tinification, and a renewed concern about fast boot has intensified focus on system boot, initialization, and configuration, so much so that there is now a Linux Plumbers Conference Microconference focused on these topics.

In addition to secure boot and minimizing size and bloat, this microconference will delve into a number of topics related to boot speed. These topics include tuning systemd for embedded systems, optimizing and/or delaying memory initialization, deferring initcall-based initialization, introducing parallelism and multicore earlier in boot, speeding up early-boot I/O, pre-loading known configurations, speeding up installation, and of course improved timing and tracing analysis earlier in system startup. In short, the fast-boot work has definitely moved into the sub-second realm. A final bonus topic is better configuring for cloud- and container-based workloads.

For more information on this topic, see the wiki page.

June 02, 2015 03:58 PM

June 01, 2015

LPC 2015: Persistent Memory Microconference Accepted into 2015 Linux Plumbers Conference

The topic of persistent memory is back to the future for those of us old enough to have used core memory, but today’s persistent memory boasts densities, speeds, latencies, and capacities that are well beyond the scope even of science fiction back in the core-memory era.

However, with extreme densities, speeds, latencies, and capacities come interesting technical challenges. This microconference will therefore cover the “struct page” problem, performance hotspots in both kernel and userspace I/O fastpaths, managing access mechanisms such as DAX, providing atomic sector updates, and more.

For more information on this topic, see the wiki page.

We hope to see you there!

June 01, 2015 07:28 PM

Rusty Russell: Block size: rate of internet speed growth since 2008?

I’ve been trying not to follow the Great Blocksize Debate raging on reddit.  However, the lack of any concrete numbers has kind of irked me, so let me add one for now.

If we assume bandwidth is the main problem with running nodes, let’s look at average connection growth rates since 2008.  Google lead me to NetMetrics (who seem to charge), and Akamai’s State Of The Internet (who don’t).  So I used the latter, of course:

Akamai’s Average Connection Speed Chart Q4/07 to Q4/14

I tried to pick a range of countries, and here are the results:

Country % Growth Over 7 years Per Annum
Australia 348 19.5%
Brazil 349 19.5%
China 481 25.2%
Philippines 258 14.5%
UK 333 18.8%
US 304 17.2%

 

Countries which had best bandwidth grew about 17% a year, so I think that’s the best model for future growth patterns (China is now where the US was 7 years ago, for example).

If bandwidth is the main centralization concern, you’ll want block growth below 15%. That implies we could jump the cap to 3MB next year, and 15% thereafter. Or if you’re less conservative, 3.5MB next year, and 17% there after.

June 01, 2015 01:20 AM

May 30, 2015

James Bottomley: DNSSEC, DANE and the failure of X.509

As a few people have noticed, I’m a bit of an internet control freak: In an age of central “cloud based” services, I run pretty much my own everything (blog, mail server, DNS, OpenID, web page etc.).  That doesn’t make me anti-cloud; I just believe in federation instead of centralisation.  In particular, I believe in owning my own content and obeying my own rules rather than those of $BIGCLOUDPROVIDER.

In the modern world, this is perfectly possible: I rent a co-lo site and I have a DNS delegation so I can run and tune my own services how I wish.  I need a secure web server for a few things (OpenID, an email portal, secure log in for my blog etc) but I’ve always used a self-signed certificate.  Why?  well having to buy one from a self appointed X.509 root of trust always really annoyed me.  Firstly because they do very little for the money; secondly because it means effectively giving my security to some self appointed entity; and thirdly, as all the compromises and misuse attests, the X.509 root of trust model is fundamentally broken.

In the ordinary course of events, none of this would affect me.  However, recently, curl, which is used as the basis of most OpenID implementations took to verifying X.509 certificate chains, meaning that OpenID simply stopped working for ID providers with self signed certificates and at a stroke I was locked out of quite a few internet sites.  The only solution is to give up on OpenID or swallow pride and get a chained X.509 certificate.  Fortunately startssl will issue free certificates and the Linux Foundation is also getting into the game, so the first objection is overcome but not the other two.

So, what’s the answer?  As a supporter of cloud federation, I really like the monkeysphere approach which links ssl certificate verification directly to the user’s personal pgp web of trust.  Unfortunately, that also means that monkeysphere suffers from all the usual web of trust problems, the biggest being that it’s pretty much inaccessible to non-techies who just don’t understand why they should invest time in building up their own trust contacts.  That’s not to say that the web of trust can’t be made accessible in a simple fashion to everyone and indeed google is working on a project along these lines; however, today the reality is that today it isn’t.

Enter DANE.  At is most basic, DANE is a protocol that links certificate verification to the DNS.  It means that because anyone who owns a domain must have a DNS entry somewhere and the ability to modify it, they can directly link their certificate verification to this ability.  To my mind, this represents a nice compromise between making the system simple for end users and the full federation of the web of trust.  The implementation of DANE relies on DNSSEC (which is a royal pain to set up and get right, but I’ll do another blog post about that).  This means that effectively DANE has the same operational model as X.509, because DNSSEC is a hierarchically rooted trust model.  It also means that the delegation record to your domain is managed by your registrar and could be compromised if your registrar is.  However, as long as you trust the DNSSEC root and your registrar, the ability to generate trusted certificates for your domain is delegated to you.  So how is this different from X509?  Surely abusive registrars could cause similar problems as abusive or negligent X.509 roots?  That’s true, but an abusive registrar can only affect their own domain and delegates, they can’t compromise everyone else (unlike X.509), so to give an example of recent origin: the Chinese registrar could falsify the .cn domain, but wouldn’t be able to issue false certificates for the .com one.  The other reason for hope is that DNSSEC is at the root of the scheme to protect the DNS infrastructure itself from attack.  This makes the functioning and administration of DNSSEC a critical task for ICANN itself, so it’s a fair bet to assume that any abuse by a registrar won’t just result in a light slap on the wrist and a vague threat to delist their certificate in some browsers, but will have ICANN threatening to revoke their accreditation and with it, their entire business model as a domain registrar.

In many ways, the foregoing directly links the business model of the registrars to making DNSSEC work correctly for you.  In theory, the same is true of the X.509 CA roots of trust, of course, but there’s no one sitting at the top making sure they behave (and the fabric of the internet isn’t dependent on securing this behaviour, even if there were).

Details of DANE

In spite of the statements above, DANE is designed to complement X.509 as well as replace it.  Dane has four separate certificate verification styles, two of which integrate with X.509 and solve specific threats in its model (the actual solution is called pinning, a way of protecting yourself from the proliferation of X509 CAs all of whom could issue certificates for your site):

Mode 3 is most commonly used to specify an exact certificate outside of the X.509 chain.  Mode 2 can be useful, but the site must have access to an external certificate store (using the DNS CERT records) or the TLSA record must specify the full certificate for it to work.

Who Supports DANE?

This is the big problem:  For certificates distributed via DANE to be useful, there must be support for them in browsers.  For Mozilla, there is the DANE validator extension but in spite of several attempts, nothing actually built into the browser certificate verifier itself.  The most complete patch set is from the DNSSEC people and there’s also a Mozilla inspired one about how they will add it one day but right at the moment it isn’t a priority.  The Chromium browser has had a similar attitude.  The conspiracy theorists are quick to point out that this is because the browser companies derive considerable revenue from the CA system, which is in itself a multi-billion dollar industry and thus there’s active lobbying against anything that would dilute the power, and hence perceived value, of the CA roots.  There is some evidence for this position in that Google recognises that certificate pinning, which DANE supports, can protect against recent fake google certificate attacks, but, while supporting DNSSEC (at least for validation, the google DNS doesn’t secure itself via DNSSEC), they steadfastly ignore DANE certificate pinning and instead have a private arrangement with Mozilla.

I learned long ago: never to ascribe to malice (or conspiracy) what can be easily explained by incompetence (or political problems).  In this case, the world was split long ago into using openssl for security (in spite of the problematic licence) or using nss (the Netscape Security Services).  Mozilla, of course, uses the latter but every implementation of DANE for mozilla (including the patches in the bugzilla) use openssl.   I actually have an experimental build of mozilla with DANE, but incorporating the two separate SSL systems is a real pain.  I think it’s safe to say that until someone comes up with a nss based DANE verifier, the DANE patches for mozilla still aren’t yet up to the starting blocks, and thus conspiracy allegations are somewhat premature.  Unfortunately, the same explanation applies to chromium: for better or worse, it’s currently using nss for certificate validation as well.

May 30, 2015 07:10 PM

May 29, 2015

Pete Zaitcev: Cool hardware in Vancouver

There wasn't much, but more than in Atlanta. The most "pro" looking kit was presented by NEC: basically a bladeserver, but the "blades" are SBCs, each of them accompained by a dedicated drive card. I can see downsides of this design, but very cute.

Unfortunately, they only offer CPU cards based on Atom. No ARM or anything.

The only other interesting booth belonged to StackVelocity, a subsidiary of JB Circuits that does custom design.

I'm sorry to say, their wares looked decidedly pedestrian, which is to be expected: their sales point is low cost, and stuff of that nature underpins the modern datacenter. One curious thing, however, is the variety of flash cards they offer. Basically Fusion-IO on budget. One was particularly tricky by having 2 layers. At first I even thought it could have flash chips mounted sideways, but nope, the science of low-cost computing is not there yet.

P.S. NEC also sell the same chassis with CPU cards instead of drive cards under the index "DX1000".

May 29, 2015 06:38 PM

Pete Zaitcev: Semi-hard numbers from Rackspace

Previously in hard numbers: China, Wikimedia, Amazon S3. Rackspace previously reported in creiht's preso 18 months ago. This time, scotty went public at the Vancouver (Liberty) summit with the following:

> 50 billion objects
> 100 PB data (sanitized number, but way higher than 85 PB)
= 6 global clusters
3:1 PUT:GET ratio
10k+ requests/second

The number of objects is roughly 40 times less than in Amazon S3.

May 29, 2015 06:11 PM

LPC 2015: Performance and Scalability Microconference Accepted into 2015 Linux Plumbers Conference

Core counts keep rising, and that means that the Linux kernel continues to encounter interesting performance and scalability issues. Which is not a bad thing, since it has been well over ten years since the “free lunch” of exponential CPU-clock frequency increases came to an abrupt end. This microconference will therefore look at futex scaling, address-space scaling, improvements to queued spinlocks, additional lockless algorithms, userspace per-CPU critical sections, and much else besides.

For more information on this topic, please see the wiki page.

May 29, 2015 03:09 AM

May 27, 2015

LPC 2015: Networking Microconference Accepted into 2015 Linux Plumbers Conference

In the past, this has been called the Network Virtualization Microconference, but this year’s instance is branching out to include IPv6 and Security.

Network-virtualization topics include networking for multi-tenant container clusters, reducing network-namespace load on the system, intelligent processing at the edge of the data center, programmable datapath (as in fun with eBPF, OVS, nft, and much else besides), hardware support, and protocol development.

IPv6 topics include performance (can it catch up to IPv4?), consolidation of common IPv4/IPv6 functionality, solving the IPv6 datacenter addressing problem, and providing network virtualization without encapsulation using logical IPv6 overlays.

Security topics include scalable networking security policies for containers, securing applications in multi-host data centers, encryption of overlays, and hardware support.

It appears that coordinated work in all three of these areas is required to make good progess. For more information, see the wiki page.

May 27, 2015 05:29 PM

Matthew Garrett: This is not the UEFI backdoor you are looking for

This is currently the top story on the Linux subreddit. It links to this Tweet which demonstrates using a System Management Mode backdoor to perform privilege escalation under Linux. This is not a story.

But first, some background. System Management Mode (SMM) is a feature in most x86 processors since the 386SL back in 1990. It allows for certain events to cause the CPU to stop executing the OS, jump to an area of hidden RAM and execute code there instead, and then hand off back to the OS without the OS knowing what just happened. This allows you to do things like hardware emulation (SMM is used to make USB keyboards look like PS/2 keyboards before the OS loads a USB driver), fan control (SMM will run even if the OS has crashed and lets you avoid the cost of an additional chip to turn the fan on and off) or even more complicated power management (some server vendors use SMM to read performance counters in the CPU and adjust the memory and CPU clocks without the OS interfering).

In summary, SMM is a way to run a bunch of non-free code that probably does a worse job than your OS does in most cases, but is occasionally helpful (it's how your laptop prevents random userspace from overwriting your firmware, for instance). And since the RAM that contains the SMM code is hidden from the OS, there's no way to audit what it does. Unsurprisingly, it's an interesting vector to insert malware into - you could configure it so that a process can trigger SMM and then have the resulting SMM code find that process's credentials structure and change it so it's running as root.

And that's what Dmytro has done - he's written code that sits in that hidden area of RAM and can be triggered to modify the state of the running OS. But he's modified his own firmware in order to do that, which isn't something that's possible without finding an existing vulnerability in either the OS or (or more recently, and) the firmware. It's an excellent demonstration that what we knew to be theoretically possible is practically possible, but it's not evidence of such a backdoor being widely deployed.

What would that evidence look like? It's more difficult to analyse binary code than source, but it would still be possible to trace firmware to observe everything that's dropped into the SMM RAM area and pull it apart. Sufficiently subtle backdoors would still be hard to find, but enough effort would probably uncover them. A PC motherboard vendor managed to leave the source code to their firmware on an open FTP server and copies leaked into the wild - if there's a ubiquitous backdoor, we'd expect to see it there.

But still, the fact that system firmware is mostly entirely closed is still a problem in engendering trust - the means to inspect large quantities binary code for vulnerabilities is still beyond the vast majority of skilled developers, let alone the average user. Free firmware such as Coreboot gets part way to solving this but still doesn't solve the case of the pre-flashed firmware being backdoored and then installing the backdoor into any new firmware you flash.

This specific case may be based on a misunderstanding of Dmytro's work, but figuring out ways to make it easier for users to trust that their firmware is tamper free is going to be increasingly important over the next few years. I have some ideas in that area and I hope to have them working in the near future.

comment count unavailable comments

May 27, 2015 06:38 AM

May 26, 2015

LPC 2015: Extending the Early Bird Rate through June 5th 2015

We have decided to extend the Early Bird rate through June 5th 2015, to match LinuxCon North America’s dates and to allow more time for people to register, after authors notifications and schedule announcements. The regular rate will now start on June 6th.

May 26, 2015 07:32 PM

May 20, 2015

Pavel Machek: Alcatel Pixi 3.5

Available in Czech Republic, too, 98 grams, and pretty cheap. On my Nokia n900, GSM parts died, and hacking cellphone you are using is a bad idea... So... what about Pixi? Underpowered hardware, but still more powerful than n900. Does firefox os support wifi tethering by default? Is it reasonably easy to hack? (I guess "apt-get install python would be too much to ask, but..) Other candidates are Jolla/Sailfish and Ubuntu Phone.

May 20, 2015 05:03 PM

May 19, 2015

Paul E. Mc Kenney: Dagstuhl Seminar: Compositional Verification Methods for Next-Generation Concurrency

Some time ago, I figured out that there are more than a billion instances of the Linux kernel in use, and this in turn led to the realization that a million-year RCU bug is happening about three times a day across the installed base. This realization has caused me to focus more heavily on RCU validation, which has uncovered a number of interesting bugs. I have also dabbled a bit in formal verification, which has not yet found a bug. However, formal verification might be getting there, and might some day be a useful addition to RCU's regression testing. I was therefore quite happy to be invited to this Dagstuhl Seminar. In what follows, I summarize a few of the presentation. See here for the rest of the presentations.

Viktor Vafeiadis presented his analysis of the C11 memory model, including some “interesting” consequences of data races, where a data race is defined as a situation involving multiple concurrent accesses to a non-atomic variable, at least one of which is a write. One such consequence involves a theoretically desirable “strengthening” property. For example, this property would mean that multiplexing two threads onto a single underlying thread would not introduce new behaviors. However, with C11, the undefined-behavior consequences of data races can actually cause new behaviors to appear with fewer threads, for example, see Slide 7. This suggests the option of doing away with the undefined behavior, which is exactly the option that LLVM has taken. However, this approach requires some care, as can be seen on Slide 19. Nevertheless, this approach seems promising. One important takeaway from this talk is that if you are worried about weak ordering, you need to pay careful attention to reining in the compiler's optimizations. If you are unconvinced, take a look at this! Jean Pichon-Pharabod, Kyndylan Nienhuis, and Mike Dodds presented on other aspects of the C11 memory model.

Martin T. Vechev apparently felt that the C11 memory model was too tame, and therefore focused on event-driven applications, specifically javascript running on Android. This presentation included some entertaining concurrency bugs and their effects on the browser's display. Martin also discussed formalizing javascript's memory model.

Hongjin Liang showed that ticket locks can provide starvation freedom given a minimally fair scheduler. This provides a proof point for Björn B. Brandenburg's dissertation, which analyzed the larger question of real-time response from lock-based code. It should also provide a helpful corrective to people who still believe that non-blocking synchronization is required.

Joseph Tassarotti presented a formal proof of the quiescent-state based reclamation (QSBR) variant of userspace RCU. In contrast to previous proofs, this proof did not rely on sequential consistency, but instead leveraged a release-acquire memory model. It is of course good to see researchers focusing their tools on RCU! That said, when a researcher asked me privately whether I felt that the proof incorporated realistic assumptions, I of course could not resist saying that since they didn't find any bugs, the assumptions clearly must have been unrealistic.

My first presentation covered what would be needed for me to be able to use formal verification as part of Linux-kernel RCU's regression testing. As shown on slide 34, these are:


  1. Either automatic translation or no translation required. After all, if I attempt to manually translate Linux-kernel RCU to some special-purpose language every release, human error will make its presence known.
  2. Correctly handle environment, including the memory model, which in turn includes compiler optimizations.
  3. Reasonable CPU and memory overhead. If these overheads are excessive, RCU is better served by simple stress testing.
  4. Map to source code lines containing the bug. After all, I already know that there are bugs—I need to know where they are.
  5. Modest input outside of source code under test. The sad fact is that a full specification of RCU would be at least as large as the implementation, and also at least as buggy.
  6. Find relevant bugs. To see why this is important, imagine that some tool finds 100 different million-year bugs and I fix them all. Because roughly one of six fixes introduces a bug, and because that bug is likely to reproduce in far less than a million years, this process has likely greatly reduced the robustness of the Linux kernel.


I was not surprised to get some “frank and honest” feedback, but I was quite surprised (but not at all displeased) to learn that some of the feedback was of the form “we want to see more C code.” After some discussion, I provided just that.

May 19, 2015 07:30 PM

May 18, 2015

LPC 2015: Extending the Earlybird deadline to 29 May

Somewhere along the way, the deadline for notifications to Authors of the Shared LinuxCon/Plumbers track got pushed out by a week to 25 May.  In the light of that, we’re extending the deadline for Earlybird registration to Friday 29 May to allow anyone who doesn’t get a talk accepted but who still wishes to attend Plumbers to take advantage of the Earlybird registration rate.

May 18, 2015 06:37 AM

May 11, 2015

Pavel Machek: More SSD fun

http://www.techspot.com/article/997-samsung-ssd-read-performance-degradation/

This scheds some light on how tricky multi-level NAND drives are.

May 11, 2015 01:17 PM

Pavel Machek: SSD temperature sensitivity.

http://www.ibtimes.co.uk/ssds-lose-data-if-left-without-power-just-7-days-1500402

If you store SSDs at higher temperature than operating, bad things will happen... Like failure in less than a week. "Enterprise" SSDs are more sensitive to this (I always thought that "enterprise" is code word for "expensive", but apprently it has other implications, too).

Oh and that N900 modem problems... it seems it was not a battery. Moving SIM card to different phone to track it down...

May 11, 2015 08:56 AM

May 09, 2015

Pavel Machek: Good use for old, 80GB 3.5" hard drive

If it does not work, open it and try to repair it.

If it works, and you are tired of killing working drives...

tar czv /data/$1 | aespipe > /mnt/$1.tgz.aes

...fill your harddrive with data you'd like to keep, and bury it in the woods on a moonless night.

On a unrelated note... it seems Nokia N900 does not have as many capacitors at it should have. If you battery is too old, it will be still good enough to power most functions, but not the GSM/SIM card parts, resulting in network errors, no calls possible, etc. Problem mysteriously goes away with newer battery...

May 09, 2015 05:47 PM

May 08, 2015

Daniel Vetter: GFX Kernel Upstreaming Requirements

Upstreaming requirements for the DRM subsystem are a bit special since Dave Airlie requires a full-blown open-source implementation as a demonstration vehicle for any new interfaces added. I've figured it's better to clear this up once instead of dealing with the fallout from surprises and made a few slides for a training session. Dave reviewed and acked them, hence this should be the up-to-date rules - the old mails and blogs back from when some ARM SoC vendors tried to push drm drivers for blob userspace to upstream are a bit outdated.

Any here's the slides for my gfx kernel upstreaming requirements training.

May 08, 2015 01:21 PM

James Morris: Linux Security Summit 2015 CFP

The CFP for the 2015 Linux Security Summit (LSS) is now open: see here.

Proposals are due by June 5th, and accepted speaker notifications will go out by June 12th.

LSS 2015 will be held over 20-21 August, in Seattle, WA, USA.

Last year’s event went really well, and we’ll follow a similar format over two days again this year.  We’re co-located again with LinuxCon, and a host of other events including Linux Plumbers, CloudOpen, KVM Forum, and ContainerCon.  We’ve been upgraded to an LF managed event this year, which means we’ll get food.

All LSS attendees, including speakers, must be registered attendees of LinuxCon.   The first round of early registration ends May 29th.

We’d like to cast our net as wide as possible in terms of presentations, so please share this info with anyone you know who’s been doing interesting Linux security development or implementation work recently.

May 08, 2015 11:01 AM

May 07, 2015

Michael Kerrisk (manpages): man-pages-4.00 is released

Version numbers for the current man-pages release had been getting uncomfortably high, so that I'd been thinking about bumping to a new major version for a while, and now that the Linux kernel has just done that, it seems an opportune moment do likewise. So, here we have it: man-pages-4.00, my 166th man-pages release.

The release tarball is available on kernel.org. The browsable online pages can be found on man7.org. The Git repository for man-pages is available on kernel.org.

This release resulted from patches, bug reports,and  comments from over 50 contributors. As well as a large number of minor fixes to around 90 man pages, the more significant changes in man-pages-4.00 include the following:

May 07, 2015 10:53 AM

May 06, 2015

Pete Zaitcev: How Mitchell Baker made me to divorce

Well, nearly did. Deleting history in Firefox 37 is very slow and the UI locks up while you do that. "Very slow" means an operation that takes 13 minutes (not exaggerating - it's reproducible). The UI lock-up means a non-dismissable context menu floating over everything; Firefox itself being, of course, entirely unresponsive. See the screencap.

The screencap is from Linux where I confirmed the problem, but the story started on Windows, where my wife tried to tidy up a bit. So, when Firefox locked up, she killed it, and repeated the process a few times. And what else would you do? We are not talking about hanging up for seconds - it literally was many minutes. Firefox did not pop a dialog with "Please wait, deleting 108,534 objects with separate SQLite transactions", a progress gauge, and a "Cancel" button. Instead, it pretended to lock up.

Interestingly enough, remember when Firefox had a default to keep the history for a week? This mode is gone now - FF keeps the history potentially forever. Instead, it offers a technical limit: 108,534 entries are saved in the "Places" database at the most, in order to prevent SQLite from eating all your storage. Now I understand why my brown "visited" links never go back to blue anymore.

The problem is, there's no alternative. I tried to use Midori as my main browser for a month or two in early 2014, but it was a horrible crash city. I had no choice but to give up and go back to Firefox and its case of Featuritis Obesum.

May 06, 2015 08:10 PM

May 05, 2015

Dave Jones: Thoughts on a feedback loop for Trinity.

With the success that afl has been having on fuzzing userspace, I’ve been revisiting an idea that Andi Kleen gave me years ago for trinity, which was pretty much the same thing but for kernel space. I.e., a genetic algorithm that rates how successful the last fuzz attempt was, and makes a decision on whether to mutate that last run, or do something completely new.

It’s something I’ve struggled to get my head around for a few years. The mutation part would be fairly easy. We would need to store the parameters from the last run, and extrapolate out a set of ->mutate functions from the existing ->sanitize functions that currently generate arguments.

The difficult part is the “how successful” measurement. Typically, we don’t really get anything useful back from a syscall other than “we didn’t crash”, which isn’t particularly useful in this case. What we really want is “did we execute code that we’ve not previously tested”. I’ve done some experiments with code coverage in the past. Explorations of the GCOV feature in the kernel didn’t really get very far however for a few reasons (primarily that it really slowed things down too much, and also I was looking into this last summer, when the initial cracks were showing that I was going to be leaving Red Hat, so my time investment for starting large new projecs was limited).

After recent discussions at work surrounding code coverage, I got thinking about this stuff again, and trying to come up with workable alternatives. I started wondering if I could use the x86 performance counters for this. Basically counting the number of instructions executed between system call enter/exit. The example code that Vince Weaver wrote for perf_event_open looked like a good starting point. I compiled it and ran it a few times.

$ ./a.out 
Measuring instruction count for this printf
Used 3212 instructions
$ ./a.out 
Measuring instruction count for this printf
Used 3214 instructions

Ok, so there’s some loss of precision there, but we can mask off the bottom few bits. A collision isn’t the end of the world for what we’re using this for. That’s just measuring userspace however. What happens if we tell it to measure the kernel, and measure say.. getpid().

$ ./a.out 
Used 9283 instructions
$ ./a.out 
Used 9367 instructions

Ok, that’s a lot more precision we’ve lost. What the hell.
Given how much time he’s spent on this stuff, I emailed Vince, and asked if he had insight as to why the counters weren’t deterministic across different runs. He had actually written a paper on the subject. Turns out we’re also getting event counts here for page faults, hardware interrupts, timers, etc.
x86 counters lack the ability to say “only generate events if RIP is within this range” or anything similar, so it doesn’t look like this is going to be particularly useful.

That’s kind of where I’ve stopped with this for now. I don’t have a huge amount of time to work on this, but had hoped that I could hack up something basic using the perf counters, but it looks like even if it’s possible, it’s going to be a fair bit more work than I had anticipated.

update:
It occurred to me after posting this that measuring instructions isn’t going to work regardless of the amount of precision the counters offer. Consider a syscall that operates on vma’s for example. Over the lifetime of a process, the number of executed instructions of a call to such a syscall will vary even with the same input parameters, as the lengths of various linked lists that have to be walked will change. Number of instructions, or number of branches taken/untaken etc just isn’t a good match for this idea. Approximating “have we been here before” isn’t really achievable with this approach afaics, so I’m starting to think something like the initial gcov idea is the only way this could be done.

Thoughts on a feedback loop for Trinity. is a post from: codemonkey.org.uk

May 05, 2015 05:41 PM

LPC 2015: Deadline for Refereed Talks Is May 12

The deadline for submission of the refereed talks is now Tuesday, May 12, 2105. The Authors Notification date has been moved to May 26th. Get your proposals in! See details on the Participate page.

May 05, 2015 05:38 PM

May 04, 2015

Dave Jones: kernel code coverage brain dump.

Someone at work recently asked me about code coverage tooling for the kernel. I played with this a little last year. At the time I was trying to figure out just how much of certain syscalls trinity was exercising. I ended up being a little disappointed at the level of post-processing tools to deal with the information presented, and added some things to my TODO list to find some time to hack up something, which quickly bubbled its way to the bottom.

As I did a write-up based on past experiences with this stuff, I figured I’d share.

gcov/gprof
requires kernel built with
CONFIG_GCOV_KERNEL=y
GCOV_PROFILE_ALL=y
GCOV_FORMAT_AUTODETECT=y
Note: Setting GCOV_PROFILE_ALL incurs some performance penalty, so any resulting kernel built with this option should _never_ be used for any kind of performance tests.
I can’t exaggerate this enough, it’s miserably slow. Disk operations that took minutes for me now took hours. As example:

Before:

# time dd if=/dev/zero of=output bs=1M count=500
500+0 records in
500+0 records out
524288000 bytes (524 MB) copied, 0.409712 s, 1.3 GB/s
0.00user 0.40system 0:00.41elapsed 99%CPU (0avgtext+0avgdata 2980maxresident)k
136inputs+1024000outputs (1major+340minor)pagefaults 0swaps

After:

# time dd if=/dev/zero of=output bs=1M count=500
500+0 records in
500+0 records out
524288000 bytes (524 MB) copied, 6.17212 s, 84.9 MB/s
0.00user 7.17system 0:07.22elapsed 99%CPU (0avgtext+0avgdata 2940maxresident)k
0inputs+1024000outputs (0major+338minor)pagefaults 0swaps

From 41 seconds, to over 7 minutes. Ugh.

If we *didn’t* set GCOV_PROFILE_ALL, we’d have to recompile just the files we cared about with the relevant gcc profiling switches. It’s kind of a pain.

For all this to work, gcov expects to see a source tree, with:

After booting the kernel, a subtree appears in sysfs at /sys/kernel/debug/gcov/
These directories mirror the kernel source tree, but instead of source files, now contain files that can be fed to the gcov tool. There will be a .gcda file, and a .gcno symlink back to the source tree (with complete path). Ie, /sys/kernel/debug/mm for example contains (among others..)

-rw------- 1 root root 0 Mar 24 11:46 readahead.gcda
lrwxrwxrwx 1 root root 0 Mar 24 11:46 readahead.gcno -> /home/davej/build/linux-dj/mm/readahead.gcno

It is likely the symlink will be broken on the test machine, because the path doesn’t exist, unless you nfs mount the source code from the built kernel for eg.

I hacked up the script below, which may or may not be useful for anyone else (honestly, it’s way easier to just use nfs).
Run it from within a kernel source tree, and it will populate the source tree with the relevant gcda files, and generate the .gcov output file.

  
#!/bin/sh
# gen-gcov-data.sh
obj=$(echo "$1" | sed 's/\.c/\.o/')
if [ ! -f $obj ]; then
  exit
fi

pwd=$(pwd)
dirname=$(dirname $1)
gcovfn=$(echo "$(basename $1)" | sed 's/\.c/\.gcda/')
if [ -f /sys/kernel/debug/gcov$pwd/$dirname/$gcovfn ]; then
  cp /sys/kernel/debug/gcov$pwd/$dirname/$gcovfn $dirname
  gcov -f -r -o $1 $obj
 
  if [ -f $(basename $1).gcov ]; then
    mv $(basename $1).gcov $dirname
  fi
else
  echo "no gcov data for /sys/kernel/debug/gcov$pwd/$dirname/$gcovfn"
fi

Take that script, and run it like so..

$ cd kernel-source-tree
$ find . -type f -name "*.c" -exec gen-gcov-data.sh "{}" \;

Running for eg, gen-gcov-data.sh mm/mmap.c will cause gcov to spit out a mmap.c.gcov file (in the current directory) that has coverage information that looks like..

 
   135684:  269:static struct vm_area_struct *remove_vma(struct vm_area_struct *vma)
        -:  270:{
   135684:  271:        struct vm_area_struct *next = vma->vm_next;
        -:  272:
   135684:  273:        might_sleep();
   135686:  274:        if (vma->vm_ops && vma->vm_ops->close)
     5080:  275:                vma->vm_ops->close(vma);
   135686:  276:        if (vma->vm_file)
    90302:  277:                fput(vma->vm_file);
        -:  278:        mpol_put(vma_policy(vma));
   135686:  279:        kmem_cache_free(vm_area_cachep, vma);
   135686:  280:        return next;
        -:  281:}

The numbers on the left being the number of times that line of code was executed.
Lines beginning with ‘-‘ have no coverage information for whatever reason.
If a branch is not taken, it gets prefixed with ‘#####’, like so..

 
  4815374:  391:                if (vma->vm_start < pend) {
     #####:  392:                        pr_emerg("vm_start %lx < pend %lx\n",
         -:  393:                                  vma->vm_start, pend);
        -:  394:                        bug = 1;
        -:  395:                }

There are some cases that need a little more digging to explain. eg:

    88105:  237:static void __remove_shared_vm_struct(struct vm_area_struct *vma,
        -:  238:                struct file *file, struct address_space *mapping)
        -:  239:{
    88105:  240:        if (vma->vm_flags & VM_DENYWRITE)
    15108:  241:                atomic_inc(&file_inode(file)->i_writecount);
    88105:  242:        if (vma->vm_flags & VM_SHARED)
        -:  243:                mapping_unmap_writable(mapping);
        -:  244:
        -:  245:        flush_dcache_mmap_lock(mapping);
    88105:  246:        vma_interval_tree_remove(vma, &mapping->i_mmap);
        -:  247:        flush_dcache_mmap_unlock(mapping);
    88104:  248:}

In this example, lines 245 & 247 have no hitcount, even though there’s no way they could have been skipped.
If we look at the definition of flush_dcache_mmap_(un)lock, we see..
#define flush_dcache_mmap_lock(mapping) do { } while (0)
So the compiler never emitted any code, and hence, it gets treated the same way as the blank lines.

There is a /sys/kernel/debug/gcov/reset file that can be written to to reset the counters before each test if desired.

Additional thoughts

kernel code coverage brain dump. is a post from: codemonkey.org.uk

May 04, 2015 02:54 PM

May 03, 2015

LPC 2015: Device Tree Tools, Validation, and Troubleshooting Microconference Accepted into 2015 Linux Plumbers Conference

There have been more than a few spirited discussions on the topic of device trees (described here) over the past few years, and we can probably expect a few more at this year’s Device Tree microconference. The main focus is on programs, scripts, techniques, and core support to help create correct device trees, validate existing device trees, and support troubleshooting of incorrect device trees, drivers, and subsystems. Within that area of focus, topics span the range from inspection to verification/validation to bindings to documentation. This microconference will also examine the impact of overlays, including boot-time and runtime updates to device trees.

May 03, 2015 07:27 AM

May 01, 2015

Dave Jones: Trinity socket improvements

I’ve been wanting to get back to working on the networking related code in trinity for a long time. I recently carved out some time in the evenings to make a start on some of the lower hanging fruit.

Something that bugged me a while is that we create a bunch of sockets on startup, and then when we call for eg, setsockopt() on that socket, the socket options we pass have more chance of not being the correct protocol for the protocol the socket was created for. This isn’t always a bad thing; for eg, one of the oldest kernel bugs trinity found was found by setting TCP options on a non-TCP socket. But doing this the majority of the time is wasteful, as we’ll just get -EINVAL most the time.

We actually have the necessary information in trinity to know what kind of socket we were dealing with in a socketinfo struct.

struct socket_triplet {
        unsigned int family;
        unsigned int type;
        unsigned int protocol;
};

struct socketinfo {
        struct socket_triplet triplet;
        int fd; 
};

We just had it at the wrong level of abstraction. setsockopt only ever saw a file descriptor. We could have searched through the fd arrays looking for the socketinfo that matched, but that seems like a lame solution. So I changed the various networking syscalls to take a ARG_SOCKETINFO instead of an ARG_FD. As a side-effect, we actually pass sockets to those syscalls more than say, a perf fd, or an epoll fd, or ..

There is still a small chance we pass some crazy fd, just to cover the crazy cases, though those cases don’t tend to trip things up much any more.

After passing down the triplet, it was a simple case of annotating the structures containing the various setsockopt function pointers to indicate which family they belonged to. AF_INET was the only complication, which needed special casing due to the multiple protocols for which we have setsockopt() functions. Creation of a second table, using the protocol instead of the family was enough for the matching code.

There are still a ton of improvements I want to make to this code, but it’s going to take a while, so it’s good when some mostly trivial changes like the above come together quickly.

Trinity socket improvements is a post from: codemonkey.org.uk

May 01, 2015 04:10 PM

April 30, 2015

Rusty Russell: Some bitcoin mempool data: first look

Previously I discussed the use of IBLTs (on the pettycoin blog).  Kalle and I got some interesting, but slightly different results; before I revisited them I wanted some real data to play with.

Finally, a few weeks ago I ran 4 nodes for a week, logging incoming transactions and the contents of the mempools when we saw a block.  This gives us some data to chew on when tuning any fast block sync mechanism; here’s my first impressions looking a the data (which is available on github).

These graphs are my first look; in blue is the number of txs in the block, and in purple stacked on top is the number of txs which were left in the mempool after we took those away.

The good news is that all four sites are very similar; there’s small variance across these nodes (three are in Digital Ocean data centres and one is behind two NATs and a wireless network at my local coworking space).

The bad news is that there are spikes of very large mempools around block 352,800; a series of 731kb blocks which I’m guessing is some kind of soft limit for some mining software [EDIT: 750k is the default soft block limit; reported in 1024-byte quantities as blockchain.info does, this is 732k.  Thanks sipa!].  Our ability to handle this case will depend very much on heuristics for guessing which transactions are likely candidates to be in the block at all (I’m hoping it’s as simple as first-seen transactions are most likely, but I haven’t tested yet).

Transactions in Mempool and in Blocks: Australia (poor connection)

Transactions in Mempool and in Blocks: Singapore

Transactions in Mempool and in Blocks: San Francisco

Transactions in Mempool and in Blocks: San Francisco (using Relay Network)

April 30, 2015 12:26 PM