Kernel Planet

October 09, 2015

Paul E. Mc Kenney: Deep Blue vs. Watson Revisited

Some years back, I speculated on the importance of IBM's Watson. Much has happened since then: Watson won Jeopardy, has been applied to medical applications, and has been made available to numerous business partners to enable them to produce Watson-based offerings. In short, it is long past time for a follow-up.

However, The Economist beat me to the punch in their October 3rd print edition. I doubt that I can improve on their article, so I will confine myself to taking the fair-use liberty of quoting their last sentence:

If it [Watson] can pull that off, a truly disturbing possibility looms: that the next TV show featuring Watson might be “America's Got Talent”.

October 09, 2015 02:31 AM

October 08, 2015

Matthew Garrett: Going my own way

Reaction to Sarah's post about leaving the kernel community was a mixture of terrible and touching, but it's still one of those things that almost certainly won't end up making any kind of significant difference. Linus has made it pretty clear that he's fine with the way he behaves, and nobody's going to depose him. That's unfortunate, because earlier today I was sitting in a presentation at Linuxcon and remembering how much I love the technical side of kernel development. "Remembering" is a deliberate choice of word - it's been increasingly difficult to remember that, because instead I remember having to deal with interminable arguments over the naming of an interface because Linus has an undying hatred of BSD securelevel, or having my name forever associated with the deepthroating of Microsoft because Linus couldn't be bothered asking questions about the reasoning behind a design before trashing it.

In the end it's a mixture of just being tired of dealing with the crap associated with Linux development and realising that by continuing to put up with it I'm tacitly encouraging its continuation, but I can't be bothered any more. And, thanks to the magic of free software, it turns out that I can avoid putting up with the bullshit in the kernel community and get to work on the things I'm interested in doing. So here's a kernel tree with patches that implement a BSD-style securelevel interface. Over time it'll pick up some of the power management code I'm still working on, and we'll see where it goes from there. But, until there's a significant shift in community norms on LKML, I'll only be there when I'm being paid to be there. And that's improved my mood immeasurably.

(Edited to add a context link for the "deepthroating of Microsoft" reference)

comment count unavailable comments

October 08, 2015 09:22 AM

James Bottomley: Respect and the Linux Kernel Mailing Lists

I recently noticed that Sarah Sharp resigned publicly from the kernel giving a failure to impose a mandatory code of conduct as the reason and citing interaction problems, mainly on the mailing lists.  The net result of this posting, as all these comments demonstrate, is to imply directly that nothing has ever changed.  This implication is incredibly annoying, firstly because it is actually untrue, secondly because it does more to discourage participation than the behaviour that is being complained about and finally because it totally disrespects and ignores the efforts of hundreds of people who, over the last decade or so, have been striving to improve all interactions around Linux … a rather nice irony given that “respect” is listed as one of the issues for the resignation.  I’d just like to remind everyone of the history of these efforts and what the record shows they’ve achieved.

The issue of respect on the Mailing lists goes way back to the beginnings of Linux itself, but after the foundation of the OSDL (precursor to the Linux Foundation) Technical Advisory Board (TAB), one of its first issues from OSDL member companies was the imbalance between Asian and European/American contributions to the kernel.  The problems were partly to do with Management culture and partly because the lack of respect on the various mailing lists was directly counter to the culture of respect in a lot of Asian countries and disproportionately discouraged contributions from that region.  The TAB largely works behind the scenes, but some aspects of the effort filtered into the public domain as can be seen with a session on developer relations at the 2007 kernel summit (and, in fact, at a lot of other kernel summits since then).  Progress was gradual, and influenced by a large number of people, but the climate did improve.  I have to confess that I don’t follow LKML (not because of the flame war issues, simply because it’s too much of a firehose); however, the lists I do participate in (linux-scsi, linux-ide, linux-mm, linux-fsdevel, linux-efi, linux-arch, linux-parisc) haven’t seen any flagrantly disrespectful and personally insulting posts for several years now.  Indeed, when an individual came along who could almost have been flame bait for this with serial efforts to get incorrect and badly thought out patches into the kernel (I won’t give cites here to avoid stigmatising individuals) they met with a large reserve of patience and respectful and helpful advice before finally being banned from the lists for being incorrigible … no insults or flames at all.

Although I’d love to take credit for some of this, I’ve got to say that I think the biggest influencer towards civility is actually the “professionalisation”  of Linux: Employers pay people to work on Linux but the statements of those people become identified with their employers (no matter how many disclaimers they have) … in many ways, Open Source engineers are the new corporate spokespeople.  All employers bear this in mind when they hire and they certainly look over the mailing lists to see how people behave.  The net result is really that the only people who can afford to be rude or abusive are those who don’t think they have much chance of a long term career in Linux.

So, by and large, I’m proud of the achievements we’ve made in civility and the way we have improved over the years.  Are we perfect? by no means (but then perfection in such a large community isn’t a realistic goal).  However, we have passed our stress test: that an individual with bad patches to several mailing lists was met with courtesy and helpful advice, in spite of serially repeating the behaviour.

In conclusion, I’d just like to note that even the thread that gave rise to Sarah’s desire to pursue a code of conduct is now over two years old and try as they might, no-one’s managed to come up with a more recent example and no-one has actually invoked the voluntary code of conflict, which was the compromise for not having a mandatory code of conduct.  If it were me, I’d actually take that as a sign of success …

October 08, 2015 03:47 AM

October 05, 2015

Pete Zaitcev: Pics Up

Чёт я под настроение выложил картинки с этой недели на форумы Авиабазы. Anglophones are welcome to pictures at least.

October 05, 2015 07:31 PM

Davidlohr Bueso: acquire/release semantics in the kernel

With the need for better scaling on increasingly larger multi-core systems, we've continued to extend our CPU barriers in the kernel. Two important variants to prevent CPU reordering for lock-free shared memory synchronization are pairs of load/acquire and store/release barriers; also known as LOCK/UNLOCK barriers. These enable threads to cooperate between each other.

Multiple, yet pretty much equivalent, definitions of acquire/release semantics can be found all over the internet, but I like the version from the infamous 'Documentation/memory-barriers.txt' file for three reasons: (i) it is clear and concise, (ii) it explicitly warns that they are the minimum operations and not to assume anything about reordering of loads and stores before or after the acquire or release, respectively. Finally, (iii) it strongly mentions the need for pairing and thus portability:
 (5) ACQUIRE operations.

     This acts as a one-way permeable barrier.  It guarantees that all memory operations after the ACQUIRE operation will appear to happen after the CQUIRE operation with respect to the other components of the system. ACQUIRE operations include LOCK operations and smp_load_acquire() operations.

     Memory operations that occur before an ACQUIRE operation may appear tohappen after it completes.

     An ACQUIRE operation should almost always be paired with a RELEASE operation.

 (6) RELEASE operations.

     This also acts as a one-way permeable barrier.  It guarantees that all   memory operations before the RELEASE operation will appear to happen before the RELEASE operation with respect to the other components of the system. RELEASE operations include UNLOCK operations and smp_store_release() operations.

     Memory operations that occur after a RELEASE operation may appear to happen before it completes.

     The use of ACQUIRE and RELEASE operations generally precludes the need for other sorts of memory barrier (but note the exceptions mentioned in the subsection "MMIO write barrier").  In addition, a RELEASE+ACQUIRE pair is -not- guaranteed to act as a full memory barrier.  However, after an ACQUIRE on a given variable, all memory accesses preceding any prior RELEASE on that same variable are guaranteed to be visible.  In other words, within a given variable's critical section, all accesses of all previous critical sections for that variable are guaranteed to have completed.

     This means that ACQUIRE acts as a minimal "acquire" operation and    RELEASE acts as a minimal "release" operation.
Thread B's ACQUIRE pairs with Thread A's RELEASE. Copyright (C) IBM.

In lock-speak, all this means is that nothing leaks from the critical region that is protected by the primitive in question. A thread attempting to take a lock will synchronize/pair the load (ACQUIRE), for instance via Rmw (cmpxchg), when attempting to take the lock with the last store (RELEASE) when another thread is concurrently releasing the lock (for example, setting the counter to 0).

For v4.2, Will Deacon introduced more relaxed extensions of traditional atomic operations (including Rmw) which allow finer grained control over, what used to be, full barriers semantics on both sides of the instruction. This is also true for just about all atomic functions that return a value to the caller, ie: atomic_*_return(). As such weakly ordered architectures can make use of these -- currently only arm64 makes use of them, but efforts for PPC are being made.
      - *_relaxed: No ordering guarantees. This is similar to what we have already for the non-return atomics (e.g. atomic_add).
      - *_acquire: ACQUIRE semantics, similar to smp_load_acquire.
      - *_release: RELEASE semantics, similar to smp_store_release.
So we now have goodies such as atomic_cmpxchg_acquire() or atomic_add_return_relaxed(). Most recently, aiming for v4.4, I've ported all our locks to make use of these optimizations, which can save almost half the amount of barriers in the kernel's locking code -- which is specially nice under low or regular contention scenarios, where the fastpaths are exercised. There are plenty of other examples of real world code making use of acquire/release semantics. Mostly by using smp_load_acquire()/smp_store_release() other primitives  also use these semantics for common building blocks (as esoteric as they can get, ie RCU).

October 05, 2015 06:54 AM

September 24, 2015

Eric Sandeen: No, XFS won’t steal your money

So, the Inquirer runs a story by Chris Merriman today, titled “GreenDispenser malware threatens to take all your dosh from Linux ATMs” which includes this breathless little gem:

GreenDispenser targets the XFS file system, a popular standard for ATMs, originally designed for IRIX but now widely used in Linux. ATMs that use Windows XP Embedded, which is still supported, are not thought to be at risk.

Of course, I found this interesting, and a bit odd.  Could the XFS filesystem possibly be at fault here?  And is the “large and lots” filesystem really used in ATMS?  Let’s see what Proofpoint, the security firm who discovered it has to say about the subject:

Specifically, GreenDispenser like its predecessors interacts with the XFS middleware [4], which is widely adopted by various ATM vendors.

That handy link & footnote leads us to Wikipedia, which explains that “XFS middleware” refers to CEN/XFS, which is not in any way related to the XFS filesystem, or Linux, and is in fact Microsoft specific:

CEN/XFS or XFS (eXtensions for Financial Services) provides a client-server architecture for financial applications on the Microsoft Windows platform.

Nice job, Inquirer!  Nice job, Chris Merriman!

(As Jeff points out in the comments, The Inquirer has updated the article as of Sep 25, removing references to LInux and the XFS filesystem.)

September 24, 2015 06:49 PM

Matthew Garrett: Filling in the holes in Linux boot chain measurement, and the TPM measurement log

When I wrote about TPM attestation via 2FA, I mentioned that you needed a bootloader that actually performed measurement. I've now written some patches for Shim and Grub that do so.

The Shim code does a couple of things. The obvious one is to measure the second-stage bootloader into PCR 9. The perhaps less expected one is to measure the contents of the MokList and MokSBState UEFI variables into PCR 14. This means that if you're happy simply running a system with your own set of signing keys and just want to ensure that your secure boot configuration hasn't been compromised, you can simply seal to PCR 7 (which will contain the UEFI Secure Boot state as defined by the UEFI spec) and PCR 14 (which will contain the additional state used by Shim) and ignore all the others.

The grub code is a little more complicated because there's more ways to get it to execute code. Right now I've gone for a fairly extreme implementation. On BIOS systems, the grub stage 1 and 2 will be measured into PCR 9[1]. That's the only BIOS-specific part of things. From then on, any grub modules that are loaded will also be measured into PCR 9. The full kernel image will be measured into PCR10, and the full initramfs will be measured into PCR11. The command line passed to the kernel is in PCR12. Finally, each command executed by grub (including those in the config file) is measured into PCR 13.

That's quite a lot of measurement, and there are probably fairly reasonable circumstances under which you won't want to pay attention to all of those PCRs. But you've probably also noticed that several different things may be measured into the same PCR, and that makes it more difficult to figure out what's going on. Thankfully, the spec designers have a solution to this in the form of the TPM measurement log.

Rather than merely extending a PCR with a new hash, software can extend the measurement log at the same time. This is stored outside the TPM and so isn't directly cryptographically protected. In the simplest form, it contains a hash and some form of description of the event associated with that hash. If you replay those hashes you should end up with the same value that's in the TPM, so for attestation purposes you can perform that verification and then merely check that specific log values you care about are correct. This makes it possible to have a system perform an attestation to a remote server that contains a full list of the grub commands that it ran and for that server to make its attestation decision based on a subset of those.

No promises as yet about PCR allocation being final or these patches ever going anywhere in their current form, but it seems reasonable to get them out there so people can play. Let me know if you end up using them!

[1] The code for this is derived from the old Trusted Grub patchset, by way of Sirrix AG's Trusted Grub 2 tree.

comment count unavailable comments

September 24, 2015 01:21 AM

September 20, 2015

Matthew Garrett: The Internet of Incompatible Things

I have an Amazon Echo. I also have a LIFX Smart Bulb. The Echo can integrate with Philips Hue devices, letting you control your lights by voice. It has no integration with LIFX. Worse, the Echo developer program is fairly limited - while the device's built in code supports communicating with devices on your local network, the third party developer interface only allows you to make calls to remote sites[1]. It seemed like I was going to have to put up with either controlling my bedroom light by phone or actually getting out of bed to hit the switch.

Then I found this article describing the implementation of a bridge between the Echo and Belkin Wemo switches, cunningly called Fauxmo. The Echo already supports controlling Wemo switches, and the code in question simply implements enough of the Wemo API to convince the Echo that there's a bunch of Wemo switches on your network. When the Echo sends a command to them asking them to turn on or off, the code executes an arbitrary callback that integrates with whatever API you want.

This seemed like a good starting point. There's a free implementation of the LIFX bulb API called Lazylights, and with a quick bit of hacking I could use the Echo to turn my bulb on or off. But the Echo's Hue support also allows dimming of lights, and that seemed like a nice feature to have. Tcpdump showed that asking the Echo to look for Hue devices resulted in similar UPnP discovery requests to it looking for Wemo devices, so extending the Fauxmo code seemed plausible. I signed up for the Philips developer program and then discovered that the terms and conditions explicitly forbade using any information on their site to implement any kind of Hue-compatible endpoint. So that was out. Thankfully enough people have written their own Hue code at various points that I could figure out enough of the protocol by searching Github instead, and now I have a branch of Fauxmo that supports searching for LIFX bulbs and presenting them as Hues[2].

Running this on a machine on my local network is enough to keep the Echo happy, and I can now dim my bedroom light in addition to turning it on or off. But it demonstrates a somewhat awkward situation. Right now vendors have no real incentive to offer any kind of compatibility with each other. Instead they're all trying to define their own ecosystems with their own incompatible protocols with the aim of forcing users to continue buying from them. Worse, they attempt to restrict developers from implementing any kind of compatibility layers. The inevitable outcome is going to be either stacks of discarded devices speaking abandoned protocols or a cottage industry of developers writing bridge code and trying to avoid DMCA takedowns.

The dystopian future we're heading towards isn't Gibsonian giant megacorporations engaging in physical warfare, it's one where buying a new toaster means replacing all your lightbulbs or discovering that the code making your home alarm system work is now considered a copyright infringement. Is there a market where I can invest in IP lawyers?

[1] It also requires an additional phrase at the beginning of a request to indicate which third party app you want your query to go to, so it's much more clumsy to make those requests compared to using a built-in app.
[2] I only have one bulb, so as yet I haven't added any support for groups.

comment count unavailable comments

September 20, 2015 09:22 PM

September 18, 2015

Daniel Vetter: XDC 2015: Atomic Modesetting for Drivers

I've done a talk at XDC 2015 about atomic modesetting with a focus for driver writers. Most of the talk is an overview of how an atomic modeset looks and how to implement the different parts in a driver backend. Anyway, for all those who missed it, there's a video and slides.

September 18, 2015 03:27 PM

September 11, 2015

Pete Zaitcev: TLS Security In Firefox 40

What do people at Mozilla think is going to happen when I need to access a website and Firefox says that TLS parameters are insecure and thus I cannot? I'm going to use Chrome, that's what. Or maybe even a hacked Midori, where I can adjust build-time parameters of gcr.

That company went way downhill when they kicked Eich out.

September 11, 2015 06:33 PM

September 07, 2015

Daniel Vetter: Neat drm/i915 stuff for 4.3

Kernel 4.2 is released already and the 4.3 merge window in full swing, time to look at what's in it for the intel graphics driver.

Biggest thing for sure is that Skylake is finally out of preliminary support and enabled by default. The reason for the long hold-up was some ABI fumble - the hardware exposes the topmost plane both through the new universal plane registers and the legacy cursor registers and because we simply carried the legacy plane code around in the driver we ended up exposing both. This wasn't something big to take care of but somehow was dragged on forever.

The other big thing is that now legacy modesets are done with the new atomic modesetting code driver-internally. Atomic support in i915.ko isn't ready for prime-time yet fully, but this is definitely a big step forward. Besides atomic there's also other cross-platform improvements in the modeset code: Ville fixed up the 12bpc support for HDMI, which is now used by default if the screen supports it. Mika Kahola and Ville also implemented dynamic adjustment of the cdclk, which is the main clock source for display engines on intel graphics. And there's a big difference in the clock speeds needed between e.g. a 4k screen and a 720p TV.

Continuing with power saving features Rodrigo again spent a lot of time fixing up PSR (panel self refresh). And Paulo did the same by writing patches to improve FBC (framebuffer compression). We have some really solid testcases by now, unfortunately neither feature is ready for enabling by default yet. Especially PSR is still plagued by screen freezes on some random systems. Also there's been some fixes to DRRS (dynamic refresh rate switching) from Ramalingam. DRRS is enabled by default already, where supported. And finally some improvements to make the frontbuffer rendering tracking more accurate, which is used by all three of these display power saving features.

And of course there's also tons of improvements to platform code. Display PLL code for Sklylake and Valleyview&Cherryview was tuned by Damien and Ville respectively. There's been tons of work on Broxton and DSI support by Imre, Gaurav and others.

Moving on to the rendering side the big change is how tracking of rendering tasks is handled. In the past the driver just used raw sequence numbers emitted by the hardware, but for cross-driver synchronization and reordering tasks with an eventual gpu scheduler more abstraction is needed. A big step is converting over to the i915 request structure completely, done by John Harrison. The next step will be to switch the internal implementation for i915 requests to the cross-driver fences, but that's for future kernels. As a follow-up cleanup John also removed the OLR, which stands for outstanding lazy request. It was a neat little trick implemented years ago to simplify handling error recovery, but which causes tons of pain with subtle bugs. Making requests more explicit in the driver allowed us to finally remove this trick since.

There's also been a pile of platform related features: MOCS programming for Skylake/Broxton (which is used for caching control). Resource streamer support from Abdiel, which is used to offload some of the buffer object tracking for shaders from the cpu to the gpu. And the command parser on Haswell was extended to support atomic instructions in shaders. And finally for Skylake Mika Kuoppala added code to avoid resetting the gpu - in certain cases the hardware would hard-hang the entire system trying to execute the reset. And a dead gpu is still better than a dead system.

September 07, 2015 09:40 AM

September 04, 2015

Andy Grover: RHEL 7.2 has an updated kernel target

As mentioned in the beta release notes, the kernel in RHEL 7.2 contains a rebased LIO kernel target, to the equivalent of the Linux 4.0.stable series.

This is a big update. LIO has improved greatly since 3.10. It has added support for SCSI features that enable VMWare VAAI support, as well as data integrity (DIF), and significant iSER work, for those of you using Infiniband. (SRP is also supported, as well as iSCSI and FCoE, of course.)

Note that we still do not ship support for the Fibre Channel qla2xxx fabric. It still seems to be something storage vendors and integrators want, more than a feature our customers are telling us they want in RHEL.

(On a side note, Infiniband hardware is pretty affordable these days! For all you datacenter hobbyists who have a rack in the garage, I might suggest a cheap previous-gen IB setup and either SRP or iSER as the way to go and still get really high IOPs.)

Users of RHEL 7’s SCSI target should find RHEL 7.2 to be a very nice upgrade. Please try the beta out and report any issues you find of course, but it’s looking really good so far.

September 04, 2015 09:50 PM

Pavel Machek: Wifi fun and misc..

(And apology for the SSD entry some time back. Apparently yes, they can fail to retain data after less than a week... at the very end of their lifetime.)

In the last weeks, learned that transfering real-time data over WIFI is way more fun than I thought. And that it is possible to communicate from inside of (closed) microwave oven using 2.4GHz WIFI. I don't know about you, but it scares me a little.

N900 and not everything is a file

Pocket Computer. We had pocket computers before ... Sharp Zaurus lines was prominent example. They had keyboards and resistive
touchscreens... Resistive touchscreen with stylus is accurate enough to serve as mouse replacement. Unfortunately, such machines are slowly going extinct. Sure, we have Quad-core Full-HD smartphones these days... but they lack keyboards, making ssh from them impossible, they lack accurate pointing device, and they are really phones, not small computers. N900 can almost be used as a pocket computer...

New Mer is "broken beyond repair" for n900.. as it uses qt5.  qt4 works well (well... little slow) on n900, but qt5 needs stable egl
drivers. Ok, so that was another nice-looking trap. I'm starting to think that text-only user interface is right thing to do on n900 at
this point.
Baking n900 for 15minutes at 250C seems to have fixed the "no sim card" problem... for a week. It now seems a bit flakey, but definitely better than before baking. Thanks for everyone at Czech BrmLab!
To backup mmc card on N900, I'd like to rsync root@maemo:/dev/mmcblk1 mmcblk1.img ... but that does not work, as rsync is too clever and refuses to transfer content of special files. Is there trick I'm missing?

On the n900 front... it has 256MiB RAM and 800x480 screen. What web browser would you recommend for that? I tried links2, but its support is not good enough for properly working pages... which I'd kind of like.

Linus, please reconsider -rc0

Hmm. There's big difference between 4.1 (expected to be pretty stable kernel) and 4.2-rc0 (which is probably going to be as unstable as it gets. Unfortunately, Linus does not change makefile before merging, so it is quite tricky to tell if
Linux amd 4.1.0 #25 SMP Wed Jul 1 11:20:22 CEST 2015 x86_64 GNU/Linux
is expected-to-be-stable 4.1, or expected-to-be-very-unstable 4.2-rc0...

Its tempting to name your branches simply "v4.1", "v3.11". Don't. When -rc's are done, Linus will create "v4.1" tag, and you'll have fun
figuring out what whent wrong in your git.

Google play bloatware

I got very cheap LG optimus chic.. and android did improve from G1 days. Its still Google's spying empire, but.. at least it is fluid and mostly works.
Not sure what "Google Play services" are good for, but taking 50MB of internal flash is not funny.. and when moved to SD card, the SD card tends to disconnect. "Google Play Store" still works without them. "My Tracks" need them, but 60MB of flash is not reasonable price to pay for GPX recording. "Pubtran" got removed, too. MHDdroid has strange interface, but perhaps it will not need that much storage.
Do you know a way to search czech public transport without Android and without desktop browser or Opera Mini? leads to "full" version.

And ...dear Android, "force close" dialog is last thing I want to see after hearing ringtone. If you could at least add the number to call log...

Feeling cheated

Wed Jul  1 01:59:58 CEST 2015
Wed Jul  1 01:59:59 CEST 2015
Wed Jul  1 02:00:00 CEST 2015
Wed Jul  1 02:00:01 CEST 2015
Wed Jul  1 02:00:02 CEST 2015
Wed Jul  1 02:00:03 CEST 2015
Different power supply for X60

Thinkpad X60 is marked as 20V, 3.25A. I wonder if using 19V, 2.63A power supply is a good idea. The power brick is way smaller, and 65W seems to be a little high for a small notebook.

September 04, 2015 10:04 AM

September 03, 2015

Gustavo F. Padovan: Linux Kernel Engineer opportunity at Collabora!

Collabora is a software consultancy specialising in bringing companies and the open source software community together and it is currently looking for a Core Software Engineer, that works in the Linux kernel and/or all the plumbing around the kernel. In this role the engineer will be part of worldwide team who works with our clients to solve their Linux kernel and low level stack technical problems.

Collabora is well-known for its strong relationship to upstream development, so it is an important part of this role make significant contributions to upstream projects.

Visit our jobs page or talk me to put you in contact with our Hiring Team!

September 03, 2015 08:44 PM

Paul E. Mc Kenney: Stupid RCU Tricks: Hand-over-hand traversal of linked list using SRCU

Suppose that a very long linked list was to be protected with SRCU. Let's also make the presumably unreasonable assumption that this list is so long that we don't want to stay in a single SRCU read-side critical section for the whole traversal.

So why not try hand-over-hand SRCU protection, as shown in the following code fragment?

  1 struct foo {
  2   struct list_head list;
  3   ...
  4 };
  6 LIST_HEAD(mylist);
  7 struct srcu_struct mysrcu;
  9 void process(void)
 10 {
 11   int i1, i2;
 12   struct foo *p;
 14   i1 = srcu_read_lock(&mysrcu);
 15   list_for_each_entry_rcu(p, &mylist, list) {
 16     do_something_with(p);
 17     i2 = srcu_read_lock(&mysrcu);
 18     srcu_read_unlock(&mysrcu, i1);
 19     i1 = i2;
 20   }
 21   srcu_read_unlock(&mysrcu, i1);
 22 }

The trick is that on each pass through the loop, we enter a new SRCU read-side critical section, then exit the old one. That way the entire traversal is protected by SRCU, but each SRCU read-side critical section is quite short, covering traversal of but a single element of the list.

As is customary with SRCU, the list is manipulated using list_add_rcu(), list_del_rcu, and friends.

What are the advantages and disadvantages of this hand-over-hand SRCU list traversal?

September 03, 2015 05:20 AM

August 31, 2015

Matthew Garrett: Working with the kernel keyring

The Linux kernel keyring is effectively a mechanism to allow shoving blobs of data into the kernel and then setting access controls on them. It's convenient for a couple of reasons: the first is that these blobs are available to the kernel itself (so it can use them for things like NFSv4 authentication or module signing keys), and the second is that once they're locked down there's no way for even root to modify them.

But there's a corner case that can be somewhat confusing here, and it's one that I managed to crash into multiple times when I was implementing some code that works with this. Keys can be "possessed" by a process, and have permissions that are granted to the possessor orthogonally to any permissions granted to the user or group that owns the key. This is important because it allows for the creation of keyrings that are only visible to specific processes - if my userspace keyring manager is using the kernel keyring as a backing store for decrypted material, I don't want any arbitrary process running as me to be able to obtain those keys[1]. As described in keyrings(7), keyrings exist at the session, process and thread levels of granularity.

This is absolutely fine in the normal case, but gets confusing when you start using sudo. sudo by default doesn't create a new login session - when you're working with sudo, you're still working with key posession that's tied to the original user. This makes sense when you consider that you often want applications you run with sudo to have access to the keys that you own, but it becomes a pain when you're trying to work with keys that need to be accessible to a user no matter whether that user owns the login session or not.

I spent a while talking to David Howells about this and he explained the easiest way to handle this. If you do something like the following:
$ sudo keyctl add user testkey testdata @u
a new key will be created and added to UID 0's user keyring (indicated by @u). This is possible because the keyring defaults to 0x3f3f0000 permissions, giving both the possessor and the user read/write access to the keyring. But if you then try to do something like:
$ sudo keyctl setperm 678913344 0x3f3f0000
where 678913344 is the ID of the key we created in the previous command, you'll get permission denied. This is because the default permissions on a key are 0x3f010000, meaning that the possessor has permission to do anything to the key but the user only has permission to view its attributes. The cause of this confusion is that although we have permission to write to UID 0's keyring (because the permissions are 0x3f3f0000), we don't possess it - the only permissions we have for this key are the user ones, and the default state for user permissions on new keys only gives us permission to view the attributes, not change them.

But! There's a way around this. If we instead do:
$ sudo keyctl add user testkey testdata @s
then the key is added to the current session keyring (@s). Because the session keyring belongs to us, we possess any keys within it and so we have permission to modify the permissions further. We can then do:
$ sudo keyctl setperm 678913344 0x3f3f0000
and it works. Hurrah! Except that if we log in as root, we'll be part of another session and won't be able to see that key. Boo. So, after setting the permissions, we should:
$ sudo keyctl link 678913344 @u
which ties it to UID 0's user keyring. Someone who logs in as root will then be able to see the key, as will any processes running as root via sudo. But we probably also want to remove it from the unprivileged user's session keyring, because that's readable/writable by the unprivileged user - they'd be able to revoke the key from underneath us!
$ sudo keyctl unlink 678913344 @s
will achieve this, and now the key is configured appropriately - UID 0 can read, modify and delete the key, other users can't.

This is part of our ongoing work at CoreOS to make rkt more secure. Moving the signing keys into the kernel is the first step towards rkt no longer having to trust the local writable filesystem[2]. Once keys have been enrolled the keyring can be locked down - rkt will then refuse to run any images unless they're signed with one of these keys, and even root will be unable to alter them.

[1] (obviously it should also be impossible to ptrace() my userspace keyring manager)
[2] Part of our Secure Boot work has been the integration of dm-verity into CoreOS. Once deployed this will mean that the /usr partition is cryptographically verified by the kernel at runtime, making it impossible for anybody to modify it underneath the kernel. / remains writable in order to permit local configuration and to act as a data store, and right now rkt stores its trusted keys there.

comment count unavailable comments

August 31, 2015 05:18 PM

August 26, 2015

James Morris: Linux Security Summit 2015 – Wrapup, slides

The slides for all of the presentations at last week’s Linux Security Summit are now available at the schedule page.

Thanks to all of those who participated, and to all the events folk at Linux Foundation, who handle the logistics for us each year, so we can focus on the event itself.

As with the previous year, we followed a two-day format, with most of the refereed presentations on the first day, with more of a developer focus on the second day.  We had good attendance, and also this year had participants from a wider field than the more typical kernel security developer group.  We hope to continue expanding the scope of participation next year, as it’s a good opportunity for people from different areas of security, and FOSS, to get together and learn from each other.  This was the first year, for example, that we had a presentation on Incident Response, thanks to Sean Gillespie who presented on GRR, a live remote forensics tool initially developed at Google.

The keynote by sysadmin, Konstantin Ryabitsev, was another highlight, one of the best talks I’ve seen at any conference.

Overall, it seems the adoption of Linux kernel security features is increasing rapidly, especially via mobile devices and IoT, where we now have billions of Linux deployments out there, connected to everything else.  It’s interesting to see SELinux increasingly play a role here, on the Android platform, in protecting user privacy, as highlighted in Jeffrey Vander Stoep’s presentation on whitelisting ioctls.  Apparently, some major corporate app vendors, who were not named, have been secretly tracking users via hardware MAC addresses, obtained via ioctl.

We’re also seeing a lot of deployment activity around platform Integrity, including TPMs, secure boot and other integrity management schemes.  It’s gratifying to see the work our community has been doing in the kernel security/ tree being used in so many different ways to help solve large scale security and privacy problems.  Many of us have been working for 10 years or more on our various projects  — it seems to take about that long for a major security feature to mature.

One area, though, that I feel we need significantly more work, is in kernel self-protection, to harden the kernel against coding flaws from being exploited.  I’m hoping that we can find ways to work with the security research community on incorporating more hardening into the mainline kernel.  I’ve proposed this as a topic for the upcoming Kernel Summit, as we need buy-in from core kernel developers.  I hope we’ll have topics to cover on this, then, at next year’s LSS.

We overlapped with Linux Plumbers, so LWN was not able to provide any coverage of the summit.  Paul Moore, however, has published an excellent write-up on his blog. Thanks, Paul!

The committee would appreciate feedback on the event, so we can make it even better for next year.  We may be contacted via email per the contact info at the bottom of the event page.

August 26, 2015 07:09 PM

August 25, 2015

Gustavo F. Padovan: Collabora contributions to Linux Kernel 4.2

A total of 63 patches were contributed upsteam by Collabora engineers as part of our current projects.

In the ARM multi_v7_defconfig we have the addition of support for Exynos Chromebooks, all options that had a tristate Kconfig option were added as module. After this change it was found that a few drivers weren’t working  properly when built as module, so this was fixed. This work was done by Javier Martinez.

Javier also added multi EC support as newer Chromebooks have more than one Embedded Controller in the system.

Tomeu Vizoso added EMC (External Memory Controller) support to the Tegra124 platform.

On the DRM side initial support for Atomic Modesetting was added to Exynos devices by Gustavo Padovan. The Atomic Modesetting interface allows all screen updates such as changing modes, pageflip and set planes/cursors to happen in the same IOCTL. Thus everything can be updated atomically. More on that can be found at Daniel Vetter’s post at Another contribution, from Daniel Stone, to Atomic Modesetting was the addition of the CRTC state mode property, it is through this property that userspace configure a modeset that will be updated via an Atomic Modesetting ioctl.

Following is a list of all patches submitted by Collabora for this kernel release:

Daniel Stone (17):

Gustavo Padovan (17):

Javier Martinez Canillas (19):

Tomeu Vizoso (11):

August 25, 2015 01:47 PM

August 24, 2015

Davidlohr Bueso: LPC 2015: Performance and Scalability MC

This year I had the privilege of leading the Performance and Scalability micro-conference for Linux Plumbers. The goals and motivation behind organizing this track were threefold. First present relevant work-in-progress ideas that can improve performance in core kernel subsystems, and need some face to face discussion -- as such, this requires previous debate on lkml. Similarly, learn about real bottlenecks and issues people are running into. And finally, get to know more relevant academic (experimental) work going on in in both the kernel and system-level userland. As such, the sessions were grouped as follows:

(i) Fast Bounded-Concurrency Hash Tables. Samy Bahra introduced a novel non-blocking multi-reader/single writer hash table with strong forward progress guarantees for TSO. Because the common-case fastpath does not incur in barriers or atomic operations, this technique allows nearly perfect scaling. While his work is done in userspace, he sees potential for it in the kernel, such as the networking subsystem. In such situations, the use of RCU (readers being the common case) might also be used.

(ii) Improving Transactional Memory Performance with Queued Locking. While transactional memory  works nicely in conflict-free setups, it ends up requiring common serialization otherwise. An option is to retry, however, when the amount of threads executing in the CR is larger than the amount of completed threads, you can get pileups. Tim Chen presented a solution based on applying a sort of 'aperture' and using principles based on MCS for faired queuing, where can be regulated based on metrics such as the number of threads in the critical region and abort rate.

(iii) How to Apply Mutation Testing to RCU. Iftekhar Ahmed from OSU,
summarized his research in overcoming limitations of mutation testing to identify problems in RCU. As usual, working with Paul McKenney, they have been able to identify a number of mutants along with making use of rcutorture for specific periods of time. They generated ~3300 mutants from rcu and rcutorture is doing a good job identifying them. It would be interesting to see this applied along with fuzzy testing which has already uncovered several bugs in RCU in the past.

Scaling track -- LPC'15, Seattle.

 (iv) Unfair Queued Spinlocks and Transactional Locks. Waiman Long has been working on extending spinlocks and apply them to solve issues with transactional memory. He presented experiments based on rwlocks and transactional spinlock (new primitive) for transactional (reader) and non-transactional (writer) executions. This talk nicely complemented Tim Chen's previous presentation. He also touched on the qspinlock performance in virtualized environments and the challenges currently out there. As we already have code for this, it was much easier to discuss face to face. Consensus in the room was that kernel developers are not against improving pv spinlocks, but what is determined is that we will not accept a 3rd primitive.

(v) Do Virtual Machines Really Scale. Sanidhya Kashyap
from GA Tech showed us the state of scalability in the cloud where there is a clear trend that services hit poor scalability after certain degrees of contention/core-count. These are LHP issues and vmexits/enters cause performance issues at high vcpu counts. He introduces oticket backed by performing multiple wakeups at once when granting the lock. Good feedback and suggestions to overcome some of the presented issues with the approach. This was an extra short BoF like of presentation, but there was quite a bit of interest, and the appropriate people were in the room.

Overall I would say that all three objectives were met and the quality of the sessions were high, thus meeting all expectations (if not, please email me for feedback ;-). In fact, there were some highly interesting and relevant presentations that, due to time constraints, had to be left out.

August 24, 2015 09:05 PM

August 19, 2015

Matt Domsch: Dell Desktop / Notebook Linux Engineering position available

Come help Dell ensure Linux “just works!” on Dell notebooks, desktops, and devices! The Dell Client Linux Engineering team has opening for a Senior Software Engineer. This team works closely with the Linux community, device manufacturers, and Dell engineering teams to provide the best Linux experience across the entire client product line.

Visit the Dell Jobs site to apply. If you’re a friend of mine and are interested, drop me a line and I’ll make sure you get in front of the hiring manager quickly!

August 19, 2015 09:31 PM

LPC 2015: Bird-of-a-Feather Sessions

We have a great slate of bird-of-a-feather (BoF) sessions on Thursday evening! However, there are still a few BoF slots left, so proposals are still welcome here. First come, first served!

August 19, 2015 09:22 PM

August 18, 2015

Matthew Garrett: Canonical's deliberately obfuscated IP policy

I bumped into Mark Shuttleworth today at Linuxcon and we had a brief conversation about Canonical's IP policy. The short summary:

The even shorter summary: Canonical won't clarify their IP policy because they believe they can make more money if they don't.

Why do I keep talking about this? Because Canonical are deliberately making it difficult to create derivative works, and that's one of the core tenets of the definition of free software. Their IP policy is fundamentally incompatible with our community norms, and that's something we should care about rather than ignoring.

comment count unavailable comments

August 18, 2015 07:02 PM

August 17, 2015

Andi Kleen: Announcing simple-pt — A simple Processor Trace implementation

Modern Intel Core CPUs (5th and 6th generation) have a Intel Processor Trace (PT) feature to trace branch execution with low overhead. This is useful for performance analysis and debugging.

simple-pt is a simple standalone driver and decoder tool to implement PT on Linux.

Starting with Linux 4.1 Linux already has a integrated PT implementation in perf (see ). simple-pt is an alternative implementation. It has many disadvantages over the perf PT implementation, such as:
- needs to run as root
- no long term tracing or sampling with interrupts
- no support for interactive debugging (use gdb 7.10 on perf for that)
- no support for histograms
- somewhat experimental
- not as well supported as perf

On the positive side simple-pt is:
- simple
- standalone. No kernel changes needed. Could be ported to older kernels or other operating systems
- easy to modify and experiment with
- more ftrace like decoding tool
- support for kprobes based triggers
- modular “unix style” design with simple tools that do only one thing each
- BSD licensed

Example output:

        % sptcmd  -c tcall taskset -c 0 ./tcall
        cpu   0 offset 1027688,  1003 KB, writing to ptout.0
        Wrote sideband to ptout.sideband
        % sptdecode --sideband ptout.sideband --pt ptout.0 | less
        frequency 32
        0        [+0]     [+   1] _dl_aux_init+436
                          [+   6] __libc_start_main+455 -> _dl_discover_osversion
                          [+  13] __libc_start_main+446 -> main
                          [+   9]     main+22 -> f1
                          [+   4]             f1+9 -> f2
                          [+   2]             f1+19 -> f2
                          [+   5]     main+22 -> f1
                          [+   4]             f1+9 -> f2
                          [+   2]             f1+19 -> f2
                          [+   5]     main+22 -> f1

Available from

August 17, 2015 04:27 AM

August 16, 2015

Daniel Vetter: Atomic Modesetting Design Overview

After a few years of development the atomic display update IOCTL for drm drivers is finally ready for prime time with the 4.2 pull request from Dave Airlie. It's been a long road, with a lot of drivers already converted over to atomic and even more in progress, the atomic helper libraries and support code in the drm subsystem sufficiently polished. But what's really missing is a design overview of what the overall atomic infrastructure looks like and why some decisions and details are implemented like they are.

That's now done and published on LWN: Part 1 talks about the problem space, issues with the Android atomic display framework and the basic atomic IOCTL interface. Part 2 goes into more detail about a few specific things like locking, helper library design and the exact semantics of atomic modessetting updates. Happy Reading!

August 16, 2015 01:52 PM

August 15, 2015

Rusty Russell: Broadband Speeds, New Data

Thanks to edmundedgar on reddit I have some more accurate data to update my previous bandwidth growth estimation post: OFCOM UK, who released their November 2014 report on average broadband speeds.  Whereas Akamai numbers could be lowered by the increase in mobile connections, this directly measures actual broadband speeds.

Extracting the figures gives:

  1. Average download speed in November 2008 was 3.6Mbit
  2. Average download speed in November 2014 was 22.8Mbit
  3. Average upload speed in November 2014 was 2.9Mbit
  4. Average upload speed in November 2008 to April 2009 was 0.43Mbit/s

So in 6 years, downloads went up by 6.333 times, and uploads went up by 6.75 times.  That’s an annual increase of 36% for downloads and 37% for uploads; that’s good, as it implies we can use download speed factor increases as a proxy for upload speed increases (as upload speed is just as important for a peer-to-peer network).

This compares with my previous post’s Akamai’s UK numbers of 3.526Mbit in Q4 2008 and 10.874Mbit in Q4 2014: only a factor of 3.08 (26% per annum).  Given how close Akamai’s numbers were to OFCOM’s in November 2008 (a year after the iPhone UK release, but probably too early for mobile to have significant effect), it’s reasonable to assume that mobile plays a large part of this difference.

If we assume Akamai’s numbers reflected real broadband rates prior to November 2008, we can also use it to extend the OFCOM data back a year: this is important since there was almost no bandwidth growth according to Akamai from Q4 2007 to Q7 2008: ignoring that period gives a rosier picture than my last post, and smells of cherrypicking data.

So, let’s say the UK went from 3.265Mbit in Q4 2007 (Akamai numbers) to 22.8Mbit in Q4 2014 (OFCOM numbers).  That’s a factor of 6.98, or 32% increase per annum for the UK. If we assume that the US Akamai data is under-representing Q4 2014 speeds by the same factor (6.333 / 3.08 = 2.056) as the UK data, that implies the US went from 3.644Mbit in Q4 2007 to 11.061 * 2.056 = 22.74Mbit in Q4 2014, giving a factor of 6.24, or 30% increase per annum for the US.

As stated previously, China is now where the US and UK were 7 years ago, suggesting they’re a reasonable model for future growth for that region.  Thus I revise my bandwidth estimates; instead of 17% per annum this suggests 30% per annum as a reasonable growth rate.

August 15, 2015 04:54 AM

August 14, 2015

Pete Zaitcev: Tablet Uber Alles Or Is It

Given the trouble with modern laptops, I'm seriously thinking if I should make a jump to a gigantic tablet with a keyboard. You run "make" on VM. Not enough RAM? Order in the cloud! The idea was planted in my mind by that jerk Atwood, who penned an article claiming a death of PC. And a month ago I saw someone at Python meetup using Canopy. It kinda worked, actually. I expect Github Atom to be even better.

Unfortunately, there are problems in 3 broad categories still.

First, the hotspot Internet connectivity sucks. It is plain unreliable. VPN, ssh, and IRC are often blocked; it's necessary to remember "Connectivity Through Anything" lessons and tehcniques. When it works, it's often slow. These problems extend to venues such as Intel's Executive Briefing Center. If "executives" eating their awesome snacks cannot obtain a decent WiFi, what hope do I have? I do not have cellphone data, but I hear bitching about it.

Second, the usual questions about privacy and security apply. Non-proprietary tablets suck immensely, from what I heard.

Third, tablets top out at 10..11 inch. Sorry, but that is not enough to kill laptops while laptops continue to be made. Certainly, Atwood made an argument that as tablets absorb users, PC makers will stop. The day the last one quits, we'll have to use the least shitty tablet regardless of size. But today is not that day.

UPDATE: 3 weeks after this post, Apple unveiled a 12.9" (2732 x 2048) iPad Pro, with a keyboard as a factory option.

August 14, 2015 09:38 PM

Pete Zaitcev: User-facing hardware

New business trip, new hardware pictures.

It was almost a year, and I'm still looking for a decent laptop, same criteria. I saw a couple of guys using Lenovo X1 Carbon, which looks good. Most importantly, the left Ctrl is now extends to its proper position. Almost a winner, but unfortunately, there are issues. Apparently, the screen on the X1 is not touching the main frame flat when it's closed, so a bundle of clothing pressing in the middle between the hinges is capable to making a nasty crack in plastic. Not acceptable for what is a $1,400 laptop even with Amazon's "discount" of $900. Way to go, Lenovo. Almost had me this time.

Meanwhile, a $500 Dell Vostro continues to soldier on. It's showing its age: building Ceph with "make -j${N}" requires more RAM that it has for any reasonable N, and dialog windows started to outgrow its screen (notably, some of GNOME preferences). I still need a laptop, but can't find a suitable one. The Lenovo X1 tops out at 8GB, which was another strike against it.

I was a little sad when Google stopped making Nexus 7. I have a 2013 version and it is quite good. In the same meeting, I bumped into a guy with a projected update to Nexus 7 that became orphaned when Google pulled the plug. ASUS continued to build them and market them as "MemoPad 7". However, taking the page from Microsoft playbook with their "Surface" and "Surface Pro", ASUS sell "MemoPad 7" versions ranging from worthless piece of junk with 1024x600 to actual Nexus 7 replacements with 1920x1200. Allegedly, the battery life and speed are much improved by using Intel's embedded Atom core. Some of the ARM-optimized apps may not work (example is some kind of music editing thing for podcasters).

August 14, 2015 09:18 PM

August 13, 2015

Dave Jones: The case of the mysterious disappearing I211

Day one of unemployed life saw me finally getting around to the first of several hardware related maintenance items that I’ve been putting off until I’ve had the time.

I got a lot of life out of my desktop machine that I had been using since 2007. Earlier this year, I decided it was long overdue an upgrade, and ended up building a ridiculously over-specced machine in the hopes it too would last me a while. After some research, I ended up with a 6-core Haswell-E i7-5820K, and a frankly ridiculously over-featured motherboard.
Once I had delved through the absurd number of BIOS options to convince it that I *really* didn’t want to overclock my CPU or my RAM, or anything else, it was very stable.

It has exceeded all my expectations. In the time it took my old desktop to build one kernel, I can build kernel .deb’s for every machine I own, and still have time spare. It’s an absolute beast.

One of the features that sold me on this board was the two onboard ethernet ports. I had been wanting to do a bunch of networking experiments, and the possibility of using bonding, without having to screw around with add-in cards was appealing.

So I was a little irked one evening after updating its BIOS, to notice that the bond only had one interface active. After some investigation, I noticed that the PCI ID of one of the onboard NICs had changed.

What was once

00:19.0 Ethernet controller: Intel Corporation Ethernet Connection (2) I218-V (rev 05)
08:00.0 Ethernet controller: Intel Corporation I211 Gigabit Network Connection (rev 03)

Was now

00:19.0 Ethernet controller: Intel Corporation Ethernet Connection (2) I218-V (rev 05)
08:00.0 Ethernet controller: Intel Corporation Device 1532 (rev 03)

My I211 had changed its PCI ID, and the e1000 driver wouldn’t bind to this new device.

At first I thought “Cool, some kind of NIC firmware update”, and assumed that e1000 hadn’t been updated yet to support this new feature. Googling for “i211 1532” told a much sadder story however.

If you read the spec update for the i211, you find this interesting table:

I211 Device ID Code Vendor ID Device ID Revision ID1
WGI211AT (not programmed/factory default) 0x8086 0x1532 0x3
WGI211AT (programmed) 0x8086 0x1539 0x3

Uh, not cool. Somehow the BIOS update procedure had wiped the NVRAM on the NIC.

A long protracted conversation with ASUS support followed, including such gems as “I understand you’re seeing blue screens” and “Have you tried removing the DIMMs, rubbing the contacts with an eraser and replacing them”. Eventually I think they got to the end of their script, and agreed to RMA the board. Somewhat annoying, given there’s probably a tool somewhere that can rewrite the flash, but Intel only seems to make that available to integrators, not end-users, and the ASUS representatives denied all knowledge.

It was gone for about two weeks, and finally returned yesterday. Its PCI ID is 0x1539 again, and it has its old MAC address once more. (I’m now hesitant to ever upgrade the BIOS on this machine again). So what happened ? Anyone’s guess, but this isn’t the first time I’ve seen this happen. We had a bunch of these NICs at Akamai too that occasionally had the same thing happen to them.

The whole thing is reminiscent of a painful old bug where ftrace would corrupt the e1000e ROM. Hopefully Linux isn’t to blame this time.

So, long story short: If you see an i211 with a PCI ID of 1532, you’re looking at an RMA.

The post The case of the mysterious disappearing I211 appeared first on

August 13, 2015 04:09 PM

LPC 2015: LPC closing party on the water

The Linux Plumbers Conference will have its closing party on Friday, August 21 at the Palisade Restaurant at Elliott Bay Marina on the waters of Puget Sound. Buses will be leaving from the Sheraton around 18:00 for the 15-minute journey to the restaurant. The evening will start with a champagne and seafood tower reception in the courtyard overlooking the marina. It will include shrimp, lobster, and oysters, along with plenty of vegetarian choices. There will be a buffet inside the restaurant after that with entrees including sushi (with vegetarian selections), salmon, risotto, and steak. All of that will be followed by dessert and coffee. There will be local wine and beer selections and all of the food will be locally sourced as much as possible.

It should make for a fabulous evening, with great views (perhaps even of Mount Rainier), and excellent company. We look forward to seeing you on Friday.

August 13, 2015 02:13 PM

August 12, 2015

LPC 2015: How to find Room WSCC 3AB

This year, because of space constraints, the single Wednesday Microconference track (for LLVM in the morning and the Development Tools Tutorial in the afternoon, see schedule for details) is happening offsite at the Washington State Convention Centre (WSCC).  Breakfast will still be at the Sheraton, but the rest of the Microconference will happen over at WSCC in room 3AB.  To get there from the Sheraton, take the escalators down to the ground floor, exit at the corner of Pike and 6th Avenue. Turn right on to Pike, cross 7th Avenue, continue a little way up Pike then turn right into the Convention Centre. Room 3AB is up the escalators on the third floor.



Wireless in the Washington State Convention Centre is on a different system.  Unfortunately it’s portal based, not WEP key based, so we can’t hide the difference.  The portal details are

SSID: Exhibitor Internet

Which is an open network, then browse to any page and enter the login at the prompt:
USER ID: linux
Password: seattle

August 12, 2015 01:34 AM

August 11, 2015

Dave Jones: Moving on from Akamai.

Today was my last day at Akamai. It’s been brief (Just over seven months), but things weren’t really working out for me there for a number of reasons. I’ve mentioned to a number of people who have known about my decision for a while, that it’s not that it’s a bad place to work, but it never felt like a good fit for me, and I came to realize that I’ve spent most of this last year being in denial of just how unhappy I was, in the hope “things would get better”.

There are a lot of smart people working there, working on really difficult problems, but a lot of those problems just don’t align with my interests, especially when they don’t always involve contributing code back upstream. [clarification: There is some upstream work going on there, just not as much as I’d like].

Add to this my disdain for some of the proprietary tooling that’s prevalent there, and it was becoming clear it was not a matter of “if”, but “when” I was going to leave. As an example; I joked a few months ago to co-workers “next time I’m looking for a job, the first question I ask is ‘do you use perforce’?”. Only it wasn’t really a joke, I was dead serious. User-hostile software has no place in my life.
Even little things like “let’s use git” translating to “let’s license Atlassian stash” rather than “run a git-daemon somewhere” started getting me down.

The final project I worked on there was a continuous rebase strategy for the kernel, moving away from perforce to git. It’s a move in the right direction, but ultimately, not the sort of work that gets me excited, and it’s going to be a multi-year project before it starts really bearing fruit. Given how perforce is ingrained in so many of Akamai’s systems, it would also have been extremely unlikely I’d have been able to purge all knowledge of ever having used it.

The rebase work itself also started to bother me that many of the kernel changes we made had no chance of ever even being submitted, let alone accepted upstream. (In part because many of them are very unique to Akamai’s CDN — you won’t find any of the trickery employed there described in a Richard Stevens book, and they’re unlikely to ever be official RFC’s due to the competitive edge they gain from those changes).
There are exceptions to all of this, and the kernel team is trying to do a better job there with upstreaming most of the newer changes, but many of the older legacy patches are under-documented, and/or understood well by few people, with the original authors no longer around, making it a frustrating exercise to get up to speed; especially when you’re trying to learn what the upstream code is doing at the same time.

Someone with less experience dealing exclusively with open-source for most of their career would probably find many of my reasons for leaving trivial. Those same people would probably find Akamai a great place to work. There are a lot of opportunities there if you have a higher tolerance for such things than I did. It was eye-opening recently, mentoring some of the interns there. Optimism. The unjaded outlook that comes with youth. Not getting bent out of shape at crappy tooling because they don’t know different. It made me realize I wasn’t going to ever be like this here.

On a particularly bad day a few weeks back, a recruiter reached out to me, to find out if I was interested in a second chance at an offer I received last time I was looking for a new job. It worked. Enduring an unhappy situation in the hopes things will get better isn’t a great strategy when there are other options.

So, I start at Facebook in September.

I have no delusions that things are going to be perfect there, but at least from the outside right now, the grass looks greener. I feel bad walking away from problems unfinished, but going home miserable or angry or some other negative emotion every day was really starting to get take its toll. It’s not a healthy way to live.

When I was interviewing last December, I read Being Geek to death, so it’s fitting that I’ve picked it up again recently. One paragraph in particular jumps out at me.

My single worst gig was one where I got everything I wanted out the of the offer letter, but in my exuberance for being highly valued, I totally forgot that my gut read on the gig was "meh". Ninety days later, I couldn't care less that I got a 15% raise and a sign-on bonus. I couldn't stand the mundanity of the daily work, and I happily resigned a few months later, taking both a pay cut and returning my sign-on bonus for the opportunity to work at Netscape.

Anachronisms and minor details aside, that paragraph played through my head this afternoon as I wrote the check to pay back the remainder of my sign-on bonus. I wasn’t quite thinking “meh”, but I knew I was making compromises on what I really valued from day one.

Walking away from unvested RSUs, giving up this months paycheck, and writing that check stings a little, but when I did my exit interview this morning, I knew that I too, was “happily resigning” for a great opportunity.

I’m feeling uncharacteristically optimistic right now. Hopefully it’ll last.

I’ll be in Seattle next week, but due to complications with my registration being transferred to another Akamai employee, I won’t actually be at the Linux plumbers conf. If you’re also going to be there and want to catch up, drop me a mail, or <ahem> hit me up on facebook.

The post Moving on from Akamai. appeared first on

August 11, 2015 10:33 PM

Pete Zaitcev: git submodule

It's a familiar sign to anyone dealing with a project that includes submodules: you run "make" and see something like this:

rgw/ In member function ‘virtual int RGWMongooseFrontend::run()’:
rgw/ error: ‘struct mg_callbacks’ has no member named ‘log_access’
cb.log_access = rgw_civetweb_log_access_callback;

Ah, yes. Submodule civetweb is obviously out of date. Type "git submodule init; git submodule update" and... nothing happens. The goddamn submodules are stuck.

At this point, running "git diff origin" produces an output like:

--- a/ceph-object-corpus
+++ b/ceph-object-corpus
@@ -1 +1 @@
-Subproject commit 20351c6bae6dd4802936a5a9fd76e41b8ce2bad0
+Subproject commit bb3cee6b85b93210af5fb2c65a33f3000e341a11

So yeah, obviously you fetched the right thing from the origin, but you cannot merge or rebase no matter what. You may spend a good part of a hackathon reading man pages for git subcommands, all for naught.

Fortunately, the stuck submodules can be worked around, by looking at the "git diff origin" above, then doing this:

git update-index --replace --cacheinfo 160000,20351c6bae6dd4802936a5a9fd76e41b8ce2bad0,ceph-object-corpus

You get the idea: force the right commit from the origin into the local index. This allows "git submodule update" to clone and checkout the right thing and you're off to the races. The fixups in the index will stick out in "git status", so create an empty commit to get rid of them (but only after "git submodule update").

When you're done, you might want to kick in the nuts whoever chose to use submodules in your project.

P.S. "git --version" yields "git version 2.4.3".

P.P.S. You verify what you have in the index by running "git ls-files -s ceph-object-corpus" (or src/civetweb). The mode must be 160000 and the hash should match the upstream. Note that "git diff origin" continues to display a disparity until you've run the "git submodules update".

August 11, 2015 03:08 AM

Pete Zaitcev: the future is here


10005 zaitcev   20   0  809920 755384  13220 R  99.7 12.5   0:20.47 cc1plus
 9894 zaitcev   20   0 1946748 1.806g  15800 R  99.3 31.4   1:46.60 cc1plus
 9956 zaitcev   20   0 1652076 1.524g  15832 R  99.0 26.5   1:30.64 cc1plus
   72 root      20   0       0      0      0 S   4.0  0.0   0:04.60 kswapd0
 9957 zaitcev   20   0   56648  43536   1436 S   2.7  0.7   0:00.49 as
 9895 zaitcev   20   0   79480  66368   1480 S   2.0  1.1   0:00.89 as
 2870 zaitcev   20   0 1989524 533104 160868 S   1.3  8.9  60:28.10 firefox
 2035 zaitcev   20   0 2018216 166872  20028 S   0.7  2.8  16:50.66 gnome-sh

That's right, boys and girls, a compiler with a bigger resident size than Firefox. Three times bigger.

August 11, 2015 02:12 AM

August 10, 2015

Lucas De Marchi: “Throw away” linux images in seconds

Generating a new rootfs from scratch in order to test changes to early parts of the software stack or just to have a pristine environment is something I needed several times in the past.

Since I use Archlinux in my desktop something that I like is to have a similar environment in the target test rootfs. I decided to re-use and improve a script from Kay Sievers to create an installer that can be booted as a VM, as a container or in bare metal: Originally  it was a script to bootstrap a Fedora image and I think that with some small changes that would still be possible.

$ time sudo -l ~/vm/test.img
real 0m31.238s
user 0m22.277s
sys 0m2.473s

30 seconds later I have a complete pristine image that can be used as a VM with qemu, as a container with systemd-nspawn or just copied to a pendrive/sdcard to boot for example a Minnow Board Max.


$ sudo systemd-nspawn -b -i ~/vm/test.img


sudo kvm-that ~/vm/test.img

Note: ‘kvm-that’ is also a script available in the same repository so I don’t have to type all the options to qemu.

In order to boot another computer or a board like Minnow Board Max just dd the image to a usb disk or sdcard. You can also generate the image directly to the final destination:

$ sudo -l /dev/mmcblk0

The script has also some nice options to make it easy to customize the final image.  One thing that I’m often doing is giving an overlay directory with configuration files for wpa_supplicant. This way I can already access my WiFi networks in the target image.

If you always need certain packages you can use the  example debug-tools hook that is executed before the image is finalized. By mixing hooks like that and the overlay directory mentioned above it’s possible to add your local repository to pacman.conf and install packages not available in Archlinux. Or packages that you’d like to maintain on your own. In my use cases with Minnow Board Max I maintain my own kernel with configurations suited to run ardupilot on it.

August 10, 2015 10:44 AM

August 08, 2015

Michael Kerrisk (manpages): man-pages-4.02 is released

I've released man-pages-4.02. The release tarball is available on The browsable online pages can be found on The Git repository for man-pages is available on

This release resulted from patches, bug reports,and  comments from around 15 contributors. As well as a large number of minor fixes to nearly 400 man pages, the more significant changes in man-pages-4.02 include the following:

August 08, 2015 10:10 PM

Matthew Garrett: Difficult social problems are still difficult problems

After less than a week of complaints, the TODO group have decided to pause development of their code of conduct. This seems to have been triggered by the public response to the changes I talked about here, which TODO appear to have been completely unprepared for.

While disappointing in a bunch of ways, this is probably the correct decision. TODO stumbled into this space with a poor understanding of the problems that they were trying to solve. Nikki Murray pointed out that the initial draft lacked several of the key components that help ensure less privileged groups can feel that their concerns are taken seriously. This was mostly rectified last week, but nobody involved appeared to be willing to stand behind those changes in a convincing way. This wasn't helped by almost all of this appearing to land on Github's plate, with the rest of the TODO group largely missing in action[1]. Where were Google in this? Yahoo? Facebook? Left facing an angry mob with nobody willing to make explicit statements of support, it's unsurprising that Github would try to back away from the situation.

But that doesn't remove their blame for being in the situation in the first place. The statement claims
We are consulting with stakeholders, community leaders, and legal professionals, which is great. It's also far too late. If an industry body wrote a new kernel from scratch and deployed it without any external review, then discovered that it didn't work and only then consulted any of the existing experts in the field, we'd never take them seriously again. But when an industry body turns up with a new social policy, fucks up spectacularly and then goes back to consult experts, it's expected that we give them a pass.

Why? Because we don't perceive social problems as difficult problems, and we assume that anybody can solve them by simply sitting down and talking for a few hours. When we find out that we've screwed up we throw our hands in the air and admit that this is all more difficult than we imagined, and we give up. We ignore the lessons that people have learned in the past. We ignore the existing work that's been done in the field. We ignore the people who work full time on helping solve these problems.

We wouldn't let an industry body with no experience of engineering build a bridge. We need to accept that social problems are outside our realm of expertise and defer to the people who are experts.

[1] The repository history shows the majority of substantive changes were from Github, with the initial work appearing to be mostly from Twitter.

comment count unavailable comments

August 08, 2015 08:09 PM

August 05, 2015

Andi Kleen: Generating Flame graphs with Processor Trace

How to generate a FlameGraph with Processor Trace. Everybody loves Flame Graphs.

Processor trace allows to do as very exact histograms of a program’s run time. Normal sampling has shadow effects, which can hide some details. Processor traces every branch, so it can be much more accurate than normal sampling.

You need a Intel Broadwell or Skylake CPU.
Running at 4.1 or later Linux kernel where perf supports PT.
You can verify the kernel supports pt with

ls /sys/devices/intel_pt

You need perf user tools built from
(this should soon be fixed when the user tools code is merged into Linux mainline)

Build perf with PT support

# set up https_proxy as needed
git clone
cd linux-perf/tools/perf

Copy the resulting perf binary to where you want to run it

Get the flamegraph code

git clone

Collect data from the workload. Best to not collect too long traces as they take much longer to process and may need too much disk space.

perf record -e intel_pt// workload (or -a sleep 1 to collect 1s globally)

Decode the data. This may take quite some time

perf script --itrace=i100usg | /path/to/FlameGraph/ | > workload.folded

The i100us means the trace decoder samples an instruction every 100us. This can be made more accurate (down to 1ns), at the cost of longer decoding time. The ‘g’ tells the decoder to add callgraphs.

Then generate the Flamegraph with

/path/to/FlameGraph/ workloaded.folded > workload.svg

Then view the resulting SVG in a SVG viewer, such as google chrome

google-chrome workload.svg

It is possible to click around.

Here’s a larger svg example from a gcc build (2.5MB). May need chrome or firefox to view.

In principle the trace also has support for more information not in normal sampling, such as determining the exact run time of individual functions from the trace. This is unfortunately not (yet?) supported by the Flame Graph tools.

August 05, 2015 11:13 PM

August 04, 2015

Matthew Garrett: Reverse this

The TODO group is an industry body that appears to be trying to define community best practices or something. I don't really know what their backstory is and whether they're trying to do meaningful work or just provide a fig leaf of respectability to organisations that dislike being criticised for doing nothing to improve the state of online communities but don't want to have to actually do anything, and their initial work on codes of conduct was, perhaps, suboptimal. But they do appear to be trying to improve things - this commit added a set of inappropriate behaviours, and also clarified that reverseisms were not actionable behaviour.

At which point Reddit lost its shit, because Reddit is garbage. And now the repository is a mess of white men attempting to explain how any policy that could allow them to be criticised is the real racism.

Fuck that shit.

Being a cis white man who's a native English speaker from a fairly well-off background, I'm pretty familiar with privilege. Spending my teenage years as an atheist of Irish Catholic upbringing in a Protestant school in a region of Northern Ireland that made parts of the bible belt look socially progressive, I'm also pretty familiar with the idea that that said privilege doesn't shield me from everything bad in life. Having privilege isn't a guarantee that my life will be better, in the same way that avoiding smoking doesn't mean I won't die of lung cancer. But there's an association in both cases, one that's strong enough to alter the statistical likelihood in meaningful ways.

And that inherently affects discussions about race or gender or sexuality. The probability that I've been subject to systematic discrimination because of these traits is vanishingly small. In the communities this policy is intended to cover, I'm the default. It's very difficult for any minority to exercise power over me. "You're white, you wouldn't understand" isn't fundamentally about my colour, it's about the fact that my colour means I haven't been subject to society trying to make my life more difficult at every opportunity. A community that considers saying that to be racist is a community that will never change the default, a community that will never be able to empower people who didn't grow up with that privilege. A code of conduct that makes it clear that "reverse racism" isn't grounds for complaint makes it clear that certain conversations are legitimate and helps ensure we have the framework we need to gradually change that default, and as such is better than one that doesn't.

(comments disabled because I don't trust any of you)

comment count unavailable comments

August 04, 2015 09:59 PM

Rusty Russell: The Bitcoin Blocksize: A Summary

There’s a significant debate going on at the moment in the Bitcoin world; there’s a great deal of information and misinformation, and it’s hard to find a cogent summary in one place.  This post is my attempt, though I already know that it will cause me even more trouble than that time I foolishly entitled a post “If you didn’t run code written by assholes, your machine wouldn’t boot”.

The Technical Background: 1MB Block Limit

The bitcoin protocol is powered by miners, who gather transactions into blocks, producing a block every 10 minutes (but it varies a lot).  They get a 25 bitcoin subsidy for this, plus whatever fees are paid by those transactions.  This subsidy halves every 4 years: in about 12 months it will drop to 12.5.

Full nodes on the network check transactions and blocks, and relay them to others.  There are also lightweight nodes which simply listen for transactions which affect them, and trust that blocks from miners are generally OK.

A normal transaction is 250 bytes, and there’s a hard-coded 1 megabyte limit on the block size.  This limit was introduced years ago as a quick way of avoiding a miner flooding the young network, though the original code could only produce 200kb blocks, and the default reference code still defaults to a 750kb limit.

In the last few months there have been increasing runs of full blocks, causing backlogs for a few hours.  More recently, someone deliberately flooded the network with normal-fee transactions for several days; any transactions paying less fees than those had to wait for hours to be processed.

There are 5 people who have commit access to the bitcoin reference implementation (aka. “bitcoin-core”), and they vary significantly in their concerns on the issue.

The Bitcoin Users’ Perspective

From the bitcoin users perspective, blocks should be infinite, and fees zero or minimal.  This is the basic position of respected (but non-bitcoin-core) developer Mike Hearn, and has support from bitcoin-core ex-lead Gavin Andresen.  They work on the wallet and end-user side of bitcoin, and they see the issue as the most urgent.  In an excellent post arguing why growth is so important, Mike raises the following points, which I’ve paraphrased:

  1. Currencies have network effects. A currency that has few users is simply not competitive with currencies that have many.
  2. A decentralised currency that the vast majority can’t use doesn’t change the amount of centralisation in the world. Most people will still end up using banks, with all the normal problems.
  3. Growth is a part of the social contract. It always has been.
  4. Businesses will only continue to invest in bitcoin and build infrastructure if they are assured that the market will grow significantly.
  5. Bitcoin needs users, lots of them, for its political survival. There are many people out there who would like to see digital cash disappear, or be regulated out of existence.

At this point, it’s worth mentioning another bitcoin-core developer: Jeff Garzik.  He believes that the bitcoin userbase has been promised that transactions will continue to be almost free.  When a request to change the default mining limit from 750kb to 1M was closed by the bitcoin lead developer Wladimir van der Laan as unimportant, Jeff saw this as a symbolic moment:

Disappointing. New #Bitcoin Core policy: stealth fee increases Zero plan to communicate this to BTC users :(

— Jeff Garzik (@jgarzik) July 21, 2015

What Happens If We Don’t Increase Soon?

Mike Hearn has a fairly apocalyptic view of what would happen if blocks fill.  That was certainly looking likely when the post was written, but due to episodes where the blocks were full for days, wallet designers are (finally) starting to estimate fees for timely processing (miners process larger fee transactions first).  Some wallets and services didn’t even have a way to change the setting, leaving users stranded during high-volume events.

It now seems that the bursts of full blocks will arrive with increasing frequency; proposals are fairly mature now to allow users to post-increase fees if required, which (if all goes well) could make for a fairly smooth transition from the current “fees are tiny and optional” mode of operation to a “there will be a small fee”.

But even if this rosy scenario is true, this begsavoids the bigger question of how high fees can become before bitcoin becomes useless.  1c?  5c?  20c? $1?

So What Are The Problems With Increasing The Blocksize?

In a word, the problem is miners.  As mining has transitioned from a geek pastime, semi-hobbyist, then to large operations with cheap access to power, it has become more concentrated.

The only difference between bitcoin and previous cryptocurrencies is that instead of a centralized “broker” to ensure honesty, bitcoin uses an open competition of miners. Given bitcoin’s endurance, it’s fair to count this a vital property of bitcoin.  Mining centralization is the long-term concern of another bitcoin-core developer (and my coworker at Blockstream), Gregory Maxwell.

Control over half the block-producing power and you control who can use bitcoin and cheat anyone not using a full node themselves.  Control over 2/3, and you can force a rule change on the rest of the network by stalling it until enough people give in.  Central control is also a single point to shut the network down; that lets others apply legal or extra-legal pressure to restrict the network.

What Drives Centralization?

Bitcoin mining is more efficient at scale. That was to be expected[7]. However, the concentration has come much faster than expected because of the invention of mining pools.  These pools tell miners what to mine, in return for a small (or in some cases, zero) share of profits.  It saves setup costs, they’re easy to use, and miners get more regular payouts.  This has caused bitcoin to reel from one centralization crisis to another over the last few years; the decline in full nodes has been precipitous by some measures[5] and continues to decline[6].

Consider the plight of a miner whose network is further away from most other miners.  They find out about new blocks later, and their blocks get built on later.  Both these effects cause them to create blocks which the network ignores, called orphans.  Some orphans are the inevitable consequence of miners racing for the same prize, but the orphan problem is not symmetrical.  Being well connected to the other miners helps, but there’s a second effect: if you discover the previous block, you’ve a head-start on the next one.  This means a pool which has 20% of the hashing power doesn’t have to worry about delays at all 20% of the time.

If the orphan rate is very low (say, 0.1%), the effect can be ignored.  But as it climbs, the pressure to join a pool (the largest pool) becomes economically irresistible, until only one pool remains.

Larger Blocks Are Driving Up Orphan Rates

Large blocks take longer to propagate, increasing the rate of orphans.  This has been happening as blocks increase.  Blocks with no transactions at all are smallest, and so propagate fastest: they still get a 25 bitcoin subsidy, though they don’t help bitcoin users much.

Many people assumed that miners wouldn’t overly centralize, lest they cause a clear decentralization failure and drive the bitcoin price into the ground.  That assumption has proven weak in the face of climbing orphan rates.

And miners have been behaving very badly.  Mining pools orchestrate attacks on each other with surprising regularity; DDOS and block withholding attacks are both well documented[1][2].  A large mining pool used their power to double spend and steal thousands of bitcoin from a gambling service[3].  When it was noticed, they blamed a rogue employee.  No money was returned, nor any legal action taken.  It was hoped that miners would leave for another pool as they approached majority share, but that didn’t happen.

If large blocks can be used as a weapon by larger miners against small ones[8], it’s expected that they will be.

More recently (and quite by accident) it was discovered that over half the mining power aren’t verifying transactions in blocks they build upon[4].  They did this in order to reduce orphans, and one large pool is still doing so.  This is a problem because lightweight bitcoin clients work by assuming anything in the longest chain of blocks is good; this was how the original bitcoin paper anticipated that most users would interact with the system.

The Third Side Of The Debate: Long Term Network Funding

Before I summarize, it’s worth mentioning the debate beyond the current debate: long term network support.  The minting of new coins decreases with time; the plan of record (as suggested in the original paper) is that total transaction fees will rise to replace the current mining subsidy.  The schedule of this is unknown and generally this transition has not happened: free transactions still work.

The block subsidy as I write this is about $7000.  If nothing else changes, miners would want $3500 in fees in 12 months when the block subsidy halves, or about $2 per transaction.  That won’t happen; miners will simply lose half their income.  (Perhaps eventually they form a cartel to enforce a minimum fee, causing another centralization crisis? I don’t know.)

It’s natural for users to try to defer the transition as long as possible, and the practice in bitcoin-core has been to aggressively reduce the default fees as the bitcoin price rises.  Core developers Gregory Maxwell and Pieter Wuille feel that signal was a mistake; that fees will have to rise eventually and users should not be lulled into thinking otherwise.

Mike Hearn in particular has been holding out the promise that it may not be necessary.  On this he is not widely supported: that some users would offer to pay more so other users can continue to pay less.

It’s worth noting that some bitcoin businesses rely on the current very low fees and don’t want to change; I suspect this adds bitterness and vitriol to many online debates.


The bitcoin-core developers who deal with users most feel that bitcoin needs to expand quickly or die, that letting fees emerge now will kill expansion, and that the infrastructure will improve over time if it has to.

Other bitcoin-core developers feel that bitcoin’s infrastructure is dangerously creaking, that fees need to emerge anyway, and that if there is a real emergency a blocksize change could be rolled out within a few weeks.

At least until this is resolved, don’t count on future bitcoin fees being insignificant, nor promise others that bitcoin has “free transactions”.

[1] “Bitcoin Mining Pools Targeted in Wave of DDOS Attacks” Coinbase 2015

[2] “Block Withholding Attacks – Recent Research” N T Courtois 2014

[3] “GHash.IO and double-spending against BetCoin Dice” mmtech et. al 2013

[4] “Questions about the July 4th BIP66 fork”

[5] “350,000 full nodes to 6,000 in two years…” P Todd 2015

[6] “Reachable nodes during the last 365 days.”

[7] “Re: Scalability and transaction rate” Satoshi 2010

[8] “[Bitcoin-development] Mining centralization pressure from non-uniform propagation speed” Pieter Wuille 2015

August 04, 2015 02:32 AM

August 03, 2015

LPC 2015: Thursday night reception for LPC

Thanks to the generous sponsorship of Intel, the Linux Plumbers Conference is pleased to announce that there will be an additional social event this year. On Thursday August 20th, we will be gathering at the Seattle Rock Bottom Brewery—just a short walk from the conference venue and hotel—for drinks and dinner in a relaxed setting. The evening’s event will be showcasing local beers, wines, and spirits, but some of the more standard items (like single-malt scotches and cocktails) will also be available.

Since there will be various BoFs and extended microconferences going later in the evening on Thursday, the event has been structured to accommodate that. The event will not have a buffet and will, instead, provide food made to order. It will run until midnight and dinner orders can be placed up until 23:30, so folks can show up any time and still get the food of their choice, hot and fresh. That said, if we all order right at the 18:00 start, the waiting time may get long. So, if you aren’t working late, a walk around Seattle (perhaps after popping in for a drink) would work well to put some space around the food orders. The Rock Bottom is a large venue with lots of tables for discussions and the like, so continuing a conversation there, rather than at the venue, will work out well.

We look forward to seeing everyone at the Rock Bottom on Thursday!

August 03, 2015 10:33 PM

August 02, 2015

Pete Zaitcev: DNF - Debugging Not Finished

It's 100% like CKS said:

[root@kvm-rei zaitcev]# dnf check-update openstack-swift
Last metadata expiration check performed 0:10:02 ago on Sun Aug  2 18:42:13 2015.

openstack-swift.noarch                   2.3.0-2.fc23                    rawhide
[root@kvm-rei zaitcev]# dnf update openstack-swift
Last metadata expiration check performed 0:10:07 ago on Sun Aug  2 18:42:13 2015.
Dependencies resolved.
Nothing to do.
[root@kvm-rei zaitcev]# rpm -q openstack-swift
[root@kvm-rei zaitcev]# 

Searching for a good way to make it unstuck.

August 02, 2015 11:12 PM

July 29, 2015

Pete Zaitcev: Conference submission and voting

Generally I feel that I do not do any work that's important enough to present at conferences. My previous presentation was at OLS back in 2005, concerning usbmon. The usbmon is something a guy learning C would program: it's a circular buffer into which kernel drops tracing events; Wireshark pulls them out. Hardly a conference material, but at the time I thought it was supremely important to proseltize the basic techniques of always-on tracing, because it would improve the quality and the ease of debugging of the kernel overall. I really wanted FireWire guys to adopt a similar tracing scheme, because it was a hell on a stick debugging juju with just printk(). Needless to say, that was a miserable failure, as was FireWire itself. I don't think anyone who came to listen to my presentation in Ottawa received their money's worth.

Or did they? Recently an epiphany occured to me. I really should not even think if anyone is interested. That is conference organizers' job, not mine! As a result, I sent a proposal to OpenStack Tokyo, entitled "The Plot to Destroy OpenStack Swift Using C++: Enhancements of Swift API Compatibility in Ceph RADOS Gateway". It's basically a compendum of practical issues that occur when running Swift apps on top of Ceph RGW and what we do to help people do that.

The things are a little different from 10 years ago, because attendees can vote on the submissions. This sounds democratic. I went through all submissions on the storage track and voted them according to my preference. It took a very long time and I suspect that I was crowdsourced by the organizers in the best traditions of Web 2.0. I wonder if they'll even read the abstracts. :-)

July 29, 2015 04:23 PM

July 28, 2015

Matthew Garrett: Your Ubuntu-based container image is probably a copyright violation

Update: A Canonical employee responded here, but doesn't appear to actually contradict anything I say below.

I wrote about Canonical's Ubuntu IP policy here, but primarily in terms of its broader impact, but I mentioned a few specific cases. People seem to have picked up on the case of container images (especially Docker ones), so here's an unambiguous statement:

If you generate a container image that is not a 100% unmodified version of Ubuntu (ie, you have not removed or added anything), Canonical insist that you must ask them for permission to distribute it. The only alternative is to rebuild every binary package you wish to ship[1], removing all trademarks in the process. As I mentioned in my original post, the IP policy does not merely require you to remove trademarks that would cause infringement, it requires you to remove all trademarks - a strict reading would require you to remove every instance of the word "ubuntu" from the packages.

If you want to contact Canonical to request permission, you can do so here. Or you could just derive from Debian instead.

[1] Other than ones whose license explicitly grants permission to redistribute binaries and which do not permit any additional restrictions to be imposed upon the license grants - so any GPLed material is fine

comment count unavailable comments

July 28, 2015 08:06 PM

LPC 2015: Microconference schedule now available

The Linux Plumbers Conference starts in less than three weeks and so the schedule for Microconferences is now available!  Looking forward to seeing you all there!

July 28, 2015 07:40 PM

July 27, 2015

Andi Kleen: Energy efficient servers book review

Energy efficient servers – Blue prints for data center optimization from Gough/Steiner/Sanders is a new book on power tuning on servers that was recently published at Apress. I got my copy a few weeks ago and read it and it is great.

Disclaimer: I contributed a few pages to the book, but have no financial interest in its success.

As you probably already know power efficiency is very important for modern computing. It matters to mobile devices to extend battery time, it matters to desktops and servers to avoid exceeding the thermal/power capacity and lower energy costs.

Modern chips cannot run all their transistors at full speed at the same time due to the dark silicon problem. This results in the somewhat paradoxical situation that power management is needed, even if energy costs don’t matter, just to give the best performance (such as the highest Turbo frequencies)

Power management in modern systems is quite complex, with many different moving parts, hardware, operating systems, drivers, firmware, embedded micro-controllers working together to be as efficient as possible. I’m not aware of any good overview of all of this.

There is some lore around — for example you may have heard of race to idle, that is running as fast as possible to go idle again — but nothing really that puts it all into a larger context. BTW race-to-idle is not always a good idea, as the book explains.

The new book makes an attempt to explain all of this together for Intel servers (the basic concepts are similar on other systems and also on client systems).

It starts with a (short) introduction of the underlying physical principles and then moves on to the basic CPU and platform power management techniques, such as frequency scaling and idle state and thermal management. It has a discussion on modern memory subsystems and describes the trade offs between different DIMM configurations. It describes the power management differences between larger servers and micro servers. And there is a overview of thermal management and power supply, such as energy efficient power supplies and voltage regulators.

Then it moves on to an overview of the software involved in power management, including firmware, rack level power management software, and operating systems. Then there is an extensive chapter how to instrument and measure power management

Finally (and perhaps most valuable) the book lays out a systematic power tuning methodology, starting with measurements and then concrete steps to optimize existing workloads for the best power efficiency.

The book is written not as an academic text book, but intended for people who solve concrete problems on shipping systems. It is quite readable, explaining any complicated concepts. You can clearly tell the authors have deep knowledge on the topic. While the details are intended for Intel servers, I would expect the book to be useful even to people working on clients or also other architectures.

One possible issue with the book is that it may be too specific for today’s systems. We’ll see how well it ages to future systems. But right now, as it just came out, it it very up-to-date and a good guide. It has some descriptions of data center design (such as efficient cooling), but these parts are quite short and are clearly not the main focus.

The ebook version is currently available as a free download both at the the publisher after registration, or at amazon as free kindle edition, or as reasonable priced paperback.

July 27, 2015 06:14 AM

July 24, 2015

James Morris: Linux Security Summit 2015 Update: Free Registration

In previous years, attending the Linux Security Summit (LSS) has required full registration as a LinuxCon attendee.  This year, LSS has been upgraded to a hosted event.  I didn’t realize that this meant that LSS registration was available entirely standalone.  To quote an email thread:

If you are only planning on attending the The Linux Security Summit, there is no need to register for LinuxCon North America. That being said you will not have access to any of the booths, keynotes, breakout sessions, or breaks that come with the LinuxCon North America registration.  You will only have access to The Linux Security Summit.

Thus, if you wish to attend only LSS, then you may register for that alone, at no cost.

There may be a number of people who registered for LinuxCon but who only wanted to attend LSS.   In that case, please contact the program committee at

Apologies for any confusion.

July 24, 2015 03:46 AM

July 23, 2015

Michael Kerrisk (manpages): man-pages-4.01 is released

I've released man-pages-4.01. The release tarball is available on The browsable online pages can be found on The Git repository for man-pages is available on

This release resulted from patches, bug reports,and  comments from nearly 50 contributors. As well as a large number of minor fixes to over 100 man pages, the more significant changes in man-pages-4.01 include the following:

July 23, 2015 06:03 PM

July 20, 2015

Pete Zaitcev: Fedora 22 killed IPv6 and I'm fine

I upgraded Fedora on my home router to F22 and immediately IPv6 disappeared on the internal network. The problem is that radvd started throwing its usual "no linklocal address configured on ethmain.5" (although the message is only visible with "IgnoreIfMissing off;"), which leads to "interface ethmain.5 does not exist or is not set up properly". With the default IgnoreIfMissing, radvd continues running but refuses to work, quietly. Needless to say, the interface has a perfectly valid link-local address, same as it had in F21 before the upgrade.

There used to be a time when I took a problem like this as an affront to the idea of IPv6 superiority and the reputation of Fedora as a platform for roll-your-own home router. Now though, I don't give a rat's tail for IPv6. Let Comcast and Google care and pay someone to care. Okay, I lied. I cared enough for file a bug 1244428, but I'm not rushing to build from SRPMs, reinstall old versions, and such.

(Frankly, if we just engage Lennart's attention for an hour, he'll incoporate a perfectly serviceable radvd function into systemd-networkd. Of course, one would need journalctl to see any messages from it, but since it is certain to work, nobody would actually attempt that. The bug report would go unanswered like radvd bugs today, but again, there would not be any bugs.)

UPDATE: The root cause turned out to be an incorrect link-local address after all. I presumed that RFC meant whole fe80::/10 prefix to be used, so each interface had a different address within a node. Ergo, fe80:0:0:1::1/64 address. As it turns out, I may have confused Link-Local address with Site-Local address. The RFC 4291 specifies a fe80::/64 prefix and in F22 the radvd started enforcing it. Note that apparently the lower part has to be unique across the link.

July 20, 2015 04:18 PM

Mel Gorman: Continual testing of mainline kernels

It is not widely known that the SUSE Performance team runs continual testing of mainline kernels and collects data on machines that would be otherwise idle. Testing is a potential topic for Kernel Summit 2015 topic so now seems like a good a time introduce Marvin. Marvin is a system that continually runs performance-related tests and is named after another robot doomed with repetitive tasks. When tests are complete it generates a performance comparison report that is publicly available but rarely linked. The primary responsibility of this system is to check SUSE Linux for Enterprise kernels for performance regressions but it is also configured to run tests against mainline releases. There are four primary components Marvin of interest.

The first component is the test client which is a copy of MMTests. The use of MMTests ensures that the tests can be independently replicated and the methodology examined. The second component is Bob which is a builder that monitors git trees for new kernels to test, builds the kernel when it's released and schedules it to be tested. In practice this monitors the SLE kernel tree continually and checks the mainline git tree once a month for new releases. Bob only builds and queues released kernels and ignores -rc kernels in mainline. The reason for this is simple -- time. The full battery of tests can take up to a month to complete in come cases and it's impractical to do that on every -rc release. There are times when a small subset of tests will be checked for a pre-release kernel but only when someone on the performance team is checking a specific series of patches and it's urgent to get the results quickly. When tests complete, it's Bob that generates the report. The third component is Marvin which runs on the server and one instance exists per test machine. It checks the queue, prepares the test machine and executes tests when the machine is ready. The final component is a configuration manager that is responsible for reserving machines for exclusive use, managing power, managing serial consoles and deploying distributions automatically. The inventory management does not have a specific name as it's different depending on where Marvin is setup.

There are two installations of Marvin -- one that runs in my house and a second that runs within SUSE and they have slightly different configurations. Technically Marvin supports testing on different distributions but only openSUSE and SLE are deployed. SLE kernels are tested on the corresponding SLE distribution. The Marvin instance in my house tests kernels 3.0 up to 3.12 on openSUSE 13.1 and then kernels 3.12 up to current mainline on openSUSE 13.2. In the SUSE instance, SLE 11 SP3 is used as the distribution for testing kernels 3.0 up to 3.12 and openSUSE 13.2 is used for 3.12 and later kernels. The kernel configuration used corresponds to the distribution. The raw results are not publicly available but the reports generated on private servers and mirrored once a week to the following locations;

Dashboard for kernels 3.0 to 3.12 on openSUSE 13.1 running on home machines

Dashboard for kernels 3.12 to current on openSUSE 13.2 running on home machines

Dashboard for kernels 3.0 to 3.12 on SLE 11 SP3 running on SUSE machines

Dashboard for kernels 3.12 to current on openSUSE 13.2 running on SUSE machines

The dashboard is intended to be a very high-level view detailing if there are regressions or not in comparison to a baseline. For example, in the first report linked above, the baseline is always going to be a 3.0-based kernel. It needs a human to establish if the regression is real or if it's an acceptable trade-off. The top section makes a guess as to where the biggest regressions might be but it's not perfect so double check. Each test that was conducted is then listed. The name of the test corresponds to a MMTests configuration file in configs/ with extentions naming the filesystem used if that is applicable. The columns are then machines with a number which represents a performance delta. 1 means there is no difference. 1.02 would mean there is a 2% difference and the colour indicates whether it is a performance regression or gain. Green is good, red is bad, gray or white is neutral. It will automatically guess if the result is significant which is why 0.98 on one test might be a 2% performance regression in one test (red) and in the noise for another.

It is important to note that the dashboard figure is a very rough estimate that often decomposing multiple values into a single number. There is no substitute for reading the detailed report and making an assessment. It is also important to note that Marvin is not up to date and some machines have not started testing 4.1. It is known that the reports are very ugly but making it prettier has yet to climb up the list of priorities. Where possible we are instead picking a regression and doing something about it instead of making HTML pages look pretty.

The obvious question is what has been done with this data. When Marvin was first assembled, the intent was to identify and fix regressions between 2.6.32 (yes, really) and 3.12. This is one of the reasons why 3.12-stable contains so many performance related fixes. When a regression was found there are generally one of three outcomes. The first is that it gets fixed obviously. The second is that it is identified as an apparent, but not real, regression. Usually this means the kernel was buggy in an old kernel in a manner that happened to benefit a particular benchmark. Tiobench is an excellent example. On old kernels there was a bug that preserved old pages and reclaimed new pages in certain circumstances. For most workloads, this is terrible but in tiobench it means that parts of the file were cached and the IO appeared to complete faster but it was a lie. The third possible outcome is that it's slower but it's a tradeoff to win somewhere else and the tradeoff is acceptable. Some scheduler regressions fall under this heading where a context-switch micro-benchmark might be hurt but it's because the scheduler is making an intelligent placement decision.

The focus on 3.12 is also why Marvin is not widely advertised within the community. It is rare that mainline developers are concerned with performance in -stable kernels unless the most recent kernel is also discussed. In some cases the most recent kernel may have the same regression but it is common to discover there is simply a different mix of problems in a recent kernel. Each problem must be identified and addressed in turn and time is spent on that instead of adding volume to LKML. Advertising the existence of Marvin wasalso postponed because some of the tests or reporting were buggy and each time I wanted to fix the problem. There are very few that are known to be problematic now but it takes a surprising amount of time to address all problems that crop up when running tests across large numbers of machines. There are still issues lurking in there but if a particularly issue is important to you then let me know and I'll see if it can be examined faster.

An obvious question is how this compares to other performance-based automated testing such as Intel's 0-day kernel test infrastructure. The answer is that they are complementary. The 0-day infrastructure tests every commit to quickly identify both performance gains and regressions. The tests are short-lived by necessity and are invaluable at quickly catching some classes of problems. The tests run by Marvin are much longer-lived and there is only overlap in a small number of places. The two systems are simply looking for different problems. Hence, in 2012 I was tempted to try integrating parts of what became Marvin with 0-day but ultimately it was unnecessary and there is value in both. The other system worth looking at is the results reported on Phoronix Test Suite. In that case, it's relatively rare that the data needed to debug a problem is included in the reports which complicates matters. In a few cases I examined in detail I had problems with the testing methodology. As MMTests already supported large amounts of what I was looking for there was no benefit to discarding it and starting again with Phoronix and addressing any perceived problems there. Finally, on the site that reports the results, there is a frequent emphasis there on graphics performance or the relative performance between different hardware configurations. It is relatively rare that this is the type of comparison my team is interested in.

The next obvious question is how are recent releases performing? At this time I so not want to make a general statement as I have not examined all the data in sufficient detail and am currently developing a series aimed at one of the problems. When I work on mainline patches, it's usually with reference to the problem I picked out after browsing through reports, targeting a particular subsystem area or in response to a bug report. I'm not applying a systematic process to identify all regressions at this point and it's still considered a manual process to determine if a reported regression is real, apparent or a tradeoff. When a real regression is found then Marvin can optionally conduct an automated bisection but that process is usually "invisible" and is only reported indirectly in a changelog if the regression gets fixed.

So what's next? The first is that more attention is going to be paid to recent kernels and checking if regressions were introduced since 3.12 that need addressing. The second is identifying any bottlenecks that exist in mainline that are not regressions but still should be addressed. The last of course if coverage. The first generation of Marvin focused on some common workloads and for a long time it was very useful. The number of problems it is finding is now declining so other workloads will be added over time. Each time a new configuration is added, Marvin will go back through all the old kernels and collect data. This is probably not a task that will ever finish. There always will be some new issue be it due to a hardware change, a new class of workload as the usage of computers evolve or a modification that fixed one problem and introduced another. Fun times!

July 20, 2015 02:55 PM

July 15, 2015

Matthew Garrett: Canonical's Ubuntu IP policy is garbage

(In order to avoid any ambiguity here, this is a personal opinion. The Free Software Foundation's opinion on this matter is here)

Canonical have a legal policy surrounding reuse of Intellectual Property they own in Ubuntu, and you can find it here. It's recently been modified to handle concerns raised by various people including the Free Software Foundation[1], who have some further opinions on the matter here. The net outcome is that Canonical made it explicit that if the license a piece of software is under explicitly says you can do something, you can do that even if the Ubuntu IP policy would otherwise forbid it.

Unfortunately, "Canonical have made it explicit that they're not attempting to violate the GPL" is about the nicest thing you can say about this. The most troubling statement is Any redistribution of modified versions of Ubuntu must be approved, certified or provided by Canonical if you are going to associate it with the Trademarks. Otherwise you must remove and replace the Trademarks and will need to recompile the source code to create your own binaries.. The apparent aim here is to avoid situations where people take Ubuntu, modify it and continue to pass it off as Ubuntu. But it reaches far further than that. Cases where this may apply include (but are not limited to):

In each of these cases, a strict reading of the policy indicates that you are distributing a modified version of Ubuntu and therefore must either get it approved by Canonical or remove the trademarks and rebuild everything. The strange thing is that this doesn't limit itself to rebuilding packages that include Canonical's trademarks - there's a requirement that you rebuild all binaries.

Now obviously this is good engineering practice in a whole bunch of ways, but it's a huge pain in the ass. And to make things worse, Canonical won't clarify what they consider to be use of their trademarks. Many Ubuntu packages rebuilt from Debian include the word "ubuntu" in their version string. Many Ubuntu packages will contain the word "ubuntu" in maintainer email addresses. Many Ubuntu packages include references to Ubuntu (for instance, documentation might say "This configuration file is located under /etc/default in Debian and Ubuntu"). And many Ubuntu packages will include the compiler version string, which will include the word "ubuntu". Realistically, there's no risk of confusion by using the trademarks in this way, and as a consequence there would be no infringement under trademark law. But Canonical aren't using trademark law here. Canonical assert that they hold copyright over binaries that they have built form source, and require that for you to have permission to redistribute these binaries under copyright law you must remove the trademarks. This means that it doesn't matter whether your use of the trademarks would be infringing or not - you're required to remove them, because fuck you that's why.

This is a huge overreach. It's hostile to free software, in that it makes it significantly more difficult to produce derivative works of Ubuntu and doesn't benefit the community in the process. It's hostile to our understanding of IP law, in that it claims that the mechanical process of turning source code into binaries creates an independently copyrightable work. And in some cases it may make it impossible to create derivative works that interoperate with Ubuntu due to applications making assumptions about the presence of strings.

It'd be easy write this off as an over the top misinterpretation of the policy if it hadn't been confirmed by the Ubuntu Community Manager that any binaries shipped by Ubuntu under licenses that don't grant an explicit right to redistribute the binaries can't be redistributed without permission or rebuilding. When I asked for clarification from Canonical over a year ago, I got no response[2]. Perhaps Canonical don't want to force you to remove every single use of the word Ubuntu from derivative works, but their policy is written such that the natural reading is that they do, and they've refused every single opportunity they've been given to clarify the point.

So, we're left with a policy that makes it hugely impractical to redistribute modified versions of Ubuntu unless Canonical approve of it. That's not freedom, and it's certainly not Ubuntu. If Canonical are serious about participating in the free software community then they need to demonstrate their willingness to continue improving this policy to bring it closer to our goals. Failure to do so will give a strong indication of their priorities.

[1] While I'm a member of the FSF's board of directors, I'm not involved in the majority of the FSF's day to day activities and was not part of this process
[2] Nebula's OS was a mixture of binary packages we pulled straight from Ubuntu and packages we rebuilt, so we were obviously pretty interested in what the answer was

comment count unavailable comments

July 15, 2015 07:20 PM

July 12, 2015

Dave Jones: Future development of Trinity.

It’s been an odd few weeks regarding Trinity based things.

First an email from a higher-up at my former employer asking (paraphrased)..

"That thing we asked you to stop working on when you worked here, any chance now you've left you'll implement these features."

I’m still trying to get my head around the thought process that led to that being a reasonable thing to ask. I’ve made the occasional commit over the last six months, but it’s mostly been code motion, clean-up work, and things like syscall table updates. New feature development came to a halt long ago.

It’s no coincidence that the number of bugs reported found with Trinity have dropped off sharply since the beginning of the year, and I don’t think it’s because the Linux kernel suddenly got lots better. Rather, it’s due to the lack of real ongoing development to “try something else” when some approaches dry up. Sadly we now live in a world where it’s easier to get paid to run someone else’s fuzzer these days than it is to develop one.

Then earlier this week, came the revelation that the only people prepared to fund that kind of new feature development are pretty much the worst people.

Apparently Hacking Team modified Trinity to fuzz ioctl() on Android, which yielded some results. I’ve done no analysis on whether those crashes are are exploitable/fixed/only relevant to Android etc. (Frankly, I’m past caring). I’m not convinced their approach is particularly sound even if it was finding results Trinity wasn’t, so it looks unlikely there are even ideas to borrow here. (We all already knew that ioctl was ripe with bugs, and had practically zero coverage testing).

It bothers me that my work was used as a foundation for their hack-job. Then again, maybe if I hadn’t released Trinity, they’d have based on iknowthis, or some other less useful fuzzer. None of this really should surprise me. I’ve known for some time that there are some “security” people that have their own modifications they have no intention of sending my way. Thanks to the way that people that release 0-days are revered in this circus, there’s no incentive for people to share their modifications if it means that someone else might beat them to finding their precious bugs.

It’s unfortunate that this project has attracted so many awful people. When I began it, the motivation had nothing to do with security. Back in 2010 we were inundated in weird oopses that we couldn’t reproduce, many times triggered by jvm’s. I came up with the idea that maybe a fuzzer could create a realistic enough workload to tickle some of those same bugs. Turned out I was right, and so began a series of huge page and other VM related bug fixes.

In the five years that I’ve made Trinity available, I’ve received notable contributions from perhaps a half dozen people. In return I’ve made my changes available before I’d even given them runtime myself.

It’s a project everyone wants to take from, but no-one wants to give back to.

And that’s why for the foreseeable future, I’m unlikely to make public any further feature work I do on it.
I’m done enabling assholes.

The post Future development of Trinity. appeared first on

July 12, 2015 09:37 PM

July 10, 2015

Andi Kleen: Speeding up less

Often when doing performance analysis or debugging, it boils down to stare at long text trace files with the less text viewer. Yes you can do a lot of analysis with custom scripts, but at some point it’s usually needed to also look at the raw data.

The first annoyance in less when opening a large file is the time it takes to count lines (less counts lines at the beginning to show you the current position as a percentage). The line counting has the easy workaround of hitting Ctrl-C or using less -n to disable percentage. But it would be still better if that wasn’t needed.

Nicolai Haenle speeded the process by about 20x in his less repository.

One thing that always bothered me was that searching in less is so slow. If you’re browsing a tens to hundreds of MB file file it can easily take minutes to search for a string. When browsing log and trace files searching over longer distances is often very important.

And there is no good workaround. Running grep on the file is much faster, but you can’t easily transfer the file position from grep to the less session.

Some profiling with perf shows that most of the time searching is spent converting each line. Less internally cleans up the line, convert it to canonical case, remove backspace bold, and some other changes. The conversion loop processes each character in a inefficient way. Most of the time this is not needed, so I replaced that with a quick check if the line contains any backspaces using the optimized strchr() from the standard C library. For case conversion the string search functions (either regular expression or fixed string search) can also handle case insensitive search directly, so we don’t need an extra conversion step. The default fixed string search (when the search string contains no regular expression meta characters) can be also done using the optimized C library functions.

The resulting less version searches ~85% faster on my benchmarks. I tried to submit the patch to the less maintainer, but it was ignored unfortunately. The less version in the repository also includes Nicolai’s speedup patches for the initial line counting.

One side effect of the patch is that less now defaults to case sensitive searches. The original less had a feature (or bug) to default to case-insensitive even without the -i option. To get case insensitive searches now “less -i” needs to be used.

[Edit: Fix typos]

July 10, 2015 08:26 PM

Pavel Machek: Front USB connectors are evil

NFSroot over USB on n900 was only giving me 300KiB/sec... and I thought that was normal. Now I plugged the cable to back USB port (not front one) and... speed went from 300KiB/sec to 2.5MiB/sec. Not bad for old cellphone.

Someone must be joking?
root@n900:/sys/devices/platform/68000000.ocp/480ab000.usb_otg_hs/ cat vbus
Vbus off, timeout 1100 msec

It looks like my n900 was bitten by famous "all calls disabled" problem. ( example solution ). Prague Brmlab helped a lot, and baking n900 for 15minutes at 250C seems to have fixed the problem... for a week. Now it looks like it slowly creeps back.

July 10, 2015 10:11 AM

July 09, 2015

Andi Kleen: toplev tutorial and manual

toplev, part of pmu-tools is a tool to determine the CPU bottleneck of workloads. Now finally there is a tutorial and manual available for toplev.,

July 09, 2015 05:57 PM

Andi Kleen: Adding Processor Trace support to Linux

I published an article at LWN: Adding processor trace to Linux. It describes the Linux perf support for the Intel Processor Trace feature on Intel Broadwell and other CPUs. Processor Trace allows fine grained tracing of program control flow.

July 09, 2015 05:51 PM

July 08, 2015

Rusty Russell: The Megatransaction: Why Does It Take 25 Seconds?

Last night f2pool mined a 1MB block containing a single 1MB transaction.  This scooped up some of the spam which has been going to various weakly-passworded “brainwallets”, gaining them 0.5569 bitcoins (on top of the normal 25 BTC subsidy).  You can see the megatransaction on

It was widely reported to take about 25 seconds for bitcoin core to process this block: this is far worse than my “2 seconds per MB” result in my last post, which was considered a pretty bad case.  Let’s look at why.

How Signatures Are Verified

The algorithm to check a transaction input (of this form) looks like this:

  1. Strip the other inputs from the transaction.
  2. Replace the input script we’re checking with the script of the output it’s trying to spend.
  3. Hash the resulting transaction with SHA256, then hash the result with SHA256 again.
  4. Check the signature correctly signed that hash result.

Now, for a transaction with 5570 inputs, we have to do this 5570 times.  And the bitcoin core code does this by making a copy of the transaction each time, and using the marshalling code to hash that; it’s not a huge surprise that we end up spending 20 seconds on it.

How Fast Could Bitcoin Core Be If Optimized?

Once we strip the inputs, the result is only about 6k long; hashing 6k 5570 times takes about 265 milliseconds (on my modern i3 laptop).  We have to do some work to change the transaction each time, but we should end up under half a second without any major backflips.

Problem solved?  Not quite….

This Block Isn’t The Worst Case (For An Optimized Implementation)

As I said above, the amount we have to hash is about 6k; if a transaction has larger outputs, that number changes.  We can fit in fewer inputs though.  A simple simulation shows the worst case for 1MB transaction has 3300 inputs, and 406000 byte output(s): simply doing the hashing for input signatures takes about 10.9 seconds.  That’s only about two or three times faster than the bitcoind naive implementation.

This problem is far worse if blocks were 8MB: an 8MB transaction with 22,500 inputs and 3.95MB of outputs takes over 11 minutes to hash.  If you can mine one of those, you can keep competitors off your heels forever, and own the bitcoin network… Well, probably not.  But there’d be a lot of emergency patching, forking and screaming…

Short Term Steps

An optimized implementation in bitcoind is a good idea anyway, and there are three obvious paths:

  1. Optimize the signature hash path to avoid the copy, and hash in place as much as possible.
  2. Use the Intel and ARM optimized SHA256 routines, which increase SHA256 speed by about 80%.
  3. Parallelize the input checking for large numbers of inputs.

Longer Term Steps

A soft fork could introduce an OP_CHECKSIG2, which hashes the transaction in a different order.  In particular, it should hash the input script replacement at the end, so the “midstate” of the hash can be trivially reused.  This doesn’t entirely eliminate the problem, since the sighash flags can require other permutations of the transaction; these would have to be carefully explored (or only allowed with OP_CHECKSIG).

This soft fork could also place limits on how big an OP_CHECKSIG-using transaction could be.

Such a change will take a while: there are other things which would be nice to change for OP_CHECKSIG2, such as new sighash flags for the Lightning Network, and removing the silly DER encoding of signatures.

July 08, 2015 03:09 AM

July 07, 2015

James Morris: Linux Security Summit 2015 Schedule Published

The schedule for the 2015 Linux Security Summit is now published!

The refereed talks are:

There will be several discussion sessions:

Also featured are brief updates on kernel security subsystems, including SELinux, Smack, AppArmor, Integrity, Capabilities, and Seccomp.

The keynote speaker will be Konstantin Ryabitsev, sysadmin for  Check out his Reddit AMA!

See the schedule for full details, and any updates.

This year’s summit will take place on the 20th and 21st of August, in Seattle, USA, as a LinuxCon co-located event.  As such, all Linux Security Summit attendees must be registered for LinuxCon. Attendees are welcome to attend the Weds 19th August reception.

Hope to see you there!

July 07, 2015 03:04 PM

July 06, 2015

Rusty Russell: Bitcoin Core CPU Usage With Larger Blocks

Since I was creating large blocks (41662 transactions), I added a little code to time how long they take once received (on my laptop, which is only an i3).

The obvious place to look is CheckBlock: a simple 1MB block takes a consistent 10 milliseconds to validate, and an 8MB block took 79 to 80 milliseconds, which is nice and linear.  (A 17MB block took 171 milliseconds).

Weirdly, that’s not the slow part: promoting the block to the best block (ActivateBestChain) takes 1.9-2.0 seconds for a 1MB block, and 15.3-15.7 seconds for an 8MB block.  At least it’s scaling linearly, but it’s just slow.

So, 16 Seconds Per 8MB Block?

I did some digging.  Just invalidating and revalidating the 8MB block only took 1 second, so something about receiving a fresh block makes it worse. I spent a day or so wrestling with benchmarking[1]…

Indeed, ConnectTip does the actual script evaluation: CheckBlock() only does a cursory examination of each transaction.  I’m guessing bitcoin core is not smart enough to parallelize a chain of transactions like mine, hence the 2 seconds per MB.  On normal transaction patterns even my laptop should be about 4 times faster than that (but I haven’t actually tested it yet!).

So, 4 Seconds Per 8MB Block?

But things are going to get better: I hacked in the currently-disabled libsecp256k1, and the time for the 8MB ConnectTip dropped from 18.6 seconds to 6.5 seconds.

So, 1.6 Seconds Per 8MB Block?

I re-enabled optimization after my benchmarking, and the result was 4.4 seconds; that’s libsecp256k1, and an 8MB block.

Let’s Say 1.1 Seconds for an 8MB Block

This is with some assumptions about parallelism; and remember this is on my laptop which has a fairly low-end CPU.  While you may not be able to run a competitive mining operation on a Raspberry Pi, you can pretty much ignore normal verification times in the blocksize debate.


[1] I turned on -debug=bench, which produced impenetrable and seemingly useless results in the log.

So I added a print with a sleep, so I could run perf.  Then I disabled optimization, so I’d get understandable backtraces with perf.  Then I rebuilt perf because Ubuntu’s perf doesn’t demangle C++ symbols, which is part of the kernel source package. (Are we having fun yet?).  I even hacked up a small program to help run perf on just that part of bitcoind.   Finally, after perf failed me (it doesn’t show 100% CPU, no idea why; I’d expect to see main in there somewhere…) I added stderr prints and ran strace on the thing to get timings.

July 06, 2015 09:58 PM

Matthew Garrett: Anti Evil Maid 2 Turbo Edition

The Evil Maid attack has been discussed for some time - in short, it's the idea that most security mechanisms on your laptop can be subverted if an attacker is able to gain physical access to your system (for instance, by pretending to be the maid in a hotel). Most disk encryption systems will fall prey to the attacker replacing the initial boot code of your system with something that records and then exfiltrates your decryption passphrase the next time you type it, at which point the attacker can simply steal your laptop the next day and get hold of all your data.

There are a couple of ways to protect against this, and they both involve the TPM. Trusted Platform Modules are small cryptographic devices on the system motherboard[1]. They have a bunch of Platform Configuration Registers (PCRs) that are cleared on power cycle but otherwise have slightly strange write semantics - attempting to write a new value to a PCR will append the new value to the existing value, take the SHA-1 of that and then store this SHA-1 in the register. During a normal boot, each stage of the boot process will take a SHA-1 of the next stage of the boot process and push that into the TPM, a process called "measurement". Each component is measured into a separate PCR - PCR0 contains the SHA-1 of the firmware itself, PCR1 contains the SHA-1 of the firmware configuration, PCR2 contains the SHA-1 of any option ROMs, PCR5 contains the SHA-1 of the bootloader and so on.

If any component is modified, the previous component will come up with a different measurement and the PCR value will be different, Because you can't directly modify PCR values[2], this modified code will only be able to set the PCR back to the "correct" value if it's able to generate a sequence of writes that will hash back to that value. SHA-1 isn't yet sufficiently broken for that to be practical, so we can probably ignore that. The neat bit here is that you can then use the TPM to encrypt small quantities of data[3] and ask it to only decrypt that data if the PCR values match. If you change the PCR values (by modifying the firmware, bootloader, kernel and so on), the TPM will refuse to decrypt the material.

Bitlocker uses this to encrypt the disk encryption key with the TPM. If the boot process has been tampered with, the TPM will refuse to hand over the key and your disk remains encrypted. This is an effective technical mechanism for protecting against people taking images of your hard drive, but it does have one fairly significant issue - in the default mode, your disk is decrypted automatically. You can add a password, but the obvious attack is then to modify the boot process such that a fake password prompt is presented and the malware exfiltrates the data. The TPM won't hand over the secret, so the malware flashes up a message saying that the system must be rebooted in order to finish installing updates, removes itself and leaves anyone except the most paranoid of users with the impression that nothing bad just happened. It's an improvement over the state of the art, but it's not a perfect one.

Joanna Rutkowska came up with the idea of Anti Evil Maid. This can take two slightly different forms. In both, a secret phrase is generated and encrypted with the TPM. In the first form, this is then stored on a USB stick. If the user suspects that their system has been tampered with, they boot from the USB stick. If the PCR values are good, the secret will be successfully decrypted and printed on the screen. The user verifies that the secret phrase is correct and reboots, satisfied that their system hasn't been tampered with. The downside to this approach is that most boots will not perform this verification, and so you rely on the user being able to make a reasonable judgement about whether it's necessary on a specific boot.

The second approach is to do this on every boot. The obvious problem here is that in this case an attacker simply boots your system, copies down the secret, modifies your system and simply prints the correct secret. To avoid this, the TPM can have a password set. If the user fails to enter the correct password, the TPM will refuse to decrypt the data. This can be attacked in a similar way to Bitlocker, but can be avoided with sufficient training: if the system reboots without the user seeing the secret, the user must assume that their system has been compromised and that an attacker now has a copy of their TPM password.

This isn't entirely great from a usability perspective. I think I've come up with something slightly nicer, and certainly more Web 2.0[4]. Anti Evil Maid relies on having a static secret because expecting a user to remember a dynamic one is pretty unreasonable. But most security conscious people rely on dynamic secret generation daily - it's the basis of most two factor authentication systems. TOTP is an algorithm that takes a seed, the time of day and some reasonably clever calculations and comes up with (usually) a six digit number. The secret is known by the device that you're authenticating against, and also by some other device that you possess (typically a phone). You type in the value that your phone gives you, the remote site confirms that it's the value it expected and you've just proven that you possess the secret. Because the secret depends on the time of day, someone copying that value won't be able to use it later.

But instead of using your phone to identify yourself to a remote computer, we can use the same technique to ensure that your computer possesses the same secret as your phone. If the PCR states are valid, the computer will be able to decrypt the TOTP secret and calculate the current value. This can then be printed on the screen and the user can compare it against their phone. If the values match, the PCR values are valid. If not, the system has been compromised. Because the value changes over time, merely booting your computer gives your attacker nothing - printing an old value won't fool the user[5]. This allows verification to be a normal part of every boot, without forcing the user to type in an additional password.

I've written a prototype implementation of this and uploaded it here. Do pay attention to the list of limitations - without a bootloader that measures your kernel and initrd, you're still open to compromise. Adding TPM support to grub is on my list of things to do. There are also various potential issues like an attacker being able to use external DMA-capable devices to obtain the secret, especially since most Linux distributions still ship kernels that don't enable the IOMMU by default. And, of course, if your firmware is inherently untrustworthy there's multiple ways it can subvert this all. So treat this very much like a research project rather than something you can depend on right now. There's a fair amount of work to do to turn this into a meaningful improvement in security.

[1] I wrote about them in more detail here, including a discussion of whether they can be used for general purpose DRM (answer: not really)

[2] In theory, anyway. In practice, TPMs are embedded devices running their own firmware, so who knows what bugs they're hiding.

[3] On the order of 128 bytes or so. If you want to encrypt larger things with a TPM, the usual way to do it is to generate an AES key, encrypt your material with that and then encrypt the AES key with the TPM.

[4] Is that even a thing these days? What do we say instead?

[5] Assuming that the user is sufficiently diligent in checking the value, anyway

comment count unavailable comments

July 06, 2015 05:39 PM

Matthew Garrett: Internet abuse culture is a tech industry problem

After Jesse Frazelle blogged about the online abuse she receives, a common reaction in various forums[1] was "This isn't a tech industry problem - this is what being on the internet is like"[2]. And yes, they're right. Abuse of women on the internet isn't limited to people in the tech industry. But the severity of a problem is a product of two separate factors: its prevalence and what impact it has on people.

Much of the modern tech industry relies on our ability to work with people outside our company. It relies on us interacting with a broader community of contributors, people from a range of backgrounds, people who may be upstream on a project we use, people who may be employed by competitors, people who may be spending their spare time on this. It means listening to your users, hearing their concerns, responding to their feedback. And, distressingly, there's significant overlap between that wider community and the people engaging in the abuse. This abuse is often partly technical in nature. It demonstrates understanding of the subject matter. Sometimes it can be directly tied back to people actively involved in related fields. It's from people who might be at conferences you attend. It's from people who are participating in your mailing lists. It's from people who are reading your blog and using the advice you give in their daily jobs. The abuse is coming from inside the industry.

Cutting yourself off from that community impairs your ability to do work. It restricts meeting people who can help you fix problems that you might not be able to fix yourself. It results in you missing career opportunities. Much of the work being done to combat online abuse relies on protecting the victim, giving them the tools to cut themselves off from the flow of abuse. But that risks restricting their ability to engage in the way they need to to do their job. It means missing meaningful feedback. It means passing up speaking opportunities. It means losing out on the community building that goes on at in-person events, the career progression that arises as a result. People are forced to choose between putting up with abuse or compromising their career.

The abuse that women receive on the internet is unacceptable in every case, but we can't ignore the effects of it on our industry simply because it happens elsewhere. The development model we've created over the past couple of decades is just too vulnerable to this kind of disruption, and if we do nothing about it we'll allow a large number of valuable members to be driven away. We owe it to them to make things better.

[1] Including Hacker News, which then decided to flag the story off the front page because masculinity is fragile

[2] Another common reaction was "But men get abused as well", which I'm not even going to dignify with a response

comment count unavailable comments

July 06, 2015 05:37 PM