Kernel Planet

May 24, 2017

Pete Zaitcev: Community Meeting

<notmyname> first, the idea of having a regular meeting in addition to this one for people in different timezones
<cschwede_> +2!
<notmyname> specifically, mahatic and pavel/onovy/seznam. but of course we've all seen various chinese contributors too
<notmyname> but the point is that it's a place to bring up stuff that those in the other time zones are working on
<mattoliverau> Cool
<notmyname> I think it's a terrific idea
<tdasilva> i bet the guys working on tape would like that too
<notmyname> my goal is to find a time for it that is so horrible for US timezones that it will be obvious that not everyone needs to be there
<zaitcev> Yeah, if only there was a way to send a message... like a mail... to a list of people. And then it could be stored on a computer somewhere, ready to be read in any timezone recepient is in.
<notmyname> zaitcev: crazytown!
<mattoliverau> zaitcev: now your just talkin crazy

May 24, 2017 09:52 PM

May 23, 2017

Michael Kerrisk (manpages): Linux Shared Libraries course, Munich, Germany, 20 July 2017

I've scheduled a public instance of my "Building and Using Shared Libraries on Linux" course to take place in Munich, Germany on 20 July 2017.  This one-day course provides a thorough introduction to building and using shared libraries. covering topics such as: the basics of creating, installing, and using shared libraries; shared library versioning and naming conventions; the role of the dynamic linker; run-time symbol resolution; controlling symbol visibility; symbol versioning; preloading shared libraries; and dynamically loaded libraries (dlopen). The course format is a mixture of theory and practical.

The course is aimed at programmers who create and use shared libraries. Systems administrators who are managing and troubleshooting applications that use shared libraries will also find the course useful.

You can find out more about the course (such as expected background and course pricing) at http://man7.org/training/shlib/ and see a detailed course outline at
http://man7.org/training/shlib/shlib_course_outline.html.

May 23, 2017 02:14 PM

Michael Kerrisk (manpages): Cgroups/namespaces/seccomp/capabilities course

There are still some places available on my "Linux Security and Isolation APIs" that will take place in Munich, Germany on 17-19 July 2017.  This three-day course provides a deep understanding of the low-level Linux features (set-UID/set-GID programs, capabilities, namespaces, cgroups, and seccomp) used to implement privileged applications and build container, virtualization, and sandboxing technologies. The course format is a mixture of theory and practical.

The course is aimed at designers and programmers building privileged applications, container applications, and sandboxing applications. Systems administrators who are managing such applications are also likely to find the course of benefit.

You can find out more about the course (such as expected background and course pricing) at
http://man7.org/training/sec_isol_apis/
and see a detailed course outline at
http://man7.org/training/sec_isol_apis/sec_isol_apis_course_outline.html

May 23, 2017 02:01 PM

May 18, 2017

Linux Plumbers Conference: Linux Kernel Memory Model Workshop Accepted into Linux Plumbers Conference

A good understanding of the Linux kernel memory model is essential for a great many kernel-hacking and code-review tasks.  Unfortunately, the current documentation (memory-barriers.txt) has been said to frighten small children, so this workshop’s goal is to demystify this memory model, including hands-on demos of the tools, help installing/running the tools, and help constructing appropriate litmus tests.  These tools should go a long way toward the ultimate goal of automating the process of using memory models to frighten small children.

For more information, please see this microconference’s wiki page.  For those who like getting a head start, this page also includes information on downloading and installing the tools, the memory model, and thousands of pre-existing litmus tests.  (Collect the whole set!!!) We also welcome experience reports from early adopters of these tools.

We hope to see you there!

May 18, 2017 05:04 PM

May 15, 2017

Kernel Podcast: Linux Kernel Podcast for 2017/05/14

Audio: http://traffic.libsyn.com/jcm/20170514.mp3

In this week’s catchup mega-issue: Linux 4.12-rc1 (including a full summary of the 4.12 merge window), Linux 4.11 final is released, saving TLB flushes, various ongoing development, and a bunch of announcements.

Editorial Note

This podcast is a free service that I provide to the community in my spare time. It takes many, many hours to prepare and produce a single episode, much more during the merge window. This means that when I have major events (such as Red Hat Summit followed by OpenStack Summit) it will be delayed, as was the case this last week week. Over the coming months, I hope to automate the production in order to reduce the overhead but there will be some weeks where I need to skip a show. I am however covering the whole 4.12 merge window regardless. So while I would usually have just moved on, the circumstance warrants a mega-length catchup episode. I hope you’re still awake by the end.

Linux 4.12-rc1

Linus Torvalds announced Linux 4.12-rc1, “one day early, because I don’t like last-minute pull requests during the merge window anyway, and tomorrow is mother’s day [in the US], so I may end up being roped into various happenings”. He also noted “Besides, this has actually been a pretty large merge window, so despite there technically being time for one more day of pulls, I actually do have enough changes already. So there.” In his announcement, he says things look smooth so far, but calls those also “Famous last words”. Finally, he calls out the “odd” diffstat which is dominated by the AMD Vega10 headers. As was noted in the pull requests, unlike certain other graphics companies, AMD actually provides nice automatically generated headers and other information about their graphics chipsets, which is why the Vega10 update is plentiful.

Later in the day yesterday, following the 4.12-rc1 announcement, Guenter Roeck posted “watchdog updates for v4.12”, and Jon Mason posted “NTB bug fixes for vv4.12”, along with an apologies for tardiness.

Linux 4.11

Linus Torvalds announced Linux 4.11 noting that the extra week due a (rare-ish) “rc8” (Release Candidate 8) meant that he had felt “much happier releasing a final 4.11 now”. As usual, Linux Kernel Newbies has a writeup of 4.11, here: https://kernelnewbies.org/Linux_4.11

Announcements

Greg K-H (Kroah-Hartman) announced Linux 4.4.68, 4.9.28, 4.10.16, and 4.11.1. He later sent “Bad signatures on recent stable updates” in which he noted that “The stable kernels I just released have had signatures due to a mixup using pixz in the new kernel.org backend. It will be fixed soon…”, which were later corrected. He would like to hear from anyone still seeing problems.

Greg also announced (separately) Linux 3.18.52. While Jiri Slaby announced Linux 3.12.74.

Stephen Hemminger announced iproute2 4.11 matching the new kernel release.

Michael Kerrisk announced map-pages-4.11.

Steven Rostedt announced trace-cmd 2.6.1.

Steven also announced Linux 4.4.66-rt79, 3.18.51-rt57, and 3.12.73-rt98 (preempt-rt) kernels.

Con Kolivas posted an updated version of his (renamed) “MuQSS CPU scheduler” [renamed from the BFS – Brain F*** Scheduler] in Linux 4.11-ck1.

Karel Zak announced util-linux v2.30-rc1, which includes a fix to libblkid that “has been fixed to extract LABEL= and UUID= from UDF rather than ISO9660 header on hybrid CDROM/DVD media. This change[] makes UDF media on Linux user-space more compatible with another operation systems.” but he calls it out since it could also introduce regressions for some other users.

Junio C Hamano announced Git version 2.13.0. Separately, he released maintenance versions of “Git v2.12.3 and others” which include fixes for
“a recently disclosed problem with “git shell”, which may allow a user who comes over SSH to run an interactive pager by causing it to spawn “git upload-pack –help” (CVE-2017-8386).”

Jan Kiszka announced version 0.7 of the Jailhouse hypervisor, which includes various debug and regular console driver updates and gcov debug statistics.

Bartosz Golaszewski announced libgpiod v0.2: “The most prominent new feature is the test suite working together with the gpio-mockup module”.

Christoph Hellwig notes that the Open OSD [an in-kernel OSD – Object-Based Storage Device] SCSI initiator library for Linux seems to be dead. He does this by posting a patch to the MAINTAINERS file “update OSD entries” in which he removes the (now defunct) open-osd.org website, and the bouncing email address for Benny Halevy. Benny appeared and ACKed.

In a similar vain, Ben Hutchings pondered aloud about the “Future of liblockdep”, which apparently “hasn’t been buildable since (I think) Linux
4.6”. Sasha Levin said things would be cleaned up promptly. And they were, with a pull request soon following with fixes for Linux 4.12.

Masahiro Yamada posted an RFC patch entitled “Increase Minimal GNU Make version for Linux Kernel from 3.80 to 3.81” in which he essentially noted that the kernel hadn’t actually worked with 3.80 (which is 15 years old!) in a very long time, but instead actually really needs 3.81 (which was itself released in 2006). It was apparently “broken” 3 years ago, but nobody noticed. Neither Greg K-H (Kroah-Hartman) nor Linus seemed to lose any sleep over this, with Linus saying “you make a strong case of “it hasn’t worked for a while already and nobody even noticed””.

Paolo Bonzini posted “CFP: KVM Forum 2017” announcing that the KVM Forum will be held October 25-27 at the Hilton in Prague, CZ, and that all submissions for proposed topics must be made by midnight June 15.

Thomas Gleixner announced “[CFP] RT-Summit – Call for Presentations” noting that the Real-Time Summit 2017 is being organized by the Linux Foundation Real-Time Linux (RTL) collaborative project in cooperation with OSADL/RTLWS and will be held also in Prague on October 21st. The cutoff for submissions is July 14th via rt-cfp@linutronix.de.

4.12 Merge Window

In his 4.11 announcement, Linus reminded us that the release of 4.11 meant that “the merge window [for kernel 4.12] is obviously open. I already have two pull request[s] for 4.12 in my inbox, I expect that overnight I’ll get a lot more.” He wasn’t disappointed. The flood gates well and truly opened. And they continued going for the entire two week (less one day) period. Let’s dive into what has been posted so far for 4.12 during the (now closed) merge window.

Stephen Rothwell [linux-next pre-merge development kernel tree maintainer] noted in a head’s up that Linus was going to see a “Large new drm driver” [drm – Direct Rendering Manager, not the “digital rights” technology]. Dave Airlie (the drm maintainer) had a reply but Stephen said everything was just fine and he was simply seeking to avoid surprising Linus (again). Once the pull came in, and Linus had pulled it, he quickly followed up to note that he was getting a lot of warnings about “Atomic update on pipe (A) took”. Daniel Vetter followed up to say that “We [Intel] did improve evasion a lot to the point that it didn’t show up in our CI machines anymore, so we felt we could risk enabling this everywhere. But of course it pops up all over the place as soon as drm-next hits mainline”.

4.12 git Pulls for existing subsystems

Hans-Christian Noren Egtvedt posted “AVR32 change for 4.12 – architecture removal” in which he removes AVR32 and “clean away the most obvious architecture related parts”. He posted followups to pick off more leftovers.

Ingo Molnar posted “RCU changes for 4.12” which includes “Parallelize SRCU callback handling”, performance improvements, documentation updates, and various other fixes. Linus pulled it. But then “after looking at it, ended up un-pulling it again”. He posted a rant about a new header file (linux/rcu_segcblist.h) which was a “header file from hell”, saying “I see absolutely no point in taking a heade file of several hundred lines of code”, along with more venting about the use of too much inline code (code that is always expanded in-place rather than called as a function – leading to a larger footprint sometimes). Finally, Linus said “The RCU code needs to start showing some good taste”. Sir Paul McKenney, the one and only author of RCU followed up swiftly, apologizing for the transgression in attempting to model “the various *list*.h header files”, proposing a fix, which Linus liked. Ingo Molnar implemented the suggestions, in “srcu: Debloat the <linux/rcu_segcblist.h> head”, which Paul provided a minor fix against for the case of !SMP (non-multi-processor kernel) builds.

Ingo Molnar also posted “EFI changes for 4.12” including fixes to the BGRT ACPI table (used for boottime graphics information) to allow it to be shared between x86 and ARM systems, an update to the arm64 boot protocol, improvements to the EFI stub’s command line parsing, and support for randomizing the virtual mapping of UEFI runtime services on arm64. The latter means that the function pointers for UEFI Runtime Services callbacks will be placed into random virtual address locations during the call to ExitBootServices which sets up the new mappings – it’s a good way to look for problems with platforms containing broken firmware that doesn’t correctly handle the change in location of runtime service calls.

Ingo Molnar also posted “x86/process changes for 4.12” which includes a new ARCH_[GET|SET]_CPUID prctl (process control) ABI extension that a running process can use in order to determine whether it has access to call the CPUID instruction directly. This is to support a userspace debugger known as “rr” that would like to trap and emulate calls to “CPUID” which are otherwise normally unprivileged on x86 systems.

Separately, Ingo posted “x86 fixes”, which includes “mostly misc fixes” for such things as “two boot crash fixes”, etc.

Ingo Molnar also posted “perf changes for 4.12” which includes updates to K and uprobes, making their trampolines (the codepaths jumped through when executing the probe sequence) read-only while they are used, changing UPROBES_EVENTS to be default yes in the Kconfig (since distros do this), and various other fixes. He also includes support for AMD IOMMU events, and new events for Intel Goldmont CPUs. The perf tooling itself gets many more updates, including PERF_RECORD_NAMESPACES, which allows the kernel to record information “required to associate samples to namespaces”.

Separately, Ingo posted “perf fixes”, which includes “mostly tooling updates”.

Ingo Molnar also posted “RAS changes for v4.12” which includes a “correct Errors Collector” kernel feature that will gather statistics aout correctable errors affecting physical memory pages. Once a certain watermark is reached, pages generating many correctable errors will be permanently offlined [this is useful both for DDR and NV-DIMMs]. Finally, he deprecates the existing /dev/mcelog driver and includes cleanups for MCE (Machine Check Exception) errors during kexec on x86 (which we covered in previous editions of this podcast).

Ingo Molnar also posted “x86/asm changes for v4.12”, which includes various fixes, among which are cleanups to stack trace unwinding.

Ingo Molanr also posted “x86/cpu changes for v4.12”, which includes support for “an extension of the Intel RDT code to extend it with Intel Memory Bandwidth Allocation CPU support: MBA allows bandwidth allocation between cores, while CBM (already upstream) allows CPU cache partitioning”. Effectively, Intel incorporate changes to their memory controller’s hardware scheduling algorithms as part of RDT. These allow the DDR interface to manage bandwidth for specific cores, which will almost certainly include both explict data operations, as well as separate algorithms for prefetching and speculative fetching of instructions and data. [This author has spent many hours reading about memory controller scheduling over the past year]

Ingo Molnar also posted “x86/debug changes for v4.12”, which includes support for the USB3 “debug port” based early console. As we have mentioned previously, USB3 includes a built-in “debug port” which no longer requires a special dongle to connect a remote machine for debug. It’s common in Windows kernel development to use a debug port, and since USB3 includes baseline support with the need for additional hardware, serial over USB3 is likely to become more common when developing for Linux – especially with the demise of DB9 headers on systems or even IDC10 headers on motherboards internally (to say nothing of laptop systems). As a reminder, with debug ports, usually only one USB port will support debug mode. I
guess my old USB debug port dongle can go in the pile of obsolete gear.

Ingo Molnar also posted “x86/platform changes for v4.12” which includes “continued SGI UV4 hardware-enablement changes, plus there’s also new Bluetooth support for the Intel Edison [a low cost IoT board] platform”.

Ingo Molnar also posted “x86/vdso changes for v4.12” which includes support for a “hyper-V TSC page” which is what it sounds like – a special shared page made available to guests under Microsoft’s Hyper-V hypervisor and providing a fast means to enumerate the current time. This is plumbed into the kernel’s vDSO mechanism (Virtual Dynamic Shared Objects look a bit like software libraries that are automatically linked against every running program when it launches) to allow fast clock reading.

Ingo Molnar also posted “x86/mm changes for v4.12”, which includes yet more work toward Intel 5-level paging among many other updates.

Separately Ingo posted a single “core kernel fix” to “increase stackprotector canary randomness on 64-bit kernels with very little cost”.

Thomas Gleixner posted “irq updates for 4.12”, which include a new driver for a MediaTek SoC, ACPI support for ITS (Interrupt Translation Services) when using a GICv3 on ARM systems, support for shared nested
interrupts, and “the usual pile of fixes and updates all over t[h]e place”.

Thomas Gleixner also posted “timer updates for 4.12” that include more reworking of year 2038 support (the infamous wrap of the Unix epoch), a “massive rework of the arm architected timer”, and various other work.

Separately, Ingo Molnar followed up with “timer fix” including “A single ARM Juno clocksource driver fix”.

Corey Minyard posted “4.12 for IPMI” including a watchdog fix. He “switched over to github at Stephen Rothwell’s [linux-next maintainer] request”.

Jonathan Corbet posted “Docs for 4.12” which includes “a new guide for user-space API documents” along with many other updates. Anil Nair noted “Missing File REPORTING-BUGS in Linux Kernel” which suggests that the Debian kernel package tools need to be taught about the recent changes in the kernel’s documentation layout. Separately, Jonathan replied to a thread entitled “Find more sane first words we have to say about Linux” noting that the kernel’s documentation files might not be the first place that someone completely new to Linux is going to go looking for information: “So I don’t doubt we could put something better there, but can we think for a moment about who the audience is here? If you’re “completely new to Linux”, will you really start by jumping into the kernel source tree?” The guy should do kernel standup in addition to LWN. It’d be hilarious.

Later, Jon posted “A few small documentation updates” which “Connect the newly RST-formatted documentation to the rest; this had to wait until the input pull was done. There’s also a few small fixes that wandered in”.

Tejun Heo posted “libata changes for 4.12-rc1” which includes “removal of SCT WRITE SAME support, which never worked properly”. SCT stands for “SMART [Self Monitoring And Reporting Technology – an error management mechanism common in contemporary disks] Command Transport”. The “write same” part means to set the drive content to a specific pattern (e.g. to zero it out) in cases that TRIM is not available. One wonders if that is also a feature used during destruction, though apparently the only (NSA) trusted way to destroy disks today is shredding and burning after zeroing.

Tejun Heo also posted “workqueue changes for v4.12-rc1”, which includes “One trivial patch to use setup_deferrable_timer() instead of open-coding the initialization”.

Tejun Heo also posted “cgroup changes for v4.12-rc1”, which includes a “second stab at fixing the long-standard race condition in the mount path and suppression of spurious warning from cgroup_get”.

Rafael J. Wysocki posted “Power management updates for v4.12-rc1, part 1” which includes many updates to the cpufreq subsystem and “to the intel_pstate driver in particular”. Its sysfs interface has apparently also been reworked to be more consistent with general expectations. He adds “Apart from that, the AnalyzeSuspend utility for system suspend profiling gets a companion called AnalyzeBoot for the analogous profiling of system boot and they both go into one place”.

Separately, he posted “Power management updates for v4.12-rc1, part 2”, which “add new CPU IDs [Intel Gemini Lake] to a couple of drivers [intel_idle and intel_rapl – Running Average Power Limit], fix a possible NULL pointer deference in the cpuidle core, update DT [DeviceTree]-related things in the generic power domains framwork and finally update the suspend/resume infrastructure to improve the handling of wakeups from suspend-to-idle”.

Rafael J. Wysocki also posted “ACPI updates for v4.12-rc1, part 1”, which includes a new Operation Region driver for the Intel CHT [Cherry Trail] Whiskey Cove PMIC [Power Management Integrated Circuit], and new sysfs entries for CPPC [Collaborative Processor Performance Control], which is a much more fine grained means for OS and firmware to coordinate on power management and CPU frequency/performance state transitions.

Separately, he posted “ACPI updates for v4.12-rc1, part 2”, which “update the ACPICA [ACPI – Advanced Configuration and Power Interface – Component Architecture, the cross-Operating System reference code]” to “add a few minor fixes and improvements”, and also “update ACPI SoC drivers with new device IDs, platform-related information and similar, fix the register information in xpower PMIC [Power Management IC] driver, introduce a concept of “always present” devices to the ACPI device enumeration code and use it to fix a problem with one platform [INT0002, Intel Cherry Trail], and fix a system resume issue related to power resources”.

Separately, Benjamin Tissories posted a patch reverting some ACPI laptop lid logic that had been introduced in Linux 4.10 but was breaking laptops from booting with the lid closed (a feature folks especially in QE use).

Rafael J. Wysocki also posted “Generic device properties framework updates for v4.12-rc1”, which includes various updates to the ACPI _DSD [Device Properties] method call to recognize “ports and endpoints”.

Shaohua Li posted “MD update for 4.12” which includes support for the “Partial Parity Log” feature present on the Intel IMSM RAID array, and a rewrite of the underlying MD bio (the basic storage IO concept used in Linux) handling. He notes “Now MD doesn’t directly access bio bvec, bi_phys_segments and uses modern bio API for bio split”.

Ulf Hansson posted “MMC for v[.]4.12” which includes many driver updates as well as refactoring of the code to “prepare for eMMC CMDQ and blkmq”. This is the planned transition to blkmq (block-multiqueue) for such storage devices. Previously it had stalled due to the performance hit when trying to use a multi-queue approach on legacy and contemporary non-mq devices.

Linus Walleij posted “pin control bulk changes for v4.12” in which he notes that “The extra week before the merge window actually resulted in some of the type of fixes that usually arrive after the merge window already starting to trickle in from eager developers using -next, I’m impressed”. He’s also impressed with the new “Samsung subsystem maintainer” (Krzysztof). Of the many changes, he says “The most pleasing to see is Julia Cartwright[‘]s work to audit the irqchip-providing drivers for realtime locking compliance. It’s one of those “I should really get around to looking into that” things that have been on my TODO list since forever”.

Linus Walliej also posted “Bulk GPIO changes for v4.12”, which has “Nothing really exciting goes on here this time, the most exciting for me is the same as for pin control: realtime is advancing thanks [t]o Julia Cartwright”.

Petr Mladek posted “printk for 4.12” which includes a fix for the “situation when early console is not deregistered because the preferred one matches a wrong entry. It caused messages to appear twice”.

Jiri Kosina posted “HID for 4.12” which includes various fixes, amongst them being an inversion of the HID_QUIRK_NO_INIT_REPORTS to the opposite due to the fact that it is appearently easier to whitelist working devices.

Jiri Kosina also posted “livepatching for 4.12” which includes a new “per-task consistency model” that is “being added for architectures that support reliable stack dumping”, which apparently “extends the nature of the types of patches than can be applied by live patching”.

Lee Jones posted “Backlight for 4.12” which includes various fixes.

Lee Jones also posted “MFD for v4.12” which includes some new drivers, new device support, and various new functionality and fixes.

Juergen Gross posted “xen: fixes and features for 4.12” which includes support for building the kernel with Xen enabled but without enabling paravirtualization, a new 9pfs xen frontend driver(!), and support for EFI “reset_sytem” (needed for ARMv8 Dom0 host to reboot), among various other fixes and cleanups.

Alex Williamson posted “VFIO updates for v4.12-rc1”.

Joerg Roedel posted “IOMMU Updates for Linux v4.12”, which includes “Some code optimizations for the Intel VT-d driver, code to “switch off a previously enabled Intel IOMMU” (presumably in order to place it into bypass mode for performance or other reasons?), “ACPI/IORT updates and fixes” (which enables full support for the ACPI IORT on 64-bit ARM).

Dmitry Torokhov posted “Input updates for v.4.11-rc0” which includes a documentation converstion to ReST (RST, the new kernel doc format), an update to the venerable Synaptics “PS/2” driver to be aware of companion “SMBus” devices and various other miscellaneous fixes.

Darren Hart posted “platform-drivers-x86 for 4.12-1” which includes “a significantly larger and more complex set of changes than those of prior merge windows”. These include “several changes with dependencies on other subsytems which we felt were best managed through merges of immutable branches”.

James Bottomley posted “first round of SCSI updates for the 4.11+ merge window”, which includes many driver updates, but also comes with a warning to Linus that “The major thing you should be aware of is that there’s a clash between a char dev change in the char-misc tree (adding the new cdev_device_add) and the make checking the return value of scsi_device_get() mandatory”. Linus and Greg would later clarify what cdev_device_add does in response to Greg’s request to pull “Char/Misc driver patches for 4.12-rc1”.

David Miller posted “Networking” which includes many fixes.

David also posted “Sparc”, which includes a “bug fix for handling exceptions during bzero on some sparc64 cpus”.

David also posted “IDE”, which includes “two small cleanups”.

Greg K-H (Kroah-Hartman) posted “USB driver patches for 4.12-rc1”, which includes “Lots of good stuff here, after many many many attempts, the kernel finally has a working typeC interface, many thanks to Heikki and Guenter and others who have taken the time to get this merged. It wasn’t an easy path for them at all.” It will be interesting to test that out!

Greg K-H also posted “Driver core patches for 4.12-rc1”, which is “very tiny” this time around and consists mostly of documentation fixes, etc.

Greg K-H also posted “Char/Misc driver patches for 4.12-rc1” which features “lots of new drivers” including Google firmware drivers, FPGA drivers, etc. This lead to a reaction from Linus about how the tree conflicted with James Bottomley’s tree (which he had already pulled, “as per James’ suggestion”, and a back and forth between James and Greg about how to better handle such a conflict next time, and Linus noting that he prefers to fix merge conflicts himself but “*also* really really prefer the two sides of the conflict having been more aware of the clash” and providing him with a head’s up in the pull.

Greg K-H also posted “Staging/IIO driver fixes for 4.12-rc1”, which adds “about 350k new lines of crap^Wcode, mostly all in a big dump of media drivers from Intel”. He notes that the Android low memory killer driver has finally been deleted “much to the celebration of the -mm developers”.

Greg K-H also posted “TTY patches for 4.12-rc1”, which wasn’t big.

Dan Williams posted “libnvdimm for 4.12” which includes “Region media error reporting [a generic interface more friendly to use with multiple namespaces]”, a new “struct dax_device” to allow drivers to have their own custom direct access operations, and various other updates. Dan also posted “libnvdimm: band aid btt vs clear posion locking”, a patch which “continues the 4.11 status quo of disabling of error clearing from the BTT [Block Translation Table] I/O path” and notes that “A solution for tracking and handling media errors natively in the BTT is needed”. The BTT or Block Translation Table is a mechanism used by NV-DIMMs to handle “torn sectors” (partially complete writes) in hardware during error or power failure. As the “btt.txt” in the kernel documentation notes, NV-DIMMs do not have the same atomicity guarantees as regular flash drives do. Flash drives have internal logic and store enough energy in capacitors to complete outstanding writes during a power failure (rotational drives have similar for flushing their memory based caches and syncing remap block state) but NV-DIMMs are designed differently. Thus the BTT provides a level of indirection that is used to provide for atomic sector semantics.

Separately, Dan posted “libnvdimm fixes for 4.12-rc1” which includes “incremental fixes and a small feature addition relative to the libnvdimm 4.12 pull request”. Gert had “noticed hat tinyconfig was bloated by BLOCK selecting DAX [Direct Acess Execution]”, while “Vishal adds a feature that missed the initial pull due to pending review feedback. It allows the kernel to clear media errors when initializing a BTT (atomic sector update driver) instance on a pmem namespace”.

Dave Airlie posted “drm tegra for 4.12-rc1” containing additional updates due because he missed a pull from Thierry Reding for NVidia Tegra patches. He also followed up with a “drm document code of conduct” patch that describes a code of conduct for graphics written by freedesktop.org.

Stafford Horne posted “Initramfs fix for 4.12-rc1” containing a fix “for an issue that has caused 4.11 to not boot on OpenRISC”.

Catalin Marinas posted “arm64 updates for 4.12” including kdump support, “ARMv8.3 HWCAP bits for JavaScript conversion instructions, complex numbers and weaker release consistency [memory ordering]”, and support for platform (non-enumerated buses) MSI support when using ACPI, among other patches. He also removes support for ASID-tagged VIVT [Virtually Indexed, Virtually Tagged] caches since “no ARMv8 implementation using it and deprecated in the architecture” [caches are PIPT – Physically Indexed, Physically Tagged – except that an implementation might do VIPT or otherwise internally using various CAM optimizations].

Catalin later posted “arm64 2nd set of updates for 4.12”, which include “Silence module allocation failures when CONFIG_ARM*_MODULE_PLTS is enabled”.

Olof Johansson posted “ARM: SoC contents for 4.12 merge window”. In his pull request, Olof notes that “It’s been a relatively quiet release cycle here. Patch count is about the usual (818 commits, which includes merges).”
He goes on to add, “Besides dts [DeviceTree files], the mach-gemini cleanup by Linus Walleij is the only platform that pops up on its own”. He called out the separate post for the TEE [Trusted Execution Environment] subsystem. Olof also removed Alexandre Courbot and Stephen Warren from NVidia Tega maintainership, and added Jon Hunter in their place.

Rob Herring posted “DeviceTree for 4.12”, which includes updates to the Device Tree Compiler (dtc), and more DeviceTree overlay unit tests, among various other changes.

Darrick J. Wong posted “xfs: updates for 4.12”, which includes the “big new feature for this release” of a “new space mapping ioctl that we’ve been discussing since LSF2016 [Linux Storage and Filesystem conference]”.

Max Filippov posted “Xtensa improvements for 4.12”.

Ted Ts’o posted “ext4 updates for 4.12”, which adds “GETFSMAP support” (discussed previously in this podcast) among other new features.

Ted also posted “fscrypt updates for 4.12” which has “only bug fixes”.

Paul Moore posted “Audit patches for 4.12” which includes 14 patches that “span the full range of fixes, new featuresm and internal cleanups”. These include a move to 64-bit timestamps, converting refcounts to the new refcount_t type from atomic_t, and so on.

Wolfram Sang posted “i2c for 4.12”.

Mark Brown posted “regulator updates for 4.12”, which includes “Quite a lot going on with the regulator API for this release, much more in the core than in the drivers for a change”. This includes “Fixes for voltage change propagation through dumb power switches, a notification when regulators are enabled, a new settling time property for regulators where the time taken to move to a new voltage is not related to the size of the change”, etc.

Mark also posted “SPI updates for 4.12”, which includs “quite a lot of small
driver specific fixes and enhancements”.

Jessica Yu posted “module updates for 4.12”, containing minor fixes.

Mauro Carvalho Chehab posted “media updates” including mostly driver updates and the removal of “two staging LIRC drivers for obscure hardware”. He also posted a 5 part patch series entitled “Conver more books to ReST”, which converted three kernel DocBook format documentation file sets to RST, the new format being used for kernel documentation (on the kernel-doc mailing list, and maintained by Jonathan Corbet of LWN): librs, mtdnand, and sh. He noted that “After this series, there will be just one DocBook pending conversion: ” lsm (Linux Security Modules)”. He also notes that the existing LSM documentation is very out of date and no longer describes the current API.

Michael Ellerman posted “Please pull powerpc/linux.git powerpc-4.12-1 tag”, which includes suppot for “Larger virtual address space on 64-bit server CPUs. By default we use a 128TB virtual address space, but a process can request access to the full 512TB by passing a hint to mmap() [this seems very similar to the 56-bit la57 feature from Intel]”. It also includes “TLB flushing optimisations for the radix MMU on Power9” and “Support for CAPI cards on Power9, using the “Coherent Accelerator Interface Architecture 2.0″ [which definitely sounds like juicy reading]”.

Separately Michael Ellerman posted “Please pull powerpc-linux.git powerpc-4.12-2 tag” which includes “rework the Linux page table geometry to lower memory usage on 64-bit Book3S (IBM chips) using the Hash MMU [IBM uses a special inverse page tables “reverse lookup” hashing format]”.

Eric W. Biederman posted “namespace related changes for v4.12-rc1”, which includes a “set of small fixes that were mostly stumbled over during more significant development. This proc fix and the fix to posix-timers are the most significant of the lot. There is a lot of good development going on but unfortunately it didn’t quite make the merge window”.

Takashi Iwai posted “sound updates for 4.12-rc1”, noting that it was “a relatively calm development cycle, no scaring changes are seen”.

Steven Rostedt posted “tracing: Updates for v4.12” which includes “Pretty much a full rewrite of the process of function probes”. He followed up with “Three more updates for 4.12” that contained “three simple changes”.

Martin Schwidefsky posted “s390 patches for 4.12 merge window” which includes improvements to VFIO support on mainframe(!) [this author was recently amazed to see there are also DPDK ports for s390x], a new true random number generator, perf counters for the new z13 CPU, and many others besides.

Geert Uytterhoeven posted “m68k updates for 4.12” with a couple fixes.

Jacek Anaszewski posted “LED updates for 4.12” with various fixes.

Kees Cook posted “usercopy updates for v4.12-rc1” with a couple fixes.

Kees also posted “pstore updates for v4.12-rc1”, which included “large
internal refactoring along with several smaller fixes”.

James Morris posted “Security subsystem updates for v4.12”.

Sebastian Reichel posted “hsi changes for hsi-4.12”.

Sebastian also posted “power-supply changes for 4.12”, which includes a couple of new drivers and various fixes.

Separately, Sebastian poted “power-supply changes for 4.12 (part 2), which includes some new drivers and some fixes.

Paolo Bonzini posted “First batch of KVM changes for 4.12 merge window” which includes kexec/kdump support on 32-bit ARM, support for a userspace virtual interrupt controller to handle the “weird” Raspberry Pi 3, in-kernel acceleration for VFIO on POWER, nested EPT support for accessed and dirty bits on x86, and many other fixes and improvements besides.

Separately Paolo posted “Second round of KVM changes for 4.12”, which include various ARM (32 and 64-bit) cleanups, support for PPC [POWER] XIVE (eXternal Interrupt Virtualization Engine), and “x86: nVMX improvements, including emulated page modification logging (PML) which brings nice performance improvements [under nested virtualization] on some workloads”.

Ilya Dryomov posted “Ceph updates for 4.12-rc1”, which include “support for disabling automatic rbd [resilent block device] exclusive lock transfers” and “the long awaited -ENOSPC [no space] handling series”. The latter finally handles out of space situations by aborting with -ENOSPC rather than “having them [writers] block indefinitely”.

Miklos Szeredi posted “fuse updates for 4.12”, which “contains support for pid namespaces from Seth and refcount_t work from Elena”.

Miklos also posted “overlayfs update for 4.12”, which includes “making st_dev/st_ino on the overlay behave like a normal filesystem”. “Currently this only wokrs if all layers are on the same filesystem, but future work will move the general case towards more sane behavior”.

Bjorn Helgaas posted “PCI changes for v4.12” which includes a framework for supporting PCIe devices in Endpoint mode from Kishon Vjiay Abraham, fixes for using non-posted PCI config space on ARM from Lorenzo Pieralisi, allowing slots below PCI-to-PCIe “reverse bridges”, a bunch of quirks, and many other fixes and enhancements.

Jaegeuk Kim posted “f2fs for 4.12-rc1”, which “focused on enhancing performance with regards to block allocation, GC [Garbage Collection], and discard/in-place-update IO controls”.

Shuah Khan posted “Kselftest update for 4.12-rc1” with a few fixes.

Richard Weinberg posted “UML changes for v4.12-rc1” which includes “No new stuff, just fixes” to the “User Mode Linux” architecture in 4.12. Separately, Masami Hiramatsu posted an RFC patch entitled “Output messages to stderr and support quiet option” intended to “fix[] some boot time printf output to stderr by adding os_info() and os_warn(). The information-level messages via os_info() are suppressed when “quiet” kernel option is specified”.

Richard also postd “UBI/UBIFS updates for 4.12-rc1”, which “contains updates for both UBI and UBIFS”. It has a new CONFIG_UBIFS_FS_SECURITY option, among “minor improvements” and “random fixes”.

Thierry Reding posted “pwm: Changes for v4.12-rc1”, which amongst other things includes “a new driver for the PWM controller found on MediaTek SoCs”.

Vinod Koul posted “dmaengine updates” which includes “a smaller update consisting of support for TI DA8xx dma controller” among others.

Chris Mason posted “Btrfs” which “Has fixes and cleanups” as well as “The biggest functional fixes [being] between btrfs raid5/6 and scrub”.

Trond Myklebust posted “Please pull NFS client fixes for 4.12”, which includes various fixes, and new features (such as “Remove the v3-only data server limitation on pNFS/flexfiles”).

J. Bruce Fields posted “nfsd changes for 4.12”, which includes various RDMA updates from Chuck Lever.

Stephen Boyd posted “clk changes for v4.12”. Of the changes, the “biggest things are the TI clk driver rework to lay the groundwork for clkctrl support in the next merge window and the AmLogic audio/graphics clk support”.

Alexandre Belloni posted “RTC [Real Time Clock] for 4.12”, which uses a new GPG subkey that he also let Linus know about at the same time.

Nicholas A. Bellinger posted “target updates for v4.12-rc1”, which was “a lot more calm than previously expected. It’s primarily fixes in various areas, with most of the new functionality centering around TCMU [TCM – Linux iSCSI Target Support in Userspace] backend work with Xiubo Li has been driving”.

Zhang Rui posted “Thermal management updates for v4.12-rc1”, which includes a number of fixes, as well as some new drivers, and a new interface in “thermal devfreq_cooling code so that the driver can provide more precise data regarding actual power to the thermal governor every time the power budget is calculated”.

4.12 git pulls for new subsystems and features

David Howells posted “Hardware module parameter annotation for secure boot” in which he requested that Linus pull in new “kmod” macros (the same name is used for the userspace module tooling, but in this case refers to the in-kernel kernel module infrastructure of the same name). The new macros add annotations to “module_param” of the new form “module_param_hw” with a “hwtype” such as “ioport” or “iomem”, and so forth. These are used by the kernel to prevent those parameters from being used under a UEFI Secure Boot situation in which the kernel is “locked down” (to prevent someone from loading a signed kernel image and then compromising it to circumvent the secure boot mechanism).

Arnd Bergmann sent a special pull request to Linus Torvalds for “TEE driver infrastructure and OP-TEE drivers”, which “introduces a generic TEE [Trusted Execution Environment] framework in the kernel, to handle trusted environ[ments] (security coprocessor or software implementations such as OP-TEE/TrustZone)”. He sent the pull separately from the other arm-soc pull specifically to call it out, and to make sure everyone knew that this was finally headed upstream, but he noted it would probably be maintained through the arm-soc kernel tree. He included a lengthy defense of why now was the right time to merge TEE support into upstream Linux.

Saving TLB flushes on Intel x86 Architecture

Andy Lutomirski posted an RFC patch series entitled “x86 TLB flush cleanups, moving toward PCID support”. Modern (non-legacy) architectures implement a per-process context identifier that can be used in order to tag VMA (Virtual Memory Area) translations that end up in the TLB (Translation Lookaside Buffer) caches within the microprocessor core. The processor’s hardware (or in some mostly embedded cases, software) (page table) “walkers” will navigate the page tables for a process and populate the TLBs (except in the embedded software case, such as on certain PowerPC and MIPS processors, in which the kernel contains special assembly routines to perform this in software). On legacy architectures, the TLB is fairly simple, containing a simple virtual address to physical (or intermediate, in the case of virtualization) address. But on more sophisticated architectures, the TLB includes address space identification information that allows the TLB to distinguish between hits to the same virtual address that are from two different processes (known as tasks from within the kernel). Using additional tagging in the TLB avoids the traditional need to invalidate the entire TLB on process context switch.

Modern architectures, such as AArch64, have implemented context tagging support in their architecture code for some time, and now x86 is finally set to follow, enabling a feature that has actually been present in x86 for some time (but was not wired up), thanks to Andy’s work on PCID (Process Context IDentifier) support. In his patch series, Andy notes that as he has been “polishing [his] PCID code, a major problem [he’s] encountered is that there are too many x86 TLB flushing code paths and that they have too many inconsequential differences”. This patch series aims to “clean up the mess”. Now if x86 finally gains hardware broadcast TLB invalidations it will also be able to remove the wasted IPIs (Inter-Processor-Interrupts) that it implements to cause remote processors to invalidate TLB entries, too. Linus liked Andy’s initial work, but said he is “always a bit nervous about TLB changes like this just because any potential bugs tend to be really really hard to see and catch”. Those of us who have debugged nasty TLB issues on other architectures would be inclined to agree with him.

Ongoing Development

Laurent Dufour posted version 3 of a patch series entitled “Speculative page faults”. This is a contemporary development inspired by Peter Zijstra’s earlier work, which was based upon ideas of still others. The whole concept dates back to at least 2009 and generally involves removing the traditional locking constraints of updates to VMAs (Virtual Memory Areas) used by Linux tasks (processes) to represent the memory of running programs. Essentially, a “speculative fault” means “not holding mmap_sem” (a semaphore guarding a tasks’ current memory map). Laurent (and Peter) make VMA lookups lockless, and perform updats speculatively, using a seqlock to detect a change to the underlying VMA during the fault. “Once we’ve obtained the page and are ready to update the PTE, we validate if the state we started the fault with is still valid, if not, we’ll fail the fault with VM_FAULT_RETRY, otherwise we update the PTE and we’re done”. Earlier testing showed very significant performance upside to this work due to the reduced lock contention.

Aaron Lu posted “smp: do not send IPI if call_single_queue not empty”. The Linux kernel (and most others) uses a construct known as an IPI – or Inter-Processor-Interrupt – a form of software generated interrupt that a processor will send to one or more others when it needs them to perform some housekeeping work on the kernel’s behalf. Usually, this is to handle such things as a TLB shootdown (invalidating a virtual address translation in a remote processor due to a virtual address space being removed), especially on less sophisticated legacy architectures that do not feature invalidation of TLBs through hardware broadcast, though there are many other uses for IPIs. Aaron’s patch realizes, effectively, that if a remote processor is already going to process a queue of CSD (call_single_data) function calls it has been asked to via IPI then there is no need to send another IPI and generate additional interrupts – the queue will be drained of this entry as well as existing entries by the IPI management code.

Romain Perier posted version 8 of “Replace PCI pool by DMA pool API” which realizes that the current PCI pool API uses “simple macro functions direct expanded to the appropriate dma pool functions”, so it simply replaces them with a direct use of the corresponding DMA pool API instead.

Sandhya Bankar posted “vfs: Convert file allocation code to use the IDR”. This replaces existing filesystem code that allocates file descriptors using a
custom allocator with Matthew (Willy) Wilcox’s idr (ID Radix) tree allocator.

Serge E. Hallyn posted a resend of version 2 of a patch series entitled “Introduce v3 namespaced file capabilities”. We covered this last time.

Heinrich Schuchardt posted “arm64: Always provide “model name” in /proc/cpuinfo”, which was quickly shot down (for the moment).

Christian König posted verision 5 of his “Resizeable PCI BAR support” patch series. We have featured this in a previous episode of the podcast.

Prakash Sangappa posted “hugetlbfs ‘noautofill’ mount option” which aims to allow (optionally) for hugetlbfs pseudo-filesystems to be mounted with an option which will not automatically populate holes in files with zeros during a page fault when the file is accessed though the mapped address. This is intended to benefit applications such as Oracle databases, which make heavy use of such mechanisms but don’t take kindly to the kernel having side effects that change on-disk files even if only zero fill. Dave Hansen pushed back against this change saying that it was “further specializing hugetlbfs” and that Oracle should be using userfaultfd or “an madvise() option that disallows backing allocations”. Prakash replied that they had considered those but with a database there are such a large number of single threaded processes that “The concern with using userfaultfs is the overhead of setup and having an additional thread per process”.

Sameer Goel posted “arm64: Add translation functions for /dev/mem read/write” which “Port architecture specific xlate [translate] and unxlate [untranslate] functions for /dev/mem read/write. This sets up the mapping for a valid physical address if a kernel direct mapping is not alread present”. Depending upon the ARM platform, access to a bad address in /dev/mem could result in a synchronous exception in the core, or a System Error (SError) generated by a system memory controller interface. In either case, it is handled as a fatal error where the same is not true on x86. While access to /dev/mem is restricted, increasingly being deprecated, and has other semantics to prevent its used on 64-bit ARM systems, it still exists and is used. In this case, to read the ACPI FPDT table which provides performance pointer records. Nevertheless, both Will Deacon and Leif Lindholm objected to the reasoning given here, saying that the kernel should instead be taught how to parse this table and expose its information via /sys rather than having userspace tools go poking in /dev/mem to try to read from the table directly.

Minchan Kim posted “vmscan: scan pages until it f[inds] eligible pages” in which he notes that “There are premature OOM [Out Of Memory killer invocations] happening. Although there are ton of free swap and anonymous LRU list of eligible zones, OOM happened. With investigation, skipping page of isolate_lru_pages makes reclaim void because it returns zero nr_taken easily so LRU shrinking is effectively nothing and just increases priority aggressively. Finally, OOM happens”.

Julius Werner posted version 3 of his “Memconsole changes for new coreboot format” which teaches the Google firmware driver for their memconsole to deal with the newer type of persistent ring buffer console they introduced.

Olliver Schinagl and Jamie Iles had a back and forth about the latter’s work on “glue-code” (generic handling code) for the DW (DesignWare) 8250 (a type of serial port interface made popular by PC) IP block as used in many different designs. Depending upon how the block is configured, it can behave differently, and there was some discussion about how to handle that. In particular the location of the UART_USR register.

Xiao Guangrong posted “KVM: MMU: fast write protect” which “introduces a[n] extremely fast way to write protec all the guest memory. Comparing with the ordinary algorthim which write protects last level sptes [the page table entries used by the guest] based on the rmap [the “reverse” map, the means that Linux uses to encode page table information within the kernel] one by one, it just simply updates the generation number to ask all vCPUs to reload its root page table, particularly it can be done out of mmu-lock”. The idea was apparently originally proposed by Avi (Kivity). Paolo Bonzini thought “This is clever” and wondered “how the alternative write protection mechanism would affect performance of the dirty page ring buffer patches”. Xiao thought it could be used to speed up those patches after merging, too [Paolo noted that he aims to merge these early in 4.13 development].

Bogdan Mirea posted version 2 of”Add “Preserve Boot Time Support””, which follows up on a previous discussion about retaining “Boot Time Preservation between Bootloader and Linux Kernel. It is based on the idea that the Bootloader (or any other early firmware) will start the HW Timer and Linux Kernel will count the time starting with the cycles elapsed since timer start”. By “Bootloader” he means “firmware” to those who live in x86-land.

Igor Stoppa posted “post-init-read-only protection for data allocated dynamically” which aims to provide a mechanism for dynamically allocated data which is similar to the “__read_only” special linker section that certain annotated (using special GCC directives) code will be placed into. That works great for read-only data (which is protected by the MMU locking down the corresponding region early in boot). His “wish” is to start with the “policy DB of SE Linux and the LSM Hooks, but eventually I would like to extend the protection also to other subsystems, in a way that can merged into mainline.” His patch includes an analysis of how he feels he can be as “little invasive as possible”, noting that “In most, if not all, the cases that could be enhanced, the code will be calling kmalloc/vmalloc, including GFP_KERNEL [Get Free Pages of Kernel Type Memory] as the desired type of memory”. Consequently, he says, “I suspect/hope that the various maintainer[s] won’t object too much if my changes are limited to replacing GFP_KERNEL with some other macro, for example what I previously called GFP_LOCKABLE”. Michal Hocko had some feedback, largely along the lines of a “master toggle” (tha would allow protection to be disabled for small periods in order to make changes to “read only” data) was largely pointless – due to it re-exposing the data. Instead, he wanted to see the protection being done at the kmem_cache_create time by adding a “SLAB_SEAL” parameter that would later be enabled on a per kmem_cache basis using “kmem_cache_seal(cache)” or a similar mechanism.

Bharat Bhushan posted “ARM64/PCI: Allow userspace to mmap PCI resources”, which Lorenzo Pieralisi noted was already implemented by another patch.

A lengthy, and “spirited” discussion took place between Timur Tabi and the various maintainers of the 64-bit ARM Architecture and SoC platform trees over the desire for the maintainers to have changes to “defconfigs” for the architecture go through a special “arm@kernel.org” alias. Except that after they had told Timur to use that, they objected to him posting a patch informing others of this alias in the kernel documentation. Instead, as Timur put it “without a MAINTAINERS entry, how would anyone know to CC: that address? I posted 3 versions of my defconfig patchset before someone told me that I had to send it to arm@kernel.org.” The discussion thread is entitled “MAINTAINERS: add arm@kernel.org as the list for arm64 defconfig changes”.

Xunlei Pang posted version 3 of his “x86/mm/ident_map: Add PUD level 1GB page support” which helps “kernel_ident_mapping_init” to create a single and very large identitiy page mapping in order to reduce TLB (Translation Lookaside Buffer – the caches that store virtual to physical memory lookups performed by hardware) pressure on an architecture that is currently using many 2MB (PMD – Page Middle Directory) level pages for this process.

Anju T Sudhakar posted version 8 of “IMC Instrumentation Support”, which provides support for POWER9’s “In-Memory-Collection” or IMC infrastructure, which “contains various Performance Monitoring Units (PMUs) at Nest level (these are on-chip but off-core), Core level and Thread level.”

Greg K-H (Kroah-Hartman) posted an RFC patch entitled “add more new kernel pointer filter options” which “implemnt[s] some new restrictions when printing out kernel pointers, as well as the ability to whitelist kernel pointers where needed.”

Kees Cook posted “x86/refcount: Implement fast refcount overflow protection”, which seeks to upstream a “modified version of the x86 PAX_REFCOUNT defense from PaX/grsecurity. This speeds up the refcount_t API by duplicating the existing atomic_t implementation with a single instruction added to detect if the refcount has wrapped past INT_MAC (or below 0) resuling in a negative value, where the handler then restores the refcount_t to INT_MAX”.

David Howlls posted an RFC patch entitled “VFS: Introduce superblock configuration context” which is a “set of patches to create a superblock configuration contenxt prior to setting up a new mount, populating it with the parsed options/binary data, creating the superblock and then effecting the mount. This allows namespaces and other information to be conveyed through the mount procedure. It also allows extra error information”.

The Google Chromebook team let folks know that they were (rarely, like one in a million) seeing “Threads stuck in zap_pid_ns_processes()”. Guenter Roeck noted that the “Problem is that if the main task [which has children that are being ptraced] doesn’t exit, it [the child] hangs forever. Chrome OS (where we see the problem in the field, and the application is chrome) is configured to reboot on hung tasks – if a task is hung for 120 seconds on those systems, it tends to be in a bad shape. This makes it a quite severe problem for us”. He asked “Are there other conditions besides ptrace where a task isn’t reaped?”. Reaping refers to the behavior in which tasks, when they exit are reparented to the init task, which “reaps” them (cleans up and makes sure the state that exit with is seen), except under ptrace in this case where the parent task spawning the children “was outside of the pid namespace and was choosing not to reap the child”. Various proposals as to how to deal with this in the namespace code were discussed.

Mahesh Bandewar posted “kmod: don’t load module unless req process has CAP_SYS_MODULE” which notes that “A process inside random user-ns [a user namespace] should not load a module, which is currently possible”. He shows how a user namespace can be created that causes the kernel to load a module upon access to a file node indirectly. This could be a security risk if this approach were used to cause a host kernel to load a vulnerable but otherwise not loaded kernel driver through the privileged permissions in the namespace.

Marc Zyngier posted “irqdomain: Improve irq_domain_mapping facility” in which he “Update[s] IRQ-domain.txt to document irq_domain_mapping” among otherwise seeking to make it easier to access and understand this kernel feature.

Jens Axboe accepted a patch from Ulf Hansson adding Paolo Valente as a MAINTAINER of the BFQ I/O scheduler.

Cyrille Pitchen updated the git repos for the SPI NOR subsystem, which is “now hosted on MTD repos, spi-nor/next is on l2-mtd and spi-nor/fixes will be on linux-mtd”.

Alexandre Courbot posted “MAINTAINERS: remove self from GPIO maintainers”.

The folks at Codeaurora posted a lengthy analysis of the Linux kernel scheduler and specific problems with load_balance that will be covered next time around, along with work by Peter Zijlstra on the “cgroup/PELT overhaul (again).

Finally, Paul McKenney previously posted “Make SRCU be once again optional”, after having noted that the need to build it in by default (caused by other recent changes in header files) increased the kernel by 2K. Nico(las) Pitre was happy to hear this, saying “If every maintainer finds a way to (optionally) reduce the size of the code they maintain by 2K then we’ll get a much smaller kernel pretty soon”.

May 15, 2017 04:07 AM

May 13, 2017

Linux Plumbers Conference: Today is the very last day for Plumbers refereed track submissions

The submission site

https://linuxplumbersconf.org/2017/ocw/events/LPC2017TALKS/proposals

Will close at midnight pacific tonight

 

May 13, 2017 03:48 PM

May 09, 2017

Matthew Garrett: Intel AMT on wireless networks

More details about Intel's AMT vulnerablity have been released - it's about the worst case scenario, in that it's a total authentication bypass that appears to exist independent of whether the AMT is being used in Small Business or Enterprise modes (more background in my previous post here). One thing I claimed was that even though this was pretty bad it probably wasn't super bad, since Shodan indicated that there were only a small number of thousand machines on the public internet and accessible via AMT. Most deployments were probably behind corporate firewalls, which meant that it was plausibly a vector for spreading within a company but probably wasn't a likely initial vector.

I've since done some more playing and come to the conclusion that it's rather worse than that. AMT actually supports being accessed over wireless networks. Enabling this is a separate option - if you simply provision AMT it won't be accessible over wireless by default, you need to perform additional configuration (although this is as simple as logging into the web UI and turning on the option). Once enabled, there are two cases:

  1. The system is not running an operating system, or the operating system has not taken control of the wireless hardware. In this case AMT will attempt to join any network that it's been explicitly told about. Note that in default configuration, joining a wireless network from the OS is not sufficient for AMT to know about it - there needs to be explicit synchronisation of the network credentials to AMT. Intel provide a wireless manager that does this, but the stock behaviour in Windows (even after you've installed the AMT support drivers) is not to do this.
  2. The system is running an operating system that has taken control of the wireless hardware. In this state, AMT is no longer able to drive the wireless hardware directly and counts on OS support to pass packets on. Under Linux, Intel's wireless drivers do not appear to implement this feature. Under Windows, they do. This does not require any application level support, and uninstalling LMS will not disable this functionality. This also appears to happen at the driver level, which means it bypasses the Windows firewall.
Case 2 is the scary one. If you have a laptop that supports AMT, and if AMT has been provisioned, and if AMT has had wireless support turned on, and if you're running Windows, then connecting your laptop to a public wireless network means that AMT is accessible to anyone else on that network[1]. If it hasn't received a firmware update, they'll be able to do so without needing any valid credentials.

If you're a corporate IT department, and if you have AMT enabled over wifi, turn it off. Now.

[1] Assuming that the network doesn't block client to client traffic, of course

comment count unavailable comments

May 09, 2017 08:18 PM

Matthew Garrett: Intel's remote AMT vulnerablity

Intel just announced a vulnerability in their Active Management Technology stack. Here's what we know so far.

Background

Intel chipsets for some years have included a Management Engine, a small microprocessor that runs independently of the main CPU and operating system. Various pieces of software run on the ME, ranging from code to handle media DRM to an implementation of a TPM. AMT is another piece of software running on the ME, albeit one that takes advantage of a wide range of ME features.

Active Management Technology

AMT is intended to provide IT departments with a means to manage client systems. When AMT is enabled, any packets sent to the machine's wired network port on port 16992 or 16993 will be redirected to the ME and passed on to AMT - the OS never sees these packets. AMT provides a web UI that allows you to do things like reboot a machine, provide remote install media or even (if the OS is configured appropriately) get a remote console. Access to AMT requires a password - the implication of this vulnerability is that that password can be bypassed.

Remote management

AMT has two types of remote console: emulated serial and full graphical. The emulated serial console requires only that the operating system run a console on that serial port, while the graphical environment requires drivers on the OS side requires that the OS set a compatible video mode but is also otherwise OS-independent[2]. However, an attacker who enables emulated serial support may be able to use that to configure grub to enable serial console. Remote graphical console seems to be problematic under Linux but some people claim to have it working, so an attacker would be able to interact with your graphical console as if you were physically present. Yes, this is terrifying.

Remote media

AMT supports providing an ISO remotely. In older versions of AMT (before 11.0) this was in the form of an emulated IDE controller. In 11.0 and later, this takes the form of an emulated USB device. The nice thing about the latter is that any image provided that way will probably be automounted if there's a logged in user, which probably means it's possible to use a malformed filesystem to get arbitrary code execution in the kernel. Fun!

The other part of the remote media is that systems will happily boot off it. An attacker can reboot a system into their own OS and examine drive contents at their leisure. This doesn't let them bypass disk encryption in a straightforward way[1], so you should probably enable that.

How bad is this

That depends. Unless you've explicitly enabled AMT at any point, you're probably fine. The drivers that allow local users to provision the system would require administrative rights to install, so as long as you don't have them installed then the only local users who can do anything are the ones who are admins anyway. If you do have it enabled, though…

How do I know if I have it enabled?

Yeah this is way more annoying than it should be. First of all, does your system even support AMT? AMT requires a few things:

1) A supported CPU
2) A supported chipset
3) Supported network hardware
4) The ME firmware to contain the AMT firmware

Merely having a "vPRO" CPU and chipset isn't sufficient - your system vendor also needs to have licensed the AMT code. Under Linux, if lspci doesn't show a communication controller with "MEI" or "HECI" in the description, AMT isn't running and you're safe. If it does show an MEI controller, that still doesn't mean you're vulnerable - AMT may still not be provisioned. If you reboot you should see a brief firmware splash mentioning the ME. Hitting ctrl+p at this point should get you into a menu which should let you disable AMT.

How about over Wifi?

Turning on AMT doesn't automatically turn it on for wifi. AMT will also only connect itself to networks it's been explicitly told about. Where things get more confusing is that once the OS is running, responsibility for wifi is switched from the ME to the OS and it forwards packets to AMT. I haven't been able to find good documentation on whether having AMT enabled for wifi results in the OS forwarding packets to AMT on all wifi networks or only ones that are explicitly configured.

What do we not know?

We have zero information about the vulnerability, other than that it allows unauthenticated access to AMT. One big thing that's not clear at the moment is whether this affects all AMT setups, setups that are in Small Business Mode, or setups that are in Enterprise Mode. If the latter, the impact on individual end-users will be basically zero - Enterprise Mode involves a bunch of effort to configure and nobody's doing that for their home systems. If it affects all systems, or just systems in Small Business Mode, things are likely to be worse.
We now know that the vulnerability exists in all configurations.

What should I do?

Make sure AMT is disabled. If it's your own computer, you should then have nothing else to worry about. If you're a Windows admin with untrusted users, you should also disable or uninstall LMS by following these instructions.

Does this mean every Intel system built since 2008 can be taken over by hackers?

No. Most Intel systems don't ship with AMT. Most Intel systems with AMT don't have it turned on.

Does this allow persistent compromise of the system?

Not in any novel way. An attacker could disable Secure Boot and install a backdoored bootloader, just as they could with physical access.

But isn't the ME a giant backdoor with arbitrary access to RAM?

Yes, but there's no indication that this vulnerability allows execution of arbitrary code on the ME - it looks like it's just (ha ha) an authentication bypass for AMT.

Is this a big deal anyway?

Yes. Fixing this requires a system firmware update in order to provide new ME firmware (including an updated copy of the AMT code). Many of the affected machines are no longer receiving firmware updates from their manufacturers, and so will probably never get a fix. Anyone who ever enables AMT on one of these devices will be vulnerable. That's ignoring the fact that firmware updates are rarely flagged as security critical (they don't generally come via Windows update), so even when updates are made available, users probably won't know about them or install them.

Avoiding this kind of thing in future

Users ought to have full control over what's running on their systems, including the ME. If a vendor is no longer providing updates then it should at least be possible for a sufficiently desperate user to pay someone else to do a firmware build with the appropriate fixes. Leaving firmware updates at the whims of hardware manufacturers who will only support systems for a fraction of their useful lifespan is inevitably going to end badly.

How certain are you about any of this?

Not hugely - the quality of public documentation on AMT isn't wonderful, and while I've spent some time playing with it (and related technologies) I'm not an expert. If anything above seems inaccurate, let me know and I'll fix it.

[1] Eh well. They could reboot into their own OS, modify your initramfs (because that's not signed even if you're using UEFI Secure Boot) such that it writes a copy of your disk passphrase to /boot before unlocking it, wait for you to type in your passphrase, reboot again and gain access. Sealing the encryption key to the TPM would avoid this.

[2] Updated after this comment - I thought I'd fixed this before publishing but left that claim in by accident.

(Updated to add the section on wifi)

(Updated to typo replace LSM with LMS)

(Updated to indicate that the vulnerability affects all configurations)

comment count unavailable comments

May 09, 2017 08:04 PM

May 07, 2017

Linux Plumbers Conference: Submission deadline for Linux Plumbers Conference refereed track proposals extended by a week

The deadline for submitting refereed track proposals for the 2017 Linux Plumbers Conference has been extended until 13 May at 11:59PM Pacific Time.  The refereed track will have 50-minute presentations on a specific aspect of Linux “plumbing” (e.g. core libraries, media creation/playback, display managers, init systems, kernel APIs/ABIs, etc.) that are chosen by the LPC committee to be given during all three days of the conference.

May 07, 2017 09:26 PM

May 04, 2017

Linux Plumbers Conference: Android/Mobile Microconference Accepted into Linux Plumbers Conference

Android continues to find interesting new applications and problems to
solve, both within and outside the mobile arena.  Mainlining continues
to be an area of focus, as do a number of areas of core Android
functionality, including the kernel.  Other areas where there is ongoing
work include eBPF, Lowmemory alternatives, the Android emulator, and
SDCardFS.

For the latest details, please see this microconference’s wiki page

http://wiki.linuxplumbersconf.org/2017:android_mobile

We hope to see you there!

May 04, 2017 07:25 PM

Dave Airlie: how much of the conformance test suite does radv pass now?

Test run totals:
Passed: 109293/150992 (72.4%)
Failed: 0/150992 (0.0%)
Not supported: 41697/150992 (27.6%)
Warnings: 2/150992 (0.0%)

This is effectively a pass. The Not Supported stuff isn't missing features as uneducated people are quick to spout, it's more stuff the hardware doesn't support or is pointless to expose on the hardware. (lots of image formats).

This is the results from the Vulkan CTS 1.0.2 branch, against mesa master with one patch (a workaround for some InternalErrors that CTS throws up).

Do not call the driver conformant as that is against the Khronos rules as we haven't paid or filed for approval, but the driver does now effectively pass the latest conformance test suite. I'll update on things if that changes.

Thanks again to everyone involved.

May 04, 2017 04:22 AM

May 03, 2017

Pete Zaitcev: AWS price reductions

Amazon announced another price cut today and anaylists are all about "there's more margin in VMs", "Microsoft Azure finds it tough to compete", and so on.

Being in storage I'm obviously biased, but I think the price cuts in S3 and Glacier are more consequential. All that data is like savings account for Amazon. Any time they need cash, they can just increase prices right back, and what are you going to do? Data is not VMs, you cannot just download it all overnight, you cannot migrate off the service.

Obviously things are not that simple, and they are trying to bind code to AWS in ways that yield similar inertia to the data. As Adrian Cockroft tweeted, "OpenStack not closing the gap". Cut him a slack though, he was at NetFlix, and it was 4 years ago. One way or the other, data's inertia is natural and code's inertia is artificial — it's something engineers can fix.

May 03, 2017 08:05 PM

Michael Kerrisk (manpages): man-pages-4.11 is released

I've released man-pages-4.11. The release tarball is available on kernel.org. The browsable online pages can be found on man7.org. The Git repository for man-pages is available on kernel.org.

This release resulted from patches, bug reports, reviews, and comments from over 30 contributors. It includes more than 300 commits changing over 100 pages. The changes include the addition of 5 pages, significant rewriting of 1 other page, and enhancements to many other pages.

Among the more significant changes in man-pages-4.11 are the following:

May 03, 2017 07:59 PM

Michael Kerrisk (manpages): man-pages-4.09 is released

I've released man-pages-4.09. The release tarball is available on kernel.org. The browsable online pages can be found on man7.org. The Git repository for man-pages is available on kernel.org.

This release resulted from patches, bug reports, reviews, and comments from 44 contributors. This is one of the more substantial releases in recent times, with more than 500 commits changing around 190 pages. The changes include the addition of eight new pages and significant enhancements or rewrites to many existing pages.

Among the more significant changes in man-pages-4.09 are the following:

In addition to the above, substantial changes were also made to the close(2), getpriority(2), nice(2), timer_create(2), timerfd_create(2), random(4), and proc(5) pages.

May 03, 2017 07:42 PM

Michael Kerrisk (manpages): man-pages-4.10 is released

I've released man-pages-4.10. The release tarball is available on kernel.org. The browsable online pages can be found on man7.org. The Git repository for man-pages is available on kernel.org.

This release resulted from patches, bug reports, reviews, and comments from over 40 contributors. This release sees a large number of changes: over 600 commits changing around 160 pages. The changes include the addition of 11 pages, significant rewrites of 3 other pages, and enhancements to many other pages.

Among the more significant changes in man-pages-4.10 are the following:

May 03, 2017 07:41 PM

May 02, 2017

Kees Cook: security things in Linux v4.11

Previously: v4.10.

Here’s a quick summary of some of the interesting security things in this week’s v4.11 release of the Linux kernel:

refcount_t infrastructure

Building on the efforts of Elena Reshetova, Hans Liljestrand, and David Windsor to port PaX’s PAX_REFCOUNT protection, Peter Zijlstra implemented a new kernel API for reference counting with the addition of the refcount_t type. Until now, all reference counters were implemented in the kernel using the atomic_t type, but it has a wide and general-purpose API that offers no reasonable way to provide protection against reference counter overflow vulnerabilities. With a dedicated type, a specialized API can be designed so that reference counting can be sanity-checked and provide a way to block overflows. With 2016 alone seeing at least a couple public exploitable reference counting vulnerabilities (e.g. CVE-2016-0728, CVE-2016-4558), this is going to be a welcome addition to the kernel. The arduous task of converting all the atomic_t reference counters to refcount_t will continue for a while to come.

CONFIG_DEBUG_RODATA renamed to CONFIG_STRICT_KERNEL_RWX

Laura Abbott landed changes to rename the kernel memory protection feature. The protection hadn’t been “debug” for over a decade, and it covers all kernel memory sections, not just “rodata”. Getting it consolidated under the top-level arch Kconfig file also brings some sanity to what was a per-architecture config, and signals that this is a fundamental kernel protection needed to be enabled on all architectures.

read-only usermodehelper

A common way attackers use to escape confinement is by rewriting the user-mode helper sysctls (e.g. /proc/sys/kernel/modprobe) to run something of their choosing in the init namespace. To reduce attack surface within the kernel, Greg KH introduced CONFIG_STATIC_USERMODEHELPER, which switches all user-mode helper binaries to a single read-only path (which defaults to /sbin/usermode-helper). Userspace will need to support this with a new helper tool that can demultiplex the kernel request to a set of known binaries.

seccomp coredumps

Mike Frysinger noticed that it wasn’t possible to get coredumps out of processes killed by seccomp, which could make debugging frustrating, especially for automated crash dump analysis tools. In keeping with the existing documentation for SIGSYS, which says a coredump should be generated, he added support to dump core on seccomp SECCOMP_RET_KILL results.

structleak plugin

Ported from PaX, I landed the structleak plugin which enforces that any structure containing a __user annotation is fully initialized to 0 so that stack content exposures of these kinds of structures are entirely eliminated from the kernel. This was originally designed to stop a specific vulnerability, and will now continue to block similar exposures.

ASLR entropy sysctl on MIPS
Matt Redfearn implemented the ASLR entropy sysctl for MIPS, letting userspace choose to crank up the entropy used for memory layouts.

NX brk on powerpc

Denys Vlasenko fixed a long standing bug where the kernel made assumptions about ELF memory layouts and defaulted the the brk section on powerpc to be executable. Now it’s not, and that’ll keep process heap from being abused.

That’s it for now; please let me know if I missed anything. The v4.12 merge window is open!

© 2017, Kees Cook. This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 License.
Creative Commons License

May 02, 2017 09:17 PM

May 01, 2017

Matthew Garrett: Looking at the Netgear Arlo home IP camera

Another in the series of looking at the security of IoT type objects. This time I've gone for the Arlo network connected cameras produced by Netgear, specifically the stock Arlo base system with a single camera. The base station is based on a Broadcom 5358 SoC with an 802.11n radio, along with a single Broadcom gigabit ethernet interface. Other than it only having a single ethernet port, this looks pretty much like a standard Netgear router. There's a convenient unpopulated header on the board that turns out to be a serial console, so getting a shell is only a few minutes work.

Normal setup is straight forward. You plug the base station into a router, wait for all the lights to come on and then you visit arlo.netgear.com and follow the setup instructions - by this point the base station has connected to Netgear's cloud service and you're just associating it to your account. Security here is straightforward: you need to be coming from the same IP address as the Arlo. For most home users with NAT this works fine. I sat frustrated as it repeatedly failed to find any devices, before finally moving everything behind a backup router (my main network isn't NATted) for initial setup. Once you and the Arlo are on the same IP address, the site shows you the base station's serial number for confirmation and then you attach it to your account. Next step is adding cameras. Each base station is broadcasting an 802.11 network on the 2.4GHz spectrum. You connect a camera by pressing the sync button on the base station and then the sync button on the camera. The camera associates with the base station via WPS and now you're up and running.

This is the point where I get bored and stop following instructions, but if you're using a desktop browser (rather than using the mobile app) you appear to need Flash in order to actually see any of the camera footage. Bleah.

But back to the device itself. The first thing I traced was the initial device association. What I found was that once the device is associated with an account, it can't be attached to another account. This is good - I can't simply request that devices be rebound to my account from someone else's. Further, while the serial number is displayed to the user to disambiguate between devices, it doesn't seem to be what's used internally. Tracing the logon traffic from the base station shows it sending a long random device ID along with an authentication token. If you perform a factory reset, these values are regenerated. The device to account mapping seems to be based on this random device ID, which means that once the device is reset and bound to another account there's no way for the initial account owner to regain access (other than resetting it again and binding it back to their account). This is far better than many devices I've looked at.

Performing a factory reset also changes the WPA PSK for the camera network. Newsky Security discovered that doing so originally reset it to 12345678, which is, uh, suboptimal? That's been fixed in newer firmware, along with their discovery that the original random password choice was not terribly random.

All communication from the base station to the cloud seems to be over SSL, and everything validates certificates properly. This also seems to be true for client communication with the cloud service - camera footage is streamed back over port 443 as well.

Most of the functionality of the base station is provided by two daemons, xagent and vzdaemon. xagent appears to be responsible for registering the device with the cloud service, while vzdaemon handles the camera side of things (including motion detection). All of this is running as root, so in the event of any kind of vulnerability the entire platform is owned. For such a single purpose device this isn't really a big deal (the only sensitive data it has is the camera feed - if someone has access to that then root doesn't really buy them anything else). They're statically linked and stripped so I couldn't be bothered spending any significant amount of time digging into them. In any case, they don't expose any remotely accessible ports and only connect to services with verified SSL certificates. They're probably not a big risk.

Other than the dependence on Flash, there's nothing immediately concerning here. What is a little worrying is a family of daemons running on the device and listening to various high numbered UDP ports. These appear to be provided by Broadcom and a standard part of all their router platforms - they're intended for handling various bits of wireless authentication. It's not clear why they're listening on 0.0.0.0 rather than 127.0.0.1, and it's not obvious whether they're vulnerable (they mostly appear to receive packets from the driver itself, process them and then stick packets back into the kernel so who knows what's actually going on), but since you can't set one of these devices up in the first place without it being behind a NAT gateway it's unlikely to be of real concern to most users. On the other hand, the same daemons seem to be present on several Broadcom-based router platforms where they may end up being visible to the outside world. That's probably investigation for another day, though.

Overall: pretty solid, frustrating to set up if your network doesn't match their expectations, wouldn't have grave concerns over having it on an appropriately firewalled network.

(Edited to replace a mistaken reference to WDS with WPS)

comment count unavailable comments

May 01, 2017 06:17 PM

April 27, 2017

Kernel Podcast: Linux Kernel Podcast for 2017/04/27

Audiohttp://traffic.libsyn.com/jcm/20170427.mp3

In this week’s edition: Linux 4.11-rc8, updating kernel.org cross compilers, Intel 5-level paging, v3 namespaced file capabilities, and ongoing development.

Editorial Notes

Apologies for the delay to this week’s podcast. I got flu around the time I was preparing last week’s podcast, limped along to the weekend, and then had to stay in bed for a long time. On the other hand, it let me play with a bunch of new SDRs [HackRF, RTL-SDR, and friends, for the curious) on Sunday when I skipped the 5K I was supposed to run 🙂

I would also like to note my thanks for the first 10,000 downloads of the new series of this podcast. It’s a work in progress. I am going to make (positive!) changes over the coming months, including a web interface that will track all LKML posts and allow for community-directed collaboration on creating this (and hopefully other) podcasts. I will include automatic patch tracking (showing when patches have landed in upstream trees, and so on), info on post authors, and allow you to edit personal bios, links, etc. And employer info. After some discussions around the best way to handle author employer attribution (to make sure everyone is treated fairly), I’ve decide to take a little time away from including employer names until I have a populated database of mappings. Jon Corbet from LWN has something similar already, which I believe is also stored in git, but there’s more to be done here (thanks to Alex and others for the G+ feedback and discussion on this).

Linux 4.11-rc8

Linus Torvalds announced Linux 4.11-rc8, saying “So originally I was just planning on releasing the final 4.11 today, but while we didn’t have a *lot* of changes the last week, we had a couple of really annoying ones, so I’m doing another rc release instead”. As he also notes, “The most noticeable of the issues is that we’ve quirked off some NVMe power management that apparently causes problems on some machines. It’s not entirely clear what caused the issue (it wasn’t just limited to some NVMe hardware, but also particular platforms), but let’s test it”.

With the release of Linux 4.11-rc8 comes that impending moment of both elation and dread that is a final kernel. It’ll be great to see 4.11 out there. It’s an awesome kernel, with lots of new features, and it will be well summarized in kernelnewbies and elsewhere. But upon its release comes the opening of the merge window for 4.12. Tracking that was exciting for 4.11. Hopefully it doesn’t finish me off trying to do that for 4.12 😉

Geert Utterhoeven posted “Build regressions/improvements in v4.11-rc8”, in which he noted that (compared with v.4.10), an addition build error and several hundred more warnings were recently added to the kernel. The error he points to is in the AVR32 architecture when applying a relocation in the linker, probably due to an unsupported offset.

Announcements

Greg K-H (Kroah-Hartman) announced Linux 4.4.64, 4.9.25, and 4.10.13

Junio C Hamano announced Git v2.13.0-rc1

Alex Williams posted “Generic DMA-capable streaming device driver looking for home” in which he describes some generic features of his device (the ability to “carry generic data to/from userspace”) and inquired as to where it should live in the kernel. It could do with some followup.

Updating kernel.org cross compilers

Andre Przywara inquired as to the state of the kernel.org cross compilers. This was a project, initiated by Tony Breeds and located on kernel.org, to maintain current Intel x86 Architecture builds of cross compiler toolchains for various architecture targets (a cross compiler is one that runs on one architecture, targeting another, which is incidentally different from a “Canadian cross” compiler – look it up if you’re ever bored or want to bootstrap compilers for fun). It was a great project, but like so many others one day (three years ago) there were no more updates. That is something Andre would like to see changed. He posted, noting that many people still use the compilers on kernel.org (including yours truly, in a pinch) and that “The latest compiler I find there is 4.9.0, which celebrated its third birthday at the weekend, also has been superseded by 4.9.4 meanwhile”.

Andre used build scripts from Segher Bossenkool to build binutils (the GNU assembler) 2.28 and GCC (the GNU Compiler Collection) 6.3.0. With some tweaks, he was able to build for “all architectures except arc, m68k, tilegx and tilepro”. He wondered “what the process is to get these [the compilers linked from the kernel website] updated?”. It seems like he is keen to clean this up, which is to be commended and encouraged. And hopefully (since he works for ARM) that will eventually also include cross compiler targets for x86 that run on ARMv8 server systems.

Intel 5-level paging

Kirill A. Shutemov posted “x86: 5-level paging enabling for v4.12, Part 4”, in which he provides an “updated version the fourth and the last bunch of [] patches that brings initial 5-level paging enabling.” This is in support of Intel’s “la57” feature of future microprocessors that allows them to exceed the traditional 48-bit “Canonical Addressing” in order to address up to 56-bits of Virtual Address space (a big benefit to those who want to map large non-volatile storage devices and accelerators into virtual memory). His latest patch series includes a fix for a “KASLR [Kernel Address Space Layout Randomization”] bug due to rewriting [] startup_64() in C”.

Separately, John Paul Adrian Glaubitz inquired about Kirill’s patch series, saying, “I recently read the LWN article on your and your colleagues work to add five-level page table support for x86 to the Linux kernel. Since this extends the address space beyond 48-bits, as you know, it will cause potential headaches with Javascript engines which use tagged pointers. On SPARC, the virtual address space already extends to 52 bits and we are running into these very issues with Javascript engines on SPARC”.

He goes on to discuss passing the “hint” parameter to mmap() “in order to tell the kernel not to allocate memory beyond the 48 bits address space. Unfortunately, on Linux this will only work when the area pointed to by “hint” is unallocated which means one cannot simply use a hardcoded “hint” to mitigate this problem”. What he means here is that the mmap call to map a virtual memory area into a userspace process allows an application to specify where it would like that mapping to occur, but Linux isn’t required to respect this. Contemporary Linux implements “MAP_FIXED” as an option to mmap, which will either map a region where requested or explicitly fail (as Andy Lutomirski pointed out). This is different from a legacy behavior where Linux used to take a hint and might just not respect placement (as Andi Kleen alluded to in followup).

This whole discussion is actually the reason that Kirill had (thoughtfully) already included a feature bit setting in his patches that allows an application to effectively override the existing kernel logic and always allocate below 48 bits (preserving as close to existing behavior as possible on a per application basis while allowing a larger VA elsewhere). The thread resulted in this being pointed out, but it’s a timely reminder of the problems faced as the pressure continues upon architectures to grow their VA (Virtual Address) space size.

Often, efforts at growing virtual memory address spaces run up against uses of the higher order bits that were never sanctioned but are in widespread use. Many people strongly dislike pointer tagging of this kind (your author included), but it is not going away. It is great that Kirill’s patches have a form of solution that can be used for the time being by applications that want to retain a smaller address space, but that’s framed in the context of legacy support, not to enable runtimes to continue to use high order bits forevermore.

Introduce v3 namespaced file capabilities

Serge E. Hallyn posted “Introduce v3 namespaced file capabilities”. Linux includes a comprehensive capability mechanism that allows applications to limit what privileged operations may be performed by them. In the “good old days” when Unix hacker beards were more likely than today’s scruffy look, root was root and nobody really cared about remote compromise because they were still fighting having to have login passwords at all. But in today’s wonderful world of awesome, in which anything not bolted down is often not long for this world, “root” can mean very little. The traditionally privileged users can be extremely restricted by security policy frameworks, such as SELinux, but even more fundamentally can be subject to restrictions imposed by the growth in use of “capabilities”.

A classic example of a capability is CAP_NET_RAW, which the “ping” utility needs in order to create a raw socket. Traditionally, such utilities were created on Unix and Linux filesystems as “setuid root”, which means that they had the “s” bit set in their permissions to “run as root” when they were executed by regular users. This allowed the utility to operate, but it also allowed any user who could trick the utility into providing a shell conveniently gain a root login. Many security exploits over the years later and we have filesystem capabilities which allow binaries to exist on disk, tagged with just those extra capabilities they require to get the job done, through the filesystem “xattr” extended attributes. “ping” has CAP_NET_RAW, so it can create raw sockets, but it doesn’t need to run as root, so it isn’t market as “setuid root” on modern distros.

Fast forward still further into the modern era of containers and namespaces, and things get more complex. As Serge notes in his patch, “Root in a non-initial user ns [namespace] cannot be trusted to write a traditional security.capability xattr. If it were allowed to do so, then any unprivileged user on the host could map his own uid to root in a private namespace, write the xattr, and execute the file with privilege on the host”. However, as he also notes, “supporting file capabilities in a user namespace is very desirable. Not doing so means that and programs designed to run with limited privilege must continue to support other methods of gaining and dropping privilege. For instance a program installer must detect whether file capabilities can be assigned, and assign them if so but set setuid-root otherwise. The program in turn must known how to drop partial capabilities [which is a mess to get right], and do so only if setuid-root”. This is, of course, far from desirable.

In the patch series, Serge “builds a vfs_ns_cap_data struct by appending a uid_t [user ID] rootid to struct vfs_cap_data. This is the absolute uid_d (that is, the uid_t in user namespace which mounted the filesystem, usually init_user_ns [the global default]) of the root id in whosr namespace the file capabilities may take effect”. He then rewrites xattrs within the namespace for unprivileged “root” users with the appropriate notion of capabilities for that environment (in a “v3” xattr that is transparently converted to/from the conventional “v2” security.capability xattr), in accordance with capabilities that have been granted to the namespace from outside by a CAP_SETFCAP. This allows capability use without undermining host system security and seems like a nice solution.

Ongoing Development

Ashish Kalra posted “Fix BSS corruption/overwrite issue in early x86 kernel setup”. The BSS (Block Started by Symbol) is the longstanding name used to refer to statically allocated (and pre-zeroed) variables that have memory set aside at compile time. It’s a common feature of almost every ELF (Executable and Linking Format) Linux binary you will come across, the kernel not being much different. Linux also uses stacks for small runtime allocations by having a page (or several) of memory that contains a pointer which descends (it’s actually called a “fully descending” type of stack) in address as more (small) items are allocated within it. At boot time, the kernel typically expects the bootloader will have setup a stack that can be used for very early code, but Linux is willing to handle its own setup if the bootloader isn’t sophisticated enough to handle this. The latter code isn’t well exercised and it turns out doesn’t reserve quite enough space, which causes the stack to descend (run into) the BSS segment, resulting in corruption. Ashish fixes this by increasing the fallback stack allocation size from 512 to 1024 bytes in arch/x86/boot/boot.h.

Vladimir Murzin posted “ARM: Fix dma_alloc_coherent()” and friends for NOMMU”, noting “It seem that addition of cache support for M-class CPUs uncovered [a] latent bug in DMA usage. NOMMU memory model has been treated as being always consistent; however, for R/M [Real Time and Microcontroller] classes [of ARM cores] memory can be covered by MPU [Memory Protection Unit] which in turn might configure RAM as Normal i.e. bufferable and cacheable. It breaks dma_alloc_coherent() and friends, since data can stuck in caches”.

Andrew Pinski posted “arm64/vdso: Rewrite gettimeofday into C”, which improves performance by up to 32% when compared to the existing in-kernel implementation on a Cavium ThunderX system (because there are division operations that the compiler can optimize). On their next generation, it apparently improves performance by 18% while also benefitting other ARM platforms that were tested. This is a significant improvement since that function is often called by userspace applications many times per second.

Baoquan He posted “x86/KASLR: Use old ident map page table if physical randomization failed”. Dave Young discovered a problem with the physical memory map setup of kexec/kdump kernels when KASLR (Kernel Address Space Layout Randomization) is enabled. KASLR does what it says on the tin. It applies a level of randomization to the placement of (most) physical pages of the kernel such that it is harder for an attacker to guess where in memory the kernel is located. This reduces the ability for “off the shelf” buffer overflow/ROP/similar attacks to leverage known kernel layout. But when the kernel kexec’s into a kdump kernel upon a crash, it’s loading a second kernel while attempting to leave physical memory not allocated to the crash kernel alone (so that it can be dumped). This can lead to KASLR allocation failures in the crash kernel, which (until this patch) would result in the crash kernel not correctly setting up an identity mapping for the original (older) kernel, resulting in immediately resetting the machine. With the patch, the crash kernel will fallback to the original kernel’s identity mapping page tables when KASLR setup fails.

On a separate, but related, note, Xunlei Pang posted “x86_64/kexec: Use PUD level 1GB page for identity mapping if available” which seeks to change how the kexec identity mapping is established, favoring a new top-level 1GB PUD (Page Upper Directory) allocation for the identity mappings needed prior to booting into the new kernel. This can save considerable memory (128MB “On one 32TB machine”…) vs using the current approach of many 2MB PTEs (Page Table Entries) for the region. Rather than many PTEs, an effective huge page can be mapped. PTEs are grouped into “directories” in memory that the microprocessor’s walker engines can navigate when handling a “page fault” (the process of loading the TLB – Translation Lookaside Buffer – and microTLB caches). Middle Directories are collections of PTEs, and these are then grouped into even larger collections at upper levels, depending upon nesting depth. For more about how paging works, see Mel Gorman’s “Linux Memory Management”, a classic text that is still very much relevant for the fundamentals.

Janakarajan Natarajan posted “Prevent timer value 0 for MWAITX” which limits the kernel from providing a value of zero to the privileged x86 “MWAITX” instruction. MWAIT (Memory Wait) is a series of instructions on contemporary x86 systems that allows the kernel to temporarily block execution (in place of a spinloop, or other solution) until a memory location has been updated. Then, various trickery at the micro-architectural level (a dedicated engine in the core that snoops for updates to that memory address) will handle resuming execution later. This is intended for use in waiting relatively small amounts of time in an energy efficient and high performance (low wakeup time) manner. The instruction accepts a timeout period after which a wakeup will happen regardless, but it can also accept a zero parameter. Zero is supposed to mean “never timeout” (i.e. always wait for the memory update). It turns out that existing Linux kernels do use zero on some occasions, incorrectly, and that this isn’t noticed on older microprocessors due to other events eventually triggering a wakeup regardless. On the new AMD Zen core, which behaves correctly, MWAITX may never wake up with a zero parameter, and this was causing NMI soft lockup warnings. The patch corrects Linux to do the right thing, removing the zero option.

Paul E. McKenney posted “Make SRCU be built by default”. SRCU (Sleepable) RCU (Read Copy Update) is an optional feature of the Linux kernel that provides an implementation of RCU which can sleep. Conventionally, RCU had spinlock semantics (it could not sleep). By definition, its purpose was to provide a cunning lockless update mechanism for data structures, relying upon the passage of a “grace period” defined by every processor having gone into the scheduler once (a gross simplification of RCU). But under some circumstances (for example, in a Real Time kernel) there is a need for a sleepable (and pre-emptable, but that’s another issue) RCU. And so SRCU was created more than 8 years ago. It has a companion in “Tiny SRCU” for embedded systems. A “surprisingly common case” exists now where parts of the kernel are including srcu.h so Paul’s patch builds it by default.

Laurent Dufour posted “BUG raised when onlining HWPoisoned page” in which he noted that the (being onlined) page “has already the mem_cgroup field set” (this is shown in the stack trace he posts with “page dumped because: page still charged to cgroup”). He cleans this up by clearing the mem_cgroup when a page is poisoned. His second patch skips poisoned pages altogether when performing a memory block onlining operation.

Laurent also posted an RFC (Request For Comment) patch series entitled “Replace mmap_sem by a range lock” which “implements the first step of the attempt to replace the mmap_sem by a range lock”. We will summarize this patch series in more detail the next time it is posted upstream.

Christian König posted version 4 of his “Resizable PCI BAR support” patches. PCI (and its derivatives, such as PCI Express) use BARs (Base Address Registers) to convey regions of the host physical memory map that the device will use to map in its memory. BARs themselves are just registers, but the memory they refer to must be linearly placed into the physical map (or interim IOVA map in the case that the BAR is within a virtual machine). Fitting large, multi GB windows can be a challenge, sometimes resulting in failure, but many devices can also manage with smaller memory windows. Christian’s patches attempt to provide for the best of both by adding support for a contemporary feature of PCI (Express) that allows devices with such an ability to convey a minimal BAR size and then increase the allocation if that is available. His changes since version 3 include “Fail if any BAR is still in use…”.

Ying Huang posted version 10 of his “THP swap: Delay splitting THP during swapping out” which allows for swapping of Transparent Huge Pages directly. We have previously covered iterations of this patch series. The latest changes are minimal, suggesting this is close to being merged.

Jérôme Glisse posted version 21 of his “Heterogeneous Memory Management” (HMM) patch series. This is very similar to the version we covered last week. As a reminder, HMM provides an API through which the kernel can manage devices that want to share memory with a host processing environment in a more seamless fashion, using shared address spaces and regular pointers. His latest version changes the concept of “device unaddressable” memory to “device private” (MEMORY_DEVICE_PRIVATE vs MEMORY_DEVICE_PUBLIC) memory, following the feedback from Dan Nellans that devices are changing over time such that “memory may not remain CPU-unaddressable in the future” and that, even though this would likely result in subsequent changes to HMM, it was worthwhile starting out with nomenclature correctly referring to memory that is considered private to a device and will not be managed by HMM.

Intel’s test Robot noticed a 12.8% performance improvement in one of their scalability benchmarks when running with a recent linux-next tree containing Al Viro’s “amd64: get rid of zeroing” patch. This is patch of his larger “uccess unification” patch series that aims to simply and cleanup the process of copying data to/from kernel and userspace. In particular, when asking the kernel to copy data from one userspace virtual address to another, there is no need to apply the level of data zeroing that typically applies to buffers the kernel copies (for security purposes – preventing leakage of extra data beyond structures returned from kernel calls, as an example). When both source and destination are already in userspace, there is no security issue, but there was a performance degregation that Viro had noticed and fixed.

Julien Grall posted “Xen: Implement EFI reset_system callback”, which provides a means to correctly reboot and power off Dom0 host Xen Hypervisors when running on EFI systems for which reset_system is used by reference (ARM).

 

April 27, 2017 03:50 PM

April 26, 2017

Michael Kerrisk (manpages): Linux Security and Isolation APIs course in Munich (17-19 July 2017)

I've scheduled the first public instance of my "Linux Security and Isolation APIs" course to take place in Munich, Germany on 17-19 July 2017. (I've already run the course a few times very successfully in non-public settings.) This three-day course provides a deep understanding of the low-level Linux features (set-UID/set-GID programs, capabilities, namespaces, cgroups, and seccomp) used to build container, virtualization, and sandboxing technologies. The course format is a mixture of theory and practical.

The course is aimed at designers and programmers building privileged applications, container applications, and sandboxing applications. Systems administrators who are managing such applications are also likely to find the course of benefit.

You can find out more about the course (such as expected background and course pricing) at
http://man7.org/training/sec_isol_apis/
and see a detailed course outline at
http://man7.org/training/sec_isol_apis/sec_isol_apis_course_outline.html

April 26, 2017 07:38 PM

April 23, 2017

Pete Zaitcev: SDSC Petabyte scale Swift cluster

It was almost two years since last Swift numbers, but here are a few numbers from San Diego (whole presentation is available on Github:

> 5+ PB data
> 42 servers and 1000+ disks
> 3, 4, 6 TB SAS drives for objects, SATA SSD drives for Account/Container
> 10 GbE network

The 5 PB size is about quarter scale of the largest known Swift cluster. V.impressive. The 100 PB installation that RAX run consists of 6 federated clusters. Number of objects and request rate are unknown. 1000/42 comes comes to about 25-30 disks per server, but they mention 45-disk JBODs later, with plans to move to 90-disk JBODs. Nodes of large clusters continue getting fatter.

The cluster is in operation since 2011 (started with the Diablo release). They still use Pound for load-balancing.

April 23, 2017 03:04 AM

April 20, 2017

Kernel Podcast: Linux Kernel Podcast for 2017/04/19

Audiohttp://traffic.libsyn.com/jcm/20170419.mp3

[ Apologies for the delay – I have been a little sick for the past day or so and was out on Monday volunteering at the Boston Marathon, so my evenings have been in scarse supply to get this week’s issue completed ]

In this week’s edition: Linus Torvalds announces Linux 4.11-rc7, a kernel security update bonanza, the end of Kconfig maintenance, automatic NUMA balancing, movable memory, a bug in synchronize_rcu_tasks, and ongoing development. The Linux 4.12 merge window should open before next week.

Linus Torvalds announced Linux 4.11-rc7, noting that “You all know the drill by now. We’re in the late rc phase, and this may be the last rc if nothing surprising happens”. He also pointed out how things had been calm, and then, “as usual Friday happened”, leading to a number of reverts for “things that didn’t work out and aren’t worth trying to fix at this point”. In anticipation of the imminent opening of the 4.12 merge window (period of time during which disruptive changes are allowed) Linux Weekly News posted their usual excellent summary of the 4.11 development cycle. If you want to support quality Linux journalism, you should subscribe to LWN today.

Ted (Theodore) Ts’o posted “[REGRESSION] 4.11-rc: systemd doesn’t see most devices” in which he noted that “[t]here is a frustrating regression in 4.11 that I’ve been trying to track down. The symptoms are that a large number of systemd devices don’t show up.” (which was affecting the encrypted device mapper target backing his filesystem). He had a back and forth with Greg K-H (Kroah Hartman) about it with Greg suggesting Ted watch with udevadm and Ted pointing out that this happens at boot and is hard to trace. Ted’s final comment was interesting: “I’d do more debugging, but there’s a lot of magic these days in the kernel to udev/systemd communications that I’m quite ignorant about. Is this a good place I can learn more about how this all works, other than diving into the udev and systemd sources?”. Indeed. In somewhat interesting timing, Enric Balletbo i Serra later posted a 5 part patch series entitled “dm: boot a mapped device without an initramfs”.

Rafael J. Wysocki posted some late breaking 4.11-rc7 fixes for ACPI, including one patch reverting a “recent ACPICA commit [to the ACPI – Advanced Configuration and Power Interface – Component Architecture aka reference code upon which the kernel’s runtime interpretor is based] targeted at catching firmware bugs” that did do so, but also caused “functional problems”.

Announcements

Jiri Slaby announced Linux 3.12.73.

Greg KH (Kroah-Hartman) announced Linux 3.18.49, 3.19.49 4.4.62, 4.9.23, and 4.10.11. As he noted in his review posting prior to announcing the latest 3.18 kernel, 3.18 was indeed “dead and forgotten and left to rot on the side of the road” but “unfortunately, there’s a few million or so devices out there in the wild that still rely on this kernel”. Important security fixes are included in all of these updates. Greg doesn’t commit to bring 3.18 out of retirement for very long, but he does note that Google is assisting a little for the moment to make sure 3.18 based devices get some updates.

Steven Rostedt announced “Real Time” (preempt-rt) kernels 3.2.88-rt126 (“just an update to the new stable 3.2.88 version”), 3.12.72-rt97, and 4.4.60-rt73. Separately, Paul E. McKenney noted “A Hannes Weisbach of TU Dresden published this master thesis on quasi-real-time scheduling:
http://os.inf.tu-dresden.de/papers_ps/weisbach-master.pdf

Rafael J. Wysocki announced a CFP (Call For Papers) targeting the upcoming LPC (Linux Plumbers Conference) Power Management and Energy-Awareness microconference “Call for topics”. Registration for LPC just opened.

Yann E. MORIN posted “MAINTAINERS: relinquish kconfig” in which he apologized for not having enough time to maintain Kconfig with “I’ve been almost entirely absent, which totally sucks, and there is no excuse for my behavior and for not having relinquished this earlier”. With such harsh friends as yourself, who needs enemies? Joking aside, this is sad news, since Kconfig is the core infrastructure used to configure the kernel. It wasn’t long before someone else (Randy Dunlap) posted a patch for Kconfig that no longer has a maintainer (Randy’s patch implements a sort method for config options)

[as an aside, as usual, I have pinged folks who might be looking for an opportunity to encourage them to consider stepping up to take this on].

Automatic NUMA balancing, movable memory, and more!

Mel Gorman posted “mm, numa: Fix bad pmd by atomically check for pmd_trans_huge when marking page tables prot_numa”. Modern Linux kernels include a feature known as automatic numa balancing which relies upon marking regions of virtual memory as inaccessible via their page table entries (PTEs) and set a special prot_numa protection hinting bit. The idea is that a later “NUMA hinting fault” on access to the page will allow the Operating System to determine whether it should migrate the page to another NUMA node. Pages are simply small granular units of system memory that are managed by the kernel in setting up translations from virtual to physical memory. When an access to a virtual address occurs, hardware (or, on some architectures, special software) “walkers” navigate the “page tables” pointed to by a special system register. The walker will traverse various “directories” formed from collections of pages in a hierarchical fashion intended to require less space to store page tables than if entries were required for every possible virtual address in a 32 or 64-bit space.

Contemporary microprocessors also support multiple page (granule) sizes, with a fundamental size (commonly 4K or 64K) being supplemented by the ability for larger pages (aka “hugepages”) to be used for very large regions of contiguous virtual memory at less overhead. Common sizes of huge pages are 2MB, 4MB, 512M, and even multi-GB, with “contiguous hint bits” on some modern architectures allowing for even greater flexibility in the footprint of page table and TLB (Translation Lookaside Buffer) entries by only requiring physical entries for a fraction of a contiguous region. On Intel x86 Architecture, huge pages are implemented using the Page Size Extensions (PSE), which allows for a PMD (Page Middle Directory) to be replaced by an entry that effectively allocates the entire range to a single page entry. When a hardware walker sees this, a single TLB entry can be used for an entire range of a few MB instead of many 4K entries.

A bug known as a “race condition” exist(ed) in the automatic NUMA hinting code in which change_pmd_range would perform a number of checks without a lock being held to protect against a concurrent race againt a parallel protection updated (which does happen under a lock) that would clear the PMD and fill it with a prot_numa entry. Mel adds a new pmd_none_or_trans_huge_or_clear_bad function that correctly handles this rare corner case sequence, and documents it (in mm/mprotect.c). Michal Hocko responded with “you will probably win the_longer_function_name_contest but I do not have [a] much better suggestion”.

Speaking of Michal Hocko, he posted version 2 of a patch series entitled “mm: make movable onlining suck less” in which he described the current status quo of “Movable onlining” as “a real hack with many downsides”. Linux divides memory into regions describing zones with names like ZONE_NORMAL (for regular system memory) and ZONE_MOVABLE (for memory the contents of which is entirely pages that don’t contain unmovable system data, firmware data, or for other reasons cannot be trivially moved/offlined/etc.).

The existing implementation has a number of constraints around which pages can be onlined. In particular, around the relative placement of the memory being onlined vs the ZONE_NORMAL memory. This, Michal described as “mainly reintroduction of lowmem/highmem issues we used to have on 32b systems – but it is the only way to make the memory hotremove more reliable which is something that people are asking for”. His patch series aims to make “the onlining semantic more usable [especially when driven by udev]…it allows to online memory movable as long as it doesn’t clash with the existing ZONE_NORMAL. That means that ZONE_NORMAL and ZONE_MOVABLE cannot overlap”. He noted that he had discussed this patch series with Jérôme Glisse (author of the HMM – Heterogenous Memory Management – patches) which were to be rebased on top of this patch series. Michal said he would assist with resolving any conflicts.

Igor Mammedov (Red Hat) noted that he had “given [the movable onlining] series some dumb testing” and had found three issues with it, which he described fully. In summary, these were “unable to online memblock as NORMAL adjacent to onlined MOVABLE”, “dimm1 assigned to node 1 on qemu CLI memblock is onlined as movable by default”, and “removable flag flipped to non-removable state”. Michal wasn’t initially able to reproduce the second issue (because he didn’t have ACPI_HOTPLUG_MEMORY enabled in his kernel) but was then able to followup noting that it was similar to another bug he had already fixed. Jérôme subsequently followed up with an updated HMM patchset as well.

Joonsoo Kim (LGE) posted version 7 of a patch series entitled “Introduce ZONE_CMA” in which he reworks the CMA (Contiguous Memory Allocator) used by Linux to manage large regions of physcially contiguous memory that must be allocated (for device DMA buffers in cases where scatter gather DMA or an IOMMU are not available for managed translations). In the existing CMA implementation, physically contiguous pages are reserved at boot time, but they operate much as reserved memory that happens to fall within ZONE_NORMAL (but with a special “migratetype”, MIGRATE_CMA), and will not generally be used by the system for regular memory allocations unless there are no movable freepages available. In other words, only as a last possible resort.

This means that on a system with 1024MB of memory, kswapd “is mostly woke[n] up when roughly 512MB free memory is left”. The new patches instead create a distinct ZONE_CMA which has some special properties intended to address utilization issues with the existing implementation. As he notes, he had a lengthy discussion with Mel Gorman after the LSF/MM 2016 conference last year, in which Mel stated “I’m not going to outright NAK your series but I won’t ACK it either”. A lot of further discussion is anticipated. Michal Hocko might have summarized it best with, “the cover letter didn’t really help me to understand the basic concepts to have a good starting point before diving into the implementation details [to review the patches]”. Joonsoo followup up with an even longer set of answers to Michal.

A bug in synchronize_rcu_tasks()

Paul E. McKenney posted “There is a Tasks RCU stall warning” in which he noted that he and Steven Rostedt were seeing a stall that didn’t report until it had waited 10 minutes (and recommended that Steven try setting the kernel rcupdate.rcu_task_stall_timeout boot parameter). RCU (Read Copy Update) is a clever mechanism used by Linux (under a GPL license from IBM, who own a patent on the underlying technology) to perform lockless updates to certain types of data structure, by tracking versions of the structure and freeing the older version once references to it have reached an RCU quiescent state (defined by each CPU in the system having scheduled synchronize_rcu once).

Steven noted that for the issue under discussion there was a thread that “never goes to sleep, but will call cond_resched() periodically [a function that is intended to possibly call into the scheduler if there is work to be done there]”. On the RT (Real Time, “preempt-rt”) kernel, Steven noted that cond_resched() is a nop and that the code he had been working on should have made a call directly to the schedule() function. Which lead to him suggesting he had “found a bug in synchronize_rcu_tasks()” in the case that a task frequently calls schedule() but never actually performs a context switch. In that case, per Paul’s subsequent patch, the kernel is patched to specially handle calls to schedule() not due to regular preemption.

Ongoing Development

Anshuman Khandual posted “mm/madvise: Clean up MADV_SOFT_OFFLINE and MADV_HWPOISON” noting that “madvise_memory_failure() was misleading to accommodate handling of both memory_failure() as well as soft_offline_page() functions. Basically it handles memory error injection from user space which can go either way as memory failure or soft offline. Renamed as madvise_inject_error() instead.” The madvise infrastructure allows for coordination between kernel and userspace about how the latter intends to use regions of its virtual memory address space. Using this interface, it is possible for applications to provide hints as to their future usage patterns, relinquish memory that they no longer require, inject errors, and much more. This is particularly useful to KVM virtual machines, which appear as regular processes and can use madvise() to control their “RAM”.

Sricharan R (Codeaurora) posted version 11 of a patch series entitled “IOMMU probe deferral support”, which “calls the dma ops configuration for the devices at a generic place so that it works for all busses”.

Kishon Vijay Abraham sent a pull request to Greg K-H (Kroah Hartman) for Linux 4.12 that included individual patches in addition to the pull itself. This resulted in an interesting side discussion between Kishon and Lee Jones (Linaro) about how this was “a strange practice” Lee hadn’t seen before.

Thomas Garnier (Google) posted version 7 of a patch series entitled “syscalls: Restore address limit after a syscall” which “ensures a syscall does not return to user-mode with a kernel address limit. If that happened, a process can corrupt kernel-mode memory and elevate privileges”. Once again, he cites how this would have preemptively mitagated a Google Project Zero security bug.

Christopher Bostic posted version 6 of a patch series enabling support for the “Flexible Support Interface” (FSI) high fan out bus on IBM POWER systems.

Dan Williams (Intel) posted “x86, pmem: fix broken __copy_user_nocache cache-bypass assumptions” in which he says “Before we rework the “pmem api” to stop abusing __copy_user_nocache() for memcpy_to_pmem() we need to fix cases where we may strand dirty data in the cpu cache.”

Leo Yan (Linaro) posted an RFC (Request For Comments) patch series entitled “coresight: support dump ETB RAM” which enables support for the Embedded Trace Buffer (ETB) on-chip storage of trace data. This is a small buffer (usually 2KB to 8KB) containing profiling data used for postmortem debug.

Thierry Escande posted “Google VPD sysfs driver”, which provides support for “accessing Google Vital Product Data (VPD) through the sysfs”.

Alex(ander) Graf posted version 6 of “kvm: better MWAIT emulation for guests”, which provides new capability information to user space in order for it to inform a KVM guest of the availability of native MWAIT instruction support. MWAIT allows a (guest) kernel to wake up a remote (v)CPU without an IPI – InterProcessor Interrupt – and the associated vmexit that would then occur to schedule the remote vCPU for execution. The availability of MWAIT is deliberately not provided in the normal CPUID bitmap since “most people will want to benefit from sleeping vCPUs to allow for over commit” (in other words with MWAIT support, one can arrange to keep virtual CPUs runnable for longer and this might impact the latency of hosting many tenants on the same machine).

David Woodhouse posted version 2 of his patch series entitled “PCI resource mmap cleanup” which “pursues my previous patch set all the way to its logical conclusion”, killing off “the legacy arch-provided pci_mmap_page_range() completely, along with its vile ‘address converted by pci_resource_ro_user()’ API and the various bugs and other strange behavior that various architectures had”. He noted that to “accommodate the ARM64 maintainers’ desire *not* to support [the legacy] mmap through /proc/bus/pci I have separated HAVE_PCI_MMAP from the sysfs implementation”. This had previously been called out since older versions of DPDK were looking for the legacy API and failing as a result on newer ARM server platforms.

Darren Hart posted an RFC (Request For Comments) patch series entitled “WMI Enhancements” that seeks to clean up the “parallel efforts involving the Windows Management Instrumentation (WMI) and dependent/related drivers”. He wanted to have a “round of discussion among those of you that have been invovled in this space before we decide on a direction”. The proposed direction is to “convert[] wmi into a platform device and a proper bus, providing devices for dependent drivers to bind to, and a mechanism for sibling devices to communicate with each other”. In particular, it includes a capability to expose WMI devices directly to userspace, which resulted in some pushback (from Pali Rohár) and a suggestion that some form of explicit whitelisting of wmi identifiers (GUIDS) should be used instead. Mario Limonciello (Dell) had many useful suggestions.

Wei Wang (Intel) posted version 9 of a patch series entitled “Extend virtio-balloon for fast (de)inflating & fast live migration” in which he “implements two optimizations”. The first “tranfer[s] pages in chunks between the guest and host”. The second “transfer[s] the guest unused pages to the host so that they can be skipped in live migration”.

Dmitry Safonov posted “ARM32: Support mremap() for sigpage/vDSO” which allows CRIU (Checkpoint and Restart in Userspace) to complete its process of restoring all application VMA (Virtual Memory Area) mappings on restart by adding the ability to move the vDSO (Virtual Dynamic Shared Object) and sigpage kernel pages (data explicitly mapped into every process by the kernel to accelerate certain operations) into “the same place where they were before C/R”.

Matias Bjørling (Cnex Labs) prepared a git pull request for “LightNVM” targeting Linux 4.12. This is “a new host-side translation layer that implements support for exposing Open-Channel SSDs as block devices”.

Greg Thelen (Google) posted “slab: avoid IPIs when creating kmem caches”. Linux’s SLAB memory allocator (see also the paper on the original Solaris memory allocator) can be used to pre-allocate small caches of objects that can then be efficiently used by various kernel code. When these are allocated, per-cpu array caches are created, and a call is made to kick_all_cpus_sync() which will schedule all processors to run code to ensure that that there are no stale references to the old array caches. This global call is performed using an IPI (InterProcessor Interrupt), which is relatively expensive, especially in the case that a new cache is being created (and not replacing an old one). In that case wasteful IPIs are generated on the order of 47,741 additional ones in the example given vs. 1,170 in a patched kernel.

April 20, 2017 08:32 AM

April 19, 2017

Kernel Podcast: One Day Delay Due to Boston Marathon

The Podcast is delayed until Wednesday evening this week. Usually, I try to get it out on a Monday night (or at least write it up then and actually post on Tuesday), but when holidays or other events fall on a Monday, I will generally delay the podcast by a day. This week, I was volunteering at the Marathon all of Monday, which means the prep is taking place Tuesday night instead.

April 19, 2017 04:15 AM

April 16, 2017

Paul E. Mc Kenney: Book review: "Fooled by Randomness" and "The Black Swan"

I avoided “The Black Swan” for some years because I was completely unimpressed with the reviews. However, I was sufficiently impressed a recent Nassim Taleb essay to purchase his “Incerto” series. I have read the first two books so far (“Fooled by Randomness” and “The Black Swan”), and highly recommend both of them.

The key point of these two books is that in real life, extremely low-probability events can have extreme effects, and such events are the black swans of the second book's title. This should be well in the realm of common sense: Things like earthquakes, volcanoes, tidal waves, and asteroid strikes should illustrate this point. A follow-on point is that low-probability events are inherently difficult to predict. This also should be non-controversial: The lower the probability, the less the frequency, and thus the less the experience with that event. And of my four examples, we are getting semi-OK at predicting volcanic eruptions (Mt. St. Helens being perhaps the best example of a predicted eruption), not bad at tidal waves (getting this information to those who need it still being a challenge), and hopeless at earthquakes and asteroid strikes.

Taleb then argues that the increasing winner-takes-all nature of our economy increases the frequency and severity of economic black-swan events, in part by rendering normal-distribution-based statistics impotent. If you doubt this point, feel free to review the economic events of the year 2008. He further argues that this process began with the invention of writing, which allowed one person to have an outsized effect on contemporaries and on history. I grant that modern transportation and communication systems can amplify black-swan events in ways that weren't possible in prehistoric times, but would argue that individual prehistoric people had just as much fun with the black swans of the time, including plague, animal attacks, raids by neighboring tribes, changes in the habits of prey, and so on. Nevertheless, I grant Taleb's point that most prehistoric black swans didn't threaten the human race as a whole, at least with the exception of asteroid strikes.

My favorite quote of the book is “As individuals, we should love free markets because operators in them can be as incompetent as they wish.” My favorite question is implied by his surprise that so few people embrace both sexual and economic freedom. Well, ask a stupid question around me and you are likely to get a stupid answer. Here goes: Contraceptives have not been in widespread use for long enough for human natural selection to have taken much account of their existence. Therefore, one should expect the deep subconscious to assume that sexual freedom will produce lots of babies, and that these babies will need care and feeding. Who will pay for this? The usual answer is “everyone” with consequent restrictions on economic freedom. If you don't like this answer, fine, but please consider that it is worth at least what you are paying for it. ;–)

So what does all of this have to do with parallel programming???

As it turns out, quite a lot.

But first, I will also point out my favorite misconception in the book, which is that NP has all that much to do with incomputability. On the other hand, the real surprise is that the trader-philosopher author would say anything at all about them. Furthermore, Taleb would likely point out that in the real world, the distinction between “infeasible to compute” and “impossible to compute” is a distinction without a difference.

The biggest surprise for me personally from these books is that one of the most feared category of bugs, race conditions, are not black-swan bugs, but are instead white-swan bugs. They are quite random, and very amenable to the Gaussian statistical tools that Taleb so rightly denigrates for black-swan situations. You can even do finite amounts of testing and derive good confidence bounds for the reliability of your software—but only with respect to white-swan bugs such as race conditions. So I once again feel lucky to have the privilege of working primarily on race conditions in concurrent code!

What is a black-swan bug? One class of such bugs caused me considerable pain at Sequent in the 1990s. You see, we didn't have many single-CPU systems, and we not infrequently produced software that worked only on systems with at least two CPUs. Arbitrarily large amounts of testing on multi-CPU systems would fail to spot such bugs. And perhaps you have encountered bugs that happened only at specific times in specific states, or as they are sometimes called, “new moon on Tuesdays” bugs.

Taleb talks about using mathematics from fractals to turn some classes of black-swan events into grey-swan events, and something roughly similar can be done with validation. We have an ever-increasing suite of bugs that people have injected in the past, and we can make some statements about how likely someone is to make that same error again. We can then use this experience to guide our testing efforts, as I try to do with the rcutorture test suite. That said, I expect to continue pursuing additional bug-spotting methods, including formal verification. After all, that fact that race conditions are not black swans does not necessarily make them easy, particularly in cases, such as the Linux kernel, where there are billions of users.

In short, ignore the reviews of “Fooled by Randomness” and “The Black Swan”, including this one, and go read the actual books. If you only have time to read one of them, your should of course pick one at random. ;–)

April 16, 2017 09:13 PM

April 14, 2017

Linux Plumbers Conference: Registration for Linux Plumbers Conference is Now Open

The 2017 Linux Plumbers Conference organizing committee is pleased to announce that the registration for this year’s conference is now open. Information on how to register can be found here. Registration prices and cutoff dates are published in the ATTEND page. A reminder that we are following a quota system to release registration slots. Therefore the early registration rate will remain in effect until early registration closes on June 18 2017, or the quota limit (150) is reached, whatever comes earlier. As usual, contact us if you have questions.

April 14, 2017 12:06 PM

April 13, 2017

Pete Zaitcev: Amazon Snowmobile

I don't know how I missed this, it should've been hyped. But here it is: an Amazon truck trailer, which is basically a giant Snowball, used to overcome the data inertia. Apparently, its capacity is 100 PB (the article is not clear about it, but it mentions that 10 Snowmobiles transfer an EB). The service apparently works one way only: you cannot use a Snomobile to download your data from Amazon.

P.S. Amazon's official page for Snowmobile confirms the 100 PB capacity.

April 13, 2017 04:59 AM

April 12, 2017

Matthew Garrett: Disabling SSL validation in binary apps

Reverse engineering protocols is a great deal easier when they're not encrypted. Thankfully most apps I've dealt with have been doing something convenient like using AES with a key embedded in the app, but others use remote protocols over HTTPS and that makes things much less straightforward. MITMProxy will solve this, as long as you're able to get the app to trust its certificate, but if there's a built-in pinned certificate that's going to be a pain. So, given an app written in C running on an embedded device, and without an easy way to inject new certificates into that device, what do you do?

First: The app is probably using libcurl, because it's free, works and is under a license that allows you to link it into proprietary apps. This is also bad news, because libcurl defaults to having sensible security settings. In the worst case we've got a statically linked binary with all the symbols stripped out, so we're left with the problem of (a) finding the relevant code and (b) replacing it with modified code. Fortuntely, this is much less difficult than you might imagine.

First, let's find where curl sets up its defaults. Curl_init_userdefined() in curl/lib/url.c has the following code:
set->ssl.primary.verifypeer = TRUE;
set->ssl.primary.verifyhost = TRUE;
#ifdef USE_TLS_SRP
set->ssl.authtype = CURL_TLSAUTH_NONE;
#endif
set->ssh_auth_types = CURLSSH_AUTH_DEFAULT; /* defaults to any auth
type */
set->general_ssl.sessionid = TRUE; /* session ID caching enabled by
default */
set->proxy_ssl = set->ssl;

set->new_file_perms = 0644; /* Default permissions */
set->new_directory_perms = 0755; /* Default permissions */

TRUE is defined as 1, so we want to change the code that currently sets verifypeer and verifyhost to 1 to instead set them to 0. How to find it? Look further down - new_file_perms is set to 0644 and new_directory_perms is set to 0755. The leading 0 indicates octal, so these correspond to decimal 420 and 493. Passing the file to objdump -d (assuming a build of objdump that supports this architecture) will give us a disassembled version of the code, so time to fix our problems with grep:
objdump -d target | grep --after=20 ,420 | grep ,493

This gives us the disassembly of target, searches for any occurrence of ",420" (indicating that 420 is being used as an argument in an instruction), prints the following 20 lines and then searches for a reference to 493. It spits out a single hit:
43e864: 240301ed li v1,493
Which is promising. Looking at the surrounding code gives:
43e820: 24030001 li v1,1
43e824: a0430138 sb v1,312(v0)
43e828: 8fc20018 lw v0,24(s8)
43e82c: 24030001 li v1,1
43e830: a0430139 sb v1,313(v0)
43e834: 8fc20018 lw v0,24(s8)
43e838: ac400170 sw zero,368(v0)
43e83c: 8fc20018 lw v0,24(s8)
43e840: 2403ffff li v1,-1
43e844: ac4301dc sw v1,476(v0)
43e848: 8fc20018 lw v0,24(s8)
43e84c: 24030001 li v1,1
43e850: a0430164 sb v1,356(v0)
43e854: 8fc20018 lw v0,24(s8)
43e858: 240301a4 li v1,420
43e85c: ac4301e4 sw v1,484(v0)
43e860: 8fc20018 lw v0,24(s8)
43e864: 240301ed li v1,493
43e868: ac4301e8 sw v1,488(v0)

Towards the end we can see 493 being loaded into v1, and v1 then being copied into an offset from v0. This looks like a structure member being set to 493, which is what we expected. Above that we see the same thing being done to 420. Further up we have some more stuff being set, including a -1 - that corresponds to CURLSSH_AUTH_DEFAULT, so we seem to be in the right place. There's a zero above that, which corresponds to CURL_TLSAUTH_NONE. That means that the two 1 operations above the -1 are the code we want, and simply changing 43e820 and 43e82c to 24030000 instead of 24030001 means that our targets will be set to 0 (ie, FALSE) rather than 1 (ie, TRUE). Copy the modified binary back to the device, run it and now it happily talks to MITMProxy. Huge success.

(If the app calls Curl_setopt() to reconfigure the state of these values, you'll need to stub those out as well - thankfully, recent versions of curl include a convenient string "CURLOPT_SSL_VERIFYHOST no longer supports 1 as value!" in this function, so if the code in question is using semi-recent curl it's easy to find. Then it's just a matter of looking for the constants that CURLOPT_SSL_VERIFYHOST and CURLOPT_SSL_VERIFYPEER are set to, following the jumps and hacking the code to always set them to 0 regardless of the argument)

comment count unavailable comments

April 12, 2017 06:10 PM

April 11, 2017

Kernel Podcast: Linux Kernel Podcast for 2017/04/11

Audiohttp://traffic.libsyn.com/jcm/20170411.mp3

In this week’s edition: Linus Torvalds announces Linux 4.11-rc6, Intel Memory Bandwidth Allocation (MBA), Coherent Device Memory (CDM), Paravirtualized Remote TLB Flushing,kernel lockdown, the latest on Intel 5-level paging, and other assorted ongoing development activities.

Linus Torvalds announced Linux 4.11-rc6. In his mail, Linus notes that “Things are looking fairly normal [for this point in the development cycle]…The only slightly unusual thing is how the patches are spread out, with almost equal parts of arch updates, drivers, filesystems, networking and “misc”.” He ends “Go and get it”. Thorsten Leemhuis followed up with “Linux 4.11: Reported regressions as of Sunday, 2017-04-09”, his third regression report for 4.11. Which “lists 15 regressions I’m currently aware of. 5 regressions mentioned in last week[‘]s report got fixed”. Most appear to be driver problems, but there is one relating to audit, and one in inet6_fill_ifaddr that is stalled waiting for “feedback from reporter”.

Stable kernels

Greg K-H (Kroah-Hartman) announced Linux kernels 4.4.60, 4.9.21, and 4.10.9

Ben Hutchings announced Linux 3.2.88 and 3.16.43

Jason A. Donenfeld pointed out that Linux 3.10 “is inexplicably missing crypto_memneq, making all crypto mac [Message Authentication Code] comparisons use non constant-time comparisons. Bad news bears [presumably due to side channel attack]. Willy followed up noting that he would “check if the 3.12 patches…can be safely backported”.

Memory Bandwidth Allocation (Intel Resource Director Technology, RDT)

Vikas Shivappa (Intel) posted version 4 of a patch series entitled “x86/intel_rdt: Intel Memory bandwidth allocation”, addressing feedback from the previous iteration that he had received from Thomas Gleixner. The MBA (Memory Bandwidth Allocation) technology is described both in the kernel Documentation patch (provided) as well as in various Intel papers and materials available online. Intel provide a construct known as a “Class of Service” (CLOS) on certain contemporary Xeon processors, as part of their CAT (Cache Allocation Technology) feature, which is itself part of a larger family of technologies known as “Intel Resource Directory Technology” (RDT). These CLOSes “act as a resource control tag into which a thread/app/VM/container can be grouped”.

It appears that a feature of Intel’s L3 cache (LLC in Intel-speak) in these parts is that they can not only assign specific proportions of the L3 cache slices on the Xeon’s ring interconnect to specific resources (e.g. “tasks” – otherwise known as processes, or applications) but also can control the amount of memory bandwidth granted to these. This is easier than it sounds. From a technical perspective, Intel integrate their memory controller onto their dies, and contemporary memory controllers already perform fine grained scheduling (this is how they bias memory reads for speculative loads of the instruction stream in among the other traffic, as just one simple example). Therefore, exposing memory bandwidth control to the cache slices isn’t all that more complex. But it is cute, and looks great in marketing materials.

Coherent Device Memory (CDM) on top of HMM

Jérôme Glisse posted and RFC [Request for Comments] patch series entitled “Coherent Device Memory (CDM) on top of HMM”. His previous HMM (Heterogenous Memory Management) patch series, now in version 19, implemented support for (non-coherent) device memory to be mapped into regular process address space, by leveraging the ability for certain contempory devices to fault on access to untranslated addresses managed in device page tables thus allowing for a kind of pageable device memory and transparent management of ownership of the memory pages between application processor cores and (e.g.) a GPU or other acceleration device. The latest patch series builds upon HMM to also support coherent device memory (via a new ZONE_DEVICE memory – see also the recent postings from IBM in this area). As Jérôme notes, “Unlike the unaddressable memory type added with HMM patchset, the CDM [Coherent Device Memory] type can be access[ed] by [the] CPU.” He notes that he wanted to kick off this RFC more for the conversation it might provoke.

In his mail, Jérôme says, “My personal belief is that the hierarchy of memory is getting deeper (DDR, HBM stack memory, persistent memory, device memory, …) and it may make sense to try to mirror this complexity within mm concept. Generalizing the NUMA abstraction is probably the best starting point for this. I know there are strong feelings against changing NUMA so i believe now is the time to pick a direction”. He’s right of course. There have been a number of patch series recently also targeting accelerators (such as FPGAs), and more can be anticipated for coherently attached devices in the future. [This author is personally involved in CCIX]

Hyper-V: Paravirtualized Remote TLB Flushing and Hypercall Improvements

Vitaly Kuznetsov (Red Hat) posted “Hyper-V: paravirtualized remote TLB flushing and hypercall improvements”. It turns out that Microsoft’s Hyper-V hypervisor supports hypercalls (calls into the hypervisor from the guest OS) for “doing local and remote TLB [Translation Lookaside Buffer] flushing”. Translation Lookaside Buffers [TLBs] are caches built into microprocessors that store a translation of a CPU virtual address to “physical” (or, for a virtual machine, into an intermediate hypervisor) address. They save an unnecessary page table walk (of the software managed hardware/software structure – depending upon architecture – that “walkers” navigate to perform a translation during a “page fault” or unhandled memory access, such as happens constantly when demand loading/faulting in application code and data, or sharing read-only data provided by shared libraries, etc.). TLBs are generally transparent to the OS, except that they must be explicitly managed under certain conditions – such as when invlidating regions of virtual memory or performing certain context switches (depending upon the provisioning of address and virtual memory space tag IDs in the architecture).

TLB invalidates on local processor cores normally use special CPU instructions, and this is certainly also true under virtualization. But virtual addresses used by a particular process (known as a task within the kernel) might be also used by other cores that have touched the same virtual memory space. And those translations need to be invalidated too. Some architectures include sophisticated hardware broadcast invalidation of TLBs, but some other legacy architectures don’t provide these kinds of capabilities. On those architectures that don’t provide for a hardware broadcast, it is typically necessary to use a construct known as an IPI (Inter Processor Interrupt) to cause an IRQ (interrupt message) to be delivered to the remote interrupt controller CPU interface (e.g. LAPIC on Intel x86 architecture) of the destination core, which will run an IPI handler in response that does the TLB teardown.

As Vitaly notes, nobody is recommending doing local TLB flash using a hypercall, but there can be significant performance improvement in using a hypercall for the remote invalidates. In the example cited, which uses “a special ‘TLB trasher'” he demonstrates how a 16 vCPU guest experienced a greater than 25% performance improvement using the hypercall approach.

Ongoing Development

David Howells posted an magnum opus entitled “Kernel lockdown”, which aims to “provide a facility by which a variety of avenues by which userspace can feasibly modify the running kernel image can be locked down”. As he says, “The lock-down can be configured to be triggered by the EFI secure boot status, provided the shim isn’t insecure. The lock-down can be lifted by typing SysRq+x on a keyboard attached to the system [physcial presence]. Among the many other things, these patches (versions of which have been in distribution kernels for a while) change kernel behavior to include “No unsigned modules and no modules for which [we] can’t validate the signature”, disable many hardware access functions, turn off hibernation, prevent kexec_load(), and limit some debugging features. Justin Forbes of the Fedora Project noted that he had (obviously) tested these. One of the many interesting sets of patches included a feature to “Annotate hardware config module parameters” which allows modules to mark unsafe options. Following some pushback, David also followed up with a rationale for doing kernel lockdown, entitled “Why kernel lockdown?”. Worth reading.

Kirill A. Shutemov posted “x86: 5-level paging enabling for v4.12, Part 4”, in which he (bravely) took Ingo’s request to “rewrite assembly parts of boot process into C before bringing 5-level paging support”. He says, “The only part where I succeed is startup_64 in arch/x86/kernel/head_64.S. Most of the logic is now in C.” He also renames the level 4 page tables “init_level4_pgt” and “early_level4_pgt” to “init_top_pgt” and “early_top_pgt”. There was another lengthy discussion around his “Allow to have userspace mappings above 47-bits”, a patch which tells the kernel to prefer to do memory allocations below 47-bits (the previous “Canonical Addressing” limit of Intel x86 processors, which some JITs and other code exploit by abusing the top bits of the address space in pointers for illegal tags, breaking compatibility with an extended virtual address space). The patch allows mmap calls ith MAP_FIXED hints to cause larger allocations. There was some concern that larger VM space is ABI and must be handled with care. A footnote here is that (apparently, from the patch) Intel MPX (Memory Protection Extension) doesn’t yet work with LA57 (the larger address space feature) and so Kirill avoids both in the same process.

Christopher Bostic posted version 5 of a patch series entitled “FSI driver implementation”. This is support for the POWER’s [Performance Optimization With Enhanced RISC, for those who ever wondered – this author used to have a lot of interest in PowerPC back in the day] “Flexible Support Interface” (FSI), a “high fan out serial bus” whose specification seems to have appeared on the OpenPower Foundation website recently also.

Kishon Vijay Abraham posted “PCI: Support for configurable PCI endpoint”, which Bjorn finally pulled into his tree in anticipation of the upcoming 4.12 merge cycle. For those who haven’t see Kishon’s awesome presentation “Overview of PCI(e) Subsystem” for Embedded Linux Conference Europe, you are encouraged to watch it at least several times. He really knows his stuff, and has done an excellent job producing a high quality generic PCIe endpoint driver for Linux: https://www.youtube.com/watch?v=uccPR6X8vy8

Ard Biesheuvel posted “EFI fixes for v4.11”, which among other goodies includes a fix for EFI GOP (Graphics Output Protocol) support on systems built using the 64-bit ARM Architecture, which uses firmware assignment of PCIe BAR resources. Ard and Alex Graf have done some really fun work with graphics cards on 64-bit ARM lately – including emulating x86 option ROMs. Ard also had some fixes prepared for v4.12 that he announced, including a bunch of cleanup to the handling of FDT (Flattened Device Tree) memory allocation. Finally, he added support for the kernel’s “quiet” command line option, to remove extraneous output from the EFI stub on boot.

Srikar Dronamraju and Michal Hocko had a back and forth on the former’s “sched: Fix numabalancing to work with isolated cpus” patch, which does what it says on the tin. Michal was a little concered that NUMA balancing wasn’t automatically applied even to isolated CPUs, but others (including Peter Zjilsta) noted that this absolutely is the intended behavior.

Ying Huang (Intel) posted version 8 of his “THP swap: Delay splitting THP during swapping out”, which essentially allows paging of (certain) huge pages. He also posted version 2 of “mm, swap: Sort swap entries before free”, which sorts consecutive swap entires in a per-CPU buffer into order accoring to their backing swap deivce before freeing those entries. This reduces needless acquiring/releasing of locks and improves performance.

Will Deacon posted version 2 of a patch series entitled “drivers/perf: Add support for ARMv8.2 Statistical Profiling Extension”. The “SPE” (Statistical Profiling Extension) “can be used to profile a population of operations in the CPU pipeline after instruction decode. These are either architected instructions (i.e. a dynamic instruction trace) or CPU-specific uops and the choice is fixed statically in the hardware and advertised to userpace via caps. Sampling is controlled using a sampling interval, similar to a regular PMU counter, but also with an optional random perturbation”. He notes that the “in-memory buffer is linear and virtually addressed, raising an interrupt when it fills up” [which makes using it nice for software folks].

Binoy Jayan posted “IV [Initial Vector] Generation algorithms for dm-crypt”, the goal of which “is to move these algorithms from the dm layer to the kernel crypto layer by implementing them as template ciphers”.

Joerg Roedel posted “PCI: Add ATS-disable quirk for AMD Stoney GPUs”. Then, he posted a followup with a minor fix based upon feedback. This should close the issue of certain bug reports posted by those using an IOMMU on a Stoney platform and seeing lockups under high TLB invalidation.

Born Helgass posted “PCI fixes for v4.11”, which includes “fix ThunderX legacy firmware resources”, a PCI quirk for certain ARM server platforms.

Paul Menzel reported “`pci_apply_final_quirks()` taking half a second”, which David Woodhouse (who wrote the code to match PCIe devices against the quick list “back in the mists of time”) posited was perhaps down to “spending a fair amount of time just attempting to match each device against the list”. He wondered “if it’s worth sorting the list by vendor ID or somthing, at least for the common case of the quirks which match on vendor/device”. There was a general consensus that cleanup would be nice, if only someone had the time and the inclination to take a poke at it.

Seth Forshee (Canonical) posted “audit regressions in 4.11”, in which he noted that ever since the merging of “audit: fix auditd/kernel connection state tracking”, the kernel will now queue up indefintely audit messages for delivery to the (userspace) audit daemon if it is not running – ultimately crashing the machine. Paul Moore thanked him for the report and there was a back and forth on the best way to handle the case of no audit running.

Neil Brown posted a patch entitled “NFS: fix usage of mempools”. As he notes in his patch, “When passed GFP [Get Free Page] flags that allow sleeping (such as GFP_NOIO), mempool_alloc() will never return NULL, it will wait until memory is available…This means that we don’t need to handle falure, but that we do need to ensure one thread doesn’t call mempool_alloc twice on the one pool without queuing or freeing the first allocation”. He then cites “pnfs_generic_alloc_ds_commits” as an unsafe function and provides a fix.

Finally, Kees Cook followed up (as he had promised) on a discussion from last week, with an RFC (Request for Comments) patch series entitiled “mm: Tighten x86 /dev/mem with zeroing”, including the suggestion from Linus that reads from /dev/mem that aren’t permitted simply return zero data. This was just one of many security discussions he was involved in (as usual). Another included having suggested a patch posted by Eddie Kovsky entitled “module: verify address is read-only”, which modifies kernel functions that use modules to verify that they are in the correct kernel ro_after_init memory area and “reject structures not marked ro_after_init”.

April 11, 2017 03:15 PM

April 10, 2017

Daniel Vetter: Review, not Rocket Science

About a week ago there where 2 articles on LWN, the first coverging memory management patch review and the second covering the trouble with making review happen. The take away from these two articles seems to be that review is hard, there’s a constant lack of capable and willing reviewers, and this has been the state of review since forever. I’d like to counter pose this with our experiences in the graphics subsystem, where we’ve rolled out a well-working review process for the Intel driver, core subsystem and now the co-maintained small driver efforts with success, and not all that much pain.

tl;dr: require review, no exceptions, but document your expectations

Aside: This is written with a kernel focus, from the point of view of a maintainer or group of maintainers trying to establish review within their subsystem. But the principles really work anywhere.

Require Review

When review doesn’t happen, that generally means no one regards it as important enough. You can try to improve the situation by highlighting review work more, and giving less focus for top committer stats. But that only goes so far, in the end when you want to make review happen, the one way to get there is to require it.

Of course if that then results in massive screaming, then maybe you need to start with improving the recognition of review and value it more. Trying to put a new process into place over the persistent resistance is not going to work. But as long as there’s general agreement that review is good, this is the easy part.

No Exceptions

The trouble is that there’s a few really easy way to torpedo reviews before you event started, and they’re all around special priviledges and exceptions. From one of the LWN articles:

… requiring reviews might be fair, but there should be one exception: when developers modify their own code.

Another similar exception is often demanded by maintainers for applying their own patches to code they maintain - in the Linux kernel only about 25% of all maintainer patches have any kind of review tag attached when they land. This is in contrast to other contributors, who always have to get past at least their direct maintainer to get a patch applied.

There’s a few reasons why having exceptions for the original developer of some code, or a maintainer of a subsystem, is a really bad idea:

On the flip side, requiring review from all your main contributors is a really easy way to kickstart a working review economy: Instantly you both have a big demand for review. And capable reviewers who are very much willing to trade a bit of review for getting reviews on their own patches.

Another easy pitfall is maintainers who demand unconditional NAck rights for the code they maintain, sometimes spiced up by claiming they don’t even need to provide reasons for the rejection. Of course more experienced people know more about the pitfalls of a code base, and hence are more likely to find serios defects in a change. But most often these rejections aren’t about clear bugs, but more design dogmas once established (and perhaps no longer valid), or just plain personal style preferences. Again, this is a great way to prevent review from happening:

And again, I haven’t seen unicorns who write perfect code yet, neither have I seen someone who’s review feedback was consistently impeccable.

But Document your Expectations

Training reviews through direct mentoring is great, but it doesn’t scale. Document what you expect from a review as much as possible. This includes everything from coding style, to how much and in which detail code correctness should be checked. But also related things like documentation, test-cases, and process details on how exactly, when and where review is happening.

And like always, executable documentation is much better, hence try to script as much as possible. That’s why build-bots, CI bots, coding style bots, and all these things are possible - it frees the carbon-based reviewers from wasting time on the easy things and instead concentrate on the harder parts of review like code design and overall architecture, and how to best get there from the current code base. But please make sure your scripting and automated testing is of high-quality, because if the results need interpretation by someone experienced you haven’t gained anything. The kernel’s checkpatch.pl coding style checker is a pretty bad example here, since it’s widely accepted that it’s too opinionated and it’s suggestions can’t be blindly followed.

As some examples we have the dim ingloriuos maintainer scripts and some fairly extensive documentation on what is expected from reviewers for Intel graphics driver patches. Contrast that to the comparetively lax review guidelines for small drivers in drm-misc. At Intel we’ve also done internal trainings on review best practices and guidelines. Another big thing we’re working on on the automation front is CI crunching through patchwork series to properly regression test new patches before they land.

April 10, 2017 12:00 AM

April 09, 2017

Matthew Garrett: A quick look at the Ikea Trådfri lighting platform

Ikea recently launched their Trådfri smart lighting platform in the US. The idea of Ikea plus internet security together at last seems like a pretty terrible one, but having taken a look it's surprisingly competent. Hardware-wise, the device is pretty minimal - it seems to be based on the Cypress[1] WICED IoT platform, with 100MBit ethernet and a Silicon Labs Zigbee chipset. It's running the Express Logic ThreadX RTOS, has no running services on any TCP ports and appears to listen on two single UDP ports. As IoT devices go, it's pleasingly minimal.

That single port seems to be a COAP server running with DTLS and a pre-shared key that's printed on the bottom of the device. When you start the app for the first time it prompts you to scan a QR code that's just a machine-readable version of that key. The Android app has code for using the insecure COAP port rather than the encrypted one, but the device doesn't respond to queries there so it's presumably disabled in release builds. It's also local only, with no cloud support. You can program timers, but they run on the device. The only other service it seems to run is an mdns responder, which responds to the _coap._udp.local query to allow for discovery.

From a security perspective, this is pretty close to ideal. Having no remote APIs means that security is limited to what's exposed locally. The local traffic is all encrypted. You can only authenticate with the device if you have physical access to read the (decently long) key off the bottom. I haven't checked whether the DTLS server is actually well-implemented, but it doesn't seem to respond unless you authenticate first which probably covers off a lot of potential risks. The SoC has wireless support, but it seems to be disabled - there's no antenna on board and no mechanism for configuring it.

However, there's one minor issue. On boot the device grabs the current time from pool.ntp.org (fine) but also hits http://fw.ota.homesmart.ikea.net/feed/version_info.json . That file contains a bunch of links to firmware updates, all of which are also downloaded over http (and not https). The firmware images themselves appear to be signed, but downloading untrusted objects and then parsing them isn't ideal. Realistically, this is only a problem if someone already has enough control over your network to mess with your DNS, and being wired-only makes this pretty unlikely. I'd be surprised if it's ever used as a real avenue of attack.

Overall: as far as design goes, this is one of the most secure IoT-style devices I've looked at. I haven't examined the COAP stack in detail to figure out whether it has any exploitable bugs, but the attack surface is pretty much as minimal as it could be while still retaining any functionality at all. I'm impressed.

[1] Formerly Broadcom

comment count unavailable comments

April 09, 2017 12:16 AM

April 07, 2017

Andi Kleen: Cheat sheet for Intel Processor Trace with Linux perf and gdb

What is Processor Trace

Intel Processor Trace (PT) traces program execution (every branch) with low overhead.

This is a cheat sheet of how to use PT with perf for common tasks

It is not a full introduction to PT. Please read Adding PT to Linux perf or the links from the general PT reference page.

PT support in hardware

CPU Support
Broadwell (5th generation Core, Xeon v4) More overhead. No fine grained timing.
Skylake (6th generation Core, Xeon v5) Fine grained timing. Address filtering.
Goldmont (Apollo Lake, Denverton) Fine grained timing. Address filtering.

PT support in Linux

PT is supported in Linux perf, which is integrated in the Linux kernel.
It can be used through the “perf” command or through gdb.

There are also other tools that support PT: VTune, simple-pt, gdb, JTAG debuggers.

In general it is best to use the newest kernel and the newest Linux perf tools. If that is not possible older tools and kernels can be used. Newer tools can be used on an older kernel, but may not support all features

Linux version Support
Linux 4.1 Initial PT driver
Linux 4.2 Support for Skylake and Goldmont
Linux 4.3 Initial user tools support in Linux perf
Linux 4.5 Support for JIT decoding using agent
Linux 4.6 Bug fixes. Support address filtering.
Linux 4.8 Bug fixes.
Linux 4.10 Bug fixes. Support for PTWRITE and power tracing

Many commands require recent perf tools, you may need to update them rom a recent kernel tree.

This article covers mainly Linux perf and briefly gdb.

Preparations

Only needed once.

Allow seeing kernel symbols (as root)


echo kernel.kptr_restrict=0' >> /etc/sysctl.conf
sysctl -p

Basic perf command lines for recording PT

ls /sys/devices/intel_pt/format

Check if PT is supported and what capabilities.

perf record -e intel_pt// program

Trace program

perf record -e intel_pt// -a sleep 1

Trace whole system for 1 second

perf record -C 0 -e intel_pt// -a sleep 1

Trace CPU 0 for 1 second

perf record --pid $(pidof program) -e intel_pt//

Trace already running program.

perf has to save the data to disk. The CPU can execute branches much faster than than the disk can keep up, so there will be some data loss for code that executes
many instructions. perf has no way to slow down the CPU, so when trace bandwidth > disk bandwidth there will be gaps in the trace. Because of this it is usually not a good idea
to try to save a long trace, but work with shorter traces. Long traces also take a lot of time to decode.

When decoding kernel data the decoder usually has to run as root.
An alternative is to use the perf-with-kcore.sh script included with perf

perf script --ns --itrace=cr

Record program execution and display function call graph.

perf script by defaults “samples” the data (only dumps a sample every 100us).
This can be configured using the –itrace option (see reference below)

Install xed first.

perf script --itrace=i0ns --ns -F time,pid,comm,sym,symoff,insn,ip | xed -F insn: -S /proc/kallsyms -64

Show every assembly instruction executed with disassembler.

For this it is also useful to get more accurate time stamps (see below)

perf script --itrace=i0ns --ns -F time,sym,srcline,ip

Show source lines executed (requires debug information)

perf script --itrace=s1Mi0ns ....

Often initialization code is not interesting. Skip initial 1M instructions while decoding:

perf script --time 1.000,2.000 ...

Slice trace into different time regions Generally the time stamps need to be looked up first in the trace, as they are absolute.

perf report --itrace=g32l64i100us --branch-history

Print hot paths every 100us as call graph histograms

Install Flame graph tools first.


perf script --itrace=i100usg | stackcollapse-perf.pl > workload.folded
flamegraph.pl workloaded.folded > workload.svg
google-chrome workload.svg

Generate flame graph from execution, sampled every 100us

Other ways to record data

perf record -a -e intel_pt// sleep 1

Capture whole system for 1 second

Use snapshot mode

This collects data, but does not continuously save it all to disk. When an event of interest happens a data dump of the current buffer can be triggered by sending a SIGUSR2 signal to the perf process.


perf record -a -e --snapshot intel_pt// sleep 1
PERF_PID=$!
*execute workload*

*event happens*
kill -USR2 $PERF_PID

*end of recording*
kill $PERF_PID>

Record kernel only, complete system

perf record -a -e intel_pt//k sleep 1

Record user space only, complete system

perf record -a -e intel_pt//u

Enable fine grained timing (needs Skylake/Goldmont, adds more overhead)

perf record -a -e intel_pt/cyc=1,cyc_thresh=2/ ...


echo $[100*1024*1024] > /proc/sys/kernel/perf_event_mlock_kb
perf record -m 512,100000 -e intel_pt// ...

Increase perf buffer to limit data loss


perf record -e intel_pt// --filter 'filter main @ /path/to/program' ...

Only record main function in program


perf record -e intel_pt// -a --filter 'filter sys_write' program

Filter kernel code (needs 4.11+ kernel)


perf record -e intel_pt// -a --filter 'start func1 @ program' --filter 'stop func2 @ program' program

Start tracing in program at main and stop tracing at func2.


perf archive
rsync -r ~/.debug perf.data other-system:

Transfer data to a trace on another system. May also require using perf-with-kcore.sh if decoding
kernel.

Using gdb

Requires a new enough gdb built with libipt. For user space only.


gdb program
start
record btrace pt
cont

record instruction-history /m # show instructions
record function-history # show functions executed
prev # step backwards in time

For more information on gdb pt see the gdb documentation

References

The perf PT documentation

Reference for –itrace option (from perf documentation)


i synthesize "instructions" events
b synthesize "branches" events
x synthesize "transactions" events
c synthesize branches events (calls only)
r synthesize branches events (returns only)
e synthesize tracing error events
d create a debug log
g synthesize a call chain (use with i or x)
l synthesize last branch entries (use with i or x)
s skip initial number of events

Reference for –filter option (from perf documentation)

A hardware trace PMU advertises its ability to accept a number of
address filters by specifying a non-zero value in
/sys/bus/event_source/devices/ /nr_addr_filters.

Address filters have the format:

filter|start|stop|tracestop [/ ] [@]

Where:
- 'filter': defines a region that will be traced.
- 'start': defines an address at which tracing will begin.
- 'stop': defines an address at which tracing will stop.
- 'tracestop': defines a region in which tracing will stop.

is the name of the object file, is the offset to the
code to trace in that file, and is the size of the region to
trace. 'start' and 'stop' filters need not specify a .

If no object file is specified then the kernel is assumed, in which case
the start address must be a current kernel memory address.

can also be specified by providing the name of a symbol. If the
symbol name is not unique, it can be disambiguated by inserting #n where
'n' selects the n'th symbol in address order. Alternately #0, #g or #G
select only a global symbol. can also be specified by providing
the name of a symbol, in which case the size is calculated to the end
of that symbol. For 'filter' and 'tracestop' filters, if is
omitted and is a symbol, then the size is calculated to the end
of that symbol.

If is omitted and is '*', then the start and size will
be calculated from the first and last symbols, i.e. to trace the whole
file.
If symbol names (or '*') are provided, they must be surrounded by white
space.

The filter passed to the kernel is not necessarily the same as entered.
To see the filter that is passed, use the -v option.

The kernel may not be able to configure a trace region if it is not
within a single mapping. MMAP events (or /proc/ /maps) can be
examined to determine if that is a possibility.

Multiple filters can be separated with space or comma.

v2: Fix some typos/broken links

April 07, 2017 08:55 PM

April 05, 2017

Kernel Podcast: Linux Kernel Podcast for 2017/04/04

Audiohttp://traffic.libsyn.com/jcm/20170404v2.mp3

Linus Torvalds announces Linux 4.11-rc5, Donald Drumpf drains the maintainer swamp in April, Intel FPGA Device Drivers, FPU state cacheing, /dev/mem access crashing machines, and assorted ongoing development.

Linus Torvalds announced Linux 4.11-rc5. In his announcement mail, Linus notes that “things have definitely started to calm down, let’s hope it stays this way and it wasn’t just a fluke this week”. He calls out the oddity that “half the arch updates are to parisc” due to parisc user copy fixes.

It’s worth noting that rc5 includes a fix for virtio_pci which removes an “out of bounds access for msix_names” (the “name strings for interrupts” provided in the virtio_pci_device structure. According to Jason Wang (Red Hat), “Fedora has received multiple reports of crashes when running 4.11 as a guest” (in fact, your author has seen this one too). Quoting Jason, “The crashes are not always consistent but they are generally some flavor of oops or GPF [General Protection Fault – Intel x86 term referring to the general case of an access violation into memory by an offending instruction in various different ISAs – Instruction Set Architectures] in virtio related code. Multiple people have done bisections (Thank you Thorsten Leemhuis and Richard W.M. Jones)”. An example rediscovery of this issue came from a Mellanox engineer who reported that their test and regression VMs were crashing occasionally with 4.11 kernels.

Announcements

Sebastian Andrzej Siewior announced preempt-rt Linux version 4.9.20-rt16. This includes a “Re-write of the R/W semaphores code. In RT we did not allow multiple readers because a writer blocking on the semaphore would have [to] deal with all the readers in terms of priority or budget inheritance [by which he is refering to the Priority Inheritance or “PI” feature common to “real time” kernels]. It’s obvious that the single reader restriction has severe performance problems for situations with heavy reader contention.” He notes that CPU hotplug got “better but can deadlock”

Greg Kroah-Hartman posted Linux stable kernels 4.4.59, 4.9.20, and 4.10.8.

Draining the Swamp (in April)

Donald Drumpf (trump.kremlin.gov@gmail.com) posted “MAINTAINERS: Drain the swamp”, an inspired patch aiming to finally address the problem of having “a small group of elites listed in the corrupt MAINTAINERS file” who, “For too long” have “reaped the rewards of maintainership”. He notes that over the past year the world has seen a great Linux Exit (“Lexit”) movement in which “People all of the Internet have come together and demanded that power be restored to the developers”, creating “a historic fork based on Linux 2.4, back to a better time, before Linux was controlled by corporate interests”. He notes that the “FAKE NEWS site LWN.net said it wouldn’t happen, but we knew better”.

Donald says that all of the groundwork laid over the past year was just an “important first step”. And that “now, we are taking back what’s rightfully ours. We are transferring power from “Lyin’ Linus” and giving it back to you, the people. With the below patch, the job-killing MAINTAINERS file is finally being ROLLED BACK.” He also notes his intention to return “LAW and ORDER” to the Linux kernel repository by building a wall around kernel.org and “THE LINUX FOUNDATION IS GOING TO PAY FOR IT”. Additional changes will include the repeal and replacement of the “bloated merge window”, the introduction of a distribution import tax, and other key innovations that will serve to improve the world and to MAKE LINUX GREAT AGAIN!

Everyone around the world immediately and enthusiastically leaped upon this inspired and life altering patch, which was of course perfect from the moment of its inception. It was then immediately merged without so much as a dissenting voice (or any review). The private email servers used to host Linus’s deleted patch emails were investigated and a special administrator appointed to investigate the investigators.

Intel FPGA Device Drivers

Wu Hao (Intel) posted a sixteen part patch series entitled “Intel FPGA Drivers”, which “provides interfaces for userspace applications to configure, enumerate, open, and access FPGA [Field Programmable Gate Arrays, flexible logic fabrics containing millions of gates that can be connected programmatically by bitstreams describing the intended configuration] accelerators on platforms equipped with Intel(R) FPGA solutions and enables system level management functions such as FPGA partial reconfiguration [the dynamic updating of partial regions of the FPGA fabric with new logic], power management, and virtualization. This support differs from the existing in-kernel fpga-mgr from Alan Tull in that it seems to relate to the so-called Xeon-FPGA hybrid designs that Intel have presented on in various forums.

The first patch (01/16) provides a lengthy summary of their proposed design in the form of documentation that is added to the kernel’s Documentation directory, specifically in the file Documentation/fpga/intel-fpga.txt. It notes that “From the OS’s point of view, the FPGA hardware appears as a regular PCIe device. The FPGA device memory is organized using a predefined structure [Device Feature List). Features supported by the particular FPGA device are exposed throughg these data structures. An FME (FPGA Management Engine) is provided which “performs power and thermal management, error reporting, reconfiguration, performance reporting, and other infrastructure functions. Each FPGA has one FME, which is always access through the physical function (PF)”. The FPGA also provides a series of Virtual Functions that can be individually mapped into virtual machines using SR-IOV.

This design allows a CPU attached using PCIe to communicate with various Accelerated Function Units (AFUs) contained within the FPGA, and which are individually assignable into VMs or used in aggregate by the host CPU. One presumes that a series of userspace management utilities will follow this posting. It’s actually quite nice to see how they implemented the discovery of individual AFU features, since this is very close to something a certain author has proposed for use elsewhere for similar purposes. It’s always nicely validating to see different groups having similar thoughts.

Copy Offload with Peer-to-Peer PCI Memory

Logan Gunthorpe posted an RFC (Request for Comments) patch series entitled “Copy Offload with Peer-to-Peer PCI Memory” which relates to work discussed at the recent LSF/MM (Linux Storage Filesystem and Memory Management) conference, in Cambridge MA (side note: I did find some of you haha!). To quote Logan, “The concept here is to use memory that’s exposed on a PCI BAR [Base Address Register – a configuration register that tells the device where in the physical memory map of a system to place memory owned by the device, under the control of the Operating System or the platform firmware, or both] as data buffers in the NVMe target code such that data can be transferred from an RDMA NIC to the special memory and then directly to an NVMe device avoiding system memory entirely”. He notes a number of positives from this, including better QoS (Quality of Service), and a need for fewer (relatively still quite precious even in 2017) PCIe lanes from the CPU into a PCIe switch placed downstream of its Root Complex on which peer-to-peer PCIe devices talk to one another without the intervening step of hopping through the Root Complex and into the system memory via the CPU. As a consequence, Logan has focused his work on “cases where the NIC, NVMe devices and memory are all behind the same PCI switch”.

To facilitate this new feature, Logan has a second patch in the series, entitled “Introduce Peer-to-Peer memory (p2mem) device”, which supports partitioning and management of memory used in direct peer-to-peer transfers between two PCIe devices (endpoints, or “cards”) with a BAR that “points to regular memory”. As Logan notes, “Depending on hardware, this may reduce the bandwidth of the transfer but could significantly reduce pressure on system memory” (again by not hopping up through the PCIe topology). In his patch, Logan had also noted that “older PCI root complexes” might have problems with peer-to-peer memory operations, so he had decided to limit the feature to be only available for devices behind the same PCIe switch. This lead to a back and forth with Sinan Kaya who asked (rhetorically) “What is so special about being connected to the same switch?”. Sinan noted that there are plenty of ways in Linux to handle blacklisting known older bad hardware and platforms, such as requiring that the DMI/SMBIOS-provided BIOS date of manufacture of the system be greater than a certain date in combination with all devices exposing the p2p capability and a fallback blacklist. Ultimately, however, it was discovered that the feature peer-to-peer feature isn’t enabled by default, leading Sinan to suggest “Push the decision all the way to the user. Let them decide whether they want this feature to work on a root port connected port or under the switch”.

FPU state cacheing

Kees Cook (Google) posted a patch entitled “x86/fpu: move FPU state into separate cache”, which aims to remove the dependency within the Intel x86 Architecture port upon an internal kernel config setting known as ARCH_WANTS_DYNAMIC_TASK_STRUCT. This configuration setting (set by each architecture’s code automatically, not by the person building the kernel in the configuration file) says that the true size of the task_struct cannot be known in advance on Intel x86 Architecture because it contains a variable sized array (VSA) within the thread_struct that is at the end of the task_struct to support context save/restore of the CPU’s FPU (Floating Point Unit) co-processor. Indeed, the kernel definition of task_struct (see include/linux/sched.h) includes a scary and ominous warning “on x88, ‘thread_struct’ contains a variable-sized structure. It *MUST* be at the end of ‘task_struct'”. Which is fairly explicit.

The reason to remove the dependency upon dynamic task_struct sizing is because this “support[s] future structure layout randomization of the task_struct”, which requires that “none of the structure fields are allowed to have a specific position or a dynamic size”. The idea is to leverage a GCC (GNU Compiler Collection) plugin that will change the ordering of C structure members (such as task_struct) randomly at compile time, in order to reduce the ability for an attacker to guess the layout of the structure (highly useful in various exploits). In the case of distribution kernels of course, an attacker has access to the same kernel binaries that may be running on a system, and could use those to calculate likely structure layout for use in a compromise. But the same is not true of the big hyperscale service providers like Google and Facebook. They don’t have to publish the binaries for their own internal kernels running on their public infrastructure servers.

This patch lead to a back and forth with Linus, who was concerned about why the task_struct would need changing in order to prevent the GCC struct layout randomization plugin from blowing up. In particular, he was worried that it sounded like the plugin was moving variable sized arrays from the last member of structures (not legally permitted). Kees, Linus, and Andy Lutomirski went through the fact that, yes, the plugin can handle trailing VSAs and so forth. In the end, it was suggested that Kees look at making task_struct “be something that contains a fixed beginning and end, and just have an unnamed randomized part in the middle”. Kees said “That could work. I’ll play around with it”.

/dev/mem access crashing machines

Dave Jones (x86info maintainer) had a back and forth with Kees Cook, Linus, and Tommi Rantala about the latter’s discovery that running Dave’s “x86info” tool crashed his machine with an illegal memory access. In turns out that x86info reads /dev/mem (a requirement to get the data it needs), which is a special file representing the contents of physical memory. Normally, when access is granted to this file, it is restricted to the root user, and then only certain parts of memory as determined by STRICT_DEVMEM. The latter is intended only to allow reads of “reserved RAM” (normal system memory reserved for specific device purposes, not that allocated for use by programs). But in Tommi’s case, he was running a kernel that didn’t have STRICT_DEVMEM set on a system booting with EFI for which the legacy “EBDA” (Extended BIOS Data Area) that normally lives at a fixed location in the sub-1MB memory window on x86 was not provided by the platform. This meant that the x86info tool was trying to read memory that was a legal address but which wasn’t reserved in the EFI System Table (memory map), and was mapped for use elsewhere.

All of this lead Linus to point out that simply doing a “dd” read on the first MB of the memory on the offending system would be enough to crash it. He noted that (on x86 systems) the kernel allows access to the sub-1MB region of physical memory unconditionally (regardless of the setting of the kernel STRICT_DEVMEM option) because of the wealth of platform data that lives there and which is expected to be read by various tools. He proposed effectively changing the logic for this region such that memory not explicitly marked as reserved would simple “just read zero” rather than trying to read random kernel data in the case that the memory is used for other purposes.

This author certainly welcomes a day when /dev/mem dies a death. We’ve gone to great lengths on 64-bit ARM systems to kill it, in part because it is so legacy, but in another part because there are two possible ways we might trap a bad access – one as in this case (synchronous exception) but another in which the access might manifest as a System Error due to hitting in the memory controller or other SoC logic later as an errant access.

Ongoing Development

Steve Longerbeam posted version 6 of a patch series entitled “i.MX Media Driver”, which implements a V4L2 (Video for Linux 2) driver for i.MX6.

David Gstir (on behalf of Daniel Walter) posted “fscrypt: Add support for AES-128-CBC” which “adds support for using AES-128-CBC for file contents and AES-128-CBC-CTS for file name encryption. To mitigae watermarking attacks, IVs [Initial Vectors] are generated using the ESSIV algorthim.”

Djalal Harouni posted an RFC (Request for Comments) patch entitled “proc: support multiple separate proc instances per pidnamespace”. In his patch, Djala notes that “Historically procfs was tied to pid namespaces, and moun options were propagated to all other procfs instances in the same pid namespace. This solved several use cases in that time. However today we face new problems, there are multiple container implementations there, some of them want to hide pid entries, others want to hide non-pid entries, others want to have sysctlfs, others want to share pid namespace with private procfs mounts. All these with current implementation won’t work since all options will be propagated to all procfs mounts. This series allow to have new instances of procfs per pid namespace where each intance can have its own mount option”.

Zhou Chengming (Hauwei) posted “reduce the time of finding symbols for module” which aims to reduce the time taken for the Kernel Live Patch (klp) module to be loaded on a system in which the module uses many static local variables. The patch replaces the use of kallsyms_on_each_symbol with a variant that limits the search to those needed for the module (rather than every symbol in the kernel). As Jessica Yu notes, “it means that you have a lot of relocation records with reference your out-of-tree module. Then for each such entry klp_resolve_symbol() is called and then klp_find_object_symbol() to actually resolve it. So if you have 20k entries, you walk through vmlinux kallsyms table 20k times…But if there were 20k modules loaded, the problem would still be there”. She would like to see a more generic fix, but was also interested to see that the Huawei report referenced live patching support for AArch64 (64-bit ARM Architecture), which isn’t in upstream. She had a number of questions about whether this code was public, and in what form, to which links to works in progress from several years ago were posted. It appears that Huawei have been maintaining an internal version of these in their kernels ever since.

Ying Huang (Intel) posted version 7 of “THP swap: Delay splitting THP during swapping out”, which as we previously noted aims to swap out actual whole “huge” (within certain limits) pages rather than splitting them down to the smallest atom of size supported by the architecture during swap. There was a specific request to various maintainers that they review the patch.

Andi Kleen posted a patch removing the printing of MCEs to the kernel log when the “mcelog” daemon is running (and hopefully logging these events).

Laura Abbott posted a RESEND of “config: Add Fedora config fragments”, which does what it says on the tin. Quoting her mail, “Fedora is a popular distribution for people who like to build their own kernels. To make this easier, add a set of reasonable common config options for Fedora”. She adds files in kernel/configs for “fedora-core.config”, “fedora-fs.config” and “fedora-networking.config” which should prove very useful next time someone complains at me that “building kernels for Red Hat distributions is hard”.

Eric Biggers posted “KEYS: encrypted: avoid encrypting/decrypting stack buffers”, which notes that “Since [Linux] v4.9, the crypto PI cannot (normally) be used to encrypt/decrypt stack buffers because the stack may be virtually mapped. Fix this or the padding buffers in encrypted-keys by using ZERO_PAGE for the encryption padding and by allocating a temporary heap buffer for the decryption padding. Eric is referring to the virtually mapped stack support introduced by Andy Lutomirski which has the side effect of incidentally flagging up various previous missuse of stacks.

Mark Rutland posted an RFC (Request For Comments) patch series entitled “ARMv8.3 pointer authentication userspace support”. ARMv8.3 includes a new architectural extension that “adds functionality to detect modification of pointer values, mitigating certain classes of attack such as stack smashing, and making return oriented [ROP] programming attacks harder”. [aside: If you’re bored, and want some really interesting (well, I think so) bedtime reading, and you haven’t already read all about ROP, you really should do so]. Continuing to quote Mark, the “extension introduces the concept of a pointer authentication code (PAC), which is stored in some upper bits of pointers. Each PAC is derived from the original pointer, another 64-bit value (e.g. the stack pointer), and a secret 128-bit key”. The extension includes new instructions to “insert a PAC into a pointer”, to “strip a PAC from a pointer”, and to “authenticate strip a PAC from a pointer” (which has the side effect of poisoning the pointer and causing a later fault if the authentication fails – allowing for detection of malicious intent).

Mark’s patch makes for great reading and summarizes this feature well. It notes that it has various counterparts in userspace to add ELF (Executable and Linking Format, the executable container used on modern Linux and Unix systems) notes sections to programs to provide the necessary annotations and presumably other data necessary to implement pointer authentication in application programs. It will be great to see those posted too.

Joerg Roedel followed up to a posting from Samuel Sieb entitled “AMD IOMMU causing filesystem corruption” to note that it has recently been discovered (and was documented in another thread this past week entitled “PCI: Blacklist AMD Stoney GPU devices for ATS”) that the AMD “Stoney” platform features a GPU for which PCI-ATS is known to be broken. ATS (Address Translation Services) is the mechanism by which PCIe endpoint devices (such as plugin adapter cards, including AMD GPUs) may obtain virtual to physical address translations for use in inbound DMA operations initiated by a PCIe device into a virtual machine (VM’s) memory (the VM talks the other way through the CPU MMU).

In ATS, the device utilizes an Address Translation Cache (ATC) which is essentially a TLB (Translation Lookaside Buffer) but not called that because of handwavy reasons intended not to confuse CPU and non-CPU TLBs. When a device sitting behind an IOMMU needs to perform an address translation, it asks a Translation Agent (TA) typically contained within the PCIe Root Complex to which it is ultimately attached. In the case of AMD’s Stoney Platform, this blows up under address invalidation load: “the GPU does not reply to invalidations anymore, causing Completion-wait loop timeouts on the AMD IOMMU driver side”. Somehow (but this isn’t clear) this is suspected as the possible cause of the filesystem corruption seen by Samuel, who is waiting to rebuild a system that ate its disk testing this.

Calvin Owens (Facebook) posted “printk: Introduce per-console filtering of messages by loglevel”, which notes that “Not all consoles are created equal”. It essentially allows the user to set a different loglevel for consoles that might each be capable of very different performance. For example, a serial console might be severely limited in its baud rate (115,200 in many cases, but perhaps as low as 9,600 or lower is still commonplace in 2017), while a graphics console might be capable of much higher. Calvin mentions netconsole as the preferred (higher speed) console that Facebook use to “monitor our fleet” but that “we still have serial consoles attached on each host for live debugging, and the latter has caused problems”. He doesn’t specifically mention USB debug consoles, or the EFI console, but one assumes that listeners are possibly aware of the many console types.

Christopher Bostic (IBM) posted version 5 of a patch series entitled “FSI device driver implementation”. FSI stands for “Flexible Support Interface”, a “high fan out [a term referring to splitting of digital signals into many additional outputs] serial bus consisting of a clock and a serial data line capable of running at speeds up to 166MHz”. His patches add core support to the Linux bus and device models (including “probing and discovery of slaves and slave engines”), along with additional handling for CFAM (Common Field Replacable Unit Access Macro) – an ASIC (chip) “residing in any device requiring FSI communications” that provides these various “engines”, and an FSI engine driver that manages devices on the FSI bus.

Finally, Adam Borowski posted “n_tty: don’t mangle tty codes in OLCUC mode” which aims to correct a bug which is “reproducible as of Linux 0.11” and all the way back to 0.01. OLCUC is not part of POSIX, but this terminios structure flag tells Linux to map lowercase characters to uppercase ones. The posting cites an obvious desire by Linus to support “Great Runes” (archiac Operating Systems in which everything was uppercase), to which Linus (obviously in jest, and in keeping with the April 1 date) asked Adam why he “didn’t make this the default state of a tty?”.

April 05, 2017 07:31 AM

April 03, 2017

Arnaldo Carvalho de Melo: Looking for users of new syscalls

Recently Linux got a new syscall to get extended information about files, a super ‘stat’, if you will, read more about it at LWN.

So I grabbed the headers with the definitions for the statx arguments to tools/include/ so that ‘perf trace’ can use them to beautify, i.e. to appear as
a bitmap of strings, as described in this cset.

To test it I used one of things ‘perf trace’ can do and that ‘strace’ does not: system wide stracing. To look if any of the programs running on my machine was using the new syscall I simply did, using strace-like syntax:

# perf trace -e statx

After a few minutes, nothing… So this fedora 25 system isn’t using it in any of the utilities I used in these moments, not surprising, glibc still needs wiring statx up.

So I found out about samples/statx/test-statx.c, and after installing the kernel headers and pointing the compiler to where those files were installed, I restarted that system wide ‘perf trace’ session and ran the test program, much better:

# trace -e statx
16612.967 ( 0.028 ms): statx/562 statx(dfd: CWD, filename: /etc/passwd, flags: SYMLINK_NOFOLLOW, mask: TYPE|MODE|NLINK|UID|GID|ATIME|MTIME|CTIME|INO|SIZE|BLOCKS|BTIME, buffer: 0x7ffef195d660) = 0
33064.447 ( 0.011 ms): statx/569 statx(dfd: CWD, filename: /tmp/statx, flags: SYMLINK_NOFOLLOW|STATX_FORCE_SYNC, mask: TYPE|MODE|NLINK|UID|GID|ATIME|MTIME|CTIME|INO|SIZE|BLOCKS|BTIME, buffer: 0x7ffc5484c790) = 0
36050.891 ( 0.023 ms): statx/576 statx(dfd: CWD, filename: /etc/motd, flags: SYMLINK_NOFOLLOW, mask: BTIME, buffer: 0x7ffeb18b66e0) = 0
38039.889 ( 0.023 ms): statx/584 statx(dfd: CWD, filename: /home/acme/.bashrc, flags: SYMLINK_NOFOLLOW, mask: TYPE|MODE|NLINK|UID|GID|ATIME|MTIME|CTIME|INO|SIZE|BLOCKS|BTIME, buffer: 0x7fff1db0ea90) = 0
^C#

Ah, to get filenames fetched we need to put in place a special probe, that will collect filenames passed to the kernel right after the kernel copies it from user memory:

[root@jouet ~]# perf probe 'vfs_getname=getname_flags:72 pathname=result->name:string'
Added new event:
probe:vfs_getname    (on getname_flags:72 with pathname=result->name:string)

You can now use it in all perf tools, such as:

perf record -e probe:vfs_getname -aR sleep 1

[root@jouet ~]# trace -e open touch /etc/passwd
0.024 ( 0.011 ms): touch/649 open(filename: /etc/ld.so.cache, flags: CLOEXEC) = 3
0.056 ( 0.018 ms): touch/649 open(filename: /lib64/libc.so.6, flags: CLOEXEC) = 3
0.481 ( 0.014 ms): touch/649 open(filename: /usr/lib/locale/locale-archive, flags: CLOEXEC) = 3
0.553 ( 0.012 ms): touch/6649 open(filename: /etc/passwd, flags: CREAT|NOCTTY|NONBLOCK|WRONLY, mode: IRUGO|IWUGO) = 3
[root@jouet ~]#

Make sure you have CONFIG_DEBUG_INFO set in your kernel build or that the matching debuginfo packages are installed. This needs to be done just once per boot, ‘perf trace’ will find it in place and use it.

Lastly, if ‘perf’ is hardlinked to ‘trace’, then the later will be the same as ‘perf trace’.


April 03, 2017 03:23 PM

March 31, 2017

Daniel Vetter: X.org Foundation Election - Vote Now!

It is election season again for the X.org Foundation. Beside electing half of the board seats we again have some paperwork changes - after updating the bylaws last year we realized that the membership agreement hasn’t been changed since over 10 years. It talks about the previous-previous legal org, has old addresses and a bunch of other things that just don’t fit anymore. In the board we’ve updated it to reflect our latest bylaws (thanks a lot to Rob Clark doing the editing), with no material changes intended.

Like bylaw changes any change to the membership agreement needs a qualified supermajority of all members, every vote counts and not voting essentially means voting no.

To vote, please go to https://members.x.org, log in and hit the “Cast” button on the listed ballot.

Voting closes by  23:59 UTC on 11 April 2017, but please don’t cut it short, it’s a computer that decides when it’s over …

March 31, 2017 12:00 AM

March 28, 2017

Arnaldo Carvalho de Melo: Getting backtraces from arbitrary places

Needs debuginfo, either in a package-debuginfo rpm or equivalent or by building with ‘cc -g’:

[root@jouet ~]# perf probe -L icmp_rcv:52 | head -15

  52  	if (rt->rt_flags & (RTCF_BROADCAST | RTCF_MULTICAST)) {
      		/*
      		 * RFC 1122: 3.2.2.6 An ICMP_ECHO to broadcast MAY be
      		 *  silently ignored (we let user decide with a sysctl).
      		 * RFC 1122: 3.2.2.8 An ICMP_TIMESTAMP MAY be silently
      		 *  discarded if to broadcast/multicast.
      		 */
  59  		if ((icmph->type == ICMP_ECHO ||
  60  		     icmph->type == ICMP_TIMESTAMP) &&
      		    net->ipv4.sysctl_icmp_echo_ignore_broadcasts) {
      			goto error;
      		}
      		if (icmph->type != ICMP_ECHO &&
      		    icmph->type != ICMP_TIMESTAMP &&
[root@jouet ~]# perf probe icmp_rcv:59
Added new event:
  probe:icmp_rcv       (on icmp_rcv:59)

You can now use it in all perf tools, such as:

	perf record -e probe:icmp_rcv -aR sleep 1

[root@jouet ~]# perf trace --no-syscalls --event probe:icmp_rcv/max-stack=5/
     0.000 probe:icmp_rcv:(ffffffffb47b7f9b))
                          icmp_rcv ([kernel.kallsyms])
                          ip_local_deliver_finish ([kernel.kallsyms])
                          ip_local_deliver ([kernel.kallsyms])
                          ip_rcv_finish ([kernel.kallsyms])
                          ip_rcv ([kernel.kallsyms])
  1025.876 probe:icmp_rcv:(ffffffffb47b7f9b))
                          icmp_rcv ([kernel.kallsyms])
                          ip_local_deliver_finish ([kernel.kallsyms])
                          ip_local_deliver ([kernel.kallsyms])
                          ip_rcv_finish ([kernel.kallsyms])
                          ip_rcv ([kernel.kallsyms])
^C[root@jouet ~]#

Humm, lots of redundant info, guess we could do away with those ([kernel.kallsyms) in all the callchain lines…


March 28, 2017 08:23 PM

Kernel Podcast: Linux Kernel Podcast for 2017/03/28

Audiohttp://traffic.libsyn.com/jcm/20170328v2.mp3

Author’s Note: Apologies to Ulrich Drepper for incorrectly attributing his paper “Futexes are Tricky” to Rusty. Oops. In any case, everyone should probably read Uli’s paper: https://www.akkadia.org/drepper/futex.pdf

In this week’s edition: Linus Torvalds announces Linux 4.11-rc4, early debug with USB3 earlycon, upcoming support for USB-C in 4.12, and ongoing development including various work on boot time speed ups, logging, futexes, and IOMMUs.

Linus Torvalds announced Linux 4.11-rc4, noting that “So last week, I said that I was hoping that rc3 was the point where we’d start to shrink the rc’s, and yes, rc4 is smaller than rc3. By a tiny tiny sidgen. It does touch a few more files, but it has a couple fewer commits, and fewer lines changed overall. But on the whole the two are almost identical in size. Which isn’t actually all that bad, considering that rc4 has both a networking merge and the usual driver suspects from Greg [Kroah Hartman], _and_ some drm fixes”.

Announcements

Junio C Hamano announced Git v2.12.2.

Greg Kroah-Hartman announced Linux 4.4.57, 4.9.18, and 4.10.6.

Sebastian Andrezej Siewior announced Linux v4.9.18-rt14, which includes a “larger rework of the futex / rtmutex code. In v4.8-rt1 we added a workaround so we don’t de-boost too early in the unlock path. A small window remained in which the locking thread could de-boost the unlocking thread. This rework by Peter Zijlstra fixes the issue.”

Upcoming features

Greg K-H finally accepted the latest “USB Type-C Connector class” patch series from Heikki Krogerus (Intel). This patch series aims to provide various control over the capability for USB-C to be used both as a power source and as a delivery interface to supply to power to external devices (enabling the oft-cited use case of selecting between charging your cellphone/mobile device or using said device to charge your laptop). This will land a new generic management framework exposed to userspace in Linux 4.12, including a driver for “Intel Whiskey Cove PMIC [Power Management IC] USB Type-C PHY”. Your author looks forward to playing. Greg thanked Heikki for the 18(!) iterations this patch went through prior to being merged – not quite a record, but a lot of effort!

Kishon Vijay Abraham (TI) posted “PCI: Support for configurable PCI endpoint”, which provides generic infrastructure to handle PCI endpoint devices (Linux operating as a PCI endpoint “device”), such as those based upon IP blocks from DesignWare (DW). He’s only tested the design on his “dra7xx” boards and requires “the help of others to test the platforms they have access to”. The driver adds a configfs interface including an entry to which userspace should write “start” to bring up an endpoint device. He adds himself as the maintainer for this new kernel feature.

Rob Herring posted “dtc updates for 4.12”, which “syncs dtc [Device Tree Compiler] with current mainline [dtc]”. His “primary motivation is to pull in the new checks [he’s] worked on. This gives lots of new warnings which are turned off by default”.

60Hz vs 59.94Hz (Handling of reduced FPS in V4L2)

Jose Abreu (Synopsys) posted a patch series entitled “Handling of reduced FPS in V4L2”, which aims to provide a mechanism for the kernel to measure (in a generic way) the actual Frames Per Second for a Video For Linux (V4L) video device. The patches rely upon hardware drivers being able to signal that they can distinguish “between regular fps and 1000/1001 fps”.

This took your author on a journey of discovery. It turns out that (most of the time), when a video device claims to be “60fps” it’s actually running at 59.94fps, but not always. The latter frame rate is an artifact of the NTSC (National Television System Committee) color television standard in the United States. Early televisions used the 60Hz frequency (which is nationally synchronized, at least in each of the traditional three independent grids operated in the US, which are now interconnected using HVDC interconnects but presumably are still not directly in phase with one another – feel free to educate me!) of the AC supply to lock individual frame scan times. When color TV was introduced, a small frequency offset was used to make room in each frame for a color sub-carrier signal while retaining backward compatibility for black and white transmissions. This is where frequencies of 29.97 and 59.95 frames per second originate. In case you always wondered.

Jose and Hans Verkuil had a back and forth discussion about various real- world measured pixelclock frequencies that they had obtained using a variety of equipment (signal analyzers, certified HDMI analyzer, and the Synopsys IP supported by the patch series under discussion) to see whether it was in reality possible to reliably distinguish frame rates.

Early Debug with USB3 earlycon (early printk)

Lu Baolu (Intel) posted version 8 of a patch series entitled “usb: early: add support for early printk through USB3 debug port”. Contemporary (especially x86) desktop and server class systems don’t expose low level hardware debug interfaces, such as JTAG debug chains, which are used during chip bringup and early firmware and OS enablement activities, and which allow developers with suitable tools to directly control and interrogate hardware state. Or just dump out the kernel ringbuffer (the dmesg “log”).

Actually, all such systems do have low level debug capabilities, they’re just fused out during the production process (by blowing efuses embedded into the processor) and either not exposed on the external pins of the chip at all, or are simply disabled in the chip logic. Probably most of these can be re-enabled by writing the magic cryptographically signed hashes to undocumented memory regions in on-chip coprocessor spaces. In any case, vendors such as Intel aren’t going to tell you how.

Yet it is often desirable to have certain low level debug functionality for systems that are deployed into field settings, even to reliably dump out the kernel console log DEBUG log level messages somewhere. Traditionally this was done using PC serial ports, but most desktop (and all laptop) systems no longer ship with those exposed on the rear panel. If you’re lucky you’ll see an IDC10 connector on your motherboard to which you can attach a DB9 breakout cable. Consumers and end users have no idea what any of this means, and in the case that they don’t know what this means, they probably shouldn’t be encouraged to open the machine up and poke things. Yet even in the case that IDC10 connectors exist and can be hooked up, this is still a cumbersome interface that cannot be relied upon today.

Microsoft (who are often criticized but actually are full of many good ideas and usually help to drive industry standardization for the broader market) instituted sanity years ago by working with the USB Implementors Forum (IF) to ensure that the USB3 specification included a standardized feature known as xHCI debug capability (DbC), an “optional but standalone functionality by an xHCI hosst controller”. This suited Windows, which traditionally requires two UARTs (serial ports) for kernel development, and uses one of them for simple direct control of the running kernel without going through complex driver frameworks. Debug port (which also existed on USB2) traditionally required a special external partner hardware dongle but is cleaner in USB3, requiring only a USB A-to-A crossover cable connecting USB3.0 data lines.

As Lu Baolu notes in his patch, “With DbC hardware initialized, the system will present a debug device through the USB3 debug port (normally the first USB3 port)”. The patch series enables this as a high speed console log target on Linux, but it could be used for much more interesting purposes via KDB.

[Separately, but only really related to console drivers and not debugging, Thierry Escande posted “firmware: google memconsole” which adds support for importing the boot time BIOS memory based console into the kernel ringbuffer on Google Coreboot systems].

Ongoing Development

Pavel Tatashin (Oracle) posted “parallelized “struct page” zeroing”, which improves boot time performance significantly in the case that the “deferred struct page initialization feature is enabled”. In this case, zeroing out of the kernel’s vmemmap (Virtual Memory Map) is delayed until after the secondary CPU cores on a machine have been started. When this is done, those cores can be used to run zeroing threads that write to memory, taking one SPARC system down from 97.89 seconds to boot down to 46.91. Pavel notes that the savings are also considerable on x86 systems too.

Thomas Gleixner had a lengthy back and forth with Pasha Tatashin (Oracle) over the latter’s posting of “Early boot time stamps for x86” which use the TSC (Time Stamp Counter) on Intel x86 Architecture. The goal is to log how long the machine actually took to boot, including firmware, rather than just how long Linux took to boot from the time it was started. Peter Zijlstra responded (to Pasha), “Lol, how cute. You assume TSC starts at 0 on reset” (alluding to the fact that firmware often does crazy things playing with the TSC offset or directly writing to it). Thomas was unimpressed with Pavel’s posting of a v2 patch series, noting “Did you actually read my last reply on V1 of this? I made it clear that the way this is done, i.e. hacking it into the earliest boo[]t stage is not going to happen…I don’t care about you wasting your time, but I very much care about my time”. He provided a further more lengthy response, including various commentary on the best ways to handle feedback.

Peter Zijlstra posted version 6 of a patch series entitled “The arduous story of FUTEX_UNLOCK_PI” in which he adds “Another installment of the futex patches that give you nightmares”. Futexes (Fast User-space Mutexes) are a mechanism provided by the Linux kernel which leverage shared memory to provide a low overhead mutex (mutual exclusion primitave) to userspace in the case that such mutexes are uncontended (no conflicts between processes – tasks within the kernel – exist trying to acquire the same resource) but with a “slow path” through the kernel in the case of contention. They are used by many userspace applications, including extensively in the C library (see the famous paper by Rusty Russell entitled “Futexes are Tricky”). Peter is working on solving problems introduced by having to have Priority Inheritance (PI) aware futexes in Real Time kernels. These adjust priority of the associated tasks holding mutexes for short periods in order to prevent Priority Inversion (see Mars Pathfinder study papers) in which a low priority task holds a mutex that a high priority task wants to acquire. Peter’s patches “rework[] and document[] the locking” of existing code.

Separately, Waiman Long (Red Hat) posted version 6 of “futex” Introducing throughput-optimized (TP) futexes which “introduces a new futex implementation called throughput-optmized (TP) futexes. It is similar to PI futexes in its calling convention, but provides better throughput than the wait-wake (WW) futexes by encouraging lock stealing and optimistic spinning. The new TP futexes an be used in implementing both userspace mutexes and rwlocks. The provide[] better performance while simplifying the userspace locking implementation at the same time. The WW futexes are still needed to implement other synchronization primitives like conditional variables and semaphores that cannot be handled by the TP futexes”.

David Woodhouse posted “PCI resource mmap cleanup” which aims to clean up the use of various kernel interfaces that provide “user visible” resource addresses through (legacy) proc and (contemporary) sysfs. The purpose of these interfaces is to provide information about regions of PCI address space memory that can be directly mapped by userspace applications such as those linked against the DPDK (Data Plane Development Kit) library. An example of his cleanup included “Only allow WC [Write Combining] mmap on prefetchable resources” for the /proc/bus/pci mmap interface because this was the case for the preferred sysfs interface already. This lead some to debate why the 64-bit ARM Architecture didn’t provide the legacy procfs interface (since there was a little confusion about the dependencies for DPDK) but ultimately re-concluded that it shouldn’t.

Tyler Baicar (Codeaurora) posted version 13 of a patch series entitled “Add UEFI 2.6 and ACPI 6.1 updates for RAS on ARM64”, which aims to introduce support to the 64-bit ARM Architecture for logging of RAS events using the shared “GHES” (Generic Hardware Error Source) memory location “with the proper GHES structures to notify the OS of the error”. This dovetails nicely with platforms performing “firmware first” error handling in which errors are trapped to secure firmare which first handles them and subsequently informs the Operating System using this ACPI feature.

Shaohua Li (Facebook) posted a patch entitled “add an option to disable iommu force on” in the case of the (x86) Trusted Boot (TBOOT) feature being enabled. The reason cited was that under a certain 40GBit networking load XDP (eXpress Data Path) test there were high numbers of IOTLB (IO Translation Look Aside Buffer) misses “which kills the performance”. What he is refering to is the mechanism through which an IOMMU (which sits logically between a hardware device, such as a network card, and memory, often as part of an integrated PCI Root Complex) translates underlying memory accesses by the adapter card into real host memory transactions. These are cached by the IOMMU in small caches (known as IOTLBS) after it performs such translations using its “page tables” (similar to how a host CPU’s MMU – Memory Management Unit – performs host memory translations). Badly designed IOMMU implementations or poor utilization can result in large numbers of misses that result in users disabling the feature. Alas, without an IOMMU, there’s little protection during boot from rogue devices that maliciously want to trash host memory. Nobody has noted this in the RFC (Request For Comments) discussion, yet.

Bodong Wang (Mellanox) posted a patch entitled “Add an option to probe VFs or not before enabling SR-IOV”, which aims to allow administrators to limit the probing of (PCIe) Virtual Functions (VFs) on adapters that will have those resources passed through to Virtual Machines (VMs) (using VFIO). This “can save host side resource usage by VF instances which would be eventually probed to VMs”. It adds a new sysfs interface to control this.

Viresh Kumar posted a patch entitled “cpufreq: Restore policy min/max limits on CPU online”. Apparently, existing code behavior was that “On CPU online the cpufreq core restores the previous governor [the in kernel logic that determines CPU frequency transitions based upon various metrics, such as saving energy, or prioritizing performance]…but it does not restore min/max limits at the same time”. The patch addresses this shortcoming.

Wanpeng Li posted a patch entitled “KVM: nVMX: Fix nested VPID vmx exec control” that aims to “hide and forbid” Virtual Processor IDentifiers in nested virtualization contexts where the hardware doesn’t support this. Apparently it was unconditionally being enabled (based upon real hardware realities of existing implementation) regardless of feature information (INVVPID) provided in the “vmx” capabilities.

Joerg Roedel posted a patch entitled “ACPI: Don’t create a platform_device for IOAPIC/IOxAPIC” since this was causing problems during hot remove (of CPUs). Rafael J. Wysocki noted that “it’s better to avoid using platform_device for hot-removable stuff” since it is “inherently fragile”.

Kees Cook (Google) posted a patch disabling hibernation support on 32-bit systems in the case that KASLR (Kernel Address Space Layout Randomization) was enabled at boot time, but allowing for “nokaslr” on the kernel command line to change this. Evgenii Shatokhin initially noted that “nokaslr” didn’t re-enable hibernation support correctly, but eventually determined that the ordering and placement of the “nokaslr” on the command line was to blame, which lead to Kees saying he would look into the command line parsing sequence and interaction with other options, such as “resume=”.

Separately, Baoquan He (Red Hat) noted that with KASLR an implicit assumption that EFI_VA_START < EFI_VA_END existed, while “In fact [the] EFI [(Unified) Extensible Firmware Interface] region reserved for runtime services [these are callbacks into firmware from Linux] virtual mapping will be allocated using a top-down schema”. His patches addressed this problem, and being “RESEND”s, he was keen to see that they get taken up soon.

Also separately, Kees posted “syscalls: Restore address limit after a syscall” which “ensures a syscall does not return to user-mode with a kernel address limit. If that happened, a process can corrupt kernel-mode memory and elevate privileges”. He cites a bug it would have prevented.

Kan Liang (Intel) posted “measure SMI cost”. This patch series aims to leverage hardware counters to inform perf of the amount of time spent (on Intel x86 Architecture systems) inside System Management Mode (SMM). SMIs (System Management Interrups) are events that are generated (usually) by Intel Platform Control Hub and similar chipset logic which can be programmed by firmare to generate regular interrupts that target a secure execution context known as SMM (System Management Mode). It is here that firmware temporarily steals CPU cycles from the Operating System (without its knowledge) to perform such things as CPU fan control, errata handling, and wholesale VGA graphics emulation in BMC “value add” from OEMs). Over the years, the amount of gunk hidden in SMIs has grown that this author even once wrote a latency detector (hwlat) and has a patent on SMI detection without using such dedicated counters…due to the impact of such on system performance. SMM is necessary on x86 due to its lack of a standardized on-SoC platform management controller, but so is accounting for bloat.

Finally, yes, Kirill A. Shutemov snuck in another iteration of his Intel “5-level paging support” in preparation for a 4.12 merge.

 

March 28, 2017 05:23 PM

March 27, 2017

Matthew Garrett: Buying a Utah teapot

The Utah teapot was one of the early 3D reference objects. It's canonically a Melitta but hasn't been part of their range in a long time, so I'd been watching Ebay in the hope of one turning up. Until last week, when I discovered that a company called Friesland had apparently bought a chunk of Melitta's range some years ago and sell the original teapot[1]. I've just ordered one, and am utterly unreasonably excited about this.

Update: Friesland have apparently always produced the Utah teapot, but were part of the Melitta group for some time - they didn't buy the range from Melitta.

[1] They have them in 0.35, 0.85 and 1.4 litre sizes. I believe (based on the measurements here) that the 1.4 litre one matches the Utah teapot.

comment count unavailable comments

March 27, 2017 11:45 PM

Pete Zaitcev: It was surprising

So here I was watching ACCA at Crunchyroll, when a commercial comes up... of VMware OpenStack.

I still remember times in 2010 when VMware was going to have their own cloud (possibly called VxCloud), with blackjack and hookers, as they say, or much better that OpenStack anyway. Looks like things have changed.

Also, what's up with this targeting? How did they link my account at Crunchyroll with OpenStack?

{Update: Thanks, Andreas!}

March 27, 2017 06:33 PM

March 26, 2017

Vegard Nossum: Writing a reverb filter from first principles

WARNING/DISCLAIMER: Audio programming always carries the risk of damaging your speakers and/or your ears if you make a mistake. Therefore, remember to always turn down the volume completely before and after testing your program. And whatever you do, don't use headphones or earphones. I take no responsibility for damage that may occur as a result of this blog post!

Have you ever wondered how a reverb filter works? I have... and here's what I came up with.

Reverb is the sound effect you commonly get when you make sound inside a room or building, as opposed to when you are outdoors. The stairwell in my old apartment building had an excellent reverb. Most live musicians hate reverb because it muddles the sound they're trying to create and can even throw them off while playing. On the other hand, reverb is very often used (and overused) in the studio vocals because it also has the effect of smoothing out rough edges and imperfections in a recording.

We typically distinguish reverb from echo in that an echo is a single delayed "replay" of the original sound you made. The delay is also typically rather large (think yelling into a distant hill- or mountainside and hearing your HEY! come back a second or more later). In more detail, the two things that distinguish reverb from an echo are:

  1. The reverb inside a room or a hall has a much shorter delay than an echo. The speed of sound is roughly 340 meters/second, so if you're in the middle of a room that is 20 meters by 20 meters, the sound will come back to you (from one wall) after (20 / 2) / 340 = ~0.029 seconds, which is such a short duration of time that we can hardly notice it (by comparison, a 30 FPS video would display each frame for ~0.033 seconds).
  2. After bouncing off one wall, the sound reflects back and reflects off the other wall. It also reflects off the perpendicular walls and any and all objects that are in the room. Even more, the sound has to travel slightly longer to reach the corners of the room (~14 meters instead of 10). All these echoes themselves go on to combine and echo off all the other surfaces in the room until all the energy of the original sound has dissipated.

Intuitively, it should be possible to use multiple echoes at different delays to simulate reverb.

We can implement a single echo using a very simple ring buffer:

    class FeedbackBuffer {
    public:
        unsigned int nr_samples;
        int16_t *samples;

        unsigned int pos;

        FeedbackBuffer(unsigned int nr_samples):
            nr_samples(nr_samples),
            samples(new int16_t[nr_samples]),
            pos(0)
        {
        }

        ~FeedbackBuffer()
        {
            delete[] samples;
        }

        int16_t get() const
        {
            return samples[pos];
        }

        void add(int16_t sample)
        {
            samples[pos] = sample;

            /* If we reach the end of the buffer, wrap around */
            if (++pos == nr_samples)
                pos = 0;
        }
    };

The constructor takes one argument: the number of samples in the buffer, which is exactly how much time we will delay the signal by; when we write a sample to the buffer using the add() function, it will come back after a delay of exactly nr_samples using the get() function. Easy, right?

Since this is an audio filter, we need to be able to read an input signal and write an output signal. For simplicity, I'm going to use stdin and stdout for this -- we will read 8 KiB at a time using read(), process that, and then use write() to output the result. It will look something like this:

    #include <cstdio>
    #include <cstdint>
    #include <cstdlib>
    #include <cstring>
    #include <unistd.h>


    int main(int argc, char *argv[])
    {
        while (true) {
            int16_t buf[8192];
            ssize_t in = read(STDIN_FILENO, buf, sizeof(buf));
            if (in == -1) {
                /* Error */
                return 1;
            }
            if (in == 0) {
                /* EOF */
                break;
            }

            for (unsigned int j = 0; j < in / sizeof(*buf); ++j) {
                /* TODO: Apply filter to each sample here */
            }

            write(STDOUT_FILENO, buf, in);
        }

        return 0;
    }

On Linux you can use e.g. 'arecord' to get samples from the microphone and 'aplay' to play samples on the speakers, and you can do the whole thing on the command line:

    $ arecord -t raw -c 1 -f s16 -r 44100 |\
        ./reverb | aplay -t raw -c 1 -f s16 -r 44100

(-c means 1 channel; -f s16 means "signed 16-bit" which corresponds to the int16_t type we've used for our buffers; -r 44100 means a sample rate of 44100 samples per second; and ./reverb is the name of our executable.)

So how do we use class FeedbackBuffer to generate the reverb effect?

Remember how I said that reverb is essentially many echoes? Let's add a few of them at the top of main():

    FeedbackBuffer fb0(1229);
    FeedbackBuffer fb1(1559);
    FeedbackBuffer fb2(1907);
    FeedbackBuffer fb3(4057);
    FeedbackBuffer fb4(8117);
    FeedbackBuffer fb5(8311);
    FeedbackBuffer fb6(9931);

The buffer sizes that I've chosen here are somewhat arbitrary (I played with a bunch of different combinations and this sounded okay to me). But I used this as a rough guideline: simulating the 20m-by-20m room at a sample rate of 44100 samples per second means we would need delays roughly on the order of 44100 / (20 / 340) = 2594 samples.

Another thing to keep in mind is that we generally do not want our feedback buffers to be multiples of each other. The reason for this is that it creates a consonance between them and will cause certain frequencies to be amplified much more than others. As an example, if you count from 1 to 500 (and continue again from 1), and you have a friend who counts from 1 to 1000 (and continues again from 1), then you would start out 1-1, 2-2, 3-3, etc. up to 500-500, then you would go 1-501, 2-502, 3-504, etc. up to 500-1000. But then, as you both wrap around, you start at 1-1 again. And your friend will always be on 1 when you are on 1. This has everything to do with periodicity and -- in fact -- prime numbers! If you want to maximise the combined period of two counters, you have to make sure that they are relatively coprime, i.e. that they don't share any common factors. The easiest way to achieve this is to only pick prime numbers to start with, so that's what I did for my feedback buffers above.

Having created the feedback buffers (which each represent one echo of the original sound), it's time to put them to use. The effect I want to create is not simply overlaying echoes at fixed intervals, but to have the echos bounce off each other and feed back into each other. The way we do this is by first combining them into the output signal... (since we have 8 signals to combine including the original one, I give each one a 1/8 weight)

    float x = .125 * buf[j];
    x += .125 * fb0.get();
    x += .125 * fb1.get();
    x += .125 * fb2.get();
    x += .125 * fb3.get();
    x += .125 * fb4.get();
    x += .125 * fb5.get();
    x += .125 * fb6.get();
    int16_t out = x;

...then feeding the result back into each of them:

    fb0.add(out);
    fb1.add(out);
    fb2.add(out);
    fb3.add(out);
    fb4.add(out);
    fb5.add(out);
    fb6.add(out);

And finally we also write the result back into the buffer. I found that the original signal loses some of its power, so I use a factor 4 gain to bring it roughly back to its original strength; this number is an arbitrary choice by me, I don't have any specific calculations to support it:

    buf[j] = 4 * out;

That's it! 88 lines of code is enough to write a very basic reverb filter from first principles. Be careful when you run it, though, even the smallest mistake could cause very loud and unpleasant sounds to be played.

If you play with different buffer sizes or a different number of feedback buffers, let me know if you discover anything interesting :-)

March 26, 2017 10:07 AM

Vegard Nossum: Fuzzing the OpenSSH daemon using AFL

(EDIT 2017-03-25: All my patches to make OpenSSH more amenable to fuzzing with AFL are available at https://github.com/vegard/openssh-portable. This also includes improvements to the patches found in this post.)

American Fuzzy Lop is a great tool. It does take a little bit of extra setup and tweaking if you want to go into advanced usage, but mostly it just works out of the box.

In this post, I’ll detail some of the steps you need to get started with fuzzing the OpenSSH daemon (sshd) and show you some tricks that will help get results more quickly.

The AFL home page already displays 4 OpenSSH bugs in its trophy case; these were found by Hanno Böck who used an approach similar to that outlined by Jonathan Foote on how to fuzz servers with AFL.

I take a slightly different approach, which I think is simpler: instead of intercepting system calls to fake network activity, we just run the daemon in “inetd mode”. The inet daemon is not used very much anymore on modern Linux distributions, but the short story is that it sets up the listening network socket for you and launches a new process to handle each new incoming connection. inetd then passes the network socket to the target program as stdin/stdout. Thus, when sshd is started in inet mode, it communicates with a single client over stdin/stdout, which is exactly what we need for AFL.

Configuring and building AFL

If you are just starting out with AFL, you can probably just type make in the top-level AFL directory to compile everything, and it will just work. However, I want to use some more advanced features, in particular I would like to compile sshd using LLVM-based instrumentation (which is slightly faster than the “assembly transformation by sed” that AFL uses by default). Using LLVM also allows us to move the target program’s “fork point” from just before entering main() to an arbitrary location (known as “deferred forkserver mode” in AFL-speak); this means that we can skip some of the setup operations in OpenSSH, most notably reading/parsing configs and loading private keys.

Most of the steps for using LLVM mode are detailed in AFL’s llvm_mode/README.llvm. On Ubuntu, you should install the clang and llvm packages, then run make -C llvm_mode from the top-level AFL directory, and that’s pretty much it. You should get a binary called afl-clang-fast, which is what we’re going to use to compile sshd.

Configuring and building OpenSSH

I’m on Linux so I use the “portable” flavour of OpenSSH, which conveniently also uses git (as opposed to the OpenBSD version which still uses CVS – WTF!?). Go ahead and clone it from git://anongit.mindrot.org/openssh.git.

Run autoreconf to generate the configure script. This is how I run the config script:

./configure \
CC="$PWD/afl-2.39b/afl-clang-fast" \
CFLAGS="-g -O3" \
--prefix=$PWD/install \
--with-privsep-path=$PWD/var-empty \
--with-sandbox=no \
--with-privsep-user=vegard

You obviously need to pass the right path to afl-clang-fast. I’ve also created two directories in the current (top-level OpenSSH directory), install and var-empty. This is so that we can run make install without being root (although var-empty needs to have mode 700 and be owned by root) and without risking clobbering any system files (which would be extremely bad, as we’re later going to disable authentication and encryption!). We really do need to run make install, even though we’re not going to be running sshd from the installation directory. This is because sshd needs some private keys to run, and that is where it will look for them.

(EDIT 2017-03-25: Passing --without-pie to configure may help make the resulting binaries easier to debug since instruction pointers will not be randomised.)

If everything goes well, running make should display the AFL banner as OpenSSH is compiled.

You may need some extra libraries (zlib1g-dev and libssl-dev on Ubuntu) for the build to succeeed.

Run make install to install sshd into the install/ subdirectory (and again, please don’t run this as root).

We will have to rebuild OpenSSH a few times as we apply some patches to it, but this gives you the basic ingredients for a build. One particular annoying thing I’ve noticed is that OpenSSH doesn’t always detect source changes when you run make (and so your changes may not actually make it into the binary). For this reason I just adopted the habit of always running make clean before recompiling anything. Just a heads up!

Running sshd

Before we can actually run sshd under AFL, we need to figure out exactly how to invoke it with all the right flags and options. This is what I use:

./sshd -d -e -p 2200 -r -f sshd_config -i

This is what it means:

-d
“Debug mode”. Keeps the daemon from forking, makes it accept only a single connection, and keeps it from putting itself in the background. All useful things that we need.
-e
This makes it log to stderr instead of syslog; this first of all prevents clobbering your system log with debug messages from our fuzzing instance, and also gives a small speed boost.
-p 2200
The TCP port to listen to. This is not really used in inetd mode (-i), but is useful later on when we want to generate our first input testcase.
-r
This option is not documented (or not in my man page, at least), but you can find it in the source code, which should hopefully also explain what it does: preventing sshd from re-execing itself. I think this is a security feature, since it allows the process to isolate itself from the original environment. In our case, it complicates and slows things down unnecessarily, so we disable it by passing -r.
-f sshd_config
Use the sshd_config from the current directory. This just allows us to customise the config later without having to reinstall it or be unsure about which location it’s really loaded from.
-i
“Inetd mode”. As already mentioned, inetd mode will make the server process a single connection on stdin/stdout, which is a perfect fit for AFL (as it will write testcases on the program’s stdin by default).

Go ahead and run it. It should hopefully print something like this:

$ ./sshd -d -e -p 2200 -r -f sshd_config -i
debug1: sshd version OpenSSH_7.4, OpenSSL 1.0.2g 1 Mar 2016
debug1: private host key #0: ssh-rsa SHA256:f9xyp3dC+9jCajEBOdhjVRAhxp4RU0amQoj0QJAI9J0
debug1: private host key #1: ssh-dss SHA256:sGRlJclqfI2z63JzwjNlHtCmT4D1WkfPmW3Zdof7SGw
debug1: private host key #2: ecdsa-sha2-nistp256 SHA256:02NDjij34MUhDnifUDVESUdJ14jbzkusoerBq1ghS0s
debug1: private host key #3: ssh-ed25519 SHA256:RsHu96ANXZ+Rk3KL8VUu1DBzxwfZAPF9AxhVANkekNE
debug1: setgroups() failed: Operation not permitted
debug1: inetd sockets after dupping: 3, 4
Connection from UNKNOWN port 65535 on UNKNOWN port 65535
SSH-2.0-OpenSSH_7.4

If you type some garbage and press enter, it will probably give you “Protocol mismatch.” and exit. This is good!

Detecting crashes/disabling privilege separation mode

One of the first obstacles I ran into was the fact that I saw sshd crashing in my system logs, but AFL wasn’t detecting them as crashes:

[726976.333225] sshd[29691]: segfault at 0 ip 000055d3f3139890 sp 00007fff21faa268 error 4 in sshd[55d3f30ca000+bf000]
[726984.822798] sshd[29702]: segfault at 4 ip 00007f503b4f3435 sp 00007fff84c05248 error 4 in libc-2.23.so[7f503b3a6000+1bf000]

The problem is that OpenSSH comes with a “privilege separation mode” that forks a child process and runs most of the code inside the child. If the child segfaults, the parent still exits normally, so it masks the segfault from AFL (which only observes the parent process directly).

In version 7.4 and earlier, privilege separation mode can easily be disabled by adding “UsePrivilegeSeparation no” to sshd_config or passing -o UsePrivilegeSeaparation=no on the command line.

Unfortunately it looks like the OpenSSH developers are removing the ability to disable privilege separation mode in version 7.5 and onwards. This is not a big deal, as OpenSSH maintainer Damien Miller writes on Twitter: “the infrastructure will be there for a while and it’s a 1-line change to turn privsep off”. So you may have to dive into sshd.c to disable it in the future.

(EDIT 2017-03-25: I’ve pushed the source tweak for disabling privilege separation for 7.5 and newer to my OpenSSH GitHub repo. This also obsoletes the need for a config change.)

Reducing randomness

OpenSSH uses random nonces during the handshake to prevent “replay attacks” where you would record somebody’s (encrypted) SSH session and then you feed the same data to the server again to authenticate again. When random numbers are used, the server and the client will calculate a new set of keys and thus thwart the replay attack.

In our case, we explicitly want to be able to replay traffic and obtain the same result two times in a row; otherwise, the fuzzer would not be able to gain any useful data from a single connection attempt (as the testcase it found would not be usable for further fuzzing).

There’s also the possibility that randomness introduces variabilities in other code paths not related to the handshake, but I don’t really know. In any case, we can easily disable the random number generator. On my system, with the configure line above, all or most random numbers seem to come from arc4random_buf() in openbsd-compat/arc4random.c, so to make random numbers very predictable, we can apply this patch:

diff --git openbsd-compat/arc4random.c openbsd-compat/arc4random.c
--- openbsd-compat/arc4random.c
+++ openbsd-compat/arc4random.c
@@ -242,7 +242,7 @@ void
arc4random_buf(void *buf, size_t n)
{
_ARC4_LOCK();
- _rs_random_buf(buf, n);
+ memset(buf, 0, n);
_ARC4_UNLOCK();
}
# endif /* !HAVE_ARC4RANDOM_BUF */

One way to test whether this patch is effective is to try to packet-capture an SSH session and see if it can be replayed successfully. We’re going to have to do that later anyway in order to create our first input testcase, so skip below if you want to see how that’s done. In any case, AFL would also tell us using its “stability” indicator if something was really off with regards to random numbers (>95% stability is generally good, <90% would indicate that something is off and needs to be fixed).

Increasing coverage

Disabling message CRCs

When fuzzing, we really want to disable as many checksums as we can, as Damien Miller also wrote on twitter: “fuzzing usually wants other code changes too, like ignoring MAC/signature failures to make more stuff reachable”. This may sound a little strange at first, but makes perfect sense: In a real attack scenario, we can always1 fix up CRCs and other checksums to match what the program expects.

If we don’t disable checksums (and we don’t try to fix them up), then the fuzzer will make very little progress. A single bit flip in a checksum-protected area will just fail the checksum test and never allow the fuzzer to proceed.

We could of course also fix the checksum up before passing the data to the SSH server, but this is slow and complicated. It’s better to disable the checksum test in the server and then try to fix it up if we do happen to find a testcase which can crash the modified server.

The first thing we can disable is the packet CRC test:

diff --git a/packet.c b/packet.c
--- a/packet.c
+++ b/packet.c
@@ -1635,7 +1635,7 @@ ssh_packet_read_poll1(struct ssh *ssh, u_char *typep)

cp = sshbuf_ptr(state->incoming_packet) + len - 4;
stored_checksum = PEEK_U32(cp);
- if (checksum != stored_checksum) {
+ if (0 && checksum != stored_checksum) {
error("Corrupted check bytes on input");
if ((r = sshpkt_disconnect(ssh, "connection corrupted")) != 0 ||
(r = ssh_packet_write_wait(ssh)) != 0)

As far as I understand, this is a simple (non-cryptographic) integrity check meant just as a sanity check against bit flips or incorrectly encoded data.

Disabling MACs

We can also disable Message Authentication Codes (MACs), which are the cryptographic equivalent of checksums, but which also guarantees that the message came from the expected sender:

diff --git mac.c mac.c
index 5ba7fae1..ced66fe6 100644
--- mac.c
+++ mac.c
@@ -229,8 +229,10 @@ mac_check(struct sshmac *mac, u_int32_t seqno,
if ((r = mac_compute(mac, seqno, data, dlen,
ourmac, sizeof(ourmac))) != 0)
return r;
+#if 0
if (timingsafe_bcmp(ourmac, theirmac, mac->mac_len) != 0)
return SSH_ERR_MAC_INVALID;
+#endif
return 0;
}

We do have to be very careful when making these changes. We want to try to preserve the original behaviour of the program as much as we can, in the sense that we have to be very careful not to introduce bugs of our own. For example, we have to be very sure that we don’t accidentally skip the test which checks that the packet is large enough to contain a checksum in the first place. If we had accidentally skipped that, it is possible that the program being fuzzed would try to access memory beyond the end of the buffer, which would be a bug which is not present in the original program.

This is also a good reason to never submit crashing testcases to the developers of a program unless you can show that they also crash a completely unmodified program.

Disabling encryption

The last thing we can do, unless you wish to only fuzz the unencrypted initial protocol handshake and key exchange, is to disable encryption altogether.

The reason for doing this is exactly the same as the reason for disabling checksums and MACs, namely that the fuzzer would have no hope of being able to fuzz the protocol itself if it had to work with the encrypted data (since touching the encrypted data with overwhelming probability will just cause it to decrypt to random and utter garbage).

Making the change is surprisingly simple, as OpenSSH already comes with a psuedo-cipher that just passes data through without actually encrypting/decrypting it. All we have to do is to make it available as a cipher that can be negotiated between the client and the server. We can use this patch:

diff --git a/cipher.c b/cipher.c
index 2def333..64cdadf 100644
--- a/cipher.c
+++ b/cipher.c
@@ -95,7 +95,7 @@ static const struct sshcipher ciphers[] = {
# endif /* OPENSSL_NO_BF */
#endif /* WITH_SSH1 */
#ifdef WITH_OPENSSL
- { "none", SSH_CIPHER_NONE, 8, 0, 0, 0, 0, 0, EVP_enc_null },
+ { "none", SSH_CIPHER_SSH2, 8, 0, 0, 0, 0, 0, EVP_enc_null },
{ "3des-cbc", SSH_CIPHER_SSH2, 8, 24, 0, 0, 0, 1, EVP_des_ede3_cbc },
# ifndef OPENSSL_NO_BF
{ "blowfish-cbc",

To use this cipher by default, just put “Ciphers none” in your sshd_config. Of course, the client doesn’t support it out of the box either, so if you make any test connections, you have to have to use the ssh binary compiled with the patched cipher.c above as well.

You may have to pass pass -o Ciphers=none from the client as well if it prefers to use a different cipher by default. Use strace or wireshark to verify that communication beyond the initial protocol setup happens in plaintext.

Making it fast

afl-clang-fast/LLVM “deferred forkserver mode”

I mentioned above that using afl-clang-fast (i.e. AFL’s LLVM deferred forkserver mode) allows us to move the “fork point” to skip some of the sshd initialisation steps which are the same for every single testcase we can throw at it.

To make a long story short, we need to put a call to __AFL_INIT() at the right spot in the program, separating the stuff that doesn’t depend on a specific input to happen before it and the testcase-specific handling to happen after it. I’ve used this patch:

diff --git a/sshd.c b/sshd.c
--- a/sshd.c
+++ b/sshd.c
@@ -1840,6 +1840,8 @@ main(int ac, char **av)
/* ignore SIGPIPE */
signal(SIGPIPE, SIG_IGN);

+ __AFL_INIT();
+
/* Get a connection, either from inetd or a listening TCP socket */
if (inetd_flag) {
server_accept_inetd(&sock_in, &sock_out);

AFL should be able to automatically detect that you no longer wish to start the program from the top of main() every time. However, with only the patch above, I got this scary-looking error message:

Hmm, looks like the target binary terminated before we could complete a
handshake with the injected code. Perhaps there is a horrible bug in the
fuzzer. Poke <lcamtuf@coredump.cx> for troubleshooting tips.

So there is obviously some AFL magic code here to make the fuzzer and the fuzzed program communicate. After poking around in afl-fuzz.c, I found FORKSRV_FD, which is a file descriptor pointing to a pipe used for this purpose. The value is 198 (and the other end of the pipe is 199).

To try to figure out what was going wrong, I ran afl-fuzz under strace, and it showed that file descriptors 198 and 199 were getting closed by sshd. With some more digging, I found the call to closefrom(), which is a function that closes all inherited (and presumed unused) file descriptors starting at a given number. Again, the reason for this code to exist in the first place is probably in order to reduce the attack surface in case an attacker is able to gain control the process. Anyway, the solution is to protect these special file descriptors using a patch like this:

diff --git openbsd-compat/bsd-closefrom.c openbsd-compat/bsd-closefrom.c
--- openbsd-compat/bsd-closefrom.c
+++ openbsd-compat/bsd-closefrom.c
@@ -81,7 +81,7 @@ closefrom(int lowfd)
while ((dent = readdir(dirp)) != NULL) {
fd = strtol(dent->d_name, &endp, 10);
if (dent->d_name != endp && *endp == '\0' &&
- fd >= 0 && fd < INT_MAX && fd >= lowfd && fd != dirfd(dirp))
+ fd >= 0 && fd < INT_MAX && fd >= lowfd && fd != dirfd(dirp) && fd != 198 && fd != 199)
(void) close((int) fd);
}
(void) closedir(dirp);

Skipping expensive DH/curve and key derivation operations

At this point, I still wasn’t happy with the execution speed: Some testcases were as low as 10 execs/second, which is really slow.

I tried compiling sshd with -pg (for gprof) to try to figure out where the time was going, but there are many obstacles to getting this to work properly: First of all, sshd exits using _exit(255) through its cleanup_exit() function. This is not a “normal” exit and so the gmon.out file (containing the profile data) is not written out at all. Applying a source patch to fix that, sshd will give you a “Permission denied” error as it tries to open the file for writing. The problem now is that sshd does a chdir("/"), meaning that it’s trying to write the profile data in a directory where it doesn’t have access. The solution is again simple, just add another chdir() to a writable location before calling exit(). Even with this in place, the profile came out completely empty for me. Maybe it’s another one of those privilege separation things. In any case, I decided to just use valgrind and its “cachegrind” tool to obtain the profile. It’s much easier and gives me the data I need without hassles of reconfiguring, patching, and recompiling.

The profile showed one very specific hot spot, coming from two different locations: elliptic curve point multiplication.

I don’t really know too much about elliptic curve cryptography, but apparently it’s pretty expensive to calculate. However, we don’t really need to deal with it; we can assume that the key exchange between the server and the client succeeds. Similar to how we increased coverage above by skipping message CRC checks and replacing the encryption with a dummy cipher, we can simply skip the expensive operations and assume they always succeed. This is a trade-off; we are no longer fuzzing all the verification steps, but allows the fuzzer to concentrate more on the protocol parsing itself. I applied this patch:

diff --git kexc25519.c kexc25519.c
--- kexc25519.c
+++ kexc25519.c
@@ -68,10 +68,13 @@ kexc25519_shared_key(const u_char key[CURVE25519_SIZE],

/* Check for all-zero public key */
explicit_bzero(shared_key, CURVE25519_SIZE);
+#if 0
if (timingsafe_bcmp(pub, shared_key, CURVE25519_SIZE) == 0)
return SSH_ERR_KEY_INVALID_EC_VALUE;

crypto_scalarmult_curve25519(shared_key, key, pub);
+#endif
+
#ifdef DEBUG_KEXECDH
dump_digest("shared secret", shared_key, CURVE25519_SIZE);
#endif
diff --git kexc25519s.c kexc25519s.c
--- kexc25519s.c
+++ kexc25519s.c
@@ -67,7 +67,12 @@ input_kex_c25519_init(int type, u_int32_t seq, void *ctxt)
int r;

/* generate private key */
+#if 0
kexc25519_keygen(server_key, server_pubkey);
+#else
+ explicit_bzero(server_key, sizeof(server_key));
+ explicit_bzero(server_pubkey, sizeof(server_pubkey));
+#endif
#ifdef DEBUG_KEXECDH
dump_digest("server private key:", server_key, sizeof(server_key));
#endif

With this patch in place, execs/second went to ~2,000 per core, which is a much better speed to be fuzzing at.

(EDIT 2017-03-25: As it turns out, this patch is not very good, because it causes a later key validity check to fail (dh_pub_is_valid() in input_kex_dh_init()). We could perhaps make dh_pub_is_valid() always return true, but then there is a question of whether this in turn makes something else fail down the line.)

Creating the first input testcases

Before we can start fuzzing for real, we have to create the first few input testcases. Actually, a single one is enough to get started, but if you know how to create different ones taking different code paths in the server, that may help jumpstart the fuzzing process. A few possibilities I can think of:

The way I created the first testcase was to record the traffic from the client to the server using strace. Start the server without -i:

./sshd -d -e -p 2200 -r -f sshd_config
[...]
Server listening on :: port 2200.

Then start a client (using the ssh binary you’ve just compiled) under strace:

$ strace -e trace=write -o strace.log -f -s 8192 ./ssh -c none -p 2200 localhost

This should hopefully log you in (if not, you may have to fiddle with users, keys, and passwords until you succeed in logging in to the server you just started).

The first few lines of the strace log should read something like this:

2945  write(3, "SSH-2.0-OpenSSH_7.4\r\n", 21) = 21
2945 write(3, "\0\0\4|\5\24\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0010curve25519-sha256,curve25519-sha256@libssh.org,ecdh-sha2-nistp256,ecdh-sha2-nistp384,ecdh-sha2-nistp521,diffie-hellman-group-exchange-sha256,diffie-hellman-group16-sha512,diffie-hellman-group18-sha512,diffie-hellman-group-exchange-sha1,diffie-hellman-group14-sha256,diffie-hellman-group14-sha1,ext-info-c\0\0\1\"ecdsa-sha2-nistp256-cert-v01@openssh.com,ecdsa-sha2-nistp384-cert-v01@openssh.com,ecdsa-sha2-nistp521-cert-v01@openssh.com,ecdsa-sha2-nistp256,ecdsa-sha2-nistp384,ecdsa-sha2-nistp521,ssh-ed25519-cert-v01@openssh.com,ssh-rsa-cert-v01@openssh.com,ssh-ed25519,rsa-sha2-512,rsa-sha2-256,ssh-rsa\0\0\0\4none\0\0\0\4none\0\0\0\325umac-64-etm@openssh.com,umac-128-etm@openssh.com,hmac-sha2-256-etm@openssh.com,hmac-sha2-512-etm@openssh.com,hmac-sha1-etm@openssh.com,umac-64@openssh.com,umac-128@openssh.com,hmac-sha2-256,hmac-sha2-512,hmac-sha1\0\0\0\325umac-64-etm@openssh.com,umac-128-etm@openssh.com,hmac-sha2-256-etm@openssh.com,hmac-sha2-512-etm@openssh.com,hmac-sha1-etm@openssh.com,umac-64@openssh.com,umac-128@openssh.com,hmac-sha2-256,hmac-sha2-512,hmac-sha1\0\0\0\32none,zlib@openssh.com,zlib\0\0\0\32none,zlib@openssh.com,zlib\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 1152) = 1152

We see here that the client is communicating over file descriptor 3. You will have to delete all the writes happening on other file descriptors. Then take the strings and paste them into a Python script, something like:

import sys
for x in [
"SSH-2.0-OpenSSH_7.4\r\n"
"\0\0\4..."
...
]:
sys.stdout.write(x)

When you run this, it will print a byte-perfect copy of everything that the client sent to stdout. Just redirect this to a file. That file will be your first input testcase.

You can do a test run (without AFL) by passing the same data to the server again (this time using -i):

./sshd -d -e -p 2200 -r -f sshd_config -i < testcase 2>&1 > /dev/null

Hopefully it will show that your testcase replay was able to log in successfully.

Before starting the fuzzer you can also double check that the instrumentation works as expected using afl-analyze:

~/afl-2.39b/afl-analyze -i testcase -- ./sshd -d -e -p 2200 -r -f sshd_config -i

This may take a few seconds to run, but should eventually show you a map of the file and what it thinks each byte means. If there is too much red, that’s an indication you were not able to disable checksumming/encryption properly (maybe you have to make clean and rebuild?). You may also see other errors, including that AFL didn’t detect any instrumentation (did you compile sshd with afl-clang-fast?). This is general AFL troubleshooting territory, so I’d recommend checking out the AFL documentation.

Creating an OpenSSH dictionary

I created an AFL “dictionary” for OpenSSH, which is basically just a list of strings with special meaning to the program being fuzzed. I just used a few of the strings found by running ssh -Q cipher, etc. to allow the fuzzer to use these strings without having to discover them all at once (which is pretty unlikely to happen by chance).

s0="3des-cbc"
s1="aes128-cbc"
s2="aes128-ctr"
s3="aes128-gcm@openssh.com"
s4="aes192-cbc"
s5="aes192-ctr"
s6="aes256-cbc"
s7="aes256-ctr"
s8="aes256-gcm@openssh.com"
s9="arcfour"
s10="arcfour128"
s11="arcfour256"
s12="blowfish-cbc"
s13="cast128-cbc"
s14="chacha20-poly1305@openssh.com"
s15="curve25519-sha256@libssh.org"
s16="diffie-hellman-group14-sha1"
s17="diffie-hellman-group1-sha1"
s18="diffie-hellman-group-exchange-sha1"
s19="diffie-hellman-group-exchange-sha256"
s20="ecdh-sha2-nistp256"
s21="ecdh-sha2-nistp384"
s22="ecdh-sha2-nistp521"
s23="ecdsa-sha2-nistp256"
s24="ecdsa-sha2-nistp256-cert-v01@openssh.com"
s25="ecdsa-sha2-nistp384"
s26="ecdsa-sha2-nistp384-cert-v01@openssh.com"
s27="ecdsa-sha2-nistp521"
s28="ecdsa-sha2-nistp521-cert-v01@openssh.com"
s29="hmac-md5"
s30="hmac-md5-96"
s31="hmac-md5-96-etm@openssh.com"
s32="hmac-md5-etm@openssh.com"
s33="hmac-ripemd160"
s34="hmac-ripemd160-etm@openssh.com"
s35="hmac-ripemd160@openssh.com"
s36="hmac-sha1"
s37="hmac-sha1-96"
s38="hmac-sha1-96-etm@openssh.com"
s39="hmac-sha1-etm@openssh.com"
s40="hmac-sha2-256"
s41="hmac-sha2-256-etm@openssh.com"
s42="hmac-sha2-512"
s43="hmac-sha2-512-etm@openssh.com"
s44="rijndael-cbc@lysator.liu.se"
s45="ssh-dss"
s46="ssh-dss-cert-v01@openssh.com"
s47="ssh-ed25519"
s48="ssh-ed25519-cert-v01@openssh.com"
s49="ssh-rsa"
s50="ssh-rsa-cert-v01@openssh.com"
s51="umac-128-etm@openssh.com"
s52="umac-128@openssh.com"
s53="umac-64-etm@openssh.com"
s54="umac-64@openssh.com"

Just save it as openssh.dict; to use it, we will pass the filename to the -x option of afl-fuzz.

Running AFL

Whew, it’s finally time to start the fuzzing!

First, create two directories, input and output. Place your initial testcase in the input directory. Then, for the output directory, we’re going to use a little hack that I’ve found to speed up the fuzzing process and keep AFL from hitting the disk all the time: mount a tmpfs RAM-disk on output with:

sudo mount -t tmpfs none output/

Of course, if you shut down (or crash) your machine without copying the data out of this directory, it will be gone, so you should make a backup of it every once in a while. I personally just use a bash one-liner that just tars it up to the real on-disk filesystem every few hours.

To start a single fuzzer, you can use something like:

~/afl-2.39b/afl-fuzz -x sshd.dict -i input -o output -M 0 -- ./sshd -d -e -p 2100 -r -f sshd_config -i

Again, see the AFL docs on how to do parallel fuzzing. I have a simple bash script that just launches a bunch of the line above (with different values to the -M or -S option) in different screen windows.

Hopefully you should see something like this:

                         american fuzzy lop 2.39b (31)

┌─ process timing ─────────────────────────────────────┬─ overall results ─────┐
│ run time : 0 days, 13 hrs, 22 min, 40 sec │ cycles done : 152 │
│ last new path : 0 days, 0 hrs, 14 min, 57 sec │ total paths : 1577 │
│ last uniq crash : none seen yet │ uniq crashes : 0 │
│ last uniq hang : none seen yet │ uniq hangs : 0 │
├─ cycle progress ────────────────────┬─ map coverage ─┴───────────────────────┤
│ now processing : 717* (45.47%) │ map density : 3.98% / 6.67% │
│ paths timed out : 0 (0.00%) │ count coverage : 3.80 bits/tuple │
├─ stage progress ────────────────────┼─ findings in depth ────────────────────┤
│ now trying : splice 4 │ favored paths : 117 (7.42%) │
│ stage execs : 74/128 (57.81%) │ new edges on : 178 (11.29%) │
│ total execs : 74.3M │ total crashes : 0 (0 unique) │
│ exec speed : 1888/sec │ total hangs : 0 (0 unique) │
├─ fuzzing strategy yields ───────────┴───────────────┬─ path geometry ────────┤
│ bit flips : n/a, n/a, n/a │ levels : 7 │
│ byte flips : n/a, n/a, n/a │ pending : 2 │
│ arithmetics : n/a, n/a, n/a │ pend fav : 0 │
│ known ints : n/a, n/a, n/a │ own finds : 59 │
│ dictionary : n/a, n/a, n/a │ imported : 245 │
│ havoc : 39/25.3M, 20/47.2M │ stability : 97.55% │
│ trim : 2.81%/1.84M, n/a ├────────────────────────┘
└─────────────────────────────────────────────────────┘ [cpu015: 62%]

Crashes found

In about a day of fuzzing (even before disabling encryption), I found a couple of NULL pointer dereferences during key exchange. Fortunately, these crashes are not harmful in practice because of OpenSSH’s privilege separation code, so at most we’re crashing an unprivileged child process and leaving a scary segfault message in the system log. The fix made it in CVS here: http://cvsweb.openbsd.org/cgi-bin/cvsweb/src/usr.bin/ssh/kex.c?rev=1.131&content-type=text/x-cvsweb-markup.

Conclusion

Apart from the two harmless NULL pointer dereferences I found, I haven’t been able to find anything else yet, which seems to indicate that OpenSSH is fairly robust (which is good!).

I hope some of the techniques and patches I used here will help more people get into fuzzing OpenSSH.

Other things to do from here include doing some fuzzing rounds using ASAN or running the corpus through valgrind, although it’s probably easier to do this once you already have a good sized corpus found without them, as both ASAN and valgrind have a performance penalty.

It could also be useful to look into ./configure options to configure the build more like a typical distro build; I haven’t done anything here except to get it to build in a minimal environment.

Please let me know in the comments if you have other ideas on how to expand coverage or make fuzzing OpenSSH faster!

Thanks

I’d like to thank Oracle (my employer) for providing the hardware on which to run lots of AFL instances in parallel :-)


  1. Well, we can’t fix up signatures we don’t have the private key for, so in those cases we’ll just assume the attacker does have the private key. You can still do damage e.g. in an otherwise locked down environment; as an example, GitHub uses the SSH protocol to allow pushing to your repositories. These SSH accounts are heavily locked down, as you can’t run arbitrary commands on them. In this case, however, we do have have the secret key used to authenticate and sign messages.

March 26, 2017 10:07 AM

March 24, 2017

James Morris: Linux Security Summit 2017: CFP Announcement

LSS logo

The 2017 Linux Security Summit CFP (Call for Participation) is now open!

See the announcement here.

The summit this year will be held in Los Angeles, USA on 14-15 September. It will be co-located with the Open Source Summit (formerly LinuxCon), and the Linux Plumbers Conference. We’ll follow essentially the same format as the 2016 event (you can find the recap here).

The CFP closes on June 5th, 2017.

March 24, 2017 01:10 PM

March 21, 2017

Matthew Garrett: Announcing the Shim review process

Shim has been hugely successful, to the point of being used by the majority of significant Linux distributions and many other third party products (even, apparently, Solaris). The aim was to ensure that it would remain possible to install free operating systems on UEFI Secure Boot platforms while still allowing machine owners to replace their bootloaders and kernels, and it's achieved this goal.

However, a legitimate criticism has been that there's very little transparency in Microsoft's signing process. Some people have waited for significant periods of time before being receiving a response. A large part of this is simply that demand has been greater than expected, and Microsoft aren't in the best position to review code that they didn't write in the first place.

To that end, we're adopting a new model. A mailing list has been created at shim-review@lists.freedesktop.org, and members of this list will review submissions and provide a recommendation to Microsoft on whether these should be signed or not. The current set of expectations around binaries to be signed documented here and the current process here - it is expected that this will evolve slightly as we get used to the process, and we'll provide a more formal set of documentation once things have settled down.

This is a new initiative and one that will probably take a little while to get working smoothly, but we hope it'll make it much easier to get signed releases of Shim out without compromising security in the process.

comment count unavailable comments

March 21, 2017 08:29 PM

Kernel Podcast: Linux Kernel Podcast for 2017/03/21

Audiohttp://traffic.libsyn.com/jcm/20170321.mp3

In this week’s kernel podcast: Linus Torvalds announces Linux 4.11-rc3, this week’s exciting installment of “5-level paging weekly”, the 2038 doomsday compliance “statx” systemcall, and heterogenous memory management. Also a summary of all ongoing active kernel development toward 4.12 onwards.

Linus Torvalds announced Linux 4.11-rc3. In his announcement, Linus noted that “rc3 is larger than rc2, but this is hopefully the point where things start to shrink and calm down. We had a late typo in rc2 that affected arm and powerpc (the prep code for the 5-level page tables [on x86 systems]), and hopefully there are no similar brown-paper-bugs in rc3.”

Announcements

Kent Overstreet announced the latest developments in Bcachefs, in a post entitled “Bcachefs – encryption, fsck, and more”. One of the key new features is that “We now have whole filesystem encryption – and this is modern authenticated encryption”. He notes that they can’t currently encrypt only part of the filesystem (as is the case, for example, with ext4 – as used on Android devices, and of course with Apple’s multi-layered iOS filesystem implementation) but “it’s more of a better dm-crypt” in removing the layers between the filesystem and the underlying hardware. He also notes that there’s a “New inode format”, and many other changes. Further details at: https://bcache.evilpiepirate.org/Bcachefs/

Hongbo Wang (Intel) announced the 2016-Q4 release of XenGT and 2016-Q4 release of KVMGT. These are both “full GPU virtualization solution[s] with mediated pass-through”…of the hardware graphics resources into guest virtual machines. Further information is available from Intel’s github: https://github.com/01org/ (igvtg-xen for the Xen tree, and igvtg-kernel, and igvtg-qemu for the pieces needed for KVM support)

Julia Cartwright announced the Linux preempt-rt (Real Time) kernel version 4.1.39-rt47 stable kernel release.

Junio C Hamano announced Git v2.12.1. In his announcement, he noted that the tarballs “are NOT YET found at” the typical URL since “I am having trouble reaching there”. It’s unclear if this is due to recent changes in the architecture of kernel.org and its mirroring, or a local issue.

Intel 5-level paging

In this week’s episode of “merging Intel 5-level paging support” the fun but unexpected plot twist resulting in a “will it merge or not” cliffhanger comes from Linus. Kirill A. Shutemov (Intel) has been diligently posting this series for some time, and if you recall from last week’s episode, the foundational pieces needed to land this in 4.12 were merged after the closure of the 4.11 merge window following a special request from Linus. Kirill has since posted “x86: 5-level paging enabling for v4.12, Part 1”. In response to a comment from Kirill that “Let’s see if I’m on the right track addressing Ingo’s [Molnar’s] feedback”, Linus stated, “Considering the bug we just had with the HAVE_GENERIC_RCU_GUP code, I’m wondering if people would be willing to look at what it would take to make x86 use the generic version?”, and “The x86 version of __get_user_pages_fast() seems to be quite similar to the generic one. And it would be lovely if all the main architectures shared the same core gup code”.

The Linux kernel implements a set of code functions for pinning of usermode (userspace) pages (the smallest granule size upon which contemporary hardware operates via a Memory Management Unit under the control of software provided and (co-)maintained “page tables”, and the size tracked by the Operating System in its page table management code) whenever they must be shared between userspace (which has dynamically pageable memory that can come and go as the kernel needs to free up RAM temporarily for other tasks by “paging” those pages out to “swap”) and code running within a kernel driver (the Linux kernel does not have pageable memory). GUP (get_user_pages) handles this operation, which takes a set of pointers to the individual pages that should be present and marked as in use. It has a variant usually referred to as “fast GUP” which aims to perform this operation without taking an expensive lock in the corresponding userspace processes’ “mm” struct (an object that forms part of a task’s – the in-kernel term for a process – metadata, and linked from the corresponding task_struct). Fast GUP doesn’t always work, but when it doesn’t need to fallback to an expensive slow path, it can save considerable time. So Linus was expressing a desire for x86 to share the same generic code as used by other architectures for this operation.

Linus further added three “subtle issues” that he saw with switching over x86 to the generic GUP code:

“(a) we need to make sure that x86 actually matches the required semantics for the generic GUP.

(b) we need to make sure the atomicity of the page table reads is ok.

(c) need to verify the maximum VM address properly”

He said “I _think_ (a) is ok”. But he wanted to see “real work to make sure” that (b) is “ok on 32-bit PAE”. PAE means Physical Address Extension, a mechanism used on certain 32-bit Intel x86 systems to address greater than a 32-bit physical address space by leveraging the fact that many individual applications don’t need larger than a 32-bit address space but that an overall system might in aggregate use multiple such 32-bit applications. It was a hack that bought time before the widespread adoption of the 64-bit architecture, and one that others (such as ARM) have implemented in a similar sense of end purpose in “LPAE” and friends as well. PAE moved the x86 architecture from 32-bit PTE (Page Table Entries) to 64-bit hardware entries, which means that on 32-bit systems there are real concerns around atomicity of updates to these structures without very careful handling. And as this author can attest, you don’t want to have to debug that situation.

This discussion lead Kirill to point out that there were some obvious looking bugs in the existing x86 GUP code that needed fixing for PAE anyway. The thread is ongoing, and Kirill is certain to be enjoying this week’s episode of “so you thought you were only adding 5-level paging?”. Michal Hocko noted that he had pulled the current version of the 5-level paging patch series into the mmotm (mm of the moment) VM (Virtual Memory) subsystem development tree as co-maintained with Andrew Morton and others.

Borislav Petkov posted “x86/mce: Handle broadcasted MCE gracefully with kexec” which (as we covered previously) seeks to handle the unfortunate case of an MCE (Machine Check Exception) on Intel x86 systems arriving during the process of handoff from the crash kernel into “pergatory” prior to the new kernel beginning. At this phase, the old kernel’s MCE handler is running and will never complete a synchronization with other cores in the system that are waiting in a holding spinloop (probably MWAIT one would assume) for the new kernel to take over.

statx

Various subsystems gained support for the new “statx” system call, which is part of the ongoing “Year 2038” doomsday avoidance work to prevent a Y2K style disaster when 32-bit Unix time wraps in 2038 (this being an actual potential “disaster” in the making, unlike the much hyped Y2K nonsense). Many of us have aspiriations to be retired and living on boats by then, but this is neither assured, nor a prudent means to guarantee we won’t have to deal with this later (but presumably with at least some kind of lucrative consulting contract to bring us out of our early or late retirements).

The “statx” call adds 64-bit timestamps and replaces “stat”. It also does a lot more than just “make large” (David Howell’s words) the various fields in the previous stat structutures. The overall system call was covered much more generally by Linux Weekly News (which you should support as a purveyor of actual in-depth journalism on such topics) as recently as last week. Stafford Horne posted one example of the patches we refer to here, for the “asm-generic” reference includes used by emerging architectures, such as the OpenRISC architecture that he is maintaining. Another statx patch came from David Howells, for the ext4 filesytem, which lead to a longer discussion of how to implement various underlying flag changes required to ext4.

Eric Biggers noted that David used the ext4_get_inode_flags function “to sync the generic inode flags (inode->i_flags) to the ext4-specific inode flags (ei->i_flags)” bu that a problem can exist when doing this without holding an underlying lock due to “flag syncs…in both directions concurrently” which could “cause an update to be lost”. He walked an example of how this could occur, and then suggested that for ->getattr() it might be easier to skip the call to the offending function and “instead populating the generic attributes like STATX_ATTR_APPEND and STATX_ATTR_IMMUTABLE from the generic inode flags, rather than from the ext4-specific flags?”. Andreas Dilger suggested the other way around, pulling the flags directly from the ext4 flags rather than the generic ones. He also raised the eneral question of “when/where are the VFS inode flags changed that they need to be propagated into the ext4 disk inode?”.

Jan Kara replied that “you seem to be right. And actually I have checked and XFS does not bother to copy inode->i_flags to its on-disk flags so it seems generally we are not expected to reflect inode->i_flags in on-disk state”. Jan suggested to Andreas that it might be “better…to have ext4_quota_on() and ext4_quota_off() just update the flags as needed and avoid doing it anywhere else…I’ll have a look into it”.

Heterogeneous Memory Management

Jérôme Glisse posted version 18 of his patch series entitled “HMM (Heterogenous Memory Management)” which aims to serve two generic use cases: “First it allows to use device memory transparently inside any process without modifications to process program code. Second it allows to mirror process address space on a device”. His intro described these summaries as a “Cliff node” (a brand of examination-time study materials often used by students for preparation), which lead to an objection from Andrew Morton that “Cliff’s notes” “isn’t appropriate for a large feature such as this. Where’s the long-form description? One which permits readers to fully understand the requirements, design, alternative designs, the implementation, the interface(s), etc?”. He also asked for clarifcation of which was meant by “device memory” since “That’s very vague. What are the characteristics of this memory? Why is it a requirement that userspace code be unaltered? What are the security implications – does the process need particular permissions to access this memory? What is the proposed interface to set up this access?”

In a followup, Jérôme noted that he had previously given a longer form summary, which he attached, in the earlier revisions of the now version 18 patch series. In his summary, he makes clear his intent is to ease the overall management and programming of hybrid systems involving GPUs and other accelerators by introducing “a new kind of ZONE_DEVICE memory that does allow to allocate a struct page for each page of the device memory. Those page are special because the CPU can not map them. They however allow to migrate main memory to device memory using ex[]isting migration mechanism[s] and everything looks like it page was swap[ped] out to disk from CPU point of view. Using a struct page gives the easiest and cleanest integration with existing mm mechanisms”. He notes that he isn’t trying to solve other problems, and in fact one could summarize HMM using the buzzword du jour: “mediated”.

In an HMM world, devices and host-side application software can share what appears to them as a “unified” memory map. One in which pointer addresses from within an application can be deferenced by code running on a GPU, and vice versa, through cunning use of page tables and a new underlying system framework for the device drivers touching the hardware. It’s not magic, but it does help to treat device memory “like regular memory” and accommodates “Advance in high level language construct (in C++ but others too) gives opportunities to compiler to leverage GPU transparently without programmer knowledge. But for this to happen we need a share[d] address space”.

This means that, if a host application (processor side of the equation) performs an access to part of a process (known as a “task” within the kernel) address space that is currently under control of a device, then the associated page fault will trigger generic framework code to handle handoff of that page back to the host CPU side. On the flip side, the framework still requires device drivers to use a new framework to manage their access to memory since few devices have generic page fault mechanisms today that can be leveraged to make this more transparent, and a lot of other device specific gunk is needed. It’s not a perfect solution, but it does arguably advance the state of the art, and is useful. Jérôme also states that “I do not wish to compete for the patchset with the highest revision count and i would like a clear cut position on w[h]ether it can be merge[d] or not. If not i would like to know why because i am more than willing to address any issues people might have. I just don’t want to keep submitting it over and over until i end up in hell…So please consider applying for 4.12”.

This author’s own personal opinion is that, while HMM is certainly useful, many such shared device/host memory situations can be greatly simplified by introducing coherent shared virtual memory between device and host. That model allows for direct address space sharing without some of the heavy lifting required in this patch set. Yet, as is noted in the posting, few devices today have such features (and there is no reason to presume that all future devices suddenly will implement shared virtual memory, not that every device will want to expand the energy required to maintain coherent memory for communication). So the HMM patches provide a means of tracking who owns memory shared between device and “host”, and they exploit split device and “host” system page tables as well as associated faults to ensure pages are handed off as cleanly as can be achieved with technology available in the market today.

Ongoing Development

Michal Hocko posted a patch entitled “rework memory hotplug onlining”, which seeks to rework the semantics for memory hotplug since the current implementation is “awkward and hard/impossible to use from the udev to online memory as movable. The main problem is that only the last memblock or the adjacent to highest movable memblock can be onlined as movable”. He posted a number of examples showing how things fall down today, as well as a patch (“just for x86 now but I will address other arches once there is an agreement this is the right approach”) removing “all the zone specific operations from __add_pages (aka arch_add_memory) path. Instead we do page->zone association from move_pfn_range which is called from online_pages. This criterion for movable/normal zone association is really simple now. We just have to guarantee that zone Normal is always lower than zone Movable”. This lead to a lengthy discussion around the ideal longer term approach and is likely to be a topic at the LSF/MM conference this week (one assumes?). [ It’s happening down the street from me…I’ll smile and wave at you 😉 ]

Gustavo Padovan posted “V4L2 explicit synchronization support”, an RFC (Request For Comments) that “adds support for Explicit Synchronization of shared buffers in V4L2” (Video For Linux 2, the general purpose video framework API used on Linux machines for certain multimedia purposes). This new RFC leverages the “Sync File Framework” as a means to “communicate the fences between kernel and userspace”. In English, what this means is that it’s often necessary to communicate using shared buffers between userspace, kernel, and hardware. And some (most) hardware might not guarantee that these buffers are fully coherent (observed identically between multiple concurrently operating agents that are manipulating it). The use of “fences” (barriers) enables explicit communication of certain points in time during which the state of a buffer is consistent and ready for access to be handed off between different parts of the system. The RFC is quite interesting and has a lot more detail, including the observation that it is intended to be a PoC (Proof of Concept) to get the conversation moving more than the eventual end result of that conversation that might actually be merged.

Wei Wang (Intel) posted a patch series entitled “Extend virtio-balloon for fast (de)inflating & fast live migration. Balloons aren’t just helium filled goodies that all of us love to play with from a young age. Well, they are that, but, they’re also a concept applied to the memory management of virtual machines, which “inflate” the amount of memory available to them by requesting more from a hypervisor during their lifetime (that they might also return). In Linux, the same concept is applied to the migration of virtual machines, which can use the virtio-balloon abstraction over the virtio bus (a hypervisor communications channel) to transfer “guest unused pages to the host so that they can be skipped to migrate in live migration”. One of the patches in his version 3 series (patch number 3 of 4), entitled “mm: add in[t]erface to offer info about unused pages” had some detailed discussion with Michael S. Tsirkin commenting on better documentation and Andrew Morton suggesting that it might be better for the code to live in the virtio-balloon driver rather than being made too generic as its use case is very targeted.

Elena Reshetova continued her work toward conversion of Linux kernel subsystems to her newer “refcount” explicit reference counting API with a posting entitled “net subsystem refcount conversions”.

Suzuki K Poulose posted a bunch of patches implementing support for detection and reporting of new ARMv8.3 architecture features, including one patch that was entitled “arm64: v8.3: Support for Javascript conversion instruction” (which really means a new double precision float to integer conversion instruction that will likely be used by high performance JavaScript JITs…). He also posted “arm64: v8.3: Support for weaker release consistency”. The new revision of the architecture adds new instructions to “support Release Consistent processor consistent (RCpc) model, which is weaker than the RCsc [Release Consistent sequential consistency] model”. Listeners are encouraged to read the C++ memory model and other fascinating bedtime literature for much more detail on the available RC options.

Markus Mayer (Broadcom) posted “Basic divider clock”, an RFC which aims to provide a generic means of expressing clock dividers that can be leveraged in an embedded system’s “DeviceTree”, for which he also posted bindings (descriptions to be used in creating these textual description “trees”). Stephen Boyd pushed back that the community had so far avoided generic implementations but instead preferred to keep things at the level of having drivers that target certain hardware IP from certain vendors based upon the compatible matching strings.

Michael S. Tsirkin posted “kvm: better MWAIT emulation for guests”. We have previously explained this patchset and the dynamics of MWAIT implementations. His goal for this patch is to handle guests that assume the presence of the (x86) MWAIT feature, which isn’t present on all x86 CPUs. If you were running (for example) MacOS inside a VM on an 86 machine, it would generally assume the presence of MWAIT without checking for it, because it’s present in all x86-based Apple Macs. Emulating MWAIT is useful in such situations.

Romain Perier posted “Replace PCI pool by DMA pool API”. As he notes in his posting, “The current PCI pool API are simple macro functions direct expanded to the appropriate dma pool functions. The prototypes are almost the same and semantically, they are very similar. I propose to use the DMA pool API directly and get rid of the old API”.

Daeseok Youn posted “staging: atomisp: use k{v}zalloc instead of k{v}alloc and memset”. Alan Cox replied “…please don’t apply this. There are about five other layers of indirection for memory allocators that want removing first so that the driver just uses the correct kmalloc/kzalloc/kv* functions in the right places”. Now does seem like a good time not to add more layers.

Peter Zijlstra posted various “x86 optimizations” that aimed to “shrink the kernel and generate better code”.

March 21, 2017 04:22 PM

March 20, 2017

Dave Airlie: how close to conformant is radv?

I spent some time staring into the results of the VK-GL-CTS test suite on radv, which contains the Vulkan 1.0 conformance tests.

In order to be conformant you have to pass all the tests on the mustpass list for the Vulkan version you want to conform to, from the branch of the test suite for that version.

The latest CTS tests for 1.0 is the vulkan-cts-1.0.2 branch, and the mustpass list is in external/vulkancts/mustpass/1.0.2/vk-default.txt

Using some WIP radv patches in my github radv-wip-conform branch and the 1.0.2 test suite, today's results are on my Tonga GPU:

Test run totals:
Passed: 82551/150950 (54.7%)
Failed: 0/150950 (0.0%)
Not supported: 68397/150950 (45.3%)
Warnings: 2/150950 (0.0%)

That is pretty conformant (in fact it would pass as-is). However I need to clean up the patches in the branch and maybe figure out how to do some bits properly without hacks (particularly some semaphore wait tweaks), but that is most of the work done.

Thanks again to Bas and all other radv contributors.

March 20, 2017 07:26 AM

March 19, 2017

Pete Zaitcev: Standards for ARM computers in 2017

I wrote this 7 years ago, in 2009:

Until ARM comes up with a full computer instead of just a CPU, it's no contender in Linux server space.

So, how are things nowadays?

The final question had to do with cross-platform drivers. There is an interpreted executable format known as EFI Byte Code (EBC); drivers compiled to that format can run on multiple architectures. [...]

Graf asked whether drivers could, instead, be shipped as a multiple-architecture binary. Progress is being made in this direction, and EFI supports multiple binary formats.

Hopeless.

P.S. Jon reminded me about SBSA, and indeed it's a solid advancement. But using EFI drivers is an idea so monstrously dumb that I don't even know what to say.

March 19, 2017 04:34 PM

Pete Zaitcev: Standards for ARM computers and Linaro

A year ago I posted the following comment at LWN in the context of excitement about Yet Another Little Linux Server:

If you buy an Atom or Geode barebones, it's guaranteed to boot a normal Fedora or Ubuntu. If you buy ARM, you're a hostage of vendor support. If that fails (which is the default), you're all alone in the bizarre maze of incompatible bootloaders and out-of-tree patches which quickly become obsolete. Until ARM comes up with a full computer instead of just a CPU, it's no contender in Linux server space. Instead, the vicious cycle of "make a product, patch something to boot on it, ship it, forget about it immediately" will continue forever, with publications hyping up the next wonderful widget and the platform going nowhere.

It looks like someone else figured it out, ergo Linaro. Unfortunately, they do not seem to be eager to create a real platform, but rather slap a veneer of something OpenFirmware-like on top of exising systems. Also, they are buddying with Ubuntu. So, a half-hearted effort and a top-down deal. But it's a step in the right direction.

March 19, 2017 04:27 PM

March 17, 2017

Andi Kleen: Intel Processor Trace resources

Intel Processor Trace (PT) can be used on modern Intel CPUs to trace execution. This page contains references for learning about and using Intel PT.

Basic information:

Implementations

JTAG support

Other presentations

Usages

Research papers using PT (subset):

 

March 17, 2017 04:46 PM

March 15, 2017

Linux Plumbers Conference: Linux Plumbers Conference Call for Refereed Presentations

We are pleased to announce the Call for Refereed Presentation
Proposals for the 2017 edition of the Linux Plumbers Conference, which
will be held in Los Angeles, CA, USA on 13-15 September in conjunction
with The Linux Foundation Open Source Summit.

Refereed Presentations are 45 minutes in length and should focus on a
specific aspect of the “plumbing” in the Linux system. Examples of
Linux plumbing include core kernel subsystems, core libraries,
windowing systems, management tools, device support, media
creation/playback, and so on. The best presentations are not about
finished work, but rather problems, proposals, or proof-of-concept
solutions that require face-to-face discussions and debate.

Following the feedback from 2016, we’re keeping the Refereed
Presentations track running throughout the three days of the
conference.  This track will be held jointly with the Open Source
Summit on Wednesday 13 September.

Linux Plumbers Conference Program Committee members will be reviewing
all submitted proposals.  High-quality submissions that cannot be
accepted due to the limited number of slots will be forwarded to both
the Open Source Summit and the Microconference leads for further
consideration.

To submit a Refereed Presentation proposal follow the instructions at

https://www.linuxplumbersconf.org/2017/how-to-submit-a-refereed-track-proposal/

Submissions are due on or before Saturday 6 May, 2017 at 11:59PM
Pacific Time.

Finally, please note we are still accepting Microconference submissions

https://www.linuxplumbersconf.org/2017/participate/#cfmc

March 15, 2017 10:57 PM

March 14, 2017

Kernel Podcast: Kernel Podcast for March 13th, 2017

Audiohttp://traffic.libsyn.com/jcm/20170313.mp3

In this week’s kernel podcast: Linus Torvalds announces Linux 4.11-rc2 (including pre-enablement for Intel 5-level paging), VMA based swap readahead, and ongoing development ahead of the next cycle.

Linus Torvalds announced Linux 4.11-rc2. In his announcement, he said that the past week had been “fairly quiet” because “people are still looking for bugs and taking a breather after the merge window”. But he also noted that “we’ve got a healthy number of fixes in, and there’ssome cleanup/prep patches for the upcoming 5-level page table support that I took after the merge window just to make the next merge window easier”.

Various fixes and updates have been posted against the previous rc1, over the past week, including an urgent fix from Matthew (Willy) Wilcox for his idr rewrite in 4.11 (freeing the correct IDA bitmap).

Geert Uytterhoeven posted “Build regressions/improvements in v4.11-rc1”. This compared build error/warning regressions and improvements between v4.11-rc1 and v4.10. According to Geert, the 4.11-rc1 kernel saw an increase of 19 build errors and 1108 warnings when compared to 4.10.

Announcements

Jiri Slaby announced Linux 3.12.71, Greg Kroah Hartman (KH) announced 4.4.53, 4.9.14, and 4.10.2 (which started a conversation about git tags being stale that we will address in a moment). Greg took the opportunity of various stable kernel work to prod the i915 graphics driver team with a message entitled “The i915 stable patch marking is totally broken”.

Sebastian Andrzej Siewior announced the v4.9.13-rt12 preempt-rt “Real Time” kernel patch set, which has a known issue that “CPU hotplug got a little better but can deadlock”, suggesting you might not want to try that then.

Julia Cartwright announced 4.1.38-rt46.

Steven Rostedt announced the 3.18.48-rt53 stable release of the RT kernel. He also announced the 3.10.105-rt119 and 3.2.86-rt124 releases.

Jair Ruusu announced “loop-AES-v3.7k file/swap crypto package”, which is available on sourceforge at: http://loop-aes.sourceforge.net/

Andy Lutomirski sent out detailed notes (along with a followup with yet more explanation) of the Intel SGX (“Secure Enclave”) feature discussion that occured at Kernel Summit and Linux Plumbers Conference last fall. The thread is called “SGX notes from KS/LPC”. In the thread, he explains what SGX is (a small region of virtual memory within a Linux process – known as a task inside the kernel – that is not visible to the host OS after the enclave is “launched”) and how it can be used to hide certain data from system administrators or providers – for example, cryptographic keys that one would rather were not compromised. SGX comes with a litany of new requirements upon the Operating System that Andy covers, along with some guidelines for how to expose this feature, and how to make it useable.

Packet.net are now sponsoring the kernel.org project to the tune of various geo-diverse bare metal frontend systems in datacenters around the globe. Each of these (powerful) frontends provides read-only public access to kernel.org git repositories and the public website (git.kernel.org and www.kernel.org). More information, including machine specifications can be found here: https://www.kernel.org/fast-new-frontends-with-packet.html

(this came to light because of a brief outage affecting the Newark, NJ mirror which was lagging behind other mirrors in picking up new git tags pushed, but one hopes that an official announcement and thanks was otherwise forthcoming)

Masahiro Yamada has been added as a Kbuild (co-)maintainer.

Intel 5-level paging

Kirill A. Shutemov posted version 4 of his “5-level paging” patch series that implements support for the la57 (56 bit Virtual Address space for x64 Canonical Addressing) feature on some future CPUs. We covered the underlying patch series before, explaining the benefit of a larger (virtual) address space, as well as the additional compexities required to implement backward compatibility (including new prctls to limit the virtual address space of certain legacy applications), and the lack (so far) of boot time switching between 4-and-5-level support, which is seen as important for the distros.

Linus responded by saying that he thought “we should just aim for this being in 4.12” as he didn’t “see any real reason to delay merging it”. After some discussion about whose tree to merge it through, it was decided (by Thomas Gleixner) that it could come in through the “-tip” x86 tree. Which resulted in Linus pulling a preparatory “5-level paging: prepare generic code” patch series from Kirill into 4.11 (even after the merge window had closed) to lay the groundwork for pulling the main feature into the next (4.12) cycle. This promptly broke PowerPC, which was promptly fixed by a followup patch. Following the merge of enabling support in 4.11, Kirill posted “5-level paging enabling for v4.12” which aims to complete the merge next cycle.

The earlier version 4 iteration of the patch series noted that the Xen hypervisor currently doesn’t support 5-level paging and thus CONFIG_XEN is disabled automatically when building CONFIG_X86_5LEVEL. It was pointed out by the Andrew Cooper that runtime (boottime) switching between 4 and 5 level support would be required in order to provide a clean experience, especially until Xen Dom0 support is available. That boottime switching is on the existing todo and presumably is going to land at some point.

Separately, Dmitry Safonov posted version 6 of a patch series entitled “Fix compatible mmap() return pointer over 4Gb” which has “some minor conflicts with Kirill’s set for 5-table paging”. Dmitry aims to solve a slightly different problem than Kirill’s PR_{SET,GET}_MAX_VADDR calls (which limit the virtual address ranges returned by mmap to avoid legacy programs breaking when suddenly able to receive much larger “Canonical Addresses” – in Intel parlance – than they were compiled with built-in and broken assumptions about once upon a time) insomuch as he is focused on 32-bit legacy syscalls on 64-bit x64 not returning memory above 4GB that cannot be used by older 32-bit code.

VMA based swap readahead

Ying Huang (Intel) posted an RFC (Request For Comments) entitled “mm, swap: VMA based swap readahead” in which he discussed the current kernel paging implementation for Virtual Memory Areas (VMAs) as well as how it could be improved to facilitate greater awareness of the in-memory access patterns of associated data by changing the corresponding readahead algorithm.

“Readahead” as a concept is what it sounds like. Locality (both spacial, in this case, as well as temporal, in other cases) of data means that when a memory access occurs, it is usually more likely than not that an access to a nearby memory location will soon follow (except in the case of pure random access workloads). Thus, the kernel contains support for preloading nearby data when performing various disk and memory operations. Examples include readahead of nearby disk blocks when loading filesystem data, and loading nearby disk blocks when reading pages back in from swap.

VMAs (Virtual Memory Areas) are regions of memory managed by the Linux kernel. A running application (process), known as a “task” by the kernel, contains a large number of different VMAs which form its overall address space. You can see this by inspecting /proc/self/maps (replacing “self” with a process ID that you have access to). The output will show a series of memory regions representing various memory owned by the task. Memory that doesn’t represent files is known as “anonymous memory” and it is what is paged (swapped) out under memory pressure situations.

As Ying notes in his RFC, the “original swap readahead algorithm does readahead based on the consecutive blocks in [the] swap device” but “the consecutive blocks in [the] swap device just reflect the order of page reclaiming” and not necessarily “the access sequence in RAM”. His patch series aims to change this by teaching the readahead algorithm about VMAs and how to bias the readahead to sequentially walk through the address space of a task (process), reading those parts of the swap space containing this data rather than simply walking through swap sequentially.

But wait! There’s more! Ying also posted a separate patch series entitled “THP swap: Delay splitting THP during swapping out”, which does what it sounds like it would do. THP (Transparent Huge Pages) is a technology used by the Linux kernel to dynamically allocate “huge” (optionally very large – up to 1GB in size, but in this case 2MB) pages of memory to contiguous regions of virtual memory address space, especially those backing shared large memory data (even including a huge zero page used for virtual machine RAM at boot). THP reduces pressure on limited CPU internal microarchitectural caches known as TLBs (Translation Lookaside Buffers) – as well as uTLBs at a lower level than the TLBs – which cache the translation performed by page table entries to physical or intermediate memory addresses. Reducing the number of TLBs required to map regions of virtual memory reduces the number of times TLBs must be reused by the underlying architecture during memory access operations.

The existing Linux kernel THP code splits THPs back into smaller pages whenever they are swapped (paged) out to disk. Yet it turns out that this is particularly inefficient on contemporary systems in which secondary disk or NVMe storage has far greater bandwidth than a single high end core can saturate if forced to do this work. Ying’s patch instead delays this split and pushes entire THPs out to swap, allowing for larger writes and reads of contiguous memory out to the backing storage.

Ongoing Development

“David F” inquired about RAID mode support for Intel m.2 chipsets. These devices continue the recent-ish legacy of certain Intel storage devices providing dual modes of operation: as an AHCI device, and as a hardware RAID device operating in a propietary mode for which no Linux drivers exist. David was quite concerned that the lack of a Linux driver was becoming particular problematic on newer machines, which might not provide a means to switch into AHCI mode (supported by Linux). Christoph Hellwig was…unsympathetic…suggesting that the RAID mode “provides worse performance”, and that its implementation was questionable. He also had a series of other suggestions for what to do with these devices – those are less family friendly to repeat in this podcast.

Michal Hocko posted “kvmalloc” which is a generic replacement for the many “open coded kmalloc with vmalloc fallback instances in the tree”. k-and-vmalloc are two different means by which kernel code allocates memory. The former is used to obtain small allocations (on the order of a few pages – the minimal granule size operated on by the virtual memory subsystem of Linux on contemporary processors) that are also linerally contiguous in physical memory. The latter is for larger allocations of strictly “virtual” memory – contiguous only when accessed using the underlying Memory Mangement Unit to perform a translation (this is usually automatic for kernel code, since the kernel runs with virtual memory of its own, just like user processes do, but it can be problematic if a driver would like to use this memory for certain hardware operations, such as DMA transfers). The generic wrapper aims to clean up the common case that kernel code just wants a chunk of memory and will try to allocate it with kmalloc, but fallback to the more generic vmalloc if that fails.

Christian Konig (AMD) posted “PCI: add resizeable BAR infrastructure” (version 2, and later an update with some fixes in a version 3 also), which aims to add support to the kernel for a PCI SIG (Peripheral Component Interconnect Special Interest Group) ECN (Engineering Change Notice) that enables BARs (Base Address Registers) to be resized at runtime. PCI(e) BARs are mapping windows (aperatures) in the system memory map that are used to talk to hardware add-on cards (or built-in devices within modern platforms) by determining where the device’s memory will live. Traditionally, BARs were fixed size and so on architectures not relying upon firmware configuration of underlying BARs, Linux would have to determine where to place certain PCI(e) resources at boot/hotplug time by checking how much memory a device needed to expose and programming the BARs. With the new extension comes the possibility to increase the size of a BAR to map larger regions of memory. This is a useful feature for graphics cards, which may want to map very large regions of memory. A subsequent patch wires up the AMD GPU driver to use this.

Javi Merino posted “Documentation/EDID fixes”, which aims to correct some broken assumptions in the kernel documentation for EDID (Extended Display Identification Data – the data provided over e.g. I2C from a VGA monitor when the cable is connected). The examples didn’t build correctly due to existing assumptions. This author is probably one of few people who always thinks of EDID and the interaction with Xorg every time he plugs in an external projector to his laptop.

David Howells posted “net: Work around lockdep limitation in sockets that use sockets” in which he corrected an erroneous assumption in the kernel “lockdep” (lock dependency checker) that prevented it from correctly identifying bad call chains involving TCP sockets when there exists a dependency between sockets created purely in the kernel and sockets created purely in userspace (which the lockdep could not distinguish between due to its use of broad lock classes). The AFS (Andrew File System) was generating a false lockdep warning because it was exposing such an implied dependency.

Charles Keepax posted “genirq: Add support for nested shared IRQs” to address an audio CODEC that also acts as an interrupt controller. The details sounded rather painful. Yet it was “fairly easy” to fix.

Steven Rostedt posted “tracing: Allow function tracing to start earlier in boot up”, which does roughly what it says on the can, “moving tracing up further in the boot process”, “right after memory is initialized”. He noted that his RFC was a start and could be futher improved upon.

Matthew (Willy) Wilcox posted an RFC entitled “memset_l and memfill” that provides a generic means for architectures to provide optimized functions that “fill regions of memory with patterns larger than those contained in a single byte”. This is intended to be used by zram as well as other code.

Paul McKenney noticed some of his RCU torture tests failing during hotplug early in boot due to calls to smp_store_cpu_info during that operation. The call is not safe because it indirectly invokes schedule_work() which wants to use RCU prior to RCU being enabled as a side effect of dealing with an unstable TSC (Time Stamp Counter) on the afflicted CPU. Peter Zijlstra had an opinion on hotplug, and also a patch to handle this situation.

Vlad Zakharov posted “update timer frequencies”, which inquired about the best means to implement a cpufreq driver for ARC CPUs. These having a special property that “ARC timers (including those are used for timekeeping) are driven by the same clock as ARC CPU core(s)”. Yup, they change frequency according to the current CPU frequency. Which as Thomas Gleixner noted in response is “broken by design and you really should go and tell your hardware folks to fix that”. He added that “It’s well known for more than TWO decades that changing the frequency of the timekeeper clocksource is a complete disaster”.

Thomas Gleixner posted “kexec, x86/purgatory: Cleanup the unholy mess”, which aims to address his opinion that “the whole machinery is undocumented and lacks any form of forward declarations” (of variables which were previously global but had been made static). Purgatory is a special piece of code which is provided by the kernel but runs in the interim period between the kernel crashing (or beginning kexec) and the new crash or kexec kernel that is then subsequently loaded – this is what performs the load and exec.

March 14, 2017 06:36 PM

March 13, 2017

James Morris: LSM mailing list archive: this time for sure!

Following various unresolved issues with existing mail archives for the Linux Security Modules mailing list, I’ve set up a new archive here.

It’s a mailman mirror of the vger list.

March 13, 2017 10:20 PM

March 09, 2017

James Morris: Hardening the LSM API

The Linux Security Modules (LSM) API provides security hooks for all security-relevant access control operations within the kernel. It’s a pluggable API, allowing different security models to be configured during compilation, and selected at boot time. LSM has provided enough flexibility to implement several major access control schemes, including SELinux, AppArmor, and Smack.

A downside of this architecture, however, is that the security hooks throughout the kernel (there are hundreds of them) increase the kernel’s attack surface. An attacker with a pointer overwrite vulnerability may be able to overwrite an LSM security hook and redirect execution to other code. This could be as simple as bypassing an access control decision via existing kernel code, or redirecting flow to an arbitrary payload such as a rootkit.

Minimizing the inherent security risk of security features, is, I believe, an essential goal.

Recently, as part of the Kernel Self Protection Project, support for marking kernel pages as read-only after init (ro_after_init) was merged, based on grsecurity/pax code. (You can read more about this in Kees Cook’s blog here). In cases where kernel pages are not modified after the kernel is initialized, hardware RO page protections are set on those pages at the end of the kernel initialization process. This is currently supported on several architectures (including x86 and ARM), with more architectures in progress.

It turns out that the LSM hook operations make an ideal candidate for ro_after_init marking, as these hooks are populated during kernel initialization and then do not change (except in one case, explained below). I’ve implemented support for ro_after_init hardening for LSM hooks in the security-next tree, aiming to merge it to Linus for v4.11.

Note that there is one existing case where hooks need to be updated, for runtime SELinux disabling via the ‘disable’ selinuxfs node. Normally, to disable SELinux, you would use selinux=0 at the kernel command line. The runtime disable feature was requested by Fedora folk to handle platforms where the kernel command line is problematic. I’m not sure if this is still the case anywhere. I strongly suggest migrating away from runtime disablement, as configuring support for it in the kernel (via CONFIG_SECURITY_SELINUX_DISABLE) will cause the ro_after_init protection for LSM to be disabled. Use selinux=0 instead, if you need to disable SELinux.

It should be noted, of course, that an attacker with enough control over the kernel could directly change hardware page protections. We are not trying to mitigate that threat here — rather, the goal is to harden the security hooks against being used to gain that level of control.

March 09, 2017 10:52 AM

Rusty Russell: Quick Stats on zstandard (zstd) Performance

Was looking at using zstd for backup, and wanted to see the effect of different compression levels. I backed up my (built) bitcoin source, which is a decent representation of my home directory, but only weighs in 2.3GB. zstd -1 compressed it 71.3%, zstd -22 compressed it 78.6%, and here’s a graph showing runtime (on my laptop) and the resulting size:

zstandard compression (bitcoin source code, object files and binaries) times and sizes

For this corpus, sweet spots are 3 (the default), 6 (2.5x slower, 7% smaller), 14 (10x slower, 13% smaller) and 20 (46x slower, 22% smaller). Spreadsheet with results here.

March 09, 2017 12:53 AM

March 08, 2017

Matthew Garrett: The Internet of Microphones

So the CIA has tools to snoop on you via your TV and your Echo is testifying in a murder case and yet people are still buying connected devices with microphones in and why are they doing that the world is on fire surely this is terrible?

You're right that the world is terrible, but this isn't really a contributing factor to it. There's a few reasons why. The first is that there's really not any indication that the CIA and MI5 ever turned this into an actual deployable exploit. The development reports[1] describe a project that still didn't know what would happen to their exploit over firmware updates and a "fake off" mode that left a lit LED which wouldn't be there if the TV were actually off, so there's a potential for failed updates and people noticing that there's something wrong. It's certainly possible that development continued and it was turned into a polished and usable exploit, but it really just comes across as a bunch of nerds wanting to show off a neat demo.

But let's say it did get to the stage of being deployable - there's still not a great deal to worry about. No remote infection mechanism is described, so they'd need to do it locally. If someone is in a position to reflash your TV without you noticing, they're also in a position to, uh, just leave an internet connected microphone of their own. So how would they infect you remotely? TVs don't actually consume a huge amount of untrusted content from arbitrary sources[2], so that's much harder than it sounds and probably not worth it because:

YOU ARE CARRYING AN INTERNET CONNECTED MICROPHONE THAT CONSUMES VAST QUANTITIES OF UNTRUSTED CONTENT FROM ARBITRARY SOURCES

Seriously your phone is like eleven billion times easier to infect than your TV is and you carry it everywhere. If the CIA want to spy on you, they'll do it via your phone. If you're paranoid enough to take the battery out of your phone before certain conversations, don't have those conversations in front of a TV with a microphone in it. But, uh, it's actually worse than that.

These days audio hardware usually consists of a very generic codec containing a bunch of digital→analogue converters, some analogue→digital converters and a bunch of io pins that can basically be wired up in arbitrary ways. Hardcoding the roles of these pins makes board layout more annoying and some people want more inputs than outputs and some people vice versa, so it's not uncommon for it to be possible to reconfigure an input as an output or vice versa. From software.

Anyone who's ever plugged a microphone into a speaker jack probably knows where I'm going with this. An attacker can "turn off" your TV, reconfigure the internal speaker output as an input and listen to you on your "microphoneless" TV. Have a nice day, and stop telling people that putting glue in their laptop microphone is any use unless you're telling them to disconnect the internal speakers as well.

If you're in a situation where you have to worry about an intelligence agency monitoring you, your TV is the least of your concerns - any device with speakers is just as bad. So what about Alexa? The summary here is, again, it's probably easier and more practical to just break your phone - it's probably near you whenever you're using an Echo anyway, and they also get to record you the rest of the time. The Echo platform is very restricted in terms of where it gets data[3], so it'd be incredibly hard to compromise without Amazon's cooperation. Amazon's not going to give their cooperation unless someone turns up with a warrant, and then we're back to you already being screwed enough that you should have got rid of all your electronics way earlier in this process. There are reasons to be worried about always listening devices, but intelligence agencies monitoring you shouldn't generally be one of them.

tl;dr: The CIA probably isn't listening to you through your TV, and if they are then you're almost certainly going to have a bad time anyway.

[1] Which I have obviously not read
[2] I look forward to the first person demonstrating code execution through malformed MPEG over terrestrial broadcast TV
[3] You'd need a vulnerability in its compressed audio codecs, and you'd need to convince the target to install a skill that played content from your servers

comment count unavailable comments

March 08, 2017 01:30 AM

March 06, 2017

Kernel Podcast: Kernel Podcast for March 6th, 2017

Audiohttp://traffic.libsyn.com/jcm/20170306.mp3

In this week’s kernel podcast: Linus Torvalds announces Linux 4.11-rc1, rants about folks not correctly leveraging linux-next, the remainder of this cycle’s merge window pulls, and announcements concerning end of life for some features.

Linus Torvalds announced Linux 4.11-rc1, noting that “two weeks have passed, the merge window is over, and 4.11 has been tagged and pushed out.” He notes that the latest kernel cycle is set to be “on the smallish side”, but that is only in comparison with the most recent two cycles, which have been significantly larger than typical. He notes that 4.11 has a similar number of commits to 4.1, 4.3, 4.5, and 4.7 before it. With the release of 4.11-rc1 comes the closing of the “merge window” (defined by it, the period of time during which disruptive changes are allowed into the kernel prior to RC).

We covered most of the major pulls for 4.11 in last week’s podcast. But there were a few more stragglers. Here’s a sample of those:

J. Bruce Fields posted “nfsd changes for 4.11” which included two semantic changes: NFS security labels are “now off by default” and a “new security_label export flag reenables it per export” since this “only makes sense if all your clients and servers have similar enough selinux policies”. Secondly, NFSv4/UDP support is off because “It was never really supported, and the spec explicitly forbids it. We only ever left it on out of laziness; thanks to Jeff Layton for finally fixing that.”

Anna Schumaker followed up a little later with “Please pull NFS client changes for Linux 4.11”, which includes a memory leak in “_nfs4_open_and_get_state”, as well as various other fixes and new features.

Matthew (Willy) Wilcox posted “Please pull IDR rewrite” which seeks to harmonize the IDR (“Small id to pointer translation service avoding fixed sized tables”) and in-kernel radix tree code. Accoring to Willy, merging the two codebases “lets us share the memory alloction pools, and results in a net deletion of 500 lines of code. It also opens up the possibility of exposing more of the fetures of the radix tree to users of the IDR”.

Will Deacon posted “arm64 fixes for -rc1” of which the “main fix here addresses a kernel panic triggered on Qualcomm QDF2400 due to incorrect register usage in an erratum workaround introduced during the merge window”.

Michael S. Tsirkin posted “vhost: cleanups and fixes”, of which there were very few for this kernel cycle.

Nicholas A. Bellinger posted “target updates for v4.11-rc1”, which includes support for “dual mode (initiator + target) qla2xxx operation”, and a number of other fixes and improvements. He pre-warns that things are “shaping up to be a busy cycle for v4.12 with a new fabric driver (efct) in flight, and a number of other patches on the list being discussed”.

Rafael J. Wysocki posted “Additional ACPI update for v4.11-rc1”, which includes a fix for “an apparant, but actually artificial, resource conflict between the ACPI NVS memory region and the ACPI BERT (Boot Error Record Table)”.

Jens Axboe posted “Block fixes for 4.11-rc1”, which includes a “collection of fixes for this merge window, either fixes for existing issues, or parts that were waiting for acks to come in”. These include a performance fix for the allocation of nvme queues on the right node, along with others.

Miklos Szeredi posted “fuse update for 4.11” and “overlayfs update for 4.11”. the latter “allows concurrent copy up of regular files eliminating [the] potential problem” of (previously) serialized copy ups taking a long time.

Bjorn Helgaas posted “PCI fixes for v4.11”, including a couple of fixes for bugs introduced during code refactoring.

Dan Williams posted “libnvdimm fixes for 4.11-rc1”, which includes a fix for the generation of “nvdimm namespace label”s (metadata) checksums that “Linux was not calculating correcting leading to other environments rejecting the Linux label”.

Helge Deller posted “parisc updates for 4.11”, noting that there was “nothing really important” in this particular cycle to pull in.

James Bottomley posted “final round of SCSI updates for the 4.10+ merge window”, which “is the set of stuff that didn’t quite make the initial pull and a set of fixes for stuff which did”.

Radim Krcmar posted “Second batch of KVM changes for 4.11 merge window”, which includes a number of fixes for PPC and x86.

David Miller posted “Networking”, including many fixes.

A linux-next rant

In his 4.11-rc1 announcement, Linus noted that “it *does* feel like there was more stuff that I was asked to pull than was in linux-next. That always happens, but seems to have happened more now than usually. Comparing to the linux-next tree at the time of the 4.10 release, almost 18% of the non-merge commits were not in Linux-next. That seems higher than usual, although I guess Stephen Rothwell has actual numbers from past merges.” Let’s break what Linus said a little. Stephen Rothwell is an (overworked) kernel hacker based in Australia who produces a (daily, outside of the merge window) kernel tree (and accompanying test infrastructure, patch tracking, and announcement mechanisms) known as “linux-next”. Its raison d’etre is to be the proving ground for new features before they are sent to Linus for merging.

Typically, major new features soak in linux-next for a cycle prior to the one in which they are actually merged (so features landing in 4.11 would have been largely complete and tested via -next during 4.10). Linux kernel development cycles are generally on the order of about two months, so this isn’t an unreasonable long period of time for disruptive changes to languish. Contrast this with the multi-year wait that used to happen back when Linux had an odd/even minor version cycle in which even numbers (2.2, 2.4, 2.6) were the “supported” releases and the odd numbers (2.1, 2.3, 2.5) were development ones. That seems like ancient history now, but it’s really only in the past decade of git that kernel development tooling and community has reached a level of sophistication that the ship can keep moving while the engine is replaced.

Linus noted that there are a “few different classes” of changes that didn’t come to him following a previous test in linux-next. Those include fixes (which is “obviously ok and inevitable”), a specific example (statx) for a longstanding issue that has been ongoing for years (to which he said, “Yeah, I’ll allow this one too”), the “quite noticeable <linux/sched.h> split up series” which “had real reasons for late inclusion”. Finally, he includes the class of subsystems such as “drm, Infiniband, watchdog and btrfs”, which he “found rather annoying this merge window”. He reminded folks of the “linux-next sanity checks” and that if folks ingore them “you had better have your own sanity checks that you replaced them with” rather than “screw all the rules and processes we have in place to verify things”.

The bottom line? Linus says “You people know who you are. Next merge window I will not accept anything even remotely like that. Things that haven’t been in linux-next will be rejected, and since you’re already on my sh*t-list you’ll get shouted at again”. And nobody enjoys being shouted at by Linus. Well, almost nobody. There do seem to be a few people who perversely enjoy it.

Announcements

A couple of questions of code maintenance arose this week. The first was from Natale Patriciello, who asked whether UML (User Mode Linux) is “not maintained anymore?” by citing a few bugs that haven’t been resolved in some time. There were no followups at the time of this recording. The second question came in form of an RFC (Request For Comments) patch entitled “remove support for AVR32 architecture” from Hans-Christian Noren Egtvedt. He noted that AVR32 is “not keeping up with the development of the kernel”, “shares so much of the drivers with Atmel ARM SoC”, and “all AVR32 AP7 SoC processors are end of lifed from Atmel (now Microchip)”. This did seem like a fairly compelling set of reasons to kill it, which others agreed with also. This means that unless someone comes forward soon to maintain AVR32 (along with the associated GCC toolchain and other distribution pieces), its days in the upstream Linux kernel are numbered – and probably removed in 4.12.

Sebastian Andrzej Siewior announced Linux v4.9.13-rt11, which includes a fix for a previous fix (allowing the previous lockdep fix to compile on UP).

Drivers

Logan Gunthorpe posted “New Microsemi PCI Switch Management Driver”, which is in its 7th revision. The RFC (Request for Comments “proposes a management driver for Microsemi’s Switchtec line of PCI switches. This hardware is still looking to be used in the Open Compute Platform”. Logan notes that “Switchtec products are compliant with the PCI specifications and are supported today with the standard in-kernel driver. However, these devices also expose a management endpoint on a separate PCI function address which can be used to perform some advanced operations”.

Ongoing Development

Michael S. Tsirkin continued his work on “vfio error recovery: kernel support” with version 4 of the patch series wich seeks to do more than simply ignoring non-fatal PCIe AER (Advanced Error Reporting) errors that hit assigned devices passed using VFIO into a guest Virtual Machine. Currently, only fatal errors (which cause a PCIe link reset) are reported – they stop the guest. In his summary email, Michael notes that his goal is to handle non-fatal errors by reporting them to the guest and having it handle them. And rather than surprising existing code, he calls out under “issues” that “this behavior should only be enabled with new userspace, old userspace should work without changes”. By “userspace” he means the code driving VFIO, which might be a QEMU process that is backing a KVM virtual machine context, or a container, or merely a bare metal userspace process that is using VFIO directly.

Johannes Weiner posted “mm: kswapd spinning on unreclaimable nodes – fixes and cleanups” in which he notes a previous posting from Jia He that he (and the team at Facebook) have reproduced. In the case of the problem scenario, the kernel’s kswapd (swap space daemon) for a given (memory) node spins indefinitely at 100% CPU usage when there are absolutely no reclaimable pages (granules of the smallest size of memory that can be managed by Linux and the underlying hardware) however the “condition for backing off is never met”. This results in kswapd busy-looping forever. In his patches, Johannes changes reclaim behavior so that kswapd will eventually really back off after failing 16 times (which is the same magic number of times we try during an OOM “Out Of Memory” situation) as defined by MAX_RECLAIM_RETRIES. He includes various examples.

Len Brown posted “cpufreq: Add the “cpufreq.off=1” cmdline option. This is a corollary to “cpuidle.off=1” and comes about for similar reasons for the purpose of testing. This author wonders aloud whether this will allow for buggy platforms that don’t support CPPC (Collaborative Processor Performance Control) to easily disable this at runtime too.

Aleksey Makarov posted “printk: fix double printing with earlycon”. On ACPI compliant platforms (including ARM servers), the SPCR (“Serial Port Console Redirection”) table provides information about the serial console UART that the kernel should be using, rather than having the user provide memory register addresses and baud rates on the kernel command line. This is a feature which is generally useful beyond ARM systems (although most x86 systems follow the traditional “PC” UART design). Prior to this fix, the kernel would double print output if given a “console=” and “earlycon”.

Minchan Kim posted “make try_to_unmap simple” which aims to remove some of the (apparently somewhat gratitous) complexity in the return value of this function. Currently it can return SWAP_SUCCESS, SWAP_FAIL, SWAP_AGAIN, SWAP_DIRTY, and SWAP_MLOCK. But Minchan feels that it can be simply a boolean return by removing the latter three of those return values.

Matthew Gerlach (Intel) posted “Altera Partial Reconfiguration IP”, which adds support to the kernel’s (Alan Tull’s) “fpga-mgr” driver for the “Altera Partial Reconfiguration IP”. Partial Reconfiguration (sometimes known as “PR” in the reconfigurable logic community) allows an FPGA (Field Programmable Gate Array)’s logic fabric to be reconfigured in smaller than whole regions. This (for example) would allow a closely coupled datacenter (Xeon) processor to continue to drive certain FPGA contained IP while other IP were being replaced dynamically. If one were to couple this with support in OpenStack Nomad or Kubernetes for dynamic reconfiguration at VM/container setup it would begin to enable various use cases for the mainstream datacenter around FPGA acceleration.

Andi Kleen posted “pci: Allow lockless access path to PCI mmconfig”. “mmconfig” refers to the memory mapped configuration region used by contemporary PCIe devices during enumeration and configuration. This is a kind of out-of-band mechanism by which the kernel can talk to PCIe devices in a fully standards compliant means prior to having configured them. Intel processors include many “PCIe” devices that are in fact a logical means of expressing so called “uncore” non-compute features on the processor SoC. They’re not real PCIe devices but appear to the kernel as such. This wonderful abstraction comes with some overhead cost, especially when the kernel spends time grabbing the “pci_cfg_lock” which it actually doesn’t need to hold, according to Andi.

Jarkko Sakkinen posted version 3 of “in-kernel resource manager”, which adds support to the kernel for “TPM spaces that provide an isolated execution context for transient objects and HMAC policy sessions”.

Tomas Winkler posted a question about what the community considered to be the “correct usage of arrats of variable length within [the] Linux kernel”. The replies generally included language to the form of “don’t”. Both for reasons of general language ugliness, and also because (especially in the case of local variables) the Linux kernel’s fixed (and also small) size stack raises serious potential for stack overflow if one is not careful. There was a suggestion that the kernel should be built with a compiler option to disallow VLAs, but that this would require various code to be fixed first.

March 06, 2017 12:53 AM

March 02, 2017

Pete Zaitcev: PTG, Sheraton, Atlanta

Serious reports trickle in (one, two), but on the lighter side, how was the venue? It's on the other side of the downtown, you know.

Bum incursions were very mild. Most went to the coffee area at the 1st floor, under the lobby level of Sheraton. One was remarkable though. I saw him practicing an unusual style of fighting on the sidewalk, with deep squats and wild swings - probably one of them prison styles. Very impressive, and also somewhat disturbing since all I have for him is checking with feet as well as I could, then move in for grappling phase, and then it's luck... Once done with his routine, he proceeded right past the coffee into the 2 rooms where some other org was meeting (not OpenStack) and started begging food off the hotel workers setting up tables. I heard him claiming that he was very hungry. Seemed super energetic and powerful a few minutes prior lol. But as much as I know, no laptops were stolen at the PTG (unlike e.g. OLS and UDS). Only goes to show that the main hazard is the venue staff and bums are more of an amusement, unless it's a really rough area.

March 02, 2017 01:51 AM

February 28, 2017

Kernel Podcast: Kernel Podcast for Feb 27th, 2017

Audiohttp://traffic.libsyn.com/jcm/20170228.mp3

In this week’s kernel podcast: the merge window for kernel 4.11 is open and patches are flying into Linus’s inbox, fixing NUMA node determination at runtime, Virtual Machine Aware Caches, Advisory Memory Allocations, and a non-fixed TASK_SIZE to bring excitement to your life. We will have this, and a summary of ongoing development in this week’s Linux Kernel podcast.

The merge window (period of time during which disruptive changes are allowed to be “merged” – incorporated into Linus’s official git tree – prior to a multi-week stabilization and Release Candidate cycle) for Linux 4.11 is currently open. This means that the most recent official kernel remains Linux 4.10. Meanwhile, many “pull requests” and merges are in flight for various kernel subsystems planning updates in 4.11. These include:

For a detailed sumary of current merge widow pulls and patches, consult this week’s Linux Weekly News at LWN.net (Thursday).

Geert Uytterhoeven posted a summary of “Build regressions/improvements in v4.10”. These show an increase in build errors and warnings vs the previous 4.9 kernel cycle. He posted a list of configs used, the error and warning messages, and thanked the “linux-next team for providing the build service”.

Pavel Machek has been posting about various problems running 4.10 kernels. In one instance, he saw a corrupted stack that implied a double call to “startup_32_smp” (the secondary CPU boot method on Intel x64 Architecture). This lead Josh Poimbeouf to ponder whether the GCC in use was somehow bad.

Announcements

Greg Kroah-Hartman announced Linux 4.4.52, 4.9.13, and 4.10.1. Ben Hutchings announced Linux 3.16.41, and 3.2.86.

Stephen Hemminger announced iproute2-4.10, including support for “new features in Linux 4.10”. Amongst those new features are “enhanced support for BPF [Berkley Packet Filter], VRF [Virtual Routing and Forwarding], and Flow based classifier (flower)”. The latest version is available here: https://www.kernel.org/pub/linux/utils/net/iproute2/iproute2-4.10.0.tar.gz

Karel Zak announced util-linux v2.29.2, including a fix for a (nasty) “su” security issue, otherwise documented in CVE-2017-2616. According to Karel, it is “possible for any local user to send SIGKILL to other processes with root privileges. To exploit this, the user must be able to perform su with a successful login. SIGKILL can only be send to processes which were executed after the su process. It is not possible to send SIGKILL to processes which were already running”. A fix entitled “properly clear child PID” against “su” is included among the fixes listed.

Lucas De Marchi announced kmod 24, which includes enhanced support for kernel module dependency loop detection: ftp://ftp.kernel.org/pub/linux/utils/kernel/kmod/kmod-24.tar.xz

Junio C Hamano announced git version 2.12.0: https://www.kernel.org/pub/software/scm/git/

Con Kolivas announced his Linux-4.10-ck1 MuQSS (Multiple Queue Skiplist Scheduler) version 0.152. More details at: http://ck.kolivas.org/patches/4.0/4.10/4.10-ck1/

Ove Kent Karlsen has been performing various Linux gaming experiments. They posted links to YouTube videos showing results with “Doom 3”, which can be found here: https://www.youtube.com/watch?v=xDct6vVvFxA

NUMA node determination

Dou Liyang (Fujitsu) posted several revisions of a patch series entitled “Revert works for the mapping of cpuid <-> nodeid”. This is intended to clean up the process by which (Intel x64 Architecture) systems enumerate the mapping of physical processor IDs to NUMA (Non-Uniform Memory Architecture) multi-socket “node” IDs. Conventionally, Linux uses the MADT (Multiple APIC Description Table – otherwise known as the “APIC” table for legacy reasons). ACPI table to map processors to their “Local APIC ID” (the ID of the core connected to the Intel APIC interrupt controller’s LAPIC CPU interface). It then maps these to NUMA nodes using the _PXM node ID in the ACPI DSDT (Differentiated System Description Table) and determines NUMA topology using the SRAT (Static Resource Affinity Table) and SLIT (System Locality Information Table). But this is fragile. Firmware developers are known to make mistakes on occasion, and these have included “duplicated processor IDs in DSDT”, and having the “_PXM in DSDT…inconsistent with the one in [the] MADT”. For this reason, Dou seeks to move the proximity discovery into the system’s hotplug path by reverting two previous commits. Xiaolong Ye (Intel) said he would test these and followup.

As a footnote, it’s worth adding that modern processors have a very  oose notion of a “physical” core, since they usually (internally) support dynamic remapping of true physical cores to the IDs exposed even to system programmers. This affords the illusion of contiguously numbered processors, and prevents an easy analysis of binning and yield characteristics. It’s one of the reasons that processors such as Intel’s use various mapping schemes in order to determine NUMA node proximinity. But one should never assume that any information given about a processor in any table reflects reality other than as a microprocessor company wanted you to perceive it.

Virtual Machine Aware Caches

Shanker Donthineni (Codeaurora) posted “arm64: Add support for VMID aware PIPT instruction cache”. Caches on the ARMv8 architecture are defined to be PIPT (Physically Indexed, Physically Tagged) from a software perspective (although the underlying implementation might be different – for example, you could index virtually with VIPT underneath a PIPT facade if you implemented expensive logic for automatic homonym detection). The ARMv8.2 specification allows “VMID aware PIPT” which means a cache is PIPT but aware of the existence of Virtual Machine IDs (VMIDs), which might form part of the cache entry. Will Deacon responded that the approach “may well cause problems for KVM with non-VHE [Virtual Host Extension – the ability to run “type 2″ hypervisors with split page tables for the kernel and userspace, as opposed to non-VHE implemented on original ARMv8.0 machines in which a shim running with its own page tables is required for KVM] because the host VMID is different from the guest VMID, yet we assume that I-cache invalidation by the host *will* affect the guest when, for example, invalidating the I-cache for pages holding the guest kernel Image”. He noted that he had some other patches in flight that he would post soon (for 4.12).

Advisory Memory Allocations in real life

Shaohua Li (Facebook) posted “mm: fix some MADV_FREE issues”. MADV_FREE is part of relatively recent(ish) kernel infrastructure to support advisory mmaps that the kernel may need to arbitrarily reclaim later when low on available memory. It’s the kind of thing that other Operating Systems (such as Windows) have done for many years (Windows will even dynamically enlarge its swap (paging) file on low memory situations). Facebook apparently like to use the (alternative) “jemalloc” userspace memory allocator and have found a number of issues when attempting to combine this with MADV_FREE flags to mmap. Shaohua notes that MADV_FREE cannot be used on a machine without swap enabled, actually increases memory pressure (due to page reclaim being biases against anonymous pages), and the lack of global accounting. The patches aim to address these.

Non-fixed TASK_SIZE

Martin Schwidefsky and Linus Torvalds had a back and forth discussion about “Using TASK_SIZE for kernel threads”. As kernel programmers know, kernel threads (“tasks”, or “kernel processes” – these show up in brackets in “ps” and “top”) don’t have an associated “mm” struct (they have no userspace). On s390, just to be different, TASK_SIZE is not fixed. It can actually be one of several values that are determined by reading a field in a task’s mm struct (context.asce_limit). This was causing very subtle breakage as the kernel indirected into a null structure which happened to contain a value very close to zero that kinda worked. Martin has a fixed queued up but had some suggestions for changes to make to the kernel to avoid such a subtle issue in future. Linus was more convinced that s390 was just doing something that needed fixing.

Ongoing Development

Elena Reshetova (Intel) posted many patches converting various uses of the kernel’s “atomic_t” datatype as a reference counter over to the new “refcount_t”. As she notes, “[b]y doing this we prevent intentional or accidental underflows or overflows that can le[a]d to use-after-free vulnerabilities”. Examples including architecture and VM code fixes.

Xunlei Pang (Red Hat) posted version 2 of a patch entitled “x86/mce: Don’t participate in rendezvous process once nmi-shootdown_cpus() was  made’. This aims to juggle a post-crash conumdrum: system errors sufficient enough to generate an MCE (Machine Check Exception) should not be ignored (and thus the machine check handler should run in the kernel) but they might be generated during the process of actively taking a crash/kdump. The existing code might instead cause a panic on exit from the (old kernel provided) MCE handler. Borislav Petkov didn’t like some of the details of the patch. He wanted to also see explicit documentation as to the handling of MCEs.

Andy Lutomirski posted “KVM TSS cleanups and speedups”, which aims to refactor how the kernel handles guest TSS (Task Segment Selector) handling on Intel x64 Architecture systems. These are layered upon a series from Thomas Gleixner aimed at cleaning up GDT (Global Descriptor Table) use. He notes that there “may be a slight speedup, too, because they remove an STR [store] instruction from the VMX [Virtual Machine] entry path”.

Heikki Krogerus posted version 17 of a patch series implementing “USB Type-C Connector class” support. This is “meant to provide [a] unified interface to…userspace to present the USB Type-C ports in a system”. Your author is looking forward to trying this on his Dell XPS Skylake with USB-C.

Rob Herring posted a patch “Add SPDX license tag check for dts files and headers” to the kernel’s “checkpatch.pl” patch submission checking tool.

Finally this week, Lorenzo Pieralisi posted “PCI: fix config and I/O Address space memory mappings” intended to address the inconvenient fact that “ioremap” on 32-bit and 64-bit ARM platforms was failing to strictly comply with the PCI local bus specification’s “Transaction Ordering and Posting” requirements. These mandate that PCI configuration cycles (during startup or hotplug) and I/O address space accesses must be “non-posted” (in other words, they must always receive a write notification response and not be buffered arbitrarily). Lorenzo addresses this with a 20 part patch series that cleans this up.

February 28, 2017 07:53 AM

Kees Cook: security things in Linux v4.10

Previously: v4.9.

Here’s a quick summary of some of the interesting security things in last week’s v4.10 release of the Linux kernel:

PAN emulation on arm64

Catalin Marinas introduced ARM64_SW_TTBR0_PAN, which is functionally the arm64 equivalent of arm’s CONFIG_CPU_SW_DOMAIN_PAN. While Privileged eXecute Never (PXN) has been available in ARM hardware for a while now, Privileged Access Never (PAN) will only be available in hardware once vendors start manufacturing ARMv8.1 or later CPUs. Right now, everything is still ARMv8.0, which left a bit of a gap in security flaw mitigations on ARM since CONFIG_CPU_SW_DOMAIN_PAN can only provide PAN coverage on ARMv7 systems, but nothing existed on ARMv8.0. This solves that problem and closes a common exploitation method for arm64 systems.

thread_info relocation on arm64

As done earlier for x86, Mark Rutland has moved thread_info off the kernel stack on arm64. With thread_info no longer on the stack, it’s more difficult for attackers to find it, which makes it harder to subvert the very sensitive addr_limit field.

linked list hardening
I added CONFIG_BUG_ON_DATA_CORRUPTION to restore the original CONFIG_DEBUG_LIST behavior that existed prior to v2.6.27 (9 years ago): if list metadata corruption is detected, the kernel refuses to perform the operation, rather than just WARNing and continuing with the corrupted operation anyway. Since linked list corruption (usually via heap overflows) are a common method for attackers to gain a write-what-where primitive, it’s important to stop the list add/del operation if the metadata is obviously corrupted.

seeding kernel RNG from UEFI

A problem for many architectures is finding a viable source of early boot entropy to initialize the kernel Random Number Generator. For x86, this is mainly solved with the RDRAND instruction. On ARM, however, the solutions continue to be very vendor-specific. As it turns out, UEFI is supposed to hide various vendor-specific things behind a common set of APIs. The EFI_RNG_PROTOCOL call is designed to provide entropy, but it can’t be called when the kernel is running. To get entropy into the kernel, Ard Biesheuvel created a UEFI config table (LINUX_EFI_RANDOM_SEED_TABLE_GUID) that is populated during the UEFI boot stub and fed into the kernel entropy pool during early boot.

arm64 W^X detection

As done earlier for x86, Laura Abbott implemented CONFIG_DEBUG_WX on arm64. Now any dangerous arm64 kernel memory protections will be loudly reported at boot time.

64-bit get_user() zeroing fix on arm
While the fix itself is pretty minor, I like that this bug was found through a combined improvement to the usercopy test code in lib/test_user_copy.c. Hoeun Ryu added zeroing-on-failure testing, and I expanded the get_user()/put_user() tests to include all sizes. Neither improvement alone would have found the ARM bug, but together they uncovered a typo in a corner case.

no-new-privs visible in /proc/$pid/status
This is a tiny change, but I like being able to introspect processes externally. Prior to this, I wasn’t able to trivially answer the question “is that process setting the no-new-privs flag?” To address this, I exposed the flag in /proc/$pid/status, as NoNewPrivs.

That’s all for now! Please let me know if you saw anything else you think needs to be called out. :) I’m already excited about the v4.11 merge window opening…

© 2017, Kees Cook. This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 License.
Creative Commons License

February 28, 2017 06:31 AM

February 27, 2017

Dave Airlie: radv + steamvr

If anyone wants to run SteamVR on top of radv, the code is all public now.

https://github.com/airlied/mesa/tree/radv-wip-steamvr

The external memory code will be going upstream to master once I clean it up a bit, the semaphore hack is waiting on kernel
changes, and the NIR shader hack is waiting on a new SteamVR build that removes the bad use of SPIR-V.

I've run Serious SAM TFE in VR mode on this branch.

February 27, 2017 07:42 PM

Matthew Garrett: The Fantasyland Code of Professionalism is an abuser's fantasy

The Fantasyland Institute of Learning is the organisation behind Lambdaconf, a functional programming conference perhaps best known for standing behind a racist they had invited as a speaker. The fallout of that has resulted in them trying to band together events in order to reduce disruption caused by sponsors or speakers declining to be associated with conferences that think inviting racists is more important than the comfort of non-racists, which is weird in all sorts of ways but not what I'm talking about here because they've also written a "Code of Professionalism" which is like a Code of Conduct except it protects abusers rather than minorities and no really it is genuinely as bad as it sounds.

The first thing you need to know is that the document uses its own jargon. Important here are the concepts of active and inactive participation - active participation is anything that you do within the community covered by a specific instance of the Code, inactive participation is anything that happens anywhere ever (ie, active participation is a subset of inactive participation). The restrictions based around active participation are broadly those that you'd expect in a very weak code of conduct - it's basically "Don't be mean", but with some quirks. The most significant is that there's a "Don't moralise" provision, which as written means saying "I think people who support slavery are bad" in a community setting is a violation of the code, but the description of discrimination means saying "I volunteer to mentor anybody from a minority background" could also result in any community member not from a minority background complaining that you've discriminated against them. It's just not very good.

Inactive participation is where things go badly wrong. If you engage in community or professional sabotage, or if you shame a member based on their behaviour inside the community, that's a violation. Community sabotage isn't defined and so basically allows a community to throw out whoever they want to. Professional sabotage means doing anything that can hurt a member's professional career. Shaming is saying anything negative about a member to a non-member if that information was obtained from within the community.

So, what does that mean? Here are some things that you are forbidden from doing:


Now, clearly, some of these are unintentional - I don't think the authors of this policy would want to defend the idea that you can't report something to the police, and I'm sure they'd be willing to modify the document to permit this. But it's indicative of the mindset behind it. This policy has been written to protect people who are accused of doing something bad, not to protect people who have something bad done to them.

There are other examples of this. For instance, violations are not publicised unless the verdict is that they deserve banishment. If a member harasses another member but is merely given a warning, the victim is still not permitted to tell anyone else that this happened. The perpetrator is then free to repeat their behaviour in other communities, and the victim has to choose between either staying silent or warning them and risk being banished from the community for shaming.

If you're an abuser then this is perfect. You're in a position where your victims have to choose between their career (which will be harmed if they're unable to function in the community) and preventing the same thing from happening to others. Many will choose the former, which gives you far more freedom to continue abusing others. Which means that communities adopting the Fantasyland code will be more attractive to abusers, and become disproportionately populated by them.

I don't believe this is the intent, but it's an inevitable consequence of the priorities inherent in this code. No matter how many corner cases are cleaned up, if a code prevents you from saying bad things about people or communities it prevents people from being able to make informed choices about whether that community and its members are people they wish to associate with. When there are greater consequences to saying someone's racist than them being racist, you're fucking up badly.

comment count unavailable comments

February 27, 2017 01:40 AM

February 26, 2017

Paul E. Mc Kenney: Stupid RCU Tricks: What if I Knew Then What I Know Now?

During my keynote at the 2017 Multicore World, Mark Moir asked what I would have done differently if I knew then what I know now, with the “then” presumably being the beginning of the RCU effort back in the early 1990s. Because I got the feeling that my admittedly glib response did not fully satisfy Mark, I figured I should try again. So imagine that you traveled back in time to the very end of the year 1993, not long after Jack Slingwine and I came up with read-copy lock (now read-copy update, or just RCU), and tried to pass on a few facts about my younger self's future. The conversation might have gone something like this:

You  By the year 2017, RCU will be part of the concurrency curriculum at numerous universities and will be very well-regarded in some circles.
Me  Nice! That must mean that DYNIX/ptx will also be doing well!

You  Well, no. DYNIX/ptx will disappear by 2005, being replaced by the combination of IBM's AIX and another operating system kernel started as a hobby.
Me  AIX??? Surely you mean Solaris, HP-UX or Ultrix! And I wouldn't say that BSD started as a hobby! It was after all fully funded research.

You  No, Sun Microsystems was acquired by Oracle in 2010, and Solaris was already in decline by that time. IBM's AIX was by then the last proprietary UNIX operating system standing. A new open-source kernel called "Linux" became the dominant OS.
Me  IBM??? But they are currently laying off more people each month than Sequent employs worldwide!!! Why would they even still be in business in 2010?

You  True. But their new CEO, Louis Gerstner, will turn IBM around.
Me  Well, yes, he did just become IBM's CEO, but before that he was CEO of RJR Nabisco. That should work about as well as John Sculley's tenure as CEO of Apple. What does Gerstner know about computers, anyway?

You  He apparently knew enough to get IBM back on its feet. In fact, IBM will buy Sequent, so that you will become an IBM employee on April 1, 2000.
Me  April Fools day? Now I know you are joking!!!

You  No joke. You will become an IBM employee on April 1, 2000, seven years to the day after Louis Gerstner became an IBM employee.
Me  OK, I guess that explains why DYNIX/ptx doesn't make it past 2005. That is really annoying! So the teaching of RCU in universities is some sort of pity play, then?

You  No. Dipankar Sarma will get RCU accepted into Linux in 2002.
Me  I could easily believe that—he is very capable. So what do I do instead?

You  You will take over maintainership of RCU in 2005.
Me  Is Dipankar going to be OK?

You  Of course! He will just move on to other projects. It is just that there will be a lot more work needed on RCU, which you will take on.
Me  What more work could there be? It is a pretty simple mechanism, way simpler than a memory allocator, for example.

You  Well, there will be quite a bit of scalability work needed. For example, you will receive a scalability bug report involving a 512-CPU shared-mmeory system.
Me  Hmmm... It took Sequent from 1985 to 1997 to get from 30 to 64 CPUs, so that is doubling every 12 years, so I am guessing that I received this bug report somewhere near the year 2019. So what did I do in the meantime?

You  No, you will receive this bug report in 2004.
Me  512-CPU system in 2004??? Well, suspending disbelief, this must be why I will start maintaining RCU in 2005.

You  No, a quick fix will be supplied by a guy named Manfred Spraul, who writes concurrent Linux-kernel code as a hobby. So you didn't do the scalability work until 2008.
Me  Concurrent Linux-kernel coding as a hobby? That sounds unlikely. But never mind. So what did I do between 2005 and 2008? Surely it didn't take me three years to create a highly scalable RCU implementation!

You  You will work with a large group of people adding real-time capabilities to the Linux kernel. You will create an RCU implementation that allowed readers to be preempted.
Me  That makes absolutely no sense! A context switch is a quiescent state, so preempting an RCU read-side critical section would result in a too-short grace period. That most certainly isn't going to help anything, given that a crashed kernel isn't going to offer much in the way of real-time response!

You  I don't know the details, but you will make it work. And this work will be absolutely necessary for the Linux kernel to achieve 20-microsecod interrupt and scheduling latencies.
Me  Given that this is a general-purpose OS, you obviously meant 20 milliseconds!!! But what could RCU possibly be doing that would contribute significantly to a 20-millisecond interrupt/scheduling delay???

You  No, I really did mean sub-20-microsecond latencies. By 2010 or so, even vanilla non-realtime Linux kernel will easily meet 20-millisecond latencies, assuming the hardware and software is properly configured.
Me  Ah, got it! CPU core clock rates should be somewhere around 50GHz by 2010, which might well make those sorts of latencies achievable.

You  No, power-consumption and heat-dissipation constraints will cap CPU core clock frequencies at about 5GHz in 2003. Most systems will run in the 1-3GHz range even as late as in 2017.
Me  Then I don't see how a general-purpose OS could possibly achieve sub-20-microsecond latencies, even on a single-CPU system, which wouldn't have all that much use for RCU.

You  No, this will be on SMP systemss. In fact, in 2012, you will receive a bug report complaining of excessively long 200-microsecond latencies on a system running 4096 CPUs.
Me  Come on! I believe that Amdahl's Law has something to say about lock contention on such large systems, which would rule out reasonable latencies, let alone 200-microsecond latencies! And there would be horrible reliability problems with that many CPUs! You wouldn't be able to keep the system running long enough to measure the latency!!!

You  Hey, I am just telling you what will happen.
Me  OK, so after I get RCU to handle insane scalability and real-time response, there cannot be anything left to do, right?

You  Actually, wrong. Energy efficiency becomes extremely important, and you will rewrite the energy-efficiency RCU code more than eight times before you get it right.
Me  Eight times??? You must be joking!!! Seems like it would be better to just waste a little energy. After all, computers don't consume all that much energy, especially compared to industrial and transportation systems.

You  No, that would not work. By 2005, there are quite a few datacenters that are limited by electrical power rather than by floor space. So much so that large data centers open in Eastern Oregon, on the sites of the old aluminum smelters. When you have that many servers, even a few percent of energy savings translates to millions of dollars a year, which is well worth spending some development effort on.
Me  That is an insanely large number of servers!!! How many Linux instances are running by that time, anyway?

You  By the mid-2010s, the number of Linux instances is well in excess of one billion, but no one knows the exact number.
Me  One billion??? That is almost one server for every family in the world! No way!!!

You  Well, most of the Linux instances are not servers. There are a lot of household appliances running Linux, to say nothing of battery-powered handl-held smartphones. By 2017, most of the smartphones will have multiple CPUs.
Me  Why on earth would you need multiple CPUs to make a phone call? And how would you fit multiple CPUs into a hand-held device? And where do you put the battery, in a large backpack or something???

You  No, the entire device, batteries, CPUs and all, will fit easily into your shirt pocket. And these smartphones can take pictures, record video, do video conference calls, find precise locations using GPS, translate among multiple languages, and much else besides. They are really full-fledged computers that fit in your pocket.
Me  A pocket-sized supercomputer??? And how would I possibly go about testing RCU code sufficiently for your claimed billion instances???

You  Interesting question. You will give a keynote at the 2017 Multicore World in February 2017 at Wellington, New Zealand describing some of your plans. These plans include the use of formal verification in your regression test suite.
Me  Formal verification of highly concurrent code in a regression test suite??? OK, now I know for sure that you are pulling my leg! It has been an interesting conversation, but I must get back to reality!!!


My 1993 self did not have a very accurate view of 2017, did he? As the old saying goes, predictions are hard, especially about the future! So it is quite wise to take such predictions with a considerable supply of salt.

February 26, 2017 11:09 PM

Pavel Machek: Using Linux notebook as an alarm clock

Is someone using notebook as an alarm clock? Yes, it would be easy if I did not suspend machine overnight, but that would waste power and produce noise from fans. I'd like version that suspends the machine...

February 26, 2017 10:32 PM