Kernel Planet

March 17, 2019

Michael S. Tsirkin:

Virtio Network Device Failover


Support for Virtio Network Device Failover which has been merged for linux 4.17 presents an interesting study in interface design: both for operating systems and hypervisors. Read on for an article examining the problem domain, solution space and describing the current status of the implementation.


PT versus PV NIC

Imagine a Virtual Machine running on a hypervisor on a host computer. The hypervisor has access to a network to which the host is attached, but ow should guest gain this access? The answer could depend on the type of the netwok and on the network interface on the host. For the sake of this article we focus on Ethernet networks and NICs. In this setup a popular solution extends (bridges) the Ethernet network into the guest by exposing a virtual Ethernet device as part of the VM.

In most setups a single host NIC would be shared between VMs. Two popular configurations are shows below:

vm network configuration

In the first diagram (on the left) the NIC exposes Virtual Function (VFs) interfaces which the hypervisor “passes through” - makes accessible to the guests. Using such Passthrough (PT) interfaces packets can pass between the guest and the NIC directly. For PCI devices, device memory can actually be mapped into the address space of the virtual machine in such as way that guest can actually access the device without invoking the hypervisor. In the setup on the right packets are passed between the guest and the NIC by the hypervisor. The hypervisor interface used by guest for this purpose would commonly be a PV - Para-virtual (i.e. designed for the hypervisor) NIC. One example would be the Virtio network device, used for example with the KVM hypervisor. By comparison, Microsoft HyperV guests use the netvsc device with its PV NICs.

Since the underlying packets are still handled by the physical NIC in both cases, it would be unusual for the second (PV) setup to outperform the first (PT) one. Besides removing some of the hypervisor overhead, passthrough allows driver within the guest to be precisely tuned to the physical device.

However the PV NIC setup obviously offers more flexibility - for example, the hypervisor can implement an arbitrary filtering policy for the networking packets. By comparison, with PT NICs we are limited to the features presented by hardware NICs which are often more limited: some of them only have simplest filtering capabilities. As an example of a simple and effective filtering/security measure, guest would often be prevented from modifying the MAC address of its devices, limiting guest’s access to the host’s network.

But even besides limitations of specific hardware the standardized interface independent of the physical NIC makes the system easier to manage: use of a standard driver within guest as well as a well known state of the device enable features such as live migration of guests between hypervisors: guests can often be moved with negligible network downtime.

Same can not be generally said with the passthrough setup, for example, one of the issues encountered with it is that even a small difference between hypervisor hosts in their physical hardware would require expensive reconfiguration when switching hypervisors.

Can not something be done with respect to performance to get the speed benefits of pass-through without giving up on live migration and similar advantages of standardized PV NIC setups? One approach could be designing a pass-through NIC around a standard paravirtualized interface. This is the approach taken by the Virtio Data Path Accelerator devices. In absence of such an accelerator, Virtual Network Device Failover presents another possible approach.

Network device Failover basics

Conceptually, the idea behind Virtual Network Device Failover is simple: assume that a standard PV interface only brings benefits part of the time. The system would change its configuration accordingly - e.g. when migration is required use the PV interface, when it’s not - use a PT device.

When possible hypervisor will pass through the NIC to the guest as a “primary” device. To reduce downtime, a “standby” PV device could be kept around at all times. When PV features are not required, hypervisor can add guest access to the primary PT device. At other times the standby PV interface is used.

Accordingly, guest would be required to switch over between primary and standby interfaces depending on availability of the primary interface.

network device failover basics

An astute reader might notice that the above switching sounds a bit like the active-backup configuration of the bond and team network drivers in Linux. That is true - in fact in the past one of these drivers has often been used to implement network device failover. Let’s take a quick look at how active-backup can be used for network device failover.

Network Device Failover using active-backup

This text will use the term bond when meaning the network device created by either a bond or the team driver: the differences between these two mostly have to do with how devices are created, configured and destroyed and will not be covered here.

A bond device is a master software network device which can enslave multiple interfaces. In our case these would be the standby and the primary devices. For this, the bond master needs to be created and initialized with slave interface names before slaves are brought up. When priority of the primary interface is set higher than priority of the standby, the bond will switch between interfaces as required for failover.

The active-backup was designed to help create redundancy and improve uptime for systems with multiple NIC devices. To make it work for the virtual machine, we need guest to detect interface failure on the primary interface and switch to the stanby one. This can be achieved for example by removing the interface by making the hypervisor emulate hotplug removal request.

However the above might already hint at some of the issues with this approach to failover: first, bond needs to be set up by userspace. Configuring a bond for all devices unconditionally would be an option but would add overhead to all users. On the other hand, adding a slave to the bond would require bringing the slave down. For this reason to avoid downtime bond has to be created upfront, even if only the standby device is present during guest initialization.

Further, setting up an active-backup bond is considered a question of policy and thus is left up to guest admin. By comparison network failover is a mechanism - there’s no good reason not to use a PT interface if it is available to the guest. Should hypervisor want to force guest to create a bond, hypervisor would need a measure of control over guest network configuration which might conflict with the way some guest admins like to set up their networking.

Another issue is with device selection. Bond tends to address devices using their names. While recently device names under many Linux distributions became more predictable, it is not the case for all distributions, and specific naming schemes might differ. It is thus a challenge for the hypervisor to specify to the guest which interfaces need to be bonded together.

To help reduce downtime, the bond will also broadcast location information on a network on every switch. This is not too problematic but might cause extra load on the network - likely unnecessary in case of virtual device failover since packets are in the end traveling over the same physical wire.

Maintaining a consistent MAC address for the guest is necessary to avoid need for all guest neighbours to rediscover the MAC address using the slow APR/Neighbour Discovery. To help with that, bond will try to program the MAC address into the primary device when it’s attached. If MAC programming is disabled as a security measure (as described above) bond will generally fail to attach to this slave.

Failover goals; 1,2 and 3 device models

The goal of the network device failover support in Linux is to address the above problems. Specifically: - PT cards with MAC programming disabled need to be supported - configuration should happen automatically, with no need for userspace to make a policy decision - in particular the primary/standby pair of devices should be selected with no need for special configuration to be passed from hypervisor - support as wide a range of existing network setup tools as possible with minimal changes

Most of the design seems to fall out from the above goals in a manner that is more or less straight-forward: - design supports two devices: a standby PV device is present at all times and used by default; a primary PT device is used by preference when it’s available - failover support is initialized by the PV device driver, e.g. in the case of Virtio this happens when the Virtio-net driver detects a special feature bit set by the hypervisor on the Virtio PV device - to support devices without MAC programming, both standby and primary can be simply required to be initialized (e.g. by the hypervisor) with the same MAC address - in that case, MAC address can also used by failover to locate and enslave the primary device

However, the requirement to minimize userspace changes caused a certain amount of debate about the best way to model the failover setup, with the debate centered around the number of network device structures being created and exposed to userspace. It seems worthwhile to list the options that have been debated, below:

1-device model

In a 1-device model userspace sees a single failover device at all times. At any time this device would be either the PT or the PV device. As userspace might need to configure devices differently depending on the specific driver used, a new interface would have to be introduced for kernel to report driver changes to userspace, and for userspace to detect the actual driver used. However, as long as userspace does not contain any driver-specific code, userspace tools that already work with the Virtio device seem to be guaranteed to keep working without any changes, but with a better performance.

To best of author’s knowledge, no actual code supporting this mode has ever been posted.

2-device model

In a 2-device model, the standby and primary devices are exposed to userspace as two network devices. The devices aren’t independent: primary device is a slave and standby is the master in that when primary is present, standby device forwards outgoing packets for transmission on the primary device.

PT driver discovery and device specific configuration can happen on the slave interface using standard device discovery interfaces.

Both portable configuration affecting both PV and PT devices (such as interface MTU) and the configuration that is specific to the PV device will happen on the master interface.

The 2-device model is used by the netvsc driver in Linux. It has been used in production for a number of years with no significant issues reported. However, it diverges from the model used by the bond driver, and the combination of PV-specific and portable configuration on the master device was seen by some developers as confusing.

3-device model

The 3-device model basically follows bond: a master failoverdevice forwards packets to either the primary or the standbyslaves, depending on the primary’s availability.

Failover device maintains portable configuration, primary and standby can each have its own driver-specific configuration.

This model is used by the net_failover driverwhich has been present in Linux since version 4.17. This model isn’t transparent to userspace: for example, presence of at least two devices (failover master and primary slave) at all times seems to confuse some userspace tools such as dracut, udev, initramfs-tools, cloud-init. Most of these tools have since been updated to check the slave flag of each interface and ignore interfaces where it is set.

3-device model with hidden slaves

It is possible that the compatibility of the 3-device model with existing userspace can be improved by somehow hiding the slave devices from most legacy userspace tools, unless they explicitly ask for them.

For example it could be possible to somehow move them to some kind of special network namespace. No patches to implement this idea have been posted so far.

Hypervisor failover support

At the time of this article writing, support for virtual network device failover in the QEMU/KVM hypervisor is still being worked upon. This work uncovered a surprising number of subtle issues some of which will be covered below.

Primary device availability

Network Failover driver relies on hotplug events for the primary device availability. In other words, to make the primary device available to the guest the hypervisor emulates a hot-add hotplug event on a bus within VM (e.g. the virtual PCI bus). To make the primary device unavailable, a hot-unplug event is emulated.

Note that at the moment most PCI drivers expect a chance to be notified and execute cleanup before a device is removed. From hypervisor’s point of view, this would mean that it can not remove the PT device and e.g. can not initiate migration until it receives a response from the VM guest. Making hypervisor depend on guest being responsive in this way is problematic e.g. from the security point of view.

As described earlier in a lwn.net article most drivers do not at the moment support surprise removal well. When that is addressed, hypervisors will be able to switch to emulate surprise removal to remove dependency on guest responsiveness.

Existing Guest compatibility

One of the issues that hypervisors take pains to handle well is compatibility with existing guests, that is guests which have not been modified with virtual network device failover support.

One possible issue is that existing guests can become confused if they detect two Ethernet devices with the same MAC address.

To help address this issue, the hypervisor can defer making the primary device visible to the guest until after the PV driver has been initialized. The PV driver can signal to the hypervisor guest support for the virtual network device failover.

For example, in case of the virtio-net driver, hypervisor can signal the support for failover to guest by setting the VIRTIO_NET_F_STANDBYhost feature bit on the Virtio device. If failover is enabled, the driver can signal failover support to hypervisor by setting the matching VIRTIO_NET_F_STANDBY guest feature bit on the device.

After detecting a modern guest with failover support, the hypervisor can hot-add the primary device. Device will have to be hot-removed again on guest reset - in case the VM will reboot into a legacy guest without failover support.

This is also helpful to avoid initializing a useless failover device on hypervisors without actual failover support.

As of the time of writing of this article, the definition of the VIRTIO_NET_F_STANDBY and its support are present in Linux. Some preliminary hypervisor patches with known issues have been posted.

Packet filtering issues

Early implementations of the failover in QEMU were originally tested with an emulated NIC. When tested on a physical one, it was quickly detected that for many configurations significant downtime occurs.

The reason has to do with how incoming packets are processed by the host NIC. Generally, a packet is matched against some rules (e.g. the destination MAC is matched using a forwarding filter) and a decision is made to forward the packet either to the hypervisor or to a guest through a VF.

incoming packet filtered

Consider again a hypervisor transitioning between configurations where a primary passthrough VF is available to a configuration where it is unavailable to the guest.

When the primary device is available to the guest we want incoming packets with destination MAC matching the device to be forwarded through the primary. In many configurations this happens immediately when the hypervisor programs the MAC into the VF. In these setups, when primary device becomes unavailable to guest, unless special steps are taken, incoming packets will still be filtered to it and eventually dropped.

incoming packet being dropped by device

One possible fix is have the hypervisor update the host NIC filtering, e.g., by updating the MAC of the VF to a different value. Another is to change the filtering on the host NIC such that it only happens when a driver is attached to the VF. This seems to already be the case for some drivers (such as ice,mlx) and so one can argue that others should be changed to behave consistently. Another approach would be to teach hypervisor to detect the difference and handle both types of behaviour.

Conversely, when the primary interface becomes available to guest, we would like packets to start flowing through the primary but only after the driver is bound to it. Again, on some devices hypervisor might need to intervene to update the forwarding filter on the host NIC. One issue is that it might take guests a while to detect a hot-add event and bind a driver to the primary device. This is because hotplug is not generally considered a data path operation. Should the host NIC filter be updated by the hypervisor immediately on hot-add, there will be a large window during which guest driver has not been initialized yet.

incoming packet being dropped by driver

As a possible fix, hypervisors can detect that the pass-through driver has been attached to device. For example, drivers enable bus-mastering on the device when they start using it, and disable it when they stop using it. Hypervisor can detect this event and update the forwarding filter on the host NIC accordingly.

QEMU patches addressing both issues have been posted on the QEMU mailing list.

An alternative could be to add a way for guest to request the switch between primary and standby through the PV device driver. This might reduce the downtime slightly: some PT drivers might enable bus mastering before they are fully ready to receive packets, causing a small window during which packets are still dropped.

This alternative approach is used by the netvsc driver. Using that with net_failover would require extending the Virtio interface and adding support to the net_failover driver in Linux, as of today no patches implementing this change have been posted.

As described above, some differences in behaviour between host NICs make failover implementation harder. While not yet widely supported, use of VF representors could make it easier to consistently configure host NICs for use by failover. However, for it to be helpful to userspace wide support across many NICs would be necessary.

Non-MAC based pairing

One basic question that had to be addressed early in the design was: how does failover master decide to which slave devices to bind? Unlike bond, failover by design can not rely on the administrator supplying the configuration.

So far, implementations focused on matching MAC addresses as a way to match slave devices. However, some configurations (sometimes called trusted VFs) do not supply VF MAC addresses by the hypervisor.

This seems to call for an alternative mechanism for locating the primary that is not based on the MAC address.

The netvsc driver uses a serial number value to locate the primary device. The serial is typically communicated through the VMBus interface and attached to a para-virtual PCI bus slot created for the device. QEMU/KVM traditionally do not have a para-virtual bus implementation, relying instead of emulating a PCI bus for VMs. One possible approach for QEMU would be to attach an ID value to a PCI slot, or bridge. For example, an ACPI Slot Unique Number, the PCI Physical Slot Number register, or an alternative vendor-specific ID register could be fit for this purpose. The ID could be supplied to the VM guest through the Virtio device. Failover driver would locate the slot based on the ID, and bind to any device located behind the slot. It would then program the MAC address from the standby device into the primary device.

An early implementation of this idea has been posted on the QEMU mailing list, however no patches to the failover driver have been posted yet.

Host network topology and other optimizations

In some configurations it might be better for the guest to use the PV interface in preference to the passthrough one. For example, if the PCI bus is very busy, and there’s spare CPU capacity on the host, it might be faster to send a packet that is destined to another VM on the same host through the hypervisor, bypassing the PCI bus.

This seems to call for keeping both interfaces active at all times. Supporting such an optimization would need to address the possibility of VM migration as well as the dynamic nature of the CPU/PCI bus available capacity, such that the specific interface used for sending packets to each destination can change at any time.

No patches for such support have been posted as of the time of writing of this article.

Specification status

Definition of the VIRTIO_NET_F_STANDBY has been included in the latest Virtio specification draft virtio-v1.1-csprd01.

Non-Linux/non-KVM support

Besides Linux, which systems could benefit from virtual network device failover support?

The DPDK set of userspace drivers is set to gain this support soon.

Drivers for other operating systems could also benefit from increased performance. One can expect the work on these drivers to start in earnest once the hypervisor support is widely available.

Other virtual devices besides Virtio could implement failover. netvsc already has a 2-device implementation that does not rely on the net_failover driver. It is possible that xen-netfront or vmxnet devices could use the failover driver. The author is not familiar with these devices.

Summary

A straight-forward sounding idea of improving performance for a Virtio network device by allowing networking traffic for the VM to temporary travel over a pass-through device exposed a wealth of issues on both VM host and guest sides.

Acknowledgements

The author thanks Jens Freimann for help analyzing netvsc as well as proof-reading the draft and suggesting corrections. The author thanks multiple contibutors who worked on implementation and helped review and guide the feature design over time.

March 17, 2019 05:22 AM

March 12, 2019

Kees Cook: security things in Linux v5.0

Previously: v4.20.

Linux kernel v5.0 was released last week! Looking through the changes, here are some security-related things I found interesting:

read-only linear mapping, arm64
While x86 has had a read-only linear mapping (or “Low Kernel Mapping” as shown in /sys/kernel/debug/page_tables/kernel under CONFIG_X86_PTDUMP=y) for a while, Ard Biesheuvel has added them to arm64 now. This means that ranges in the linear mapping that contain executable code (e.g. modules, JIT, etc), are not directly writable any more by attackers. On arm64, this is visible as “Linear mapping” in /sys/kernel/debug/kernel_page_tables under CONFIG_ARM64_PTDUMP=y, where you can now see the page-level granularity:

---[ Linear mapping ]---
...
0xffffb07cfc402000-0xffffb07cfc403000    4K PTE   ro NX SHD AF NG    UXN MEM/NORMAL
0xffffb07cfc403000-0xffffb07cfc4d0000  820K PTE   RW NX SHD AF NG    UXN MEM/NORMAL
0xffffb07cfc4d0000-0xffffb07cfc4d1000    4K PTE   ro NX SHD AF NG    UXN MEM/NORMAL
0xffffb07cfc4d1000-0xffffb07cfc79d000 2864K PTE   RW NX SHD AF NG    UXN MEM/NORMAL

per-task stack canary, arm
ARM has supported stack buffer overflow protection for a long time (currently via the compiler’s -fstack-protector-strong option). However, on ARM, the compiler uses a global variable for comparing the canary value, __stack_chk_guard. This meant that everywhere in the kernel needed to use the same canary value. If an attacker could expose a canary value in one task, it could be spoofed during a buffer overflow in another task. On x86, the canary is in Thread Local Storage (TLS, defined as %gs:20 on 32-bit and %gs:40 on 64-bit), which means it’s possible to have a different canary for every task since the %gs segment points to per-task structures. To solve this for ARM, Ard Biesheuvel built a GCC plugin to replace the global canary checking code with a per-task relative reference to a new canary in struct thread_info. As he describes in his blog post, the plugin results in replacing:

8010fad8:       e30c4488        movw    r4, #50312      ; 0xc488
8010fadc:       e34840d0        movt    r4, #32976      ; 0x80d0
...
8010fb1c:       e51b2030        ldr     r2, [fp, #-48]  ; 0xffffffd0
8010fb20:       e5943000        ldr     r3, [r4]
8010fb24:       e1520003        cmp     r2, r3
8010fb28:       1a000020        bne     8010fbb0
...
8010fbb0:       eb006738        bl      80129898 <__stack_chk_fail>

with:

8010fc18:       e1a0300d        mov     r3, sp
8010fc1c:       e3c34d7f        bic     r4, r3, #8128   ; 0x1fc0
...
8010fc60:       e51b2030        ldr     r2, [fp, #-48]  ; 0xffffffd0
8010fc64:       e5943018        ldr     r3, [r4, #24]
8010fc68:       e1520003        cmp     r2, r3
8010fc6c:       1a000020        bne     8010fcf4
...
8010fcf4:       eb006757        bl      80129a58 <__stack_chk_fail>

r2 holds the canary saved on the stack and r3 the known-good canary to check against. In the former, r3 is loaded through r4 at a fixed address (0x80d0c488, which “readelf -s vmlinux” confirms is the global __stack_chk_guard). In the latter, it’s coming from offset 0x24 in struct thread_info (which “pahole -C thread_info vmlinux” confirms is the “stack_canary” field).

per-task stack canary, arm64
The lack of per-task canary existed on arm64 too. Ard Biesheuvel solved this differently by coordinating with GCC developer Ramana Radhakrishnan to add support for a register-based offset option (specifically “-mstack-protector-guard=sysreg -mstack-protector-guard-reg=sp_el0 -mstack-protector-guard-offset=...“). With this feature, the canary can be found relative to sp_el0, since that register holds the pointer to the struct task_struct, which contains the canary. I’m hoping there will be a workable Clang solution soon too (for this and 32-bit ARM). (And it’s also worth noting that, unfortunately, this support isn’t yet in a released version of GCC. It’s expected for 9.0, likely this coming May.)

top-byte-ignore, arm64
Andrey Konovalov has been laying the groundwork with his Top Byte Ignore (TBI) series which will also help support ARMv8.3’s Pointer Authentication (PAC) and ARMv8.5’s Memory Tagging (MTE). While TBI technically conflicts with PAC, both rely on using “non-VA-space” (Virtual Address) bits in memory addresses, and getting the kernel ready to deal with ignoring non-VA bits. PAC stores signatures for checking things like return addresses on the stack or stored function pointers on heap, both to stop overwrites of control flow information. MTE stores a “tag” (or, depending on your dialect, a “color” or “version”) to mark separate memory allocation regions to stop use-after-tree and linear overflows. For either of these to work, the CPU has to be put into some form of the TBI addressing mode (though for MTE, it’ll be a “check the tag” mode), otherwise the addresses would resolve into totally the wrong place in memory. Even without PAC and MTE, this byte can be used to store bits that can be checked by software (which is what the rest of Andrey’s series does: adding this logic to speed up KASan).

ongoing: implicit fall-through removal
An area of active work in the kernel is the removal of all implicit fall-through in switch statements. While the C language has a statement to indicate the end of a switch case (“break“), it doesn’t have a statement to indicate that execution should fall through to the next case statement (just the lack of a “break” is used to indicate it should fall through — but this is not always the case), and such “implicit fall-through” may lead to bugs. Gustavo Silva has been the driving force behind fixing these since at least v4.14, with well over 300 patches on the topic alone (and over 20 missing break statements found and fixed as a result of the work). The goal is to be able to add -Wimplicit-fallthrough to the build so that the kernel will stay entirely free of this class of bug going forward. From roughly 2300 warnings, the kernel is now down to about 200. It’s also worth noting that with Stephen Rothwell’s help, this bug has been kept out of linux-next by him sending warning emails to any tree maintainers where a new instance is introduced (for example, here’s a bug introduced on Feb 20th and fixed on Feb 21st).

ongoing: refcount_t conversions
There also continues to be work converting reference counters from atomic_t to refcount_t so they can gain overflow protections. There have been 18 more conversions since v4.15 from Elena Reshetova, Trond Myklebust, Kirill Tkhai, Eric Biggers, and Björn Töpel. While there are more complex cases, the minimum goal is to reduce the Coccinelle warnings from scripts/coccinelle/api/atomic_as_refcounter.cocci to zero. As of v5.0, there are 131 warnings, with the bulk of the remaining areas in fs/ (49), drivers/ (41), and kernel/ (21).

userspace PAC, arm64
Mark Rutland and Kristina Martsenko enabled kernel support for ARMv8.3 PAC in userspace. As mentioned earlier about PAC, this will give userspace the ability to block a wide variety of function pointer overwrites by “signing” function pointers before storing them to memory. The kernel manages the keys (i.e. selects random keys and sets them up), but it’s up to userspace to detect and use the new CPU instructions. The “paca” and “pacg” flags will be visible in /proc/cpuinfo for CPUs that support it.

platform keyring
Nayna Jain introduced the trusted platform keyring, which cannot be updated by userspace. This can be used to verify platform or boot-time things like firmware, initramfs, or kexec kernel signatures, etc.

Edit: added userspace PAC and platform keyring, suggested by Alexander Popov
Edit: tried to clarify TBI vs PAC vs MTE

That’s it for now; please let me know if I missed anything. The v5.1 merge window is open, so off we go! :)

© 2019, Kees Cook. This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 License.
Creative Commons License

March 12, 2019 11:04 PM

Linux Plumbers Conference: Linux Plumbers Conference 2019 Call for Refereed-Track Proposals

We are pleased to announce the Call for Refereed-Track talk proposals for the 2019 edition of the Linux Plumbers Conference, which will be held in Lisbon, Portugal on September 9-11 in conjunction with the Linux Kernel Maintainer Summit.

Refereed track presentations are 50 minutes in length (which includes time for questions and discussion) and should focus on a specific aspect of the “plumbing” in the Linux system. Examples of Linux plumbing include core kernel subsystems, toolchains, container runtimes, core libraries, windowing systems, management tools, device support, media creation/playback, and so on. The best presentations are not about finished work, but rather problems, proposals, or proof-of-concept solutions that require face-to-face discussions and debate.

For more information on submitting a Refereed-Track talk proposal, see the following:

https://www.linuxplumbersconf.org/event/4/abstracts

Please note that the submission system is the same as 2018. If you created an user account last year, you will be able to re-use the same credentials to submit and modify your proposal(s) this year.

The call for Microconferences proposals is also open, and we hope to see you in Lisbon this coming September!

March 12, 2019 04:22 PM

Linux Plumbers Conference: Linux Plumbers Conference 2019 Call for Microconference Proposals

We are pleased to announce the Call for Microconferences for the 2019 edition of the Linux Plumbers Conference, which will be held in Lisbon, Portugal on September 9-11 in conjunction with the Linux Kernel Maintainer Summit.

A microconference is a collection of collaborative sessions focused on problems in a particular area of the Linux plumbing, which includes the kernel, libraries, utilities, services, UI, and so forth, but can also focus on cross-cutting concerns such as security, scaling, energy efficiency, toolchains, container runtimes, or a particular use case. Good microconferences result in solutions to these problems and concerns, while the best microconferences result in patches that implement those solutions. For more information on submitting a microconference proposal, see the following:

https://www.linuxplumbersconf.org/event/4/abstracts

Please note that the submission system is the same as 2018. If you created an user account last year, you will be able to re-use the same credentials to submit and modify your proposal(s) this year.

Look for the upcoming call for refereed-track proposals, and we hope to see you in Lisbon this coming September!

March 12, 2019 12:07 AM

March 07, 2019

Michael Kerrisk (manpages): man-pages-5.00 is released

I've released man-pages-5.00. The release tarball is available on kernel.org. The browsable online pages can be found on man7.org. The Git repository for man-pages is available on kernel.org.

This release resulted from patches, bug reports, reviews, and comments from around 130 contributors. The release is rather larger than average, since it has been nearly a year since the last release. The release includes more than 600 commits that changed nearly 400 pages. In addition, 3 new manual pages were added.

Among the more significant changes in man-pages-5.,00 are the following:

In addition, two pages have been removed in this release after encouragement from Ingo Schwarze: mdoc(7) and mdoc.samples(7). As the commit message notes, groff_mdoc(7) from the groff project provides a better equivalent of mdoc.samples(7) and the mandoc project provides a better mdoc(7). And nowadays, there are virtually no pages in man-pages that use mdoc markup.

Again special thanks to Eugene Syromyatnikov, who contributed nearly 30 patches to this release!

March 07, 2019 04:57 AM

March 06, 2019

James Bottomley: Using TPM Based Client Certificates on Firefox and Apache

One of the useful features of Apache (or indeed any competent web server) is the ability to use client side certificates. All this means is that a certificate from each end of the TLS transaction is verified: the browser verifies the website certificate, but the website requires the client also to present one and verifies it. Using client certificates, when linked to your own client certificate CA gives web transactions the strength of two factor authentication if you do it on the login page. I use this feature quite a lot for all the admin features my own website does. With apache it’s really simple to turn on with the

SSLCACertificateFile

Directive which allows you to specify the CA for the accepted certificates. In my own setup I have my own self signed certificate as CA and then all the authority certificates use it as the issuer. You can turn Client Certificate verification on per location basis simply by doing

<Location /some/web/location>
SSLVerifyClient require
</Location

And Apache will take care of requesting the client certificate and verifying it against the CA. The only caveat here is that TLSv1.3 currently fails to work for this, so you have to disable it with

SSLProtocol -TLSv1.3

Client Certificates in Firefox

Firefox is somewhat hard to handle for SSL because it includes its own hand written mozilla secure sockets code, which has a toolkit quite unlike any other ssl toolkit1. In order to import a client certificate and key into firefox, you need to create a pkcs12 file containing them and import that into the “Your Certificates” box which is under Preferences > Privacy & Security > View Certificates

Obviously, simply supplying a key file to firefox presents security issues because you’d like to prevent a clever hacker from gaining access to it and thus running off with your client certificate. Firefox achieves a modicum of security by doing all key operations over the PKCS#11 API via a software token, which should mean that even malicious javascript cannot gain access to your key but merely the signing API

However, assuming you don’t quite trust this software separation, you need to store your client signing key in a secure vault like a TPM to make sure no web hacker can gain access to it. Various crypto system connectors, like the OpenSSL TPM2 and TPM2 engine, already exist but because Firefox uses its own crytographic code it can’t take advantage of them. In fact, the only external object the Firefox crypto code can use is a PKCS#11 module.

Aside about TPM2 and PKCS#11

The design of PKCS#11 is that it is a loadable library which can find and enumerate keys and certificates in some type of hardware device like a USB Key or a PCI attached HSM. However, since the connector is simply a library, nothing requires it connect to something physical and the OpenDNSSEC project actually produces a purely software based cryptographic token. In theory, then, it should be easy

The problems come with the PKCS#11 expectation of key residency: The library allows the consuming program to enumerate a list of slots each of which may, or may not, be occupied by a single token. Each token may contain one or more keys and certificates. Now the TPM does have a concept of a key resident in NV memory, which is directly analagous to the PKCS#11 concept of a token based key. The problems start with the TPM2 PC Client Profile which recommends this NV area be about 512 bytes, which is big enough for all of one key and thus not very scalable. In fact, the imagined use case of the TPM is with volatile keys which are demand loaded.

Demand loaded keys map very nicely to the OpenSSL idea of a key file, which is why OpenSSL TPM engines are very easy to understand and use, but they don’t map at all into the concept of token resident keys. The closest interface PKCS#11 has for handling key files is the provisioning calls, but even there they’re designed for placing keys inside tokens and, once provisioned, the keys are expected to be non-volatile. Worse still, very few PKCS#11 module consumers actually do provisioning, they mostly leave it up to a separate binary they expect the token producer to supply.

Even if the demand loading problem could be solved, the PKCS#11 API requires quite a bit of additional information about keys, like ids, serial numbers and labels that aren’t present in the standard OpenSSL key files and have to be supplied somehow.

Solving the Key File to PKCS#11 Mismatch

The solution seems reasonably simple: build a standard PKCS#11 library that is driven by a known configuration file. This configuration file can map keys to slots, as required by PKCS#11, and also supply all the missing information. the C_Login() operation is expected to supply a passphrase (or PIN in PKCS#11 speak) so that would be the point at which the private key could be loaded.

One of the interesting features of the above is that, while it could be implemented for the TPM engine only, it can also be implemented as a generic OpenSSL key exporter to PKCS#11 that happens also to take engine keys. That would mean it would work for non-engine keys as well as any engine that exists for OpenSSL … a nice little win.

Building an OpenSSL PKCS#11 Key Exporter

A Token can be built from a very simple ini like configuration file, with the global section setting global properties, like manufacurer id and library description and each individual section being used to instantiate a slot containing one key. We can make the slot name, the id and the label the same if not overridden and use key file directives to load the public and private keys. The serial number seems best constructed from a hash of the public key parameters (again, if not overridden). In order to support engine keys, the token library needs to know which engine to invoke, so I added an engine keyword to tell it.

With that, the mechanics of making the token library work with any OpenSSL key are set, the only thing is to plumb in the PKCS#11 glue API. At this point, I should add that the goal is simply to get keys and tokens working, not to replicate a full featured PKCS#11 API, so you shouldn’t use this as something to test against for a reference implementation (the softhsm2 token is much better for that). However, it should be functional enough to use for storing keys in Firefox (as well as other things, see below).

The current reasonably full featured source code is here, with a reference build using the OpenSUSE Build Service here. I should add that some of the build failures are due to problems with p11-kit and others due to the way Debian gets the wrong engine path for libp11.

At Last: Getting TPM Keys working with Firefox

A final problem with Firefox is that there seems to be no way to import a certificate file for which the private key is located on a token. The only way Firefox seems to support this is if the token contains both the private key and the certificate. At least this is my own project, so some coding later, the token now supports certificates as well.

The next problem is more mundane: generating the certificate and key. Obviously, the safest key is one which has never left the TPM, which means the certificate request needs to be built from it. I chose a CSR type that also includes my name and my machine name for later easy discrimination (and revocation if I ever lose my laptop). This is the sequence of commands for my machine called jarvis.

create_tpm2_key -a key.tpm
openssl req -subj "/CN=James Bottomley/UID=jarivs/" -new -engine tpm2 -keyform engine -key key.tpm -nodes -out jarvis.csr
openssl x509 -in jarvis.csr -req -CA my-ca.crt -engine tpm2 -CAkeyform engine -CAkey my-ca.key -days 3650 -out jarvis.crt

As you can see from the above, the key is first created by the TPM, then that key is used to create a certificate request where the common name is my name and the UID is the machine name (this is just my convention, feel free to use your own) and then finally it’s signed by my own CA, which you’ll notice is also based on a TPM key. Once I have this, I’m free to create an ini file to export it as a token to Firefox

manufacturer id = Firefox Client Cert
library description = Cert for hansen partnership
[mozilla-key]
certificate = /home/jejb/jarvis.crt
private key = /home/jejb/key.tpm
engine = tpm2

All I now need to do is load the PKCS#11 shared object library into Firefox using Settings > Privacy & Security > Security Devices > Load and I have a TPM based client certificate ready for use.

Additional Uses

It turns out once you have a generic PKCS#11 exporter for engine keys, there’s no end of uses for them. One of the most convenient has been using TPM2 keys with gnutls. Although gnutls was quick to adopt TPM 1.2 based keys, it’s been much slower with TPM2 but because gnutls already has a PKCS#11 interface using the p11 kit URI format, you can easily build a config file of all the TPM2 keys you want it to use and simply use them by URI in gnutls.

Unfortunately, this has also lead to some problems, the biggest one being Firefox: Firefox assumes, once you load a PKCS#11 module library, that you want it to use every single key it can find, which is fine until it pops up 10 dialogue boxes each time you start it, one for each key password, particularly if there’s only one key you actually care about it using. This problem doesn’t seem solvable in the Firefox token interface, so the eventual way I did it was to add the ability to specify the config file in the environment (variable OPENSSL_PKCS11_CONF) and modify my xfce Firefox action to set this in the environment pointing at a special configuration file with only Firefox’s key in it.

Conclusions and Future Work

Hopefully I’ve demonstrated this simple PKCS#11 converter can be useful both to keeping Firefox keys safe as well as uses in other things like gnutls. Unfortunately, it turns out that the world wide web is turning against PKCS#11 tokens as having usability problems and is moving on to something called FIDO2 tokens which have the web browser talking directly to the USB token. In my next technical post I hope to explain how you can use the Linux Kernel USB gadget system to connect a TPM up easily as a FIDO2 token so you can use the new passwordless webauthn protocol seamlessly.

March 06, 2019 08:21 PM

March 05, 2019

Paul E. Mc Kenney: Parallel Programming: March 2018 deferred-processing query

TL;DR: Do you know of additional publicly visible production uses of sequnce locking, hazard pointers, or RCU not already called out in the remainder of this blog post?

I am updating the deferred-processing chapter of “Is Parallel Programming Hard, And, If So, What Can You Do About It?” and would like to include a list of publicly visible production uses of sequence locking, hazard pointers, and RCU. I suppose I could also include reference counting, but given that it was well known before I was born, I expect that its list would be way too long to be useful!

The only production use of sequence locking that I am aware of is within the Linux kernel, but I would be surprised if it is not rather widely used. Can you tell me of more publicly visible production sequence-locking uses?

Hazard pointers is used within MongoDB (v3.0 and later) and within Facebook's Folly library, which is used in production at Facebook and perhaps elsewhere as well. It is also implemented by several libraries called out on its Wikipedia page (Concurrent Building Blocks, Concurrency Kit, Atomic Ptr Plus, and libcds). Hazard pointers is also sometimes called “safe memory reclamation” (SMR). Any other production hazard-pointers uses?

RCU is used within the Linux kernel, the FreeBSD kernel, the OpenBSD kernel, Linux Trace Toolkit Next Generation (LTTng), QEMU, Knot DNS, Netsniff-ng, Sheepdog, GlusterFS, and gdnsd. It is also implemented by several libraries, including Userspace RCU, Concurrency Kit, Facebook's Folly library, and libcds. RCU is also called “epochs” (from Keir Fraser), “generations” (from Tornado/K42), “passive serialization” (from IBM zVM), and probably other things as well. Any other production RCU uses?

So what do I mean by “publicly visible”? Open-source projects should qualify, as should scholarly publications regarding proprietary projects. Similarly, “production use” means use for getting some job done, as opposed to research, prototyping, or benchmarking. Not that there is necessarily anything wrong with research, prototyping, or benchmarking, but we are looking for things a little bit further along the hype cycle. ;-)

March 05, 2019 06:45 PM

February 28, 2019

Pete Zaitcev: Suddenly RISC-V

I knew about that thing because Rich Jones was a fan. Man, that guy is always ahead of the curve.

Coincidentially, a couple of days ago Amazon announced support for RISC-V in FreeRTOS (I have no idea how free that thig is. It's MIT license, but with Amazon, it might be patented up the gills.).

February 28, 2019 07:51 PM

Pete Zaitcev: Mu accounts

Okay, here's the breakdown:

@pro: Programming, computers, networking, maybe some technical fields. It's basically migrated from SeaLion and is the main account of interest for the readers of this journal.

@stuff: Pictures of butterflies, gardening, and general banality.

@gat: Boomsticks.

@avia: Flying.

@union: Politics.

@anime: Anime, manga, and weaboo. Note that Ani-nouto is still officially at Smug.

Thinking about adding @cars and @space, if needed.

You can subscribe from any Fediverse instance, just hit the "Remote follow" button.

February 28, 2019 05:06 PM

Pete Zaitcev: Multi-petabyte Swift cluster

In a Swift numbers post in 2017, I mentioned that the largest known cluster had about 20 PB. It is 2019 now and I just got a word that TurkCell is operating a cluster with 36 PB, and they are looking at growing it up to 50 PB by the end of the year. The information about its make-up is proprietary, unfortunately. The cluster was started in Icehouse release, so I'm sure there was a lot of churn and legacy, like 250 GB drives and RHEL 6.

February 28, 2019 04:44 PM

February 26, 2019

Linux Plumbers Conference: Welcome to the 2019 Linux Plumbers Conference blog

Planning for the 2019 Linux Plumbers Conference is well underway. The planning committee will be posting various informational blurbs here, including information on hotels, microconference acceptance, evening events, scheduling, and so on. Next up will be a “call for proposals” that should appear soon.

LPC will be held at the Corinthia Hotel, Lisbon, Portugal, 9-11 September 2019, colocated with the Linux Kernel Maintainer Summit. The Linux Kernel Summit Track will very much be taking place during LPC 2019 again this year.

February 26, 2019 02:21 PM

February 25, 2019

Pete Zaitcev: Elixir of your every fear

TFW you consider an O'Reily animal-cover book and the blurb says:

Authors Simon St. Laurent and J. David Eisenberg show you how Elixir combines the robust functional programming of Erlang with an approach that looks more like Ruby, and includes powerful macro features for metaprogramming.

February 25, 2019 01:34 AM

February 24, 2019

Davidlohr Bueso: Linux v4.20: Performance Goodies

With v4.20 out for almost the entire v5.0 rc-cycle, here are some of the more interesting performance related changes that made their way in.

signal: Use a smaller struct siginfo in the kernel

Reduces the memory footprint of 'struct siginfo' most of which is just reserved. Ultimately this avoid spanning two cachelines to just one.
[Commit 4ce5f9c9e754]

sched/fair: Fix cpu_util_wake() for 'execl' type workloads

Fix an exec() related performance regression, which was caused by incorrectly calculating load and migrating tasks on exec() when they shouldn't be.
[Commit c469933e7721]

locking/rwsem: Exit read lock slowpath if queue empty and no writer

This change presents a new heuristic for optimizing rw-semaphores, specifically in read-mostly scenarios. Before the patch, a reader could find itself in a situation when it was in the slowpath, due to an occasional writer thread, but the writer was then released, and only other readers are now present.  At that point the waitqueue was enlarged unnecessarily, causing other readers attempting to lock to see waiting readers. This directly improves some issues found when (ab)using pread64() and XFS.
[Commit 4b486b535c33]

mm: mmap: zap pages with read mmap_sem in munmap

When a process unmaps a range of memory, the infamous mmap_sem would held for the duration of the entire munmap() call, which can be a long time for big mappings (reportedly up to 18 seconds for a 320Gb mapping).  A two-phase approach was done to address this where the key is to unmap the vma first such that the semaphore can be taken exclusively at first then downgrade it such that it can be shared while doing the zapping and freeing of page tables.
[Commit dd2283f2605e b4cefb360512 cb4922496ae4]

net/tcp: optimize tcp internal pacing

When TCP implements its own pacing (when no fq packet scheduler is used), it is arming high resolution timer after a packet is sent. But in many cases (like TCP_RR kind of workloads), this high resolution timer expires before the application attempts to write the following packet. Setup the timer only when a packet is about to be sent, and if tcp_wstamp_ns is in the future,  showing a ~10% performance increase in TCP_RR workloads.
[Commit 864e5c090749]
 

fs: better member layout of struct super_block

Re-organize 'struct super_block' to try and keep some frequently accessed fields on the same cache line as well as grouping the rarely accessed members. This was seen to address a regression on a concurrent unlink intensive workload.
[Commit 99c228a994ec]


fs/fuse: improved scalability

Two changes that have performance visible effects went in. The first series changes some of the protections for background requests. This allows async reads not take the fuseconn lock. Secondly implement a hash table for processing requests which was seen to address a 20% time spent in request_find() under some workloads with Virtuozzo storage over rdma.
[Commit e287179afe21 2a23f2b8adbe 2b30a533148a ae2dffa39485 63825b4e1da5 c59fd85e4fd0 be2ff42c5d6e]

February 24, 2019 11:53 PM

February 22, 2019

Pete Zaitcev: Mu!

In the past several days, I innaugurated a private Fediverse instance, "Mu", running Pleroma for now. Although Mastodon is the dominant implementation, Pleroma is far easier to install, and uses less memory on small, private instances. By doing this, I'm bucking the trend of people hating to run their own infrastructure. Well, I do run my own e-mail service, so, what the heck, might as well join the Fediverse.

So far, it was pretty fun, but Pleroma has problem spots. For example, Pleroma has a concept of "local accounts" and "remote accounts": local ones are normal, into which users log in at the instance, and remote ones mirror accounts on other instances. This way, if users Alice@Mu and Bob@Mu follow user zaitcev@SLC, Mu creates a "remote" account UnIqUeStRiNg@Mu, which tracks zaitcev@SLC, so Alice and Bob subscribe to it locally. This permits to send zaitcev's updates over the network only once. Makes sense, right? Well... I have a "stuck" remote account now at Mu, let's call it Xprime@Mu and posit that it follows X@SPC. Updates posted by X@SPC are reflected in Xprime@Mu, but if Alice@Mu tries to follow X@SPC, she does not see updates that Xprime@Mu receives (the updates are not reflected in Alice's friends/main timeline) [1]. I asked at #pleroma about it, but all they could suggest was to try and resubscribe. I think I need to unsubscribe and purge Xprime@Mu somehow. Then, when Alice resubscribes, Pleroma will re-create a remote, say Xbis@Mu, and things hopefully ought to work. Well, maybe. I need to examine the source to be sure.

Unfortnately, aside from being somewhat complex by its nature, Pleroma is written in Elixir, which is to Erlang what Kotlin is to Java, I gather. Lain explains it thus:

As I had written a social network in Ruby for my work at around that time, I wanted to apply my [negative] experience to a new project. [...] This was also to get some experience with Elixir and the Erlang ecosystem, which seemed like a great fit for a fediverse server — and I think it is.

and to re-iterate:

When I started writing Pleroma I was already writing a social network in Ruby for my day job. Because of that, I knew a lot about the pain points of doing it with Ruby, mostly the bad performance for anything involving concurrency. I had written a Bittorrent DHT client in Elixir, so I knew that it would work well for this kind of software. I was also happy to work with functional programming again, which I like very much.

Anyway, it's all water under the bridge, and if I want to understand why Xprime@Mu is stuck, I need to learn Elixir. Early signs are not that good. Right away, it uses its own control entity that replaces make(1), packaging, and a few other things, called "mix". Sasuga desu, as they say in my weeb neighbourhood. Every goddamn language does that nowadays.


[1] It's trickier, actually. For an inexplicable reason, Alice sees some updates by X: for example, re-posts.

February 22, 2019 09:14 PM

Pavel Machek: Certified danger

I suspected Linux Foundation went to the dark side when they started strange deals with Microsoft. But I'm pretty sure they went to dark side now. https://venturebeat.com/2019/02/21/linux-foundation-elisa/ If Linux can be certified for safety-critical stuff, it means your certification requirements are _way_ too low. People are using microkernels for critical stuff for a reason...

February 22, 2019 01:38 PM

February 11, 2019

Pete Zaitcev: Feynman on discussions among great men

One of the first experiences I had in this project at Princeton was meeting great men. I had never met very many great men before. But there was an evaluation committee that had to try to help us along, and help us ultimately decide which way we were going to separate the uranium. This committee had men like Compton and Tolman and Smyth and Urey and Rabi and Oppenheimer on it. I would sit in because I understood the theory of how our process of separating isotopes worked, and so they'd ask me questions and talk about it. In these discussions one man would make a point. Then Compton, for example, would explain a different point of view. He would say it should be this way, and was perfectly right. Another guy would say, well, maybe, but there's this other possibility we have to consider against it.

So everybody is disagreeing, all around the table. I am surprised and disturbed that Compton doesn't repeat and emphasize his point. Finally, at the end, Tolman, who's the chairman, would say, ``Well, having heard all these arguments, I guess it's true that Compton's argument is the best of all, and now we have to go ahead.''

It was such a shock to me to see that a committee of men could present a whole lot of ideas, each one thinking of a new facet, while remembering what the other fella said, so that, at the end, the discussion is made as to which idea was the best — summing it all up — without having to say it three times. These were very great men indeed.

Life on l-k before CoC.

February 11, 2019 07:52 PM

February 06, 2019

Pete Zaitcev: SpaceBelt whitepaper

I pay a special attention to my hometown rocket enterprise, Firefly. So, it didn't escape my notice when Dr. Tom Markusic mentioned SunBelt in the SatMagazine as a potential user of launch services:

Cloud Constellation Corporation capped off 2018 funding announcements with a $100 million capital raise for space-based data centers [...]

Not a large amount of funding, but nonetheless, what are they trying to do? The official answer is provided in the whitepaper on their website.

The orbiting belt provides a greater level of security, independence from jurisdictional control, and eliminating the need for terrestrial hops for a truly worldwide network. Access to the global network is via direct satellite links, providing for a level of flexibility and security unheard of in terrestrial networks.

SpaceBelt provides a solution – a space-based storage layer for highly sensitive data providing isolation from conventional internet networks, extreme physical isolation, and sovereign independent data storage resources.

Although not pictured in the illustrations, text permits users direct access, which will become important later:

Clients can purchase or lease specialized very-small-aperture terminals (VSATs) which have customized SpaceBelt transceivers allowing highly-secure access to the network.

Interesting. But a few thoughts spring to mind.

Isolation from the Internet is vulnerable to the usual gateway problem, unintentional or malicious. If only application-level access is provided, a compromised gateway only accesses its own account. So that's fine. However, if state security services were able to insert malware into Iran's nuclear facilities, I think that the isolation may not be as impregnable as purported.

Consider also that system control has to be provided somehow, so they must have a control facility. In terms of vulnerabilities to governments and physical attacks, it is an equivalent of a datacenter hosting the intercontinental cluster's control plane, located at the point where master ground station is. In case of SpaceBelt, it is identified as "Network Management Center".

In addition, the space location invites a new spectrum of physical attacks: now the adversary can cook your data with microwaves or lasers, instead of launching ICBMs. It's a significantly lower barrier to the entry.

Turning around, it might be cheaper to store the data where the NMC is, since the physical security measures are the same, but vulnerabilities are smaller.

Of course the physical security includes a legal aspect. The whitepaper nods to "jurisdictional independence" several times. They don't explain what they mean, but they may be trying to imply that the data sent from the ground to the SpaceBelt does not traverse the ground infrastructure where NMC is located, and therefore is not a subject to any legal restrictions there, such as GDPR.

Very nice, and IANAL, but doesn't Outer Space Treaty establishes a regime of the absolute responsibility of signatory nations? I only know that OST is quite unlike the Law of The Sea: because of the absolute responsibility there is no salvage. Therefore, a case can be made, if the responsible nation is under GDPR, the whole SunBelt is too.

The above considerations apply to the "sovereign" or national data, but the international business faces more. The whitepaper implies that accessing data may be a simple matter of "leasing VSATs", but the governments still have the powers to deny this access. Usually the radio frequency licensing is involved, such as the case of OneWeb in Russia. The whitepaper mentions using traditional GSO comsats as relays, thus shifting the radio spectrum licensing hurdles onto the comsat operators. But there may be outright bans as well. I'm sure the Communist government of mainland China will not be happy if SunBelt users start downloading Falun Gong literature from space.

One other thing. If frying SpaceBelt with lasers might be too hard, there are other ways. Russia, for example, is experimenting with a rogue satellite that approaches comsats. It's not doing anything too bad to them at present, but so much for the "extreme physical isolation". If you thought that using SunBelt VSAT will isolate you from the risk of Russian submarines tapping undersea cables, then you might want to reconsider.

Overall, it's not like I would not love to work at Cloud Constellation Corporation, implementing the basic technologies their project needs. Sooner or later, humanity will have computing in space, might as well do it now. But their pitch needs work.

Finally, for your amusement:

In the future, the SpaceBelt system will be enabled to host docker containers allowing for on-orbit data processing in-situ with data storage.

Congratulations, Docker. You've became the xerox of cloud. (In the U.S., Xerox was ultimately successful is fighting the dillution: everyone now uses the word "photocopy". Not that litigation helped them to remain relevant.)

February 06, 2019 08:50 PM

January 29, 2019

Paul E. Mc Kenney: Article review: "The Hard Truth About Innovative Cultures"

There has been much ink spilled about innovation over the past decades, but this article from Harvard Business Review is the first one that really rings true with my experiences. The main point of this article is that much prior writing has focused on the fun aspects of innovation, and points out some additional hard work that is absolutely required for meaningful innovation. The authors put forth five maxims, each of which is discussed below.

Tolerance for failure but no tolerance for incompetence. This maxim is the one that rings most true with me: Innovation's progress is often measured in errors per hour, but the errors have to be productive errors that either eliminate classes of potential solutions from consideration or that better approximate a useful solution. And in my experience, extreme competence is required to make the right mistakes, that is, the mistakes that will generate the experience required to eventually arrive at a workable solution.

However, this maxim is also the one that I am most uncomfortable with. The discomfort stems from the choice of the word “incompetence”. After all, what is incompetence? The old apprentice/journeyman/master trichotomy is a useful guide. An apprentice is expected to do useful work if overseen by a journeyman or master. A journeyman is expected to be capable of carrying out a wide range of tasks without guidance. A master is expected to be able to extend the state of the art as needed to complete the task at hand. Clearly, there is a wide gulf between the definition of “incompetence” appropriate for an apprentice on the one hand and a master on the other. The level of competence required for this sort of work is not a function of education, certifications, or seniority, but instead requires a wide range of deep skills and experience combined with a willingness to learn things the hard way, along with a tolerance for the confusion and disorder that usually accompanies innovation. In short, successful innovation requires the team have a fair complement of masters. Yet it makes absolutely no sense to label as “incompetent” an accomplished journeyman, even if said journeyman is a bit uncreative and disorder-intolerant.

All that aside, “Tolerance for failure but no tolerance for non-mastery” doesn't exactly roll off the tongue, and besides which, large projects would have ample room for apprentices and journeymen, for example, our hypothetical accomplished but disorder-intolerant journeyman might be an excellent source of feedback. And in fact, master-only teams tend to be quite small [PDF, paywalled, sorry!]. I therefore have no suggestions for improvement. And wording quibbles aside, this maxim seems to me to be the most important of the five by far.

Willingness to experiment but highly disciplined. Although it is true that sometimes the only way forward is a random walk, it is still critically important to keep careful records of the experiments and their outcomes. It is often the case that last week's complete and utter failure turns out to contain the seeds of this week's step towards success, and sometimes patterns within a depressing morass of failures point the way to eventual success. The article also makes the excellent point that stress-testing ideas early on avoids over-investing in the inevitable blind alleys.

Psychologically safe but brutally candid. We all fall in love with our ideas, and therefore we all need the occasional round of “frank and open” feedback. If nothing else, we should design our experiments (or, in software, our validation suites) to provide that feedback.

Collaboration but with individual accountability. Innovation often requires that individuals and teams buck the common wisdom, but common wisdom often carries the day. Therefore, those individuals and teams must remain open to feedback, and accountability is one good way to help them seek out feedback and take that feedback seriously.

Flat but strong leadership. Most of my innovation has been carried out by very small teams, so this maxim has not been an issue for me. But people wishing to create large but highly innovative teams would do well to read this part of the article very carefully.

In short, this is a great article, and to the best of my knowledge the first one presenting both the fun and hard-work sides of the process of innovation. Highly recommended!

January 29, 2019 06:43 PM

James Morris: Save the Dates! Linux Security Summit Events for 2019.

There will be two Linux Security Summit (LSS) events again this year:

Stay tuned for CFP announcements!

January 29, 2019 05:35 PM

January 06, 2019

Pete Zaitcev: Reinventing a radio wheel

I tinker with software radio as a hobby and I am stuck solving a very basic problem. But first, a background exposition.

Bdale, what have you done to me

Many years ago, I attended an introductory lecture on software radio at a Linux conference we used to have - maybe OLS, maybe LCA, maybe ALS/Usenix even. Bdale Garbee was presenting, who I mostly knew as a Debian guy. He outlined a vision of Software Defined Radio: take what used to be a hardware problem, re-frame it as a software problem, let hackers hack on it.

Back then, people literally had sound cards as receiver back-ends, so all Bdale and his cohorts could do was HF, narrow band signals. Still, the idea seemed very powerful to me and caught my imagination.

A few years ago, the RTL-SDR appeared. I wanted to play with it, but nothing worthy came to mind, until I started flying and thus looking into various aviation data link signals, in particular ADS-B and its relatives TIS and FIS.

Silly government, were feet and miles not enough for you

At the time FAA became serious about ADS-B, two data link standards were available: Extended Squitter aka 1090ES at 1090 MHz and Universal Access Transciever aka UAT at 978 MHz. The rest of the world was converging quickly onto 1090ES, while UAT had a much higher data rate, so permitted e.g. transmission of weather information. FAA sat like a Buridan's ass in front of two heaps of hay, and decided to adopt both 1090ES and UAT.

Now, if airplane A is equipped with 1090ES and airplane B is equipped with UAT, they can't communicate. No problem, said FAA, we'll install thousands of ground stations that re-transmit the signals between bands. Also, we'll transmit weather images and data on UAT. Result is, UAT has a lot of signals all the time, which I can receive.

Before I invent a wheel, I invent an airplane

Well, I could, if I had a receiver that could decode a 1 megabit/second signal. Unfortunately, RTL-SDR could only snap 2.8 million I/Q samples/second in theory. In practice, even less. So, I ordered an expensive receiver called AirSpy, which was told to capture 20 million samples/second.

But, I was too impatient to wait for my AirSpy, so I started thinking if I could somehow receive UAT with RTL-SDR, and I came up with a solution. I let it clock at twice of the exact speed of UAT, a little more than 1 mbit/s. Then, since UAT used PSK2 encoding, I would compare phase angles between samples. Now, you cannot know for sure where the bits fall over your samples. But you can look at decoded bits and see if it's garbage or a packet. Voila, making impossible possible, at Shannon's boundary.

When I posted my code to github, it turned out that a British gentleman by the handle of mutability was thinking about the same thing. He contributed a patch or two, but he also had his own codebase, at which I hacked a bit too. His code was performing better, and it found a wide adoption under the name dump978.

Meanwhile, the AirSpy problem

AirSpy ended collecting dust, until now. I started playing with it recently, and used the 1090ES signal for tests. It was supposed to be easy... Unlike the phase shift of UAT, 1090ES is much simpler signal: raising front is 1, falling front is 0, stable is invalid and is used in the preamble. How hard can it be, right? Even when I found that AirSpy only receives the real component, it seemed immaterial: 1090ES is not phase-encoded.

But boy, was I wrong. To begin with, I need to hunt a preamble, which synchronizes the clocks for the remainder of the packet. Here's what it looks like:

The fat green square line on the top is a sample that I stole from our German friends. The thin green line is a 3-sample average of abs(sample). And the purple is raw samples off the AirSpy, real-only.

My first idea was to compute a "discriminant" function, or a kind of an integrated difference between the ideal function (in fat green) and the actual signal. If the discriminant is smaller than a threshold, we have our preamble. The idea was a miserable failure. The problem is, the signal is noisy. So, even when the signal is normalized, the noise in more powerful signal inflates the discriminant enough that it becomes larger than the discriminant of background noise.

Mind, this is a long-solved problem. Software receiver for 1090ES with AirSpy exists. I'm just playing here. Still... How do real engineers do it?

January 06, 2019 03:47 AM

December 24, 2018

Kees Cook: security things in Linux v4.20

Previously: v4.19.

Linux kernel v4.20 has been released today! Looking through the changes, here are some security-related things I found interesting:

stackleak plugin

Alexander Popov’s work to port the grsecurity STACKLEAK plugin to the upstream kernel came to fruition. While it had received Acks from x86 (and arm64) maintainers, it has been rejected a few times by Linus. With everything matching Linus’s expectations now, it and the x86 glue have landed. (The arch-specific portions for arm64 from Laura Abbott actually landed in v4.19.) The plugin tracks function calls (with a sufficiently large stack usage) to mark the maximum depth of the stack used during a syscall. With this information, at the end of a syscall, the stack can be efficiently poisoned (i.e. instead of clearing the entire stack, only the portion that was actually used during the syscall needs to be written). There are two main benefits from the stack getting wiped after every syscall. First, there are no longer “uninitialized” values left over on the stack that an attacker might be able to use in the next syscall. Next, the lifetime of any sensitive data on the stack is reduced to only being live during the syscall itself. This is mainly interesting because any information exposures or side-channel attacks from other kernel threads need to be much more carefully timed to catch the stack data before it gets wiped.

Enabling CONFIG_GCC_PLUGIN_STACKLEAK=y means almost all uninitialized variable flaws go away, with only a very minor performance hit (it appears to be under 1% for most workloads). It’s still possible that, within a single syscall, a later buggy function call could use “uninitialized” bytes from the stack from an earlier function. Fixing this will need compiler support for pre-initialization (this is under development already for Clang, for example), but that may have larger performance implications.

raise faults for kernel addresses in copy_*_user()

Jann Horn reworked x86 memory exception handling to loudly notice when copy_{to,from}_user() tries to access unmapped kernel memory. Prior this, those accesses would result in a silent error (usually visible to callers as EFAULT), making it indistinguishable from a “regular” userspace memory exception. The purpose of this is to catch cases where, for example, the unchecked __copy_to_user() is called against a kernel address. Fuzzers like syzcaller weren’t able to notice very nasty bugs because writes to kernel addresses would either corrupt memory (which may or may not get detected at a later time) or return an EFAULT that looked like things were operating normally. With this change, it’s now possible to much more easily notice missing access_ok() checks. This has already caught two other corner cases even during v4.20 in HID and Xen.

spectre v2 userspace mitigation

The support for Single Thread Indirect Branch Predictors (STIBP) has been merged. This allowed CPUs that support STIBP to effectively disable Hyper-Threading to avoid indirect branch prediction side-channels to expose information between userspace threads on the same physical CPU. Since this was a very expensive solution, this protection was made opt-in (via explicit prctl() or implicitly under seccomp()). LWN has a nice write-up of the details.

jump labels read-only after init

Ard Biesheuvel noticed that jump labels don’t need to be writable after initialization, so their data structures were made read-only. Since they point to kernel code, they might be used by attackers to manipulate the jump targets as a way to change kernel code that wasn’t intended to be changed. Better to just move everything into the read-only memory region to remove it from the possible kernel targets for attackers.

VLA removal finished

As detailed earlier for v4.17, v4.18, and v4.19, a whole bunch of people answered my call to remove Variable Length Arrays (VLAs) from the kernel. I count at least 153 commits having been added to the kernel since v4.16 to remove VLAs, with a big thanks to Gustavo A. R. Silva, Laura Abbott, Salvatore Mesoraca, Kyle Spiers, Tobin C. Harding, Stephen Kitt, Geert Uytterhoeven, Arnd Bergmann, Takashi Iwai, Suraj Jitindar Singh, Tycho Andersen, Thomas Gleixner, Stefan Wahren, Prashant Bhole, Nikolay Borisov, Nicolas Pitre, Martin Schwidefsky, Martin KaFai Lau, Lorenzo Bianconi, Himanshu Jha, Chris Wilson, Christian Lamparter, Boris Brezillon, Ard Biesheuvel, and Antoine Tenart. With all that done, “-Wvla” has been added to the top-level Makefile so we don’t get any more added back in the future.

per-task stack canaries, powerpc
For a long time, only x86 has had per-task kernel stack canaries. Other architectures would generate a single canary for the life of the boot and use it in every task. This meant that exposing a canary from one task would give an attacker everything they needed to spoof a canary in a separate attack in a different task. Christophe Leroy has solved this on powerpc now, integrating the new GCC support for the -mstack-protector-guard-reg and -mstack-protector-guard-offset options.

Given the holidays, Linus opened the merge window before v4.20 was released, letting everyone send in pull requests in the week leading up to the release. v4.21 is in the making. :) Happy New Year everyone!

Edit: clarified stackleak details, thanks to Alexander Popov. Added per-task canaries note too.

© 2018 – 2019, Kees Cook. This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 License.
Creative Commons License

December 24, 2018 11:59 PM

December 22, 2018

Pete Zaitcev: The New World

well I had to write a sysv init script today and I wished it was systemd

— moonman, 21 December 2018

December 22, 2018 04:41 AM

James Morris: Linux Security Summit Europe 2018 Wrap-up

The inaugural Linux Security Summit Europe (LSS-EU) was held in October, in Edinburgh, UK.

For 2018, the LSS program committee decided to add a new event in Europe, with the aim of fostering Linux security community engagement beyond North America. There are many Linux security developers and users in Europe who may not be able to obtain funding to travel to North America for the conference each year. The lead organizer and MC for LSS EU is Elena Reshetova, of Intel Finland.

This was my first LSS as a speaker, as I’ve always been the MC for the North American events. I provided a brief overview of the Linux kernel security subsystem.

Sub-maintainers of kernel security projects presented updates on their respective areas, and there were also several referred presentations.

Slides may be found here, while videos of all talks are available via this youtube playlist.

There are photos, too!

The event overall seemed very successful, with around 150 attendees. We expect to continue now to have both NA and EU LSS events each year, although there are some scheduling challenges for 2019, with several LF events happening closely together. From 2020 on, it seems we will have 4-5 months separation between the EU and NA events, which will work much better for all involved.

 

December 22, 2018 03:53 AM

December 20, 2018

Pete Zaitcev: And to round out the 2018

To quoth:

Why not walk down the wider path, using GNU/Linux as DOM0? Well, if you like the kernel Linux, by all means, do that! I prefer an well-engineered kernel, so I choose NetBSD. [...]

Unfortunately, NetBSD's installer now fails on many PCs from 2010 and later. [...]

Update 2018-03-11: I have given up on NetBSD/Xen and now use Gentoo GNU/Linux/Xen instead. The reason is that I ran into stability problems which survived many NetBSD updates.

You have to have a heart of stone not to laugh out loud.

P.S. Use KVM already, sheesh.

P.P.S. This fate also awaits people who don't like SystemD.

December 20, 2018 09:30 PM

December 18, 2018

Pete Zaitcev: Firefox 64 autoplay in Fedora 29

With one of the recent Firefox releases (current version is 64), autoplay videos began to play again, although they start muted now [1]. None of the previously-working methods work (e.g. about:config media.autoplay.enabled), the documented preference is not there in 64 (promised for 63: either never happened, or was removed). Extensions that purport to disable autoplay do not work.

The solution that does work is set media.autoplay.default to 1.

Finding the working option required a bit of effort. I'm sure this post will become obsolete in a few months, and add to the Internet noise that makes it harder to find a working solution when Mozilla changes something again. But hey. Everyting is shit, so whatever.

[1] Savour the bitterness of realization that an employee of Mozilla thought that autoplay was okay to permit as long as it was muted.

December 18, 2018 05:53 PM

December 13, 2018

Pete Zaitcev: IBM PC XT

By whatever chance, I visited an old science laboratory where I played at times when I was a teenager. They still have a pile of old equipment, including the IBM PC XT clone that I tinkered with.

Back in the day, they also had a PDP-11, already old, which had a magnetic tape unit. They also had data sets on those tapes. The PC XT was a new hotness, and they wanted to use it for data visualization. It was a difficult task to find a place that could read the data off the tape and write to 5.25" floppies. Impossible, really.

I stepped in and went to connect the two over RS-232. I threw together a program in Turbo Pascal, which did the job of shuffling the characters between the MS-DOS and the mini, thus allowing to log in and initiate a transfer of the data. I don't remember if we used an ancient Kermit, or just printed the numbers in FORTRAN, then captured them on the PC.

The PDP-11 didn't survive for me to take a picture, but the PC XT did.

December 13, 2018 06:07 AM

December 09, 2018

Paul E. Mc Kenney: Parallel Programming: December 2018 Update

This weekend features a new release of Is Parallel Programming Hard, And, If So, What Can You Do About It?.

This release features Makefile-automated running of litmus tests (both with herd and litmus tools), catch-ups with recent Linux-kernel changes, a great many consistent-style changes (including a new style-guide appendix), improved code cross-referencing, and a great many proofreading changes, all courtesy of Akira Yokosawa. SeongJae Park, Imre Palik, Junchang Wang, and Nicholas Krause also contributed much-appreciated improvements and fixes. This release also features numerous epigraphs, modernization of sample code, many random updates, and larger updates to the memory-ordering chapter, with much help from my LKMM partners in crime, whose names are now enshrined in the LKMM section of the Linux-kernel MAINTAINERS file.

As always, git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/perfbook.git will be updated in real time.

Oh, and the first edition is now available on Amazon in English as well as Chinese. I have no idea how this came about, but there it is!

December 09, 2018 07:42 PM

December 03, 2018

Dave Airlie (blogspot): Open source compute stack talk from Linux Plumbers Conference 2018

I spoke at Linux Plumbers Conference 2018 in Vancouver a few weeks ago, about CUDA and the state of open source compute stacks.

The video is now available.

https://www.youtube.com/watch?v=d94N2Lu4x9s


December 03, 2018 01:43 AM

December 02, 2018

Pete Zaitcev: Twitter

First things first: I am sorry for getting passive-aggressive on Twitter, although I was mad and the medium encourages this sort of thing. But this is the world we live in: the way to deal with computers is to google the symptoms, and hope that you don't have to watch a video. Something about this world disagrees with me so much, that I almost boycott Wikipedia and Stackoverflow. "Almost" means that I go very far, even Read The Fine Manuals, before I resort to them. As the path in tweet indicated, I built Ceph from source in order to debug the problem. But as the software stacks get thicker and thicker, source gets less and less useful, or at least it loses competition to googling for symptoms. My only hope at this point is for the merciful death take me away before these trends destroy the human civilization.

December 02, 2018 04:22 AM

November 04, 2018

Paul E. Mc Kenney: Book review: "Skin in the Game: Hidden Asymmetries in Daily Life"

Antifragile” was the last volume in Nassim Taleb's Incerto series, but it has lost that distinction with the publication of “Skin in the Game: Hidden Asymmetries in Daily Life”. This book covers a great many topics, but I will focus on only a few that relate most closely to my area of expertise.

Chapter 2 is titled “The Most Intolerant Wins: The Dominance of a Stubborn Minority”. Examples include kosher and halal food, the English language (I plead guilty!!!), and many others besides. In all cases, if the majority is not overly inconvenienced by the strongly expressed needs or desires of the minority, the minority's preferences will prevail. On the one hand, I have no problem eating either kosher or halal food, so would be part of the compliant majority in that case. On the other hand, although I know bits and pieces of several languages, the only one I am fluent in is English, and I have attended gatherings where the language was English solely for my benefit. But there are limits. For example, if I were to attend a gathering in certain parts of (say) rural India or China, English might not be within the realm of possibility.

But what does this have to do with parallel programming???

This same stubborn-minority dominance appears in software, including RCU. Very few machines have more than a few tens of CPUs, but RCU is designed to accommodate thousands. Very few systems run workloads featuring aggressive real-time requirements, but RCU is designed to support low latencies (and even more so the variant of RCU present in the -rt patchset). Very few systems allow physical removal of CPUs while the systems is running, but RCU is designed to support that as well. Of course, as with human stubborn minorities, there are limits. RCU handles systems with a few thousand CPUs, but probably would not do all that well on a system with a few million CPUs. RCU supports deep sub-millisecond real-time latencies, but not sub-microsecond latencies. RCU supports controlled removal and insertion of CPUs, but not surprise removal or insertion.

Chapter 6 is titled Intellectual Yet Idiot (with the entertaining subtext “Teach a professor how to deadlift”), and, as might be expected from the title, takes a fair number of respected intellectual to task, for but two examples, Cass Sunstein and Richard Thaler. I did find the style of this chapter a bit off-putting, but I happened to read Michael Lewis's “The Undoing Project” at about the same time. This informative and entertaining book covers the work of Daniel Kahneman and Amos Tversky (whose work helped to inform that of Sunstein and Thaler), but I found the loss-aversion experiments to be unsettling. After all, what does losing (say) $100 really mean? That I will be sad for a bit? That I won't be able to buy that new book I was looking forward to reading? That I don't get to eat dinner tonight? That I go hungry for a week? That I starve to death? I just might give a very different answer in these different scenarios, mightn't I?

This topic is also covered by Jared Diamond in his most excellent book entitled “The World Until Yesterday”. In the “Scatter your land” section, Diamond discusses how traditional farmers plant multiple small and widely separated plots of land. This practice puzzled anthropologists for some time, as it does the opposite of optimize yields and minimize effort. Someone eventually figured out that because these traditional farmers had no way to preserve food and limited opportunities to trade it, there was no value in producing more food than they could consume. But there was value in avoiding a year in which there was no food, and farming different crops in widely separated locations greatly decreased the odds that all their crops in all their plots would fail, thus in turn minimizing the probability of starvation. In short, these farmers were not optimizing for maximum average production, but rather for maximum probability of survival.

And this tradeoff is central to most of Taleb's work to date, including “Skin in the Game”.

But what does this have to do with parallel programming???

Quite a bit, as it turns out. In theory, RCU should just run its state machine and be happy. In practice, there are all kinds of things that can stall its state machine, ranging from indefinitely preempted readers to long-running kernel threads refusing to give up the CPU to who knows what all else. RCU therefore contains numerous forward-progress checks that reduce performance slightly but which also allow RCU to continue working when the going gets rough. This sort of thing is baked even more deeply into the physical engineering disciplines in the form of the fabled engineering factor of safety. For example, a bridge might be designed to handle three times the heaviest conceivable load, thus perhaps surviving a black-swan event such as a larger-than-expected earthquake or tidal wave.

Returning to Skin in the Game, Taleb makes much of the increased quality of decisions when the decider is directly affected by them, and rightly so. However, I became uneasy about cases where the decision and effect are widely separated in time. Taleb does touch obliquely on this topic in a section entitled “How to Put Skin in the Game of Suicide Bombers”, but does not address this topic in more prosaic settings. One could take a survival-based approach, arguing that tomorrow matters not unless you survive today, but in the absence of a very big black swan, a large fraction of the people alive today will still be alive ten years from now.

But what does this have to do with parallel programming???

There is a rather interesting connection, especially when you consider that Linux-kernel RCU's useful lifespan probably exceeds my own. This is not a new thought, and is in fact why I have put so much energy into speaking and writing about RCU. I also try my best to make RCU able to stand up to whatever comes its way, with varying degrees of success over the years.

However, beyond a certain point, this practice is labeled “overengineering”, which is looked down upon within the Linux kernel community. And with good reason: Many of the troubles one might foresee will never happen, and so the extra complexity added to deal with those troubles will provide nothing but headaches for no benefit. In short, my best strategy is to help make sure that there are bright, capable, and motivated people to look after RCU after I am gone. I therefore intend to continue writing and speaking about RCU. :–)

November 04, 2018 03:54 AM

October 30, 2018

Pete Zaitcev: Where is Amazon?

Imagine, purely hypothetically, that you were a kernel hacker working for Red Hat and for whatever reason you wanted to find a new challenge at a company with a strong committment to open source. What are the possibilities?

To begin with, as the statistics from the Linux Foundation's 2016 report demonstrate, you have to be stark raving mad to leave Red Hat. If you do, Intel and AMD look interesting (hello, Alan Cox). IBM is not bad, although since yesterday, you don't need to quit Red Hat to work for IBM anymore. Even Google, famous for being a black hole that swallows good hackers who are never heard from again, manages to put up a decent showing, Fuchsia or no. Facebook looks unimpressive (no disrespect to DaveJ intended).

Now, the no-shows. Both of them hail from Seattle, WA: Microsoft and Amazon. Microsoft made an interesting effort to adopt Linux into its public cloud, but their strategy was to make Red Hat do all the work. Well, as expected. Amazon, though, is a problem. I managed to get into an argument with David "dwmw2" Woodhouse on Facebook about it, where I brought up a somewhat dated article at The Register. The central claim is, the lack of Amazon's contribution is the result of the policy rolled all the way from the top.

(...) as far as El Reg can tell, the internet titan has submitted patches and other improvements to very few projects. When it does contribute, it does so typically via a third party, usually an employee's personal account that is not explicitly linked to Amazon.

I don't know if this culture can be changed quickly, even if Bezos suddenly changes his mind.

October 30, 2018 03:26 AM

October 25, 2018

Davidlohr Bueso: Linux v4.19: Performance Goodies

This post marks one year since I began doing these kernel performance goodies write ups,  starting from v4.14. And this week Greg released Linux v4.19, so here are some of the changes related to software optimizations, performance and scalability topics across various subsystems.

epoll: loosen irq safety when possible

The epoll code uses an irq-safe spinlock to protect concurrent operations to the ready-event linked list. However, with the exception of the callback done from the wakequeues, the calls to the spinlock are never done in irq context, and therefore there is really no need to save and restore interrupts each time the lock is acquired and released. For example, on x86, a POPF (irqrestore) instruction can be quite expensive as it changes all the flags and therefore potentially heavy on dependencies. These changes yield some measurable results on a range of epoll_wait(2) microbenchmarks, around 7-20% in raw throughput. This is unsurprising as PUSHF + POPF is  more expensive than STI + CLI.
[Commit 002b343669c4, 304b18b8d6af, 92e641784055, 679abf381a18]

sched/numa:  migrate pages to local nodes quicker early in the lifetime of a task

Automatic NUMA Balancing uses a multi-stage pass to decide whether a page should migrate to a local node. This filter avoids excessive ping-ponging if a page is shared or used by threads that migrate cross-node frequently. Threads inherit both page tables and the preferred node ID from the parent. This means that threads can trigger hinting faults earlier than a new task which delays scanning for a number of seconds. As it can be load balanced very early in its lifetime there can be an unnecessary delay before it starts migrating thread-local data. This patch migrates private pages faster early in the lifetime of a thread using the sequence counter as an identifier of new tasks.
[Commit 37355bdc5a12]
 

rcu: check if GP already requested

This commit makes rcu_nocb_wait_gp() check to see if the current CPU already knows about the needed grace period having already been requested.  If so, it avoids acquiring the corresponding leaf rcu_node structure's lock, thus decreasing contention.  This optimization is intended for cases where either multiple leader rcu kthreads are running on the same CPU or these kthreads are running on a non-offloaded (e.g., housekeeping) CPU.
[Commit ab5e869c1f7a]

cpufreq/schedutil: take into account time spent in irq

Time being spent in interrupt handlers was not being accounted for in the CPU utilization when selecting an operating performance point. This can be a significant amount of time which is reported in the normal context time window. The new CPU utilization is yields a 10% performance boost on iperf workloads.
[Commit 9033ea11889f]

mm/page_alloc: enlarge zone's batch size

The page allocator will first try to use a percpu set of pages, then if all used up, ask the Buddy for a batch of pages. The size of this batch can have a number of consequences, including performance. The last time this magic number was increased was 13 years ago, and there have been numerous hardware improvements since then. As such a recent study with allocator intensive benchmarks, shows that doubling the size of the batch can yield improvements on larger/modern machines.
[Commit d8a759b57035]

mm: skip invalid pages block at a time in zero_resv_unresv()

The role of zero_resv_unavail() is to make sure that every struct page that is allocated but is not backed by memory that is accessible by kernel is zeroed and not in some uninitialized state. Since struct pages are allocated in blocks we can skip pageblock_nr_pages at a time, when the first one is found to be invalid. This optimization may help since now on x86 every hole in e820 maps is marked as reserved in memblock, and thus will go through this function.
[Commit 720e14ebec64]

kvm, x86: implement paravirt "send IPI" hypercall

Replace sending IPIs one by one for xAPIC physical mode by a single hypercall (vmexit). This patchset lets a guest send multicast IPIs, with at most 128 destinations per hypercall in 64-bit mode and 64 vCPUs per hypercall in 32-bit mode. An IPI microbenchmark shows non-trivial performance improvements for broadcast IPIs (send IPI to all online CPUs and force them to take/drop a spinlock).
[Commit 4180bf1b655a]

arm64: use queued spinlocks

Similar to x86, replace the old ticket spinlocks with fair qspinlocks and make use of MCS features as well as better performance under virtualization. This is particularly suitable for larger multicore machines.
[Commit c11090474d70]

October 25, 2018 06:19 PM

October 22, 2018

Kees Cook: security things in Linux v4.19

Previously: v4.18.

Linux kernel v4.19 was released today. Here are some security-related things I found interesting:

L1 Terminal Fault (L1TF)

While it seems like ages ago, the fixes for L1TF actually landed at the start of the v4.19 merge window. As with the other speculation flaw fixes, lots of people were involved, and the scope was pretty wide: bare metal machines, virtualized machines, etc. LWN has a great write-up on the L1TF flaw and the kernel’s documentation on L1TF defenses is equally detailed. I like how clean the solution is for bare-metal machines: when a page table entry should be marked invalid, instead of only changing the “Present” flag, it also inverts the address portion so even a speculative lookup ignoring the “Present” flag will land in an unmapped area.

protected regular and fifo files

Salvatore Mesoraca implemented an O_CREAT restriction in /tmp directories for FIFOs and regular files. This is similar to the existing symlink restrictions, which take effect in sticky world-writable directories (e.g. /tmp) when the opening user does not match the owner of the existing file (or directory). When a program opens a FIFO or regular file with O_CREAT and this kind of user mismatch, it is treated like it was also opened with O_EXCL: it gets rejected because there is already a file there, and the kernel wants to protect the program from writing possibly sensitive contents to a file owned by a different user. This has become a more common attack vector now that symlink and hardlink races have been eliminated.

syscall register clearing, arm64

One of the ways attackers can influence potential speculative execution flaws in the kernel is to leak information into the kernel via “unused” register contents. Most syscalls take only a few arguments, so all the other calling-convention-defined registers can be cleared instead of just left with whatever contents they had in userspace. As it turns out, clearing registers is very fast. Similar to what was done on x86, Mark Rutland implemented a full register-clearing syscall wrapper on arm64.

Variable Length Array removals, part 3

As mentioned in part 1 and part 2, VLAs continue to be removed from the kernel. While CONFIG_THREAD_INFO_IN_TASK and CONFIG_VMAP_STACK cover most issues with stack exhaustion attacks, not all architectures have those features, so getting rid of VLAs makes sure we keep a few classes of flaws out of all kernel architectures and configurations. It’s been a long road, and it’s shaping up to be a 4-part saga with the remaining VLA removals landing in the next kernel. For v4.19, several folks continued to help grind away at the problem: Arnd Bergmann, Kyle Spiers, Laura Abbott, Martin Schwidefsky, Salvatore Mesoraca, and myself.

shift overflow helper
Jason Gunthorpe noticed that while the kernel recently gained add/sub/mul/div helpers to check for arithmetic overflow, we didn’t have anything for shift-left. He added check_shl_overflow() to round out the toolbox and Leon Romanovsky immediately put it to use to solve an overflow in RDMA.

Edit: I forgot to mention this next feature when I first posted:

trusted architecture-supported RNG initialization

The Random Number Generator in the kernel seeds its pools from many entropy sources, including any architecture-specific sources (e.g. x86’s RDRAND). Due to many people not wanting to trust the architecture-specific source due to the inability to audit its operation, entropy from those sources was not credited to RNG initialization, which wants to gather “enough” entropy before claiming to be initialized. However, because some systems don’t generate enough entropy at boot time, it was taking a while to gather enough system entropy (e.g. from interrupts) before the RNG became usable, which might block userspace from starting (e.g. systemd wants to get early entropy). To help these cases, Ted T’so introduced a toggle to trust the architecture-specific entropy completely (i.e. RNG is considered fully initialized as soon as it gets the architecture-specific entropy). To use this, the kernel can be built with CONFIG_RANDOM_TRUST_CPU=y (or booted with “random.trust_cpu=on“).

That’s it for now; thanks for reading. The merge window is open for v4.20! Wish us luck. :)

© 2018, Kees Cook. This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 License.
Creative Commons License

October 22, 2018 11:17 PM

October 16, 2018

Matthew Garrett: Initial thoughts on MongoDB's new Server Side Public License

MongoDB just announced that they were relicensing under their new Server Side Public License. This is basically the Affero GPL except with section 13 largely replaced with new text, as follows:

If you make the functionality of the Program or a modified version available to third parties as a service, you must make the Service Source Code available via network download to everyone at no charge, under the terms of this License. Making the functionality of the Program or modified version available to third parties as a service includes, without limitation, enabling third parties to interact with the functionality of the Program or modified version remotely through a computer network, offering a service the value of which entirely or primarily derives from the value of the Program or modified version, or offering a service that accomplishes for users the primary purpose of the Software or modified version.

“Service Source Code” means the Corresponding Source for the Program or the modified version, and the Corresponding Source for all programs that you use to make the Program or modified version available as a service, including, without limitation, management software, user interfaces, application program interfaces, automation software, monitoring software, backup software, storage software and hosting software, all such that a user could run an instance of the service using the Service Source Code you make available.


MongoDB admit that this license is not currently open source in the sense of being approved by the Open Source Initiative, but say:We believe that the SSPL meets the standards for an open source license and are working to have it approved by the OSI.

At the broadest level, AGPL requires you to distribute the source code to the AGPLed work[1] while the SSPL requires you to distribute the source code to everything involved in providing the service. Having a license place requirements around things that aren't derived works of the covered code is unusual but not entirely unheard of - the GPL requires you to provide build scripts even if they're not strictly derived works, and you could probably make an argument that the anti-Tivoisation provisions of GPL3 fall into this category.

A stranger point is that you're required to provide all of this under the terms of the SSPL. If you have any code in your stack that can't be released under those terms then it's literally impossible for you to comply with this license. I'm not a lawyer, so I'll leave it up to them to figure out whether this means you're now only allowed to deploy MongoDB on BSD because the license would require you to relicense Linux away from the GPL. This feels sloppy rather than deliberate, but if it is deliberate then it's a massively greater reach than any existing copyleft license.

You can definitely make arguments that this is just a maximalist copyleft license, the AGPL taken to extreme, and therefore it fits the open source criteria. But there's a point where something is so far from the previously accepted scenarios that it's actually something different, and should be examined as a new category rather than already approved categories. I suspect that this license has been written to conform to a strict reading of the Open Source Definition, and that any attempt by OSI to declare it as not being open source will receive pushback. But definitions don't exist to be weaponised against the communities that they seek to protect, and a license that has overly onerous terms should be rejected even if that means changing the definition.

In general I am strongly in favour of licenses ensuring that users have the freedom to take advantage of modifications that people have made to free software, and I'm a fan of the AGPL. But my initial feeling is that this license is a deliberate attempt to make it practically impossible to take advantage of the freedoms that the license nominally grants, and this impression is strengthened by it being something that's been announced with immediate effect rather than something that's been developed with community input. I think there's a bunch of worthwhile discussion to have about whether the AGPL is strong and clear enough to achieve its goals, but I don't think that this SSPL is the answer to that - and I lean towards thinking that it's not a good faith attempt to produce a usable open source license.

(It should go without saying that this is my personal opinion as a member of the free software community, and not that of my employer)

[1] There's some complexities around GPL3 code that's incorporated into the AGPLed work, but if it's not part of the AGPLed work then it's not covered

comment count unavailable comments

October 16, 2018 10:44 PM

October 15, 2018

Davidlohr Bueso: Linux v4.18: Performance Goodies

Linux v4.18 has been out a two months now; making this post a bit late, but still in time before the next release. Also so much drama in the CoC to care about performance topics :P As always comes with a series of performance enhancements and optimizations across subsystems.

locking: avoid pointless TEST instructions

A number of places within locking primitives have been optimized to avoid superfluous test instructions for the CAS return by relying on try_cmpxchg, generating slightly better code for x86-64 (for arm64 there is really no difference). Such have been the cases for mutex fastpath (uncontended case) and queued spinlocks.
[Commit c427f69564e2, ae75d9089ff7]

locking/mcs: optimize cpu spinning

Some architectures, such as arm64,  can enter low-power standby state (spin-waiting) instead of purely spinning on a condition. This is applied to the MCS spin loop, which in turn directly helps queued spinlocks. On x86, this can also be cheaper than spinning on smp_load_acquire().
[Commit 7f56b58a92aa]

mm/mremap: reduce amount of TLB shootdowns

It was discovered that on a heavily dominated mremap workload, the amount of TLB flushes was excessive causing overall performance issues. By removing the LATENCY_LIMIT magic number to handle TLB flushes on a PMD boundary instead of every 64 pages,  the amount of shootdowns can be redced by a factor of 8 in the ideal case.  The LATENCY_LIMIT was almost certainly used originally to limit the PTL hold times but the latency savings are likely shadowed by the cost of IPIs in many cases.
[Commit 37a4094e828f]

mm: replace mmap_sem to protect cmdline and environ procfs files

Reducing (ab)users of the mmap_sem is always good for general address space performance. Introduce a new mm->arg_lock to protect against races when handling /proc/$PID/{cmdline,environ} files, this removes (mostly) the semaphore's requirements.
[Commit 88aa7cc688d4]

mm/hugetlb: make better use of page clearing optimization

Pass the fault address (address of the sub-page to access) to the nopage fault handler to better use the general huge page clearing optimization. This allows the sub-page to access to be cleared last to avoid the cache lines of to access sub-page to be evicted when clearing other sub-pages. Performance improvements were reported for vm-scalability.anon-w-seq  workload under hugetlbfs, reducing ~30% throughput.
[Commit 285b8dcaacfc]

sched: don't schedule threads on pre-empted vCPUs

It can be determined whether a vCPU is running to prioritize CPUs when scheduling threads. If a vCPU has been pre-empted, it will incur the extra cost of VMENTER and the time it actually spends to be running on the host CPU. If we had other vCPUs which were actually running on the host CPU and idle we should schedule threads there.
[Commit 247f2f6f3c70, 943d355d7fee]

sched/numa: Stagger NUMA balancing scan periods for new threads

It is redundant and counter productive for threads sharing an address space to change the protections to trap NUMA faults. Potentially only one thread is required but that thread may be idle or it may not have any locality concerns and pick an unsuitable scan rate. This patch uses independent scan period but they are staggered based on the number of address space users when the thread is created.

The intent is that threads will avoid scanning at the same time and have a chance to adapt their scan rate later if necessary. This reduces the total scan activity early in the lifetime of the threads. The different in headline performance across a range of machines and workloads is marginal but the system CPU usage is reduced as well as overall scan activity.
[Commit 137844759843]

block/bfq: postpone rq preparation to insert or merge

A lock contention point is removed (see patch for details and justification) by postponing request preparation to insertion or merging, as lock needs to be grabbed any longer in the prepare_request hook.
[Commit 18e5a57d7987]

btrfs: improve rmdir performance for large directories

When checking if a directory can be deleted, instead of ensuring all its children have been processed,  this optimization keeps track of the directory index offset of the child last checked in the last call to can_rmdir(), and then use it as the starting point for future calls. The changes were shown to yield massive performance benefits; for test directory with two million files being deleted the runtime is reduced from half an hour to less than two seconds.
[Commit 0f96f517dcaa]



KVM: VMX: Optimize tscdeadline timer latency

Add the advance tscdeadline expiration support to which the tscdeadline timer is emulated by VMX preemption timer to reduce the hypervisor lantency (handle_preemption_timer -> vmentry). The guest can also set an expiration that is very small in that case we set delta_tsc to 0, leading to an immediately vmexit when delta_tsc is not bigger than advance ns. This patch can reduce ~63% latency for kvm-unit-tests/tscdeadline_latency when testing busy waits.
[Commit c5ce8235cffa]

net/sched: NOLOCK qdisc performance enhancements and  fixes

There have been various performance related core changes to the NOLOCK qdisc code. The first begins with reducing the atomic operations of __QDISC_STATE_RUNNING. The bit is flipped twice per packet in the uncontended scenario with packet rate below the line rate: on packed dequeue and on the next, failing dequeue attempt. The changes simplify the qdisc. The changes moves the bit manipulation into the qdisc_run_{begin,end} helpers, so that the bit is now flipped only once per packet, with measurable performance improvement in the uncontended scenario.

Later, the above is actually replaced by using a sequence spinlock instead of the atomic approach address pfifo_fast performance regressions. There is also a reduction in the Qdisc struct memory footprint (spanning a cacheline less).
[Commit 96009c7d500e, 021a17ed796b, e9be0e993d95]

lib/idr: improve scalability by reducing IDA lock granularity

Improve the scalability of the IDA by using the per-IDA xa_lock rather than the global simple_ida_lock.  IDAs are not typically used in performance-sensitive locations, but since we have this lock anyway, we can use it.
[Commit b94078e69533

x86-64: micro-optimize __clear_put()

Use immediate constants and saves two registers.
[Commit 1153933703d9]

arm64: select ARCH_HAS_FAST_MULTIPLIER

It is probably safe to assume that all Armv8-A implementations have a multiplier whose efficiency is comparable or better than a sequence of three or so register-dependent arithmetic instructions. Select ARCH_HAS_FAST_MULTIPLIER to get ever-so-slightly nicer codegen in the few dusty old corners which care.
[Commit e75bef2a4fe2]

October 15, 2018 08:19 PM

October 11, 2018

Pete Zaitcev: I'd like to interject for a moment

In a comment on the death of G+, elisteran brought up something that long annoyed me out of all proportion with its actual significance. What do you call a collection of servers communicating through NNTP? You don't call them "INN", you call them "Usenet". The system of hosts communicating through SMTP is not called "Exim", it is called "e-mail". But when someone wants to escape G+, they often consider "Mastodon". Isn't it odd?

Mastodon is merely an implementation of Fediverse. As it happens, only one of my Fediverse channels runs on Mastodon (the Japanese language one at Pawoo). Main one still uses Gnusocial, the anime one was on Gnusocial and migrated to Pleroma a few months ago. All of them are communicating using the OStatus protocol, although a movement is afoot to switch to ActivityPub. Hopefully it's more successful than the migration from RSS to Atom was.

Yet, I noticed that a lot of people fall to the idea that Mastodon is an exclusive brand. Rarely one has to know or care what MTA someone else uses. Microsoft was somewhat successful in establishing Outlook as such a powerful brand to the exclusion of the compatible e-mail software. The maintainer of Mastodon is doing his hardest to present it as a similar brand, and regrettably, he's very successful at that.

I guess what really drives me mad about this is how Eugen uses his mindshare advanage to drive protocol extensions. All of Fediverse implementations generaly communicate freely with one another, but as Pleroma and Mastodon develop, they gradually leave Gnusocial behind in features. In particular, Eugen found a loophole in the protocol, which allows to attach pictures without using up the space in the message for the URL. When Gnusocial displays a message with attachment, it only displays the text, not the picture. This acutally used to be a server setting, in case you want to safe your instance from NSFW imagery and save a little bandwidth. But these days pictures are so prevalent, that it's pretty much impossible to live without receiving them. In this, Eugen has completed the "extend" phase and is moving onto "extinguish".

I'm not sure if this a lost cause by now. At least I hope that members of my social circle migrate to Fediverse in general, and not to Mastodon from the outset. Of course, the implementation does matter when they make choices. As I mentioned, for anything but Linux discussions, pictures are essential, so one cannot reasonably use a Gnusocial instance for anime, for example. And, I can see some users liking Mastodon's UI. And, Mastodon's native app support is better (or not). So yes, by all means, if you want to install Mastodon, or join an instance that's running Mastodon, be my guest. Just realize that Mastodon is an implementation of Fediverse and not the Fediverse itself.

UPDATE 2019/02/11: Chris finds a silver lining.

October 11, 2018 01:16 PM

October 09, 2018

Pete Zaitcev: Ding-dong, the witch is dead

Reactions by G+ inhabitants were better than expected at times. Here's Jon Masters:

For the three people who care about G+: it's closing down. This is actually a good thing. If you work in kernel or other nerdy computery circles and this is your social media platform, I have news for you...there's a world outside where actual other people exist. Try it. You can then follow me on Twitter at @jonmasters when you get bored.

Rock on. Although LJ was designed as a shitty silo, it wasn't powerful enough to make itself useless. For example, outgoing links aren't limited. That said, LJ isn't bulletproof: the management is pushing the "new" editor that does not allow HTML. The point is though, there's a real world out there.

And, some people are afraid of it, and aren't ashamed to admit it. Here's Steven Rostedt in Jon's comments:

In other words, we are very aware of the world outside of here. This is where we avoided that world ;-)

So weak. Jon is titan among his entourage.

Kir enumerated escape plans thus (in my translation):

Where to run, unclear. Not wanting to Facebook, Telegram is kinda a marginal platform (although Google+ marginal too), too lazy to stand up a standalone. Nothing but LJ comes to mind.

One thing that comes across very strongly is how reluctant people are to run their own infrastructure. For one thing, the danger of a devastating DDoS is absolutely real. And then you have to deal with spam. Those who do not have the experience also tend to over-estimate the amount of effort you have to put into running "dnf update" once in a while.

Personally, I think that although of course it's annoying, the time wasted on the infra is not that great, or at least it wasn't for me. The spam can be kept under control with a minimal effort. Or, could be addressed in drastic ways. For example, my anime blog simply does not have comments at all. As far as DoS goes, yes, it's a lottery. But then the silo platform can easily die (like G+), or ban you. This actually happens a lot more than those hiding their heads in the sand like to admit. And you don't need to go as far as to admit to your support of President Trump in order to get banned. Anything can trigger it, and the same crazies that DoS you will also try to deplatform you.

One other idea I was very successful with, and many people have trouble accepting, is having several channels for social posting (obviously CKS was ahead of the time with separating pro and hobby). Lots and lots of G+ posters insist on dumping all the garbage into one bin, instead of separating the output. Perhaps now they'll find a client or device that allows them switch accounts easily.

October 09, 2018 02:46 PM

October 07, 2018

Pete Zaitcev: Python and journalism

Back in July, Economist wanted to judge a popularity of programming languages and used ... Google Trends. Python is rocketing up, BTW. Go is not even mentioned.

October 07, 2018 08:04 PM

October 04, 2018

Andy Grover: Stratis 1.0 released!

We just tagged Stratis 1.0.

I can’t believe I haven’t blogged about Stratis before, although I’ve written in other places about it. We’ve been working on it for two years.

Basically, it’s a fancy manager of device-mapper and XFS configuration, to provide a similar experience as ZFS and Btrfs, but completely different under the hood.

Four things that took the most development time (so far)
  1. Writing the design doc. Early on, much of the work was convincing people the approach we wanted was a good one. We spent a lot of time discussing details among ourselves and winning over internal stakeholders (or not), but most of all, showing that we had given serious thought to various alternatives, and had spent some time to comprehend the consequences of initial design choices. Having the design doc made these discussions easier, and solicited feedback that resulted in a much better design than what we started with.
  2. Implementing on-disk metadata formats and algorithms to protect maximally against corruption and over-write. People said it would take more time than we thought and they…weren’t wrong! I still think implementing this was the right call, however.
  3. The hordes of range lists Stratis manages internally. It was probably inevitable that using multiple device-mapper layers involves a lot of range mapping. Stratis does a lot of it now, and it will be doing way more in the future, once we start using DM devices like integrity, raid, and compression. Rust really came through for us here I think. Rust’s functional aspects work very well for things like mapping and allocating.
  4. The D-Bus interface was a big effort in the pre-0.5 timeframe, but now that it is up and running it’s easy to maintain and update. We owe much of this to the quality of the dbus-rs library, and the receptivity of its author, diwic, to help us understand how to use it, and also helping to add small bits that aided our usage of D-Bus.
People to thank

Thanks to Igor Gnatenko and Josh Stone, two people who played a large part in making Rust on Fedora a reality. When I started writing the prototype for Stratis, this was a big question mark! I just hoped that the value of Rust would ensure that sooner or later Rust would be supported on Fedora and RHEL, and thanks to these two (and others, and, oh, you know, Firefox needing it…) it worked out.

I’d also like to thank the Rust community, for making such a compelling, productive systems language through friendliness and respect, sweating the details, and sharing! Like I alluded to before, Rust’s functional style was a good match for our problem space, and Rust’s intense focus on error handling also was perfect for a critical piece of software like stratisd, where what to do about errors is the most important part of what it does.

Finally, I’d like to thank the other members of the Stratis core team: Todd, Mulhern, and Tony. Stratis 1.0 is immeasurably better because of the different backgrounds and strengths we each brought to bear on developing this new piece of software. Thanks, everybody. You made 1.0 happen.

The Future

The 1.0 release marks the end of the beginning, so to speak. We just left the Shire, Frodo! Stratis is a viable product, but there’s so much more to do. Integrating more high-value device-mapper layers, more integration with other storage APIs (both “above” and “below”), more flexibility around adding and removing storage devices, while keeping the UI clean and the admin work low, is the challenge.

Stratis is going to need some major help to get there. For people interested in doing development, testing, packaging, or using Stratis, I invite you to visit our website and GitHub, or just keep tabs by following the project on Google Plus or Twitter.

October 04, 2018 10:44 PM

September 27, 2018

James Morris: 2018 Linux Security Summit North America: Wrapup

The 2018 Linux Security Summit North America (LSS-NA) was held last month in Vancouver, BC.

Attendance continued to grow this year, with a record of 220+ attendees.  Our room was upgraded as a result, with spectacular views.

LSS-NA 2018 Vancouver BC

Linux Security Summit NA 2018, Vancouver,BC

We also had many great proposals and the schedule ended up being a very tight fit.  We’ve asked for an extra day for LSS-NA next year — here’s hoping.

Slides of all presentations are available here: https://events.linuxfoundation.org/events/linux-security-summit-north-america-2018/program/slides/

Videos may be found in this youtube playlist.

Once again, as is typical, the conference was focused around development, somewhat uniquely in the world of security conferences.  It’s interesting to see more attention seemingly being paid to the lower parts of the stack: secure booting, firmware, and hardware roots of trust, as well as the continued efforts in hardening the kernel.

LWN provided some excellent coverage of LSS-NA:

Paul Moore has a brief writeup here.

Thanks to everyone involved in the event for 2018: the speakers, attendees, the program committee, the sponsors, and the organizing team at the Linux Foundation.  LSS-NA would not be possible without all of you!

September 27, 2018 08:06 PM

September 26, 2018

Pete Zaitcev: Postres vs MySQL

Unexpectedly in the fediverse:

[...] in my experience, postgres crashes less, and the results are less devastating if it does crash. I've had a mysql crash make the data unrecoverable. On the other hand I have a production postgres 8.1 installation (don't ask) that has been running without problems for over 10 years.

There is more community information and more third-party tools that require mysql, it has that advantage. the client tools for mysql are easier to use because the commands are in plain english ("show tables") unlike postgres that are commands like "\dt+". but if I'm doing my own thing though, I use postgres.

Reiser, move over. There's a new game in town.

September 26, 2018 01:15 AM

September 25, 2018

Pete Zaitcev: Huawei UI/UX fail

The Huawei M3 gave me an unpleasant surprise a short time ago. I had it in my hands while doing something and my daughter (age 30) offered to hold it for me. When I received it back and turned it on, it was factory reset. What happened?

It turned out that it's possible to reset the blasted thing merely by holding it. If someone grabs it and pays no attention to what's on the screen, then it's easy to press and hold the edge power button inadvertently. That brings up a dialog that has 2 touch buttons for power off and reset. The same hand that's holding the power touches the screen and causes the reset (a knuckle where the finger meets the palm does that perfectly).

The following combination of factors makes this happen. 1. Power button is on the edge, and it sticks out. Some tablets like Kindle or Nexus have edges somewhat slanted, so buttons are somewhat protected. Holding the tablet across the face engages the power button. They could at least place the power button on the short edge of the device. 2. The size of the tablet is just large enough that a normal person can hold it with one hand, but has to stretch. Therefore, the base knuckles touch the surface. On a larger tablet, a human hand is not large enough to hold it like that, and on a phone sized device the palm cups, so it does not touch the center of the screen. 3. The protection against accidental reset is essentially absent.

Huawei, not even once.

Google, bring back the Nexus 7, please.

September 25, 2018 01:20 AM

September 20, 2018

Valerie Aurora: Something is rotten in the Linux Foundation

When I agreed to talk about the management problems at the Linux Foundation to Noam Cohen, the reporter who wrote this story on Linux for the New Yorker, I expected to wait at least a year to see any significant change in the Linux community. Instead, before the story was even published, the Linux project … Continue reading Something is rotten in the Linux Foundation

September 20, 2018 07:04 PM

September 17, 2018

Pete Zaitcev: Robots on TV

Usually I do not watch TV, but I traveled and saw a few of them in public food intake places and such. What caught my attention were ads for robotics companies, aimed at business customers. IIRC, the companies were called generic names like "Universal Robotics" and "Reach Robotics". Or so I recall, but on second thought, Reach Robotics is a thing, but it focises on gaming, not traditional robotics. But the ads depicted robots doing some unspecified stuff: moving objects from place to place. Not dueling bots. Anyway, what's up with this? Is there some sort of revolution going on? What was the enabler? Don't tell me, it's all the money released by end of the Moore's Law, seeking random fields of application.

P.S. I know about the "Pentagon's Evil Mechanical Dogs" by Boston Dynamics. These were different, manipulating objects in the environment.

September 17, 2018 07:35 PM

September 12, 2018

Gustavo F. Padovan: linuxdev-br: a Linux international conference in Brazil

linuxdev-br second edition just happened end of last month in Campinas, Brazil. We have put a nice write-up about the conference on the link below. Soon we will start planning next year’s event. Come and join our community!

linuxdev-br: a Linux international conference in Brazil

The post linuxdev-br: a Linux international conference in Brazil appeared first on Gustavo Padovan.

September 12, 2018 11:48 AM

September 10, 2018

Matthew Garrett: The Commons Clause doesn't help the commons

The Commons Clause was announced recently, along with several projects moving portions of their codebase under it. It's an additional restriction intended to be applied to existing open source licenses with the effect of preventing the work from being sold[1], where the definition of being sold includes being used as a component of an online pay-for service. As described in the FAQ, this changes the effective license of the work from an open source license to a source-available license. However, the site doesn't go into a great deal of detail as to why you'd want to do that.

Fortunately one of the VCs behind this move wrote an opinion article that goes into more detail. The central argument is that Amazon make use of a great deal of open source software and integrate it into commercial products that are incredibly lucrative, but give little back to the community in return. By adopting the commons clause, Amazon will be forced to negotiate with the projects before being able to use covered versions of the software. This will, apparently, prevent behaviour that is not conducive to sustainable open-source communities.

But this is where things get somewhat confusing. The author continues:

Our view is that open-source software was never intended for cloud infrastructure companies to take and sell. That is not the original ethos of open source.

which is a pretty astonishingly unsupported argument. Open source code has been incorporated into proprietary applications without giving back to the originating community since before the term open source even existed. MIT-licensed X11 became part of not only multiple Unixes, but also a variety of proprietary commercial products for non-Unix platforms. Large portions of BSD ended up in a whole range of proprietary operating systems (including older versions of Windows). The only argument in favour of this assertion is that cloud infrastructure companies didn't exist at that point in time, so they weren't taken into consideration[2] - but no argument is made as to why cloud infrastructure companies are fundamentally different to proprietary operating system companies in this respect. Both took open source code, incorporated it into other products and sold them on without (in most cases) giving anything back.

There's one counter-argument. When companies sold products based on open source code, they distributed it. Copyleft licenses like the GPL trigger on distribution, and as a result selling products based on copyleft code meant that the community would gain access to any modifications the vendor had made - improvements could be incorporated back into the original work, and everyone benefited. Incorporating open source code into a cloud product generally doesn't count as distribution, and so the source code disclosure requirements don't trigger. So perhaps that's the distinction being made?

Well, no. The GNU Affero GPL has a clause that covers this case - if you provide a network service based on AGPLed code then you must provide the source code in a similar way to if you distributed it under a more traditional copyleft license. But the article's author goes on to say:

AGPL makes it inconvenient but does not prevent cloud infrastructure providers from engaging in the abusive behavior described above. It simply says that they must release any modifications they make while engaging in such behavior.

IE, the problem isn't that cloud providers aren't giving back code, it's that they're using the code without contributing financially. There's no difference between what cloud providers are doing now and what proprietary operating system vendors were doing 30 years ago. The argument that "open source" was never intended to permit this sort of behaviour is simply untrue. The use of permissive licenses has always allowed large companies to benefit disproportionately when compared to the authors of said code. There's nothing new to see here.

But that doesn't mean that the status quo is good - the argument for why the commons clause is required may be specious, but that doesn't mean it's bad. We've seen multiple cases of open source projects struggling to obtain the resources required to make a project sustainable, even as many large companies make significant amounts of money off that work. Does the commons clause help us here?

As hinted at in the title, the answer's no. The commons clause attempts to change the power dynamic of the author/user role, but it does so in a way that's fundamentally tied to a business model and in a way that prevents many of the things that make open source software interesting to begin with. Let's talk about some problems.

The power dynamic still doesn't favour contributors

The commons clause only really works if there's a single copyright holder - if not, selling the code requires you to get permission from multiple people. But the clause does nothing to guarantee that the people who actually write the code benefit, merely that whoever holds the copyright does. If I rewrite a large part of a covered work and that code is merged (presumably after I've signed a CLA that assigns a copyright grant to the project owners), I have no power in any negotiations with any cloud providers. There's no guarantee that the project stewards will choose to reward me in any way. I contribute to them but get nothing back in return - instead, my improved code allows the project owners to charge more and provide stronger returns for the VCs. The inequity has shifted, but individual contributors still lose out.

It discourages use of covered projects

One of the benefits of being able to use open source software is that you don't need to fill out purchase orders or start commercial negotiations before you're able to deploy. Turns out the project doesn't actually fill your needs? Revert it, and all you've lost is some development time. Adding additional barriers is going to reduce uptake of covered projects, and that does nothing to benefit the contributors.

You can no longer meaningfully fork a project

One of the strengths of open source projects is that if the original project stewards turn out to violate the trust of their community, someone can fork it and provide a reasonable alternative. But if the project is released with the commons clause, it's impossible to sell any forked versions - anyone who wishes to do so would still need the permission of the original copyright holder, and they can refuse that in order to prevent a fork from gaining any significant uptake.

It doesn't inherently benefit the commons

The entire argument here is that the cloud providers are exploiting the commons, and by forcing them to pay for a license that allows them to make use of that software the commons will benefit. But there's no obvious link between these things. Maybe extra money will result in more development work being done and the commons benefiting, but maybe extra money will instead just result in greater payout to shareholders. Forcing cloud providers to release their modifications to the wider world would be of benefit to the commons, but this is explicitly ruled out as a goal. The clause isn't inherently incompatible with this - the negotiations between a vendor and a project to obtain a license to be permitted to sell the code could include a commitment to provide patches rather money, for instance, but the focus on money makes it clear that this wasn't the authors' priority.

What we're left with is a license condition that does nothing to benefit individual contributors or other users, and costs us the opportunity to fork projects in response to disagreements over design decisions or governance. What it does is ensure that a range of VC-backed projects are in a better position to improve their returns, without any guarantee that the commons will be left better off. It's an attempt to solve a problem that's existed since before the term "open source" was even coined, by simply layering on a business model that's also existed since before the term "open source" was even coined[3]. It's not anything new, and open source derives from an explicit rejection of this sort of business model.

That's not to say we're in a good place at the moment. It's clear that there is a giant level of power disparity between many projects and the consumers of those projects. But we're not going to fix that by simply discarding many of the benefits of open source and going back to an older way of doing things. Companies like Tidelift[4] are trying to identify ways of making this sustainable without losing the things that make open source a better way of doing software development in the first place, and that's what we should be focusing on rather than just admitting defeat to satisfy a small number of VC-backed firms that have otherwise failed to develop a sustainable business model.

[1] It is unclear how this interacts with licenses that include clauses that assert you can remove any additional restrictions that have been applied
[2] Although companies like Hotmail were making money from running open source software before the open source definition existed, so this still seems like a reach
[3] "Source available" predates my existence, let alone any existing open source licenses
[4] Disclosure: I know several people involved in Tidelift, but have no financial involvement in the company

comment count unavailable comments

September 10, 2018 11:38 PM

September 08, 2018

Paul E. Mc Kenney: Ancient Hardware I Have Hacked: Back to Basics!

My return to the IBM mainframe was delayed by my high school's acquisition of a a teletype connected via a 110-baud serial line to a timesharing system featuring the BASIC language. I was quite impressed with this teletype because it could type quite a bit faster than I could. But this is not as good as it might sound, given that I came in dead last in every test of manual dexterity that the school ever ran us through. In fact, on a good day, I might have been able to type 20 words a minute, and it took decades of constant practice to eventually get above 70 words a minute. In contrast, one of the teachers could type 160 words a minute, more than half again faster than the teletype could!

Aside from output speed, I remained unimpressed with computers compared to paper and pencil, let alone compared to my pocket calculator. And given that this was old-school BASIC, there was much to be unimpressed about. You could name your arrays anything you wanted, as long as that name was a single upper-case character. Similarly, you could name your scalar variables anything you wanted, as long as that name was either a single upper-case character or a single upper-case character followed by a single digit. This allowed you to use up to 286 variables, up to 26 of which could be arrays. If you felt that GOTO was harmful, too bad. If you wanted a while loop, you could make one out of IF statements. Not only did IF statements have no else clause, the only thing that could be in the THEN clause was the number of the line to which control would transfer when the IF condition evaluated to true. And each line had to be numbered, and the numbers had to be monotonically increasing, that is, in the absence of control-flow statements, the program would execute the lines of code in numerical order, regardless of the order in which you typed those lines of code. Definitely a step down, even from FORTRAN.

But then the teacher showed the class a documentary movie showing several problems that could be solved by computer. I was unimpressed by most of the problems: Printing out prime numbers was impressive but pointless, and maximizing the volume of a box given limited materials was a simple pencil-and-paper exercise in calculus. But the finite-element analysis fluid-flow problem did get my attention. This featured a rectangular aquarium with a glass divider, so that initially the right-hand half of the aquarium was full of water and the left-hand half was full of air. They abruptly removed the glass divider, causing the water to slosh back and forth. They then showed a video of a computer simulation of the water flow, which matched the actual water flow quite well. There was no way I could imagine doing anything like that by hand, and was thus inspired to continue studying computer programming.

We students therefore searched out things that the computer could do that we were unwilling or unable to. One of my classmates ran the teletype's punch-tape output through its punch-tape reader, thus giving us all great insight as to why teletypes on television shows appeared to be so busy. For some reason, our teacher felt that this project was a waste of both punched tape and paper. He was more impressed with the work of another classmate, who calculated and ASCII-art printed magnetic lines of force. Despite the teletype's use of eight-bit ASCII, its print head was quite innocent of lower-case characters.

I coded up a project that plotted the zeroes of functions of two variables as ASCII art on the teletype. My teacher expressed some disappointment in my brute-force approach to locating the zeroes, but as far as I could see the bottleneck was the teletype, not the CPU. Besides, the timesharing service charged only for connect time, so CPU time was free, and why conserve a zero-cost resource?

I worked around the computer's limited arithmetic using crude multi-precision code with the goal of computing one thousand factorial. In this case, CPU was definitely the bottleneck, especially given my naive multiplication algorithm. The largest timeslot I could reserve on the teletype was an hour, and during that time, the computer was only able to make it to 659 factorial. In contrast, Maxima takes a few tens of milliseconds to compute 1000 factorial on my laptop. What a difference four decades makes!

I wrote my first professional program on this computer, a pro bono effort for a charity fundraiser. This charity was the work of the local branch of the National Honor Society, and the fundraiser was a computer-dating dance. Given that I was 160 pounds (73 kilograms) of computer-geeky social plutonium, I felt the need to consult an expert. The expert I chose was the home-economics teacher, who unfortunately seemed much more interested in working out why I was such a hopeless geek than in helping with matching criteria. I nevertheless extracted sufficient information to construct a simple Hamming-distance matcher. Fortunately most people seemed reasonably satisfied with their computer-chosen dance partners, the most notable exception being a senior girl who objected strenuously to having been matched only with freshmen boys. Further investigation determined that this mismatch was due to a data-entry error. Apparently, even Cupid is subject to Murphy's Law.

I also did my one and (thus far) only stint of white-hat hacking. In those trusting times, the school-administration software printed the user's password in cleartext as it was typed. But it was not necessary to memorize the characters that the user typed. You see, this teletype had what is called a ``HERE IS'' key. When this key was pressed, the teletype would send a 20-character sequence recorded on a mechanical drum located inside the teletype. And the sequence recorded on this particular teletype's mechanical drum was, you guessed it, the password to the school-administration software. I demonstrated this to my teacher, which resulted in the teletype being under continuous guard by a school official until such time as the mechanical drum could be replaced with one containing 20 ASCII NUL characters. (And here you thought that security theater was a recent phenomenon!)

Despite its limitations, my two years with this system were quite entertaining and educational. But then it was time to move on to college.

September 08, 2018 10:29 PM

September 04, 2018

Paul E. Mc Kenney: Ancient Hardware I Have Hacked: My First Computer

For the first couple of decades of my life, computers as we know them today were exotic beasts that filled rooms, each requiring the care of a cadre of what were then called systems programmers. Therefore, in my single-digit years the closest thing to a computer that I laid my hands on was a typewriter-sized electromechanical calculator that did addition, subtraction, multiplication, and division. I had the privilege of using this captivating device when helping out with accounting at the small firm at which my mother and father worked.

I was an early fan of hand-held computing devices. In fact, I was in the last math class in my high school that was required to master a slide rule, of which I still have several. I also learned how to use an abacus, including not only addition and subtraction, but multiplication and division as well. Finally, I had the privilege of living through the advent of the electronic pocket calculator. My first pocket calculator was a TI SR-50, which put me firmly on the infix side of the ensuing infix/Polish religious wars.

But none of these qualified as “real computers”.

Unusually for an early 1970s rural-Oregon high school, mine offered computer programming courses. About the only thing I knew about computers were that they would be important in the future, so I signed up. Even more unusually for that time and place, we got to use a real computer, namely an IBM 360. This room-filling monster was located fourteen miles (23 kilometers) away at Chemeketa Community College. As far as I know, this was the closest computer to my home and school. Somehow my math teacher managed to wangle use of this machine on Tuesday and Thursday evenings, and he bussed us there and back.

This computer used punched cards and a state-of-the-art chain lineprinter. We were allowed to feed the card reader ourselves, but operating the lineprinter required special training. This machine's console had an attractive red button labeled EMERGENCY PULL. The computer's operator, who would later distinguish himself by creating a full-motion video on an Apple II, quite emphatically stated that this button should be pulled only in case of a bona fide emergency. He also gave us a simple definition of “emergency” that featured flames shooting out of the top of the computer. I never did see any flames anywhere near the computer, much less shooting out of its top, so I never had occasion to pull that button. But perhaps the manufacturers of certain incendiary laptops should have equipped each of them with an attractive red EMERGENCY PULL button.

Having provided us the necessary hardware training, the operator then gave us a sample card deck. We were to put our program at one specific spot in the deck, and our input data in another. Those of us wishing more information about how this worked were directed to an impressively large JCL manual.

The language of the class was FORTRAN, except that FORTRAN was deemed to difficult an initial language for our tender high-school minds. They therefore warmed us up with assembly language. Not IBM's celebrated Basic Assembly Language (BAL), but a simulated assembly language featuring base-10 arithmetic. After a couple of sessions with the simulated assembly, we moved up to FORTRAN, and even used PL/1 for one of our assignments. There were no error messages: There were instead error numbers that you looked up in a thick printed manual located in the same bookcase containing the JCL manual.

I was surprised by the computer's limitations, especially the 6-to-7 digit limits for single-precision floating point. After all, even my TI SR-50 pocket calculator did ten digits! That said, the computer could also do alphabetic characters (but only upper case) and a few symbols—though the exclamation point was notably missing. The state-of-the-art 029 keypunches were happy to punch an exclamation mark, but alas! It printed as “0” (zero) on the lineprinter.

I must confess that I was not impressed with the computer. In addition to its arithmetic limitations, its memory was quite small. Most of our assignments were small exercises in arithmetic that I could complete much more quickly using paper and pencil. In retrospect, this is not too surprising, given that my early laissez-faire programming methodology invariably resulted in interminable debugging sessions. However, it was quite clear that computers were becoming increasingly important, and I therefore resolved to take the class again the following year.

So, the last time I walked out of that machine room in Spring of 1974, I fully expected to walk back the following Fall. Little did I know that it would be almost 30 years before I would once again write code for an IBM mainframe. Nor did I suspect that it would be more than 15 years before work started on the operating system that was to be running on that 30-years-hence mainframe.

My limited foresight notwithstanding, somewhere in Finland a small boy was growing up.

September 04, 2018 12:55 AM

September 03, 2018

Pete Zaitcev: gai.conf

A couple Fedora releases back, I noticed that my laptop stopped using IPv6 to access dual-hosted services. I gave RFC-6724 a read, but it was much too involved for my small mind. Fortunately, it contained a simplified explanation:

Another effect of the default policy table is to prefer communication using IPv6 addresses to communication using IPv4 addresses, if matching source addresses are available.

My IPv6 is NAT-ed, so the laptop sees an RFC-4193 address fc00::/7. This does not match the globally assigned address of the external service. Therefore, a matching source address is not available, and things develop from there.

For now, I forced RFC-3484 with gai.conf. Basically, reverted to Fedora 26 behavior.

September 03, 2018 03:01 AM

September 01, 2018

Pete Zaitcev: Vladimir Butenko 1962-2018

Butenko was simply the most capable programmer that I've ever worked with. He was also very accomplished. I'm sure everyone has an idea what UNIX v7 was. Although BSD, sockets, and VFS were still in the future, it was a sophisticated OS for its time. Butenko wrote his own OS that was about a peer for the v7 in features (including vi). He also wrote a Fortran 77 compiler with IDE, an SQL database, and a myriad other things. Applications, too: games, communications, industrial control.

I still remember one of our first meetings in the late 1983. I wanted someone to explain me the instruction set of Mitra-15, a French 16-bit mini. Documentation was practically impossible to get back then, especially for undergrads. Someone referred me, and I received a lecture at a smoking area near an elevator, which founded my understanding of computer architecture.

The only time I ever got one up, was when I wrote a utility to monitor processes (years later, top(1) does the same thing). Apparently the concept never occurred to Butenko, who was perfectly capable to analyzing the system with a debugger and profiler. Seeing just my UI, he knocked out a clone in a couple of days. Of course, it was superior in every respect.

Butenko worked a lot. The combination of genius and workaholic was unstoppable. Or maybe they were sides of the same coin.

Unfortunately, Butenko was not in with the open source. He used to post to Usenet, lampooning and dismissing Linux. I suspect once you can code your own Linux any time you want, your perspective changes a bit. This was a part of the way we drifted apart later on. I was plugging on my little corner of Linux, while Butenko was somewhere out in the larger world, revolutionizing computer-intermediated communications.

He died suddenly, from a heart failure. Way too early, I think.

September 01, 2018 04:17 AM

August 30, 2018

Pete Zaitcev: The shutdown of the project Hummingbird at Rackspace

Wait, wasn't this supposed to be our future?

The abridged history, as I recall was as follows. Mike Burton started the work to port Swift to Go in early 2016, inside the Swift tree. As such, it was a community effort. There was even a discussion at OpenStack Technical Committee about allowing development in Go (the TC disallowed it, but later posted some conditions). At the end of the year, I managed to write an object with MIME and collapsed the stock Swift auditor (the one in Python). That was an impetus for PUT+POST, BTW. But in 2017, the RAX cabal - creiht, redbo, and later gholt - weren't very happy with trying to supplicate to the TC, as well as the massive overhead of developing in the established community, and went their own way. In addition to the TC shenagians, the upstream Swift at SwiftStack and Red Hat needed a migration path. A Hummingbird without a support for Erasure Coding was a non-starter, and the RAX group wasn't interested in accomodating that. By the end of 2017, they were completely on their own, and started going off at the deep end by adding database servers and such. They managed to throw off some good ideas about what the next-generation replication ought to look like. But by cutting themselves off Swift they committed to re-capturng the lightning in the bottle anew, and they just could not pull it off.

On reflection, I suspect their chances would be better if they were serious about interoperating with Swift. The performance gains that they demonstrated were quite impressive. But their paymasters at RAX weren't into this community development and open-source toys (not that RAX went through the change of ownership while Hummingbird was going on).

I think a port of Swift to Go is still on the agenda, but it's unknown at this point if it's going to happen.

August 30, 2018 04:38 AM

August 27, 2018

Daniel Vetter: Why no 2D Userspace API in DRM?

The DRM (direct rendering manager, not the content protection stuff) graphics subsystem in the linux kernel does not have a generic 2D accelaration API. Despite an awful lot of of GPUs having more or less featureful blitter units. And many systems need them for a lot of use-cases, because the 3D engine is a bit too slow or too power hungry for just rendering desktops.

It’s a FAQ why this doesn’t exist and why it won’t get added, so I figured I’ll answer this once and for all.

Bit of nomeclatura upfront: A 2D engine (or blitter) is a bit of hardware that can copy stuff with some knowledge of the 2D layout usually used for pixel buffers. Some blitters also can do more like basic blending, converting color spaces or stretching/scaling. A 3D engine on the other hand is the fancy bit of high performance compute block, which run small programs (called shaders) on a massively parallel archicture. Generally with huge memory bandwidth and a dedicated controller to feed this beast through an asynchronous command buffer. 3D engines happen to be really good at rendering the pixels for 3D action games, among other things.

There’s no 2D Acceleration Standard

3D has it easy: There’s OpenGL and Vulkan and DirectX that require a certain feature set. And huge market forces that make sure if you use these features like a game would, rendering is fast.

Aside: This means the 2D engine in a browser actually needs to work like a 3D action game, or the GPU will crawl. The impendence mismatch compared to traditional 2D rendering designs is huge.

On the 2D side there’s no such thing: Every blitter engine is its own bespoke thing, with its own features, limitations and performance characteristics. There’s also no standard benchmarks that would drive common performance characteristics - today blitters are neeeded mostly in small systems, with very specific use cases. Anything big enough to run more generic workloads will have a 3D rendering block anyway. These systems still have blitters, but mostly just to help move data in and out of VRAM for the 3D engine to consume.

Now the huge problem here is that you need to fill these gaps in various hardware 2D engines using CPU side software rendering. The crux with any 2D render design is that transferring buffers and data too often between the GPU and CPU will kill performance. Usually the cliff is so steep that pure CPU rendering using only software easily beats any simplistic 2D acceleration design.

The only way to fix this is to be really careful when moving data between the CPU and GPU for different rendering operations. Sticking to one side, even if it’s a bit slower, tends to be an overall win. But these decisions highly depend upon the exact features and performance characteristics of your 2D engine. Putting a generic abstraction layer in the middle of this stack, where it’s guaranteed to be if you make it a part of the kernel/userspace interface, will not result in actual accelaration.

So either you make your 2D rendering look like it’s a 3D game, using 3D interfaces like OpenGL or Vulkan. Or you need a software stack that’s bespoke to your use-case and the specific hardware you want to run on.

2D Accelaration is Really Hard

This is the primary reason really. If you don’t believe that, look at all the tricks a browser employs to render CSS and HTML and text really fast, while still animating all that stuff smoothly. Yes, a web-browser is the pinnacle of current 2D acceleration tech, and you really need all the things in there for decent performance: Scene graphs, clever render culling, massive batching and huge amounts of pains to make sure you don’t have to fall back to CPU based software rendering at the wrong point in a rendering pipeline. Plus managing all kinds of assorted caches to balance reuse against running out of memory.

Unfortunately lots of people assume 2D must be a lot simpler than 3D rendering, and therefore they can design a 2D API that’s fast enough for everyone. No one jumps in and suggests we’ll have a generic 3D interface at the kernel level, because the lessons there are very clear:

There are a bunch of DRM drivers which have a support for 2D render engines exposed to userspace. But they all use highly hardware specific interfaces, fully streamlined for the specific engine. And they all require a decently sized chunk of driver code in userspace to translate from a generic API to the hardware formats. This is what DRM maintainers will recommend you to do, if you submit a patch to add a generic 2D acceleration API.

Exactly like a 3D driver.

If All Else Fails, There’s Options

Now if you don’t care about the last bit of performance, and your use-case is limited, and your blitter engine is limited, then there’s already options:

You can take whatever pixel buffer you have, export it as a dma-buf, and then import it into some other subsystem which already has some kind of limited 2D accelaration support. Depending upon your blitter engine, a v4l2 mem2m device, or for simpler things there’s also dmaengines.

On top, the DRM subsystem does allow you to implement the traditional accelaration methods exposed by the fbdev subsystem. In case you have userspace that really insists on using these; it’s not recommended for anything new.

What about KMS?

The above is kinda a lie, since the KMS (kernel modesetting) IOCTL userspace API is a fairly full-featured 2D rendering interface. The aim of course is to render different pixel buffers onto a screen. With the recently added writeback support operations targetting memory are now possible. This could be used to expose a traditional blitter, if you only expose writeback support and no other outputs in your KMS driver.

There’s a few downsides:

So all together this isn’t the high-speed 2D accelaration API you’re looking for either. It is a valid alternative to the options above though, e.g. instead of a v4l2 mem2m device.

FAQ for the FAQ, or: OpenVG?

OpenVG isn’t the standard you’re looking for either. For one it’s a userspace API, like OpenGL. All the same reasons for not implementing a generic OpenGL interface at the kernel/userspace apply to OpenVG, too.

Second, the Mesa3D userspace library did support OpenVG once. Didn’t gain traction, got canned. Just because it calls itself a standard doesn’t make it a widely adopted industry default. Unlike OpenGL/Vulkan/DirectX on the 3D side.

Thanks to Dave Airlie and Daniel Stone for reading and commenting on drafts of this text.

August 27, 2018 12:00 AM

August 24, 2018

Greg Kroah-Hartman: What stable kernel should I use?

I get a lot of questions about people asking me about what stable kernel should they be using for their product/device/laptop/server/etc. all the time. Especially given the now-extended length of time that some kernels are being supported by me and others, this isn’t always a very obvious thing to determine. So this post is an attempt to write down my opinions on the matter. Of course, you are free to use what ever kernel version you want, but here’s what I recommend.

As always, the opinions written here are my own, I speak for no one but myself.

What kernel to pick

Here’s the my short list of what kernel you should use, raked from best to worst options. I’ll go into the details of all of these below, but if you just want the summary of all of this, here it is:

Hierarchy of what kernel to use, from best solution to worst:

What kernel to never use:

To give numbers to the above, today, as of August 24, 2018, the front page of kernel.org looks like this:

So, based on the above list that would mean that:

Quite easy, right?

Ok, now for some justification for all of this:

Distribution kernels

The best solution for almost all Linux users is to just use the kernel from your favorite Linux distribution. Personally, I prefer the community based Linux distributions that constantly roll along with the latest updated kernel and it is supported by that developer community. Distributions in this category are Fedora, openSUSE, Arch, Gentoo, CoreOS, and others.

All of these distributions use the latest stable upstream kernel release and make sure that any needed bugfixes are applied on a regular basis. That is the one of the most solid and best kernel that you can use when it comes to having the latest fixes (remember all fixes are security fixes) in it.

There are some community distributions that take a bit longer to move to a new kernel release, but eventually get there and support the kernel they currently have quite well. Those are also great to use, and examples of these are Debian and Ubuntu.

Just because I did not list your favorite distro here does not mean its kernel is not good. Look on the web site for the distro and make sure that the kernel package is constantly updated with the latest security patches, and all should be well.

Lots of people seem to like the old, “traditional” model of a distribution and use RHEL, SLES, CentOS or the “LTS” Ubuntu release. Those distros pick a specific kernel version and then camp out on it for years, if not decades. They do loads of work backporting the latest bugfixes and sometimes new features to these kernels, all in a Quixote quest to keep the version number from never being changed, despite having many thousands of changes on top of that older kernel version. This work is a truly thankless job, and the developers assigned to these tasks do some wonderful work in order to achieve these goals. If you like never seeing your kernel version number change, then use these distributions. They usually cost some money to use, but the support you get from these companies is worth it when something goes wrong.

So again, the best kernel you can use is one that someone else supports, and you can turn to for help. Use that support, usually you are already paying for it (for the enterprise distributions), and those companies know what they are doing.

But, if you do not want to trust someone else to manage your kernel for you, or you have hardware that a distribution does not support, then you want to run the Latest stable release:

Latest stable release

This kernel is the latest one from the Linux kernel developer community that they declare as “stable”. About every three months, the community releases a new stable kernel that contains all of the newest hardware support, the latest performance improvements, as well as the latest bugfixes for all parts of the kernel. Over the next 3 months, bugfixes that go into the next kernel release to be made are backported into this stable release, so that any users of this kernel are sure to get them as soon as possible.

This is usually the kernel that most community distributions use as well, so you can be sure it is tested and has a large audience of users. Also, the kernel community (all 4000+ developers) are willing to help support users of this release, as it is the latest one that they made.

After 3 months, a new kernel is released and you should move to it to ensure that you stay up to date, as support for this kernel is usually dropped a few weeks after the newer release happens.

If you have new hardware that is purchased after the last LTS release came out, you almost are guaranteed to have to run this kernel in order to have it supported. So for desktops or new servers, this is usually the recommended kernel to be running.

Latest LTS release

If your hardware relies on a vendors out-of-tree patch in order to make it work properly (like almost all embedded devices these days), then the next best kernel to be using is the latest LTS release. That release gets all of the latest kernel fixes that goes into the stable releases where applicable, and lots of users test and use it.

Note, no new features and almost no new hardware support is ever added to these kernels, so if you need to use a new device, it is better to use the latest stable release, not this release.

Also this release is common for users that do not like to worry about “major” upgrades happening on them every 3 months. So they stick to this release and upgrade every year instead, which is a fine practice to follow.

The downsides of using this release is that you do not get the performance improvements that happen in newer kernels, except when you update to the next LTS kernel, potentially a year in the future. That could be significant for some workloads, so be very aware of this.

Also, if you have problems with this kernel release, the first thing that any developer whom you report the issue to is going to ask you to do is, “does the latest stable release have this problem?” So you will need to be aware that support might not be as easy to get as with the latest stable releases.

Now if you are stuck with a large patchset and can not update to a new LTS kernel once a year, perhaps you want the older LTS releases:

Older LTS release

These releases have traditionally been supported by the community for 2 years, sometimes longer for when a major distribution relies on this (like Debian or SLES). However in the past year, thanks to a lot of suport and investment in testing and infrastructure from Google, Linaro, Linaro member companies, kernelci.org, and others, these kernels are starting to be supported for much longer.

Here’s the latest LTS releases and how long they will be supported for, as shown at kernel.org/category/releases.html on August 24, 2018:

The reason that Google and other companies want to have these kernels live longer is due to the crazy (some will say broken) development model of almost all SoC chips these days. Those devices start their development lifecycle a few years before the chip is released, however that code is never merged upstream, resulting in a brand new chip being released based on a 2 year old kernel. These SoC trees usually have over 2 million lines added to them, making them something that I have started calling “Linux-like” kernels.

If the LTS releases stop happening after 2 years, then support from the community instantly stops, and no one ends up doing bugfixes for them. This results in millions of very insecure devices floating around in the world, not something that is good for any ecosystem.

Because of this dependency, these companies now require new devices to constantly update to the latest LTS releases as they happen for their specific release version (i.e. every 4.9.y release that happens). An example of this is the Android kernel requirements for new devices shipping for the “O” and now “P” releases specified the minimum kernel version allowed, and Android security releases might start to require those “.y” releases to happen more frequently on devices.

I will note that some manufacturers are already doing this today. Sony is one great example of this, updating to the latest 4.4.y release on many of their new phones for their quarterly security release. Another good example is the small company Essential which has been tracking the 4.4.y releases faster than anyone that I know of.

There is one huge caveat when using a kernel like this. The number of security fixes that get backported are not as great as with the latest LTS release, because the traditional model of the devices that use these older LTS kernels is a much more reduced user model. These kernels are not to be used in any type of “general computing” model where you have untrusted users or virtual machines, as the ability to do some of the recent Spectre-type fixes for older releases is greatly reduced, if present at all in some branches.

So again, only use older LTS releases in a device that you fully control, or lock down with a very strong security model (like Android enforces using SELinux and application isolation). Never use these releases on a server with untrusted users, programs, or virtual machines.

Also, support from the community for these older LTS releases is greatly reduced even from the normal LTS releases, if available at all. If you use these kernels, you really are on your own, and need to be able to support the kernel yourself, or rely on you SoC vendor to provide that support for you (note that almost none of them do provide that support, so beware…)

Unmaintained kernel release

Surprisingly, many companies do just grab a random kernel release, slap it into their product and proceed to ship it in hundreds of thousands of units without a second thought. One crazy example of this would be the Lego Mindstorm systems that shipped a random -rc release of a kernel in their device for some unknown reason. A -rc release is a development release that not even the Linux kernel developers feel is ready for everyone to use just yet, let alone millions of users.

You are of course free to do this if you want, but note that you really are on your own here. The community can not support you as no one is watching all kernel versions for specific issues, so you will have to rely on in-house support for everything that could go wrong. Which for some companies and systems, could be just fine, but be aware of the “hidden” cost this might cause if you do not plan for this up front.

Summary

So, here’s a short list of different types of devices, and what I would recommend for their kernels:

And as for me, what do I run on my machines? My laptops run the latest development kernel (i.e. Linus’s development tree) plus whatever kernel changes I am currently working on and my servers run the latest stable release. So despite being in charge of the LTS releases, I don’t run them myself, except in testing systems. I rely on the development and latest stable releases to ensure that my machines are running the fastest and most secure releases that we know how to create at this point in time.

August 24, 2018 04:11 PM

August 20, 2018

Paul E. Mc Kenney: Performance and Scalability Systems Microconference Accepted into 2018 Linux Plumbers Conference

Core counts keep rising, and that means that the Linux kernel continues to encounter interesting performance and scalability issues. Which is not a bad thing, since it has been fifteen years since the ``free lunch'' of exponential CPU-clock frequency increases came to an abrupt end. During that time, the number of hardware threads per socket has risen sharply, approaching 100 for some high-end implementations. In addition, there is much more to scaling than simply larger numbers of CPUs.

Proposed topics for this microconference include optimizations for mmap_sem range locking; clearly defining what mmap_sem protects; scalability of page allocation, zone->lock, and lru_lock; swap scalability; variable hotpatching (self-modifying code!); multithreading kernel work; improved workqueue interaction with CPU hotplug events; proper (and optimized) cgroup accounting for workqueue threads; and automatically scaling the threshold values for per-CPU counters.

We are also accepting additional topics. In particular, we are curious to hear about real-world bottlenecks that people are running into, as well as scalability work-in-progress that needs face-to-face discussion.

We hope to see you there!

August 20, 2018 09:04 PM

Kees Cook: security things in Linux v4.18

Previously: v4.17.

Linux kernel v4.18 was released last week. Here are details on some of the security things I found interesting:

allocation overflow detection helpers

One of the many ways C can be dangerous to use is that it lacks strong primitives to deal with arithmetic overflow. A developer can’t just wrap a series of calculations in a try/catch block to trap any calculations that might overflow (or underflow). Instead, C will happily wrap values back around, causing all kinds of flaws. Some time ago GCC added a set of single-operation helpers that will efficiently detect overflow, so Rasmus Villemoes suggested implementing these (with fallbacks) in the kernel. While it still requires explicit use by developers, it’s much more fool-proof than doing open-coded type-sensitive bounds checking before every calculation. As a first-use of these routines, Matthew Wilcox created wrappers for common size calculations, mainly for use during memory allocations.

removing open-coded multiplication from memory allocation arguments

A common flaw in the kernel is integer overflow during memory allocation size calculations. As mentioned above, C doesn’t provide much in the way of protection, so it’s on the developer to get it right. In an effort to reduce the frequency of these bugs, and inspired by a couple flaws found by Silvio Cesare, I did a first-pass sweep of the kernel to move from open-coded multiplications during memory allocations into either their 2-factor API counterparts (e.g. kmalloc(a * b, GFP...) -> kmalloc_array(a, b, GFP...)), or to use the new overflow-checking helpers (e.g. vmalloc(a * b) -> vmalloc(array_size(a, b))). There’s still lots more work to be done here, since frequently an allocation size will be calculated earlier in a variable rather than in the allocation arguments, and overflows happen in way more places than just memory allocation. Better yet would be to have exceptions raised on overflows where no wrap-around was expected (e.g. Emese Revfy’s size_overflow GCC plugin).

Variable Length Array removals, part 2

As discussed previously, VLAs continue to get removed from the kernel. For v4.18, we continued to get help from a bunch of lovely folks: Andreas Christoforou, Antoine Tenart, Chris Wilson, Gustavo A. R. Silva, Kyle Spiers, Laura Abbott, Salvatore Mesoraca, Stephan Wahren, Thomas Gleixner, Tobin C. Harding, and Tycho Andersen. Almost all the rest of the VLA removals have been queued for v4.19, but it looks like the very last of them (deep in the crypto subsystem) won’t land until v4.20. I’m so looking forward to being able to add -Wvla globally to the kernel build so we can be free from the classes of flaws that VLAs enable, like stack exhaustion and stack guard page jumping. Eliminating VLAs also simplifies the porting work of the stackleak GCC plugin from grsecurity, since it no longer has to hook and check VLA creation.

Kconfig compiler detection

While not strictly a security thing, Masahiro Yamada made giant improvements to the kernel’s Kconfig subsystem so that kernel build configuration now knows what compiler you’re using (among other things) so that configuration is no longer separate from the compiler features. For example, in the past, one could select CONFIG_CC_STACKPROTECTOR_STRONG even if the compiler didn’t support it, and later the build would fail. Or in other cases, configurations would silently down-grade to what was available, potentially leading to confusing kernel images where the compiler would change the meaning of a configuration. Going forward now, configurations that aren’t available to the compiler will simply be unselectable in Kconfig. This makes configuration much more consistent, though in some cases, it makes it harder to discover why some configuration is missing (e.g. CONFIG_GCC_PLUGINS no longer gives you a hint about needing to install the plugin development packages).

That’s it for now! Please let me know if you think I missed anything. Stay tuned for v4.19; the merge window is open. :)

© 2018, Kees Cook. This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 License.
Creative Commons License

August 20, 2018 06:29 PM

August 17, 2018

Paul E. Mc Kenney: Testing & Fuzzing Microconference Accepted into 2018 Linux Plumbers Conference

Testing, fuzzing, and other diagnostics have greatly increased the robustness of the Linux ecosystem, but embarrassing bugs still escape to end users. Furthermore, a million-year bug would happen several tens of times per day across Linux's installed base (said to number more than 20 billion), so the best we can possibly do is hardly good enough.

The Testing and Fuzzing Microconference intends to raise the bar with further progress on syzbot/syzkaller, distribution/stable testing, kernel continuous integration, and unit testing. The best evidence of progress in these efforts will of course be the plethora of bug reports produced by these and similar tools!

Join us for an important and spirited discussion!

August 17, 2018 07:38 PM

August 01, 2018

Dave Airlie (blogspot): virgl - exposes GLES3.1/3.2 and GL4.3

I'd had a bit of a break from adding feature to virgl while I was working on radv, but recently Google and Collabora have started to invest in virgl as a solution. A number of developers from both companies have joined the project.

This meant trying to get virgl to pass their dEQP suite and adding support for newer GL/GLES feature levels. They also have a goal for the renderer to run on a host GLES implementation whereas it currently only ran on a host GL.

Over the past few months I've worked with the group to add support for all the necessary features needed for guest GLES3.1 support (for them) and GL4.3 (for me).

The feature list was roughly:
tessellation shaders
fp64 support
ARB_gpu_shader5 support
Shader buffer objects
Shader image objects
Compute shaders
Copy Image
Texture views

With this list implemented we achieved GL4.3 and GLES3.1.

However Marek@AMD did some work on exposing ASTC for gallium drivers,
and with some extra work on EXT_shader_framebuffer_fetch, we now expose GLES3.2.

There was also plenty of work done on avoiding crashes from rogue guests (rewrote the whole feature/capability bit handling), and lots of bug fixes. There are still ongoing fixes to finish the dEQP tests, but it looks like all the feature work should be landed now.

What next?

Well there is one big problem facing virgl in exposing GL 4.4. GL_ARB_buffer_storage requires exposing coherent buffer memory, and with the virgl architecture, we currently don't have a way to map the pages behind a host GL buffer mapping into a guest GL buffer mapping in order to achieve coherency. This is going to require some thought and it may even require exposing some new GL extensions to export a buffer to a dma-buf.

There has also been a GSoC student Nathan working on vulkan/virgl support, he's made some iniital progess, however vulkan also has requirements on coherent memory so that tricky problems needs to be solved.

Thanks again to all the contributors to the virgl project.

August 01, 2018 04:26 AM

July 31, 2018

Matthew Garrett: Porting Coreboot to the 51NB X210

The X210 is a strange machine. A set of Chinese enthusiasts developed a series of motherboards that slot into old Thinkpad chassis, providing significantly more up to date hardware. The X210 has a Kabylake CPU, supports up to 32GB of RAM, has an NVMe-capable M.2 slot and has eDP support - and it fits into an X200 or X201 chassis, which means it also comes with a classic Thinkpad keyboard . We ordered some from a Facebook page (a process that involved wiring a large chunk of money to a Chinese bank which wasn't at all stressful), and a couple of weeks later they arrived. Once I'd put mine together I had a quad-core i7-8550U with 16GB of RAM, a 512GB NVMe drive and a 1920x1200 display. I'd transplanted over the drive from my XPS13, so I was running stock Fedora for most of this development process.

The other fun thing about it is that none of the firmware flashing protection is enabled, including Intel Boot Guard. This means running a custom firmware image is possible, and what would a ridiculous custom Thinkpad be without ridiculous custom firmware? A shadow of its potential, that's what. So, I read the Coreboot[1] motherboard porting guide and set to.

My life was made a great deal easier by the existence of a port for the Purism Librem 13v2. This is a Skylake system, and Skylake and Kabylake are very similar platforms. So, the first job was to just copy that into a new directory and start from there. The first step was to update the Inteltool utility so it understood the chipset - this commit shows what was necessary there. It's mostly just adding new PCI IDs, but it also needed some adjustment to account for the GPIO allocation being different on mobile parts when compared to desktop ones. One thing that bit me - Inteltool relies on being able to mmap() arbitrary bits of physical address space, and the kernel doesn't allow that if CONFIG_STRICT_DEVMEM is enabled. I had to disable that first.

The GPIO pins got dropped into gpio.h. I ended up just pushing the raw values into there rather than parsing them back into more semantically meaningful definitions, partly because I don't understand what these things do that well and largely because I'm lazy. Once that was done, on to the next step.

High Definition Audio devices (or HDA) have a standard interface, but the codecs attached to the HDA device vary - both in terms of their own configuration, and in terms of dealing with how the board designer may have laid things out. Thankfully the existing configuration could be copied from /sys/class/sound/card0/hwC0D0/init_pin_configs[2] and then hda_verb.h could be updated.

One more piece of hardware-specific configuration is the Video BIOS Table, or VBT. This contains information used by the graphics drivers (firmware or OS-level) to configure the display correctly, and again is somewhat system-specific. This can be grabbed from /sys/kernel/debug/dri/0/i915_vbt.

A lot of the remaining platform-specific configuration has been split out into board-specific config files. and this also needed updating. Most stuff was the same, but I confirmed the GPE and genx_dec register values by using Inteltool to dump them from the vendor system and copy them over. lspci -t gave me the bus topology and told me which PCIe root ports were in use, and lsusb -t gave me port numbers for USB. That let me update the root port and USB tables.

The final code update required was to tell the OS how to communicate with the embedded controller. Various ACPI functions are actually handled by this autonomous device, but it's still necessary for the OS to know how to obtain information from it. This involves writing some ACPI code, but that's largely a matter of cutting and pasting from the vendor firmware - the EC layout depends on the EC firmware rather than the system firmware, and we weren't planning on changing the EC firmware in any way. Using ifdtool told me that the vendor firmware image wasn't using the EC region of the flash, so my assumption was that the EC had its own firmware stored somewhere else. I was ready to flash.

The first attempt involved isis' machine, using their Beaglebone Black as a flashing device - the lack of protection in the firmware meant we ought to be able to get away with using flashrom directly on the host SPI controller, but using an external flasher meant we stood a better chance of being able to recover if something went wrong. We flashed, plugged in the power and… nothing. Literally. The power LED didn't turn on. The machine was very, very dead.

Things like managing battery charging and status indicators are up to the EC, and the complete absence of anything going on here meant that the EC wasn't running. The most likely reason for that was that the system flash did contain the EC's firmware even though the descriptor said it didn't, and now the system was very unhappy. Worse, the flash wouldn't speak to us any more - the power supply from the Beaglebone to the flash chip was sufficient to power up the EC, and the EC was then holding onto the SPI bus desperately trying to read its firmware. Bother. This was made rather more embarrassing because isis had explicitly raised concern about flashing an image that didn't contain any EC firmware, and now I'd killed their laptop.

After some digging I was able to find EC firmware for a related 51NB system, and looking at that gave me a bunch of strings that seemed reasonably identifiable. Looking at the original vendor ROM showed very similar code located at offset 0x00200000 into the image, so I added a small tool to inject the EC firmware (basing it on an existing tool that does something similar for the EC in some HP laptops). I now had an image that I was reasonably confident would get further, but we couldn't flash it. Next step seemed like it was going to involve desoldering the flash from the board, which is a colossal pain. Time to sleep on the problem.

The next morning we were able to borrow a Dediprog SPI flasher. These are much faster than doing SPI over GPIO lines, and also support running the flash at different voltage. At 3.5V the behaviour was the same as we'd seen the previous night - nothing. According to the datasheet, the flash required at least 2.7V to run, but flashrom listed 1.8V as the next lower voltage so we tried. And, amazingly, it worked - not reliably, but sufficiently. Our hypothesis is that the chip is marginally able to run at that voltage, but that the EC isn't - we were no longer powering the EC up, so could communicated with the flash. After a couple of attempts we were able to write enough that we had EC firmware on there, at which point we could shift back to flashing at 3.5V because the EC was leaving the flash alone.

So, we flashed again. And, amazingly, we ended up staring at a UEFI shell prompt[3]. USB wasn't working, and nor was the onboard keyboard, but we had graphics and were executing actual firmware code. I was able to get USB working fairly quickly - it turns out that Linux numbers USB ports from 1 and the FSP numbers them from 0, and fixing that up gave us working USB. We were able to boot Linux! Except there were a whole bunch of errors complaining about EC timeouts, and also we only had half the RAM we should.

After some discussion on the Coreboot IRC channel, we figured out the RAM issue - the Librem13 only has one DIMM slot. The FSP expects to be given a set of i2c addresses to probe, one for each DIMM socket. It is then able to read back the DIMM configuration and configure the memory controller appropriately. Running i2cdetect against the system SMBus gave us a range of devices, including one at 0x50 and one at 0x52. The detected DIMM was at 0x50, which made 0x52 seem like a reasonable bet - and grepping the tree showed that several other systems used 0x52 as the address for their second socket. Adding that to the list of addresses and passing it to the FSP gave us all our RAM.

So, now we just had to deal with the EC. One thing we noticed was that if we flashed the vendor firmware, ran it, flashed Coreboot and then rebooted without cutting the power, the EC worked. This strongly suggested that there was some setup code happening in the vendor firmware that configured the EC appropriately, and if we duplicated that it would probably work. Unfortunately, figuring out what that code was was difficult. I ended up dumping the PCI device configuration for the vendor firmware and for Coreboot in case that would give us any clues, but the only thing that seemed relevant at all was that the LPC controller was configured to pass io ports 0x4e and 0x4f to the LPC bus with the vendor firmware, but not with Coreboot. Unfortunately the EC was supposed to be listening on 0x62 and 0x66, so this wasn't the problem.

I ended up solving this by using UEFITool to extract all the code from the vendor firmware, and then disassembled every object and grepped them for port io. x86 systems have two separate io buses - memory and port IO. Port IO is well suited to simple devices that don't need a lot of bandwidth, and the EC is definitely one of these - there's no way to talk to it other than using port IO, so any configuration was almost certainly happening that way. I found a whole bunch of stuff that touched the EC, but was clearly depending on it already having been enabled. I found a wide range of cases where port IO was being used for early PCI configuration. And, finally, I found some code that reconfigured the LPC bridge to route 0x4e and 0x4f to the LPC bus (explaining the configuration change I'd seen earlier), and then wrote a bunch of values to those addresses. I mimicked those, and suddenly the EC started responding.

It turns out that the writes that made this work weren't terribly magic. PCs used to have a SuperIO chip that provided most of the legacy port functionality, including the floppy drive controller and parallel and serial ports. Individual components (called logical devices, or LDNs) could be enabled and disabled using a sequence of writes that was fairly consistent between vendors. Someone on the Coreboot IRC channel recognised that the writes that enabled the EC were simply using that protocol to enable a series of LDNs, which apparently correspond to things like "Working EC" and "Working keyboard". And with that, we were done.

Coreboot doesn't currently have ACPI support for the latest Intel graphics chipsets, so right now my image doesn't have working backlight control.Backlight control also turned out to be interesting. Most modern Intel systems handle the backlight via registers in the GPU, but the X210 uses the embedded controller (possibly because it supports both LVDS and eDP panels). This means that adding a simple display stub is sufficient - all we have to do on a backlight set request is store the value in the EC, and it does the rest.

Other than that, everything seems to work (although there's probably a bunch of power management optimisation to do). I started this process knowing almost nothing about Coreboot, but thanks to the help of people on IRC I was able to get things working in about two days of work[4] and now have firmware that's about as custom as my laptop.

[1] Why not Libreboot? Because modern Intel SoCs haven't had their memory initialisation code reverse engineered, so the only way to boot them is to use the proprietary Intel Firmware Support Package.
[2] Card 0, device 0
[3] After a few false starts - it turns out that the initial memory training can take a surprisingly long time, and we kept giving up before that had happened
[4] Spread over 5 or so days of real time

comment count unavailable comments

July 31, 2018 08:44 AM

July 29, 2018

Pavel Machek: Pretty big side-effect

Timing and side-channels are not normally considered side-effects, meaning compilers and cpus feel free to do whatever they want. And they do. Unfortunately, I consider leaking my passwords to remote attackers prety significant side-effect... Imagine simple function.

void handle(char secret) {}
That's obviously safe, right? And now
void handle(char secret) { int i; for (i=0; i<secret*1000000; i++) ; }
That's obviously bad idea, because now secret is exposed via timing. Now, that used to be the only sidechannel for a while, but then, caches were invented. These days,
static char font[16*256]; void handle(char secret) { font[secret*16]; }
may be bad idea. But C has not changed, it knows nothing about caches, and nothing about side-channels. Caches are old news. But today we have complex branch predictors, and speculative execution. It is called spectre. This is bad idea:
static char font[16*256]; void handle(char secret) { if (0) font[secret*16]; }
as is this:
static char small[16], big[256]; void foo(int untrusted) { if (untrusted<16) big[small[untrusted]]; }
CPU bug... unfortunately it is tricky to fix, and the bug only affects caches / timing, so it "does not exist" as far as C is concerned. Canonical fix is something like
static char small[16], big[256]; void foo(int untrusted) { if (untrusted<16) { asm volatile("lfence"); big[small[untrusted]]; }}
which is okay as long as compiler compiles it the obvious way. But again, compiler knows nothing about caches / side channels, so it may do something unexpected, and re-introduce the bug. Unfortunately, it seems that there's not even agreement whose bug it is. Is it time C was extended to know about side-channels? What about
void handle(int please_do_not_leak_this secret) {}
? Do we need new language to handle modern (speculative, multi-core, fast, side-channels all around) CPUs?
(Now, you may say that it is impossible to eliminate all the side-channels. I believe eliminating most of them is well possible, if we are willing to live with ... huge slowdown. You can store each variable twice, to at least detectrowhammer. Caches still can be disabled -- row buffer in DRAM will be more problematic, and if you disable hyperthreading and make every second instruction  lfence, you can get rid of Spectre-like problems. You may get 100x? 1000x? slowdown, but if that's applied only to software that needs it, it may be acceptable. You probably want your editors & mail readers protected. You probably don't need gcc to be protected. No, running modern web browser does not look sustainable).

July 29, 2018 06:20 AM

July 23, 2018

Paul E. Mc Kenney: RDMA Microconference Accepted into 2018 Linux Plumbers Conference

We are pleased to announce that the RDMA Microconference has been accepted into the 2018 Linux Plumbers Conference!

RDMA (remote direct memory access) is a well-established technology that is used in environments requiring both maximum throughputs and minimum latencies. For a long time, this technology was used primary in high-performance computing, high frequency trading, and supercomputing. For example, the three most powerful computers are based on Linux and RDMA (in the guise of Infiniband).

However, the latest trends in cloud computing (more bandwidth at larger scales) and storage (more IOPS) makes RDMA increasingly important outside of its initial niches. Therefore, clean integration between RDMA and various kernel susbsystems is paramount. We are therefore looking to build on previous years' successful RDMA microconferences, this year discussing our 2018-2019 plans and roadmap.

Topics proposed for this year's event include the interaction between RDMA and DAX (direct access for files), how to solve the get_user_pages() problem (see https://lwn.net/Articles/753027/ and https://lwn.net/Articles/753272/), IOMMU and PCI-E issues, continuous integration, python integration, and Syzkaller testing.

July 23, 2018 08:07 PM