Kernel Planet

October 15, 2021

Paul E. Mc Kenney: Verification Challenges

You would like to do some formal verification of C code? Or you would like a challenge for your formal-verification tool? Either way, here you go!
 
 
 


  1. Stupid RCU Tricks: rcutorture Catches an RCU Bug, also known as Verification Challenge 1.
  2. Verification Challenge 2: RCU NO_HZ_FULL_SYSIDLE.
  3. Verification Challenge 3: cbmc.
  4. Verification Challenge 4: Tiny RCU.
  5. Verification Challenge 5: Uses of RCU.
  6. Verification Challenge 6: Linux-Kernel Tree RCU.
  7. Verification Challenge 7: Heavy Modifications to Linux-Kernel Tree RCU.

October 15, 2021 06:22 PM

October 13, 2021

Paul E. Mc Kenney: TL;DR: Memory-Model Recommendations for Rusting the Linux Kernel

These recommendations assume that the initial Linux-kernel targets for Rust developers are device drivers that do not have unusual performance and scalability requirements, meaning that wrappering of small C-language functions is tolerable. (Please note that most device drivers fit into this category.) It also assumes that the main goal is to reduce memory-safety bugs, although other bugs might be addressed as well. Or, Murphy being Murphy, created as well. But that is a risk in all software development, not just Rust in the Linux kernel.

Those interested in getting Rust into Linux-kernel device drivers sooner rather than later should look at the short-term recommendations, while those interested in extending Rust's (and, for that matter, C's) concurrency capabilities might be more interested in the long-term recommendations.

Short-Term Recommendations

The goal here is to allow the Rust language considerable memory-model flexibility while also providing Rust considerable freedom in what it might be used for within the Linux kernel. The recommendations are as follows:

  1. Provide wrappers around the existing Linux kernel's locking, MMIO, I/O-barrier, and I/O-access primitives. If I was doing this work, I would add wrappers incrementally as they were actually needed, but I freely admit that there are benefits to providing a full set of wrappers from the get-go.
  2. Either wrapper READ_ONCE() or be careful to take Alpha's, Itanium's, and ARMv8's requirements into account as needed. The situation with WRITE_ONCE() appears more straightforward. The same considerations apply to the atomic_read() and atomic_set() family of primitives.
  3. Atomic read-modify-write primitives should be made available to Rust programs via wrappers around C code. But who knows? Maybe it is once again time to try out intrinsics. Again, if I was doing the work, I would add wrappers/implementations incrementally.
  4. Although there has been significant discussion surrounding how sequence locking and RCU might be handled by Rust code, more work appears to be needed. For one thing, it is not clear that people proposing solutions are aware of the wide range of Linux-kernel use cases for these primitives. In the meantime, I recommend keeping direct use of sequence locking and RCU in C code, and providing Rust wrappers for the resulting higher-level APIs. In this case, it might be wise to make good use of compiler directives in order to limit Rust's ability to apply code-motion optimizations. However, if you need sequence-locking or RCU Rust-language wrappers, there are a number that have been proposed.
  5. Control dependencies should be confined to C code. If needed, higher-level APIs whose C-language implementations require control dependencies can be wrappered for Rust use. But my best guess is that it will be some time before Rust code needs control dependencies.

Taking this approach should avoid memory-model-mismatch issues between Rust and C code in the Linux kernel. And discussions indicate that much of the wrappering work called out in the first three items above has already been done.

Of course, situations that do not fit this set of recommendations can be addressed on a case-by-case basis. I would of course be happy to help.

Long-Term Recomendations

This section takes a more utopian view. What would a perfect Rust sequence-locking implementation/wrapper look like? Here are some off-the-cuff desiderata, none of which are met by the current C-code Linux-kernel implementation:

  1. Use of quantities computed from sequence-lock-protected variables in a failed reader should result in a warning. But please note that reliably associating variables with sequence locks may not be easy. One approach suggested in response to this series is to supply a closure containing the read-side critical section, thus restricting such leakage to (unsafe?) side effects.
  2. Improper access to variables not protected by the sequence lock should result in a warning. But please note that it is quite difficult to define "improper" in this context, let alone detect it in real code.
  3. Data races in failed sequence-lock readers should not cause failures. But please note that this is extremely difficult in general if the data races involve non-volatile C-language accesses in the reader. For example, the compiler would be within its rights to refetch the value after the old value had been checked. (This is why the sequence-locking post suggests marked accesses to sequence-lock-protected variables.)
  4. Data races involving sequence-locking updaters are detected, for example, via KCSAN.

As noted elsewhere, use of a wrapper around the existing C-language implementation allows C and Rust to use the same sequence lock. This might or might not prove to be important.

Similarly, what would a perfect Rust RCU wrapper look like? Again, here are some off-the-cuff desiderata, similarly unmet by existing C-code Linux-kernel implementations:

  1. Make call_rcu() and friends cause the specified object to make an end-of-grace-period transition in other threads' ownership from readers to unowned. Again, not an immediate transition at the time of call_rcu() invocation, but rather a deferred transition that takes place at the end of some future grace period.
  2. Some Rust notion of type safety would be useful in slab caches flagged as SLAB_TYPESAFE_BY_RCU.
  3. Some Rust notion of existence guarantee would be useful for RCU readers.
  4. Detect pointers to RCU-protected objects being improperly leaked from RCU read-side critical sections, where proper leakage is possible via locks and reference counters.
  5. Detect mishandling of dependency-carrying pointers returned by rcu_dereference() and friends. Or perhaps introduce some marking such pointers to that the compiler will avoid breaking the ordering. See the address/data dependency post for a list of C++ working papers moving towards this goal.

There has recently been some work attempting to make the C compiler understand control dependencies. Perhaps Rust could make something useful happen in this area, perhaps by providing markings that allow the compiler to associate the reads with the corresponding control-dependent writes.

Perhaps Rust can better understand more of the ownership schemes that are used within the Linux kernel.

Rationale

Please note that middle-term approaches are likely to be useful, for example, those that simply provide wrappers around the C-language RCU and sequence-locking implementations. However, such approaches also carry risks, especially during that time when the intersections of the set of people deeply understanding Rust with the set of people deeply understanding RCU and sequence locking is the empty set. For example, one risk that became apparent during the effort to add RCU to the C++ standard is that people new to RCU and sequence locking will latch onto the first use case that they encounter and focus solely on that use case. This has also been a recurring issue for concurrency experts who encounter sequence locking and RCU for the first time. This of course means that I need to do a better job of documenting RCU, its use cases, and the relationships between them.

This is not to say that per-use-case work is pointless. In fact, such work can be extremely valuable, if nothing else, in helping to build understanding of the overall problem. It is also quite possible that current RCU and sequence-locking use cases will resemble those of the future in the same way that the venerable "while" loop resembles the wide panoply of interator-like constructs found not only in modern languages, including those written in C in the Linux kernel. Except that the Linux kernel still contains "while" loops, as does a great deal of other software. There will also be questions as to whether a given new-age use case is there to help developers using RCU and sequence locking on the one hand or whether its main purpose is instead to work around a shortcoming in Rust on the other. "This should be good clean fun!" ;-)

Experience indicates that it will take significant time to sort out all of these issues. We should therefore make this time available by proceeding initially as described in the short-term recommendations.

Criteria for judging proposals include safety, performance, scalability, build time, development cost, maintenance effort, and so on, not necessarily in that order.

History

October 18, 2021: Add "Rationale" section.

October 13, 2021 09:08 PM

October 06, 2021

Paul E. Mc Kenney: Rusting the Linux Kernel: Summary and Conclusions

We have taken a quick trip through history, through a number of the differences between the Linux kernel and the C/C++ memory models, sequence locks, RCU, ownership, zombie pointers, and KCSAN. I give a big "thank you" to everyone who has contributed to this discussion, both publicly and privately. It has been an excellent learning experience for me, and I hope that it has also been helpful to all of you.

To date, Can Rust Code Own Sequence Locks? has proven the most popular by far. Porting Linux-kernel code using sequence locking to Rust turns out to be trickier than one might expect, in part due to the inherently data-racy nature of this synchronization primitive.

So what are those advocating use of Rust within the Linux kernel to do?

The first thing is to keep in mind that the Linux kernel comprises tens of millions of lines of code. One disadvantage of this situation is that the Linux kernel is not going to be ported to much of anything quickly or easily. One corresponding advantage is that it allows Rust-for-Linux developers to carefully pick their battles, focusing first on those portions of the Linux kernel where conversion to Rust might do the most good. Given that order-of-magnitudes performance wins are unlikely, the likely focus is likely to be on proof of concept and on reduced bug rates. Given Rust's heavy focus on bugs stemming from undefined behavior, and given the Linux kernel's use of the -fno-strict-aliasing and -fno-strict-overflow compiler command-line options, bug-reduction choices will need to be made quite carefully. In addition, order-of-magnitude bug-rate differences across the source base provides a high noise floor that makes it more difficult to measure small bug-reduction rates. But perhaps enabling the movement of code into mainline and out of staging can be another useful goal. Or perhaps there is an especially buggy driver out there somewhere that could make good use of some Rust code.

Secondly, careful placement of interfaces between Rust and C code is necessary, especially if there is truth to the rumors that Rust does not inline C functions without the assistance of LTO. In addition, devilish details of Linux-kernel synchronization primitives and dependency handling may further constrain interface boundaries.

Finally, keep in mind that the Linux kernel is not written in standard C, but rather in a dialect that relies on gcc extensions, code-style standards, and external tools. Mastering these extensions, standards, and tools will of course require substantial time and effort, but on the other hand this situation provides precedent for Rust developers to also rely on extensions, style standards, and tools.

But what about the question that motivated this blog in the first place? What memory model should Linux-kernel Rust code use?

For Rust outside of the Linux kernel, the current state of compiler backends and the path of least resistance likely leads to something resembling the C/C++ memory model. Use of Rust in the Linux kernel is therefore not a substitute for continued participation in the C/C++ standards committees, much though some might wish otherwise.

Rust non-unsafe code does not depend much on the underlying memory model, at least assuming that unsafe Rust code and C code avoids undermining the data-race-free assumption of non-unsafe code. The need to preserve this assumption was in fact inspired the blog post discussing KCSAN.

In contrast, Rust unsafe code must pay close attention to the underlying memory model, and in the case of the Linux kernel, the only reasonable choice is of course the Linux-kernel memory model. That said, any useful memory-ordering tool will also need to pay attention to safe Rust code in order to correctly evaluate outcomes. However, there is substantial flexibility, depending on exactly where the interfaces are placed:


  1. As noted above, forbidding use of Rust unsafe code within the Linux kernel would render this memory-model question moot. Data-race freedom for the win!
  2. Allowing Rust code to access shared variables only via atomic operations with acquire (for loads), release (for stores) or stronger ordering would allow Rust unsafe code to use that subset of the Linux-kernel memory model that mirrors the C/C++ memory model.
  3. Restricting use of sequence locks, RCU, control dependencies, lockless atomics, and non-standard locking (e.g., stores to shared variables within read-side reader-writer-locking critical sections) to C code would allow Rust unsafe code to use that subset of the Linux-kernel memory model that closely (but sadly, not exactly) mirrors the C/C++ memory model.
  4. Case-by-case relaxation of the restrictions called out in the preceding pair of items pulls in the corresponding portions of the Linux-kernel memory model. For example, adding sequence locking, but only in cases where readers access only a limited co-located set of objects might permit some of the simpler Rust sequence-locking implementations to be used (see for example the comments to the sequence-locking post).
  5. Insisting that Rust be able to do everything that Linux-kernel C code currently does pulls in the entirety of the Linux-kernel memory model, including those parts that have motivated many of my years of C/C++ standards-committee fun and excitement.

With the first three options, a Rust compiler adhering to the C/C++ memory model will work just fine for Linux-kernel C code. In contrast, the last two options would require code-style restrictions (preferably automated), just as they are in current Linux-kernel C code. See the recommendations post for more detail.

In short, choose wisely and be very careful what you wish for! ;-)

History

October 7, 2021: Added atomic-operation-only option to memory-model spectrum.
October 8, 2021: Fix typo noted by Miguel Ojeda
October 12, 2021: Self-review.
October 13, 2021: Add reference to recommendations post.

October 06, 2021 09:48 PM

Paul E. Mc Kenney: Can the Kernel Concurrency Sanitizer Own Rust Code?

Given the data-race-freedom guarantees of Rust's non-unsafe code, one might reasonably argue that there is no point in the Kernel Concurrency Sanitizer (KCSAN) analyzing such code. However, the Linux kernel is going to need unsafe Rust code. Furthermore, even given unanticipated universal acclamation of Rust within the Linux kernel community combined with equally unanticipated advances in C-to-Rust translation capabilities, a significant fraction of the existing tens of millions of lines of Linux-kernel C code will persist for some time to come. Both the unsafe Rust code and the C code can interfere with Rust non-unsafe code, and furthermore there are special cases where safe code can violate unsafe code's assumptions. Therefore, run-time analysis of Rust code (safe code included) is likely to be able to find issues that compile-time analysis cannot.

As a result, Rust seems unlikely to render KCSAN obsolete any time soon. This suggests that Rust code should include KCSAN instrumentation, both via the compiler and via atomic and volatile primitives, as is currently the case for the C compilers and primitives currently in use. The public KCSAN interface is as follows:


  1. __kcsan_check_access(): This function makes a relevant memory access known to KCSAN. It takes a pointer to the object being accessed, the size of the object, and an access type (for example, zero for read, KCSAN_ACCESS_WRITE for write, and KCSAN_ACCESS_ATOMIC for atomic read-modify-write).
  2. The __tsan_{read,write}{1,2,4,8}() family of functions serve the same purpose as __kcsan_check_access(), but the size and access type are implicit in the function name instead of being passed as arguments, which can provide better performance. This family of functions is used by KCSAN-enabled compilers.
  3. ASSERT_EXCLUSIVE_ACCESS() is invoked from C code and causes KCSAN to complain if there is a concurrent access to the specified object. A companion ASSERT_EXCLUSIVE_ACCESS_SCOPED() expands the scope of the concurrent-access complaint to the full extent of the compound statement invoking this macro.
  4. ASSERT_EXCLUSIVE_WRITER() is invoked from C code and causes KCSAN to complain if there is a concurrent write to the specified object. A companion ASSERT_EXCLUSIVE_WRITER_SCOPED() expands the scope of the concurrent-writer complaint to the full extent of the compound statement invoking this macro.
  5. ASSERT_EXCLUSIVE_BITS() is similar to ASSERT_EXCLUSIVE_WRITER(), but focuses KCSAN's attention on changes to bits set in the mask passed as this macro's second argument.

These last three categories of interface members are designed to be directly invoked in order to check concurrency designs. See for example the direct call to ASSERT_EXCLUSIVE_WRITER() from the rcu_cpu_starting() function in Linux-kernel RCU. It would of course be quite useful for these interface members to be available from within Rust code.

KCSAN has helped me fix a number of embarrassing bugs in Linux-kernel RCU that might otherwise have inconvenience end users, so they just might be helpful to people introducing Rust code into the Linux kernel. It is therefore quite encouraging that Marco Elver (KCSAN lead developer and maintainer) points out off-list that the rustc compiler already has sanitizer support, which should suffice not only for KCSAN, but for the equally helpful Kernel Address Sanitizer (KASAN) as well. He also notes that some care is required to pass in the relevant -mllvm options to rustc and to ensure that it is not attempting to link against compiler-rt because doing so is not appropriate when building the Linux kernel. Marco also noted that KCSAN relies on the front-end to properly handle volatile accesses, and Miguel Ojeda kindly confirmed that it does. So there is hope!

For more information on KCSAN and its reason for being:

  1. "Concurrency bugs should fear the big bad data-race detector" (Part 1 and Part 2).
  2. Who's afraid of a big bad optimizing compiler?.
  3. Calibrating your fear of big bad optimizing compilers.
  4. Sections 4.3.4 ("Accessing Shared Variables"), 15.2 ("Tricks and Traps"), and 15.3 ("Compile-Time Consternation") of perfbook.


History

October 7, 2021: Add current state of KCSAN/Rust integration. Clarify Rust module per Gary Guo email.
October 12, 2021: Self-review.

October 06, 2021 07:11 PM

October 05, 2021

Paul E. Mc Kenney: Will Your Rust Code Survive the Attack of the Zombie Pointers?

Some of the previous posts in this series have been said to be quite difficult, so I figured I owed you all an easy one. And the zombie-pointer problem really does have a trivial solution, at least in the context of the Linux kernel. In other environments, all bets are off.

But first, what on earth is a zombie pointer?

To answer that question, let's start with a last-in-first-out (LIFO) stack. We don't know when this algorithm was invented or who invented it, but a very similar algorithm was described in R. J. Treiber's classic 1986 technical report (IBM Almaden Research Center RJ 5118), and it was referred to as prior art in (expired) US Patent 3,886,525, which was filed in 1973 (hat trick to Maged Michael). Fun though it might be to analyze this code's original IBM assembly-language implementation, let's instead look at a C11 implementation:

 1 struct node_t* _Atomic top;
 2
 3 void list_push(value_t v)
 4 {
 5   struct node_t *newnode = (struct node_t *) malloc(sizeof(*newnode));
 6
 7   set_value(newnode, v);
 8   newnode->next = atomic_load(&top);
 9   do {
10     // newnode->next may have become invalid
11   } while (!atomic_compare_exchange_weak(&top, &newnode->next, newnode));
12 }
13
14 void list_pop_all()
15 {
16   struct node_t *p = atomic_exchange(&top, NULL);
17
18   while (p) {
19     struct node_t *next = p->next;
20      
21     foo(p);
22     free(p);
23     p = next;
24   }
25 }

Those used to the Linux-kernel cmpxchg() atomic operation should keep in mind that when the C11 atomic_compare_exchange_weak() fails, it writes the current value referenced by its first argument to the pointer passed as its second argument. In other words, instead of passing atomic_compare_exchange_weak() the new value, you instead pass it a pointer to the new value. Each time atomic_compare_exchange_weak() fails, it updates the pointed-to new value. This sometimes allows this function to be used in a tight loop, as in line 11 above.

With these atomic_compare_exchange_weak() semantics in mind, let's step through execution of list_push(C) on a stack initially containing objects A and B:

  1. Line 5 allocates memory for the new object C.
  2. Line 7 initializes this newly allocated memory.
  3. Line 8 sets the new object's ->next pointer to the current top-of-stack pointer.
  4. Line 11 invokes atomic_compare_exchange_weak(), which atomically compares the value of newnode->next to the value of top, and if they are equal, stores the pointer newnode into top and returns true. (As noted before, if the comparison is instead not-equal, atomic_compare_exchange_weak() instead stores the value of top into newnode->next and returns false, repeating until such time as the pointers compare equal.)
  5. One way or another, the stack ends up containing objects C, A, and B, ordered from the top of stack down.

Oddly enough, this code would still work even if line 8 were omitted, however, that would result in a high probability that the first call to atomic_compare_exchange_weak() would fail, which would not be so good for common-case performance. It would also result in compile-time complaints about uninitialized variables, so line 8 is doubly unlikely to be omitted in real life.

While we are at it, let's step through list_pop_all():

  1. Line 16 atomically stores NULL into top and returns its previous value. After this lines executes, the stack is empty and its previous contents are referenced by local variable p.
  2. Lines 18-24 execute for each object on list p:
    1. Line 19 prefetches a pointer to the next object.
    2. Line 21 passes the current object to function foo().
    3. Line 22 frees the current object.
    4. Line 23 advances to the next object.

Thus far, we have seen no zombie pointers. Let's therefore consider the following sequence of events, with the stack initially containing objects A and B:

  1. Thread 1 starts pushing object C, but is interrupted, preempted, otherwise impeded just after executing line 8.
  2. Thread 2 pops the entire list and frees all of the objects. Object C's ->next pointer is now invalid because it points to now-freed memory formerly known as object A.
  3. Thread 2 pushes an object, and happens to allocate memory for it at the same address as the old object A, so let's call it A'.
  4. Object C's ->next pointer is still invalid as far as an omniscient compiler is concerned, but from an assembly language viewpoint happens to reference the type-compatible object A'. This pointer is now a "zombie pointer" that has come back from the dead, or that has at least come to have a more entertaining form of invalidity.
  5. Thread 1 resumes and executes the atomic_compare_exchange_weak() on line 11. Now this primitive, like its Linux-kernel counterpart cmpxchg(), operates not on the compiler's idea of what the pointer is, but rather on the actual bits making up that pointer. Therefore, that atomic_compare_exchange_weak() succeeds despite the fact that the object C's ->next pointer is invalid.
  6. Thread 1 has thus completed its push of object C, and the resulting stack looks great from an assembly-language viewpoint. But the pointer from object C to A' is a zombie pointer, and compilers are therefore within their rights to do arbitrarily strange things with it.

As noted earlier, this has a trivial solution within the Linux kernel. Please do not peek too early. ;-)

But what can be done outside of the Linux kernel?

The traditional solution has been to hide the allocator from the compiler. After all, if the compiler cannot see the calls to malloc() and free(), it cannot figure out that a given pointer is invalid (or, in C-standard parlance, indeterminate). Creating your own allocator with its own function names suffices, and back in the day you often needed to create your own allocator if performance and scalability was important to you.

However, this would also deprive Rust of information that it needs to check pointer operations. So maybe Rust needs to know about allocation and deallocation, but needs to be very careful not to pass this information on to the optimizer. Assuming that this even makes sense in the context of the implementation of Rust.

Perhaps code working with invalid and zombie pointers needs to be carefully written in C. And perhaps there is some way to write it safely in unsafe Rust.

The final approach is to wait until backend support for safe handling of invalid/zombie pointers arrives, and then add Rust syntax as appropriate. Here is some documentation of current efforts towards adding such support to C++:

  1. CPPCON 2020 presentation
  2. Pointer lifetime-end zap (informational/historical)
  3. Pointer lifetime-end zap proposed solutions

So there is good progress, but it might still be some time before this is supported by either the standard or by compilers.

History

October 6, 2021: Expand description per feedback from Jonathan Corbet
October 12, 2021: Self-review.

October 05, 2021 06:40 PM

October 04, 2021

Paul E. Mc Kenney: How Much of the Kernel Can Rust Own?

Rust concurrency makes heavy use of ownership and borrowing. The purpose of this post is not to give an exposition of Rust's capabilities and limitations in this area, but rather to give a series of examples of ownership in the Linux kernel.

The first example involves Linux-kernel per-CPU variables. In some cases, such variables are protected by per-CPU locks, for example, a number of fields in the per-CPU rcu_data structure are used by the kernel threads that manage grace periods for offloaded callbacks, and these fields are protected by the ->nocb_gp_lock field in the same instance of that same structure. In other cases, access to a given per-CPU variable is permitted only by the corresponding CPU, and even then only if that CPU has disabled preemption. For example, the per-CPU rcu_data structure's ->ticks_this_gp field may be updated only from the corresponding CPU, and only when preemption is disabled. In the particular case, preemption is disabled as a side-effect of having disabled interrupts.

The second example builds on the first. In kernels built with CONFIG_RCU_NOCB_CPU=n, the per-CPU rcu_data structure's ->cblist field may be updated from the corresponding CPU, and only when preemption is disabled. However, it is also allowed from some other CPU when the corresponding CPU has been taken offline, but only when that other CPU that is orchestrating the offlining of the corresponding CPU.

(What about kernels built with CONFIG_RCU_NOCB_CPU=y? They must also acquire a ->nocb_lock that is also contained within the per-CPU rcu_data structure.)

The third example moves to the non-per-CPU rcu_node structures, which are arranged into a combining tree that processes events from an arbitrarily large number of CPUs while maintaining bounded lock contention. Each rcu_node> structure has a ->qsmask bitmask that tracks which of its children need to report a quiescent state for the current grace period, and a ->gp_tasks pointer that, when non-NULL, references a list of tasks that blocked while in an RCU read-side critical section that blocks the current grace period. The children of each leaf rcu_node structure are the rcu_data structures feeding into that rcu_node structure. Finally, there is a singleton rcu_state structure that contains a ->gp_seq field that numbers the current grace period and also indicates whether or not it is in progress.

We now have enough information to state the ownership rule for the rcu_state structure's ->gp_seq field. This field may be updated only if all of the following hold:


  1. The current CPU holds the root rcu_node structure's ->lock.
  2. All rcu_node structures' ->qsmask fields are zero.
  3. All rcu_node structures' ->gp_tasks pointers are NULL.

This might seem excessively ornate, but it is the natural consequence of RCU's semantics and design:

  1. The next grace period is not permitted to start until after the previous one has ended. This is a design choice that allows RCU to function well in the face of update-side overload from large numbers of CPUs.
  2. A grace period is not allowed to end until all CPUs have been observed in a quiescent state, that is, until all rcu_node structures' ->qsmask fields have become zero.
  3. A grace period is not allowed to end until all tasks that were preempted while executing within a critical section blocking that grace period have resumed and exited their critical sections, that is, until all rcu_node structures' ->gp_tasks pointers have become NULL.
  4. If multiple CPUs would like to start a grace period at the same time, there has to be something that works out which CPU is actually going to do the starting, and the last part of that something is the root rcu_node structure's ->lock.

Trust me, there are far more complex ownership models in the Linux kernel!

The fourth example involves per-CPU data that is protected by a reader-writer lock. A CPU is permitted to both read and update its own data only if it read-holds the lock. A CPU is permitted to read other CPUs' data only if it write-holds the lock. That is right, CPUs are allowed to write when read-holding the lock and and are allowed to read while write-holding the lock. Perhaps reader-writer locks should instead have been called exclusive-shared locks, but it is some decades too late now!

The fifth and final example involves multiple locks, for example, when some readers must traverse the data structure from an interrupt handler and others must sleep while traversing that same structure. One way to implement this is to have one interrupt-disabled spinlock (for readers in interrupt handlers) and a mutex (for readers that must sleep during the traversal). Updaters must then hold both locks.

So how on earth does the current C-language Linux kernel code keep all this straight???

One indispensable tool is the assertion, which in the Linux kernel includes WARN(), BUG(), and friends. For example, RCU uses these to verify the values of the aforementioned ->qsmask and ->gp_tasks fields during between-grace-periods traversals of the rcu_node combining tree.

Another indispensable tool is lockdep, which, in addition to checking for deadlocks, allows code to verify that conditions are right. A few of the many available lockdep assertions are shown below:

  1. lockdep_assert_held(): Complains if the specified lock is not held by the current CPU/task.
  2. lockdep_assert_held_write(): Complains if the specified reader-writer lock is not write-held by the current CPU/task.
  3. lockdep_assert_held_read():: Complains if the specified reader-writer lock is not read-held by the current CPU/task.
  4. lockdep_assert_none_held_once(): Complains if the current CPU/task holds any locks.
  5. lockdep_assert_irqs_disabled(): Complains if the current CPU has interrupts enabled.
  6. rcu_read_lock_held(): Complains if the current CPU/task is not within an RCU read-side critical section.

Of course, none of these assertions are going to do anything for you unless you make use of them. On the other hand, some of the more complex ownership criteria are going to be quite difficult for compilers or other tooling to intuit without significant help from the developer.

A more entertaining ownership scenario is described in Section 5.4.6 ("Applying Exact Limit Counters") of perfbook. Chapter 8 ("Data Ownership") describes several other ownership scenarios.

History

October 12, 2021: Self-review.

October 04, 2021 11:37 PM

Paul E. Mc Kenney: Can Rust Code Own RCU?

Read-copy update (RCU) vaguely resembles a reader-writer lock [1], but one in which readers do not exclude writers. This change in semantic permits RCU readers to be exceedingly fast and scalable. In fact, in the most aggressive case, rcu_read_lock() and rcu_read_unlock() generate no code, and rcu_dereference() emits but a single load instruction. This most aggressive case is achieved in production via Linux-kernel CONFIG_PREEMPT_NONE=y builds. Additional information on RCU is presented at the end of this post.

Can Rust Ownership be Adapted to RCU?

The fact that RCU readers do not exclude writers opens the door to data races between RCU readers and updaters. This in turn suggests that least some RCU use cases are likely to pose challenges to Rust's ownership semantics, however, discussions with Wedson Almeida Filho at Linux Plumbers Conference suggested that these semantics might be flexed a little bit, perhaps in a manner similar to the way that Linux-kernel RCU flexes lockdep's semantics. In addition, the data races between the read-side rcu_dereference() primitive and the update-side rcu_assign_pointer() primitive might reasonably be resolved by having the Rust implementation of these two primitives use unsafe mode. Or, perhaps better yet, the Rust implementation of these two primitives might simply wrapper the Linux-kernel primitives in order to get the benefit of existing tools such as the aforementioned lockdep and the sparse static-analysis checker. One downside of this approach is potential performance degradation due to the extra function call implied by the wrappering. Longer term, LTO might eliminate this function call, but if the initial uses of Rust are confined to performance-insensitive device drivers, the extra overhead should not be a problem.

Linux-Kernel Tooling and RCU

Here are a few examples of these Linux-kernel tools:

  1. A pointer can be marked __rcu, which will cause the sparse static-analysis checker to complain if that pointer is accessed directly, instead of via rcu_dereference() and rcu_assign_pointer(). This checking can help developers avoid accidental destruction of the address and data dependencies on which many RCU use cases depend.
  2. In most RCU use cases, the rcu_dereference() primitive that accesses RCU-protected pointers must itself be protected by rcu_read_lock(). The Linux-kernel lockdep facility has been therefore adapted to complain about unprotected uses of rcu_dereference().
  3. Passing the same pointer twice in quick succession to a pair of call_rcu() invocations is just as bad as doing the same with a pair of kfree() invocations: Both situations are a double free. The Linux kernel's debug-objects facility checks for this sort of abuse.

This is by no means a complete list. Although I am by no means claiming that these sorts of tools provide the same degree of safety as does Rust safe mode, it is important to understand that Rust unsafe mode is not to be compared to standard C, but rather to the dialect of C used in the Linux kernel, augmented by the associated tools, processes, and coding guidelines.

RCU Use Cases

There are in fact Rust implementations of variants of RCU, including:

  1. crossbeam-rs
  2. ArcSwapAny
I currently have no opinion on whether or not these could be used within the Linux kernel, and any such opinion is irrelevant. You see, Rust's use of RCU within the Linux kernel would need to access RCU-protected data structures provided by existing C code. In this case, it will not help for Rust code to define its own RCU. It must instead interoperate with the existing Linux-kernel RCU implementations.

But even interoperating with Linux-kernel RCU is not trivial. To help with this task, this section presents a few RCU use cases within the Linux kernel.

The first use case is the textbook example where once an RCU-protected object is exposed to readers, its data fields remain constant. This approach allows Rust's normal read-only mode of ownership to operate as intended. The pointers will of course change as objects are inserted or deleted, but these are handled by rcu_dereference() and rcu_assign_pointer() and friends. These special primitives might be implemented using unsafe Rust code or C code, as the case might be.

In-place updates of RCU-protected objects are sometimes handled using the list_replace_rcu() primitive. In this use case, a new version of the object is allocated, the old version is copied to the new version, any required updates are carried out on the new version, and then list_replace_rcu() is used to make the new version available to RCU readers. Readers then see either the old version or the new version, but either way they see an object with constant data fields, again allowing Rust ownership to work as intended.

However, one complication arises for objects removed from an RCU-protected data structure. Such objects are not owned by the updater until after a grace period elapses, and they can in fact be accessed by readers in the meantime. It is not clear to me how Rust would handle this. For purposes of comparison, within the Linux kernel, the Kernel Concurrency Sanitizer (KCSAN) handles this situation using dynamic runtime checking. With this checking enabled, when an updater takes full ownership of a structure before readers are done with it, KCSAN reports a data race. These issues should be most immediately apparent to anyone attempting to create a Rust wrapper for the call_rcu() function.

Because RCU readers do not exclude RCU updaters, it is possible for an RCU reader to upgrade itself to an updater while traversing an RCU-protected data structure. This is usually handled by a lock embedded in the object itself. From what I understand, the easiest way to apply Rust ownership to these situations is to make the RCU-protected object contain a structure that in turn contains the data protected by the lock. People wishing to apply Rust to these sorts of situations are invited to review the Linux-kernel source code to check for possible complications, for example, cases where readers locklessly read fields that are updated under the protection of the lock, and cases where multiple locks provide one type of protection or another to overlapping subsets of fields in the RCU-protected object.

Again because RCU readers do not exclude RCU updaters, there can also be lockless in-place updates (either from readers or updaters) to RCU-protected objects, for example, to set flags, update statistics, or to provide type-safe memory for any number of lockless algorithms (for example, using SLAB_TYPESAFE_BY_RCU). Rust could presumably use appropriate atomic or volatile operations in this case.

One important special case of RCU readers locklessly reading updated-in-place data is the case where readers cannot be allowed to operate on an object that has been removed from the RCU-protected data structure. These situations normally involve a per-object flag and lock. Updaters acquire the object's lock, remove the object from the structure, set the object's flag, and release the lock. Readers check the flag, and if the flag is set, pretend that they did not find the corresponding object. This class of use cases illustrate how segregating read-side and update-side fields of an RCU-protected object can be non-trivial.

RCU is sometimes used to protect combinations of data structures, sometimes involving nested RCU readers. In some cases, a given RCU read-side critical section might span a great many functions across several translation units.

It is not unusual for RCU use cases to include a search function that is invoked both by readers and updaters. This function will therefore sometimes be within an RCU read-side critical section and other times be protected by the update-side lock (or by some other update-side synchronization mechanism).

Sequence locking is sometimes used in conjunction with RCU, so that RCU protects traversal of the data structure and sequence locking detects updates profound enough to cause problems for the RCU traversals. The poster boy for this use case is in the Linux kernel's directory-entry cache. In this case, certain sequences of rename operations could fool readers into traversing pathnames that never actually existed. Using the sequence lock named rename_lock to protect such traversals allows readers to reject such bogus pathnames. More information on the directory-entry cache is available in Neil Brown's excellent Linux Weekly News series (part 1, part 2, part 3). One key take-away of this use case is that sequence locking is sometimes used to protect arbitrarily large data structures.

Sequence locking can be used to easily provide readers with a consistent view of a data structure, for a very wide range of definitions of "consistent". However, all of these techniques allow updaters to block readers. There are a number of other approaches to read-side consistency both within and across data structures, including the Issaquah Challenge.

One final use case is phased state change, where RCU readers check a state variable and take different actions depending on the current state. Updaters can change the state variable, but must wait for the completion of all readers that might have seen the old value before taking further action. The rcu-sync functionality (rcu_sync_init() and friends) implements a variant of RCU-protected phased state change. This use case is notable in that there are no linked data structures and nothing is ever freed.

This is by no means an exhaustive list of RCU use cases. RCU has been in the Linux kernel for almost 20 years, and during that time kernel developers have come up with any number of new ways of applying RCU to concurrency problems.

Rust RCU Options

The most straightforward approach is to provide Rust wrappers for the relevant C-language RCU API members. As noted earlier, wrappering the low-overhead read-side API members will introduce function-call overhead in non-LTO kernels, but this should not be a problem for performance-insensitive device drivers. However, fitting RCU into Rust's ownership model might not be as pretty as one might hope.

A more utopian approach would be to extend Rust's ownership model, one way or another, to understand RCU. One approach would be for Rust to gain "type safe" and "guaranteed existence" ownership modes, which would allow Rust to better diagnose abuses of the RCU API. For example, this might help with pointers to RCU-protected objects being erroneously leaked from RCU read-side critical sections.

So how might a pointer to an RCU-protected object be non-erroneously leaked from an RCU read-side critical section, you ask? One way is to acquire a reference via a reference count within that object before exiting that critical section. Another way is to acquire a lock within that object before exiting that critical section.

So what does the Linux kernel currently do to find this type of pointer leak? One approach is to enable KCSAN and build the kernel with CONFIG_RCU_STRICT_GRACE_PERIOD=y, and to run this on a small system. Nevertheless, this might be one area where Rust could improve Linux-kernel diagnostics, albeit not without effort.

For More Information

Additional information on RCU may be found here:

  1. Linux-kernel documentation, with special emphasis on Linux-kernel RCU's requirements.
  2. The following sections of perfbook discuss other aspects of RCU:

    1. Section 9.5 ("Read-Copy Update").
    2. Section 10.3 ("Read-Mostly Data Structures").
    3. Section 13.5 ("RCU Rescues").
    4. Section 15.4.2 ("RCU [memory-ordering properties ]").
  3. The RCU API, 2019 Edition.
  4. Userspace RCU.
  5. Folly-library RCU.
  6. Section 9.6.3.3 of perfbook lists several other RCU implementations.
  7. Proposed Wording for Concurrent Data Structures: Read-Copy-Update (RCU) (C++ Working Paper).

RCU's semantics ordered by increasing formality:

  1. The "Fundamental Requirements" section of the aforementioned Linux-kernel RCU requirements. Extremely informal, but you have to start somewhere!
  2. RCU Semantics: A First Attempt. Also quite informal, but with a bit more math involved.
  3. Verifying Highly Concurrent Algorithms with Grace (extended version). Formal semantics expressed in separation logic.
  4. Frightening Small Children and Disconcerting Grown-ups: Concurrency in the Linux Kernel. Executable formal semantics expressed in Cat language. This paper also includes a proof that the traditional "wait for all pre-existing readers" semantic is equivalent to these executable formal semantics within the context of the Linux-kernel memory model.
  5. Linux-kernel memory model (LKMM), see especially the Sync-rcu and Sync-srcu relations in file linux-kernel.cat.

Furthermore, significant portions of Linux-kernel RCU have been subjected to mechanical proofs of correctness:

  1. Lihao Liang applied the C Bounded Model Checker (CBMC) to Tree RCU (paper).
  2. Michalis Kokologiannakis applied Nidhugg to Tree RCU (paper).
  3. Lance Roy applied CBMC to Classic SRCU, represented by Linux-kernel commit 418b2977b343 ("rcutorture: Add CBMC-based formal verification for SRCU"). Sadly, the scripting has not aged well.

For more information, see Verification Challenge 6.

For those interested in the userspace RCU library, there have been both manual and mechanical proofs of correctness. A manual proof of correctness of userspace RCU in the context of a linked list has also been carried out.

Endnotes

[1]  At its core, RCU is nothing more nor less than a way of waiting for already-started things to finish. Within the Linux kernel, something to be waited on is delimited by rcu_read_lock() and rcu_read_unlock(). For their part, synchronize_rcu() and call_rcu() do the waiting, synchronously and asynchronously, respectively.

It turns out that you can do quite a lot with this simple capability, including approximating some reader-writer-lock use cases, emulating a number of reference-counting use cases, providing type-safe memory, and providing existence guarantees, this last sometimes being pressed into service as a poor-man's garbage collector. Section 9.3.3 ("RCU Usage") of perfbook provides more detail on these and other RCU use cases.


History

October 12, 2021: Self-review. Note that some of the comments are specific to earlier versions of this blog post.
October 13, 2021: Add RCU-protected phased state change use case.
October 18, 2021: Add read/update common code and Tasseroti citation.

October 04, 2021 07:41 PM

October 02, 2021

Paul E. Mc Kenney: Can Rust Code Own Sequence Locks?

What Are Sequence Locks?

Sequence locks vaguely resemble reader-writer locks except that readers cannot block writers. If a writer runs concurrently with a reader, that reader is forced to retry its critical section. If a reader successfully completes its critical section (as in there were no concurrent writers), then its accesses will have been data-race free. Of course, an incessant stream of writers could starve all readers, and the Linux-kernel sequence-lock implementation provides extensions to deal with this in cases where incessant writing is a possibility. The usual approach is to instead arrange things so that writers are never incessant.

In the simple case where readers loop until they succeed, with no provisions for incessant writers, a reader looks like this:

do {
  seq = read_seqbegin(&test_seqlock);
  /* read-side access. */
} while (read_seqretry(&test_seqlock, seq));

A writer looks like this:

write_seqlock(&test_seqlock);
  /* Update */
write_sequnlock(&test_seqlock);

The read-side access can run concurrently with the writer's update, which is not consistent with the data-race-free nature of non-unsafe Rust. How can this be handled?

Linux-Kernel Sequence-Lock Use Cases

First, please note that if Rust-language sequence-locking readers are to interoperate with C-language sequence-locking updaters (or vice versa), then the Rust-language and C-language implementations must be compatible. One easy way to achieve such compatibility is to provide Rust code wrappers around the C-language implementation, or perhaps better yet, around higher-level C-language functions making use of sequence locking. Of course, if Rust-language use of sequence locks only ever accesses data that is never accessed by C-language code, then Rust could provide its own implementation of sequence locking. As always, choose carefully!

Next, let's look at a few classes of sequence-locking use cases within the Linux kernel.

One common use case is the textbook case where the read-side loop picks up a few values, all within a single function. For example, sequence locking is sometimes used to fetch a 64-bit value on a 32-bit system (see the sock_read_timestamp() function in include/net/sock.h). The remainder of this section looks at some additional use cases that receive significant Linux-kernel use:

  1. Readers can gather data from multiple sources, including functions in other translation units that are invoked via function pointers. See for example the timer_cs_read() function in arch/sparc/kernel/time_32.c.
  2. Readers often perform significant computations. See for example the badblocks_check() function in block/badblocks.c.
  3. Readers can contain sequence-lock writers for that same sequence lock, as is done in the sdma_flush() function in drivers/infiniband/hw/hfi1/sdma.c. Although one could reasonably argue that this could be structured in other ways, this approach does avoid an added check, which could be valuable on fastpaths.
  4. The call to read_seqretry() is sometimes buried in a helper function, for example, in the sdma_progress() function in drivers/infiniband/hw/hfi1/sdma.h and the follow_dotdot_rcu() function in fs/namei.c.
  5. Readers sometimes use memcpy(), as might be expected by those suggesting that sequence-lock readers should clone/copy the data. See for example the neigh_hh_output() function in include/net/neighbour.h.
  6. Sequence-locking readers are often used in conjunction with RCU readers, as is described in the Can Rust Code Own RCU? posting.

Some of these use cases impose significant constraints on sequence-locking implementations. If the constraints rule out all reasonable Rust implementations, one approach would be to provide different Rust implementations for different use cases. However, it would be far better to avoid further API explosion!

Rust Sequence-Lock Options

One approach would be for all sequence-lock critical sections to be marked unsafe, but this removes at least some of the rationale for switching from C to Rust. In addition, some believe that this approach can pose severe Rust-syntax challenges in some use cases.

Another approach is to require all sequence-lock critical sections to use marked accesses. This avoids the data races and presumably also the need for unsafe, but it prevents the kernel concurrency sanitizer (KCSAN) from locating data races involving sequence-lock write-side critical sections. Or might, if it were not for the use of kcsan_flat_atomic_begin() and kcsan_flat_atomic_end() in sequence-locking read-side primitives and the use of kcsan_nestable_atomic_begin() and kcsan_nestable_atomic_end() in sequence-locking write-side primitives. So perhaps Rust could make use of similar hooks.

Yet another approach is to place the data referenced by readers into a separate structure and to switch back and forth between a pair of instances of such structures. This has been attempted in the Linux kernel with unsatisfactory results. In fact, in those cases where it is feasible to use multiple instances of a separate structure, Linux-kernel RCU is usually better choice.

One more approach would be creating something like a "tentatively readable" Rust ownership class that could directly handle sequence-locking read-side critical sections. For example, if a quantity computed from a read within a failed critical section were used subsequently, Rust could emit a warning. Hopefully without much in the way of false positives. (We can dream, can't we?)

One might instead search the Linux-kernel source tree for uses of sequence locking and to note that it is used only by a very few device drivers:

     16 drivers/dma-buf/
      1 drivers/firmware/efi/
      5 drivers/gpu/drm/
      2 drivers/gpu/drm/amd/amdgpu/
      2 drivers/gpu/drm/i915/gem/
     15 drivers/gpu/drm/i915/gt/
      5 drivers/hwmon/
     72 drivers/infiniband/hw/hfi1/
     11 drivers/md/
     15 drivers/net/ethernet/mellanox/mlx4/
     19 drivers/net/ethernet/mellanox/mlx5/core/lib/

These device drivers (or at least those portions of them that make use of sequence locking) could then be relegated to C code.

However, the use of sequence locking has been increasing, and it is likely that it will continue to increase, including within device drivers. Longer term, Rust's ownership model should therefore be extended to cover sequence locking in a natural and safe manner.

For More Information

For more on sequence locking, see the seqlock.rst documentation. Sequence locking is also covered in Sections 9.4 ("Sequence Locking") and 13.4 ("Sequence-Locking Specials") of perfbook.

History

October 6, 2021: Updated to add sectioning, Linux-kernel use cases, and updated KCSAN commentary.
October 12, 2021: Self-review. Note that some of the comments are specific to earlier versions of this blog post.
October 13, 2021: Add a note discussing possible interoperability between Rust-language and C-language sequence locking.

October 02, 2021 12:16 AM

October 01, 2021

Paul E. Mc Kenney: Rusting the Linux Kernel: Compiler Writers Hate Dependencies (OOTA)

Out-of-thin-air (OOTA) values are a troubling theoretical problem, but thus far do not actually occur in practice. In the words of the Bard: "It is a tale told by an idiot, full of sound and fury, signifying nothing."

But what is life without a little sound and fury? Besides, "mere theory" can loom large for those creating the tools required to analyze concurrent software. This post therefore gives a brief introduction to OOTA, and then provides some simple practical solutions, along with some difficult ones.

Here is the canonical OOTA example, involving two threads each having but one statement:

T1: x.store(y.load(mo_relaxed), mo_relaxed);
T2: y.store(x.load(mo_relaxed), mo_relaxed);

Believe it or not, according to the C/C++ memory model, even if both x and y are initially 0, after both threads complete it might be that x==y==42!

LKMM avoids OOTA by respecting dependencies, at least when headed by marked accesses. Plus the dependent access must either be marked or be free of data races. For example:

T1: WRITE_ONCE(x, READ_ONCE(y));
T2: WRITE_ONCE(y, READ_ONCE(x));

If x and y are both initially zero, then they are both guaranteed to stay that way!

There has been great quantities of ink spilled on the OOTA problem, but these three C++ working papers give a reasonable overview and cite a few of the more prominent papers:

  1. P0422R0: Out-of-Thin-Air Execution is Vacuous. This presents a fixed-point analysis that to the best of my knowledge is the first scheme capable of differentiating OOTA scenarios from straightforward reordering scenarios. Unfortunately, this analysis cannot reasonably be applied to automated tooling. This paper also includes a number of references to other work, which unfortunately either impose unnecessary overhead on weakly ordered systems on the one hand or are well outside of what compiler developers are willing to countenance on the other.
  2. P1916R0: There might not be an elegant OOTA fix. This diabolically clever paper shows how back-propagation of undefined behavior can further complicate the OOTA question.
  3. P2055R0: A Relaxed Guide to memory_order_relaxed. This paper presents memory_order_relaxed use cases that are believed to avoid the OOTA problem.

As with address, data, and control dependencies, the trivial solution is for Rust to simply promote all READ_ONCE() operations to smp_load_acquire(), but presumably while leaving the C-language READ_ONCE() untouched. This approach increases overhead, but keeps the people building analysis tools happy, or at least happier. Unfortunately, this trivial solution incurs overhead on weakly ordered systems, overhead that is justified only in theory, never in practice.

But if analysis tools are not an issue, then Rust could simply use volatile loads (perhaps even directly using an appropriately Rust-wrapped instance of the Linux kernel's READ_ONCE() primitive), just as the Linux kernel currently does (but keeping architecture-specific requirements firmly in mind). The problem is after all strictly theoretical.

History

October 12, 2021: Self-review. Note that the comments are specific to earlier versions of this blog post.

October 01, 2021 11:27 PM

Paul E. Mc Kenney: Rusting the Linux Kernel: Compiler Writers Hate Dependencies (Address/Data)

An address dependency involves a load whose return value directly or indirectly determines the address of a later load or store, which results in the earlier load being ordered before the later load or store. A data dependency involves a load whose return value directly or indirectly determines the value stored by a later store, which results in the load being ordered before the store. These are used heavily by RCU. Although they are not quite as fragile as control dependencies, compilers still do not know about them. Therefore, care is still required, as can be seen in the rcu_dereference.rst Linux-kernel coding guidelines. As with control dependencies, address and data dependencies enjoy very low overheads, but unlike control dependencies, they are used heavily in the Linux kernel via rcu_dereference() and friends.


  1. As with control dependencies, the trivial solution is to promote READ_ONCE() to smp_load_acquire(). Unlike with control dependencies, there are only a few such READ_ONCE() instances, and almost all of them are conveniently located in definitions of the rcu_dereference() family of functions. Because the rest of the kernel might not be happy with the increase in overhead due to such a promotion, it would likely be necessary to provide Rust-specific implementations of rcu_dereference() and friends.
  2. Again as with control dependencies, an even more trivial solution is to classify code containing address and data dependencies as core Linux-kernel code that is outside of Rust's scope. However, given that there are some thousands of instances of rcu_dereference() scattered across the Linux kernel, this solution might be a bit more constraining than many Rust advocates might hope for.
  3. Provide Rust wrappers for the rcu_dereference() family of primitives. This does incur additional function-call overhead, but on the other hand, if initial use of Rust is confined to performance-insensitive device drivers, this added overhead is unlikely to be problem.
  4. But if wrapper overhead nevertheless proves problematic, provide higher-level C-language functions that encapsulate the required address and data dependencies, and that do enough work that the overhead of the wrappering for Rust-language use is insignificant.
  5. And yet again as with control dependencies, the best approach from the Linux-kernel-in-Rust developer's viewpoint is for Rust to enforce the address/data-dependency code-style restrictions documented in the aforementioned rcu_dereference.rst. There is some reason to hope that this enforcement would be significantly easier than for control dependencies.
  6. Wait for compiler backends to learn about address and data dependencies. This might take some time, but there is ongoing work along these lines that is described below.
One might hope that C/C++'s memory_order_consume would correctly handle address and data dependencies, and in fact it does. Unfortunately, in all known compilers, it does so by promoting memory_order_consume to memory_order_acquire, which adds overhead just as surely as does smp_load_acquire(). There has been considerable work done over a period of some years towards remedying this situation, including these working papers:

  1. P0371R1: Temporarily discourage memory_order_consume (for some definition of "temporarily").
  2. P0098R1: Towards Implementation and Use of memory_order_consume, which reviews a number of potential remedies.
  3. P0190R4: Proposal for New memory_order_consume Definition, which selects a solution involving marking pointers carrying dependencies. Note that many Linux-kernel developers would likely demand a compiler command-line argument that caused the compiler to act as if all pointers had been so marked. Akshat Garg prototyped marked dependency-carrying pointers in gcc as part of a Google Summer of Code project.
  4. P0750R1: Consume, which proposes carrying dependencies in an extra bit associated with the pointer. This approach is much more attractive to some compiler developers, but many committee members did not love the resulting doubling of pointer sizes.
  5. P0735R1: Interaction of memory_order_consume with release sequences, which addresses an obscure C/C++ memory-model corner case
There appears to be a recent uptick in interest in a solution to this problem, so there is some hope for progress. However, much more work is needed.

More information on address and data dependencies may be found in Section 15.3.2 ("Address- and Data-Dependency Difficulties") of perfbook.

History

October 12, 2021: Self-review.

October 01, 2021 09:04 PM

Paul E. Mc Kenney: Rusting the Linux Kernel: Compiler Writers Hate Dependencies (Control)

At the assembly language level on many weakly ordered architectures, a conditional branch acts as a very weak, very cheap, but very useful memory-barrier instruction. It orders any load whose return value feeds into the condition codes before all stores that execute after the branch instruction completes, whether the branch is taken or not. ARMv8 also has a conditional-move instruction (CSEL) that provides similar ordering.

Because the ordering properties of the conditional branch involve dependencies from the load to the branch and from the branch to the store, and because the branch is a control-flow instruction, this ordering is said to be due to a control dependency.

Because compilers do not understand them, control dependencies are quite fragile, as documented by the many cautionary tales in the Linux kernel's memory-barriers.txt documentation (search for the "CONTROL DEPENDENCIES" heading). But they are very low cost, so they are used on a few critically important fastpaths in the Linux kernel.

Rust could deal with control dependencies in a number of ways:


  1. The trivial solution is to promote the loads heading the control dependencies to smp_load_acquire(). This works, but adds instruction overhead on some architectures and needlessly limits compiler optimizations on all architectures (but to be fair, ARMv8 does exactly this when built with link-time optimizations). Another difficulty is identifying (whether manually or automatically) exactly which READ_ONCE() calls need to be promoted.
  2. An even more trivial solution is to classify code containing control dependencies as core Linux-kernel code that is outside of Rust's scope. Because there are very few uses of control dependencies in the Linux kernel, Rust would not lose much by taking this approach. In addition, there is the possibility of creating higher-level C-language primitives containing the needed control dependencies which are then wrappered for Rust-language use.
  3. The best approach from the Linux-kernel-in-Rust developer's viewpoint is for Rust to enforce the code style restrictions documented in memory-barriers.txt. However, there is some chance that this approach might prove to be non-trivial.
  4. Wait for compiler backends to learn about control dependencies. This might be a bit of a wait, especially given the difficulty even defining control dependencies within the current nomenclature of the C/C++ standards.

More information on control dependencies may be found in Section 15.3.3 ("Control-Dependency Calamities") of perfbook.

History

October 12, 2021: Self-review.

October 01, 2021 08:24 PM

Paul E. Mc Kenney: Rusting the Linux Kernel: Atomics and Barriers and Locks, Oh My!

One way to reduce the number of occurrences of unsafe in Rust code in Linux is to push the unsafety down into atomic operations, memory barriers, and locking primitives, which are the topic of this post. But first, here are some materials describing LKMM:


  1. P0124R7: Linux-Kernel Memory Model: A C++ standards-committee working paper comparing the C/C++ memory model to LKMM.
  2. Linux Weekly News series on LKMM (Part 1 and Part 2).
  3. The infamous ASPLOS'18 paper entitled Frightening Small Children and Disconcerting Grown-ups: Concurrency in the Linux Kernel (non-paywalled), with a title-based tip of the hat to the irrepressible Mel Gorman.
  4. Chapter 15 of perfbook ("Advanced Synchronization: Memory Ordering").
  5. The Linux kernel's tools/memory-model directory, featuring an executable version of LKMM.

For all of these references, I give a big "Thank You!!!" to my co-authors.

LKMM is not the most complex memory model out there, but neither is it the simplest. In addition, it is in some ways more strict than the C/C++ memory models, which means that strict adherence to coding guidelines is required in order to prevent compiler optimizations from breaking Linux-kernel code. Many of these optimizations are not localized, but are instead scattered hither and yon throughout the compilers, including throughout the compiler backends. The optimizations in the backends are a special challenge to Rust, which seems to take the approach of layering safety on top of (or perhaps within) the compiler frontend. Later posts in this series will look at several pragmatic options available to Rust Linux-kernel code.

There is one piece of good news: Compilers are forbidden from introducing data races into code, at least not into code that is free of undefined behavior.

With all of that out of the way, let's look at Rust's options for dealing with Linux-kernel atomics and barriers and locks.

The first approach is to carefully read the P0124R7: Linux-Kernel Memory Model working paper and even more carefully follow its advice in selecting C/C++ primitives that best match Linux-kernel atomics, barriers, and locks. This approach works well for data whose definition and use is confined to Rust code, and with sufficient care and ongoing attention can also work for atomic operations and memory barriers involving data shared with C code. However, expecting Rust locking primitives to interoperate with Linux-kernel locking primitives might not be a strategy to win. It seems wise to make direct use of the existing Linux-kernel locking primitives, keeping in mind that this means properly wrappering them in order to make Rust ownership work properly. Those who doubt the wisdom of wrappering the C-language Linux-kernel locking primitives should consider the following:

  1. Linux-kernel locks are complex and highly optimized. Keeping two implementations is an excellent way to inject profound bugs into the Linux kernel.
  2. Linux-kernel locks are deeply entwined with the lockdep lock dependency checker. The data structures implementing each lock class would need to be shared between C and Rust code, which is another excellent way to inject bugs.
  3. On some architectures, Linux-kernel locks must interact with memory-mapped I/O (MMIO) accesses. Any Rust-language implementation of Linux-kernel locks must therefore be architecture-dependent and must know quite a bit about Linux-kernel MMIO.

As described in later sections, it might be useful to promote READ_ONCE() to smp_load_acquire() instead of implementing it as a volatile load. It might also be useful to promote WRITE_ONCE() to smp_store_release() instead of implementing it as a volatile store, depending on what sort of data-race analysis Rust provides for unsafe code. There is some C/C++ work in flight towards providing better definitions for volatile operations, but it is still early days for this work.

If READ_ONCE() and WRITE_ONCE() are instead to be implemented as volatile operations in Rust, please take care to check the individual architectures that are affected. DEC Alpha requires a full memory-barrier instruction at the end of READ_ONCE(), Itanium requires promotion of volatile loads to acquire loads (but this is carried out by the compiler), and ARMv8 requires READ_ONCE() to be promoted to acquire (but only in CONFIG_LTO=y builds).

Device drivers make heavy use of volatile accesses and memory barriers for MMIO accesses, and Linux-kernel device drivers are no exception. As noted earlier, some architectures require that these accesses interact with locking primitives. Furthermore, there are many device-specific special cases surrounding device control in general and MMIO in particular. Therefore, Rust-language device drivers should access the existing Linux-kernel C-language primitives rather than creating their own, especially to start with. There might well be exceptions to this rule, for example, Rust might be applied to a device driver that is only used by architectures that do not require interaction with locking primitives. But if you write driver containing Rust-language MMIO primitives, please carefully and prominently document the resulting architecture restrictions.

This suggests another approach, namely not bothering implementing any of these primitives in Rust, but rather to make direct use of the Linux-kernel implementations, as suggested earlier for locking and MMIO primitives. And again, this requires wrappering them for use by Rust code. However, such wrappering introduces another level of function call, potentially for tiny functions. Although it is expected that LTO will successfully inline tiny functions, not all of the world is yet ready for LTO. In the meantime, where feasible, developers should avoid invoking tiny C functions from Rust-language fastpaths.

This being the real world, we should expect that the Rust/C determination will need to be made on a case-by-case basis, with many devils in the details.

History

  • October 12, 2021: Self-review changes.
    October 13, 2021: Add explicit justification for wrappering the Linux kernel's C-language locks and add a few observations about MMIO accesses
  • October 01, 2021 07:02 PM

    Paul E. Mc Kenney: Rust Concurrency Philosophy: A Historical Perspective

    At first glance, Rust's concurrency philosophy resembles that of Sequent's DYNIX and DYNIX/ptx in the 1980s and early 1990s: "Lock data, not code" (see Jack Inman's classic USENIX'85 paper "Implementing Loosely Coupled Functions on Tightly Coupled Engines", sadly invisible to search engines). Of course, Sequent lacked Rust's automatic checking, and Sequent's software engineers made much less disciplined use of ownership than Rust fans recommend. Nevertheless, this resemblance has resulted in some comparisons of Rust with the DEC Alpha, which had a similar concurrency philosophy.

    Interestingly enough, DYNIX and early versions of DYNIX/ptx used compile-time-allocated arrays for almost all of its data structures. You want your kernel to support up to N tasks? Very well, build your kernel to have its array of N task structures. This worked surprisingly well, perhaps because the important concurrent applications of that time had very predictable resource requirements, including numbers of tasks. Nevertheless, as you might expect, this did become quite the configuration nightmare. So why were arrays used in the first place?

    To the best of my knowledge, the earliest published complete articulation of the reason appeared in Gamsa et al.'s landmark paper  "Tornado: Maximizing Locality and Concurrency in a Shared Memory Multiprocessor Operating System". The key point is that you cannot protect a dynamically allocated object with a lock located within that object. The DYNIX arrays avoided deallocation (or, alternatively, provided a straightforward implementation of type-safe memory), thus allowing these objects to be protected with internal locks. Avoiding the need for global locks or reference counters was an important key to the performance and scalability prized by Sequent's customers.

    This strategy worked less well when Sequent added a distributed lock manager because the required number of locks was not predictable, nor was there a useful upper bound. This problem was solved in part by the addition of RCU, which provided a high-performance and scalable means of resolving races between acquiring a given object's lock and deletion (and subsequent freeing) of that same object. Given that DEC Alpha famously had difficulty with RCU, it is only reasonable to ask how Rust will do with it. Or must the concurrency designs of those portions of the Linux kernel that are to be written in Rust be "fitted" to a Rust-language ProcRustean bed [1]? Those who prize Rust's fearless-concurrency goal above all else might reasonably argue that this ProcRustean bed is in fact a most excellent thing. However, some Linux-kernel maintainers (including this one) might in their turn reasonably argue that within the context of some portions of the Linux kernel, a proper level of fear is a very healthy thing. As is the ability to do one's job in a reasonably straightforward manner!

    One way to avoid this ProcRustean bed is to use the Rust unsafe facility, and in fact "unsafe" has been the answer to a disturbingly large number of my questions about Rust [2]. However, use of this facility introduces the possibility of data races, which in turn raises the question of Rust's memory model. Within the Linux kernel, the answer to this question is of course LKMM, or perhaps some reasonable subset of LKMM.

    However, in my personal experience, I have most frequently seen Rust being used to rewrite scripts that became performance problems upon being more widely deployed than expected. In some cases, these rewrites greatly improved user experience as well as performance. This means that Rust is heavily used outside of the Linux kernel, which in turn means that LKMM might not be the right answer for Rust in general, though some in the Rust community have come out strongly in favor of extending the ProcRustian bed. But this blog series is focused on Rust for the Linux kernel, so the question of the memory model for Rust in general is out of scope. Again, for Rust in the Linux kernel, some subset of LKMM is clearly the correct memory model.

    Many of the following posts in this series cover ways that Rust might work with specific aspects of LKMM, including some wild speculation about how Rust's ownership model might be generalized in a manner similar to Linux-kernel's lockdep checking has been generalized for cross-released locks and for RCU. Readers wishing to learn more about non-ProcRustean concurrency designs are invited to peruse "Is Parallel Programming Hard, And, If So, What Can You Do About It?, hereinafter called "perfbook". Specific chapters and sections of the Second Edition of this book will be cited as appropriate by later posts in this series.

    Endnotes

    [1]  Making this 1990s-style concurrency scale usually involves hashed arrays of locks. These are often deadlock-prone, but there are heavily used techniques that (mostly) avoid the deadlocks. See Section 7.1.1.6 ("Acquire Needed Locks First") in perfbook for one such technique. However, hashed arrays of locks are prone to scalability problems, especially on multi-socket systems, due to poor locality of reference. See for example Section 10.2.3 ("Hash-Table Performance") for performance results on hash tables, also in perfbook. As a result, most attempts to apply hashed arrays of locks to the Linux kernel resulted in RCU being used instead. The performance and scalability benefits of RCU (and hazard pointers) are shown in Section 10.3 ("Read-Mostly Data Structures"), again in perfbook.
     
    [2]  But please note that Rust's unsafe code has only limited undefined-behavior unsafe superpowers:

    1. Dereferencing a raw pointer (and it is the programmer's responsibility to avoid destructive wild-pointer dereferences).
    2. Calling an unsafe function or method.
    3. Accessing or modifying a mutable static variable (and it is the programmer's responsibility to avoid destructive data races).
    4. Implementing an unsafe trait.
    5. Accessing fields of a union (and it is the programmer's responsibility to avoid accesses that invoke undefined behavior, or, alternatively, understand how the compiler at hand reacts to any undefined behavior that might be invoked).
    I do find the Rust community's sharp focus on undefined-behavior-induced bugs over other types of bugs to be rather surprising. Perhaps this is because I have not had so much trouble with undefined-behavior-induced bugs. Conspiracy theorists might imagine an unholy alliance between UB-happy developers of compiler optimizations on the one hand and old-school parallel programmers wishing to turn the clock back to the 1990s on the other hand. I am happy to let such theorists imagine such things. ;-)

    Besides, more recent discussions have focused on memory safety rather than the full gamut of undefined behavior.


    History

    October 12, 2021: Self-review. Note that some of the comments are specific to earlier versions of this blog post.
    October 13, 2021: Add note on memory safety specifically rather than undefined behavior in general.

    October 01, 2021 06:08 PM

    Paul E. Mc Kenney: So You Want to Rust the Linux Kernel?

    There has been much discussion of using the Rust language in the Linux kernel (for example, here, here, and here), at the Kangrejos Rust for Linux Workshop (here, here, and here) and 2021 Linux Plumbers Conference had a number of sessions on this topic, as did Maintainers Summit. At least two of these sessions mentioned the question of how Rust is to handle the Linux-kernel memory model (LKMM), and I volunteered to write this blog series on this topic.

    This series focuses mostly on use cases and opportunities, rather than on any non-trivial solutions. Please note that I am not in any way attempting to dictate or limit Rust's level of ambition. I am instead noting the memory-model consequences of a few potential levels of ambition, ranging from "portions of a few drivers", "a few drivers", "some core code" and up to and including "the entire kernel". Greater levels of ambition will require greater willingness to accommodate a wider variety of LKMM requirements.

    One could instead argue that portions or even all of the Linux kernel should instead be hammered into the Rust ownership model. On the other hand, might the rumored sudden merge of the ksmdb driver (https://lwn.net/Articles/871098/) have been due to the implicit threat of its being rewritten in Rust? [1] Nevertheless, in cases where Rust is shown to offer particularly desirable advantages, it is quite possible that Rust and some parts of the Linux kernel might meet somewhere in the middle.

    These blog posts will therefore present approaches ranging upwards from trivial workarounds. But be warned that some of the high-quality approaches require profound reworking of compiler backends that have thus far failed to spark joy in the hearts of compiler writers. In addition, Rust enjoys considerable use outside of the Linux kernel, for but one example that I have personally observed, as something into which to rewrite inefficient Python scripts. (A megawatt here, a megawatt there, and pretty soon you are talking about real power consumption!) Therefore, there might well be sharp limits beyond which the core Rust developers are unwilling to go.

    The remaining posts in this series are as follows:


    1. Rust Concurrency Philosophy: A Historical Perspective
    2. Atomics and Barriers and Locks, Oh My!
    3. Compiler Writers Hate Dependencies (Control)
    4. Compiler Writers Hate Dependencies (Address/Data)
    5. Compiler Writers Hate Dependencies (OOTA)
    6. Can Rust Code Own Sequence Locks?
    7. Can Rust Code Own RCU?
    8. How Much of the Kernel Can Rust Own?
    9. Will Your Rust Code Survive the Attack of the Zombie Pointers?
    10. Can the Kernel Concurrency Sanitizer Own Rust Code?
    11. Summary and Conclusions
    12. TL;DR: Memory-Model Recommendations for Rusting the Linux Kernel

    Endnotes


    [1]  Some in the Linux-kernel community might be happy with either outcome: (1) The threat of conversion to Rust caused people to push more code into mainline and (2) Out-of-tree code was converted to Rust by Rust advocates and then pushed into mainline. The latter case might need special care for longer-term maintenance of the resulting Rust code, but perhaps the original authors might be persuaded to declare victory, learn Rust, and maintain the code. Who knows? ;-)


    History

    October 8, 2021: Fix s/LInux/Linux/ typo noted by Miguel Ojeda
    October 12, 2021: Self-review, including making it clear that Rust might have use cases other than rewriting inefficient scripts.
    October 13, 2021: Add a link to the recommendations post.

    The October 12 update affected the whole series, for example, removing the "under construction" markings. Summary of significant updates to other posts:

    October 01, 2021 12:39 AM

    September 30, 2021

    James Bottomley: Linux Plumbers Conference Matrix and BBB integration

    The recently completed Linux Plumbers Conference (LPC) 2021 used the Big Blue Button (BBB) project again as its audio/video online conferencing platform and Matrix for IM and chat. Why we chose BBB has been discussed previously. However this year we replaced RocketChat with Matrix to achieve federation, allowing non-registered conference attendees to join the chat. Also, based on feedback from our attendees, we endeavored to replace the BBB chat window with a Matrix one so anyone could see and participate in one contemporaneous chat stream within BBB and beyond. This enabled chat to be available before, during and after each session.

    One thing that emerged from our initial disaster with Matrix on the first day is that we failed to learn from the experiences of other open source conferences (i.e. FOSDEM, which used Matrix and ran into the same problems). So, an object of this post is to document for posterity what we did and how to repeat it.

    Integrating Matrix Chat into BBB

    Most of this integration was done by Guy Lunardi.

    It turns out that Chat is fairly deeply embedded into BBB, so replacing the existing chat module is hard. Fortunately, BBB also contains an embedded etherpad which is simply produced via an iFrame redirection. So what we did is to disable the BBB chat panel and replace it with a new iFrame based component that opened an embedded Matrix chat client. The client we chose was riot-embedded, which is a relatively recent project but seemed to work reasonably well. The final problem was to pass through user credentials. Up until three days before the conference, we had been happy with the embedded Matrix client simply creating a one-time numbered guest account every time it was opened, but we worried about this being a security risk and so implemented pass through login credentials at the last minute (life’s no fun unless you live dangerously).

    Our custom front end for BBB (lpcfe) was created last year by Jon Corbet. It uses a fairly simple email/registration confirmation code for username/password via LDAP. The lpcfe front end Jon created is here git://git.lwn.net/lpcfe.git; it manages the whole of the conference log in process and presents the current and future sessions (with join buttons) according to the timezone of the browser viewing it.

    The credentials are passed through directly using extra parameters to BBB (see commit fc3976e “Pass email and regcode through to BBB”). We eventually passed these through using a GET request. Obviously if we were using a secret password, this would be a problem, but since the password was a registration code handed out by a third party, it’s acceptable. I imagine if anyone wishes to take this work forward, add native Matrix device/session support in riot-embedded would be better.

    The main change to get this working in riot-embedded is here, and the supporting patch to BBB is here.

    Note that the Matrix room ID used by the client was added as an extra parameter to the flat text file that drives the conference track layout of lpcfe. All Matrix rooms were created as public (and published) so anyone going to our :lpc.events matrix domain could see and join them.

    Setting up Matrix for the Conference

    We used the matrix-synapse server and did a standard python venv pip install on Ubuntu of the latest tag. We created around 30+ public rooms: one for each Microconference and track of the conference and some admin and hallway rooms. We used LDAP to feed the authentication portion of lpcfe/Matrix, but we had a problem using email addresses since the standard matrix user name cannot have an ‘@’ symbol in it. Eventually we opted to transform everyone’s email to a matrix compatible form simply by replacing the ‘@’ with a ‘.’, which is why everyone in our conference appeared with ridiculously long matrix user names like @jejb.ibm.com:lpc.events

    This ‘@’ to ‘.’ transformation was a huge source of problems due to the unwillingness of engineers to read instructions, so if we do this over again, we’ll do the transformation silently in the login javascript of our Matrix web client. (we did this in riot-embedded but ran out of time to do it in Element web as well).

    Because we used LDAP, the actual matrix account for each user was created the first time they log into our server, so we chose at this point to use auto-join to add everyone to the 30+ LPC Matrix rooms we’d already created. This turned out to be a huge problem.

    Testing our Matrix and BBB integration

    We tried to organize a “Town Hall” event where we invited lots of people to test out the infrastructure we’d be using for the conference. Because we wanted this to be open, we couldn’t use the pre-registration/LDAP authentication infrastructure so Jon quickly implemented a guest mode (and we didn’t auto join anyone to any Matrix rooms other than the townhall chat).

    In the end we got about 220 users to test during which time the Matrix and BBB infrastructure behaved quite well. Based on this test, we chose a 2 vCPU Linode VM for our Matrix server.

    What happened on the Day

    Come the Monday of the conference, the first problem we ran into was procrastination: the conference registered about 1,000 attendees, of whom, about 500 tried to log on about 5 minutes prior to the first session. Since accounts were created and rooms joined upon the first login, this is clearly a huge thundering herd problem of our own making … oops. The Matrix server itself shot up to 100% CPU on the python synapse process and simply stayed there, adding new users at a rate of about one every 30 seconds. All the chat tabs froze because logins were taking ages as well. The first thing we did was to scale the server up to a 16 CPU bare metal system, but that didn’t help because synapse is single threaded … all we got was the matrix synapse python process running at 100% one one of the CPUs, still taking 30 seconds per first log in.

    Fixing the First Day problems

    The first thing we realized is we had to multi-thread the synapse server. This is well known but the issue is also quite well hidden deep in the Matrix documents. It also happens that the Matrix documents are slightly incomplete. The first scaling attempt we tried: simply adding 16 generic worker apps to scale across all our physical CPUs failed because the Matrix server stopped federating and then the database crashed with “FATAL: remaining connection slots are reserved for non-replication superuser connections”.

    Fixing the connection problem (alter system set max_connections = 1000;) triggered a shared memory too small issue which was eventually fixed by bumping the shared buffer segment to 8GB (alter system set shared_buffers=1024000;). I suspect these parameters were way too large, but the Linode we were on had 32GB of main memory, so fine tuning in this emergency didn’t seem a good use of time.

    Fixing the worker problem was way more complex. The way Matrix works, you have to use a haproxy to redirect incoming connections to individual workers and you have to ensure that the same worker always services the same transaction (which you achieve by hashing on IP address). We got a lot of advice from FOSDEM on this aspect, but in the end, instead of using an external haproxy, we went for the built in forward proxy load balancing in nginx. The federation problem seems to be that Matrix simply doesn’t work without a federation sender. In the end, we created 15 generic workers and one each of media server, frontend server and federation sender.

    Our configuration files are

    once you have all the units enabled in systemd, you can then simply do systemctl start/stop matrix-synapse.target

    Finally, to fix the thundering herd problem (for people who hadn’t already logged in), we ran through the entire spreadsheet of email/confirmation numbers doing an automatic login using the user management API on the server itself. At this point we had about half the accounts auto created, so this script created the rest.

    emaillist=lpc2021-all-attendees.txt
    IFS='   '
    
    while read first last confirmation email; do
        bbblogin=${email/+*@/@}
        matrixlogin=${bbblogin/@/.}
        curl -XPOST -d '{"type":"m.login.password", "user":"'${matrixlogin}'", "password":"'${confirmation}'"}' "http://localhost:8008/_matrix/client/r0/login"
        sleep 1
    done < ${emaillist}

    The lpc2021-all-attendees.txt is a tab separated text file used to drive the mass mailings to plumbers attendees, but we adapted it to log everyone in to the matrix server.

    Conclusion

    With the above modifications, the matrix server on a Dedicated 32GB (16 cores) Linode ran smoothly for the rest of the conference. The peak load got to 17 and the peak total CPU usage never got above 70%. Finally, the peak memory usage was around 16GB including cache (so the server was a bit over provisioned).

    In the end, 878 of the 944 registered attendees logged into our BBB servers at one time or another and we got a further 100 external matrix users (who may or may not also have had a conference account).

    September 30, 2021 04:24 PM

    September 25, 2021

    Brendan Gregg: The Speed of Time

    How long does it take to read the time? How would you _time_ time? These strange questions came to the fore back in 2014 when Netflix was switching services from CentOS Linux to Ubuntu, and I helped debug several weird performance issues including one I'll describe here. While you're unlikely to run into this specific issue anymore, what is interesting is this type of issue and the simple method of debugging it: a pragmatic mix of observability and experimentation tools. I've shared many posts about superpower observability tools, but often humble hacking is just as effective. A Cassandra database cluster had switched to Ubuntu and noticed write latency increased by over 30%. A quick check of basic performance statistics showed over 30% higher CPU consumption. What on Earth is Ubuntu doing that results in 30% higher CPU time!? ## 1. CLI tools The Cassandra systems were EC2 virtual machine (Xen) instances. I logged into one and went through some basic CLI tools to get started (my [60s checklist]). Was there some other program consuming CPU, like a misbehaving Ubuntu service that wasn't in CentOS? top(1) showed that only the Cassandra database was consuming CPU. What about short-lived processes, like a service restarting in a loop? These can be invisible to top(8). My execsnoop(8) tool (back then my [Ftrace version]) showed nothing. It seemed that the extra CPU time really was in Cassandra, but how? ## 2. CPU profile Understanding a CPU time should be easy by comparing [CPU flame graphs]. Since instances of both CentOS and Ubuntu were running in parallel, I could collect flame graphs at the same time (same time-of-day traffic mix) and compare them side by side. The CentOS flame graph:

    The Ubuntu flame graph:
    Darn, they didn't work. There's no Java stack—there should be a tower of green Java methods—instead there's only a single green frame or two. This is how Java flame graphs looked at the time. Later that year I prototyped the c2 frame pointer fix that became -XX:+PreserveFramePointer, which fixes Java stacks in these profiles. Even with the broken Java stacks, I noticed a big difference: On Ubuntu, there's a massive amount of CPU time in a libjvm call: os::javaTimeMillis(). 30.14% in the middle of the flame graph. Searching shows it was elsewhere as well for a total of 32.1%. This server is spending about a third of its CPU cycles just checking the time! This was a weird problem to think about: Time itself had now become a resource and target of performance analysis. The broken Java stacks turned out to be beneficial: They helped group together the os::javaTimeMillis() calls which otherwise might have have been scattered on top of different Java code paths, appearing as thin stacks everywhere. If that were the case, I'd have zoomed in to see what they were then zoomed out and searched for them for the cumulative percent; or flipped the merge order so it's an icicle graph, merging leaf to root. But in this case I didn't have to do anything as it was mostly merged by accident. (This is one of the motivating reasons for switching to a d3 version of flame graphs, as I want the interactivity of d3 to do things like collapse all the Java frames, all the user-mode frames, etc., to expose different groupings like this.) ## 3. Studying the flame graph os::javaTimeMillis fetches the current time. Browsing the flame graph shows it is calling the gettimeofday(2) syscall which enters the tracesys() and syscall_trace_enter/exit() kernel functions. This gave me two theories: A) Some syscall tracing is enabled in Ubuntu (auditing? apparmor?). B) Fetching time was somehow slower on Ubuntu, which could be a library change or a kernel/clocksource change. Theory (A) is most likely based on the frame widths in the flame graph. But I'm not completely sure. As a Xen guest, this profile was gathered using perf(1) and the kernel's software cpu-clock soft interrupts, not the hardware NMI. Without NMI, some kernel code paths (interrupts disabled) can't be profiled. Also, since it's a Xen guest, hypervisor time can never be profiled. These two factors mean that there can be missing kernel and hypervisor time in the flame graph, to the true breakdown of time in os::javaTimeMillis may be a little different. Note that Ubuntu also has a frame to show entry into vDSO (virtual dynamic shared object). This is a user-mode syscall accelerator, and gettimeofday(2) is a classic use case that is cited in the vdso(7) man page. At the time, the Xen pvclock source didn't support vDSO, so you can see the syscall code above the vdso frame. It's the same on CentOS, although it doesn't include a vdso frame in the flame graph (I'd guess due to a perf(1) difference alone). ## 4. Colleagues/Internet I love using [Linux performance tools]. But I also love solving issues quickly, and sometimes that means just asking colleagues or searching the Internet. I'm including this step as a reminder for anyone following this kind of analysis. Others I asked hadn't hit this issue, and the Internet at the time had nothing using the search terms os::javaTimeMillis, clocksource, tracesys(), Ubuntu, EC2, Xen, etc. (That changed by the end of the year.) ## 5. Experimentation To further analysis this with observability tools, I could:
    1. Fix the Java stacks to see if there's a difference in how time is used on Ubuntu. Maybe Java is calling it more often for some reason.
    2. Trace the gettimeofday() and related syscall paths, to see if there's some difference there: E.g., errors.
    But as I summarized in my [What is Observability] post, the term observability can be a reminder not to get stuck on that one type of analysis. Here's some experimental approaches I could also explore:
    1. Disable tracesys/syscall_trace.
    2. Microbenchmark os::javaTimeMillis() on both systems.
    3. Try changing the kernel clocksource.
    As (C) looked like a kernel rebuild, I started with (D) and (E). ## 5. Measuring the speed of time Is there already a microbenchmark for os::javaTimeMillis()? This would help confirm that these calls really were slower on Ubuntu. I couldn't find such a microbenchmark so I wrote something simple. I'm not going to try to make before and after time calls to time the duration in time (which I guess would work if you factored in the extra time in the timing calls). Instead, I'm just going to call time millions of times in a loop and time how long it takes (sorry, that's two many different usages of the word "time" in one paragraph):
    $ cat TimeBench.java
    public class TimeBench {
        public static void main(String[] args) {
    	for (int i = 0; i < 100 * 1000 * 1000; i++) {
    		long t0 = System.currentTimeMillis();
    		if (t0 == 87362) {
    			System.out.println("Bingo");
    		}
    	}
        }
    }
    
    This does 100 million calls of currentTimeMillis(). I then executed it via the shell time(1) command to give an overall runtime for those 100 million calls. (There's also a test and println() in the loop to, hopefully, convince the compiler not to optimize-out an otherwise empty loop. This will slow this test a little.) Trying it out:
    centos$ time java TimeBench
    real	0m12.989s
    user	0m3.368s
    sys	0m18.561s
    
    ubuntu# time java TimeBench
    real	1m8.300s
    user	0m38.337s
    sys	0m29.875s
    
    How long is each time call? Assuming the loop is dominated by the time call, it works out to be about 0.13 us on Centos and 0.68 us on Ubuntu. Ubuntu is 5x slower. As I'm interested in the relative comparison I can just compare the total runtimes (the "real" time) for the same result. I also rewrote this in C and called gettimeofday(2) directly:
    $ cat gettimeofdaybench.c
    #include <sys/time.h>
    
    int
    main(int argc, char *argv[])
    {
    	int i, ret;
    	struct timeval tv;
    	struct timezone tz;
    
    	for (i = 0; i < 100 * 1000 * 1000; i++) {
    		ret = gettimeofday(&tv, &tz);
    	}
    
    	return (0);
    }
    
    I compiled this with -O0 to avoid dropping the loop. Running this on the two systems saw similar results. I love short benchmarks like this as I can disassemble the resulting binary and ensure that the compiled instructions match my expectations, and the compiler hasen't messed with it. ## 6. clocksource Experimentation My second experiment was to change the clocksource. Checking those available:
    $ cat /sys/devices/system/clocksource/clocksource0/available_clocksource
    xen tsc hpet acpi_pm
    $ cat /sys/devices/system/clocksource/clocksource0/current_clocksource
    xen
    
    Ok, so it's defaulted to xen, which we saw in the flame graph (the tower ends with pvclock_clocksource_read()). Let's try tsc, which should be the fastest:
    # echo tsc > /sys/devices/system/clocksource/clocksource0/current_clocksource
    $ cat /sys/devices/system/clocksource/clocksource0/current_clocksource
    tsc
    $ time java TimeBench
    real    0m3.370s
    user    0m3.353s
    sys     0m0.026s
    
    The change is immediate, and my Java microbenchmark is now running over 20x faster than before! (And nearly 4x faster than on CentOS.) Now that it's reaching 33 ns, the loop instructions are likely inflating this result. If I wanted more accuracy, I'd partially [unroll the loop] so that the loop instructions become negligible. ## 6. Workaround The time stamp counter (TSC) clocksource is fast as it retrieves time using just an RDTSC instruction, and with vDSO it can do this without the syscall. TSC traditionally was not the default because of concerns about time drift. Software-based clocksources could fix those issues and provide accurate monotonically-increasing time. I happened to be speaking at a technical confering while still debugging this, and mentioned what I was working on to a processor engineer. He said that tsc had been stable for years, and any advise about avoiding it was old. I asked if he knew of a public reference saying so, but he didn't. That chance encounter, coupled with the Netflix's fault-tolerant cloud, gave me enough confidence to suggest trying tsc in production as a workaround for the issue. The change was obvious in the production graphs, showing a drop in write latencies:
    Once tested more broadly, it showed the write latencies dropped by 43%, delivering slightly better performance than on CentOS. The CPU flame graph for Ubuntu now looked like:
    os::javaTimeMillis() was now 1.6% in total. Note that it now enters "[[vdso]]" and nothing more: No kernel calls above it. ## 7. Aftermath I provided details to AWS and Canonical, and then moved onto the other performance issues as part of the migration. A colleague, Mike Huang, also hit this for a different service at Netflix and enabled tsc. We ended up setting it in the BaseAMI for all cloud services. Later that year (2014), Anthony Liguori from AWS gave a [re:Invent talk] recommending users switch the clocksource to tsc to improve performance. I also shared setting the clocksource in my talks and in my 2015 [Linux tunables] post. Over the years, more and more articles have been published about clocksource in virtual machines, and it's now a well-known issue. Amazon even provides an official [recommendation] \(2021\):
    "For EC2 instances launched on the AWS Xen Hypervisor, it's a best practice to use the tsc clock source. Other EC2 instance types, such as C5 or M5, use the AWS Nitro Hypervisor. The recommended clock source for the AWS Nitro Hypervisor is kvm-clock."
    As this indicates, things have changed with [Nitro] where clocksources are much faster (thanks!). In 2019 myself and others tested kvm-clock and found it was only about 20% slower than tsc. That's much better than the xen clocksource, but still slow enough to resist switching over absent a reason (such as an reemergence of tsc clock drift issues). I'm not sure if Intel ever published something to clarify tsc stability on newer processors. If you know they did, please drop a comment. The JMH benchmark suite can also now test System.currentTimeMillis(), so it's no longer necessary to roll your own (unless you want to dissassemble it, in which case it's easier to have something short and simple). As for tracesys: I investigated the overhead for other syscalls and found it to be negligible, and before I returned to work on it further the kernel code paths changed and it was no longer present in the stacks. Did that Ubuntu release have a misconfiguration of auditing that was later fixed? I like to get to the rock bottom of issues, so it was a bit unsatisfying that the problem went away before I did. Even if I did figure it out, we'd still have preferred to go with tsc instead of the xen clocksource for the 4x improvement. ## 8. Summary Reading time itself can become a bottleneck for some clocksources. This was much worse many years ago on Xen virtual machine guests. For Linux I've been recommending the faster tsc clocksource for years, altough I'm not a processor vendor so I can't make assurances about tsc issues of clock drift. At least AWS have now included it in their recommendations. Also, while I often post about superpower tracing tools, sometimes some humble hacking is best. In this case it was a couple of ad hoc microbenchmarks, only several lines of code each. Any time you're investigating performance of some small discrete system component, consider finding or rolling your own microbenchmark to get more information on it experimentally. You have two hands: observation and experimentation. [both hands]: /blog/2021-05-23/what-is-observability.html [What is Observability]: /blog/2021-05-23/what-is-observability.html [Ftrace version]: https://github.com/brendangregg/perf-tools [CPU flame graphs]: /FlameGraphs/cpuflamegraphs.html [Linux performance tools]: /linuxperf.html [Linux tunables]: /blog/2015-03-03/performance-tuning-linux-instances-on-ec2.html [60s checklist]: /Articles/Netflix_Linux_Perf_Analysis_60s.pdf [re:Invent talk]: https://youtu.be/ujGx0tiI1L4?t=2160 [recommendation]: https://aws.amazon.com/premiumsupport/knowledge-center/manage-ec2-linux-clock-source/ [Nitro]: /blog/2017-11-29/aws-ec2-virtualization-2017.html [unroll the loop]: /blog/2014-04-26/the-noploop-cpu-benchmark.html

    September 25, 2021 02:00 PM

    September 20, 2021

    Linux Plumbers Conference: Welcome to LPC 2021 — Registration Closed

    Hi,
    thank you for attending LPC 2021!
    We have now reached our limit for attendees. Registration is now closed.
    If you are still intending to watch the conference you can do this by watching on YouTube.

    September 20, 2021 02:25 PM

    September 17, 2021

    Linux Plumbers Conference: Get ready for LPC 2021!

    The LPC 2021 conference is just around the corner. We wanted to share the logistics on how to participate and watch the virtual conference.

    For those that are not registered for the conference, we will have live streaming of the sessions on YouTube, like last year. This is free of charge. We will provide the URLs where to watch each day, on this page. The only limitation is that you cannot participate and ask questions live with audio. However this year we will have the chat in each Big Blue Button room also available externally via the Matrix open communication network. Anyone is invited to join with their personal Matrix account.

    Those who are registered for the conference will be able to log into our Big Blue Button server through our front end page, starting Monday September 20 at 7:00AM US Pacific time.
    To log in to BBB, please go to meet.lpc.events. You will find a front end showing the schedule for the current day with all the active sessions you can join. Your credentials are the email address you used for registration, and the confirmation code you received in email when you registered. Please make sure you have those available in advance of trying to log in.

    Please review the LPC 2021 Participant Guide before you join the conference.

    Looking forward to seeing you there!

    September 17, 2021 10:21 PM

    Linux Plumbers Conference: Linux Plumbers Conference 2021 is Almost Here

    We are only three days away from the start of LPC 2021!

    Thank you to all that made our conference possible:
    – Our generous Sponsors, listed here on the right
    – The Linux Foundation, which provides as always impeccable support
    – Our speakers and leaders, who are providing a lot of great content and planning great discussions

    As you can see, the schedule is finalized now. There are going to be seven parallel tracks each day, lasting four hours each. We have a total of 23 different tracks and Microconferences, with 191 sessions.

    At this time we are closing the CfPs for all tracks. We have still room for a limited number of Birds of a Feather sessions. If you want to propose one, even during the conference, and the necessary participants are all registered, please send an email to our lpc-contact@lists.linuxplumbersconf.org mailing list.

    Take a look at all the great technical content at this year virtual LPC.
    You can view the schedule by main blocks , or by track, or as a complete detailed view.

    Note that at the end of the first day we’ll have a plenary keynote by Jon “maddog” Hall.
    Additionally, at the end of the last day we’ll have a plenary session as a wrap up for this year conference.

    The conference will be entirely virtual, offered on a completely free and open software stack.

    We look forward to five days filled with great discussions, and we hope that LPC 2021 will provide once again a creative and productive environment where ideas can be exchanged and problems tackled. Many great ideas have sprung in the past from these meetings, driving innovation in the Linux plumbing layer!

    September 17, 2021 02:43 AM

    September 14, 2021

    Pete Zaitcev: Scalability of a varying degree

    Seen at official site of Qumulo:

    Scale

    Platforms must be able to serve petabytes of data, billions of files, millions of operations, and thousands of users.

    Thousands of users...? Isn't it a little too low? Typical Swift clusters in Telcos have tens of millions of users, of which tens or hundreds of thousands are active simultaneously.

    Google's Chumby paper has a little section on scalability problem with talking to a cluster over TCP/IP. Basically at low tens of thousands you're starting to have serious issues with kernel sockets and TIME_WAIT. So maybe that.

    September 14, 2021 08:27 PM

    Paul E. Mc Kenney: Stupid RCU Tricks: Making Race Conditions More Probable

    Given that it is much more comfortable chasing down race conditions reported by rcutorture than those reported from the field, it would be good to make race conditions more probable during rcutorture runs than in production. A number of tricks are used to make this happen, including making rare events (such as CPU-hotplug operations) happen more frequently, testing the in-kernel RCU API directly from within the kernel, and so on.

    Another approach is to change timing. Back at Sequent in the 1990s, one way that this was accomplished was by plugging different-speed CPUs into the same system and then testing on that system. It was observed that for certain types of race conditions, the probability of the race occurring increased by the ratio of the CPU speeds. One such race condition is when a timed event on the slow CPU races with a workload-driven event on the fast CPU. If the fast CPU is (say) two times faster than the slow CPU, then the timed event will provide two times greater “collision cross section” than if the same workload was running on CPUs running at the same speed.

    Given that modern CPUs can easily adjust their core clock rates at runtime, it is tempting to try this same trick on present-day systems. Unfortunately, everything and its dog is adjusting CPU clock rates for various purposes, plus a number of modern CPUs are quite happy to let you set their core clock rates to a value sufficient to result in physical damage. Throwing rcutorture into this fray might be entertaining, but it is unlikely to be all that productive.

    Another approach is to make use of memory latency. The idea is for the rcutorture scripting to place one pair of a given scenario's vCPUs in the hyperthreads of a single core and to place another pair of that same scenario's vCPUs in the hyperthreads of a different single core, and preferably a core on some other socket. The theory is that the different communications latencies and bandwidths within a core on the one hand and between cores (or, better yet, between sockets) on the other should have roughly the same effect as does varying CPU core clock rates.

    OK, theory is all well and good, but what happens in practice?

    As it turns out, on dual-socket systems, quite a bit.

    With this small change to the rcutorture scripting, RCU Tasks Trace suddenly started triggering assertions. These test failures led to no fewer than 12 fixes, perhaps most notably surrounding proper handling of the count of tasks from which quiescent states are needed. This caused me to undertake a full review of RCU Tasks Trace, greatly assisted by Boqun Feng, Frederic Weisbecker, and Neeraj Upadhyay, with Neeraj providing half of the fixes. There is likely to be another fix or three, but then again isn't that always the case?

    More puzzling were the 2,199.0-second RCU CPU stall warnings (described in more detail here). These were puzzling for a number of reasons:


    1. The RCU CPU stall warning timeout is set to only 21 seconds.
    2. There was absolutely no console output during the full stall duration.
    3. The stall duration was never 2,199.1 seconds and never 2,198.9 seconds, but always exactly 2,199.0 seconds, give or take a (very) few tens of milliseconds. (Kudos to Willy Tarreau for pointing out offlist that 2,199.02 seconds is almost exactly 2 to the 41st power worth of nanoseconds. Coincidence? You decide!)
    4. The stalled CPU usually took only a handful of scheduling-clock interrupts during the stall, but would sometimes take them at a rate of 100,000 per second, which seemed just a bit excessive for a kernel built with HZ=1000.
    5. At the end of the stall, the kernel happily continued, usually with no other complaints.
    These stall warnings appeared most frequently when running rcutorture's TREE04 scenario.

    But perhaps this is not a stall, but instead a case of time jumping forward. This might explain the precision of the stall duration, and would definitely explain the lack of intervening console output, the lack of other complaints, and the kernel's being happy to continue at the end of the stall. Not so much the occasional extreme rate of scheduling-clock interrupts, but perhaps that is a separate problem.

    However, running large numbers (as in 200) of concurrent shorter one-hour TREE04 runs often resulted in the run terminating (forcibly) in the middle of the stall. Now this might be due to the host's and the guests' clocks all jumping forward at the same time, except that different guests stalled at different times, and even when running TREE04, most guests didn't stall at all. Therefore, the stalls really did stall, and for a very long time.

    But then it should be possible to work out what the CPUs were doing in the meantime. One approach would be to use tracing, but previous experience with massive volumes of trace messages (and thus lost trace messages) suggested a more surgical approach. Furthermore, the last console message before the stall was always of the form “kvm-clock: cpu 3, msr d4a80c1, secondary cpu clock” and the first console message after the stall was always of the form “kvm-guest: stealtime: cpu 3, msr 1f597140”. These are widely separated and and are often printed from different CPUs, which also suggests a more surgical approach. This situation also implicates CPU hotplug, but this is not at all unusual.

    The first attempt at exploratory surgery used the jiffies counter to check for segments of code taking more than 100 seconds to complete. Unfortunately, these checks never triggered, even in runs having stall warnings. So maybe the jiffies counter is not being updated. It is easy enough to switch to ktime_get_mono_fast_ns(), right? Except that this did not trigger, either.

    Maybe there is a long-running interrupt handler? Mark Rutland recently posted a patchset to detect exactly that, so I applied it. But it did not trigger.

    I switched to ktime_get() in order to do cross-CPU time comparisons, and out of shear paranoia added checks for time going backwards. And these backwards-time checks really did trigger just before the stall warnings appeared, once again demonstrating the concurrent-programming value of a healthy level paranoia, and also explaining why many of my earlier checks were not triggering. Time moved forward, and then jumped backwards, making it appear that no time had passed. (Time did jump forward again, but that happened after the last of my debug code had executed.)

    Adding yet more checks showed that the temporal issues were occurring within stop_machine_from_inactive_cpu(). This invocation takes the mtrr_rendezvous_handler() function as an argument, and it really does take 2,199.0 seconds (that is, about 36 minutes) from the time that stop_machine_from_inactive_cpu() is called until the time that mtrr_rendezvous_handler() is called. But only sometimes.

    Further testing confirmed that increasing the frequency of CPU-hotplug operations increased the frequency of 2,199.0-second stall warnings.

    A extended stint of code inspection suggested further diagnostics, which showed that one of the CPUs would be stuck in the multi_cpu_stop() state machine. The stuck CPU was never CPU 0 and was never the incoming CPU. Further tests showed that the scheduler always thought that all of the CPUs, including the stuck CPU, were in the TASK_RUNNING state. Even more instrumentation showed that the stuck CPU was failing to advance to state 2 (MULTI_STOP_DISABLE_IRQ), meaning that all of the other CPUs were spinning in a reasonably tight loop with interrupts disabled. This could of course explain the lack of console messages, at least from the non-stuck CPUs.

    Might qemu and KVM be to blame? A quick check of the code revealed that vCPUs are preserved across CPU-hotplug events, that is, taking a CPU offline does not cause qemu to terminate the corresponding user-level thread. Furthermore, the distribution of stuck CPUs was uniform across the CPUs other than CPU 0. The next step was to find out where CPUs were getting stuck within the multi_cpu_stop() state machine. The answer was “at random places”. Further testing also showed that the identity of the CPU orchestrating the onlining of the incoming CPU had nothing to do with the problem.

    Now TREE04 marks all but CPU 0 as nohz_full CPUs, meaning that they disable their scheduling-clock interrupts when running in userspace when only one task is runnable on that CPU. Maybe the CPUs need to manually enable their scheduling-clock interrupt when starting multi_cpu_stop()? This did not fix the problem, but it did manage to shorten some of the stalls, in a few cases to less than ten minutes.

    The next trick was to send an IPI to the stalled CPU every 100 seconds during multi_cpu_stop() execution. To my surprise, this IPI was handled by the stuck CPU, although with surprisingly long delays ranging from just a bit less than one millisecond to more than eight milliseconds.

    This suggests that the stuck CPUs might be suffering from an interrupt storm, so that the IPI had to wait for its turn among a great many other interrupts. Further testing therefore sent an NMI backtrace at 100 seconds into multi_cpu_stop() execution. The resulting stack traces showed that the stuck CPU was always executing within sysvec_apic_timer_interrupt() or some function that it calls. Further checking showed that the stuck CPU was in fact suffering from an interrupt storm, namely an interrupt storm of scheduling-clock interrupts. This spurred another code-inspection session.

    Subsequent testing showed that the interrupt duration was about 3.5 microseconds, which corresponded to about one third of the stuck CPU's time. It appears that the other two-thirds is consumed repeatedly entering and exiting the interrupt.

    The retriggering of the scheduling-clock interrupt does have some potential error conditions, including setting times in the past and various overflow possibilities. Unfortunately, further diagnostics showed that none of this was happening. However, they also showed that the code was trying to schedule the next interrupt at time KTIME_MAX, so that an immediate relative-time-zero interrupt is a rather surprising result.

    So maybe this confusion occurs only when multi_cpu_stop() preempts some timekeeping activity. Now TREE04 builds its kernels with CONFIG_PREEMPT=n, but maybe there is an unfortunately placed call to schedule() or some such. Except that further code inspection found no such possibility. Furthermore, another test run that dumped the previous task running on each CPU showed nothing suspicious (aside from rcutorture, which some might argue is always suspicious).

    And further debugging showed that tick_program_event() thought that it was asking for the scheduling-clock interrupt to be turned off completely. This seemed like a good time to check with the experts, and Frederic Weisbecker, noting that all of the action was happening within multi_cpu_stop() and its called functions, ran the following command to enlist ftrace, while also limiting its output to something that the console might reasonably keep up with:

    ./kvm.sh --configs "18*TREE04" --allcpus --bootargs "ftrace=function_graph ftrace_graph_filter=multi_cpu_stop" --kconfig "CONFIG_FUNCTION_TRACER=y CONFIG_FUNCTION_GRAPH_TRACER=y"

    This showed that there was no hrtimer pending (consistent with KTIME_MAX), and that the timer was nevertheless being set to fire immediately. Frederic then proposed the following small patch:

    --- a/kernel/softirq.c
    +++ b/kernel/softirq.c
    @@ -595,7 +595,8 @@ void irq_enter_rcu(void)
     {
            __irq_enter_raw();
     
    -       if (is_idle_task(current) && (irq_count() == HARDIRQ_OFFSET))
    +       if (tick_nohz_full_cpu(smp_processor_id()) ||
    +           (is_idle_task(current) && (irq_count() == HARDIRQ_OFFSET)))
                    tick_irq_enter();
     
            account_hardirq_enter(current);
    

    This forces the jiffies counter to be recomputed upon interrupt from nohz_full CPUs in addition to idle CPUs, which avoids the timekeeping confusion that caused KTIME_MAX to be interpreted as zero.

    And a 20-hour run for each of 200 instances of TREE04 was free of RCU CPU stall warnings! (This represents 4,000 hours of testing consuming 32,000 CPU-hours.)

    This was an example of that rare form of deadlock, a temporary deadlock. The stuck CPU was stuck because timekeeping wasn't happening. Timekeeping wasn't happening because all the timekeeping CPUs were spinning in multi_cpu_stop() with interrupts disabled. The other CPUs could not exit their spinloops (and thus could not update timekeeping information) because the stuck CPU did not advance through the multi_cpu_stop() state machine.

    So what caused this situation to be temporary? I must confess that I have not dug into it (nor do I intend to), but my guess is that integer overflow resulted in KTIME_MAX once again taking on its proper role, thus ending the stuck CPU's interrupt storm and in turn allowing the multi_cpu_stop() state machine to advance.

    Nevertheless, this completely explains the mystery. Assuming integer overflow, the extremely repeatable stall durations make perfect sense. The RCU CPU stall warning did not happen at the expected 21 seconds because all the CPUs were either spinning with interrupts disabled on the one hand or being interrupt stormed on the other. The interrupt-stormed CPU did not report the RCU CPU stall because the jiffies counter wasn't incrementing. A random CPU would report the stall, depending on which took the first scheduling-clock tick after time jumped backwards (again, presumably due to integer overflow) and back forwards. In the relatively rare case where this CPU was the stuck CPU, it reported an amazing number of scheduling clock ticks, otherwise very few. Since everything was stuck, it is only a little surprising that the kernel continued blithely on after the stall ended. TREE04 reproduced the problem best because it had the largest proportion of nohz_full CPUs.

    All in all, this experience was a powerful (if sometimes a bit painful) demonstration of the ability of controlled memory latencies to flush out rare race conditions!

    September 14, 2021 04:43 AM

    September 05, 2021

    Brendan Gregg: ZFS Is Mysteriously Eating My CPU

    A microservice team asked me for help with a mysterious issue. They claimed that the ZFS file system was consuming 30% of CPU capacity. I summarized this case study at [Kernel Recipes] in 2017; it is an old story that's worth resharing here. ## 1. Problem Statement The microservice was for metrics ingestion and had recently updated their base OS image (BaseAMI). After doing so, they claimed that ZFS was now eating over 30% of CPU capacity. My first thought was that they were somehow mistaken: I worked on ZFS internals at Sun Microsystems, and unless it is badly misconfigured there's no way it can consume that much CPU. I have been surprised many times by unexpected performance issues, so I thought I should check their instances anyway. At the very least, I could show that I took it seriously enough to check it myself. I should also be able to identify the real CPU consumer. ## 2. Monitoring I started with the cloud-wide monitoring tool, [Atlas], to check high-level CPU metrics. These included a breakdown of CPU time into percentages for "usr" (user: applications) and "sys" (system: the kernel). I was surprised to find a whopping 38% of CPU time was in sys, which is highly unusual for cloud workloads at Netflix. This supported the claim that ZFS was eating CPU, but how? Surely this is some other kernel activity, and not ZFS. ## 3. Next Steps I'd usually SSH to instances for deeper analysis, where I could use mpstat(1) to confirm the usr/sys breakdown and perf(1) to begin profiling on-CPU kernel code paths. But since Netflix has tools (previously [Vector], now FlameCommander) that allow us to easily fetch flame graphs from our cloud deployment UI, I thought I'd jump to the chase. Just for illustration, this shows the Vector UI and a typical cloud flame graph:

    Note that this sample flame graph is dominated by Java, shown by the green frames. ## 4. Flame Graph Here's the CPU flame graph from one of the problem instances:
    The kernel CPU time pretty obvious, shown as two orange towers on the left and right. (The other colors are yellow for C++, and red for other user-level code.) Zooming in to the left kernel tower:
    This is arc_reclaim_thread! I worked on this code back at Sun. So this really is ZFS, they were right! The ZFS Adapative Replacement Cache (ARC) is the main memory cache for the file system. The arc_reclaim_thread runs arc_adjust() to evict memory from the cache to keep it from growing too large, and to maintain a threshold of free memory that applications can quickly use. It does this periodically or when woken up by low memory conditions. In the past I've seen the arc_reclaim_thread eat too much CPU when a tiny file system record size was used (e.g., 512 bytes) creating millions of tiny buffers. But that was basically a misconfiguration. The default size is 128 Kbytes, and people shouldn't be tuning below 8 Kbytes. The right kernel tower enters spl_kmem_cache_reap_now(), another ZFS memory freeing function. I imagine this is related to the left tower (e.g., contending for the same locks). But first: Why is ZFS in use? ## 5. Configuration There was only one use of ZFS so far at Netflix that I knew of: A new infrastructure project was using ZFS for containers. This led me to a theory: If they were quickly churning through containers, they would also be churning through ZFS file systems, and this might mean that many old pages needed to be cleaned out of the cache. Ahh, it makes sense now. I told them my theory, confident I was on the right path. But they replied: "We aren't using containers." Ok, then how _are_ you using ZFS? I did not expect their answer:
    "We aren't using ZFS."
    What!? Yes you are, I can see the arc_reclaim_thread in the flame graph. It doesn't run for fun! It's only on CPU because it's evicting pages from the ZFS ARC. If you aren't using ZFS, there are no pages in the ARC, so it shouldn't run. They were confident that they weren't using ZFS at all. The flame graph defied logic. I needed to prove to them that they were indeed using ZFS somehow, and figure out why. ## 6. cd & ls I should be able to debug this using nothing more than the cd and ls(1) commands. cd to the file system and ls(1) to see what's there. The file names should be a big clue as to its use. First, finding out where the ZFS file systems are mounted:
    df -h
    mount
    zfs list
    
    This showed nothing! No ZFS file systems were currently mounted. I tried another instance and saw the same thing. Huh? Ah, but containers may have been created previously and since destroyed, hence no remaining file systems now. How can I tell if ZFS has ever been used? ## 7. arcstats I know, arcstats! The kernel counters that track ZFS statistics, including ARC hits and misses. Viewing them:
    # cat /proc/spl/kstat/zfs/arcstats
    name                            type data
    hits                            4    0
    misses                          4    0
    demand_data_hits                4    0
    demand_data_misses              4    0
    demand_metadata_hits            4    0
    demand_metadata_misses          4    0
    prefetch_data_hits              4    0
    prefetch_data_misses            4    0
    prefetch_metadata_hits          4    0
    prefetch_metadata_misses        4    0
    mru_hits                        4    0
    mru_ghost_hits                  4    0
    mfu_hits                        4    0
    mfu_ghost_hits                  4    0
    deleted                         4    0
    mutex_miss                      4    0
    evict_skip                      4    0
    evict_not_enough                4    0
    evict_l2_cached                 4    0
    evict_l2_eligible               4    0
    [...]
    
    Unbelievable! All the counters were zero! ZFS really wasn't in use, ever! But at the same time, it was eating over 30% of CPU capacity! Whaaat?? The customer had been right all along. ZFS was straight up eating CPU, and for no reason. How can a file system _that's not in use at all_ consume 38% CPU? I'd never seen this before. This was a mystery. ## 8. Code Analysis I took a closer look at the flame graph and the paths involved, and saw that the CPU code paths led to get_random_bytes() and extract_entropy(). These were new to me. Browsing the [source code] and change history I found the culprit. The ARC contains lists of cached buffers for different memory types. A performance feature ("multilist") had been added that split the ARC lists into one per CPU, to reduce lock contention on multi-CPU systems. Sounds good, as that should improve performance. But what happens when you want to evict some memory? You need to pick one of those CPU lists. Which one? You could go through them in a round-robin fashion, but the developer thought it better to pick one at random. _Cryptographically-secure random._ The kicker was that ZFS wasn't even in use. The ARC was detecting low system memory and then adjusting its size accordingly, at which point it'd discover it was zero size already and didn't need to do anything. But this was after randomly selecting a zero-sized list, using a CPU-expensive random number generator. I filed this as ZFS [issue #6531]. I believe the first fix was to have the arc_reclaim_thread bail out earlier if ZFS wasn't in use, and not enter list selection. The ARC has since had many changes, and I haven't heard of this issue again. [Kernel Recipes]: https://youtu.be/UVM3WX8Lq2k?t=133 [Kernel Recipes2]: https://www.slideshare.net/brendangregg/kernel-recipes-2017-using-linux-perf-at-netflix/2 [talks]: /index.html#Videos [issue #6531]: https://github.com/openzfs/zfs/issues/6531 [source code]: https://github.com/openzfs/zfs/blob/4e33ba4c389f59b74138bf7130e924a4230d64e9/module/zfs/arc.c [Atlas]: https://netflixtechblog.com/introducing-atlas-netflixs-primary-telemetry-platform-bd31f4d8ed9a [Vector]: https://netflixtechblog.com/introducing-vector-netflixs-on-host-performance-monitoring-tool-c0d3058c3f6f

    September 05, 2021 02:00 PM

    August 29, 2021

    Brendan Gregg: Analyzing a High Rate of Paging

    These are rough notes. A service team was debugging a performance issue and noticed it coincided with a high rate of paging. I was asked to help, and used a variety of Linux performance tools to solve this including those that use eBPF and Ftrace. This is a rough post to share this old but good case study of using these tools, and to help justify their further development. No editing, spell checking, or comments. Mostly screenshots. ## 1. Problem Statement The microservice managed and processed large files, including encrypting them and then storing them on S3. The problem was that large files, such as 100 Gbytes, seemed to take forever to upload. Hours. Smaller files, as large as 40 Gbytes, were relatively quick, only taking minutes. A cloud-wide monitoring tool, Atlas, showed a high rate of paging for the larger file uploads:

    The blue is pageins (page ins). Pageins are a type of disk I/O where a page of memory is read in from disk, and are normal for many workloads. You might be able to guess the issue from the problem statement alone. ## 2. iostat Starting with my [60-second performance checklist], the iostat(1) tool showed a high rate of disk I/O during a large file upload:
    # iostat -xz 1
    Linux 4.4.0-1072-aws (xxx) 	12/18/2018 	_x86_64_	(16 CPU)
    
    avg-cpu:  %user   %nice %system %iowait  %steal   %idle
               5.03    0.00    0.83    1.94    0.02   92.18
    
    Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
    xvda              0.00     0.29    0.21    0.17     6.29     3.09    49.32     0.00   12.74    6.96   19.87   3.96   0.15
    xvdb              0.00     0.08   44.39    9.98  5507.39  1110.55   243.43     2.28   41.96   41.75   42.88   1.52   8.25
    
    avg-cpu:  %user   %nice %system %iowait  %steal   %idle
              14.81    0.00    1.08   29.94    0.06   54.11
    
    Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
    xvdb              0.00     0.00  745.00    0.00 91656.00     0.00   246.06    25.32   33.84   33.84    0.00   1.35 100.40
    
    avg-cpu:  %user   %nice %system %iowait  %steal   %idle
              14.86    0.00    0.89   24.76    0.06   59.43
    
    Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
    xvdb              0.00     0.00  739.00    0.00 92152.00     0.00   249.40    24.75   33.49   33.49    0.00   1.35 100.00
    
    avg-cpu:  %user   %nice %system %iowait  %steal   %idle
              14.95    0.00    0.89   28.75    0.06   55.35
    
    Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
    xvdb              0.00     0.00  734.00    0.00 91704.00     0.00   249.87    24.93   34.04   34.04    0.00   1.36 100.00
    
    avg-cpu:  %user   %nice %system %iowait  %steal   %idle
              14.54    0.00    1.14   29.40    0.06   54.86
    
    Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
    xvdb              0.00     0.00  750.00    0.00 92104.00     0.00   245.61    25.14   33.37   33.37    0.00   1.33 100.00
    
    ^C
    
    I'm looking at the r_await column in particular: the average wait time for reads in milliseconds. Reads usually have apps waiting on them; writes may not (write-back caching). An r_wait of 33 ms is kinda high, and likely due to the queueing (avgqu-sz). They are largeish I/O, about 128 Kbytes (divide rkB/s by r/s). ## 3. biolatency From [bcc], this eBPF tool shows a latency histogram of disk I/O. I'm running it in case the averages are hiding outliers, which could be a device issue:
    # /usr/share/bcc/tools/biolatency -m
    Tracing block device I/O... Hit Ctrl-C to end.
    ^C
         msecs               : count     distribution
             0 -> 1          : 83       |                                        |
             2 -> 3          : 20       |                                        |
             4 -> 7          : 0        |                                        |
             8 -> 15         : 41       |                                        |
            16 -> 31         : 1620     |*******                                 |
            32 -> 63         : 8139     |****************************************|
            64 -> 127        : 176      |                                        |
           128 -> 255        : 95       |                                        |
           256 -> 511        : 61       |                                        |
           512 -> 1023       : 93       |                                        |
    
    This doesn't look too bad. Most I/O are between 16 and 127 ms. Some outliers reaching the 0.5 to 1.0 second range, but again, there's quite a bit of queueing here seen in the earlier iostat(1) output. I don't think this is a device issue. I think it's the workload. ## 4. bitesize As I think it's a workload issue, I want a better look at the I/O sizes in case there's something odd:
    # /usr/share/bcc/tools/bitesize 
    Tracing... Hit Ctrl-C to end.
    ^C
    Process Name = java
         Kbytes              : count     distribution
             0 -> 1          : 0        |                                        |
             2 -> 3          : 0        |                                        |
             4 -> 7          : 0        |                                        |
             8 -> 15         : 31       |                                        |
            16 -> 31         : 15       |                                        |
            32 -> 63         : 15       |                                        |
            64 -> 127        : 15       |                                        |
           128 -> 255        : 1682     |****************************************|
    
    The I/O is mostly in the 128 to 255 Kbyte bucket, as expected from the iostat(1) output. Nothing odd here. ## 5. free Also from the 60-second checklist:
    # free -m
                  total        used        free      shared  buff/cache   available
    Mem:          64414       15421         349           5       48643       48409
    Swap:             0           0           0
    
    There's not much memory left, 349 Mbytes, but more interesting is the amount in the buffer/page cache: 48,643 Mbytes (48 Gbytes). This is a 64-Gbyte memory system, and 48 Gbytes is in the page cache (the file system cache). Along with the numbers from the problem statement, this gives me a theory: Do the 100-Gbyte files bust the page cache, whereas 40-Gbyte files fit? ## 6. cachestat [cachestat] is an experimental tool I developed that uses Ftrace and has since been ported to bcc/eBPF. It shows statistics for the page cache:
    # /apps/perf-tools/bin/cachestat
    Counting cache functions... Output every 1 seconds.
        HITS   MISSES  DIRTIES    RATIO   BUFFERS_MB   CACHE_MB
        1811      632        2    74.1%           17      48009
        1630    15132       92     9.7%           17      48033
        1634    23341       63     6.5%           17      48029
        1851    13599       17    12.0%           17      48019
        1941     3689       33    34.5%           17      48007
        1733    23007      154     7.0%           17      48034
        1195     9566       31    11.1%           17      48011
    [...]
    
    This shows many cache misses, with a hit ratio varying between 6.5 and 74%. I usually like to see that in the upper 90's. This is "cache busting." The 100 Gbyte file doesn't fit in the 48 Gbytes of page cache, so we have many page cache misses that will cause disk I/O and relatively poor performance. The quickest fix is to move to a larger-memory instance that does fit 100 Gbyte files. The developers can also rework the code with the memory constraint in mind to improve performance (e.g., processing parts of the file, instead of making multiple passes over the entire file). ## 7. Smaller File Test For further confirmation, I collected the same outputs for a 32 Gbyte file upload. cachestat shows a ~100% cache hit ratio:
    # /apps/perf-tools/bin/cachestat
    Counting cache functions... Output every 1 seconds.
        HITS   MISSES  DIRTIES    RATIO   BUFFERS_MB   CACHE_MB
       61831        0      126   100.0%           41      33680
       53408        0       78   100.0%           41      33680
       65056        0      173   100.0%           41      33680
       65158        0       79   100.0%           41      33680
       55052        0      107   100.0%           41      33680
       61227        0      149   100.0%           41      33681
       58669        0       71   100.0%           41      33681
       33424        0       73   100.0%           41      33681
    ^C
    
    This smaller size allows the service to process the file (however many passes it takes) entirely from memory, without re-reading it from disk. free(1) shows it fitting in the page cache:
    # free -m
                  total        used        free      shared  buff/cache   available
    Mem:          64414       18421       11218           5       34773       45407
    Swap:             0           0           0
    
    And iostat(1) shows little disk I/O, as expected:
    # iostat -xz 1
    Linux 4.4.0-1072-aws (xxx) 	12/19/2018 	_x86_64_	(16 CPU)
    
    avg-cpu:  %user   %nice %system %iowait  %steal   %idle
              12.25    0.00    1.24    0.19    0.03   86.29
    
    Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
    xvda              0.00     0.32    0.31    0.19     7.09     4.85    47.59     0.01   12.58    5.44   23.90   3.09   0.15
    xvdb              0.00     0.07    0.01   11.13     0.10  1264.35   227.09     0.91   82.16    3.49   82.20   2.80   3.11
    
    avg-cpu:  %user   %nice %system %iowait  %steal   %idle
              57.43    0.00    2.95    0.00    0.00   39.62
    
    Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
    
    avg-cpu:  %user   %nice %system %iowait  %steal   %idle
              53.50    0.00    2.32    0.00    0.00   44.18
    
    Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
    xvdb              0.00     0.00    0.00    2.00     0.00    19.50    19.50     0.00    0.00    0.00    0.00   0.00   0.00
    
    avg-cpu:  %user   %nice %system %iowait  %steal   %idle
              39.02    0.00    2.14    0.00    0.00   58.84
    
    Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
    [...]
    
    ## 8. Final Notes cachestat was the killer tool here, but I should stress that it's still experimental. I wrote it for Ftrace with the constraints that it must be low-overhead and use the Ftrace function profiler only. As I mentioned in my LSFMMBPF 2019 keynote in Puerto Rico, where the Linux mm kernel engineers were present, I think the cachestat statistics are so commonly needed that they should be in /proc and not need my experimental tool. They pointed out challenges with providing them properly, and I think any robust solution is going to need their help and expertise. I hope this case study helps show why it is useful and worth the effort. Until the kernel does support page cache statistics (which may be never: they are hot-path, so adding counters isn't free) we can use my cachestat tool, even if it does require frequent maintenance to keep working. [cachestat]: /blog/2014-12-31/linux-page-cache-hit-ratio.html [60-second performance checklist]: /Articles/Netflix_Linux_Perf_Analysis_60s.pdf [bcc]: https://github.com/iovisor/bcc

    August 29, 2021 02:00 PM

    August 27, 2021

    Michael Kerrisk (manpages): man-pages-5.13 released

    Alex Colomar and I have released released man-pages-5.13. The release tarball is available on kernel.org. The browsable online pages can be found on man7.org. The Git repository for man-pages is available on kernel.org.

    This release resulted from patches, bug reports, reviews, and comments from 40 contributors. The release includes around 200 commits that changed around 120 manual pages.

    The most notable of the changes in man-pages-5.13 are the following:

    Special thanks again to Alex, who kept track of a lot of patches while I was unavailable.

    August 27, 2021 09:07 PM

    August 26, 2021

    Brendan Gregg: Slack's Secret STDERR Messages

    These are rough notes. I run the Slack messaging application on Ubuntu Linux, and it recently started mysteriously crashing. I'd Alt-Tab and find it was no longer there. No error message, no dialog, just gone. It usually happened when locking and unlocking the screen. A quick internet search revealed nothing. These are rough notes for how I debugged it, in case it's useful for someone searching on this topic. I spend many hours documenting advanced debugging stories for books, talks, and blog posts, but many things I never have time to share. I'm experimenting with this "rough notes" format as a way to quickly share things. No editing, spell checking, or comments. Mostly screenshots. Dead ends included. Note that I don't know anything about Slack internals, and there may be better ways to solve this. ## 1. Enabling core dumps I'm guessing it's core dumping and Ubuntu's apport is eating them. Redirecting them to the file system so I can then do core dump analysis using gdb(1), as root:

    # cat /proc/sys/kernel/core_pattern
    |/usr/share/apport/apport %p %s %c %d %P
    # mkdir /var/cores
    # echo "/var/cores/core.%e.%p.%h.%t" > /proc/sys/kernel/core_pattern
    [...another crash...]
    # ls /var/cores
    #
    
    This didn't work: No core file showed up. I may need to increase the core file size ulimits for Slack, but that might mean mucking around with its startup scripts; I'll try some other tracing first. ## 2. exitsnoop Using an eBPF/bcc tool to look for exit reasons:
    # exitsnoop -t
    TIME-AEST    PCOMM            PID    PPID   TID    AGE(s)  EXIT_CODE 
    13:51:19.432 kworker/dying    3663305 2      3663305 1241.59 0
    13:51:30.948 kworker/dying    3663626 2      3663626 835.76  0
    13:51:33.296 systemd-udevd    3664149 2054939 3664149 3.55    0
    13:53:09.256 kworker/dying    3662179 2      3662179 2681.15 0
    13:53:25.636 kworker/dying    3663520 2      3663520 1122.60 0
    13:53:30.705 grep             3664239 6009   3664239 0.08    0
    13:53:30.705 ps               3664238 6009   3664238 0.08    0
    13:53:40.297 slack            3663135 1786   3663135 1459.54 signal 6 (ABRT)
    13:53:40.298 slack            3663208 3663140 3663208 1457.86 0
    13:53:40.302 slack            3663140 1786   3663140 1459.18 0
    13:53:40.302 slack            3663139 1786   3663139 1459.18 0
    13:53:40.303 slack            3663171 1786   3663171 1458.22 0
    13:53:40.317 slack            3663197 1786   3663197 1458.03 0
    13:53:44.827 gdm-session-wor  3664269 1778   3664269 0.02    0
    [...]
    
    This traced a Slack SIGABRT which happened around the same time as a crash. A strong lead. ## 3. killsnoop Running killsnoop (eBPF/bcc) to get more info:
    # killsnoop
    TIME      PID    COMM             SIG  TPID   RESULT
    13:45:01  2053366 systemd-journal  0    1024   0
    13:45:01  2053366 systemd-journal  0    3663525 -3
    13:45:01  2053366 systemd-journal  0    3663528 -3
    13:49:00  2054939 systemd-udevd    15   3664053 0
    13:51:33  2054939 systemd-udevd    15   3664149 0
    13:53:44  2053366 systemd-journal  0    4265   -1
    13:53:44  2053366 systemd-journal  0    972    0
    13:53:44  2053366 systemd-journal  0    1778   0
    13:53:44  2053366 systemd-journal  0    6414   -1
    [...]
    
    A crash happened, but killsnoop(8) didn't see it. A quick look at the killsnoop(8) source reminded me that I wrote it back in 2015, which is practically ancient in eBPF years. Back then there wasn't tracepoint support yet so I was using kprobes for everything. Kprobes aren't a stable interface, which might be the problem. ## 4. perf trace Nowadays this can be done as a perf one-liner:
    # perf list syscalls:sys_enter_*kill
    
    List of pre-defined events (to be used in -e):
    
      syscalls:sys_enter_kill                            [Tracepoint event]
      syscalls:sys_enter_tgkill                          [Tracepoint event]
      syscalls:sys_enter_tkill                           [Tracepoint event]
    
    # perf trace -e 'syscalls:sys_enter_*kill'
     15755.483 slack/3684015 syscalls:sys_enter_tgkill(tgid: 3684015 (slack), pid: 3684015 (slack), sig: ABRT)
    
    Ok, so there's our slack SIGABRT, sent via tgkill(2). (And I filed an issue to update bcc killsnoop(8) to use tracepoints.) This output doesn't really tell me much more about it though. I want to see a stack trace. I can use perf record or bpftrace...and that reminds me, didn't I write another signal tool using bpftrace? ## 5. signals.bt The signals.bt bpftrace tool from my BPF book traces the signal:signal_generate tracepoint, which should catch every type of generated signal, including tgkill(2). Trying it out:
    # bpftrace /home/bgregg/Git/bpf-perf-tools-book/originals/Ch13_Applications/signals.bt
    Attaching 3 probes...
    Counting signals. Hit Ctrl-C to end.
    ^C
    @[SIGNAL, PID, COMM] = COUNT
    
    @[SIGPIPE, 1883, Xorg]: 1
    @[SIGCHLD, 1797, dbus-daemon]: 1
    @[SIGINT, 3665167, bpftrace]: 1
    @[SIGTERM, 3665198, gdm-session-wor]: 1
    @[SIGCHLD, 3665197, gdm-session-wor]: 1
    @[SIGABRT, 3664940, slack]: 1
    @[SIGTERM, 3665197, gdm-session-wor]: 1
    @[SIGKILL, 3665207, dbus-daemon]: 1
    @[SIGWINCH, 859450, bash]: 2
    @[SIGCHLD, 1778, gdm-session-wor]: 2
    @[, 3665201, gdbus]: 2
    @[, 3665199, gmain]: 2
    @[SIGWINCH, 3665167, bpftrace]: 2
    @[SIGWINCH, 3663319, vi]: 2
    @[SIGCHLD, 1786, systemd]: 6
    @[SIGALRM, 1883, Xorg]: 106
    
    Ok, there's the SIGABRT for slack. (There's a new sigsnoop(8) tool for bcc that uses this tracepoint as well.) ## 6. Signal Stacks Moving from signals.bt to a bpftrace one-liner:
    # bpftrace -e 't:signal:signal_generate /comm == "slack"/ { printf("%d, %s%s\n", args->sig, kstack, ustack); }'
    Attaching 1 probe...
    6, 
            __send_signal+579
            __send_signal+579
            send_signal+221
            do_send_sig_info+81
            do_send_specific+110
            do_tkill+171
            __x64_sys_tgkill+34
            do_syscall_64+73
            entry_SYSCALL_64_after_hwframe+68
    
            0x7f4a2e2e2f95
    
    This was supposed to print both the kernel and user stack traces that led to the SIGABRT, but the user stack is broken, showing 0x7f4a2e2e2f95 only. Grumble. There's ways to fix this, but it usually gets time consuming, so let me try something else first. Logs! ## 7. Logs Does Slack have logs? I have no idea. Maybe they contain the error message.
    # lsof -p `pgrep -n slack` | grep -i log
    lsof: WARNING: can't stat() fuse.gvfsd-fuse file system /run/user/1000/gvfs
          Output information may be incomplete.
    lsof: WARNING: can't stat() fuse file system /run/user/1000/doc
          Output information may be incomplete.
    
    Ignore the lsof(8) warnings. There's no output here, nothing containing "log". Although I'm looking at the most recent process called "slack" and maybe that's the wrong one.
    # pstree -ps `pgrep -n slack`
    systemd(1)───systemd(1786)───slack(3666477)───slack(3666481)───slack(3666548)─┬─{slack}(3666549)
                                                                                  ├─{slack}(3666551)
                                                                                  ├─{slack}(3666552)
                                                                                  ├─{slack}(3666553)
                                                                                  ├─{slack}(3666554)
                                                                                  ├─{slack}(3666555)
                                                                                  ├─{slack}(3666556)
                                                                                  ├─{slack}(3666557)
                                                                                  ├─{slack}(3666558)
                                                                                  ├─{slack}(3666559)
                                                                                  ├─{slack}(3666560)
                                                                                  ├─{slack}(3666564)
                                                                                  ├─{slack}(3666566)
                                                                                  ├─{slack}(3666568)
                                                                                  └─{slack}(3666609)
    
    Ok, how about I try the great-grandparent, PID 3666477:
    # lsof -p 3666477 | grep -i log
    lsof: WARNING: can't stat() fuse.gvfsd-fuse file system /run/user/1000/gvfs
          Output information may be incomplete.
    lsof: WARNING: can't stat() fuse file system /run/user/1000/doc
          Output information may be incomplete.
    slack   3666477 bgregg   37r    REG      253,1     32768   140468 /home/bgregg/.local/share/gvfs-metadata/home-8fd8d123.log
    slack   3666477 bgregg   40r    REG      253,1     32768   131314 /home/bgregg/.local/share/gvfs-metadata/trash:-85854456.log
    slack   3666477 bgregg   71w    REG      253,1     15566  1707316 /home/bgregg/.config/Slack/Local Storage/leveldb/013430.log
    slack   3666477 bgregg   72w    REG      253,1       347  1704816 /home/bgregg/.config/Slack/Local Storage/leveldb/LOG
    slack   3666477 bgregg   73w    REG      253,1   2324236  1718407 /home/bgregg/.config/Slack/logs/browser.log
    slack   3666477 bgregg   90w    REG      253,1    363600  1713625 /home/bgregg/.config/Slack/Service Worker/Database/000004.log
    slack   3666477 bgregg   92w    REG      253,1       274  1704249 /home/bgregg/.config/Slack/Service Worker/Database/LOG
    slack   3666477 bgregg  108w    REG      253,1   4182513  1749672 /home/bgregg/.config/Slack/logs/webapp-service-worker-console.log
    slack   3666477 bgregg  116w    REG      253,1       259  1704369 /home/bgregg/.config/Slack/Session Storage/LOG
    slack   3666477 bgregg  122w    REG      253,1     31536  1749325 /home/bgregg/.config/Slack/Session Storage/000036.log
    slack   3666477 bgregg  126w    REG      253,1   3970909  1704566 /home/bgregg/.config/Slack/logs/webapp-console.log
    slack   3666477 bgregg  127w    REG      253,1   2330006  1748923 /home/bgregg/.config/Slack/IndexedDB/https_app.slack.com_0.indexeddb.leveldb/023732.log
    slack   3666477 bgregg  131w    REG      253,1       330  1704230 /home/bgregg/.config/Slack/IndexedDB/https_app.slack.com_0.indexeddb.leveldb/LOG
    slack   3666477 bgregg  640r    REG      253,1     32768   140378 /home/bgregg/.local/share/gvfs-metadata/root-7d269acf.log (deleted)
    
    Lots of logs in ~/config/Slack/logs!
    # cd ~/.config/Slack/logs
    # ls -lrth
    total 26M
    -rw-rw-r-- 1 bgregg bgregg 5.0M Aug 20 07:54 webapp-service-worker-console1.log
    -rw-rw-r-- 1 bgregg bgregg 5.1M Aug 23 19:30 webapp-console2.log
    -rw-rw-r-- 1 bgregg bgregg 5.1M Aug 25 16:07 webapp-console1.log
    drwxrwxr-x 2 bgregg bgregg 4.0K Aug 27 14:34 recorded-trace/
    -rw-rw-r-- 1 bgregg bgregg 4.0M Aug 27 14:46 webapp-service-worker-console.log
    -rw-rw-r-- 1 bgregg bgregg 2.3M Aug 27 14:46 browser.log
    -rw-rw-r-- 1 bgregg bgregg 3.9M Aug 27 14:46 webapp-console.log
    
    Ok, so how about this one:
    # cat webapp-console.log
    [...]
    [08/27/21, 14:46:36:238] info: [API-Q] (TKZ41AXQD) 614b3789-1630039595.801 dnd.teamInfo is RESOLVED 
    [08/27/21, 14:46:36:240] info: [API-Q] (TKZ41AXQD) 614b3789-1630039595.930 users.interactions.list is RESOLVED 
    [08/27/21, 14:46:36:242] info: [DND] (TKZ41AXQD) Fetched DND info for the following member: ULG5H012L 
    [08/27/21, 14:46:36:251] info: [DND] (TKZ41AXQD) Checking for changes in DND status for the following members: ULG5H012L 
    [08/27/21, 14:46:36:254] info: [DND] (TKZ41AXQD) Will check for changes in DND status again in 5 minutes 
    [08/27/21, 14:46:36:313] info: [DND] (TKZ41AXQD) Fetched DND info for the following members: UL0US3455 
    [08/27/21, 14:46:36:313] info: [DND] (TKZ41AXQD) Checking for changes in DND status for the following members: ULG5H012L,UL0US3455 
    [08/27/21, 14:46:36:314] info: [DND] (TKZ41AXQD) Will check for changes in DND status again in 5 minutes 
    [08/27/21, 14:46:37:337] info: [FOCUS-EVENT] Client window blurred 
    [08/27/21, 14:46:40:022] info: [RTM] (T029N2L97) Processed 1 user_typing event(s) in channel(s) C0S9267BE over 0.10ms 
    [08/27/21, 14:46:40:594] info: [RTM] (T029N2L97) Processed 1 message:message_replied event(s) in channel(s) C0S9267BE over 2.60ms 
    [08/27/21, 14:46:40:595] info: [RTM] Setting a timeout of 37 ms to process more rtm events 
    [08/27/21, 14:46:40:633] info: [RTM] Waited 37 ms, processing more rtm events now 
    [08/27/21, 14:46:40:653] info: [RTM] (T029N2L97) Processed 1 message event(s) in channel(s) C0S9267BE over 18.60ms 
    [08/27/21, 14:46:44:938] info: [RTM] (T029N2L97) Processed 1 user_typing event(s) in channel(s) C0S9267BE over 0.00ms 
    
    No, I don't see any errors jumping out at me. How about searching for errors:
    # egrep -i 'error|fail' webapp-console.log
    [08/25/21, 16:07:13:051] info: [DESKTOP-SIDE-EFFECT] (TKZ41AXQD) Reacting to  {"type":"[39] Set a value that represents whether we are currently in the boot phase","payload":false,"error":false} 
    [08/25/21, 16:07:13:651] info: [DESKTOP-SIDE-EFFECT] (T7GLTMS0P) Reacting to  {"type":"[39] Set a value that represents whether we are currently in the boot phase","payload":false,"error":false} 
    [08/25/21, 16:07:14:249] info: [DESKTOP-SIDE-EFFECT] (T0DS04W11) Reacting to  {"type":"[39] Set a value that represents whether we are currently in the boot phase","payload":false,"error":false} 
    [08/25/21, 16:07:14:646] info: [DESKTOP-SIDE-EFFECT] (T0375HBGA) Reacting to  {"type":"[39] Set a value that represents whether we are currently in the boot phase","payload":false,"error":false} 
    [...]
    # egrep -i 'error|fail' browser.log 
    [07/16/21, 08:18:27:621] error: Cannot override webPreferences key(s): webviewTag, nativeWindowOpen, nodeIntegration, nodeIntegrationInWorker, nodeIntegrationInSubFrames, enableRemoteModule, contextIsolation, sandbox 
    [07/16/21, 08:18:27:653] error: Failed to load empty window url in window 
      "error": {
        "stack": "Error: ERR_ABORTED (-3) loading 'about:blank'\n    at rejectAndCleanup (electron/js2c/browser_init.js:217:1457)\n    at Object.navigationListener (electron/js2c/browser_init.js:217:1763)\n    at Object.emit (events.js:315:20)\n    at Object.EventEmitter.emit (domain.js:467:12)",
    [07/16/21, 08:18:31:355] error: Cannot override webPreferences key(s): webviewTag, nativeWindowOpen, nodeIntegration, nodeIntegrationInWorker, nodeIntegrationInSubFrames, enableRemoteModule, contextIsolation, sandbox 
    [07/16/21, 08:18:31:419] error: Cannot override webPreferences key(s): webviewTag, nativeWindowOpen, nodeIntegration, nodeIntegrationInWorker, nodeIntegrationInSubFrames, enableRemoteModule, contextIsolation, sandbox 
    [07/24/21, 09:00:52:252] error: Failed to load calls-desktop-interop.WindowBorderPanel 
      "error": {
        "stack": "Error: Module did not self-register: '/snap/slack/42/usr/lib/slack/resources/app.asar.unpacked/node_modules/@tinyspeck/calls-desktop-interop/build/Release/CallsDesktopInterop.node'.\n    at process.func [as dlopen] (electron/js2c/asar_bundle.js:5:1846)\n    at Object.Module._extensions..node (internal/modules/cjs/loader.js:1138:18)\n    at Object.func [as .node] (electron/js2c/asar_bundle.js:5:2073)\n    at Module.load (internal/modules/cjs/loader.js:935:32)\n    at Module._load (internal/modules/cjs/loader.js:776:14)\n    at Function.f._load (electron/js2c/asar_bundle.js:5:12684)\n    at Module.require (internal/modules/cjs/loader.js:959:19)\n    at require (internal/modules/cjs/helpers.js:88:18)\n    at bindings (/snap/slack/42/usr/lib/slack/resources/app.asar/node_modules/bindings/bindings.js:112:48)\n    at Object. (/snap/slack/42/usr/lib/slack/resources/app.asar/node_modules/@tinyspeck/calls-desktop-interop/lib/index.js:1:34)",
    [07/24/21, 09:00:52:260] warn: Failed to install protocol handler for slack:// links 
    [07/24/21, 09:00:52:440] error: Cannot override webPreferences key(s): webviewTag 
    [...]
    
    I browsed the logs for a while but didn't see a smoking gun. Surely it spits out some error message when crashing, like to STDERR... ## 8. STDERR Tracing Where is STDERR written?
    # lsof -p 3666477
    [...]
    slack   3666477 bgregg  mem     REG               7,16    141930     7165 /snap/slack/44/usr/lib/slack/chrome_100_percent.pak
    slack   3666477 bgregg  mem     REG               7,16    165680     7433 /snap/slack/44/usr/lib/slack/v8_context_snapshot.bin
    slack   3666477 bgregg    0r    CHR                1,3       0t0        6 /dev/null
    slack   3666477 bgregg    1w    CHR                1,3       0t0        6 /dev/null
    slack   3666477 bgregg    2w    CHR                1,3       0t0        6 /dev/null
    slack   3666477 bgregg    3r   FIFO               0,12       0t0 29532192 pipe
    slack   3666477 bgregg    4u   unix 0x00000000134e3c45       0t0 29526717 type=SEQPACKET
    slack   3666477 bgregg    5r    REG               7,16  10413488     7167 /snap/slack/44/usr/lib/slack/icudtl.dat
    [...]
    
    /dev/null? Like that's going to stop me. I could trace writes to STDERR, but I think my old shellsnoop(8) tool (another from eBPF/bcc) already does that:
    # shellsnoop 3666477
    [...]
    [08/27/21, 14:46:36:314] info: [DND] (TKZ41AXQD) Will check for changes in DND status again in 5 minutes 
    [08/27/21, 14:46:37:337] info: [FOCUS-EVENT] Client window blurred 
    [08/27/21, 14:46:40:022] info: [RTM] (T029N2L97) Processed 1 user_typing event(s) in channel(s) C0S928EBE over 0.10ms 
    [08/27/21, 14:46:40:594] info: [RTM] (T029N2L97) Processed 1 message:message_replied event(s) in channel(s) C0S928EBE over 2.60ms 
    [08/27/21, 14:46:40:595] info: [RTM] Setting a timeout of 37 ms to process more rtm events 
    [08/27/21, 14:46:40:633] info: [RTM] Waited 37 ms, processing more rtm events now 
    [08/27/21, 14:46:40:653] info: [RTM] (T029N2L97) Processed 1 message event(s) in channel(s) C0S928EBE over 18.60ms 
    
    
    [08/27/21, 14:46:44:938] info: [RTM] (T029N2L97) Processed 1 user_typing event(s) in channel(s) C0S928EBE over 0.00ms 
    
    (slack:3666477): Gtk-WARNING **: 14:46:45.525: Could not load a pixbuf from icon theme.
    This may indicate that pixbuf loaders or the mime database could not be found.
    **
    Gtk:ERROR:../../../../gtk/gtkiconhelper.c:494:ensure_surface_for_gicon: assertion failed (error == NULL): Failed to load /usr/share/icons/Yaru/16x16/status/image-missing.png: Unable to load image-loading module: /snap/slack/42/usr/lib/x86_64-linux-gnu/gdk-pixbuf-2.0/2.10.0/loaders/libpixbufloader-png.so: /snap/slack/42/usr/lib/x86_64-linux-gnu/gdk-pixbuf-2.0/2.10.0/loaders/libpixbufloader-png.so: cannot open shared object file: No such file or directory (gdk-pixbuf-error-quark, 5)
    
    Ah-ha! The last message printed is an error about a .png file and a .so file. As it's Slack's final mesage before crashing, this is a smoking gun. Note that this was not in any log!:
    # grep image-missing.png *
    grep: recorded-trace: Is a directory
    
    It's the .so file that is missing, not the .png:
    # ls -lh /usr/share/icons/Yaru/16x16/status/image-missing.png
    -rw-r--r-- 1 root root 535 Nov  6  2020 /usr/share/icons/Yaru/16x16/status/image-missing.png
    # ls -lh /snap/slack/42/usr/lib/x86_64-linux-gnu/gdk-pixbuf-2.0/2.10.0/loaders/libpixbufloader-png.so
    ls: cannot access '/snap/slack/42/usr/lib/x86_64-linux-gnu/gdk-pixbuf-2.0/2.10.0/loaders/libpixbufloader-png.so': No such file or directory
    
    But there is a .so file with a similar path:
    # ls -lh /snap/slack/
    total 0
    drwxrwxr-x 8 root root 123 Jul 14 02:49 43/
    drwxrwxr-x 8 root root 123 Aug 18 10:27 44/
    lrwxrwxrwx 1 root root   2 Aug 24 09:48 current -> 44/
    # ls -lh /snap/slack/44/usr/lib/x86_64-linux-gnu/gdk-pixbuf-2.0/2.10.0/loaders/libpixbufloader-png.so
    -rw-r--r-- 1 root root 27K Aug 18 10:27 /snap/slack/44/usr/lib/x86_64-linux-gnu/gdk-pixbuf-2.0/2.10.0/loaders/libpixbufloader-png.so
    
    Hmm, I wonder... ## 9. Workaround This is obviously a hack and is not guaranteed to be safe:
    # cd /snap/slack
    # ln -s current 42
    # ls -lh
    total 0
    lrwxrwxrwx 1 root root   7 Aug 27 15:01 42 -> current/
    drwxrwxr-x 8 root root 123 Jul 14 02:49 43/
    drwxrwxr-x 8 root root 123 Aug 18 10:27 44/
    lrwxrwxrwx 1 root root   2 Aug 24 09:48 current -> 44/
    # ls -lh /snap/slack/42/usr/lib/x86_64-linux-gnu/gdk-pixbuf-2.0/2.10.0/loaders/libpixbufloader-png.so
    -rw-r--r-- 1 root root 27K Aug 18 10:27 /snap/slack/42/usr/lib/x86_64-linux-gnu/gdk-pixbuf-2.0/2.10.0/loaders/libpixbufloader-png.so
    
    I don't know why Slack was looking up this library via the old directory version, but linking the new version to the old path did the trick. Slack has stopped crashing! I'm guessing this is a problem with how the snap is built. Needs more debugging. ## 10. Other debugging In case you're wondering what I'd do if I didn't find the error in STDERR, I'd go back to setting ulimits to see if I could get a core dump, and if that still didn't work, I'd try to run Slack from a gdb(1) session. I'd also work on fixing the user stack trace and symbols to see what that revealed. ## 11. Bonus: opensnoop I often wonder how I could have debugged things sooner, and I'm kicking myself I didn't run opensnoop(8) as I usually do. Tracing just failed opens:
    # opensnoop -Tx
    TIME(s)       PID    COMM       FD ERR PATH
    [...]
    11.412358000  3673057 slack      -1   2 /var/lib/snapd/desktop/mime/subclasses
    11.412360000  3673057 slack      -1   2 /var/lib/snapd/desktop/mime/icons
    11.412363000  3673057 slack      -1   2 /var/lib/snapd/desktop/mime/generic-icons
    11.412495000  3673057 slack      -1   2 /snap/slack/42/usr/lib/x86_64-linux-gnu/gdk-pixbuf-2.0/2.10.0/loaders/libpixbufloader-png.so
    11.412527000  3673057 slack      -1   2 /usr/share/locale/en_AU/LC_MESSAGES/gdk-pixbuf.mo
    11.412537000  3673057 slack      -1   2 /usr/share/locale/en/LC_MESSAGES/gdk-pixbuf.mo
    11.412559000  3673057 slack      -1   2 /usr/share/locale-langpack/en/LC_MESSAGES/gdk-pixbuf.mo
    11.412916000  3673057 slack      -1   2 /snap/slack/42/usr/lib/x86_64-linux-gnu/gdk-pixbuf-2.0/2.10.0/loaders/libpixbufloader-png.so
    11.425405000  1786   systemd    -1   2 /sys/fs/cgroup/memory/user.slice/user-1000.slice/user@1000.service/snap.slack.slack.402dde03-7f71-48a0-98a5-33fd695ccbde.scope/memory.events
    
    That shows its last failed open was to the .so file. Which would have been a good lead. But the best clue was the secret STDERR messages Slack sends to /dev/null, rescued using shellsnoop(8).

    August 26, 2021 02:00 PM

    Linux Plumbers Conference: BOFs Call for Proposals Now Open

    We have formally opened the CfP for Birds of a Feather. Select the BOFs track when submitting a BOF here.

    As a reminder:

     

    August 26, 2021 12:06 AM

    August 19, 2021

    Linux Plumbers Conference: Diversity, Equity & Inclusion Microconference Accepted into 2021 Linux Plumbers Conference

    We are pleased to announce that the Diversity, Equity & Inclusion Microconference has been accepted into the 2021 Linux Plumbers Conference.

    Creating diverse communities requires effort and commitment to creating inclusive and welcoming spaces. Recognizing that communities which adopt inclusive language and actions attract and retain more individuals from diverse backgrounds, the Linux kernel community adopted inclusive language in Linux 5.8 release. Understanding if this sort of change has been effective is a topic of active research. This MC will take a pulse of the Linux kernel community as it turns 30 this year and discuss some next steps. Experts from the DEI research community will share their perspectives, together with the perspectives from the Linux community members. This microconference will build on what was started at the LPC 2020 BoF session on Improving Diversity.

    This year’s topics to be discussed include:

    Come and join us in the discussion of how we can improve the diversity of the Linux Kernel community and help keep it vibrant for the next 30 years!

    We hope to see you there.

    August 19, 2021 04:35 PM

    August 16, 2021

    Linux Plumbers Conference: GPU/media/AI buffer management and interop Microconference Accepted into 2021 Linux Plumbers Conference

    We are pleased to announce that the GPU/media/AI buffer management and interop Microconference has been accepted into the 2021 Linux Plumbers Conference. The Linux GPU subsystem has long had three major tenets:

    Forthcoming hardware makes the former two difficult, if not impossible, to achieve. In order to give user space the fastest possible path to support modern complex workloads, forthcoming hardware is removing the notion of a small number of kernel-controlled job queues, replacing it with direct user space access to command queues to submit and control their own jobs.

    This, and evolution in the Vulkan API, make it difficult to retain the existing implicit synchronization model, where the kernel tracks all access and ensures that the hardware executes jobs in the order of user space submission, so that multiple independent clients can reuse the same buffers without data hazards. As all of these changes impact both media and neural-network accelerators, this Linux Plumbers Conference microconference allows us to open the discussion past the graphics community and into the wider kernel community.

    This year’s topics to be discussed include:

    Come and join us in the discussion of keeping Linux a first class citizen
    in the would of graphics and media.

    We hope to see you there.

    August 16, 2021 06:51 PM

    August 04, 2021

    Linux Plumbers Conference: Android Microconference Accepted into 2021 Linux Plumbers Conference

    We are pleased to announce that the Android Microconference has been accepted into the 2021 Linux Plumbers Conference. The past Android microconferences have been centered around the idea that it was primarily a synchronization point between the Android kernel team and the rest of the community to inform them on what they have been doing. With the help of last year’s focus on the Generic Kernel Image[1] (GKI), this year’s Android microconference will instead be an opportunity to foster a higher level of collaboration between the Android and Linux kernel communities. Discussions will be centered on the goal of ensuring that both the Android and Linux development moves in a lockstep fashion going forward.

    Last year’s meetup achieved the following:

    This year’s topics to be discussed include:

    Come and join us in the discussion of making Android a better partner with Linux.

    We hope to see you there.

    August 04, 2021 07:40 PM

    Dave Airlie (blogspot): crocus misrendering of the week

     I've been chasing a crocus misrendering bug show in a qt trace.


    The bottom image is crocus vs 965 on top. This only happened on Gen4->5, so Ironlake and GM45 were my test machines. I burned a lot of time trying to work this out. I trimmed the traces down, dumped a stupendous amount of batchbuffers, turned off UBO push constants, dump all the index and vertex buffers, tried some RGBx changes, but nothing was rushing to hit me, except that the vertex shaders produced were different.

    However they were different for many reasons, due to the optimization pipelines the mesa state tracker runs vs the 965 driver. Inputs and UBO loads were in different places so there was a lot of noise in the shaders.

    I ported the trace to a piglit GL application so I could easier hack on the shaders and GL, with that I trimmed it down even further (even if I did burn some time on a misplace */+ typo).

    Using the ported app, I removed all uniform buffer loads and then split the vertex shader in half (it was quite large, but had two chunks). I finally then could spot the difference in the NIR shaders.

    What stood out was the 965 shader had an if which the crocus shader has converted to a bcsel. This is part of peephole optimization and the mesa/st calls it, and sure enough removing that call fixed the rendering, but why? it is a valid optimization.

    In a parallel thread on another part of the planet, Ian Romanick filed a MR to mesa https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/12191 fixing a bug in the gen4/5 fs backend with conditional selects. This was something he noticed while debugging elsewhere. However his fix was for the fragment shader backend, and my bug was in the vec4 vertex shader backend. I tracked down where the same changes were needed in the vec4 backend and tested a fix on top of his branch, and the misrendering disappeared.

    It's a strange coincidence we both started hitting the same bug in different backends in the same week via different tests, but he's definitely saved me a lot of pain in working this out! Hopefully we can combine them and get it merged this week.

    Also thanks to Angelo on the initial MR for testing crocus with some real workloads.

    August 04, 2021 08:04 AM

    July 29, 2021

    Linux Plumbers Conference: System Boot and Security Microconference Accepted into 2021 Linux Plumbers Conference

    We are pleased to announce that the System Boot and Security Microconference has been accepted into the 2021 Linux Plumbers Conference. This microconference brings together those that are interested in the firmware, bootloaders, system boot and security. The events around last year’s BootHole showed how crucial platform initialization is for the overall system security. Those events may have showed the shortcomings in the current boot process, but they have also tightened the cooperation between various companies and organizations. Now is the time to use this opportunity to discuss the lessons learned and what can be done to improve in the future. Other cooperation discussions are also welcomed like those based on legal and organizational issues which may hinder working together.

    Last year’s meetup achieved the following:

    This year’s topics to be discussed include:

    Come and join us in the discussion about how to keep your system secure even at bootup.

    We hope to see you there.

    July 29, 2021 07:10 PM

    July 28, 2021

    Linux Plumbers Conference: Kernel Dependability and Assurance Microconference Accepted into 2021 Linux Plumbers Conference

    We are pleased to announce that the Kernel Dependability and Assurance Microconference has been accepted into the 2021 Linux Plumbers Conference.

    Linux development is producing kernels at an ever increasing rate, and at the same time with arguably increasing software quality. The process of kernel development has been adapting to handle the increasing number of contributors over the years to ensure a sufficient software quality. This quality is key in that Linux is now being used in applications that require a high degree of trust that the kernel is going to behave as expected. Some of the key areas we’re seeing Linux start to be used are in medical devices, civil infrastructure, caregiving robots, automotives, etc.

    Last year’s miniconference raised awareness about this topic with the wider community. Since then the ELISA team has made contributions to the Documentation and tools. The team has deployed a CI server that runs static analysis tools and syzkaller on the Linux kernel repos and is making the results of last 10 days of linux-next are made available to the community.

    This year’s topics to be discussed include:

    Come and join us in the discussion on how we can assure that Linux becomes the most trusted and dependable software in the world!

    We hope to see you there.

    July 28, 2021 02:28 PM

    July 27, 2021

    Paul E. Mc Kenney: Confessions of a Recovering Proprietary Programmer, Part XVIII: Preventing Involuntary Generosity

    I recently learned that all that is required for someone to take out a loan in some random USA citizen's name is that citizen's full name, postal address, email address, date of birth, and social security number. If you are above a certain age, all of these are for all intents and purposes a matter of public record. If you are younger, then your social security number is of course supposed to be secret—and it will be, right up to that data breach that makes it available to all the wrong people.

    This sort of thing can of course be a bit annoying to our involuntarily generous USA citizen. Some unknown person out there gets a fancy toy, and our citizen gets some bank's dunning notices. Fortunately, there are quite a few things you can do, although I will not try to reproduce the entirety of the volumes of good advice that are available out there. Especially given that laws, processes, and procedures are all subject to change.

    But at present, one important way to prevent this is to put a hold on your credit information through either of Experian, Equifax, or Transunion. I strongly suggest that you save yourself considerable time and hassle by doing this, which is free of charge for a no-questions-asked one-year hold. Taking this step is especially important if you are among the all too many of us whose finances don't have much slack, as was the case with my family back when my children were small. After all, it is one thing to have to deal with a few hassles in the form of phone calls, email, and paperwork, but it is quite another if you and your loved ones end up missing meals. Thankfully, it never came to that for my family, although one of my children did complain bitterly to a medical professional about the woefully insufficient stores of candy in our house.

    Of course, I also have some advice for the vendor, retailer, digital-finance company, and bank that were involved in my case of attempted involuntary generosity:


    1. Put just a little more effort into determining who you are really doing business with.
    2. If the toy contains a computer and connects to the internet, consider the option of directing your dunning notices through the toy rather than to the email and phone of your involuntarily generous USA citizen.
    3. A loan application for a toy that is shipped to a non-residential address should be viewed with great suspicion.
    4. In fact, such a loan application should be viewed with with some suspicion even if the addresses match. Porch pirates and all that.
    5. If the toy is of a type that must connect to the internet to do anything useful, you have an easy method of dealing with non-payment, don't you?

    I should hasten to add that after only a little discussion, these companies have thus far proven quite cooperative in my particular case, which is why they are thus far going nameless.

    Longer term, it is hard to be optimistic, especially given advances in various easy-to-abuse areas of information technology. In the meantime, I respectfully suggest that you learn from my experience and put a hold on your credit information!

    July 27, 2021 03:12 AM

    July 26, 2021

    Linux Plumbers Conference: RISC-V Microconference Accepted into 2021 Linux Plumbers Conference

    We are pleased to announce that the RISC-V Microconference has been accepted into the 2021 Linux Plumbers Conference. The RISC-V software eco-system is gaining momentum at breakneck speed with three new Linux development platforms available this year. The new platforms bring new issues to deal with.

    Last year’s meetup achieved the following:

    This year’s topics to be discussed include:

    Come join us and participate in the discussion on how we can improve the support for RISC-V in the Linux kernel.

    We hope to see you there.

    July 26, 2021 09:26 PM

    July 22, 2021

    Pete Zaitcev: MinIO liberates your storage from rebalancing

    MinIO posted a blog entry a few days ago where the bragged about adding capacity without a need to re-balance.

    First, they went into a full marketoid mode, whipping up the fear:

    Rebalancing a massive distributed storage system can be a nightmare. There’s nothing worse than adding a storage node and watching helplessly as user response time increases while the system taxes its own resources rebalancing to include the new node.

    Seems like MinIO folks assume that operators of distributed storage such as Swift and Ceph have no tools to regulate the resource consumption of rebalancing. So they have no choice but to "wait helplessly". Very funny.

    But it gets worse when obviously senseless statements are made:

    Rebalancing doesn’t just affect performance - moving many objects between many nodes across a network can be risky. Devices and components fail and that often leads to data loss or corruption.

    Often, man! Also, a commit protocol? Never heard of her!

    Then, we talk about some unrelated matters:

    A group of drives is an erasure set and MinIO uses a Reed-Solomon algorithm to split objects into data and parity blocks based on the size of the erasure set and then uniformly distributes them across all of the drives in the erasure such that each drive in the set contains no more than one block per object.

    Understood, your erasure set is what we call "partition" in Swift or a placement group in Ceph.

    Finally, we get to the matter at hand:

    To enable rapid growth, MinIO scales by adding server pools and erasure sets. If we had built MinIO to allow you to add a drive or even a single hardware node to an existing server pool, then you would have to suffer through rebalancing.

    MinIO scales up quickly by adding server pools, each an independent set of compute, network and storage resources.

    Add hardware, run MinIO server to create and name server processes, then update MinIO with the name of the new server pool. MinIO leaves existing data in their original server pools while exposing the new server pools to incoming data.

    My hot take on the social media was: "Placing new sets on new storage impacts utilization and risks hotspotting because of time affinity. There's no free lunch." Even on the second thought, I think that is about right. But let us not ignore the cost of the data movement associated with rebalancing. What if the operator wants to implement in Swift what MinIO blog post talks about?

    It is possible to emulate MinIO, to an extent. Some operators add a new storage policy when they expand the cluster, configure all the new nodes and/or volumes in its ring, then make it default, so newly-created objects end on the new hardware. This works to accomplish the same goals that MinIO outline above, but it's a kludge. Swift was not intended for this originally and it shows. In particular, storage policies were intended for low numbers of storage classes, such as rotating media and SSD, or Silver/Gold/Platinum. Once you make a new policy for each new forklift visit, you run a risk of finding scalability issues. Well, most clusters only upgrade a few times over their lifetime, but potentially it's a problem. Also, policies are customer visible, they are intended to be.

    In the end, I still think that balanced cluster is the way to go. Just think rationally about it.

    Interestingly, the reverse emulation appears to be not possible for MinIO: if you wanted to rebalance your storage, you would not be able to. Or at least the blog post above says: "If we had built MinIO to allow you to add a drive or ... a node to an existing server pool". I take it to mean that they don't allow, and the blog post is very much a case of sour grapes, then.

    July 22, 2021 11:08 PM

    Linux Plumbers Conference: Open Printing Microconference Accepted into 2021 Linux Plumbers Conference

    We are pleased to announce that the Open Printing Microconference has been accepted into the 2021 Linux Plumbers Conference. Over the years OpenPrinting has been actively working on improving and modernizing the way we print in Linux. We have been working on multiple areas of printing and scanning. Especially driverless print and scan technologies have helped the world do away with a lot of hassles involved in deciding on the correct driver to use and to install the same. Users can now just plug in their printer and do what they need.

    Based on the discussions that we had last year, we have been able to achieve the following:

    – Significant progress in deciding on the structure of PAPPL – framework/library for developing Printer Applications as a replacement of Printer Drivers.

    – Progress on LPrint. Label Printer Application, implementing printing for a variety of common label and receipt printers connected via network or USB.

    – Have helped us in giving shape to the Printer Application concept. Sample printer applications for HP PCL printers have been created that use PAPPL to support IPP printing from multiple operating systems. This prototype will help others looking forward to adopting this new concept of Printer Applications. First production Printer Application started from this prototype is the PostScript Printer Application.

    Development is in continuous progress, see the state of the art in OpenPrinting’s monthly news posts[6].

    This year’s topics to be discussed include:

    Come join us and participate in the discussion to bring Linux printing, scanning and fax a better experience.

    We hope to see you there.

    July 22, 2021 04:21 PM

    July 21, 2021

    Dave Airlie (blogspot): llvmpipe/lavapipe: anisotropic texture filtering

    In order to expose OpenGL 4.6 the last missing feature in llvmpipe is anisotropic texture filtering. Adding support for this also allows lavapipe expose the Vulkan samplerAnisotropy feature.

    I started writing anisotropic support > 6 months ago. At the time we were trying to deprecate the classic swrast driver, and someone pointed out it had support for anisotropic filtering. This support had also been ported to the softpipe driver, but never to llvmpipe.

    I had also considered porting swiftshaders anisotropic support, but since I was told the softpipe code was functional and had users I based my llvmpipe port on that.

    Porting the code to llvmpipe means rewriting it to generate LLVM IR using the llvmpipe vector processing code. This is a lot messier than just writing linear processing code, and when I thought I had it working it passes GL CTS, but failed the VK CTS. The results also to my eye looked worse than I'd have thought was acceptable, and softpipe seemed to be as bad.

    Once I swung back around to this I decided to port the VK CTS test to GL and run it on softpipe and llvmpipe code. Initially llvmpipe had some more bugs to solve esp where the mipmap levels were being chosen, but once I'd finished aligning softpipe and llvmpipe I started digging into why the softpipe code wasn't as nice as I expected.

    The softpipe code was based on an implementation of an Elliptical Weighted Average Filter (EWA). The paper "Creating Raster Omnimax Images from Multiple Perspective Views Using the Elliptical Weighted Average Filter" described this. I sat down with the paper and softpipe code and eventually found the one line where they diverged.[1] This turned out to be a bug introduced in a refactoring 5 years ago, and nobody had noticed or tracked it down.

    I then ported the same fix to my llvmpipe code, and VK CTS passes. I also optimized the llvmpipe code a bit to avoid doing pointless sampling and cleaned things up. This code landed in [2] today.

    For GL4.6 there are still some fixes in other areas.

    [1] https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/11917

    [2] https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/8804

    July 21, 2021 01:07 AM

    July 18, 2021

    Linux Plumbers Conference: GNU Tools Track Added to Linux Plumbers Conference 2021

    We are very excited to announce that also for 2021 our friends from the GNU Toolchain are going to join the Linux Plumbers Conference with an additional track: the GNU Tools track. The track will run for the 5 days of the conference.
    For more information about what types of proposals are accepted, please see the GNU Tools track wiki page.
    The call for papers is now open and will close on August 31 2021. To submit a proposal please go to our CFP page and select the GNU Tools Track.

     

    July 18, 2021 07:49 PM

    July 16, 2021

    Linux Plumbers Conference: VFIO/IOMMU/PCI Microconference Accepted into 2021 Linux Plumbers Conference

    We are pleased to announce that the VFIO/IOMMU/PCI Microconference has been accepted into the 2021 Linux Plumbers Conference. Today’s high speed components commonly utilize the devices that implement the PCI interconnect specification and the system IOMMUs that provide memory and access control between the devices and the system resources. The features of this domain are constantly increasing with such features as:

    Last year’s meetup achieved the following:

    This year’s topics to be discussed include:

    Come and join us in the discussion in helping Linux keep up with the new features being added to the PCI interconnect specification.

    We hope to see you there.

    July 16, 2021 04:15 PM

    July 13, 2021

    Matthew Garrett: Does free software benefit from ML models being derived works of training data?

    Github recently announced Copilot, a machine learning system that makes suggestions for you when you're writing code. It's apparently trained on all public code hosted on Github, which means there's a lot of free software in its training set. Github assert that the output of Copilot belongs to the user, although they admit that it may occasionally produce output that is identical to content from the training set.

    Unsurprisingly, this has led to a number of questions along the lines of "If Copilot embeds code that is identical to GPLed training data, is my code now GPLed?". This is extremely understandable, but the underlying issue is actually more general than that. Even code under permissive licenses like BSD requires retention of copyright notices and disclaimers, and failing to include them is just as much a copyright violation as incorporating GPLed code into a work and not abiding by the terms of the GPL is.

    But free software licenses only have power to the extent that copyright permits them to. If your code isn't a derived work of GPLed material, you have no obligation to follow the terms of the GPL. Github clearly believe that Copilot's output doesn't count as a derived work as far as US copyright law goes, and as a result the licenses on the training data don't apply to the output. Some people have interpreted this as an attack on free software - Copilot may insert code that's either identical or extremely similar to GPLed code, and claim that there are no license obligations created as a result, effectively allowing the laundering of GPLed code into proprietary software.

    I'm completely unqualified to hold a strong opinion on whether Github's legal position is justifiable or not, and right now I'm also not interested in thinking about it too much. What I think is more interesting is what the impact of either position has on free software. Do we benefit more from a future where the output of Copilot (or similar projects) is considered a derived work of the training data, or one where it isn't? Having been involved in a bunch of GPL enforcement activities, it's very easy to think of this as something that weakens the GPL and, as a result, weakens free software. That was my initial reaction, but that's shifted over the past few days.

    Let's look at the GNU manifesto, specifically this section:

    The fact that the easiest way to copy a program is from one neighbor to another, the fact that a program has both source code and object code which are distinct, and the fact that a program is used rather than read and enjoyed, combine to create a situation in which a person who enforces a copyright is harming society as a whole both materially and spiritually; in which a person should not do so regardless of whether the law enables him to.

    The GPL makes use of copyright law to ensure that GPLed work can't be taken from the commons. Anyone who produces a derived work of GPLed code is obliged to provide that work under the same terms. If software weren't copyrightable, the GPL would have no power. But this is the outcome Stallman wanted! The GPL doesn't exist because copyright is good, it exists because software being copyrightable is what enables the concept of proprietary software in the first place.

    The powers that the GPL uses to enforce sharing of code are used by the authors of proprietary software to reduce that sharing. They attempt to forbid us from examining their code to determine how it works - they argue that anyone who does so is tainted, unable to contribute similar code to free software projects in case they produce a derived work of the original. Broadly speaking, the further the definition of a derived work reaches, the greater the power of proprietary software authors. If Oracle's argument that APIs are copyrightable had prevailed, it would have been disastrous for free software. If the Apple look and feel suit had established that Microsoft infringed Apple's copyright, we might be living in a future where we had no free software desktop environments.

    When we argue for an interpretation of copyright law that enhances the power of the GPL, we're also enhancing the power of giant corporations with a lot of lawyers on hand. So let's look at this another way. If Github's interpretation of copyright law holds, we can train a model on proprietary code and extract concepts without having to worry about being tainted. The proprietary code itself won't enter the commons, but the ideas it embodies will. No more worries about whether you're literally copying the code that implements an algorithm you want to duplicate - simply start typing and let the model remove the risk for you.

    There's a reasonable counter argument about equality here. How much GPL-influenced code is going to end up in proprietary projects when compared to the reverse? It's not an easy question to answer, but we should bear in mind that the majority of public repositories on Github aren't under an open source license. Copilot is already claiming to give us access to the concepts embodied in those repositories. Do these provide more value than is given up? I honestly don't know how to measure that. But what I do know is that free software was founded in a belief that software shouldn't be constrained by copyright, and our default stance shouldn't be to argue against the idea that copyright is weaker than we imagined.

    (Edit: this post by Julia Reda makes some of the same arguments, but spends some more time focusing on a legal analysis of why having copyright cover the output of Copilot would be a problem)

    comment count unavailable comments

    July 13, 2021 08:09 AM

    July 12, 2021

    Linux Plumbers Conference: File system Microconference Accepted into 2021 Linux Plumbers Conference

    We are pleased to announce that the File System Microconference has been accepted into the 2021 Linux Plumbers Conference. File systems are key to any operating system, and especially for the Linux kernel. They are the gateway to the underling storage, or could simply live in RAM as a virtual information repository. The file system developers are constantly adding features and improvements. Some of these new features are slow to be utilized by the application developers, or they may be used in interesting ways that the file system developers never thought of.

    This year’s topics to be discussed include:

    These are big ongoing projects that have implications across all file systems as well as users, and would be good to discuss across a large portion of attendees.

    Come and join us in the discussion of improving the state of saving reading and accessing your data.

    We hope to see you there.

    July 12, 2021 10:49 PM

    July 09, 2021

    Linux Plumbers Conference: Testing and Fuzzing Microconference Accepted into 2021 Linux Plumbers Conference

    We are pleased to announce that the Testing and Fuzzing Microconference has been accepted into the 2021 Linux Plumbers Conference. In spite of the huge number of products shipping with the Linux kernel which are being thoroughly tested by OEMs and distribution providers, there is still no enforced quality standard upstream. How can we make best use of all the publicly available infrastructure and test frameworks in order to fill this gap? Testing and fuzzing upstream as well as gathering results from products is crucial to keeping a project that has over 5,000 commits every month stable for all to use.

    Last year’s meetup achieved the following:

    This year’s topics to be discussed include:

    Come and join us in the discussion of keeping Linux being the best quality it can be.

    We hope to see you there.

    July 09, 2021 08:23 PM

    July 04, 2021

    Brendan Gregg: USENIX LISA2021 Computing Performance: On the Horizon

    It's an exciting time for developments in computer performance, not just for the BPF technology (which I often [write about]) but also for processors with 3D stacking and cloud vendor CPUs (e.g., AWS Graviton2); for memory with the arrival of DDR5 and High Bandwidth Memory (HBM) on-processor; for storage including new uses for 3D Xpoint as a 3D NAND accelerator; for networking with the rise of QUIC and eXpress Data Path (XDP); and so on. I summarized these topics and more as a plenary conference talk, including my own predictions (as a senior performance engineer) for the future of computing performance, with a focus on back-end servers. The video is on [youtube]:

    The slides are on [slideshare] or as a [PDF]:
    I work on many areas of performance, but recently I've had a lot of demand to talk about BPF. This was a chance to talk about other things I've been working on, such as the present and future of hardware performance. I also wrote about these topics in detail for my recent [Systems Performance 2nd Edition] book. Note that my predictions in this talk may be wrong, but they should be thought-provoking. I hope you enjoy it! ## References I've reproduced the talk references below, so you can click on links: - [Gregg 08] Brendan Gregg, “ZFS L2ARC,” http://www.brendangregg.com/blog/2008-07-22/zfs-l2arc.html, Jul 2008 - [Gregg 10] Brendan Gregg, “Visualizations for Performance Analysis (and More),” https://www.usenix.org/conference/lisa10/visualizations-performance-analysis-and-more, 2010 - [Greenberg 11] Marc Greenberg, “DDR4: Double the speed, double the latency? Make sure your system can handle next-generation DRAM,” https://www.chipestimate.com/DDR4-Double-the-speed-double-the-latencyMake-sure-your-system-can-handle-next-generation-DRAM/Cadence/Technical-Article/2011/11/22, Nov 2011 - [Hruska 12] Joel Hruska, “The future of CPU scaling: Exploring options on the cutting edge,” https://www.extremetech.com/computing/184946-14nm-7nm-5nm-how-low-can-cmos-go-it-depends-if-you-ask-the-engineers-or-the-economists, Feb 2012 - [Gregg 13] Brendan Gregg, “Blazing Performance with Flame Graphs,” https://www.usenix.org/conference/lisa13/technical-sessions/plenary/gregg, 2013 - [Shimpi 13] Anand Lal Shimpi, “Seagate to Ship 5TB HDD in 2014 using Shingled Magnetic Recording,” https://www.anandtech.com/show/7290/seagate-to-ship-5tb-hdd-in-2014-using-shingled-magnetic-recording, Sep 2013 - [Borkmann 14] Daniel Borkmann, “net: tcp: add DCTCP congestion control algorithm,” https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=e3118e8359bb7c59555aca60c725106e6d78c5ce, 2014 - [Macri 15] Joe Macri, “Introducing HBM,” https://www.amd.com/en/technologies/hbm, Jul 2015 - [Cardwell 16] Neal Cardwell, et al., “BBR: Congestion-Based Congestion Control,” https://queue.acm.org/detail.cfm?id=3022184, 2016 - [Gregg 16] Brendan Gregg, “Unikernel Profiling: Flame Graphs from dom0,” http://www.brendangregg.com/blog/2016-01-27/unikernel-profiling-from-dom0.html, Jan 2016 - [Gregg 16b] Brendan Gregg, “Linux 4.X Tracing Tools: Using BPF Superpowers,” https://www.usenix.org/conference/lisa16/conference-program/presentation/linux-4x-tracing-tools-using-bpf-superpowers, 2016 - [Alcorn 17] Paul Alcorn, “Seagate To Double HDD Speed With Multi-Actuator Technology,” https://www.tomshardware.com/news/hdd-multi-actuator-heads-seagate,36132.html, 2017 - [Alcorn 17b] Paul Alcorn, “Hot Chips 2017: Intel Deep Dives Into EMIB,” https://www.tomshardware.com/news/intel-emib-interconnect-fpga-chiplet,35316.html#xenforo-comments-3112212, 2017 - [Corbet 17] Jonathan Corbet, “Two new block I/O schedulers for 4.12,” https://lwn.net/Articles/720675, Apr 2017 - [Gregg 17] Brendan Gregg, “AWS EC2 Virtualization 2017: Introducing Nitro,” http://www.brendangregg.com/blog/2017-11-29/aws-ec2-virtualization-2017.html, Nov 2017 - [Russinovich 17] Mark Russinovich, “Inside the Microsoft FPGA-based configurable cloud,” https://www.microsoft.com/en-us/research/video/inside-microsoft-fpga-based-configurable-cloud, 2017 - [Gregg 18] Brendan Gregg, “Linux Performance 2018,” http://www.brendangregg.com/Slides/Percona2018_Linux_Performance.pdf, 2018 - [Hady 18] Frank Hady, “Achieve Consistent Low Latency for Your Storage-Intensive Workloads,” https://www.intel.com/content/www/us/en/architecture-and-technology/optane-technology/low-latency-for-storage-intensive-workloads-article-brief.html, 2018 - [Joshi 18] Amit Joshi, et al., “Titus, the Netflix container management platform, is now open source,” https://netflixtechblog.com/titus-the-netflix-container-management-platform-is-now-open-source-f868c9fb5436, Apr 2018 - [Cutress 19] Dr. Ian Cutress, “Xilinx Announces World Largest FPGA: Virtex Ultrascale+ VU19P with 9m Cells,” https://www.anandtech.com/show/14798/xilinx-announces-world-largest-fpga-virtex-ultrascale-vu19p-with-9m-cells, Aug 2019 - [Gallatin 19] Drew Gallatin, “Kernel TLS and hardware TLS offload in FreeBSD 13,” https://people.freebsd.org/~gallatin/talks/euro2019-ktls.pdf, 2019 - [Redestad 19] Claes Redestad, Staffan Friberg, Aleksey Shipilev, “JEP 230: Microbenchmark Suite,” http://openjdk.java.net/jeps/230, updated 2019 - [Bearman 20] Ian Bearman, “Exploring Profile Guided Optimization of the Linux Kernel,” https://linuxplumbersconf.org/event/7/contributions/771, 2020 - [Burnes 20] Andrew Burnes, “GeForce RTX 30 Series Graphics Cards: The Ultimate Play,” https://www.nvidia.com/en-us/geforce/news/introducing-rtx-30-series-graphics-cards, Sep 2020 - [Charlene 20] Charlene, “800G Is Coming: Set Pace to More Higher Speed Applications,” https://community.fs.com/blog/800-gigabit-ethernet-and-optics.html, May 2020 - [Cutress 20] Dr. Ian Cutress, “Insights into DDR5 Sub-timings and Latencies,” https://www.anandtech.com/show/16143/insights-into-ddr5-subtimings-and-latencies, Oct 2020 - [Ford 20] A. Ford, et al., “TCP Extensions for Multipath Operation with Multiple Addresses,” https://datatracker.ietf.org/doc/html/rfc8684, Mar 2020 - [Gregg 20] Brendan Gregg, “Systems Performance: Enterprise and the Cloud, Second Edition,” Addison-Wesley, 2020 - [Hruska 20] Joel Hruska, “Intel Demos PCIe 5.0 on Upcoming Sapphire Rapids CPUs,” https://www.extremetech.com/computing/316257-intel-demos-pcie-5-0-on-upcoming-sapphire-rapids-cpus, Oct 2020 - [Liu 20] Linda Liu, “Samsung QVO vs EVO vs PRO: What’s the Difference? [Clone Disk],” https://www.partitionwizard.com/clone-disk/samsung-qvo-vs-evo.html, 2020 - [Moore 20] Samuel K. Moore, “A Better Way to Measure Progress in Semiconductors,” https://spectrum.ieee.org/semiconductors/devices/a-better-way-to-measure-progress-in-semiconductors, Jul 2020 - [Peterson 20] Zachariah Peterson, “DDR5 vs. DDR6: Here's What to Expect in RAM Modules,” https://resources.altium.com/p/ddr5-vs-ddr6-heres-what-expect-ram-modules, Nov 2020 - [Salter 20] Jim Salter, “Western Digital releases new 18TB, 20TB EAMR drives,” https://arstechnica.com/gadgets/2020/07/western-digital-releases-new-18tb-20tb-eamr-drives, Jul 2020 - [Spier 20] Martin Spier, Brendan Gregg, et al., “FlameScope,” https://github.com/Netflix/flamescope, 2020 - [Tolvanen 20] Sami Tolvanen, Bill Wendling, and Nick Desaulniers, “LTO, PGO, and AutoFDO in the Kernel,” Linux Plumber’s Conference, https://linuxplumbersconf.org/event/7/contributions/798, 2020 - [Vega 20] Juan Camilo Vega, Marco Antonio Merlini, Paul Chow, “FFShark: A 100G FPGA Implementation of BPF Filtering for Wireshark,” IEEE 28th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), 2020 - [Warren 20] Tom Warren, “Microsoft reportedly designing its own ARM-based chips for servers and Surface PCs,” https://www.theverge.com/2020/12/18/22189450/microsoft-arm-processors-chips-servers-surface-report, Dec 2020 - [Google 21] Google, “Cloud TPU,” https://cloud.google.com/tpu, 2021 - [Haken 21] Michael Haken, et al., “Delta Lake 1S Server Design Specification 1v05, https://www.opencompute.org/documents/delta-lake-1s-server-design-specification-1v05-pdf, 2021 - [Intel 21] Intel corporation, “Intel® OptaneTM Technology,” https://www.intel.com/content/www/us/en/products/docs/storage/optane-technology-brief.html, 2021 - [Quach 21a] Katyanna Quach, “Global chip shortage probably won't let up until 2023, warns TSMC: CEO 'still expects capacity to tighten more',” https://www.theregister.com/2021/04/16/tsmc_chip_forecast, Apr 2021 - [Quach 21b] Katyanna Quach, “IBM says it's built the world's first 2nm semiconductor chips,” https://www.theregister.com/2021/05/06/ibm_2nm_semiconductor_chips, May 2021 - [Ridley 21] Jacob Ridley, “IBM agrees with Intel and TSMC: this chip shortage isn't going to end anytime soon,” https://www.pcgamer.com/ibm-agrees-with-intel-and-tsmc-this-chip-shortage-isnt-going-to-end-anytime-soon, May 2021 - [Shilov 21] Anton Shilov, “Samsung Develops 512GB DDR5 Module with HKMG DDR5 Chips,” https://www.tomshardware.com/news/samsung-512gb-ddr5-memory-module, Mar 2021 - [Shilov 21b] Anton Shilov, “Seagate Ships 20TB HAMR HDDs Commercially, Increases Shipments of Mach.2 Drives,” https://www.tomshardware.com/news/seagate-ships-hamr-hdds-increases-dual-actuator-shipments, 2021 - [Shilov 21c] Anton Shilov, “SK Hynix Envisions 600-Layer 3D NAND & EUV-Based DRAM,” https://www.tomshardware.com/news/sk-hynix-600-layer-3d-nand-euv-dram, Mar 2021 - [Shilov 21d] Anton Shilov, “Sapphire Rapids Uncovered: 56 Cores, 64GB HBM2E, Multi-Chip Design,” https://www.tomshardware.com/news/intel-sapphire-rapids-xeon-scalable-specifications-and-features, Apr 2021 - [SuperMicro 21] SuperMicro, “B12SPE-CPU-25G (For SuperServer Only),” https://www.supermicro.com/en/products/motherboard/B12SPE-CPU-25G, 2021 - [Thaler 21] Dave Thaler, Poorna Gaddehosur, “Making eBPF work on Windows,” https://cloudblogs.microsoft.com/opensource/2021/05/10/making-ebpf-work-on-windows, May 2021 - [TornadoVM 21] TornadoVM, “TornadoVM Run your software faster and simpler!” https://www.tornadovm.org, 2021 - [Trader 21] Tiffany Trader, “Cerebras Second-Gen 7nm Wafer Scale Engine Doubles AI Performance Over First-Gen Chip,” https://www.enterpriseai.news/2021/04/21/latest-cerebras-second-gen-7nm-wafer-scale-engine-doubles-ai-performance-over-first-gen-chip, Apr 2021 - [Vahdat 21] Amin Vahdat, “The past, present and future of custom compute at Google,” https://cloud.google.com/blog/topics/systems/the-past-present-and-future-of-custom-compute-at-google, Mar 2021 - [Wikipedia 21] “Semiconductor device fabrication,” https://en.wikipedia.org/wiki/Semiconductor_device_fabrication, 2021 - [Wikipedia 21b] “Silicon,” https://en.wikipedia.org/wiki/Silicon, 2021 - [ZonedStorage 21] Zoned Storage, “Zoned Namespaces (ZNS) SSDs,” https://zonedstorage.io/introduction/zns, 2021 I've taken care to cite the author names along with the talk title and date, including for Internet sources, instead of the common practice of just listing URLs. I followed that practice when writing some earlier books, and it has since struck me as unfair that some references had author names and some didn't. Nowadays I always include full names when known. In case you are interested, at the same conference I also gave a talk on [BPF Internals]. [youtube]: https://www.youtube.com/watch?v=5nN1wjA_S30 [PDF]: /Slides/LISA2021_ComputingPerformance.pdf [Systems Performance 2nd Edition]: /systems-performance-2nd-edition-book.html [BPF Internals]: /blog/2021-06-15/bpf-internals.html [slideshare]: https://www.slideshare.net/brendangregg/computing-performance-on-the-horizon-2021 [write about]: /blog/2021-07-03/how-to-add-bpf-observability.html

    July 04, 2021 02:00 PM

    July 02, 2021

    Brendan Gregg: How To Add eBPF Observability To Your Product

    There's an arms race to add [eBPF] (BPF) to commercial observability products, and in this post I'll describe how to quickly do that. This is also applicable for people adding it to their own in-house monitoring systems. People like to show me their BPF observability products after they have prototyped or built them, but I often wish I had given them advice before they started. As the leader of BPF observability, it's advice I've been including in recent talks, and now I'm including it in this post. First, I know you're busy. You might not even like BPF. To be pragmatic, I'll describe how to spend the least effort to get the most value. Think of this as "version 1": A starting point that's pretty useful. Whether you follow this advice or not, at least please understand it to avoid later regrets and pain. If you're using an open source monitoring platform, first check if it already has a BPF agent. This post assumes it doesn't, and you'll be adding something for the first time. ## 1. Run your first tool Start by installing the [bcc] or [bpftrace] tools. E.g., bcc on Ubuntu:

    # apt-get install bpfcc-tools
    
    Then try running a tool. E.g., to see process execution with timestamps using execsnoop(8):
    # execsnoop-bpfcc -T
    TIME     PCOMM            PID    PPID   RET ARGS
    19:36:15 service          828567 6009     0 /usr/sbin/service --status-all
    19:36:15 basename         828568 828567   0 
    19:36:15 basename         828569 828567   0 /usr/bin/basename /usr/sbin/service
    19:36:15 env              828570 828567   0 /usr/bin/env -i LANG=en_AU.UTF-8 LANGUAGE=en_AU:en LC_CTYPE= LC_NUMERIC= LC_TIME= LC_COLLATE= LC_MONETARY= LC_MESSAGES= LC_PAPER= LC_NAME= LC_ADDRESS= LC_TELEPHONE= LC_MEASUREMENT= LC_IDENTIFICATION= LC_ALL= PATH=/opt/local/bin:/opt/local/sbin:/usr/local/git/bin:/home/bgregg/.local/bin:/home/bgregg/bin:/opt/local/bin:/opt/local/sbin:/ TERM=xterm-256color /etc/init.d/acpid 
    19:36:15 acpid            828570 828567   0 /etc/init.d/acpid status
    19:36:15 run-parts        828571 828570   0 /usr/bin/run-parts --lsbsysinit --list /lib/lsb/init-functions.d
    19:36:15 systemctl        828572 828570   0 /usr/bin/systemctl -p LoadState --value show acpid.service
    19:36:15 readlink         828573 828570   0 /usr/bin/readlink -f /etc/init.d/acpid
    [...]
    
    While basic, I've solved many perf issues with this tool alone, including for misconfigured systems where a shell script is launching failing processes in a loop, and when some minor application is crashing and is restarting every few minutes but has not yet been noticed. ## 2. Add a tool to your product Now imagine adding execsnoop(8) to your product. You likely already have agents running on all your customer systems. Do they have a way to run a command and return the text output? Or run a command and send the output elsewhere for aggregation (S3, Hive, Druid, etc.)? There are so many options it's really your own preference based on your existing system and customer environments. When you add your first tool to your product, have it run it for a short duration such as 10 to 60 seconds. I just noticed execsnoop(8) doesn't have a duration option yet, so in the interim you could wrap it with watch -s2 60 execsnoop-bpfcc. If you want to run these tools 24x7, study overheads to understand the cost first. Low frequency events such as process execution should be negligible to capture. Instead of bcc, you can also use the [bpftrace] versions. These typically don't have canned options (-v, -l, etc.), but do have a json output mode. E.g.:
    # bpftrace -f json execsnoop.bt 
    {"type": "attached_probes", "data": {"probes": 2}}
    {"type": "printf", "data": "TIME(ms)   PID   ARGS\n"}
    {"type": "printf", "data": "2737       849176 "}
    {"type": "join", "data": "ls -F"}
    {"type": "printf", "data": "5641       849178 "}
    {"type": "join", "data": "date"}
    
    This mode was added so that BPF observability products can be built on top of bpftrace. ## 3. Don't worry about dependencies I am indeed suggesting that you install bcc or bpftrace on your customer systems, and they currently have llvm dependencies. This can add up to tens of Mbytes, which can be a problem for some resource-constrained environments (embedded). We've been doing lots of work to fix this in the future. bcc has newer versions of the tools (libbpf-tools) that use [BTF and CO-RE] \(and not Python) and will ultimately mean you can install 100-Kbyte binary versions of the tools with no dependencies. bpftrace has a similar plan to produce a small dependency-less binary using the newer kernel features. This does require at least Linux 5.8 to work well, and your customers may not run that for years. In the interim I'd suggest not worrying about the llvm dependencies for now since it will be fixed later. Note that not all Linux distributions have enabled CONFIG_DEBUG_INFO_BTF=y, which is necessary for the future of BTF and CO-RE. Major distros have set it, such as in Ubuntu 20.10, Fedora 30, and RHEL 8.2. But if you know some of your customers are running something uncommon, please check and encourage them or the distro vendor to set CONFIG_DEBUG_INFO_BTF=y and CONFIG_DEBUG_INFO_BTF_MODULES=y to avoid pain in the future. ## 4. Version 1 dashboard Now you have one BPF observability tool in your product, it's time to add more. Here are the top ten tools you can run and present as a generic BPF observability dashboard, along with suggested visualizations: This is based on my [bcc Tutorial], and many also exist in bpftrace. I chose these to find the most performance wins with the fewest tools. Note that runqlat and profile can have noticable overheads, so I'd run these tools for between 10 and 60 seconds only and generate a report. Some are low enough overhead to be run 24x7 if desired (e.g., execsnoop, biolatency, tcplife, tcpretrans). There is already documentation as man pages and example files in the bcc and bpftrace repositories that you can link to, to help your customers understand the tool output. E.g., here's the execsnoop(8) example files in bcc and bpftrace. Once you have this all working, you have version 1! ## bcc vs bpftrace The bcc tools are the easiest to use, as they usually have many command-line options. The bpftrace tools are easier to edit and customize, and bpftrace has a json output mode. If you're completely new to tracing, go with bcc. If you want to do some hacking and customizing of the tools, go with bpftrace. In the end, they are both good options. ## Case study: Netflix Netflix is building a new GUI that does this tool dashboard and more, based on the bpftrace versions of these tools. The architecture is:
    While the bpftrace binary is installed on all the target systems, the bpftrace tools (text files) live on a web server and are pushed out when needed. This means we can ensure we're always running the latest version of the tools by updating them in one place. This is currently part of our FlameCommander UI, which also runs flame graphs across the cloud. Our previous BPF GUI was part of [Vector], and used bcc, but we've since deprecated that. We'll likely open source the new one at some point and have a post about it on the Netflix tech blog. ## Case study: Facebook Facebook are advanced users of BPF, but deep details of how they run the tools fleet-wide aren't fully public. Based on the activity in bcc, and their development of the BTF and CO-RE technologies, I'd strongly suspect their solution is based on the bcc libbpf-tool versions. ## Porting Pitfalls BPF tracing tools are like application and kernel patches. They need constant updates to keep working across different software versions. Porting them to a different language and then not maintaining them may be like trying to apply a Linux 4.15 patch to Linux 5.12. If you're lucky, it blows up! If you're unlucky, the patch applies but corrupts some things in a subtle way that you don't notice until later. It depends on the tool. As an extreme example, I wrote cachestat(8) while on vacation in 2014 for use on the Netflix cloud, which was a mix of Linux 3.2 and 3.13 at the time. BPF didn't exist on those versions, so I used basic Ftrace capabilities that were available on Linux 3.2. I described this approach as [brittle] and a [sandcastle] that would need maintenance as the kernel changed. It was later ported to BPF with kprobes, and has now been rewritten and included in commercial observability products. Unsurprisingly, I've heard it has problems on newer kernels, printing output that doesn't make sense. It really needs an overhaul. When I (or someone) does, anyone pulling updates from bcc will automatically get the fixed version, no effort. Those that have rewritten it will need to rewrite theirs. I fear they won't, and customers will be running a broken version of cachestat(8) for years. Note that if BPF was available on my target environment when I wrote cachestat(8), I would have coded it completely differently. People are porting something written for Linux 3.2 and running it on Linux 5.x. In a previous blog post, [An Unbelievable Demo], I talked about how something similar happened many years ago where old tracing tool versions were used without updates. The problems I'm describing are specific to BPF software and kernel tracing. As a different example, my flame graph software has been rewritten over a dozen times, and since it's a simple and finished algorithm I don't see a big problem with that. I prefer people help with the newer [d3 version], but if people do their own it's no big deal. You can code it and it'll work forever. That's not the case with uprobe- and kprobe-based BPF tools, because they do need maintenance. ## Think like a sysadmin, not like a programmer In summary, start by checking if there's already a BPF agent for your monitoring systems, and if not, build one based on the existing [bcc] or [bpftrace] tools rather than rewriting everything from scratch. This is thinking like a sysadmin who installs and maintains software, and not like a programmer who codes everything. Install the bcc or bpftrace tools, add them to your observability product, and pull package updates as needed. That will be a quick and useful version 1. BPF up and running! I see people think like a programmer instead and feel they must start by learning bcc and BPF programming in depth. Then, having discovered everything is C or Python, some rewrite it all in a different language. First, learning bcc and BPF well takes weeks; Learning the subtleties and pitfalls of system tracing can take months or years. To give you a taste of what you're in for, check out my [BPF Internals] talk. If you really want to do this and have the time, you certainly can (you'll probably wind up at tracing conferences and bumping into me: See you at Linux Plumber's or the Tracing Summit!) But if you're under some deadline to add BPF observability, try thinking like a sysadmin instead and just build upon the existing tools. That's the fast way. Think like a programmer later, if or when you have the time. Second, the BPF software, especially certain kprobe-based tools, require ongoing maintenance. A tool may work on Linux 5.3 but break on 5.4, as a traced function was renamed or a new code path added. The BPF libraries and frameworks are also changing and evolving, most recently with the BTF and CO-RE support. This is something I hope people consider before choosing to rewrite them: Do you have a plan to rewrite all the updates as well, or will you end up stuck on an old port of the library? It's easier to pull updates of everything than to maintain your own versions. Finally, what if you have a great idea for a _better_ BPF library or framework than what we're using in bcc and bpftrace? Talk to us, try it out, innovate. We're at the start of the BPF era and there's lots more to explore. But please understand what exists first and the maintenance burden you are taking on. Your energies may be better spent creating something new, on top of what exists, than porting something old. [bcc]: https://github.com/iovisor/bcc [bpftrace]: https://github.com/iovisor/bpftrace [book]: /bpf-performance-tools-book.html [choosing]: /blog/2015-07-08/choosing-a-linux-tracer.html [An Unbelievable Demo]: /blog/2021-06-04/an-unbelievable-demo.html [d3 version]: https://github.com/spiermar/d3-flame-graph [bcc Tutorial]: https://github.com/iovisor/bcc/blob/master/docs/tutorial.md [brittle]: /blog/2014-12-31/linux-page-cache-hit-ratio.html [sandcastle]: https://github.com/brendangregg/perf-tools/blob/master/fs/cachestat [BTF and CO-RE]: /blog/2020-11-04/bpf-co-re-btf-libbpf.html [Vector]: https://github.com/Netflix/vector [eBPF]: https://ebpf.io/ [BPF Internals]: /blog/2021-06-15/bpf-internals.html

    July 02, 2021 02:00 PM

    June 30, 2021

    Linux Plumbers Conference: Real-time Microconference Accepted into 2021 Linux Plumbers Conference

    We are pleased to announce that the Real-time Microconference has been accepted into the 2021 Linux Plumbers Conference. Since 2004, the project that has become known as PREEMPT_RT, formally the real-time patch, has improved the real-time and low-latency features of the Linux kernel. Over the past decade, many parts of PREEMPT_RT have been included into the official Linux codebase. Examples include: mutexes, high-resolution timers, lockdep, ftrace, RT scheduling, SCHED_DEADLINE, RCU_PREEMPT, generic interrupts, priority inheritance futexes, threaded interrupt handlers, and more. The number of patches that need integration has been significantly reduced, and the rest is mature enough to make their way into mainline Linux.

    The following accomplishments have been made as a result of last year’s microconference:

    This year’s topics to be discussed include:

    Come and join us in the discussion of controlling what tasks get to runon your machine and when.

    We hope to see you there.

    June 30, 2021 10:08 PM

    June 22, 2021

    Michael Kerrisk (manpages): man-pages-5.12 released

    Alex Colomar and I have released released man-pages-5.12. The release tarball is available on kernel.org. The browsable online pages can be found on man7.org. The Git repository for man-pages is available on kernel.org.

    This release resulted from patches, bug reports, reviews, and comments from around 40 contributors. The release includes more than 300 commits that changed around 180 manual pages.

    The most notable of the changes in man-pages-5.12 are the following:

    Special thanks to Alex, who was once again the largest contributor in this release!

    June 22, 2021 12:48 AM

    June 21, 2021

    Linux Plumbers Conference: Toolchains and Kernel Microconference Accepted into 2021 Linux Plumbers Conference

    We are pleased to announce that the Toolchains and Kernel Microconference has been accepted into the 2021 Linux Plumbers Conference. Toolchains are the main part of any development, as they create the executables from the code a developer writes. In order to run efficiently on the operating system, there needs to be a strong understanding of the interface between the application and the kernel it runs on. This microconference is focused on the integration of toolchains and the Linux kernel.

    Since last year’s meet up, the following has been accomplished:

    This year’s topics to be discussed include:

    Come and join us in the discussion of making the toolchains work better with the Linux kernel.

    We hope to see you there.

    June 21, 2021 10:17 PM

    June 18, 2021

    Linux Plumbers Conference: Tracing Microconference Accepted into 2021 Linux Plumbers Conference

    We are pleased to announce that the Tracing Microconference has been accepted into the 2021 Linux Plumbers Conference. Tracing in the Linux kernel is constantly improving. Tracing was officially added to Linux in 2008. Since then, more tooling has been constantly added to help out with visibility. The work is still ongoing, with Perf, ftrace, Lttng, and eBPF. User space tooling is expanding and as the kernel gets more complex, so does the need for facilitating seeing what is going on under the hood.

    Since the last tracing meetup at Linux Plumbers in 2019, a few accomplishments have come out of it:

    This year’s topics to be discussed include:

    Come and join us and not only learn but help direct the future progress of tracing inside the Linux kernel and beyond!

    We hope to see you there!

    June 18, 2021 09:15 PM

    June 14, 2021

    Brendan Gregg: USENIX LISA2021 BPF Internals (eBPF)

    For USENIX LISA2021 I gave a 40 minute deep dive talk on BPF internals for Linux, focusing on observability tracing tools. Since there are already BPF internals references online (listed in this post) I used the opportunity to create some new content, showing how bpftrace instrumentation works from user space down to machine code. I break it down to all the small components involved, where you'll find it's actually quite easy. The video is on [youtube]:

    The slides are on [slideshare] or as a [PDF]:
    Thanks to USENIX LISA for not only hosting this talk, but also for suggesting it. Internals talks can feel like they don't have strong take-aways, so I usually share that content in websites and books instead where people can browse as needed. But other USENIX events have had success with these "Core Principles" topics, so I gave it a try this time. How do you like it? As this is content that otherwise wouldn't exist without USENIX's help, my thanks to everyone who supports USENIX. Links from my references slide: - https://events.static.linuxfound.org/sites/events/files/slides/bpf_collabsummit_2015feb20.pdf - Linux include/uapi/linux/bpf_common.h - Linux include/uapi/linux/bpf.h - Linux include/uapi/linux/filter.h - https://docs.cilium.io/en/v1.9/bpf/#bpf-guide - BPF Performance Tools, Addison-Wesley 2020 - https://ebpf.io/what-is-ebpf - http://www.brendangregg.com/ebpf.html - https://github.com/iovisor/bcc - https://github.com/iovisor/bpftrace Capabilities continue to be added to BPF, so to stay current you will need to keep an eye on updates to the Linux header files listed above. For high-frequency updates you can also subscribe to the [bpf-next] mailing list, or for low-frequency summaries search for "BPF" in the [KernelNewbies summaries]. There is also a substantially different implementation of BPF internals that I didn't cover at all in this talk: [eBPF on Windows] by Microsoft, only recently made public. In other BPF news, I just found out that my Addison-Wesley [BPF Performance Tools] book is in a snap 5-day sale [until June 19]. [youtube]: https://www.youtube.com/watch?v=_5Z2AU7QTH4 [slideshare]: https://www.slideshare.net/brendangregg/bpf-internals-ebpf [PDF]: /Slides/LISA2021_BPF_Internals.pdf [BPF Performance Tools]: /bpf-performance-tools-book.html [until June 19]: https://twitter.com/InformIT/status/1404569134042603520 [bpf-next]: http://vger.kernel.org/vger-lists.html#bpf [KernelNewbies summaries]: https://kernelnewbies.org/LinuxVersions [eBPF on Windows]: https://cloudblogs.microsoft.com/opensource/2021/05/10/making-ebpf-work-on-windows/

    June 14, 2021 02:00 PM

    Linux Plumbers Conference: IoThree’s Company Microconference Accepted into 2021 Linux Plumbers Conference

    We are pleased to announce that the IoThree’s Company Microconference has been accepted into the 2021 Linux Plumbers Conference. As everyday devices start to become more connected to the internet, the infrastructure around it constantly needs to be developed. Linux is showing up more in products that are not normally considered to be computers, but now need to interact with a central location (cloud). This brings new challenges that need to be addressed.

    Last’s years meetup produced the following:

    This year’s topics to be discussed include:

    Come and join us in some heated but productive discussions in making your everyday devices communicate with the world around them.

    We hope to see you there.

    June 14, 2021 12:35 AM

    June 04, 2021

    Matthew Garrett: Mike Lindell's Cyber "Evidence"

    Mike Lindell, notable for absolutely nothing relevant in this field, today filed a lawsuit against a couple of voting machine manufacturers in response to them suing him for defamation after he claimed that they were covering up hacks that had altered the course of the US election. Paragraph 104 of his suit asserts that he has evidence of at least 20 documented hacks, including the number of votes that were changed. The citation is just a link to a video called Absolute 9-0, which claims to present sufficient evidence that the US supreme court will come to a 9-0 decision that the election was tampered with.

    The claim is that Lindell was provided with a set of files on the 9th of January, and gave these to some cyber experts to verify. These experts identified them as packet captures. The video contains scrolling hex, and we are told that this is the raw encrypted data from the files. In reality, the hex values correspond very clearly to printable ASCII, and appear to just be the Pennsylvania voter roll. They're not encrypted, and they're not packet captures (they contain no packet headers).

    20 of these packet captures were then selected and analysed, giving us the tables contained within Exhibit 12. The alleged source IPs appear to correspond to the networks the tables claim, and the latitude and longitude presumably just come from a geoip lookup of some sort (although clearly those values are far too precise to be accurate). But if we look at the target IPs, we find something interesting. Most of them resolve to the website for the county that was the nominal target (eg, 198.108.253.104 is www.deltacountymi.org). So, we're supposed to believe that in many cases, the county voting infrastructure was hosted on the county website.

    Unfortunately we're not given the destination port, but 198.108.253.104 isn't listening on anything other than 80 and 443. We're told that the packet data is encrypted, so presumably it's over HTTPS. So, uh, how did they decrypt this to figure out how many votes were switched? If Mike's hackers have broken TLS, they really don't need to be dealing with this.

    We're also given some background information on how it's impossible to reconstruct packet captures after the fact (untrue), or that modifying them would change their hashes (true, but in the absence of known good hash values that tells us nothing), but it's pretty clear that nothing we're shown actually demonstrates what we're told it does.

    In summary: yes, any supreme court decision on this would be 9-0, just not the way he's hoping for.

    Update: It was pointed out that this data appears to be part of a larger dataset. This one is even more dubious - it somehow has MAC addresses for both the source and destination (which is impossible), and almost none of these addresses are in actual issued ranges.

    comment count unavailable comments

    June 04, 2021 05:49 AM

    Linux Plumbers Conference: Performance and Scalability Microconference Accepted into 2021 Linux Plumbers Conference

    We are pleased to announce that the Performance and Scalability Microconference has been accepted into the 2021 Linux Plumbers Conference.

    All parts of the Linux ecosystem, kernel and userspace, should account for performance and scalability. The purpose of this microconference is for developers from different projects to meet and collaborate, as the entire stack must perform well for the user to see good results. Because performance and scalability are very generic topics, this microconference focuses on issues that may also be addressed in other, more specific sessions.

    The structure will be similar to what was followed in previous years, including topics such as synchronization primitives, bottlenecks in memory management, testing/validation, lockless algorithms and RCU, among others.

    Here are some of the outcomes from the last time the event was held in 2018:

    This year’s topics tentatively include:

    Come and join us in the discussion of improving performance and scalability of your system.

    We hope to see you there.

    June 04, 2021 01:12 AM

    June 03, 2021

    Brendan Gregg: An Unbelievable Demo

    This is the story of the most unbelievable demo I've been given in world of open source. You can't make this stuff up. It was 2005, and I felt like I was in the eye of a hurricane. I was an independent performance consultant and Sun Microsystems had just released DTrace, a tool that could instrument all software. This gave performance analysts like myself X-ray vision. While I was busy writing and publishing advanced performance tools using DTrace (my open source [DTraceToolkit] and other [DTrace tools], aka scripts), I noticed something odd: I was producing more DTrace tools than were coming out of Sun itself. Perhaps there was some internal project that was consuming all their DTrace expertise?


    DTraceToolkit v0.96 tools (2006)
    As I wasn't a Sun Microsystems employee I wasn't privy to Sun's internal projects. However, I was doing training and consulting for Sun, helping their customers with system administration and performance. Sun sometimes invited me to their own customer meetings and other events I might be interested in, as a local expert. I was living in Sydney, Australia. This time I was told that there was a Very Important Person visiting from the US whom I'd want to meet. I didn't recognize the name, but was told that he was a DTrace expert and developer at Sun, and was on a world tour demonstrating Sun's new DTrace-based product. Ah-hah – this must be the internal project! But this would be no ordinary project. I'd seen some amazing technologies from Sun, but I'd never seen a developer on a world tour. This was going to be big, and would likely blow away my earlier DTrace work. The VIP was returning to Sydney for a few days before going to the next Australian city, so we agreed to meet at the Sun Sydney office. ## The Meeting The DTrace expert arrived wearing casual business attire and a heavy American accent, and seemed a bit weary from his world tour. He had just been to South Africa and New Zealand, and listed other countries and cities he was heading to next. Two other Australian Sun staff joined the meeting, and one introduced me with: "Brendan teaches some classes for us, and has been doing some DTrace stuff.” Low-key introductions are the norm in Australia (especially for Australians) and I wondered whether he knew of this cultural difference. Another difference was that there were few roles in Australia for engineers in 2005, unlike the US. The Sun Microsystems Australia jobs, for example, were all in support and none in development, and other tech giants had not yet arrived. So back then in Australia you could find amazing engineers doing whatever roles were available. I tried to expand on the "stuff" a bit by saying that I’d written the DTraceToolkit, but he wasn't impressed. He didn't recognize my name, nor had he heard of the DTraceToolkit. To him, I was just some random guy. He was kind enough to give me a quick demo anyway. His DTrace product was an add-on for a larger Sun GUI that I was already familiar with. After it loaded, he showed how you could run one of several DTrace tools by double clicking an icon. Either the raw output would be printed in a separate window, or the results would be shown as a line graph. This seemed __quite underwhelming__. The GUI already had this functionality: Showing the raw output of tools or drawing a line graph. I was hoping for a new GUI feature. The only new work was the tools themselves, of which there were several. He gave a quick sales pitch about the new and amazing observability they provided, something he must have said many times to impress customers. I got the feeling he wasn't expecting me to properly appreciate their value. But I _did_ understand these tools, since I had coded similar functionality for my own DTraceToolkit. They were useful, but...I was expecting a hurricane of awesome _new_ DTrace content. "I've done these before – I've written tools that do these things myself!" "Yeah, sure." He didn’t quite say it, but gave me a look like he didn't really believe me, or that I could even truly understand what they were. This was an important innovation by Sun Microsystems, a US-based multinational company worth billions. I was just some random Aussie. ## Socket Tracing I browsed the GUI icons for something new, and the closest was a tool for tracing socket I/O. I had tried this in 2004 ([socketsnoop.d]) and published it as open source, but my tool was incomplete: I didn't have access to the kernel source code so I had to figure out everything the hard way using black box analysis. It worked for most TCP traffic types but not others, which I warned about in the script comments. I'd also not included it in the DTraceToolkit yet as I didn't consider it finished. So of all the tools he had, I was most interested to see this one. Sun could do a much better job just by referring to the source code they were instrumenting, and actually finish this tool. "Can I see the socket I/O script?". I fired up a terminal. He looked alarmed at first, as if I wasn't supposed to look behind the curtain, then realized another selling feature: "Well, sure, you could even add more tools to the GUI!" and after a pause, added "if you have them". Sure, I have them all right. He gave me a path to start looking under, and after a bit of searching I found the directory with all the tools he had been demoing. The tools all had familiar names. One was even called socketsnoop.d. A new possibility dawned on me. No way. I printed socketsnoop.d. The screen filled with _my own script_. It was the same incomplete attempt I had hacked up a year earlier, and published as open source. It included some weird code that only made sense when I wrote it (use of PFORMAT, prior to defaultargs) and was written in my earlier coding style. I was looking at _my own fucking script_. "This is MY script." I printed the other tools and saw the same – they were _all mine_. This hot new Sun product that Mr. VIP was touring the world showing off was actually just my own open source tools. My jaw was on the floor. He didn't seem to believe me. ## You Can't Do That I used grep to search all his tools for my name, which was in the header comment of all my tools, to prove beyond a doubt that these were mine. But I found nothing. My name had been stripped. Some of my tools had even included the line:
    # Author: Brendan Gregg  [Sydney, Australia]
    
    And now, here he was, in Sydney, Australia, trying to sell Brendan Gregg's tools to Brendan Gregg. One of the Australian Sun staff interrupted: "Those say copyright Sun Microsystems." Most of my tools had my own copyright and a GPLv2 or CDDL license. But these only had Sun's standard copyright message, and the open source licenses had been stripped. "You deleted my name! And the copyrights and licenses!" The other Aussie added, to the VIP: "You can't do that." A silence fell over the room as the magnitude of what had happened sunk in. While some at Sun were encouraging open source contributions and building a community, others were ripping off that same community. Taking their work, changing the licence and copyrights, and then selling it. The VIP wasn't prepared for this and had a look of confusion. He didn't say much, other than that he didn't know what had happened, and that he may have gotten the tools from someone else already like this (ie, don't blame me). He seemed to be only half believing what we were saying. The meeting ended quickly. I suggested that he get newer copies of my tools, directly from the DTraceToolkit, since these older versions from my homepage were out of date, and some had errors that I had already fixed. I also reminded him to keep my name, copyright, and license on all of them. In his defense, perhaps the meeting may have gone differently had I not been given a low-key Australian introduction. That's an Australian cultural problem (tall poppy syndrome). To an Australian, introductions in the US can sound boastful, but they can also be useful as a quick way to share one's specialties. ## Other Cases Of all the tools I had published as open source, I still can't believe socketsnoop.d was included. It wasn't even very good. Later on I wrote much better socket tools (in my [DTrace] and [BPF] books). A few years later, Apple added dozens of my tools to OS X. They left my name, copyright, and CDDL open source license intact, and even improved and enhanced some of them. Years later, Oracle did the same for Oracle Solaris 11, and the BSD community did for FreeBSD. My thanks to all of you. You might say that this wasn't really Sun the company doing this, but rather, a careless individual. But there was something in Sun's culture that contributed to this kind of carelessness. It was something I and my consulting colleagues had run into before: The belief at Sun that only Sun could make good use of its own technologies, and anything created outside of Sun was trash. When these Sun employees found something that was good, they were inclined to assume it came from Sun, and it was therefore safe to reuse and rebrand (and relicense) as they assumed they already held the copyrights. There were also others at Sun that did try hard to do the right thing by me and my work. On at least four other occasions my DTraceToolkit was built into observability products, without stripping licenses. (In one case they wanted to relicense to GPL, and talked to me and Sun legal about it, but that's another story.) This also wasn't the last time someone unwittingly tried to sell me my own work, it was just the first. I've learned to not tell sales people that I invented what they are showing me, as they then give me funny looks like I'm a crazy person, but instead to simply say "I have a lot of experience with that technology" and leave it at that. I'm reminded of this first case since my BPF tools are now appearing in observability products, and will grow to a scale much bigger than my DTrace tools. I'll write about it more in future posts, but my immediate advice to developers is this: Please do not rewrite my BPF tools and the bcc libraries; try to build upon them as-is (either bcc Python or bcc libbpf-tool versions) and fetch regular updates. This is because they are works-in-progress, and rewriting (forking) them divides engineering resources and will have your customers using out of date versions. (Note that I think my flame graph software is different: Since it is a simple and finished algorithm that doesn't need much maintenance, I don't see a big problem with people rewriting it. It is nice to get some thanks, however, just as I have done for those that inspired flame graphs.) As for the unbelievable demo: This wasn't the great DTrace product I imagined when hearing about a world tour. It was, in fact, my own tools. I suspect that it's not uncommon for an open source developer to discover, at some point, that their own code has been rebranded. But the circumstance in this case may be a little unusual. A US developer got a world tour for software he didn't write, which included giving a sales pitch and demo in Australia, unwittingly, to the author. I don't think he even said thank you. [socketsnoop.d]: http://www.brendangregg.com/DTrace/socketsnoop.d [DTrace]: /dtrace.html [BPF]: /bpf-performance-tools-book.html [DTraceToolkit]: /dtracetoolkit.html [DTrace tools]: /dtrace.html

    June 03, 2021 02:00 PM

    June 02, 2021

    Matthew Garrett: Producing a trustworthy x86-based Linux appliance

    Let's say you're building some form of appliance on top of general purpose x86 hardware. You want to be able to verify the software it's running hasn't been tampered with. What's the best approach with existing technology?

    Let's split this into two separate problems. The first is to do as much as we can to ensure that the software can't be modified without our consent[1]. This requires that each component in the boot chain verify that the next component is legitimate. We call the first component in this chain the root of trust, and in the x86 world this is the system firmware[2]. This firmware is responsible for verifying the bootloader, and the easiest way to do this on x86 is to use UEFI Secure Boot. In this setup the firmware contains a set of trusted signing certificates and will only boot executables with a chain of trust to one of these certificates. Switching the system into setup mode from the firmware menu will allow you to remove the existing keys and install new ones.

    (Note: You shouldn't use the trusted certificate directly for signing bootloaders - instead, the trusted certificate should be used to sign another certificate and the key for that certificate used to sign your bootloader. This way, if you ever need to revoke the signing certificate, you can simply sign a new one with the trusted parent and push out a revocation update instead of having to provision new keys)

    But what do you want to sign? In the general purpose Linux world, we use an intermediate bootloader called Shim to bridge from the Microsoft signing authority to a distribution one. Shim then verifies the signature on grub, and grub in turn verifies the signature on the kernel. This is a large body of code that exists because of the use cases that general purpose distributions need to support - primarily, booting on arbitrary off the shelf hardware, and allowing arbitrary and complicated boot setups. This is unnecessary in the appliance case, where the hardware target can be well defined, where there's no need for interoperability with the Microsoft signing authority, and where the boot configuration can be extremely static.

    We can skip all of this complexity using systemd-boot's unified Linux image support. This has the format described here, but the short version is that it's simply a kernel and initramfs linked into a small EFI executable that will run them. Instructions for generating such an image are here, and if you follow them you'll end up with a single static image that can be directly executed by the firmware. Signing this avoids dealing with a whole host of problems associated with relying on shim and grub, but note that you'll be embedding the initramfs as well. Again, this should be fine for appliance use-cases, but you'll need your build system to support building the initramfs at image creation time rather than relying on it being generated on the host.

    At this point we have a single image that can be verified by the firmware and will get us to the point of a running kernel and initramfs. Unless you've got enough RAM that you can put your entire workload in the initramfs, you're going to want a filesystem as well, and you're going to want to verify that that filesystem hasn't been tampered with. The easiest approach to this is to use dm-verity, a device-mapper layer that uses a hash tree to verify that the filesystem contents haven't been modified. The kernel needs to know what the root hash is, so this can either be embedded into your initramfs image or into the kernel command line. Either way, it'll end up in the signed boot image, so nobody will be able to tamper with it.

    It's important to note that a dm-verity partition is read-only - the kernel doesn't have the cryptographic secret that would be required to generate a new hash tree if the partition is modified. So if you require the ability to write data or logs anywhere, you'll need to add a new partition for that. If this partition is unencrypted, an attacker with access to the device will be able to put whatever they want on there. You should treat any data you read from there as untrusted, and ensure that it's validated before use (ie, don't just feed it to a random parser written in C and expect that everything's going to be ok). On the other hand, if it's encrypted, remember that you can't just put the encryption key in the boot image - an attacker with access to the device is going to be able to dump that and extract it. You'll probably want to use a TPM-sealed encryption secret, which will be discussed later on.

    At this point everything in the boot process is cryptographically verified, and so should be difficult to tamper with. Unfortunately this isn't really sufficient - on x86 systems there's typically no verification of the integrity of the secure boot database. An attacker with physical access to the system could attach a programmer directly to the firmware flash and rewrite the secure boot database to include keys they control. They could then replace the boot image with one that they've signed, and the machine would happily boot code that the attacker controlled. We need to be able to demonstrate that the system booted using the correct secure boot keys, and the only way we can do that is to use the TPM.

    I wrote an introduction to TPMs a while back. The important thing to know here is that the TPM contains a set of Platform Configuration Registers that are large enough to contain a cryptographic hash. During boot, each component of the boot process will generate a "measurement" of other security critical components, including the next component to be booted. These measurements are a representation of the data in question - they may simply be a hash of the object being measured, or the hash of a structure containing various pieces of metadata. Each measurement is passed to the TPM, along with the PCR it should be measured into. The TPM takes the new measurement, appends it to the existing value, and then stores the hash of this concatenated data in the PCR. This means that the final PCR value depends not only on the measurement, but also on every previous measurement. Without breaking the hash algorithm, there's no way to set the PCR to an arbitrary value. The hash values and some associated data are stored in a log that's kept in system RAM, which we'll come back to later.

    Different PCRs store different pieces of information, but the one that's most interesting to us is PCR 7. Its use is documented in the TCG PC Client Platform Firmware Profile (section 3.3.4.8), but the short version is that the firmware will measure the secure boot keys that are used to boot the system. If the secure boot keys are altered (such as by an attacker flashing new ones), the PCR 7 value will change.

    What can we do with this? There's a couple of choices. For devices that are online, we can perform remote attestation, a process where the device can provide a signed copy of the PCR values to another system. If the system also provides a copy of the TPM event log, the individual events in the log can be replayed in the same way that the TPM would use to calculate the PCR values, and then compared to the actual PCR values. If they match, that implies that the log values are correct, and we can then analyse individual log entries to make assumptions about system state. If a device has been tampered with, the PCR 7 values and associated log entries won't match the expected values, and we can detect the tampering.

    If a device is offline, or if there's a need to permit local verification of the device state, we still have options. First, we can perform remote attestation to a local device. I demonstrated doing this over Bluetooth at LCA back in 2020. Alternatively, we can take advantage of other TPM features. TPMs can be configured to store secrets or keys in a way that renders them inaccessible unless a chosen set of PCRs have specific values. This is used in tpm2-totp, which uses a secret stored in the TPM to generate a TOTP value. If the same secret is enrolled in any standard TOTP app, the value generated by the machine can be compared to the value in the app. If they match, the PCR values the secret was sealed to are unmodified. If they don't, or if no numbers are generated at all, that demonstrates that PCR 7 is no longer the same value, and that the system has been tampered with.

    Unfortunately, TOTP requires that both sides have possession of the same secret. This is fine when a user is making that association themselves, but works less well if you need some way to ship the secret on a machine and then separately ship the secret to a user. If the user can simply download the secret via some API, so can an attacker. If an attacker has the secret, they can modify the secure boot database and re-seal the secret to the new PCR 7 value. That means having to add some form of authentication, along with a strong binding of machine serial number to a user (in order to avoid someone with valid credentials simply downloading all the secrets).

    Instead, we probably want some mechanism that uses asymmetric cryptography. A keypair can be generated on the TPM, which will refuse to release an unencrypted copy of the private key. The public key, however, can be exported and stored. If it's acceptable for a verification app to connect to the internet then the public key can simply be obtained that way - if not, a certificate can be issued to the key, and this exposed to the verifier via a QR code. The app then verifies that the certificate is signed by the vendor, and if so extracts the public key from that. The private key can have an associated policy that only permits its use when PCR 7 has an appropriate value, so the app then generates a nonce and asks the user to type that into the device. The device generates a signature over that nonce and displays that as a QR code. The app verifies the signature matches, and can then assert that PCR 7 has the expected value.

    Once we can assert that PCR 7 has the expected value, we can assert that the system booted something signed by us and thus infer that the rest of the boot chain is also secure. But this is still dependent on the TPM obtaining trustworthy information, and unfortunately the bus that the TPM sits on isn't really terribly secure (TPM Genie is an example of an interposer for i2c-connected TPMs, but there's no reason an LPC one can't be constructed to attack the sort usually used on PCs). TPMs do support encrypted communication channels, but bootstrapping those isn't straightforward without firmware support. The easiest way around this is to make use of a firmware-based TPM, where the TPM is implemented in software running on an ancillary controller. Intel's solution is part of their Platform Trust Technology and runs on the Management Engine, AMD run it on the Platform Security Processor. In both cases it's not terribly feasible to intercept the communications, so we avoid this attack. The downside is that we're then placing more trust in components that are running much more code than a TPM would and which have a correspondingly larger attack surface. Which is preferable is going to depend on your threat model.

    Most of this should be achievable using Yocto, which now has support for dm-verity built in. It's almost certainly going to be easier using this than trying to base on top of a general purpose distribution. I'd love to see this become a largely push button receive secure image process, so might take a go at that if I have some free time in the near future.

    [1] Obviously technologies that can be used to ensure nobody other than me is able to modify the software on devices I own can also be used to ensure that nobody other than the manufacturer is able to modify the software on devices that they sell to third parties. There's no real technological solution to this problem, but we shouldn't allow the fact that a technology can be used in ways that are hostile to user freedom to cause us to reject that technology outright.
    [2] This is slightly complicated due to the interactions with the Management Engine (on Intel) or the Platform Security Processor (on AMD). Here's a good writeup on the Intel side of things.

    comment count unavailable comments

    June 02, 2021 04:36 PM

    May 28, 2021

    Brendan Gregg: Moving my US tech job to Australia

    I've moved from the San Francisco Bay Area to Sydney, Australia, where I will continue the best job so far of my career: Performance engineering at Netflix. I'm grateful for the support of Netflix engineering management, Netflix HRBPs, and others for helping to make this happen. While my move is among the first from the Linux cloud teams, Netflix has had staff in Australia for years (for content, marketing, and the FreeBSD OCA). It's been a privilege and an adventure to work in Silicon Valley with so many amazing people. But I'm now excited about my new adventure: Doing an advanced tech role remotely from Australia. I know others who have also left the Bay Area or are planning to. Back in 2015 we'd have BPF (iovisor) meetups in Santa Clara and most contributors would be there in person, with some having travelled. Now we're more scattered, either to other US cities or worldwide. As another indicator of tech moving elsewhere, last year brought the [headline]: "Bay Area's share of VC deals predicted to fall below 20% for first time in 2021." Day to day things won't be much different. I'm still online, doing the same work, answering the same emails. And many of us expect (when travel is possible) to make regular visits to the US for company-wide meetings and events. I think some coworkers will still see me occasionally in the US office and won't even realize I've moved.

    Why Australia?

    When I told people I was moving to Australia they'd guess why: "Is it because of X? Or Y? ... or Z?" Well, the answer is yes, all of the above. I began discussing Australian tech roles with different companies in Jan 2020. The pandemic then added another reason to move. Both the US and Australia have their pros and cons, and I have many favorite places and people in both (sorry I didn't come say goodbye: We'll meet again). But in the end I'm a proud Australian and I do prefer Australia for various reasons, many of which Deirdré wrote about in [Why move to Australia?]. Additional reasons for me included visa uncertainty (and the abuse it leads to), voting rights, and complex international taxation. (Disclaimer: Netflix is an exception, as they have been great with visa workers including myself.) Another reason is that the tech market became stronger in Australia. I moved to the US in 2006 as there were many more opportunities there, especially in kernel engineering and performance. Now, in 2021, Australia has a thriving tech market. Sydney has AWS and Google offices and even a small Netflix office, just to name a few. There is also a wider variety of roles available. If you want to do kernel engineering work you no longer need to move to California to work for Sun Microsystems in the MPK17 building. You can work on Linux anywhere.

    Linux is Already Remote

    Linux has been described as the world's most successful open source project, and it's all engineers working remotely. There's no Linux kernel headquarters where all the engineers sit in an open office layout, typing furiously then dashing for the break room coffee during kernel builds, and where maintainers can yell across the room at someone for their bad patch (when it's Linus yelling, everyone takes off their headphones to listen). That doesn't happen. Engineers are remote, and may only meet once or twice a year at Linux kernel conferences. And it's worked very well for years. Another example of remote work I've already done is book writing. Last year I published [Systems Performance 2nd Edition], which I wrote from my home office with help from remote contributors. The entire project was run via emails, a Google drive, and Google docs, and was delivered to the publisher on time.

    Making it Work

    While tech workers are well suited for remote work (savvy with communications technologies) there are benefits with office work, and I don't think remote work is for everyone. (One benefit I'll miss is playing in the Netflix cricket team.) In the future I'd expect hybrid teams, where the remote workers visit the office on a regular cadence (e.g., once a quarter) for meetings. This is a model that's already been successfully used by some teams, including at Netflix. As for work hours, I set my own schedule where I start around 7am, giving between 3 and 5 hours overlap with California time (depending on daylight savings). About once a month I'll have an early morning meeting (e.g., 4am). Back when I did [SRE oncall] for Netflix I'd have more wakeups at unpredictable times, so this feels easier to manage. (I also had prior jobs in the Bay Area where I'd be in the office most days past midnight, so compared to that this is like a health retreat!) As more people move to other timezones I think this will improve further. Some meetings may move to an asynchronous format, and others may be run twice for world coverage, at 9am and 4pm California time.
    To work remote I think you have to really want it and be willing to put in extra effort, including doing the occasional early meeting. Personally, I use a stopwatch to help me stay productive: I pause it whenever I have an interruption, and measure how many hours of uninterrupted work I get done each day, log it, and then plot it on graphs to see the trends. Yes, I'm performance analyzing myself. It's been a slow process, but I've been figuring out how to become more productive each day. It's really satisfying to finish a full day's work and then realize I'm no longer in the Bay Area, but instead have a two minute walk to the beach. It's just one of many reasons to put in that extra effort. [Why move to Australia?]: http://www.beginningwithi.com/2020/12/01/why-move-to-australia/ [headline]: https://www.bizjournals.com/sanjose/news/2020/12/14/bay-area-vc-deal-share-predicted-to-fall-below-20.html [Systems Performance 2nd Edition]: /systems-performance-2nd-edition-book.html [SRE oncall]: /blog/2016-05-04/srecon2016-perf-checklists-for-sres.html

    May 28, 2021 02:00 PM

    May 23, 2021

    David Sterba: Authenticated hashes for btrfs (part 1)

    There was a request to provide authenticated hashes in btrfs, natively as one of the btrfs checksum algorithms. Sounds fun but there’s always more to it, even if this sounds easy to implement.

    Johaness T. at that time in SUSE sent the patchset adding the support for SHA256 [1] with a Labs conference paper, summarizing existing solutions and giving details about the proposed implementation and use cases.

    The first version of the patchset posted got some feedback, issues were found and some ideas suggested. Things have stalled a bit, but the feature is still very interesting and really not hard to implement. The support for additional checksums has provided enough support code to just plug in the new algorithm and enhance the existing interfaces to provide the key bytes. So until now I’ve assumed you know what an authenticated hash means, but for clarity and in simple terms: a checksum that depends on a key. The main point is that it’s impossible to generate the same checksum for given data without knowing the key, where impossible is used in the cryptographic-strength sense, there’s an almost zero probability doing that by chance and brute force attack is not practical.

    Auth hash, fsverity

    Notable existing solution for that is fsverity that works in read-only fashion, where the key is securely hidden and used only to verify that data that are read from media haven’t been tampered with. A typical use case is an OS image in your phone. But that’s not all. Images of OS appear in all sorts of boxed devices, IoT. Nowadays, with explosion of edge computing, assuring integrity of the end devices is a fundamental requirement.

    Where btrfs can add some value is the read AND write support, with an authenticated hash. This brings questions around key handling, and not everybody is OK with a device that could potentially store malicious/invalid data with a proper authenticated checksum. So yeah, use something else, this is not your use case, or maybe there’s another way how to make sure the key won’t be compromised easily. This is beyond the scope of what filesystem can do, though.

    As an example use case of writable filesystem with authenticated hash: detect outside tampering with on-disk data, eg. when the filesystem was unmounted. Filesystem metadata formats are public, interesting data can be located by patterns on the device, so changing a few bytes and updating the checksum(s) is not hard.

    There’s one issue that was brought up and I think it’s not hard to observe anyway: there’s a total dependency on the key to verify a basic integrity of the data. Ie. without the key it’s not possible to say if the data are valid as if a basic checksum was used. This might be still useful for a read-only access to the filesystem, but absence of key makes this impossible.

    Existing implementations

    As was noted in the LWN discussion [2], what ZFS does, there are two checksums. One is the authenticated and one is not. I point you to the comment stating that, as I was not able to navigate far enough in the ZFS code to verify the claim, but the idea is clear. It’s said that the authenticated hash is eg. SHA512 and the plain hash is SHA256, split half/half in the bytes available for checksum. The way the hash is stored is a simple trim of the first 16 bytes of each checksum and store them consecutively. As both hashes are cryptographically strong, the first 16 bytes should provide enough strength despite the truncation. Where 16 bytes is 128 bits.

    When I was thinking about that, I had a different idea how to do that. Not that copying the scheme would not work for btrfs, anything that the linux kernel crypto API provides is usable, the same is achievable. I’m not judging the decisions what hashes to use or how to do the split, it works and I don’t see a problem in the strength. Where I see potential for an improvement is performance, without sacrificing strength too much. Trade-offs.

    The CPU or software implementation of SHA256 is comparably slower to checksums with hardware aids (like CRC32C instructions) or hashes designed to perform well on CPUs. That was the topic of the previous round of new hashes, so we now compete against BLAKE2b and XXHASH. There are CPUs with native instructions to calculate SHA256 and the performance improvement is noticeable, orders of magnitude better. But the support is not as widespread as eg. for CRC32C. Anyway, there’s always choice and hardware improves over time. The number of hashes may seem to explode but as long as it’s manageable inside the filesystem, we take it. And a coffee please.

    Secondary hash

    The checksum scheme proposed is to use a cryptographic hash and a non-cryptographic one. Given the current support for SHA256 and BLAKE2b, the cryptographic hash is given. There are two of them and that’s fine. I’m not drawing an exact parallel with ZFS, the common point for the cryptographic hash is that there are limited options and the calculation is expensive by design. This is where the non-cryptographic hash can be debated. Also I want to call it secondary hash, with obvious meaning that it’s not too important by default and comes second when the authenticated hash is available.

    We have CRC32C and XXHASH to choose from. Note that there are already two hashes from the start so supporting both secondary hashes would double the number of final combinations. We’ve added XXHASH to enhance the checksum collision space from 32 bits to 64 bits. What I propose is to use just XXHASH as the secondary hash, resulting in two new hashes for the authenticated and secondary hash. I haven’t found a good reason to also include CRC32C.

    Another design point was where to do the split and truncation. As the XXHASH has fixed length, this could be defined as 192 bits for the cryptographic hash and 64 bits for full XXHASH.

    Here we are, we could have authenticated SHA256 accompanied by XXHASH, or the same with BLAKE2b. The checksum split also splits the decision tree what to do when the checksum partially matches. For a single checksum it’s a simple yes/no decision. The partial match is the interesting case:

    This leads to 4 outcomes of the checksum verification, compared to 2. A boolean type can simply represent the yes/no outcome but for two hashes it’s not that easy. It depends on the context, though I think it still should be straightforward to decide what to do that in the code. Nevertheless, this has to be updated in all calls to checksum verification and has to reflect the key availability eg. in case where the data are auto-repaired during scrub or when there’s a copy.

    Performance considerations

    The performance comparison should be now clear: we have the potentially slow SHA256 but fast XXHASH, for each metadata and data block, vs slow SHA512 and slow SHA256. As I reckon it’s possible to also select SHA256/SHA256 split in ZFS, but that can’t beat SHA256/XXHASH.

    The key availability seems to be the key point in all that, puns notwithstanding. The initial implementation assumed for simplicity to provide the raw key bytes to kernel and to the userspace utilities. This is maybe OK for a prototype but under any circumstances can’t survive until a final release. There’s key management wired deep into linux kernel, there’s a library for the whole API and command line tools. We ought to use that. Pass the key by name, not the raw bytes.

    Key management has it’s own culprits and surprises (key owned vs possessed), but let’s assume that there’s a standardized way how to obtain the key bytes from the key name. In kernel its “READ_USER_KEY_BYTES”, in userspace it’s either keyctl_read from libkeyutils or a raw syscall to keyctl. Problem solved, on the low-level. But, well, don’t try that over ssh.

    Accessing a btrfs image for various reasons (check, image, restore) now needs the key to verify data or even the key itself to perform modifications (check + repair). The command line interface has to be extended for all commands that interact with the filesystem offline, ie. the image and not the mounted filesystem.

    This results to a global option, like btrfs --auth-key 1234 ispect-internal dump-tree, compared to btrfs inspect-internal dump-tree --auth-key 1234. This is not finalized, but a global option is now the preferred choice.

    Final words

    I have a prototype, that does not work in all cases but at least passes mkfs and mount. The number of checksum verification cases got above what I was able to fix by the time of writing this. I think this has enough matter on itself so I’m pushing it out out as part 1. There are open questions regarding the command line interface and also a some kind of proof or discussion regarding attacks. Stay tuned.

    References:

    May 23, 2021 10:00 PM

    May 22, 2021

    Brendan Gregg: What is Observability

    It's a made-up computer word that my word processor decorates with a wiggly red you-can't-spell line. At least it did until I clicked "Add to Dictionary" (it got too annoying as I was writing a book on computer observability). Some people abbreviate it as o11y.

    Observability: The ability to observe.
    Observe-ability. Observability. In computer engineering we use it to describe the tools, data sources, and methods for understanding (observing!) how a technology is operating. We don't use the _real_ word "observable" since that implies the wrong thing. Imagine "observable metrics": Are there metrics that _aren't_ observable? Using observability in sentences: - What observability tools are installed? (Means: What tools exist that only read state?) - What observability does that database have? (Means: What metrics and logs does it have?) - How do you do observability? (Means: What products do you use for metrics, tracing, etc.?) - Let me try some observability first. (Means: Let me look at the system without changing it.) Wait, aren't all performance tools observability tools? No. _Experimental_ tools change the state of the system to understand it. For example, benchmarks. As an analogy, a car's dashboard is a collection of observability tools that let you understand how the car is operating (speed, rpm, temperature). A car's 0-60 mph time is an _experiment_. When I was a performance consultant I'd show up to random companies who wanted me to fix their computer performance issues. If they trusted me with a login to their production servers, I could help them a lot quicker. To get that trust I knew which tools looked but didn't touch: Which were observability tools and which were experimental tools. "I'll start with observability tools only" is something I'd say at the start of every engagement. Note that observability tools aren't completely harmless: Their execution consumes resources, usually negligible, but in some cases it's enough to perturb the target of study. This is the "observer effect." Another use of the term observability is as a reminder to switch between tool types, and not to get stuck on one. A colleague (Roch Bourbonnais from memory) once told me:
    "You have two hands. Observability and experimentation."
    It stuck with me as it also makes the point that when you're only using one type to solve a performance problem __you're working one-handed__.

    May 22, 2021 02:00 PM

    May 20, 2021

    Linux Plumbers Conference: Scheduler Microconference Accepted into 2021 Linux Plumbers Conference

    We are pleased to announce that the Scheduler Microconference has been accepted into the 2021 Linux Plumbers Conference! The scheduler is an important functionality of the Linux kernel, deciding what process gets to run when, where and for how long. With different topologies and workloads, it is no easy task to give the user the best experience possible. Schedulers are one of the most discussed topics on the Linux Kernel Mailing List, but many of these topics need further discussion in a conference format. Indeed, the scheduler microconference is responsible for many topics to make progress.

    At last year’s meet up, the Scheduler microconference achieved the following results:

    Not only were enhancements made, but the meetup also helped prove that some topics were not feasible and we do not need to spend more time on them.

    This year’s topics to be discussed include:

    Come and join us in the discussion of controlling what tasks get to run on your machine and when. We hope to see you there!

    May 20, 2021 01:23 AM

    May 14, 2021

    Linux Plumbers Conference: Confidential Computing Microconference Accepted into 2021 Linux Plumbers Conference

    We are pleased to announce that the Confidential Computing Microconference has been accepted into the 2021 Linux Plumbers Conference! In this microconference we will discuss how Linux can support encryption technologies which protect data during processing on the CPU. Examples are AMD SEV, Intel TDX, IBM Secure Execution for s390x and ARM Secure Virtualization. These are recent additions compared to technologies which protect data while in transit (SSL, VPNs) and at rest (disk encryption).

    The Linux kernel recently gained support for SEV-ES and support for Intel TDX is upcoming. AMD SEV will be further enhanced by Secure Nested Paging (SNP). Support for these technologies requires intrusive changes to the Linux kernel for memory integrity and secure interrupt delivery to virtual machines. Designing these changes in a way that works for different confidential computing technologies is one goal of this microconference.

    Topics to be included, but not limited to, are:

    Please come and join us in the discussion for solutions to the open problems for supporting these technologies.

    We hope to see you there!

    May 14, 2021 01:09 AM

    May 13, 2021

    James Bottomley: The Community Corrosive Effects of CLAs

    As one of the kernel DCO advocates, I’ve written many times about using the DCO instead of a CLA for copyright and patent contributions under open source licences. In spite of my obvious biases, I’ll try to give a factual overview of the cases for the DCO and CLA system. First, it should be noted that both the DCO and any CLA are types of Contribution Agreements (a set of terms by which contributors are agreeing to be bound). It should also be acknowledged that the DCO is a far more recent invention than CLAs. The DCO was first pioneered by the Linux kernel in 2004 (having been designed by Diane Peters, then of OSDL) and was subsequently adopted by a broad range of open source projects. However, in legal terms, the DCO is much less well understood than a standard CLA type agreement between the contributor and some entity, which is largely the reason you find a number of lawyers still advocating for the use of CLAs in various open source projects: because they’d like to stick with something that has more miles on it, or because they’re invested in the older model of community, largely pioneered by Apache. The biggest problem today is that the operation of most CLAs is asymmetrical: they take from the contributor more rights than the open source code actually needs, so lets begin with a summary of each type of Contribution Agreement.

    DCO

    The DCO is a legal representation by the contributor to everyone who might ever use the code. It requires no second party on the other side to counter sign it or act as the receiving entity, so it exactly mirrors the inbound=outbound licensing model first coined by Richard Fontana. The DCO explicitly grants to all downstream recipients only the exact rights the Open Source licence requires (and nothing more). In this sense it is fully symmetrical: the rights granted by the contributor are the same as the rights received by the downstream (i.e. inbound=outbound). Every contributor under the DCO retains their own copyright (or their company does if the contribution is a work for hire). The main alleged disadvantage of the DCO is that it encourages distributed ownership and makes it very hard to change the licence of the project because each contributor has only granted the rights necessary for the current licence, so if the new one requires more or different rights, all the current contributors have to re-grant those new or different rights (which can be a huge number of people for large long running projects). Since the DCO is a representation to everyone and requires no receiving entity, the project collecting the code doesn’t require any formal legal entity, like a foundation, to operate and thus the DCO gives rise to a truly lightweight structure for any project. The other big advantage of the DCO is that all of the representations are tracked by the Signed-off-by: tag on the commit, which goes in the git repository of the project code, so anyone with a clone of the repository has complete access to information about who changed what and where their DCO signoff is.

    CLA

    All current Open Source CLAs are structured as agreements between the contributor and a second party. Most often, the second party is a Foundation or a Corporation, making them quite heavy weight in terms of setup, admin and overhead. Every current CLA that I know about takes more rights from the contributor than the open source licence actually requires. For instance the Apache Individual CLA grants the right to copy, derive and sublicence to the Apache foundation who then relicence the contribution to the project usually under the Apache 2.0 licence. This is a classic asymmetric grant because the Apache foundation receives far more rights in the contribution than it grants to the downstream recipients. The FSF CLA is even more extreme because they require assignment of the copyright (so they will own the code and you, the author, will have no further right or interest in it except possibly for minimal moral rights to be named the author). Apart from the asymmetric grant, which places the receiving entity in a privileged position in the ecosystem, the other problem with CLAs is that they’re legal agreements, so they require a lawyer to prepare them, a mechanism to ensure people sign them and a mechanism to keep all the signatures … sometimes this can be in filing cabinets if paper instead of electronic copies are used. This repository of agreements then isn’t available to anyone except the tracking entity, meaning that if someone needs to know if John Doe signed a CLA, they have to reach out and ask. In some cases the actual filing cabinets got lost as projects changed offices, so some CLA based projects don’t actually have complete records of all their CLAs.

    CLAs Catalyse Community Corrosion

    The main driver of community corrosion is the temptation to abuse a position of power (this temptation becomes irresistable over time because, as Baron Acton put it, “all power corrupts”). Since CLAs by their nature force a power imbalance between the contributor and the receiving entity, they act as focal points for this corrosion. Communities are very sensitive to what they see as their work being misused, so the fastest way to lose community trust is to abuse the power the CLA gave you to go against the community itself. There are numerous examples of this in the Corporate World, the most topical one today being the Elastic change from Apache 2.0 to SSPL to better monetize the code the community contributed freely to. One might think the solution to this is never to sign a CLA if the holder of the power imbalance is a corporation … i.e. only do it if the other entity is a not for profit foundation. But ask yourself, how much do you trust the people running the foundation and do its bylaws guarantee your rights in the code? Relicensing for commercial gain isn’t the only way the community could be abused, so how sure are you of the power you’re handing to a foundation which, after all, is an entity governed by some type of board, all of whom likely have political agendas, won’t be abused? To see some examples of foundations not being in tune with their community, one only has to look at the FSF and Richard Stallman. Based on all of this I conclude, like Drew DeVault, that you should never sign a CLA under any circumstances.

    The bottom line is that if you do sign a CLA some decision will happen at some point that you don’t agree with but which you already gave away the power to block because of the rights imbalance inherent in the CLA you signed. Inevitably this decision will cause you to feel betrayed because your views are being ignored and as a contributor you feel you should be heard, so you’ll sour on the project. This is the community corrosion catalyst buried deep inside all CLAs.

    One final thing to note is that it is possible to craft a CLA that only takes the rights it needs, in the same way the DCO does, it’s just that no project I know has ever done this. However, even if this experiment were attempted, you still need a recipient entity, plus all the infrastructure to do signing and track the signed agreements, so you’d still be better off using a lightweight DCO process.

    Conclusion: For Community Small is Beautiful

    The way to avoid the community corrosion problem is to do everything minimally: use a DCO to take only the rights the downstream requires and to avoid all the heavyweight recipient, signing and tracking infrastructure. Don’t set up a foundation unless you absolutely need an entity, say to handle cash, and if you must set one up, never give it any control over the project (like appointing a change control or architecture control board for instance) everything you set up should be as small as possible and clearly serve the project and its community. Above all, don’t use a CLA because it will cause a rights imbalance that corrodes your community and it will require a large amount of overhead to run.

    May 13, 2021 10:51 PM