We all know how it's possible for a guest VM to access various host functions by accessing a PCI device, right? When KVM traps an access to this fake PCI, QEMU emulates the device, which allows packets sent, console updated, or whatever. This is called "virtio".
NVIDIA took it a step further: they have a real PCI device that emuilates QEMU. No joke. And, they have a firmware bug! The following patch works around it:
diff --git a/drivers/virtio/virtio_pci_modern.c b/drivers/virtio/virtio_pci_modern.c
index 9193c30d640a..6bbb34f9b088 100644
--- a/drivers/virtio/virtio_pci_modern.c
+++ b/drivers/virtio/virtio_pci_modern.c
@@ -438,6 +438,7 @@ static void vp_reset(struct virtio_device *vdev)
{
struct virtio_pci_device *vp_dev = to_vp_device(vdev);
struct virtio_pci_modern_device *mdev = &vp_dev->mdev;
+ int i;
/* 0 status means a reset. */
vp_modern_set_status(mdev, 0);
@@ -446,8 +447,16 @@ static void vp_reset(struct virtio_device *vdev)
* This will flush out the status write, and flush in device writes,
* including MSI-X interrupts, if any.
*/
- while (vp_modern_get_status(mdev))
+ i = 0;
+ while (vp_modern_get_status(mdev)) {
+ if (++i >= 10000) {
+ printk(KERN_INFO
+ "virtio reset ignoring status 0x%02x\n",
+ vp_modern_get_status(mdev));
+ break;
+ }
msleep(1);
+ }
vp_modern_avq_cleanup(vdev);
I'm not dumping on NVIDIA here at all, I think it's awesome for this devious hardware to exist. And bugs are just a way of life.
They pushed the "You're one of a few experts invited to answer" notifications for a long time - maybe a year, I don't remember. When I had enough and started to capture them with the intent of mockery, they stopped. So sad. Here's what I got:
"You're facing pushback from vendors on cloud integration. How can you convince them to collaborate?"
"You're focused on cutting costs in cloud computing. How do you ensure security protocols aren't compromised?"
"You're overseeing a code review process. How do you ensure feedback boosts developer morale?"
What a dystopia. LinkedIn is owned by Microsoft, so I'm not suprised someone in a giant corporation thought this sort of nonsense was a good idea. But still, the future is stupid, and all that.
P.S. The notification inserts were non-persistent — inserted on the fly. That was just fraud w.r.t. the idea of notification ticker.
P.P.S. Does anyone else think that this sort of thing would cause self-selection? They made their AI trained by the most vain and also least bright members of their user population. I'm not an expert in any of these fields.
UPDATE 2024-10-31: Spoke too soon! They hit me with the notificantion insert: "Here's how you can craft a personalized learning plan for advancing in Cloud Computing." That is not even a formed question. Getting lazy, are we?
UPDATE 2024-11-02: "You're facing budget disputes over cloud solutions. How can you align IT and non-technical teams effectively?" They are not stopping.
Meanwhile, how about another perspective: I saw an update that Hubbert Smith contributed an answer to: "You're facing a ransomware attack crisis. How do you convey the severity to a non-technical executive?" Instead of answering what LinkedIn AI asked, he answered a question of how to deal with ransomware ("Ransomware is fixable with snapshots of sensitive data."). Unless he is an AI himself, he may be thinking that he's dealing with a LinkedIn equivalent of Quora.
Imagine halving the resource costs of AI and what that could mean for the planet and the industry -- based on extreme estimates such savings could reduce the total US power usage by over 10% by 20301. At Intel we've been creating a new analyzer tool to help reduce AI costs called AI Flame Graphs: a visualization that shows an AI accelerator or GPU hardware profile along with the full software stack, based on my CPU flame graphs. Our first version is available to customers in the Intel Tiber AI Cloud as a preview for the Intel Data Center GPU Max Series (previously called Ponte Vecchio). Here is an example:
(Click for interactive SVG.) The green frames are the actual instructions running on the AI or GPU accelerator, aqua shows the source code for these functions, and red (C), yellow (C++), and orange (kernel) show the CPU code paths that initiated these AI/GPU programs. The gray "-" frames just help highlight the boundary between CPU and AI/GPU code. The x-axis is proportional to cost, so you look for the widest things and find ways to reduce them.
Layers
This flame graph shows a simple program for SYCL (a high-level C++ language for accelerators) that tests three implementations of matrix multiply, running them with the same input workload. The flame graph is dominated by the slowest implementation, multiply_basic(), which doesn't use any optimizations and consumes at 72% of stall samples and is shown as the widest tower. On the right are two thin towers for multiply_local_access() at 21% which replaces the accessor with a local variable, and multiply_local_access_and_tiling() at 6% which also adds matrix tiling. The towers are getting smaller as optimizations are added.
This flame graph profiler is a prototype based on Intel EU stall profiling for hardware profiling and eBPF for software instrumentation. It's designed to be easy and low-overhead, just like a CPU profiler. You should be able to generate a flame graph of an existing AI workload whenever you want, without having to restart anything or launch additional code via an interposer.
Instruction-offset Profiling
This is not the first project to build an AI profiler or even something called an AI Flame Graph, however, others I've seen focus on tracing CPU stacks and timing accelerator execution, but don't profile the instruction offsets running on the accelerator; or do profile them but via expensive binary instrumentation. I wanted to build AI flame graphs that work like CPU flame graphs: Easy to use, negligible cost, production safe, and shows everything. A daily tool for developers, with most of the visualization in the language of the developer: source code functions.
This has been an internal AI project at Intel for the past year. Intel was already investing in this space, building the EU stall profiler capability for the Intel Data Center GPU Max Series that provides an approximation of HW instruction sampling. I was lucky to have Dr. Matthew (Ben) Olson, an Intel AI engineer who has also worked on eBPF performance tooling (processwatch) as well as memory management research, join my team and do most of the development work. His background has helped us power through difficulties that seemed insurmountable. We've also recently been joined by Dr. Brandon Kammerdiener (coincidentally another graduate of the University of Tennessee, like Ben), who also has eBPF and memory internals experience, and has been helping us take on harder and harder workloads. And Gabriel Muñoz just joined today to help with releases. Now that our small team has shown that this is possible, we'll be joined by other teams at Intel to develop this further.
We could have built a harder-to-use and higher-overhead version months ago using Intel GTPin but for widespread adoption it needs minimal overhead and ease of use so that developers don't hesitate to use this daily and to add it to deployment pipelines.
What's a Flame Graph?
A flame graph is a visualization I invented in 2011 for showing sampled code stack traces. It has become the standard for CPU profiling and analysis, helping developers quickly find performance improvements and eliminate regressions. A CPU flame graph shows the "big picture" of running software, with x-axis proportional to CPU cost. The example picture on the right summarizes how easy it can be to go from compute costs to responsible code paths. Prior to flame graphs, it could take hours to understand a complex profile by reading through hundreds of pages of output. Now it takes seconds: all you have to do is look for the widest rectangles.
Flame graphs have had worldwide adoption. They have been the basis for five startups so far, have been adopted in over thirty performance analysis products, and have had over eighty implementations.
My first implementation of flame graphs took a few hours on a Wednesday night after work. The real effort has been in the decade since, where I worked with different profilers, runtimes, libraries, kernels, compilers, and hypervisors to get flame graphs working properly in different environments, including fixing stack walking and symbolization. Earlier this year I posted about the final missing piece: Helping distros enable frame pointers so that profiling works across standard system libraries.
Similar work is necessary for AI workloads: fixing stacks and symbols and getting profiling to work for different hardware, kernel drivers, user-mode drivers, frameworks, runtimes, languages, and models. A lot more work, too, as AI analysis has less maturity than CPU analysis.
Searching Samples
If you are new to flame graphs, it's worth highlighting the built-in search capability. In the earlier example, most of the stall samples are caused by sbid: software scoreboard dependency. As that may be a unique search term, you can run search (Ctrl-F, or click "Search") on "sbid" and it will highlight it in magenta:
Search also shows the total number of stack samples that contained sbid in the bottom right: 78.4%. You can search for any term in the flame graph: accelerator instructions, source paths, function names, etc., to quickly calculate the percentage of stacks where it is present (excluding vertical overlap) helping you prioritise performance work.
Note that the samples are EU stall-based, which means theoretical performance wins can take the percentages down to zero. This is different to timer-based samples as are typically used in CPU profiling. Stalls mean you better focus on the pain, the parts of the code that aren't making forward progress, but you aren't seeing resource usage by unstalled instructions. I'd like to supuport timer-based samples in the future as well, so we can have both views.
Who will use this?
At a recent golang conference, I asked the audience of 200+ to raise their hands if they were using CPU flame graphs. Almost every hand went up. I know of companies where flame graphs are a daily tool that developers use to understand and tune their code, reducing compute costs. This will become a daily tool for AI developers.
My employer will use this as well for evaluation analysis, to find areas to tune to beat competitors, as well as to better understand workload performance to aid design.
Why is AI profiling hard?
Consider CPU instruction profiling: This is easy when the program and symbol table are both in the file system and in a standardized file format (such as ELF) as is the case with native compiled code (C). CPU profiling gets hard for JIT-complied code, like Java, as instructions and symbols are dynamically generated and placed in main memory (the process heap) without following a universal standard. For such JITted code we use runtime-specific methods and agents to retrieve snapshots of the heap information, which is different for each runtime.
AI workloads also have different runtimes (and frameworks, languages, user-mode drivers, compilers, etc.) any of which can require special tinkering to get their CPU stacks and symbols to work. These CPU stacks are shown as the red, orange, and yellow frames in the AI Flame Graph. Some AI workloads are easy to get these frames working, some (like PyTorch) are a lot more work.
But the real challenge is instruction profiling of actual GPU and AI accelerator programs -- shown as the aqua and green frames -- and correctly associating them with the CPU stacks beneath them. Not only may these GPU and AI programs not exist in the file system, but they may not even exist in main memory! Even for running programs. Once execution begins, they may be deallocated from main memory and only exist in special accelerator memory, beyond the direct reach of OS profilers and debuggers. Or within reach, but only through a prohibitively high-overhead HW-specific debugger interface.
There's also no /proc representation for these programs either (I've been proposing building an equivalent) so there's no direct way to even tell what is running and what isn't, and all the other /proc details. Forget instruction profiling, even ps(1) and all the other process tools do not work.
It's been a mind-bending experience, revealing what gets taken for granted because it has existed in CPU land for decades: A process table. Process tools. Standard file formats. Programs that exist in the file system. Programs running from main memory. Debuggers. Profiliers. Core dumping. Disassembling. Single stepping. Static and dynamic instrumentation. Etc. For GPUs and AI, this is all far less mature. It can make the work exciting at times, when you think something is impossible and then find or devise a way.
Fortunately we have a head start as some things do exist. Depending on the runtime and kernel driver, there are debug interfaces where you can list running accelerator programs and other statistics, as used by tools like intel_gpu_top(1). You can kill -9 a GPU workload using intel_gpu_abrt(1). Some interfaces can even generate basic ELF files for the running accelerator programs that you can try to load in a debugger like gdb(1). And there is support for GPU/AI program disassembly, if you can get your hands on the binary. It feels to me like GPU/AI debugging, OS style, is about two years old. Better than zero, but still early on, and lots more ahead of us. A decade, at least.
What do AI developers think of this?
We've shown AI Flame Graphs to other AI developers at Intel and a common reaction is to be a bit puzzled, wondering what to do with it. AI developers think about their bit of code, but with AI Flame Graphs they can now see the entire stack for the first time, including the HW, and many layers they don't usually think about or don't know about. It basically looks like a pile of gibberish with their code only a small part of the flame graph.
CPU Flame Graph Implementations
This reaction is similar to people's first experiences with CPU flame graphs, which show parts of the system that developers and engineers typically don't work on, such as runtime internals, system libraries, and kernel internals. Flame graphs are great at highlighting the dozen or so functions that matter the most, so it becomes a problem of learning what those functions do across a few different code bases, which are typically open source. Understanding a dozen such functions can take a few hours or even a few days -- but if this leads to a 10% or 2x cost win, it is time well spent. And the next time the user looks at a flame graph, they start saying "I've seen that function before" and so on. You can get to the point where understanding the bulk of a CPU flame graph takes less than a minute: look for the widest tower, click to zoom, read the frames, done.
I'm encouraged by the success of CPU flame graphs, with over 80 implementations and countless real world case studies. Sometimes I'm browsing a performance issue I care about on github and hit page down and there's a CPU flame graph. They are everywhere.
I expect AI developers will also be able to understand AI Flame Graphs in less than a minute, but to start with people will be spending a day or more browsing code bases they didn't know were involved. Publishing case studies of found wins will also help people learn how to interpret them, and also help explain the value.
What about PyTorch?
Another common reaction we've had is that AI developers are using PyTorch, and initially we didn't support it as it meant walking Python stacks, which isn't trivial. But prior work has been done there (to support CPU profiling) and after a lot of tinkering we now have the first PyTorch AI Flame Graph:
PyTorch frames in pink
(Click for interactive SVG.) The PyTorch functions are at the bottom and are colored pink. This example runs oneDNN kernels that are JIT-generated, and don't have a source path so that layer just reads "jit". Getting all other the layers included was a real pain to get going, but an important milestone. We think if we can do PyTorch we can do anything.
In this flame graph, we show PyTorch running the Llama 2 7B model using the Intel Extensions for PyTorch (IPEX). This flame graph shows the origin of the GPU kernel execution all the way back to the Python source code shown in pink. Most samples are from a stack leading up to a gemm_kernel (matrix multiply) shown in aqua, which like the previous example has many stalls due to software scoreboarding.
There are two instructions here (0xa30 and 0xa90) that combined are 27% of the entire profile. I expect someone will ask: Can't we just click on instructions and have it bring up a dissassembly view with full source? Yes, that should be possible, but I can't answer how we're going to provide this yet. Another expected question I can't yet answer: Since there are now multiple products providing AI auto-tuning of CPU workloads using CPU flame graphs (including Intel Granulate) can't we have AI auto-tuning of AI workloads using AI Flame Graphs?
First Release: Sometimes hard and with moderate overhead
Getting AI Flame Graphs to work with some workloads is easy, but others are currently hard and cost moderate overhead. It's similar to CPU profiling, where some workloads and languages are easy to profile, whereas others need various things fixed. Some AI workloads use many software dependencies that need various tweaks and recompilation (e.g., enabling frame pointers so that stack walking works) making setup time consuming. PyTorch is especially difficult and can take over a week of OS work to be ready for AI Flame Graphs. We will work on getting these tweaks changed upstream in their respective repositories, something involving teams inside and outside of Intel, and is a process I'd expect to take at least a year. During that time AI workloads will gradually become easier to flame graph, and with lower-overhead as well.
I'm reminded of eBPF in the early days: You had to patch and recompile the kernel and LLVM and Clang, which could take multiple days if you hit errors. Since then all the eBPF dependency patches have been merged, and default settings changed, so that eBPF "just works." We'll get there with AI Flame Graphs too, but right now it's still those early days.
The changes necessary for AI Flame Graphs are really about improving debugging in general, and are a requirement for Fast by Friday: A vision where we can root-cause analyze anything in five days or less.
Availability
AI Flame Graphs will first become available on the Intel Tiber AI Cloud as a preview feature for the Intel Data Center GPU Max Series. If you are currently deployed there you can ask through the Intel service channel for early access. As for if or when it will support other hardware types, be in other Intel products, be officially launched, be open source, etc., these involve various other teams at Intel and they need to make their own announcements before I can discuss them here.
Conclusions
Finding performance improvements for AI data centers of just fractions of a percent can add up to planetary savings in electricity, water, and money. If AI flame graphs have the success that CPU flame graphs have had, I'd expect finding improvements of over 10% will be common, and 50% and higher will eventually be found*. But it won't be easy in these early days as there are still many software components to tweak and recompile, and software layers to learn about that are revealed in the AI flame graph.
In the years ahead I imagine others will build their own AI flame graphs that look the same as this one, and there may even be startups selling them, but if they use more difficult-to-use and higher-overhead technologies I fear they could turn companies off the idea of AI flame graphs altogether and prevent them from finding sorely needed wins. This is too important to do badly. AI flame graphs should be easy to use, cost negligible overhead, be production safe, and show everything. Intel has proven it's possible.
Disclaimer
* This is a personal blog post that makes personal predictions but not guarantees of possible performance improvements. Feel free to take any claim with a grain of salt, and feel free to wait for an official publication and public launch by Intel on this technology.
Thanks to everyone at Intel who have helped us make this happen. Markus Flierl has driven this project and made it a top priority, and Greg Lavender has expressed his support. Special thanks to Michael Cole, Matthew Roper, Luis Strano, Rodrigo Vivi, Joonas Lahtinen, Stanley Gambarin, Timothy Bauer, Brandon Yates, Maria Kraynyuk, Denis Samoylov, Krzysztof Raszknowski, Sanchit Jain, Po-Yu Chen, Felix Degrood, Piotr Rozenfeld, Andi Kleen, and all of the other coworkers that helped clear things up for us, and thanks in advance for everyone else who will be helping us in the months ahead.
My final thanks is to the companies and developers who do the actual hands-on work with flame graphs, collecting them, examining them, finding performance wins, and applying them. You are helping save the planet.
I sincerely regret to see Linux kernel patches like this one removing Russian developers from the MAINTAINERS
file. To me, it is a sign or maybe even
a symbol of how far the Linux kernel developer community I remember from ~ 20 years ago has changed, and how
much it has alienated itself from what I remember back in the day.
In my opinion this commit is wrong at so many different levels:
it is intransparent. Initially it gave no explanation whatsoever (other than some compliance
hand-waving). There was some follow-up paraphrasing one paragraph of presumed legal advice that was given
presumably by Linux Foundation to Linus. That's not a thorough legal analysis at all. It doesn't even say
to whom it was given, and who (the individual developers? Linux Foundation? Distributors?) is presumed to be
subject to the unspecified regulations in which specific jurisdiction
it discriminates developers based on their presumed [Russian] nationality based on their name, e-mail
address domain name or employer.
A later post in the thread has clarified
that it's about an U.S. embargo list against certain Russian individuals / companies. It is news to me that
the MAINTAINERS file was usually containing Companies or that the Linux kernel development is Companies
engaging with each other. I was under the naive assumption that it's individual developers who work together,
and their employers do not really matter. Contributions are judged by their merit, and not by the author or
their employer / affiliation. In the super unlikely case that indeed those individual developers removed from
the MAINTAINERS file would be personally listed in the embargo list: Then yes, of course, I agree, they'd have
to be removed. But then the commit log should of course point to [the version] of that list and explicitly
mention that they were personally listed there.
And no, I am of course not a friend of the Russian government at all. They are committing war crimes,
no doubt about it. But since when has the collaboration of individual developers in an open source project
been something related to actions completely unrelated to those individuals? Should I as a German developer
be excluded due to the track record of Germany having started two world wars killing millions? Should
Americans be excluded due to a very extensive track record of violating international law? Should we exclude
Palestinians? Israelis? Syrians? Iranians? [In case it's not obvious: Those are rhetorical questions, my
position is of course no to all of them].
I just think there's nothing more wrong than discriminating against people just because of their passport,
their employer or their place of residence. Maybe it's my German upbringing/socialization, but we've had
multiple times in our history where the concept of **Sippenhaft** (kin liability) existed. In those dark ages of history you
could be prosecuted for crimes committed by other family members.
Now of course removal from the MAINTAINERS file or any other exclusion from the Linux kernel development
process is of course not in any way comparable to prosecution like imprisonment or execution. However, the
principle seems the same: An individual is punished for mere association with some others who happen to be
committing crimes.
Now if there really was a compelling legal argument for this (I doubt it, but let's assume for a second
there is): In that case I'd expect a broad discussion against it; a reluctance to comply with it; a search
for a way to circumvent said legal requirement; a petition or political movement against that requirement.
Even if there was absolutely no way around performing such a "removal of names": At the very least I'd expect
some civil disobedience by at least then introducing a statement into the file that one would have hoped to
still be listing those individuals as co-maintainers but one was forced by [regulation, court order, ...] to
remove them.
But the least I would expect is for senior Kernel developers to simply do apply the patch with a one-sentence
commit log message and thereby disrespect the work of said [presumed] Russian developers. All that does is to
alienate individuals of the developer community. Not just those who are subject to said treatment today, but
any others who see this sad example how Linux developers treat each other and feel discouraged from becoming
or remaining active in a community with such behaviour.
It literally hurts me personally to see this happening. It's like a kick in the gut. I used to be proud
about having had an involvement with the Linux kernel community in a previous life. This doesn't feel like
the community I remember being part of.
As some of you know, I've been spending a lot of time in recent years researching (and practically exploring +
re-implementing) historical telecommunications with my retronetworking project.
Retrocomputing itself is not my main focus. I usually feel there's more than enough people operating,
repairing, documenting at least many older computers, as well as keeping archives of related software and
continuing to spread knowledge on how they operated. Nevertheless, it is a very interesting topic - I just
decided that with my limited spare time I want to focus on retro-communications which is under-explored and
under-represented.
What's equally important than keeping the old technology alive, is keeping the knowledge around its creation
alive. How did it happen that certain technologies were created and became successful or not? How where they
key people behind it? etc.
Given my personal history with Taiwan during the last 18 years, it's actually surprising I haven't yet given
thought on how or where the history of the Taiwanese IT industry is documented or kept alive. So far I didn't
know of any computer museums that would focus especially on the Taiwanese developments. It didn't even occur
to me to even check if there are any.
During my work in Taiwan I've had the chance to briefly meet a few senior people at FIC (large mainboard maker
that made many PC mainboards I personally used) and both at VIA (chipset + CPU maker). But I didn't ever have
a chance to talk about the history.
In any case, I now found those transcripts of interviews. And what a trove of interesting first-hand
information they are! If you have an interest in computer history, and want to understand how it came about
that Taiwan became such a major player in either the PC industry or in the semiconductor design +
manufacturing, then I believe those transcripts are a "must read".
Now they've made me interested to learn more. I have little hope of many books being published on that
subject, particularly in a Language I can read (i.e. English, not mandarin Chinese). But I shall research
that subject. I'd also be interested to hear about any other information, like collections of historical
artifacts, archives, libraries, etc. So in the unlikely case anybody reading this has some pointers on
information about the history of the Taiwanese Chip and Computer history, please by all means do reach out and
share!.
Once I have sufficiently prepared myself in reading whatever I can find in terms of written materials, I might
be tempted to try to reach out and see if I can find some first-hand witnesses who'd want to share their
stories on a future trip to Taiwan...
Some of the readers of this blog know that I have a very special relationship with Taiwan. As a teenager, it
was the magical far-away country that built most of the PC components in all my PCs since my first 286-16 I
got in 1989. Around 2006-2008 I had the very unexpected opportunity to work in Taiwan for some time (mainly
for Openmoko, later some consulting for VIA). During that time I have always felt most welcome in and fascinated
by the small island nation who managed to turn themselves into a high-tech development and manufacturing site
for ever more complex electronics. And who managed to evolve from decades of military dictatorship and turn
into a true democracy - all the while being discriminated by pretty much all of the countries around the
world, as everybody wanted to benefit from cheap manufacturing in mainland China and hence expel democratic
Taiwan from the united nations in favour of communist mainland Chine.
I have the deepest admiration for Taiwan to manage all of their economic success and progress in terms of
democracy and freedom despite the political situation across the Taiwan strait, and despite everything that
comes along with it. May they continue to have the chance of continuing their path.
Setting economy, society and politics behind: On a more personal level I've enjoyed their culinary marvels
from excellent dumplings around every street corner to niu rou mien (beef noodle soup) to ma la huo guo
(spicy hot pot). Plus then the natural beauty, particularly of the rural mountainous regions once you leave
the densely populated areas around the coast line and the plains of the north west.
While working in Taiwan in 2006/2007 I decided to buy a motorbike. Using that bike I've first made humble
day trips and later (once I was no longer busy with stressful work at Openmoko) multiple week-long road trips
around the island, riding on virtually any passable road you can find. My typical routing algorithm is "take
the smallest possible road from A to B".
So even after concluding my work in Taiwan, I returned again and again for holidays, each one with more road
trips. For some time, Taiwan had literally become my second home. I had my favorite restaurants, shops, as
well as some places around the rural parts of the Island I cam back to several times. I even managed to take
up some mandarin classes, something I never had the time for while doing [more than] full time work. To my
big regret, it's still very humble beginner level; I guess had I not co-started a company (sysmocom) in Berlin
in 2011, I'd have spent more time for a more serious story.
In any case, I have nothing but the fondest memory of Taiwan. My frequent visits cam to a forcible halt with
the COVID-19 pandemic, Taiwan was in full isolation in 2020/21, and even irrespective of government
regulations, I've been very cautious about travel and contact. Plus of course, there's always the bad
conscience of frequent intercontinental air travel.
Originally I was planning to finally go on an extended Taiwan holiday in Summer 2024, but then the island was
hit by a relatively serious earthquake in April, affecting particularly many of the remote mountain regions
that are of main interest to me. There are some roads that I'd have wanted to ride ever since 2008, but which
had been closed every successive year when I went there, due to years of reconstructions after [mostly
landslides following] earthquakes and typhoons. So I decided to postpone it for another year to 2025.
However, in an unexpected change of faith, the opportunity arose to give the opening Keyonte at the 2024
Open Compliance Summit in Japan, and along with that the opportunity to do a stop-over in Taiwan. It will
just be a few days of Taipei this time (no motorbike trips), but I'm very much looking forward to being
back in the city I probably know second or third-best on the planet (after Berlin, my home for 23 years, as
well as Nuernberg, my place of birth). Let's see what is still the same and what has changed during the past
5 years!
First, let me paraphrase something from my LiveJournal profile: These posts are my own, and in particular do not necessarily reflect my employer's positions, strategies, or opinions.
With that said, some say that the current geopolitical outlook is grim. And far be it from me to minimize the present-day geopolitical problems, nor am I at all interested in comparing them to their counterparts in the "good old days". But neither do I wish to obsess on these problems. I will instead call attention to a few instances of global cooperation, current and past.
Last month, NASA's oldest active astronaut traveled to Kazakhstan's Baikonur Cosmodrome, entered a Soyuz capsule atop a Roscosmos rocket and flew to the International Space Station. For me, this is especially inspiring: If he can do that at age 69, I should certainly be able to continue doing my much less demanding job for many years to come.
Some decades ago, during the Cold War, I purchased an English translation of Gradshteyn's and Ryzhik's classic "Table of Integrals, Series, and Products". Although computer-algebra systems have largely replaced this book, I have used it within the past few years and I used it heavily in the 1980s and early 1990s. Thus, along with many others, I am indebted to the longstanding Russian tradition of excellence in mathematics.
Many other countries have also made many excellent contributions to mathematics, science, and technology. For example, the smartphone that I used hails from South Korea. And earlier this year, SeongJae (SJ) Park completed a Korean translation of the Second Edition of "Is Parallel Programming Hard, And, If So, What Can You Do About It?"
Returning to rocketry, China started working with rockets in the 1200s, if not earlier, and has made a great deal of more recent progress in a wide variety of fields. And rumor has it that a Chinese translation of the Second Edition will be appearing shortly.
So if you tried reading this book, but the English got in the way, you now have two other options and hopefully soon a third! But what if you want a fourth option? Then you, too, can do a translation! Just send me a translated chapter and I will add it to the list in the book's FAQ.txt file.
Because FreeCAD was such a disaster for me, I started looking at crazy solutions, like exporting STEP from OpenSCAD. I even stooped to looking at proprietary alternatives. First on the runway was SolidWorks. If it's good for Mark Serbu, surely it's good for me, right?
The first thing I found, you cannot tap your card and download. You have to contact a partner representative — never a good sign. The representative quoted me for untold thousands. I'm not going to post the amount, I'm sure they vary it every time, like small shop owners who vary prices according to the race of the shopper.
In addition, they spam like you would not believe. First you have to unsubscribe from the partner, next from community.3ds.com, next from draftsight.3ds.com, and so on. Eventually, you'll get absolutely random spam, you try to unsubscribe, and they just continue and spam. Fortunately, I used a one-time address, and I killed it. Phew.
A few years ago Mike and I discussed adding video support to zink, so that we could provide vaapi on top of vulkan video implementations.
This of course got onto a long TODO list and we nerdsniped each other into moving it along, this past couple of weeks we finally dragged it over the line.
This MR adds initial support for zink video decode on top of Vulkan Video. It provides vaapi support. Currently it only support H264 decode, but I've implemented AV1 decode and I've played around a bit with H264 encode. I think adding H265 decode shouldn't be too horrible.
I've tested this mainly on radv, and a bit on anv (but there are some problems I should dig into).
Thank you to everyone who attended Linux Plumbers 2024 both in person and virtually!
This year we were able to accommodate huge demand for in-person participation and we were glad to see more than 700 people in the Austria Center.
As in previous years after the pandemic we also had a virtual component with more than 200 participants.
We had a lot of great content in Refereed Track, Kernel Summit, eBPF and Networking Summits and Toolchains Track and a lot of productive discussions in 24 microconferences.
There also were 25 Birds-of-a-Feather sessions, many of them were added during the event to continue a discussion that started in a microconference or in the Hallway Track.
There are recordings of live streams and we hope to have recordings of all the sessions soon.
Finally, I want to thank all those that were involved in making Linux Plumbers the best technical conference there is. This would not have happened without the hard work from the planning committee (Alice Ferrazzi, André Almeida, Christian Brauner, David Woodhouse, James Bottomley, Kate Stewart, Lorenzo Pieralisi, Shuah Khan, Song Liu, Steve Rostedt, Tim Bird), the runners of the Networking and BPF Summit tracks, the Toolchain track, Kernel Summit, and those that put together the very productive microconferences. I would also like to thank all those that presented as well as those who attended both in-person and virtually.
I want to thank our sponsors for their continued support, without them Linux Plumbers Conference would not be possible.
And a very special thanks to the Linux Foundation and their staff who did really great job behind the scenes and on-site to make this conference run smoothly. Their work is greatly appreciated by the LPC planning committee.
Sincerely,
Mike Rapoport
Linux Plumbers 2024 Conference chair
To get a feel for how the BBB platform works. In addition, your credentials are the email address you registered with in cvent and the confirmation number of the registration it sent you back. You can use those to log in here:
There's been a couple of mentions of Rust4Linux in the past week or two, one from Linus on the speed of engagement and one about Wedson departing the project due to non-technical concerns. This got me thinking about project phases and developer types.
Archetypes:
I will regret making an analogy, in an area I have no experience in, but let's give it a go with a road building analogy.
Let's sort developers into 3 rough categories. Let's preface by saying not all developers fit in a single category throughout their careers, and some developers can do different roles on different projects, or on the same project simultaneously.
1. Wayfinders/Mapmakers
I want to go build a hotel somewhere but there exists no map or path. I need to travel through a bunch of mountains, valleys, rivers, weather, animals, friendly humans, antagonistic humans and some unknowns. I don't care deeply about them, I want to make a path to where I want to go. I hit a roadblock, I don't focus on it, I get around it by any means necessary and move onto the next one. I document the route by leaving maps, signs. I build a hotel at the end.
2. Road builders
I see the hotel and path someone has marked out. I foresee that larger volumes will want to traverse this path and build more hotels. The roadblocks the initial finder worked around, I have to engage with. I engage with each roadblock differently. I build a bridge, dig a tunnel, blow up some stuff, work with with/against humans, whatever is necessary to get a road built to the place the wayfinder built the hotel. I work on each roadblock until I can open the road to traffic. I can open it in stages, but it needs a completed road.
3. Road maintainers
I've got a road, I may have built the road initially. I may no longer build new roads. I've no real interest in hotels. I deal with intersections with other roads controlled by other people, I interact with builders who want to add new intersections for new roads, and remove old intersections for old roads. I fill in the holes, improve safety standards, handle the odd wayfinder wandering across my 8 lanes.
Interactions:
Wayfinders and maintainers is the most difficult interaction. Wayfinders like to move freely and quickly, maintainers have other priorities that slow them down. I believe there needs to be road builders engaged between the wayfinders and maintainers.
Road builders have to be willing to expend the extra time to resolving roadblocks in the best way possible for all parties. The time it takes to resolve a single roadblock may be greater than the time expended on the whole wayfinding expedition, and this frustrates wayfinders. The builder has to understand what the maintainers concerns are and where they come from, and why the wayfinder made certain decisions. They work via education and trust building to get them aligned to move past the block. They then move down the road and repeat this process until the road is open. How this is done might change depending on the type of maintainers.
Maintainer types:
Maintainers can fall into a few different groups on a per-new road basis, and how do road builders deal with existing road maintainers depends on where they are for this particular intersection:
1. Positive and engaged
Aligned with the goal of the road, want to help out, design intersections, help build more roads and more intersections. Will often have helped wayfinders out.
2. Positive with real concerns
Agrees with the road's direction, might not like some of the intersections, willing to be educated and give feedback on newer intersection designs. Moves to group 1 or trusts that others are willing to maintain intersections on their road.
3. Negative with real concerns
Don't agree fully with road's direction or choice of building material. Might have some resistance to changing intersections, but may believe in a bigger picture so won't actively block. Hopefully can move to 1 or 2 with education and trust building.
4. Negative and unwilling
Don't agree with the goal, don't want the intersection built, won't trust anyone else to care about their road enough. Education and trust building is a lot more work here, and often it's best to leave these intersections until later, where they may be swayed by other maintainers having built their intersections. It might be possible to build a reduced intersection. but if they are a major enough roadblock in a very busy road, then a higher authority might need to be brought in.
5. Don't care/Disengaged
Doesn't care where your road goes and won't talk about intersections. This category often just need to be told that someone else will care about it and they will step out of the way. If they are active blocks or refuse interaction then again a higher authority needs to be brought in.
Where are we now?
I think the r4l project has a had lot of excellent wayfinding done, has a lot of wayfinding in progress and probably has a bunch of future wayfinding to do. There are some nice hotels built. However now we need to build the roads to them so others can build hotels.
To the higher authority, the road building process can look slow. They may expect cars to be driving on the road already, and they see roadblocks from a different perspective. A roadblock might look smaller to them, but have a lot of fine details, or a large roadblock might be worked through quickly once it's engaged with.
For the wayfinders the process of interacting with maintainers is frustrating and slow, and they don't enjoy it as much as wayfinding, and because they still only care about the hotel at the end, when a maintainer gets into the details of their particular intersection they don't want to do anything but go stay in their hotel.
The road will get built, it will get traffic on it. There will be tunnels where we should have intersections, there will be bridges that need to be built from both sides, but I do think it will get built.
I think my request from this is that contributors should try and identify the archetype they currently resonate with and find the next group over to interact with.
For wayfinders, it's fine to just keep wayfinding, just don't be surprised when the road building takes longer, or the road that gets built isn't what you envisaged.
For road builder, just keep building, find new techniques for bridging gaps and blowing stuff up when appropriate. Figure out when to use higher authorities. Take the high road, and focus on the big picture.
For maintainers, try and keep up with modern road building, don't say 20 year old roads are the pinnacle of innovation. Be willing to install the rumble strips, widen the lanes, add crash guardrails, and truck safety offramps. Understand that wayfinders show you opportunities for longer term success and that road builders are going to keep building the road, and the result is better if you engage positively with them.
Every year the Android Micro-conference brings the upstream Linux community and the Android systems developers together at the Linux Plumbers Conference. They discuss how they can effectively engage the existing issues and collaborate on upcoming changes to the Android platform and their upstream dependencies.
This year Android MC is scheduled to start at 10am on Friday, 20th Sep at Hall L1 (Austria Center). Attending Android MC gives you a chance to contribute to the broader discussion on Android platform ecosystem and Linux kernel development. You can share your own experiences, offer feedback, and help shape the future direction of these technologies.
Discussion topics for this year include:
Android kernel support and long term AOSP maintainership to support device longevity
Android MC will be followed by a Android BoF session, which will be a audience directed discussion. It can be a follow-up of the discussions from any of the Android MC topics or a free-form discussion on Android related topics.
Long version: When UEFI Secure Boot was specified, everyone involved was, well, a touch naive. The basic security model of Secure Boot is that all the code that ends up running in a kernel-level privileged environment should be validated before execution - the firmware verifies the bootloader, the bootloader verifies the kernel, the kernel verifies any additional runtime loaded kernel code, and now we have a trusted environment to impose any other security policy we want. Obviously people might screw up, but the spec included a way to revoke any signed components that turned out not to be trustworthy: simply add the hash of the untrustworthy code to a variable, and then refuse to load anything with that hash even if it's signed with a trusted key.
Unfortunately, as it turns out, scale. Every Linux distribution that works in the Secure Boot ecosystem generates their own bootloader binaries, and each of them has a different hash. If there's a vulnerability identified in the source code for said bootloader, there's a large number of different binaries that need to be revoked. And, well, the storage available to store the variable containing all these hashes is limited. There's simply not enough space to add a new set of hashes every time it turns out that grub (a bootloader initially written for a simpler time when there was no boot security and which has several separate image parsers and also a font parser and look you know where this is going) has another mechanism for a hostile actor to cause it to execute arbitrary code, so another solution was needed.
And that solution is SBAT. The general concept behind SBAT is pretty straightforward. Every important component in the boot chain declares a security generation that's incorporated into the signed binary. When a vulnerability is identified and fixed, that generation is incremented. An update can then be pushed that defines a minimum generation - boot components will look at the next item in the chain, compare its name and generation number to the ones stored in a firmware variable, and decide whether or not to execute it based on that. Instead of having to revoke a large number of individual hashes, it becomes possible to push one update that simply says "Any version of grub with a security generation below this number is considered untrustworthy".
So why is this suddenly relevant? SBAT was developed collaboratively between the Linux community and Microsoft, and Microsoft chose to push a Windows update that told systems not to trust versions of grub with a security generation below a certain level. This was because those versions of grub had genuine security vulnerabilities that would allow an attacker to compromise the Windows secure boot chain, and we've seen real world examples of malware wanting to do that (Black Lotus did so using a vulnerability in the Windows bootloader, but a vulnerability in grub would be just as viable for this). Viewed purely from a security perspective, this was a legitimate thing to want to do.
(An aside: the "Something has gone seriously wrong" message that's associated with people having a bad time as a result of this update? That's a message from shim, not any Microsoft code. Shim pays attention to SBAT updates in order to avoid violating the security assumptions made by other bootloaders on the system, so even though it was Microsoft that pushed the SBAT update, it's the Linux bootloader that refuses to run old versions of grub as a result. This is absolutely working as intended)
The problem we've ended up in is that several Linux distributions had not shipped versions of grub with a newer security generation, and so those versions of grub are assumed to be insecure (it's worth noting that grub is signed by individual distributions, not Microsoft, so there's no externally introduced lag here). Microsoft's stated intention was that Windows Update would only apply the SBAT update to systems that were Windows-only, and any dual-boot setups would instead be left vulnerable to attack until the installed distro updated its grub and shipped an SBAT update itself. Unfortunately, as is now obvious, that didn't work as intended and at least some dual-boot setups applied the update and that distribution's Shim refused to boot that distribution's grub.
What's the summary? Microsoft (understandably) didn't want it to be possible to attack Windows by using a vulnerable version of grub that could be tricked into executing arbitrary code and then introduce a bootkit into the Windows kernel during boot. Microsoft did this by pushing a Windows Update that updated the SBAT variable to indicate that known-vulnerable versions of grub shouldn't be allowed to boot on those systems. The distribution-provided Shim first-stage bootloader read this variable, read the SBAT section from the installed copy of grub, realised these conflicted, and refused to boot grub with the "Something has gone seriously wrong" message. This update was not supposed to apply to dual-boot systems, but did anyway. Basically:
1) Microsoft applied an update to systems where that update shouldn't have been applied 2) Some Linux distros failed to update their grub code and SBAT security generation when exploitable security vulnerabilities were identified in grub
The outcome is that some people can't boot their systems. I think there's plenty of blame here. Microsoft should have done more testing to ensure that dual-boot setups could be identified accurately. But also distributions shipping signed bootloaders should make sure that they're updating those and updating the security generation to match, because otherwise they're shipping a vector that can be used to attack other operating systems and that's kind of a violation of the social contract around all of this.
It's unfortunate that the victims here are largely end users faced with a system that suddenly refuses to boot the OS they want to boot. That should never happen. I don't think asking arbitrary end users whether they want secure boot updates is likely to result in good outcomes, and while I vaguely tend towards UEFI Secure Boot not being something that benefits most end users it's also a thing you really don't want to discover you want after the fact so I have sympathy for it being default on, so I do sympathise with Microsoft's choices here, other than the failed attempt to avoid the update on dual boot systems.
Anyway. I was extremely involved in the implementation of this for Linux back in 2012 and wrote the first prototype of Shim (which is now a massively better bootloader maintained by a wider set of people and that I haven't touched in years), so if you want to blame an individual please do feel free to blame me. This is something that shouldn't have happened, and unless you're either Microsoft or a Linux distribution it's not your fault. I'm sorry.
(The issues described in this post have been fixed, I have not exhaustively researched whether any other issues exist)
Feeld is a dating app aimed largely at alternative relationship communities (think "classier Fetlife" for the most part), so unsurprisingly it's fairly popular in San Francisco. Their website makes the claim:
Can people see what or who I'm looking for? No. You're the only person who can see which genders or sexualities you're looking for. Your curiosity and privacy are always protected.
which is based on you being able to restrict searches to people of specific genders, sexualities, or relationship situations. This sort of claim is one of those things that just sits in the back of my head worrying me, so I checked it out.
First step was to grab a copy of the Android APK (there are multiple sites that scrape them from the Play Store) and run it through apk-mitm - Android apps by default don't trust any additional certificates in the device certificate store, and also frequently implement certificate pinning. apk-mitm pulls apart the apk, looks for known http libraries, disables pinning, and sets the appropriate manifest options for the app to trust additional certificates. Then I set up mitmproxy, installed the cert on a test phone, and installed the app. Now I was ready to start.
What became immediately clear was that the app was using graphql to query. What was a little more surprising is that it appears to have been implemented such that there's no server state - when browsing profiles, the client requests a batch of profiles along with a list of profiles that the client has already seen. This has the advantage that the server doesn't need to keep track of a session, but also means that queries just keep getting larger and larger the more you swipe. I'm not a web developer, I have absolutely no idea what the tradeoffs are here, so I point this out as a point of interest rather than anything else.
Anyway. For people unfamiliar with graphql, it's basically a way to query a database and define the set of fields you want returned. Let's take the example of requesting a user's profile. You'd provide the profile ID in question, and request their bio, age, rough distance, status, photos, and other bits of data that the client should show. So far so good. But what happens if we request other data?
graphql supports introspection to request a copy of the database schema, but this feature is optional and was disabled in this case. Could I find this data anywhere else? Pulling apart the apk revealed that it's a React Native app, so effectively a framework for allowing writing of native apps in Javascript. Sometimes you'll be lucky and find the actual Javascript source there, but these days it's more common to find Hermes blobs. Fortunately hermes-dec exists and does a decent job of recovering something that approximates the original input, and from this I was able to find various lists of database fields.
So, remember that original FAQ statement, that your desires would never be shown to anyone else? One of the fields mentioned in the app was "lookingFor", a field that wasn't present in the default profile query. What happens if we perform the incredibly complicated hack of exporting a profile query as a curl statement, add "lookingFor" into the set of requested fields, and run it?
Oops.
So, point 1 is that you can't simply protect data by having your client not ask for it - private data must never be released. But there was a whole separate class of issue that was an even more obvious issue.
Looking more closely at the profile data returned, I noticed that there were fields there that weren't being displayed in the UI. Those included things like "ageRange", the range of ages that the profile owner was interested in, and also whether the profile owner had already "liked" or "disliked" your profile (which means a bunch of the profiles you see may already have turned you down, but the app simply didn't show that). This isn't ideal, but what was more concerning was that profiles that were flagged as hidden were still being sent to the app and then just not displayed to the user. Another example of this is that the app supports associating your profile with profiles belonging to partners - if one of those profiles was then hidden, the app would stop showing the partnership, but was still providing the profile ID in the query response and querying that ID would still show the hidden profile contents.
Reporting this was inconvenient. There was no security contact listed on the website or in the app. I ended up finding Feeld's head of trust and safety on Linkedin, paying for a month of Linkedin Pro, and messaging them that way. I was then directed towards a HackerOne program with a link to terms and conditions that 404ed, and it took a while to convince them I was uninterested in signing up to a program without explicit terms and conditions. Finally I was just asked to email security@, and successfully got in touch. I heard nothing back, but after prompting was told that the issues were fixed - I then looked some more, found another example of the same sort of issue, and eventually that was fixed as well. I've now been informed that work has been done to ensure that this entire class of issue has been dealt with, but I haven't done any significant amount of work to ensure that that's the case.
You can't trust clients. You can't give them information and assume they'll never show it to anyone. You can't put private data in a database with no additional acls and just rely on nobody ever asking for it. You also can't find a single instance of this sort of issue and fix it without verifying that there aren't other examples of the same class. I'm glad that Feeld engaged with me earnestly and fixed these issues, and I really do hope that this has altered their development model such that it's not something that comes up again in future.
(Edit to add: as far as I can tell, pictures tagged as "private" which are only supposed to be visible if there's a match were appropriately protected, and while there is a "location" field that contains latitude and longitude this appears to only return 0 rather than leaking precise location. I also saw no evidence that email addresses, real names, or any billing data was leaked in any way)
This year there was a huge demand to attend Linux Plumbers Conference in person and at last we were able to add more places and reopen the registration.
With the imminent release of kmod 33, I thought it’d be good to have a post
about the different types of module dependencies that we have in the Linux
kernel and kmod. The new version adds another type, weak dependency,
and as the name implies, is the weakest of all. But let’s revisit what are
the other types first.
Hard (symbol) dependency
This is the first dependency that every appeared in kmod (and
module-init-tools). A hard (or as some call, “symbol”) dependency occurs when
your module calls or uses an exported symbol of another module. The most common
way is by calling a function that is exported in another module. Example: the
xe.ko calls a function ttm_bo_pin() that is provided and exported by
another module, ttm.ko. Looking to the source:
$ nmbuild/drivers/gpu/drm/ttm/ttm.ko|grep-e"\bttm_bo_pin\b"0000000000000bc0 T ttm_bo_pin$ nmbuild64/drivers/gpu/drm/xe/xe.ko|grep-e"\bttm_bo_pin\b" U ttm_bo_pin
It is not possible to insert the xe.ko module before ttm.ko and if you try
via insmod (that doesn’t handle dependencies) it will fail with the kernel
complaining that the ttm_bo_pin symbol is undefined.
The manual invocations to nm illustrates what the depmod tool does: it
opens the modules and reads the ELF headers. Then it takes note of all the
symbols required and provided by each module, creating a graph of symbol
dependencies. Ultimately that leads to module dependencies: xe ➛ ttm.
This is recorded in the modules.dep file and its sibling modules.dep.bin.
The former is human-readable and the latter is used by libkmod, but they
contain the same information: all dependencies for all the modules. Also note
that each line reflects indirect dependencies: module A calls symbol from B
and B calls symbol from C will lead to A depending on both B and C. Real world example:
So the xe.ko (with .zst extension since it’s compressed) directly or indirectly
depends on drm_gpuvm.ko, drm_exec.ko, gpu-sched.ko, drm_buddy.ko, i2c-algo-bit.ko,
drm_suballoc_helper.ko, drm_ttm_helper.ko, ttm.ko, drm_display_helper.ko, cec.ko,
video.ko, wmi.ko. Same information, but using the kmod tools rather than looking
at the raw index:
There are situations in the kernel where it’s not possible or desired to use a
symbol directly from another module - they may interact by registering in a
subsystem, scanning a bus etc.. In this case depmod doesn’t have enough
information from the ELF file about that. Yet the user would have a more complete
support if both modules were available - it may even cause failures visible to
the end user and not only “partial support for features”.
The softdep implementation contains 2 parts: pre and post dependencies.
The post dependencies are not very much used in practice: they instruct kmod to
load another module after loading the target one.
They can come from a configuration file like e.g. /etc/modprobe.d/foo.conf or
from the kernel itself by embedding that info in the module. From the kernel source
this is achieved by using the macro MODULE_SOFTDEP(). Example:
lib/libcrc32c.c:MODULE_SOFTDEP("pre: crc32c");
When libkmod is loading a module it will first load, in order:
hard dependencies
soft pre dependencies
target module
soft post dependencies
Historically softdeps were also a way to (mostly) get rid of install rules, in
which the configuration instructs libkmod to execute something instead of loading
the module - people would add an install rule to execute something and then call
modprobe again with --ignore-install to fake a dependency. That could easily
lead to a runtime loop which is avoided with softdep since kmod can (and does)
check for loops.
Weak dependency
After explaining the other types of dependencies, back to the new addition
in kmod 33. These are very similar to pre softdep: they come either from a
configuration file or embedded in the module and they express a dependency
that wouldn’t cause the target module to fail to load, but that may cause
the initialization to export less features or fail while initializing.
There is one important difference: weak dependencies don’t cause libkmod
to actually load the module. Rather the dependency information may be
used by tools like dracut and other tools responsible for assembling an
initrd to make the module available since it may or may not be used.
Why are they called “weak”? This was a borrowed terminology from
“weak symbols”:
a weak symbol is there, waiting to be used, but it may or may not be, with
the final decision happening in the final link stage.
With weak dependencies, hopefully some of the pre softdep embedded in the kernel
may be replaced: if the target module is already doing a request_module() or
in some way getting the other module to be loaded, it doesn’t need a softdep that
would serialize the module load order and possibly load more modules than required.
I don't know how that happened anymore, but previously someone made me think that 1. there will be no more numbered releases of CentOS, which is why it is called "CentOS Stream" now, and 2. CentOS Stream is now the upstream of RHEL, replacing Fedora. I really was concerned for the future of Fedora, that was superfluous in that arrangement, you know!
But apparently, Fedora is still the trunk upstream, and CenOS Stream is only named like that. Nothing changes except CentOS is no longer a clone of RHEL, but instead RHEL is a clone of CentOS. What was all the panic for?
I made a VM at Kamatera a few months ago, and they didn't even have Fedora among images. I ended using Rocky.
In the future, computers will not crash due to bad software updates, even those updates that involve kernel code. In the future, these updates will push eBPF code.
Friday July 19th provided an unprecedented example of the inherent dangers of kernel programming, and has been called the largest outage in the history of information technology. Windows computers around the world encountered blue-screens-of-death and boot loops, causing outages for hospitals, airlines, banks, grocery stores, media broadcasters, and more. This was caused by a config update by a security company for their widely used product that included a kernel driver on Windows systems. The update caused the kernel driver to try to read invalid memory, an error type that will crash the kernel.
For Linux systems, the company behind this outage was already in the process of adopting eBPF, which is immune to such crashes. Once Microsoft's eBPF support for Windows becomes production-ready, Windows security software can be ported to eBPF as well. These security agents will then be safe and unable to cause a Windows kernel crash.
eBPF (no longer an acronym) is a secure kernel execution environment, similar to the secure JavaScript runtime built into web browsers. If you're using Linux, you likely already have eBPF available on your systems whether you know it or not, as it was included in the kernel several years ago. eBPF programs cannot crash the entire system because they are safety-checked by a software verifier and are effectively run in a sandbox. If the verifier finds any unsafe code, the program is rejected and not executed. The verifier is rigorous -- the Linux implementation has over 20,000 lines of code -- with contributions from industry (e.g., Meta, Isovalent, Google) and academia (e.g., Rutgers University, University of Washington). The safety this provides is a key benefit of eBPF, along with heightened security and lower resource usage.
Some eBPF-based security startups (e.g., Oligo, Uptycs) have made their own statements about the recent outage, and the advantages of migrating to eBPF. Larger tech companies are also adopting eBPF for security. As an example, Cisco acquired the eBPF-startup Isovalent and has announced a new eBPF security product: Cisco Hypershield, a fabric for security enforcement and monitoring. Google and Meta already rely on eBPF to detect and stop bad actors in their fleet, thanks to eBPF's speed, deep visibility, and safety guarantees. Beyond security, eBPF is also used for networking and observability.
The worst thing an eBPF program can do is to merely consume more resources than is desirable, such as CPU cycles and memory. eBPF cannot prevent developers writing poor code -- wasteful code -- but it will prevent serious issues that cause a system to crash. That said, as a new technology eBPF has had some bugs in its management code, including a Linux kernel panic discovered by the same security company in the news today. This doesn't mean that eBPF has solved nothing, substituting a vendor's bug for its own. Fixing these bugs in eBPF means fixing these bugs for all eBPF vendors, and more quickly improving the security of everyone.
There are other ways to reduce risks during software deployment that can be employed as well: canary testing, staged rollouts, and "resilience engineering" in general. What's important about the eBPF method is that it is a software solution that will be available in both Linux and Windows kernels by default, and has already been adopted for this use case.
If your company is paying for commercial software that includes kernel drivers or kernel modules, you can make eBPF a requirement. It's possible for Linux today, and Windows soon. While some vendors have already proactively adopted eBPF (thank you), others might need a little encouragement from their paying customers. Please help raise awareness, and together we can make such global outages a lesson of the past.
Authors: Brendan Gregg, Intel; Daniel Borkmann, Isovalent; Joe Stringer, Isovalent; KP Singh, Google.
The System Boot and Security Microconference has been a critical platform for enthusiasts and professionals working on firmware, bootloaders, system boot, and security. This year, the conference focuses on the challenges of upstreaming boot process improvements to the Linux kernel. Cryptography, an ever-evolving field, poses unique demands on secure elements and TPMs as newer algorithms are introduced and older ones are deprecated. Additionally, new hardware architectures with DRTM capabilities, such as ARM’s D-RTM specification and the increased use of fTPMs in innovative applications, add to the complexity of the task. This is the fifth time the conference has been held in the last six years.
Trusted Platform Modules (TPMs) for encrypting disks have become widespread across various distributions. This highlights the vital role that TPMs play in ensuring platform security. As the field of confidential computing continues to grow, virtual machine firmware must evolve to meet end-users’ demands, and Linux would have to leverage exposed capabilities to provide relevant security properties. Mechanisms like UEFI Secure Boot that were once limited to OEMs now empower end-users. The System Boot and Security Microconference aims to address these challenges collaboratively and transparently. We welcome talks on the following technologies that can help achieve this goal.
TPMs, HSMs, secure elements
Roots of Trust: SRTM and DRTM
Intel TXT, SGX, TDX
AMD SKINIT, SEV
ARM DRTM
Growing Attestation ecosystem
IMA
TrenchBoot, tboot
TianoCore EDK II (UEFI), SeaBIOS, coreboot, U-Boot, LinuxBoot, hostboot
legal, organizational, and other similar issues relevant to people interested in system boot and security.
If you want to participate in this microconference and have ideas to share, please use the Call for Proposals (CFP) process. Your submissions should focus on new advancements, innovations, and solutions related to firmware, bootloader, and operating system development. It’s essential to explain clearly what will be discussed, why, and what outcomes you expect from the discussion.
Edit: The submission deadline has been updated to July 14th!
sched_ext is a Linux kernel feature which enables implementing host-wide, safe kernel thread schedulers in BPF, and dynamically loading them at runtime. sched_ext enables safe and rapid iterations of scheduler implementations, thus radically widening the scope of scheduling strategies that can be experimented with and deployed, even in massive and complex production environments.
sched_ext was first sent to the upstream list as an RFC patch set back in November 2022. Since then, the project has evolved a great deal, both technically, as well as in the significant growth of the community of sched_ext users and contributors.
This MC is the space for the community to discuss the developments of sched_ext, its impact on the community, and to outline future strategies aimed at integrating this feature into the Linux kernel and mainstream Linux distributions.
Ideas of topics to be discussed include (but are not limited to):
Challenges and plans to facilitate the upstream merge of sched_ext
User-space scheduling (offload part / all of the scheduling from kernel to user-space)
Scheduling for gaming and latency-sensitive workloads
Scheduling & cpufreq integration
Distro support
While we anticipate having a schedule with existing talk proposals at the MC, we invite you to submit proposals for any topic(s) you’d like to discuss. Time permitting, we are happy to readjust the schedule for additional topics that are of relevance to the sched_ext community.
Submissions are made via LPC submission system, selecting the track Sched-Ext: The BPF extensible scheduler class.
This year it took us a bit more time, but we did run out of places and the conference is currently sold out for in-person registration.
We are setting up a waitlist for in-person registration (virtual attendee places are still available).
Please fill in this form and try to be clear about your reasons for wanting to attend.
We are giving waitlist priority to new attendees and people expected to contribute content.
Rust in the kernel (e.g. status update, next steps).
Use cases for Rust around the kernel (e.g. subsystems, drivers, other modules…).
Discussions on how to abstract existing subsystems safely, on API design, on coding guidelines.
Integration with kernel systems and other infrastructure (e.g. build system, documentation, testing and CIs, maintenance, unstable
features, architecture support, stable/LTS releases, Rust versioning, third-party crates).
It comes with great sadness that on June 24th, 2024 we lost a great contributor to the Linux Plumbers Conference and the whole of Linux generally. Daniel Bristot de Oliveira passed away unexpectedly at the age of 37. Daniel has been an active participant of Linux Plumbers since 2017. Not only has he given numerous talks, which were extremely educational, he also took leadership roles in running Microconferences. He was this year’s main Microconference runner for both the Scheduler Microconference as well as the Real-Time Microconference. This year’s conference will be greatly affected by his absence. Many have stated how Daniel made them feel welcomed at Linux Plumbers. He always had a smile, would make jokes and help developers come to a conclusion for those controversial topics. He perfectly embodied the essence of what Linux Plumbers was all about. He will be missed.
The Linux kernel has grown in complexity over the years. Complete understanding of how it works via code inspection has become virtually impossible. Today, tracing is used to follow the kernel as it performs its complex tasks. Tracing is used today for much more than simply debugging. Its framework has become the way for other parts of the Linux kernel to enhance and even make possible new features. Live kernel patching is based on the infrastructure of function tracing, as well as BPF. It is now even possible to model the behavior and correctness of the system via runtime verification which attaches to trace points. There is still much more that is happening in this space, and this microconference will be the forum to explore current and new ideas.
This year, focus will also be on perf events:
Perf events are a mechanism for presenting performance counters and Linux software events to users. There are kernel and userland components to perf events. The kernel supplies the APIs and the perf tooling presents the data to users.
Possible ideas for topics for this year’s conference:
Feedback about the tracing/perf subsystems overall (e.g. how can people help the maintainers).
Reboot persistent in-memory tracing buffers, this would make ftrace a very powerful debugging and performance analysis tool for kexec and could also be used for post crash debugging.
The counted_by attribute was introduced in Clang-18 and will soon be available in GCC-15. Its purpose is to associate a flexible-array member with a struct member that will hold the number of elements in this array at some point at run-time. This association is critical for enabling runtime bounds checking via the array bounds sanitizer and the __builtin_dynamic_object_size() built-in function. In user-space, this extra level of security is enabled by -D_FORTIFY_SOURCE=3. Therefore, using this attribute correctly enhances C codebases with runtime bounds-checking coverage on flexible-array members.
Here is an example of a flexible array annotated with this attribute:
In the above example, count is the struct member that will hold the number of elements of the flexible array at run-time. We will call this struct member the counter.
In the Linux kernel, this attribute facilitates bounds-checking coverage through fortified APIs such as the memcpy() family of functions, which internally use __builtin_dynamic_object_size() (CONFIG_FORTIFY_SOURCE). As well as through the array-bounds sanitizer (CONFIG_UBSAN_BOUNDS).
The __counted_by() macro
In the kernel we wrap the counted_by attribute in the __counted_by() macro, as shown below.
ceba9725fb45 (“cxgb4: Annotate struct sched_table with …”)
However, as we are about to see, not all __counted_by() annotations are always as straightforward as the one above.
__counted_by() annotations in the kernel
There are a number of requirements to properly use the counted_by attribute. One crucial requirement is that the counter must be initialized before the first reference to the flexible-array member. Another requirement is that the array must always contain at least as many elements as indicated by the counter. Below you can see an example of a kernel patch addressing these requirements.
In the patch above, datalen is the counter for the flexible-array member data. Notice how the assignment to the counterevent->datalen = datalen had to be moved to before calling memcpy(event->data, data, datalen), this ensures the counter is initialized before the first reference to the flexible array. Otherwise, the compiler would complain about trying to write into a flexible array of size zero, due to datalen being zeroed out by a previous call to kzalloc(). This assignment-after-memcpy has been quite common in the Linux kernel. However, when dealing with counted_by annotations, this pattern should be changed. Therefore, we have to be careful when doing these annotations. We should audit all instances of code that reference both the counter and the flexible array and ensure they meet the proper requirements.
In the kernel, we’ve been learning from our mistakes and have fixed some buggy annotations we made in the beginning. Here are a couple of bugfixes to make you aware of these issues:
6dc445c19050 (“clk: bcm: rpi: Assign ->num before accessing…”)
9368cdf90f52 (“clk: bcm: dvp: Assign ->num before accessing…”)
Another common issue is when the counter is updated inside a loop. See the patch below.
34c34c242a1b (“wifi: wil6210: cfg80211: Use __counted_by…”)
The patch above does a bit more than merely annotating the flexible array with the __counted_by() macro, but that’s material for a future post. For now, let’s focus on the following excerpt.
Notice that in this case, num_channels is our counter, and it’s set to zero before the for loop. Inside the for loop, the original code used this variable as an index to access the flexible array, then updated it via a post-increment, all in one line: cmd.cmd.channel_list[cmd.cmd.num_channels++]. The issue is that once channel_list was annotated with the __counted_by() macro, the compiler enforces dynamic array indexing of channel_list to stay below num_channels. Since num_channels holds a value of zero at the moment of the array access, this leads to undefined behavior and may trigger a compiler warning.
As shown in the patch, the solution is to increment num_channels before accessing the array, and then access the array through an index adjustment below num_channels.
Another option is to avoid using the counter as an index for the flexible array altogether. This can be done by using an auxiliary variable instead. See an excerpt of a patch below.
ea9e148c803b (“Bluetooth: hci_conn: Use __counted_by() and…”)
Again, the entire patch does more than merely annotate the flexible-array member, but let’s just focus on how aux_num_cis is used to access flexible array pdu->cis[].
In this case, the counter is num_cis. As in our previous example, originally, the counter is used to directly access the flexible array: &pdu.cis[pdu.cp.num_cis++]. However, the patch above introduces a new variable aux_num_cis to be used instead of the counter: &pdu->cis[aux_num_cis++]. The counter is then updated after the loop: pdu->num_cis = aux_num_cis.
Both solutions are acceptable, so use whichever is convenient for you.
Here, you can see a recent bugfix for some buggy annotations that missed the details discussed above:
[PATCH] wifi: iwlwifi: mvm: Fix _counted_by usage in cfg80211_wowlan_nd*
In a future post, I’ll address the issue of annotating flexible arrays of flexible structures. Spoiler alert: don’t do it!
We are excited about the submissions that are coming in to Linux Plumbers 2024. If you want to discuss a topic at one of the Microconferences, you should start putting together a problem statement and submit. Each Microconference has its own defined deadline. To submit, go to the Call for Proposals page and select Submit new abstract. After filling out your problem statement in the Content section, make sure to select the proper Microconference in the Track pull down list. It is recommended to read this blog before writing up your submission.
And a reminder that the other tracks submissions are ending soon as well.
The deadline for submitting refereed track proposals for the 2024 Linux Plumbers Conference has been extended until 23 June.
If you have already submitted a proposal, thank you very much! For the rest of you, there is one additional week in which to get your proposal submitted. We very much look forward to seeing what you all come up with.
A while back, I wrote about using the SSH agent protocol to satisfy WebAuthn requests. The main problem with this approach is that it required starting the SSH agent with a special argument and also involved being a little too friendly with the implementation - things worked because I could provide an arbitrary public key and the implementation never validated that, but it would be legitimate for it to start doing so and then break everything. And it also only worked for keys stored on tokens that ssh supports - there was no way to extend this to other keystores on the client (such as the Secure Enclave on Macs, or TPM-backed keys on PCs). I wanted a better solution.
It turns out that it was far easier than I expected. The ssh agent protocol is documented here, and the interesting part is the extension support extension mechanism. Basically, you can declare an extension and then just tunnel whatever you want over it. As before, my goto was the go ssh agent package which conveniently implements both the client and server side of this. Implementing the local agent is trivial - look up SSH_AUTH_SOCK, connect to it, create a new agent client that can communicate with that by calling NewClient, and then implement the ExtendedAgent interface, create a new socket, and call ServeAgent against that. Most of the ExtendedAgent functions should simply call through to the original agent, with the exception of Extension(). Just add a case statement against extensionType, define some reasonably namespaced extension, and you're done.
Now you need to use this agent. You probably don't want to use this for arbitrary hosts (agent forwarding should only be enabled for remote systems you trust, not arbitrary machines you connect to - if you enabled agent forwarding for github and github got compromised, github would be able to use any private keys loaded into your agent, and you probably don't want that). So the right approach is to add a Host entry to the ssh config with a ForwardAgent stanza pointing at the socket you created in your new agent. This way the configured subset of remote hosts will automatically talk to this new custom agent, while forwarding for anything else will still be at the user's discretion.
For the remote end things are even easier. Look up SSH_AUTH_SOCK and call NewClient as before, and then simply call client.Extension(). Whatever you stick in the contents argument will simply end up being received at the client end. You now have a communication channel between a the remote system and the local client, and what you do with that is up to you. I'm using it to allow a remote system to obtain auth tokens from Okta and forward WebAuthn challenges that can either be satisfied via a local WebAuthn token or by passing the query off to Mac TouchID, but there's fundamentally no constraints whatsoever on what can be done here.
(If you want to do this on Windows and still have everything work with existing clients you'll need to take this into account - Windows didn't really do Unix sockets until recently so everything there is awful)
We’re happy to announce that registration for LPC 2024 is now open. To register please go to our attend page.
To try to prevent the instant sellout we had in previous years we are keeping our cancellation policy of no refunds, only transfers of registrations. You will find more details during the registration process. LPC 2024 follows the Linux Foundation’s health & safety policy.
As usual we expect to sell our rather quickly so don’t delay your registration for too long!
Unfortunately we still do not know the total cost of the 4th track yet. We are still in the process of looking at the costs of adding another room, but we do not want to delay the acceptance of topics to Microconferences any further. We have decided to accept all pending Microconferences with one caveat. That is, we are not accepting the rest as full Microconferences. The Microconferences being accepted now will become one of the following at Linux Plumbers 2024:
A full 3 hour Microconference
A 1 and a half hour Microconference (Nanoconference)
A full 3 hour Microconference but without normal Audio/Video
That last one is another option we are looking at. The main cost to having a 4th track is the manned AV operations. But we could add the 4th track without normal AV. Instead, these would get a BBB room where an Owl video camera (or the like) and a Jabra speaker will be in place. The quality of the AV will not be as good as having a fully manned room, but this would be better than being rejected from the conference, or having half the time of a full microconference.
Even with a 4th track, two still need to become Nanoconferences.
In the mean time, we will be accepting the rest of the Microconferences so that they can start putting together content. How they are presented at Linux Plumbers is still to be determined.
Note that this also means we will likely be dropping the 3 free passes that a Microconference usually gets down to only 2 passes.
The accepted Microconferences (as full, half or no A/V) are:
Sched_ext
Containers and Checkpoint/Restore
Confidential Computing
Real-Time
Build Systems
RISC-V
Compute Express Link
X86
VFIO/IOMMU/PCI
System Boot and Security
Zone Storage
Internet of Things
Embedded
Complex Cameras
Power Management and Thermal Control
Kernel <-> Userspace/Init/System Management Boundaries and APIs
I've presented a talk
Introduction to XDP, eBPF and AF_XDP
as part of the
OsmoDevCon 2024 conference on Open Source Mobile Communications.
This talk provides a generic introduction to a set of modern Linux kernel technologies:
eBPF (extended Berkeley Packet Filter) is a kind of virtual machine that runs sandboxed programs inside the Linux kernel.
XDP (eXpress Data Path) is a framework for eBPF that enables high-performance programmable packet processing in the Linux kernel
AF_XDP is an address family that is optimized for high-performance packet processing. It allows in-kernel XDP eBPF programs to efficiently pass packets to userspace via memory-mapped ring buffers.
The talk provides a high-level overview. It should provide some basics before the other/later talks on bpftrace and eUPF.
I've presented a talk
Using bpftrace to analyze osmocom performance
as part of the
OsmoDevCon 2024 conference on Open Source Mobile Communications.
bpftrace is a utility that uses the Linux kernel tracing infrastructure (and eBPF) in order to provide tracing capabilities within the kernel, like uprobe, kprobe, tracepoints, etc.
bpftrace can help us to analyze the performance of [unmodified] Osmocom programs and quickly provide information like, for example:
Histogram of time spent in a specific system call
Histogram of any argument or return value of any system call
The Call-for-Proposals for Microconferences has come to a close, and with that, this year’s list of Microconferences is to be decided. A Microconference is a 3 and a half hour session with a half hour break (giving a total of 3 hours of content). Linux Plumbers has three Microconference tracks running per day, with each track having two Microconferences (one in the morning and one in the afternoon). Linux Plumbers runs for three days allowing for 18 Microconferences total (2 per track, with 3 tracks a day for 3 days).
This year we had a total of 26 quality submissions! Linux Plumbers is known as the conference that gets work done, and its success is proof of that. But sometimes success brings its own problems. How can we accept 26 Microconferences when we only have 18 slots to place them? Two of the Microconferences have agreed to merge as one bringing the total down to just 25. But that still is 7 more than we can handle.
We want to avoid rejecting 7 microconferences, but to do so, we need to make compromises. The first idea we have is to add a 4th Microconference track. But that still only gives us 6 more slots. As it will also require more A/V and manpower, the cost will increase and may not be within the budget to do so.
Pros to a 4 track are:
Have 24 full Microconferences and reject one (or 23 and keep 2 as per the next option).
Cons to a 4 track are:
Increased costs.
Having 4 Microconference tracks running simultaneously will cause more conflicts in the schedule. People have complained in the past about conflicts between sessions with just 3 tracks, having 4 will exacerbate the situation.
Still may need to reject 1 Microconference.
Another solution is to create a half Microconference (Nanoconference?). That is an hour and a half session, run the same as the full sessions, with the 30 minute break between two Nanoconferences. Doing so will allow for 11 full Microconferences and 14 Nanoconferences which will allow for all submissions to be accepted and fit within the 3 tracks.
The difference between a Nanoconference and a BOF is that a Nanoconference still has all the rules of a Microconference. That is, all sessions should be strictly discussion focus. If presentations are needed, they should be submitted as Refereed talks (the CFP for them are still open). A BOF is usually focused on a single issue. A Nanoconference should still be broken up into small discussions about different issues with sessions lasting 15 to 20 minutes each.
Pros for Nanoconferences are:
Can accommodate all submitted Microconferences.
Cons for Nanoconferences are:
Shortened time for Microconferences, even for topics in the past that had filled a full Microconference may now only get half the time.
Note, as BOFs will be in a separate track, a Nanoconference may be able to still submit for topics there (BOF submissions are still open).
Currently, we also give out 3 free passes to each Microconference that can be handed to anyone in their session. For 18 Microconferences, that is 54 passes. This will not be feasible to give out 3 passes to 25 Microconferences (totaling 75 passes), thus one solution is to drop it down to 2 free passes. The problem with passes is still an issue with the Nanoconference approach, as you can not give out 1 and a half passes. Thus, the Nanoconferences may only get 1 pass each, or perhaps have both the Microconferences and Nanoconferences all get just 2 passes each.
Anyway, since the above solutions still allow for 11 full Microconferences, we have accepted 9 so far. They are:
Android
KVM
Kernel Memory Management
Scheduling
Rust
Kernel Testing and Dependability
Graphics and DRM
Safe Systems with Linux
Tracing / Perf events (This is the merged Microconference)
We are still weighing our options so stay tuned for updates on the situation, and thank you to all the Microconference submitters that make
Linux Plumbers the best technical conference around!
I've co-presented a talk (together with Andreas EversbergHigh-performance I/O using io_uring via osmo_io
as part of the
OsmoDevCon 2024 conference on Open Source Mobile Communications.
Traditional socket I/O via read/write/recvfrom/sendto/recvmsg/sendmsg and friends creates a very high system call load. A highly-loaded osmo-bsc spends most of its time in syscall entry and syscall exit.
io_uring is a modern Linux kernel mechanism to avoid this syscall overhead. We have introduced the osmo_io`API to libosmocore as a generic back-end for non-blocking/asynchronous I/O and a back-end for our classic `osmo_fd / poll approach as well as a new backend for io_uring.
The talk will cover
a very basic io_uring introduction
a description of the osmo_io API
the difficulties porting from osmo_fd to osmo_io
status of porting various sub-systems over to osmo_io
The only way to obtain STEP from OpenSCAD that I know is an external tool that someone made. It's pretty crazy actually: it parses OpenSCAD's native export, CSG, and issues commands to OpenCASCADE's CLI, OCC-CSG. The biggest issue for me here is that his approach cannot handle transformations that the CLI does not support. I use hull all over the place and a tool that does not support hull is of no use for me.
So I came up with a mad lad idea: just add a native export of STEP to OpenSCAD. The language itself is constructive, and an export to CSG exists. I just need to duplicate whatever it does, and then at each node, transform it into something that can be expressed in STEP.
As it turned out, STEP does not have any operations. It only has manifolds assembled from faces, which are assembled from planes and lines, which are assembled from cartesian points and vectors. Thus, I need to walk the CSG, compile it into a STEP representation, and only then write it out. Operations like union, difference, or hull have to be computed by my code. The plan is to borrow from OpenSCAD's compiler that builds the mesh, only build with larger pieces - possibly square or round.
Not sure if this is sane and can be made to work, but it's pretty fun at least.
Problem: After an update to F39, a system continues to boot F38 kernels
The /bin/kernel-install generates entries in /boot/efi/loader/entries instead of /boot/loader/entries. Also, they are in BLS Type 1 format, and not in the legacy GRUB format. So I cannot copy them over.
I've read a bunch of docs and the man page for kernel-install(8), but they are incomprehensible. Still the key insight was that all that Systemd stuff loves to autodetect by finding this directory or that.
The way to test is: [root@chihiro zaitcev]# /bin/kernel-install -v add 6.8.5-201.fc39.x86_64 /lib/modules/6.8.5-201.fc39.x86_64/vmlinuz
I'm a strong proponent of self-hosting all your services, if not on your own hardware than at least on
dedicated rented hardware. For IT nerds of my generation, this has been the norm sicne the early 1990s: If
you wante to run your own webserver/mailserver/... back then, the only way was to self-host.
So over the last 30 years, I've always been running a fleet of machines, some my own hardware colocated,
and during the past ~18 years also some rented dedicated "root servers". They run a plethora of services for
either my personal stuff (like this blog, or my personal email server), or any of the IT services of the open
source projects I'm involved in (like osmocom) or the company I co-founded and run (sysmocom).
Every few years there's the need to migrate to new hardware. Either due to power consumption/efficiency, or
to increase performance, or to simply avoid aging hardware that may be dying soon.
When upgrading from one [hosted] server to another [hosted] server, there's always the question of how to
manage the migration with minimal interruption to services. For very simple services like http/https,
it can usually be done entirely within DNS: You reduce the TTL of the records, bring up the service
on the new server (with a new IP), make the change in the DNS and once the TTL of the DNS record is expired in
all caches, everybody will access the new server/IP.
However, there are services where the IP address must be retained. SMTP is a prime example of that. Given how
spam filtering works, you certainly will not want to give up your years if not decadeds of good reputation for
your IP address. As a result, you will want to keep the IP address while doing the migration.
If it's a physical machine in colocation or your home, you can of course do that all rather easily under your
control. You can synchronize the various steps from stopping the services on the old machine, rsync'ing over
the spool files to the new, then migrate the IP over to the new machine.
However, if it's a rented "root" server at a company like Hetzner or KVH, then you do not have full control
over when exactly the IP address will be migrated over to the new server.
Also, if there are many different services on that same physical machine, running on a variety of different
IPv4/IPv6 addresess and ports, it may be difficult to migrate all of them at once. It would be much more
practical, if individual services could be migrated step by step.
The poor man's approach would be to use port-forwarding / reverse-proxying. In this case, the client
establishes a TCP connection to the old IP address on the old server, and a small port-forward proxy accepts
that TCP connectin, creates a second TCP connection to the new server, and bridges those two together.
This approach only works for the most simplistic of services (like web servers), where
there are only inbound connections from remote clients (as outbound connections from the new server would
originate from the new IP, not the old one), and
where the source IP of the client doesn't matter. To the new server all connections' source IP addresses
suddenly are masked and there's only one source IP (the old server) for all connections.
For more sophisticated serviecs (like e-mail/SMTP, again), this is not an option. The SMTP client IP address
matters for whitelists/blacklists/relay rules, etc. And of course there are also plenty of outbound SMTP
connections which need to originate from the old IP, not the new IP.
So in bed last night [don't ask], I was brainstorming if the goal of fully transparent migration of individual
TCP+UDP/IP (IPv4+IPv6) services could be made between and old and new server. In theory it's rather simple,
but in practice the IP stack is not really designed for this, and we're breaking a lot of the assumptions
and principles of IP networking.
After some playing around earlier today, I was actually able to create a working setup!
It fulfills the followign goals / exhibits the following properties:
old and new server run concurrently for any amount of time
individual IP:port tuples can be migrated from old to new server, as services are migrated step by step
fully transparent to any remote peer: Old IP:port of server visible to client
fully transparent to the local service: Real client IP:port of client visible to server
no constraints on whether or not the old and new IPs are in the same subnet, link-layer, data centre, ...
use only stock features of the Linux kernel, no custom software, kernel patches, ...
no requirement for controlling a router in front of either old or new server
General Idea
The general idea is to receive and classify incoming packets on the old server, and then selectively tunnel
some of them via a GRE tunnel from the old machine to the new machine, where they are decapsulated and passed
to local processes on the new server. Any packets generated by the service on the new server (responses to
clients or outbound connections to remote serveers) will take the opposite route: They will be encapsulated
on the new server, passed through that GRE tunnel back to the old server, from where they will be sent off to
the internet.
That sounds simple in theory, but it poses a number of challenges:
packets destined for a local IP address of the old server need to be re-routed/forwarded, not delivered to
local sockets. This is easily done with fwmark, multiple routing tables and a rule, similar ro many other
policy routing setups.
Linux Plumbers Conference 2024 is pleased to host the Networking Track!
LPC Networking track is an in-person manifestation of the netdev mailing list, bringing together developers, users and vendors to discuss topics related to Linux networking. Relevant topics span from proposals for kernel changes, through user space tooling, netdev testing and CI, to presenting interesting use cases, new protocols or new, interesting problems waiting for a solution.
The goal is to allow gathering early feedback on proposals, reach consensus on long running mailing list discussions and raise awareness of interesting work and use cases.
After four years of co-locating BPF & Networking Tracks together this year we separated the two, again. Please submit to the track which feels suitable, the committee will transfer submissions between tracks as it deems necessary.
Please come and join us in the discussion. We hope to see you there!
When you have an outage caused by a performance issue, you don't want to lose precious time just to install the tools needed to diagnose it. Here is a list of "crisis tools" I recommend installing on your Linux servers by default (if they aren't already), along with the (Ubuntu) package names that they come from:
bpftrace, basic versions of opensnoop(8), execsnoop(8), runqlat(8), biosnoop(8), etc.
eBPF scripting[1]
trace-cmd
trace-cmd(1)
Ftrace CLI
nicstat
nicstat(1)
net device stats
ethtool
ethtool(8)
net device info
tiptop
tiptop(1)
PMU/PMC top
cpuid
cpuid(1)
CPU details
msr-tools
rdmsr(8), wrmsr(8)
CPU digging
(This is based on Table 4.1 "Linux Crisis Tools" in SysPerf 2.)
Some longer notes: [1] bcc and bpftrace have many overlapping tools: the bcc ones are more capable (e.g., CLI options), and the bpftrace ones can be edited on the fly. But that's not to say that one is better or faster than the other: They emit the same BPF bytecode and are equally fast once running. Also note that bcc is evolving and migrating tools from Python to libbpf C (with CO-RE and BTF) but we haven't reworked the package yet. In the future "bpfcc-tools" should get replaced with a much smaller "libbpf-tools" package that's just tool binaries.
This list is a minimum. Some servers have accelerators and you'll want their analysis tools installed as well: e.g., on Intel GPU servers, the intel-gpu-tools package; on NVIDIA, nvidia-smi. Debugging tools, like gdb(1), can also be pre-installed for immediate use in a crisis.
Essential analysis tools like these don't change that often, so this list may only need updating every few years. If you think I missed a package that is important today, please let me know (e.g., in the comments).
The main downside of adding these packages is their on-disk size. On cloud instances, adding Mbytes to the base server image can add seconds, or fractions of a second, to instance deployment time. Fortunately the packages I've listed are mostly quite small (and bcc will get smaller) and should cost little size and time. I have seen this size concern prevent debuginfo (totaling around 1 Gbyte) from being included by default.
Can't I just install them later when needed?
Many problems can occur when trying to install software during a production crisis. I'll step through a made-up example that combines some of the things I've learned the hard way:
4:00pm: Alert! Your company's site goes down. No, some people say it's still up. Is it up? It's up but too slow to be usable.
4:01pm: You look at your monitoring dashboards and a group of backend servers are abnormal. Is that high disk I/O? What's causing that?
4:02pm: You SSH to one server to dig deeper, but SSH takes forever to login.
4:03pm: You get a login prompt and type "iostat -xz 1" for basic disk stats to begin with. There is a long pause, and finally "Command 'iostat' not found...Try: sudo apt install sysstat". Ugh. Given how slow the system is, installing this package could take several minutes. You run the install command.
4:07pm: The package install has failed as it can't resolve the repositories. Something is wrong with the /etc/apt configuration. Since the server owners are now in the SRE chatroom to help with the outage, you ask: "how do you install system packages?" They respond "We never do. We only update our app." Ugh. You find a different server and copy its working /etc/apt config over.
4:10pm: You need to run "apt-get update" first with the fixed config, but it's miserably slow.
4:12pm: ...should it really be taking this long??
4:13pm: apt returned "failed: Connection timed out." Maybe this system is too slow with the performance issue? Or can't it connect to the repos? You begin network debugging and ask the server team: "Do you use a firewall?" They say they don't know, ask the network security team.
4:17pm: The network security team have responded: Yes, they have blocked any unexpected traffic, including HTTP/HTTPS/FTP outbound apt requests. Gah. "Can you edit the rules right now?" "It's not that easy." "What about turning off the firewall completely?" "Uh, in an emergency, sure."
4:20pm: The firewall is disabled. You run apt-get update again. It's slow, but works! Then apt-get install, and...permission errors. What!? I'm root, this makes no sense. You share your error in the SRE chatroom and someone points out: Didn't the platform security team make the system immutable?
4:24pm: The platform security team are now in the SRE chatroom explaining that some parts of the file system can be written to, but others, especially for executable binaries, are blocked. Gah! "How do we disable this?" "You can't, that's the point. You'd have to create new server images with it disabled."
4:27pm: By now the SRE team has announced a major outage and informed the executive team, who want regular status updates and an ETA for when it will be fixed. Status: Haven't done much yet.
4:30pm: You start running "cat /proc/diskstats" as a rudimentary iostat(1), but have to spend time reading the Linux source (admin-guide/iostats.rst) to make sense of it. It just confirms the disks are busy which you knew anyway from the monitoring dashboard. You really need the disk and file system tracing tools, like biosnoop(8), but you can't install them either. Unless you can hack up rudimentary tracing tools as well...You "cd /sys/kernel/debug/tracing" and start looking for the FTrace docs.
4:55pm: New server images finally launch with all writable file systems. You login – gee it's fast – and "apt-get install sysstat". Before you can even run iostat there are messages in the chatroom: "Website's back up! Thanks! What did you do?" "We restarted the servers but we haven't fixed anything yet." You have the feeling that the outage will return exactly 10 minutes after you've fallen asleep tonight.
12:50am: Ping! I knew this would happen. You get out of bed and open your work laptop. The site is down – it's been hacked – someone disabled the firewall and file system security.
I've fortunately not experienced the 12:50am event, but the others are based on real world experiences. In my prior job this sequence can often take a different turn: a "traffic team" may initiate a cloud region failover by about the 15 minute mark, so I'd eventually get iostat installed but then these systems would be idle.
Default install
The above scenario explains why you ideally want to pre-install crisis tools so you can start debugging a production issue quickly during an outage. Some companies already do this, and have OS teams that create custom server images with everything included. But there are many sites still running default versions of Linux that learn this the hard way. I'd recommend Linux distros add these crisis tools to their enterprise Linux variants, so that companies large and small can hit the ground running when performance outages occur.
I’ve had a couple of reasons recently to wonder about ipsec: one was doing private overlay networks in confidential VMs and the other was trying to be more efficient than my IPv4 openVPN when I’m remote on an IPv6 capable network. Usually ipsec descriptions begin with tools like raccoon or strong/open/libreswan; however, I’m going to try to explain how you do ipsec at a very basic level within Linux networking stack without using an ipsec toolkit. I’m going to concentrate on my latter use case, so this post is going to be ipsec over IPv6 (although most of the concepts should be applicable to IPv4). To attempt to do this, I’ll be delving into the ip xfrm commands extensively and trying to explain how the transform filters and policy work with the rest of the Linux networking stack.
The basics of ipsec
ipsec has two “protocols”: Authentication Header (AH) which means the packet is fully authenticated (or integrity protected) by an HMAC but not encrypted; and Encapsulating Security Payload (ESP), where the packet is encrypted but not necessarily integrity protected (ipsec was invented before AEAD ciphers, so previously you used both protocols to ensure confidentiality and integrity but, in the modern world, you can use ESP with an AEAD cipher and dispense entirely with AH). Once you have the protocol set, you encapsulate either in transport or tunnel mode. Transport mode means that the protocol headers are simply added to an existing IP packet (so the source and destination address remain the same) in the case of AH and an added header plus an encryption transformed payload with ESP and tunnel mode means that the entire IP packet is encapsulated and a new outer source and destination address is added (this is sometimes referred to as an ipsec VPN).
Understanding ipsec flows
This diagram should help understand how ipsec transforms work. There are two aspects to this: policy which basically does accept/reject and tagging and state, which does the encode/decode
The square boxes are the firewall filters and the ellipses are the ip xfrm policy and state transforms. xfrm decode is unconditionally activated whenever an ipsec packet reaches the input flow (provided there’s a matching state rule), output encoding only occurs if a matching output policy says it should (otherwise the packet is passed unencoded) and the xfrm policy fwd has no matching encode/decode, so it’s not possible to ipsec transform forwarded packets.
ip xfrm policy
To decapsulate, there is no requirement for a policy: every ipsec packet coming into the INPUT flow will be checked for a state match and decapsulated if one is found. However, for the packet to progress further you may need a policy. For transport mode, the decapsulated packet will go back around the input loop, so a dir in policy likely isn’t required but for tunnel mode, the decapsulated packet will likely traverse the FORWARD table and a dir fwd policy plus a firewall rule is likely required to permit the packet. Forwarded packets also hit the dir out policy as well.
A policy is specified with two parts: a selector which contains a set of matches (the only mandatory part of which is the direction dir) and which may match on partial addresses an action (block or allow, with allow being default) and a template (tmpl) which can specify the encoding (for dir out) or additional rules based on encapsulation.
Encoding Templates
Encoding only applies to dir out policies. Transport mode simply requires a statement of which encapsulation to use (proto ah or esp) and doesn’t require IP addresses in the ID section. In tunnel mode, the template must also have the source and destination outer IP addresses (the current source and destination become the inner addresses).
Every packet matching an encoding policy must also have a corresponding ip xfrm state match to specify the encapsulation parameters. Note that if a state transform is missing, the kernel will signal this on a netlink socket (which you can monitor with ip xfrm monitor). This socket is mostly used by ipsec toolkits to add state transforms just in time.
Other Policy Templates
For the in and fwd directions, the template acts as an additional filter on what packets to allow and what to block. For instance, if only decapsulated packets should be forwarded, then there should be a policy like
ip xfrm policy add dst net/netmask dir fwd tmpl proto ah mode tunnel level required
Which says the only allowed packets are those which were encapsulated in tunnel mode. Note that for this policy to be reached at all, the FORWARD table must allow the packets to pass. For most network security people, having a blanket forward permit rule is an anathema, so they often achieve the same thing by applying a firewall mark in decapsulation (the output-mark option of ip xfrm state) and only allowing market packets to pass the FORWARD chain (which dispenses with the need for a xfrm dir fwd policy). The two levers for controlling filter policies are the action (allow or block) the default is allow which is why this statement usually doesn’t appear) and the template level (required or use). The default level is required, which means for the allow rule to match the packet must be decapsulate (level use means pass regardless of decapsulation status)
Security Parameters Index (SPI) and reqid
For ipsec to work, every encapsulated packet must have a SPI value. You can specify this in the state transformation. The standards (RFC2409, RFC4303, etc) specify that SPI values 1-255 are “reserved”. Additionally the standards allow SPI value 0 to be used internally, which the Linux Kernel takes advantage of. SPI is mostly used to distinguish packet streams from the same host for complex ipsec policy, and don’t have much use in a simple policy situation. However, you must provide a value that isn’t 0-255 otherwise strange things can happen. In particular 0, the value you’ll get if you don’t specify spi, often causes the packet to get lost after decapsulation, so always specify a large number for spi. requid is a label which is attached to an unencapsulated packet that effectively remembers what the SPI value was; it’s mostly used as a label based discriminator in policy template to state transforms.
For the purposes of the following example I’ll simply use the randomly chosen 4321 for the spi value (but you could choose anything outside the 0-255 range).
ip xfrm state
Unlike policy, which attaches to a particular location (in, out or fwd), state is location agnostic and the same state match could theoretically be used both to encapsulate or decapsulate. State matching also isn’t subnet based: the address matching is either exact or fully wild card (match everything). However, a state encapsulation transform rule must match on dst and a decapsulation one may match on either src or dst, but must have an exact match on one or other. The main thing the state specifies is the algorithm to encapsulate (see man ip-xfrm for a full list). Remember you also must specify spi. The only other thing you might want to specify is the sel parameter. The selector applies to the inner address of decapsulated packets and is to ensure that a mode tunnel packet is going to an address you approve of.
Simple Example: HMAC authentication between two nodes
Assume a [4321::]/64 subnet with two nodes [4321::1] and [4321::2]. To set up authenticated headers one way (from 1->2) you need a policy specifying AH (can be specific or subnet based, so this is subnet)
ip xfrm policy add dst 4321::/64 dir out tmpl proto ah mode tunnel
Followed by a state that’s specific to the destination (using random spi 4321 and short key 1234):
ip xfrm state add dst 4321::2 proto ah spi 4321 auth "hmac(sha1)" 1234 mode transport
If you ping from [4321::1] and do a tcp dump from [4321::2] you’ll see
But nothing will come back until you add on [4321::2]
ip xfrm state add dst 4321::2 proto ah spi 4321 auth "hmac(sha1)" 1234 mode transport
Which will cause a ping response to be seen. Note the ping packet has an authentication header, but the response is a simple icmp6 response packet (no AH) demonstrating ipsec can be set up asymmetrically.
Example: Private network for Cloud Nodes
Assume we have N nodes with public IP addresses [4321::1]…[4321::N] (which could be provided by the cloud overlay or simply by virtue of the physical network the nodes are on) and we want to connect them in a private mesh network using encryption. There are two ways of doing this: the first is to encrypt all traffic between the nodes on the public network using transport mode and the second would be to set up overlay tunnels between the nodes (this latter can be used even if the public addresses aren’t on a single network segment).
Simple Transport Mode Encryption
Firstly each node needs a policy to require encryption both to and from the private network
ip xfrm policy add dst 4321::0/64 dir out tmpl proto esp
ip xfrm policy add scr 4321::/64 dir in tmpl proto esp
And then for each node on the star and encrypt and a decrypt policy (the aes cipher type is taken from the key length, so I’ve chosen a 128 bit key “1234567890123456”)
ip xfrm state add dst 4321::1 proto esp spi 4321 enc "cbc(aes)" 1234567890123456 mode transport
ip xfrm state add src 4321::1 proto esp sip 4321 enc "cbc(aes)" 1234567890123456 mode transport
...
ip xfrm state add dst 4321::N proto esp spi 4321 enc "cbc(aes)" 1234567890123456 mode transport
ip xfrm state add src 4321::N proto esp sip 4321 enc "cbc(aes)" 1234567890123456 mode transport
This encryption scheme has one key for the entire network, but you could use 1 key per node if you wished (although this wouldn’t necessarily increase security that much). Note that what’s described above is not an overlay network because it relies on using the characteristics of the underlying network (in this case that all nodes are on an IPv6 /64 segment) to do opportunistic transport encryption. To get a true single subnet overlay on top of a disjoint network there must be some sort of tunnel. One way to get the tunnel is simply to use ipsec in tunnel mode, but another is to set up gre tunnels (or another, not necessarily trusted, network overlay which the cloud can likely provide) for the virtual overlay and then use ipsec in transport mode to ensure the packets are always encrypted.
Tunnel Mode Overlay Network
For this example, we’ll allow unencrypted packets to flow over a routed network [4321::1] and [4322::2] (assume network device eth0 on each node) but set up an overlay network on [6666::N]/64 which is fully encrypted. Firstly, each node requires a local address addition for the [6666::N] address. So on node [4321::1] do
ip addr add 6666::1/128 dev eth0
Now add policies and state transforms in both directions (in this case a dir in policy is required otherwise the decapsulated packets won’t get sent up the input flow):
# required policy for encapsulation
ip xfrm policy add dst 6666::2 dir out tmpl src 4321::1 dst 4322::2 proto esp mode tunnel
# state transform for encapsulation
ip xfrm state add src 4321::1 dst 4322::2 proto esp spi 4321 enc "cbc(aes)" 1234567890123456 mode tunnel
# policy to allow passing of decapsulated packets
ip xfrm policy add dst 6666::1 dir in tmpl proto esp mode tunnel level required
# automatic decapsulation. sel ensures addresses after decapsulation
ip xfrm state add src 4322::2 dst 4321::1 proto esp sip 4321 enc "cbc(aes)" 1234567890123456 mode tunnel sel src 6666::2 dst 6666::1
And on the other node [4322::2] do the same in reverse
ip addr add 6666::2/128 dev eth0
ip xfrm policy add dst 6666::1 dir out tmpl src 4321::1 dst 4322::2 proto esp mode tunnel
ip xfrm state add src 4322::2 dst 4321::1 proto esp spi 4321 enc "cbc(aes)" 1234567890123456 mode tunnel
ip xfrm policy add dst 6666::2 dir in tmpl proto esp mode tunnel level required
ip xfrm state add src 4321::1 dst 4322::2 proto esp sip 4321 enc "cbc(aes)" 1234567890123456 mode tunnel sel src 6666::1 dst 6666::2
Note there’s no need to add any routing entries because the decapsulation is point to point (incoming decapsulated packets always end up in the input flow destined for the local [6666::N] address). With the above rules you should be able to ping from [6666::1] to [6666::2] and tcpdump should show fully encrypted packets going over the wire.
Obviously, you can add more nodes but each time you have to add rules for all the other nodes, making this an N(N-1) scaling problem. The need for a specific source and destination template in the out policy means you must have one for each connection. The in policy can be subnet based. The reason why people use ipsec toolkits is that they can add the transforms just in time for the subset of nodes you’re actually communicating with rather than having to add all the rules up front.
Final Example: correcly keyed AH Inbound Packet Acceptance
The final example is me trying to penetrate my router firewall when on an external IPv6 connection. I have a class /60 set of IPv6 space, so each of my systems has its own IPv6 address but, as is usual, inbound packets in the NEW state are blocked. It occurred to me that I should be able to use ipsec AH (no real need for encryption since most of the protocols I use are encrypted anyway) to accept packets to an internal destination with the NEW state. This would be an asymmetric use of ipsec because inbound would have AH but return wouldn’t.
My initial thought was to use AH in transport mode, but as you can see from the diagram above that won’t work because the router merely forwards the packets and to get to a state decapsulation on the router they have to go up the INPUT flow.
The next attempt used tunnel mode, so the packet was aimed at the router and then the inner destination was the real node. The next problem was the fwd policy to permit this: the xfrm policy has no connection tracker and return packets from connections originating in the interior node also have to pass this filter. The solution to this conundrum is to install a level use policy and rely on the FORWARD table firewall rules to allow RELATED,ESTABLISHED and MARKed packets (so the decapsulation can add the MARK to pass this rule).
IPSEC on the Router
Assume my external router address is [4321::1] and my internal network is [4444::]/60. On the router, I install a catch all state transform
ip xfrm state add add dst 4321::1 proto ah spi 4321 auth "hmac(sha1)" 1234 mode tunnel sel dst 4444::/60 output-mark 0x1
But I also need a policy to permit the decapsulated packets (and the state RELATED,ESTABLISHED unencapsulated ones) to pass:
ip xfrm policy add dst 4444::/60 dir fwd tmpl proto ah spi 4321 mode tunnel level use
And finally I need an addition to the firewall rules to allow packets in state NEW but with mark 0x1 to pass
ip6tables -A FORWARD -m conntrack --ctstate NEW -m mark --mark 0x1/0x1 -j ACCEPT
Which should be placed directly after the RELATED,ESTABLISHED state check.
Since there’s no encapsulation on outbound, the return packets simply pass through the firewall as normal. This means that any external entity wishing to use this AH packet acceptance simply needs a policy and state to tunnel
ip xfrm policy dst 4444::/60 dir out tmpl proto ah dst 4321::1 mode tunnel
ip xfrm state add dst 4321::1 spi 4321 proto ah auth "hmac(sha1)" 1234 mode tunnel
And with that, any machine known by inner IPv6 address can be reached (for an IPv6 connected remote machine).
Conclusion
Hopefully this post has demystified some of the ip xfrm rules for you. I’m afraid the commands have a huge range of options, so I’ve only covered the essential ones above and there are still loads of interesting but not at all well documented ones remaining, but thanks to the examples you should have some scope now for playing with them.
Linux Plumbers Conference 2024 is pleased to host the eBPF Track!
After four years in a row of co-locating eBPF & Networking Tracks together, this year we separated the two in order to allow for both tracks to grow further individually as well as to bring more diversity into LPC by attracting more developers from each community.
The eBPF Track is going to bring together developers, maintainers, and other contributors from all around the globe to discuss improvements to the Linux kernel’s BPF subsystem and its surrounding user space ecosystem such as libraries, loaders, compiler backends, and other related low-level system tooling.
The gathering is designed to foster collaboration and face to face discussion of ongoing development topics as well as to encourage bringing new ideas into the development community for the advancement of the BPF subsystem.
Proposals can cover a wide range of topics related to BPF covering improvements in areas such as (but not limited to) BPF infrastructure and its use in tracing, security, networking, scheduling and beyond, as well as non-kernel components like libraries, compilers, testing infra and tools.
Please come and join us in the discussion. We hope to see you there!
Sometimes debuggers and profilers are obviously broken, sometimes it's subtle and hard to spot. From my flame graphs page:
CPU flame graph (partly broken)
(Click for original SVG.) This is pretty common and usually goes unnoticed as the flame graph looks ok at first glance. But there are 15% of samples on the left, above "[unknown]", that are in the wrong place and missing frames. The problem is that this system has a default libc that has been compiled without frame pointers, so any stack walking stops at the libc layer, producing a partial stack that's missing the application frames. These partial stacks get grouped together on the left.
Click here for a longer explanation.
To explain this example in more detail: The profiler periodically interrupts software execution, and for those disconnected stacks it happens to be the execution of the kernel software ("vfs*", "ext*", etc.). Once interrupted, the profiler begins at the top edge of the flame graph (or bottom edge if you are using the icicle layout) and then "stack walks" down through the rectangles to collect a stack trace. It finally gets through the kernel frames then steps through the syscall (sys_write()) to userspace, hits the libc syscall wrapper (__GI___libc_write()), then tries to resolve the symbol for the next frame but fails and records "[unknown]". It fails because of a compiler optimization where the frame pointer register is used to store data instead of the frame pointer, but it's just a number so the profiler is unaware this happened and tries to match that address to a function symbol and fails (it is therefore an unknown symbol). Then, the profiler is usually unable to walk any more frames because that data doesn't point to the next frame either (and likely doesn't point to any valid mapping, because to the profiler it's effectively a random number), so it stops before it can reach the application frames. There's probably several frames missing from that left disconnected tower, similar to the application frames you see on the right (this example happens to be the bash(1) shell). What happens if the random data is a valid pointer by coincidence? You usually get an extra junk frame. I've seen situations where the random data ends up pointing to itself, so the profiler gets stuck in a loop and you get a tower of junk frames until perf hits its max frame limit.
Other types of profiling hit this more often. Off-CPU flame graphs, for example, can be dominated by libc read/write and mutex functions, so without frame pointers end up mostly broken. Apart from library code, maybe your application doesn't have frame pointers either, in which case everything is broken.
I'm posting about this problem now because Fedora and Ubuntu are releasing versions that fix it, by compiling libc and more with frame pointers by default. This is great news as it not only fixes these flame graphs, but makes off-CPU flame graphs far more practical. This is also a win for continuous profilers (my employer, Intel, just announced one) as it makes customer adoption easier.
What are frame pointers?
The x86-64 ABI documentation shows how a CPU register, %rbp, can be used as a "base pointer" to a stack frame, aka the "frame pointer." I pictured how this is used to walk stack traces in my BPF book.
Figure 3.3: Stack Frame with Base Pointer (x86-64 ABI)
This stack-walking technique is commonly used by external profilers and debuggers, including Linux perf and eBPF, and ultimately visualized by flame graphs. However, the x86-64 ABI has a footnote [12] to say that this register use is optional:
"The conventional use of %rbp as a frame pointer for the stack frame may be avoided by using %rsp
(the stack pointer) to index into the stack frame. This technique saves two instructions in the prologue and
epilogue and makes one additional general-purpose register (%rbp) available."
(Trivia: I had penciled the frame pointer function prologue and epilogue on my Netflix office wall, lower left.)
2004: Their removal
In 2004 a compiler developer, Roger Sayle, changed gcc to stop generating frame pointers, writing:
"The simple patch below tweaks the i386 backend, such that we now
default to the equivalent of "-fomit-frame-pointer -ffixed-ebp" on
32-bit targets"
i386 (32-bit microprocessors) only have four general purpose registers, so freeing up %ebp takes you from four to five (or if you include %si and %di, from six to seven). I'm sure this delivered large performance improvements and I wouldn't try arguing against it. Roger cited two other reasons for this change: The desire to outperform Intel's icc compiler, and the belief that it didn't break debuggers (of the time) since they supported other stack walking techniques.
2005-2023: The winter of broken profilers
However, the change was then applied to x86-64 (64-bit) as well, which had over a dozen registers and didn't benefit so much from freeing up one more. And there are debuggers/profilers that this change did break (typically system profilers, not language specific ones), more so today with eBPF, which didn't exist back then. As my former Sun Microsystems colleague Eric Schrock (nickname Schrock) wrote in November 2004:
"On i386, you at least had the advantage of increasing the number of usable registers by 20%. On amd64, adding a 17th general purpose register isn't going to open up a whole new world of compiler optimizations. You're just saving a pushl, movl, an series of operations that (for obvious reasons) is highly optimized on x86. And for leaf routines (which never establish a frame), this is a non-issue. Only in extreme circumstances does the cost (in processor time and I-cache footprint) translate to a tangible benefit - circumstances which usually resort to hand-coded assembly anyway. Given the benefit and the relative cost of losing debuggability, this hardly seems worth it."
In Schrock's conclusion:
"it's when people start compiling /usr/bin/ without frame pointers that it gets out of control."
This is exactly what happened on Linux, not just /usr/bin but also /usr/lib and application code! I'm sure there are people who are too new to the industry to remember the pre-2004 days when profilers would "just work" without OS and runtime changes.
2014: Java in Flames
Broken Java Stacks (2014)
When I joined Netflix in 2014, I found Java's lack of frame pointer support broke all application stacks (pictured in my 2014 Surge talk on the right). I ended up developing a fix for the JVM c2 compiler which Oracle reworked and added as the -XX:+PreserveFramePointer option in JDK8u60 (see my Java in Flames post for details [PDF]).
While that Java change led to discovering countless performance wins in application code, libc was still breaking some portion of the samples (as pictured in the example at the top of this post) and was breaking most stacks in off-CPU flame graphs. I started by compiling my own libc for production use with frame pointers, and then worked with Canonical to have one prebuilt for Ubuntu. For a while I was promoting the use of Canonical's libc6-prof, which was libc6 with frame pointers.
2015-2020: Overhead
As part of production rollout I did many performance overhead tests, which I've described publicly before: The overhead of adding frame pointers to everything (libc and Java) was usually less than 1%, with one exception of 10%. That 10% was an unusual application that was generating stack traces over 1000 frames deep (via Groovy), so deep that it broke Linux's perf profiler. Arnaldo Carvalho de Melo (Red Hat) added the kernel.perf_event_max_stack sysctl just for this Netflix workload. It was also a virtual machine that lacked low-level hardware profiling capabilities, so I wasn't able to do cycle analysis to confirm that the 10% was entirely frame pointer-based.
The actual overhead depends on your workload. Others have reported around 1% and around 2%. Microbenchmarks can be the worst, hitting 10%: This doesn't surprise me since they resolve to running a small funciton in a loop, and adding any instructions to that function can cause it to spill out of L1 cache warmth (or cache lines) causing a drop in performance. If I were analyzing such a microbenchmark, apart from observability anaylsis (cycles, instructions, PMU, PMCs, PEBS) there is also an experiment I'd like to try:
To test the theory of I-cache spillover: Compile the microbenchmark with and without frame pointers and find the performance delta. Then flame graph the microbenchmark to understand the hot function. Then add some inline assembly to the hot function where you add enough NOPs to the start and end to mimic the frame pointer prologue and epilogue (I recommend writing them on your office wall in pencil), compile it without frame pointers, disassemble the compiled binary to confirm those NOPs weren't stripped, and now test that. If the performance delta is still large (10%) you've confirmed that it is due to cache effects, and anyone who was worked at this level in production will tell you that it's the straw that broke the camel's back. Don't blame the straw, in this case, the frame pointers. Adding anything will cause the same effect. Having done this before, it reminds me of CSS programming: you make a little change here and everything breaks, and you spend hours chasing your own tail.
Another extreme example of overhead was the Python scimark_sparse_mat_mult benchmark, which could reach 10%. Fortunately this was analyzed by Andrii Nakryiko (Meta) who found it was a unusual case of a large function where gcc switched from %rsp offsets to %rbp-relative offsets, which took more bytes to store, causing performance issues. I've heard this has since been fixed so that Python can reenable frame pointers by default.
As I've seen frame pointers help find performance wins ranging from 5% to 500%, the typical "less than 1%" cost (or even 1% or 2% cost) is easily justified. But I'd rather the cost be zero, of course! We may get there with future technologies I'll cover later. In the meantime, frame pointers are the most practical way to find performance wins today.
What about Linux on devices where there is no chance of profiling or debugging, like electric toothbrushes? (I made that up, AFAIK they don't run Linux, but I may be wrong!) Sure, compile without frame pointers. The main users of this change are enterprise Linux. Back-end servers.
2022: Upstreaming, first attempt
Other large companies with OS and perf teams (Meta, Google) hinted strongly that they had already enabled frame pointers for everything years earlier. (Google should be no surprise because they pioneered continuous profiling.) So at this point you had Google, Meta, and Netflix running their own libc with frame pointers and able to enjoy profiling capabilities that most other companies – without dedicated OS teams – couldn't get working. Can't we just upstream this so everyone can benefit?
There's a bunch of difficulties when taking "works well for me" changes and trying to make them the default for everyone. Among the difficulties is that end-user companies don't have a clear return on the investment from telling their Linux vendor what they fixed, since they already fixed it. I guess the investment is quite small, we're talking about a single email, right?...Wrong! Your suggestion is now a 116-post thread where everyone is sharing different opinions and demanding this and that, as we found out the hard way. For Fedora, one person requested:
"Meta and/or Netflix should provide infrastructure for a side repository in which the change can be tested and benchmarked and the code size measured."
(Bear in mind that Netflix doesn't even use Fedora!)
Jonathan Corbet, who writes the best Linux articles, summarized this in "Fedora's tempest in a stack frame" which is so detailed that I feel PTSD when reading it. It's good that the Fedora community wants to be so careful, but I'd rather spend time discussing building something better than frame pointers, perhaps involving ORC, LBR, eBPF, and other technologies, than so much worry about looking bad in kitchen-sink benchmarks that I wouldn't trust in the first place.
2023, 2024: Frame Pointers in Fedora and Ubuntu!
Fedorarevisited the proposal and has accepted it this time, making it the first distro to reenable frame pointers. Thank you!
UPDATE: I've now heard that Arch Linux is also enabling frame pointers! Thanks Daan De Meyer (Meta).
While this fixes stack walking through OS libraries, you might find your application still doesn't support stack tracing, but that's typically much easier to fix. Java, for example, has the -XX:+PreserveFramePointer option. There were ways to get Golang to support frame pointers, but that became the default years ago. Just to name a couple of languages.
2034+: Beyond Frame Pointers
There's more than one way to walk a stack. These could be separate blog posts, but I want to comment briefly on alternates:
LBR (Last Branch Record): Intel's hardware feature that was limited to 16 or 32 frames. Most application stacks are deeper, so this can't be used to build flame graphs, but it is better than nothing. I use it as a last resort as it gives me some stack insights.
BTS (Branch Trace Store): Another Intel thing. Not so limited to stack depth, but has overhead from memory load/stores and BTS buffer overflow interrupt handling.
AET (Archetectural Event Trace): Another Intel thing. It's a JTAG-based tracer that can trace low-level CPU, BIOS, and device events, and apparently can be used for stack traces as well. I haven't used it. (I spent years as a cloud customer where I couldn't access many HW-level things.) I hope it can be configured to output to main memory, and not just a physical debug port.
DWARF: Binary debuginfo, has been used forever with debuggers. Update: I'd said it doesn't exist for JIT'd runtimes like the Java JVM, but others have pointed out there has been some JIT->DWARF work done. I still don't expect it to be practical on busy production servers that are constantly in c2. The overhead just to walk DWARF is also high, as it was designed for non-realtime use. Javier Honduvilla Coto (Polar Signals) did some interesting work using an eBPF walker to reduce the overhead, but...Java.
eBPF stack walking: Mark Wielaard (Red Hat) demonstrated a Java JVM stack walker using SystemTap back at LinuxCon 2014, where an external tracer walked a runtime with no runtime support or help. Very cool. This can be done using eBPF as well. The performmance overhead could be too high, however, as it may mean a lot of user space reads of runtime internals depending on the runtime. It would also be brittle; such eBPF stack walkers should ship with the language code base and be maintained with it.
ORC (oops rewind capability): The Linux kernel's new lightweight stack unwinder by Josh Poimboeuf (Red Hat) that has allowed newer kernels to remove frame pointers yet retain stack walking. You may be using ORC without realizing it; the rollout was smooth as the kernel profiler code was updated to support ORC (perf_callchain_kernel()->unwind_orc.c) at the same time as it was compiled to support ORC. Can't ORCs invade user space as well?
SFrames (Stack Frames): ...which is what SFrames does: lightweight user stack unwinding based on ORC. There have been recent talks to explain them by Indu Bhagat (Oracle) and Steven Rostedt (Google). I should do a blog post just on SFrames.
Shadow Stacks: A newer Intel and AMD security feature that can be configured to push function return addresses onto a separate HW stack so that they can be double checked when the return happens. Sounds like such a HW stack could also provide a stack trace, without frame pointers.
(And this isn't even all of them.)
Daan De Meyer (Meta) did a nice summary as well of different stack walkers on the Fedora wiki.
So what's next? Here's my guesses:
2029: Ubuntu and Fedora release new versions with SFrames for OS components (including libc) and ditches frame pointers again. We'll have had five years of frame pointer-based performance wins and new innovations that make use of user space stacks (e.g., better automated bug reporting), and will hit the ground running with SFrames.
2034: Shadow stacks have been enabled by default for security, and then are used for all stack tracing.
Conclusion
I could say that times have changed and now the original 2004 reasons for omitting frame pointers are no longer valid in 2024. Those reasons were that it improved performance significantly on i386, that it didn't break the debuggers of the day (prior to eBPF), and that competing with another compiler (icc) was deemed important. Yes, times have indeed changed. But I should note that one engineer, Eric Schrock, claimed that it didn't make sense back in 2004 either when it was applied to x86-64, and I agree with him. Profiling has been broken for 20 years and we've only now just fixed it.
Fedora and Ubuntu have now returned frame pointers, which is great news. People should start running these releases in 2024 and will find that CPU flame graphs make more sense, Off-CPU flame graphs work for the first time, and other new things become possible. It's also a win for continuous profilers, as they don't need to convince their customers to make OS changes to get profiles to fully work.
Thanks
The online threads about this change aren't even everything, there have been many discussions, meetings, and work put into this, not just for frame pointers but other recent advances including ORC and SFrames. Special thanks to Andrii Nakryiko (Meta), Daan De Meyer (Meta), Davide Cavalca (Meta), Neal Gompa (Velocity Limitless), Ian Rogers (Google), Steven Rostedt (Google), Josh Poimboeuf (Red Hat), Arjan Van De Ven (Intel), Indu Bhagat (Oracle), Mark Shuttleworth (Canonical), Jon Seager (Canonical), Oliver Smith (Canonical), Javier Honduvilla Coto (Polar Signals), Mark Wielaard (Red Hat), Ben Cotton (Red Hat), and many others (see the Fedora discussions). And thanks to Schrock.
Appendix: Fedora
For reference, here's my writeup for the Fedora change:
I enabled frame pointers at Netflix, for Java and glibc, and summarized the effect in BPF Performance Tools (page 40):
"Last time I studied the performance gain from frame pointer omission in our production environment, it was usually less than one percent, and it was often so close to zero that it was difficult to measure. Many microservices at Netflix are running with the frame pointer reenabled, as the performance wins found by CPU profiling outweigh the tiny loss of performance."
I've spent a lot of time analyzing frame pointer performance, and I did the original work to add them to the JVM (which became -XX:+PreserveFramePoiner). I was also working with another major Linux distro to make frame pointers the default in glibc, although I since changed jobs and that work has stalled. I'll pick it up again, but I'd be happy to see Fedora enable it in the meantime and be the first to do so.
We need frame pointers enabled by default because of performance. Enterprise environments are monitored, continuously profiled, and analyzed on a regular basis, so this capability will indeed be put to use. It enables a world of debugging and new performance tools, and once you find a 500% perf win you have a different perspective about the <1% cost. Off-CPU flame graphs in particular need to walk the pthread functions in glibc as most blocking paths go through them; CPU flame graphs need them as well to reconnect the floating glibc tower of futex/pthread functions with the developers code frames.
I see the comments about benchmark results of up to 10% slowdowns. It's good to look out for regressions, although in my experience all benchmarks are wrong or deeply misleading. You'll need to do cycle analysis (PEBS-based) to see where the extra cycles are, and if that makes any sense. Benchmarks can be super sensitive to degrading a single hot function (like "CPU benchmarks" that really just hammer one function in a loop), and if extra instructions (function prologue) bump it over a cache line or beyond L1 cache-warmth, then you can get a noticeable hit. This will happen to the next developer who adds code anyway (assuming such a hot function is real world) so the code change gets unfairly blamed. It will only regress in this particular scenario, and regression is inevitable. Hence why you need the cycle analysis ("active benchmarking") to make sense of this.
There was one microservice that was an outlier and had a 10% performance loss with Java frame pointers enabled (not glibc, I've never seen a big loss there). 10% is huge. This was before PMCs were available in the cloud, so I could do little to debug it. Initially the microservice ran a "flame graph canary" instance with FPs for flame graphs, but the developers eventually just enabled FPs across the whole microservice as the gains they were finding outweighed the 10% cost. This was the only noticeable (as in, >1%) production regression we saw, and it was a microservice that was bonkers for a variety of reasons, including stack traces that were over 1000 frames deep (and that was after inlining! Over 3000 deep without. ACME added the perf_event_max_stack sysctl just so Netflix could profile this microservice, as the prior limit was 128). So one possibility is that the extra function prologue instructions add up if you frequently walk 1000 frames of stack (although I still don't entirely buy it). Another attribute was that the microservice had over 1 Gbyte of instruction text (!), and we may have been flying close to the edge of hardware cache warmth, where adding a bit more instructions caused a big drop. Both scenarios are debuggable with PMCs/PEBS, but we had none at the time.
So while I think we need to debug those rare 10%s, we should also bear in mind that customers can recompile without FPs to get that performance back. (Although for that microservice, the developers chose to eat the 10% because it was so valuable!) I think frame pointers should be the default for enterprise OSes, and to opt out if/when necessary, and not the other way around. It's possible that some math functions in glibc should opt out of frame pointers (possibly fixing scimark, FWIW), but the rest (especially pthread) needs them.
In the distant future, all runtimes should come with an eBPF stack walker, and the kernel should support hopping between FPs, ORC, LBR, and eBPF stack walking as necessary. We may reach a point where we can turn off FPs again. Or maybe that work will never get done. Turning on FPs now is an improvement we can do, and then we can improve it more later.
For some more background: Eric Schrock (my former colleague at Sun Microsystems) described the then-recent gcc change in 2004 as "a dubious optimization that severely hinders debuggability" and that "it's when people start compiling /usr/bin/* without frame pointers that it gets out of control" I recommend reading his post: [0].
The original omit FP change was done for i386 that only had four general-purpose registers and saw big gains freeing up a fifth, and it assumed stack walking was a solved problem thanks to gdb(1) without considering real-time tracers, and the original change cites the need to compete with icc [1]. We have a different circumstance today -- 18 years later -- and it's time we updated this change.
[0] http://web.archive.org/web/20131215093042/https://blogs.oracle.com/eschrock/entry/debugging_on_amd64_part_one
[1] https://gcc.gnu.org/ml/gcc-patches/2004-08/msg01033.html
Closing arguments in the trial between various people and Craig Wright over whether he's Satoshi Nakamoto are wrapping up today, amongst a bewildering array of presented evidence. But one utterly astonishing aspect of this lawsuit is that expert witnesses for both sides agreed that much of the digital evidence provided by Craig Wright was unreliable in one way or another, generally including indications that it wasn't produced at the point in time it claimed to be. And it's fascinating reading through the subtle (and, in some cases, not so subtle) ways that that's revealed.
One of the pieces of evidence entered is screenshots of data from Mind Your Own Business, a business management product that's been around for some time. Craig Wright relied on screenshots of various entries from this product to support his claims around having controlled meaningful number of bitcoin before he was publicly linked to being Satoshi. If these were authentic then they'd be strong evidence linking him to the mining of coins before Bitcoin's public availability. Unfortunately the screenshots themselves weren't contemporary - the metadata shows them being created in 2020. This wouldn't fundamentally be a problem (it's entirely reasonable to create new screenshots of old material), as long as it's possible to establish that the material shown in the screenshots was created at that point. Sadly, well.
One part of the disclosed information was an email that contained a zip file that contained a raw database in the format used by MYOB. Importing that into the tool allowed an audit record to be extracted - this record showed that the relevant entries had been added to the database in 2020, shortly before the screenshots were created. This was, obviously, not strong evidence that Craig had held Bitcoin in 2009. This evidence was reported, and was responded to with a couple of additional databases that had an audit trail that was consistent with the dates in the records in question. Well, partially. The audit record included session data, showing an administrator logging into the data base in 2011 and then, uh, logging out in 2023, which is rather more consistent with someone changing their system clock to 2011 to create an entry, and switching it back to present day before logging out. In addition, the audit log included fields that didn't exist in versions of the product released before 2016, strongly suggesting that the entries dated 2009-2011 were created in software released after 2016. And even worse, the order of insertions into the database didn't line up with calendar time - an entry dated before another entry may appear in the database afterwards, indicating that it was created later. But even more obvious? The database schema used for these old entries corresponded to a version of the software released in 2023.
This is all consistent with the idea that these records were created after the fact and backdated to 2009-2011, and that after this evidence was made available further evidence was created and backdated to obfuscate that. In an unusual turn of events, during the trial Craig Wright introduced further evidence in the form of a chain of emails to his former lawyers that indicated he had provided them with login details to his MYOB instance in 2019 - before the metadata associated with the screenshots. The implication isn't entirely clear, but it suggests that either they had an opportunity to examine this data before the metadata suggests it was created, or that they faked the data? So, well, the obvious thing happened, and his former lawyers were asked whether they received these emails. The chain consisted of three emails, two of which they confirmed they'd received. And they received a third email in the chain, but it was different to the one entered in evidence. And, uh, weirdly, they'd received a copy of the email that was submitted - but they'd received it a few days earlier. In 2024.
And again, the forensic evidence is helpful here! It turns out that the email client used associates a timestamp with any attachments, which in this case included an image in the email footer - and the mysterious time travelling email had a timestamp in 2024, not 2019. This was created by the client, so was consistent with the email having been sent in 2024, not being sent in 2019 and somehow getting stuck somewhere before delivery. The date header indicates 2019, as do encoded timestamps in the MIME headers - consistent with the mail being sent by a computer with the clock set to 2019.
But there's a very weird difference between the copy of the email that was submitted in evidence and the copy that was located afterwards! The first included a header inserted by gmail that included a 2019 timestamp, while the latter had a 2024 timestamp. Is there a way to determine which of these could be the truth? It turns out there is! The format of that header changed in 2022, and the version in the email is the new version. The version with the 2019 timestamp is anachronistic - the format simply doesn't match the header that gmail would have introduced in 2019, suggesting that an email sent in 2022 or later was modified to include a timestamp of 2019.
This is by no means the only indication that Craig Wright's evidence may be misleading (there's the whole argument that the Bitcoin white paper was written in LaTeX when general consensus is that it's written in OpenOffice, given that's what the metadata claims), but it's a lovely example of a more general issue.
Our technology chains are complicated. So many moving parts end up influencing the content of the data we generate, and those parts develop over time. It's fantastically difficult to generate an artifact now that precisely corresponds to how it would look in the past, even if we go to the effort of installing an old OS on an old PC and setting the clock appropriately (are you sure you're going to be able to mimic an entirely period appropriate patch level?). Even the version of the font you use in a document may indicate it's anachronistic. I'm pretty good at computers and I no longer have any belief I could fake an old document.
(References: this Dropbox, under "Expert reports", "Patrick Madden". Initial MYOB data is in "Appendix PM7", further analysis is in "Appendix PM42", email analysis is "Sixth Expert Report of Mr Patrick Madden")
In a different epoch, before the pandemic, I’ve done a presentation about
upstream first at the Siemens Linux Community
Event 2018, where I’ve tried to explain the fundamentals of open source using
microeconomics. Unfortunately that talk didn’t work out too well with an
audience that isn’t well-versed in upstream and open source concepts, largely
because it was just too much material crammed into too little time.
Last year I got the opportunity to try again for an Intel-internal event series,
and this time I’ve split the material into two parts. I think that worked a lot
better. For obvious reasons I cannot publish the recordings, but I can publish
the slides.
The first part “Upstream, Why?” covers a
few concepts from microeconomcis 101, and then applies them to upstream stream
open source. The key concept is on one hand that open source achieves an
efficient software market in the microeconomic sense by driving margins and
prices to zero. And the only way to make money in such a market is to either
have more-or-less unstable barriers to entry that prevent the efficient market
from forming and destroying all monetary value. Or to sell a complementary
product.
Individual engineers, who have skills and create a product with zero economic
value, and might still be stupid enough and try to build a career on that.
Upstream communities, often with a formal structure as a foundation, and what
exactly their goals should be to build a thriving upstream open source project
that can actually pay some bills, generate some revenue somewhere else and get
engineers paid. Because without that you’re not going to have much of a
project with a long term future.
Engineering organizations, what exactly their incentives and goals should
be, and the fundamental conflicts of interest this causes. Specifically on
this I’ve only seen bad solutions, and ugly solutions, but not yet a really
good one. A relevant pre-pandemic talk of mine on this topic is also
“Upstream Graphics: Too Little, Too
Late”
And finally the overall business and more importantly, what kind of business
strategy is needed to really thrive with an open source upstream first
approach: You need to clearly understand which software market’s economic
value you want to destroy by driving margins and prices to zero, and which
complemenetary product you’re selling to still earn money.
At least judging by the feedback I’ve received internally taking more time and
going a bit more in-depth on the various concept worked much better than the
keynote presentation I’ve done at Siemens, hence I decided to publish at the
least the slides.
eBPF is a crazy technology – like putting JavaScript into the Linux kernel – and getting it accepted had so far been an untold story of strategy and ingenuity. The eBPF documentary, published late last year, tells this story by interviewing key players from 2014 including myself, and touches on new developments including Windows. (If you are new to eBPF, it is the name of a kernel execution engine that runs a variety of new programs in a performant and safe sandbox in the kernel, like how JavaScript can run programs safely in a browser sandbox; it is also no longer an acronym.) The documentary was played at KubeCon and is on youtube:
Watching this brings me right back to 2014, to see the faces and hear their voices discussing the problems we were trying to fix. Thanks to Speakeasy Productions for doing such a great job with this documentary, and letting you experience what it was like in those early days. This is also a great example of all the work that goes on behind the scenes to get code merged in a large codebase like Linux.
When Alexei Starovoitov visited Netflix in 2014 to discuss eBPF with myself and Amer Ather, we were so entranced that we lost track of time and were eventually kicked out of the meeting room as another meeting was starting. It was then I realized that we had missed lunch! Alexei sounded so confident that I was convinced that eBPF was the future, but a couple of times he added "if the patches get merged." If they get merged?? They have to get merged, this idea is too good to waste.
While only several of us worked on eBPF in 2014, more joined in 2015 and later, and there are now hundreds contributing to make it what it is. A longer documentary could also interview Brendan Blanco (bcc), Yonghong Song (bcc), Sasha Goldshtein (bcc), Alastair Robertson (bpftrace), Tobais Waldekranz (ply), Andrii Nakryiko, Joe Stringer, Jakub Kicinski, Martin KaFai Lau, John Fastabend, Quentin Monnet, Jesper Dangaard Brouer, Andrey Ignatov, Stanislav Fomichev, Teng Qin, Paul Chaignon, Vicent Marti, Dan Xu, Bas Smit, Viktor Malik, Mary Marchini, and many more. Thanks to everyone for all the work.
Ten years later it still feels like it's early days for eBPF, and a great time to get involved: It's likely already available in your production kernels, and there are tools, libraries, and documentation to help you get started.
I hope you enjoy the documentary. PS. Congrats to Isovalent, the role-model eBPF startup, as Cisco recently announced they would acquire them!
Linux Plumbers Conference 2024 is pleased to host the Toolchains Track!
The aim of the Toolchains track is to fix particular toolchain issues which are of the interest of the kernel and, ideally, find solutions in situ, making the best use of the opportunity of live discussion with kernel developers and maintainers. In particular, this is not about presenting research nor abstract/miscellaneous toolchain work.
The track will be composed of activities, of variable length depending on the topic being discussed. Each activity is intended to cover a particular topic or issue involving both the Linux kernel and one or more of its associated toolchains and development tools. This includes compiling, linking, assemblers, debuggers and debugging formats, ABI analysis tools, object manipulation, etc. Few slides shall be necessary, and most of the time shall be devoted to actual discussion, brainstorming and seeking agreement.
Please come and join us in the discussion. We hope to see you there!
Following generic guides (e.g. at Akamai Linode) almost got it all working with ease. There were a few minor problems with permissions.
Problem: Feb 28 11:45:17 takane postfix/smtpd[1137214]: warning: connect to Milter service local:/run/opendkim/opendkim.sock: Permission denied Solution: add postfix to opendkim group; no change to Umask etc.
Problem: Feb 28 13:36:39 takane opendkim[1136756]: 5F1F4DB81: no signing table match for 'zaitcev@kotori.zaitcev.us' Solution: change SigningTable from refile: to a normal file
We have a cabin out in the forest, and when I say "out in the forest" I mean "in a national forest subject to regulation by the US Forest Service" which means there's an extremely thick book describing the things we're allowed to do and (somewhat longer) not allowed to do. It's also down in the bottom of a valley surrounded by tall trees (the whole "forest" bit). There used to be AT&T copper but all that infrastructure burned down in a big fire back in 2021 and AT&T no longer supply new copper links, and Starlink isn't viable because of the whole "bottom of a valley surrounded by tall trees" thing along with regulations that prohibit us from putting up a big pole with a dish on top. Thankfully there's LTE towers nearby, so I'm simply using cellular data. Unfortunately my provider rate limits connections to video streaming services in order to push them down to roughly SD resolution. The easy workaround is just to VPN back to somewhere else, which in my case is just a Wireguard link back to San Francisco.
This worked perfectly for most things, but some streaming services simply wouldn't work at all. Attempting to load the video would just spin forever. Running tcpdump at the local end of the VPN endpoint showed a connection being established, some packets being exchanged, and then… nothing. The remote service appeared to just stop sending packets. Tcpdumping the remote end of the VPN showed the same thing. It wasn't until I looked at the traffic on the VPN endpoint's external interface that things began to become clear.
This probably needs some background. Most network infrastructure has a maximum allowable packet size, which is referred to as the Maximum Transmission Unit or MTU. For ethernet this defaults to 1500 bytes, and these days most links are able to handle packets of at least this size, so it's pretty typical to just assume that you'll be able to send a 1500 byte packet. But what's important to remember is that that doesn't mean you have 1500 bytes of packet payload - that 1500 bytes includes whatever protocol level headers are on there. For TCP/IP you're typically looking at spending around 40 bytes on the headers, leaving somewhere around 1460 bytes of usable payload. And if you're using a VPN, things get annoying. In this case the original packet becomes the payload of a new packet, which means it needs another set of TCP (or UDP) and IP headers, and probably also some VPN header. This still all needs to fit inside the MTU of the link the VPN packet is being sent over, so if the MTU of that is 1500, the effective MTU of the VPN interface has to be lower. For Wireguard, this works out to an effective MTU of 1420 bytes. That means simply sending a 1500 byte packet over a Wireguard (or any other VPN) link won't work - adding the additional headers gives you a total packet size of over 1500 bytes, and that won't fit into the underlying link's MTU of 1500.
And yet, things work. But how? Faced with a packet that's too big to fit into a link, there are two choices - break the packet up into multiple smaller packets ("fragmentation") or tell whoever's sending the packet to send smaller packets. Fragmentation seems like the obvious answer, so I'd encourage you to read Valerie Aurora's article on how fragmentation is more complicated than you think. tl;dr - if you can avoid fragmentation then you're going to have a better life. You can explicitly indicate that you don't want your packets to be fragmented by setting the Don't Fragment bit in your IP header, and then when your packet hits a link where your packet exceeds the link MTU it'll send back a packet telling the remote that it's too big, what the actual MTU is, and the remote will resend a smaller packet. This avoids all the hassle of handling fragments in exchange for the cost of a retransmit the first time the MTU is exceeded. It also typically works these days, which wasn't always the case - people had a nasty habit of dropping the ICMP packets telling the remote that the packet was too big, which broke everything.
What I saw when I tcpdumped on the remote VPN endpoint's external interface was that the connection was getting established, and then a 1500 byte packet would arrive (this is kind of the behaviour you'd expect for video - the connection handshaking involves a bunch of relatively small packets, and then once you start sending the video stream itself you start sending packets that are as large as possible in order to minimise overhead). This 1500 byte packet wouldn't fit down the Wireguard link, so the endpoint sent back an ICMP packet to the remote telling it to send smaller packets. The remote should then have sent a new, smaller packet - instead, about a second after sending the first 1500 byte packet, it sent that same 1500 byte packet. This is consistent with it ignoring the ICMP notification and just behaving as if the packet had been dropped.
All the services that were failing were failing in identical ways, and all were using Fastly as their CDN. I complained about this on social media and then somehow ended up in contact with the engineering team responsible for this sort of thing - I sent them a packet dump of the failure, they were able to reproduce it, and it got fixed. Hurray!
(Between me identifying the problem and it getting fixed I was able to work around it. The TCP header includes a Maximum Segment Size (MSS) field, which indicates the maximum size of the payload for this connection. iptables allows you to rewrite this, so on the VPN endpoint I simply rewrote the MSS to be small enough that the packets would fit inside the Wireguard MTU. This isn't a complete fix since it's done at the TCP level rather than the IP level - so any large UDP packets would still end up breaking)
I've no idea what the underlying issue was, and at the client end the failure was entirely opaque: the remote simply stopped sending me packets. The only reason I was able to debug this at all was because I controlled the other end of the VPN as well, and even then I wouldn't have been able to do anything about it other than being in the fortuitous situation of someone able to do something about it seeing my post. How many people go through their lives dealing with things just being broken and having no idea why, and how do we fix that?
(Edit: thanks to this comment, it sounds like the underlying issue was a kernel bug that Fastly developed a fix for - under certain configurations, the kernel fails to associate the MTU update with the egress interface and so it continues sending overly large packets)
Speaking of S3 becoming strongly consistent, Swift was strongly consistent for objects in practice from the very beginning. All the "in practice" assumes a reasonably healthy cluster. It's very easy, really. When your client puts an object into a cluster and receives a 201, it means that a quorum of back-end nodes stored that object. Therefore, for you to get a stale version, you need to find a proxy that is willing to fetch you an old copy. That can only happen if the proxy has no access to any of the back-end nodes with the new object.
Somewhat unfortunately for the lovers of consistency, we made a decision that makes the observation of eventual consistency easier some 10 years ago. We allowed a half of an even replication factor to satisfy quorum. So, if you have a distributed cluster with factor of 4, a client can write an object into 2 nearest nodes, and receive a success. That opens a window for another client to read an old object from the other nodes.
Oh, well. Original Swift defaulted for odd replication factors, such as 3 and 5. They provided a bit of resistance to intercontinental scenarios, at the cost of the client knowing immediately if a partition is occurring. But a number of operators insisted that they preferred the observable eventual consistency. Remember that the replication factor is configurable, so there's no harm, right?
Alas, being flexible that way helps private clusters only. Because Swift generally hides the artitecture of the cluster from clients, they cannot know if they can rely on consistency of a random public cluster.
Either way, S3 does something that Swift cannot do here: the consistency of bucket listings. These things start lagging in Swift at a drop of a hat. If the container servers fail to reply in milliseconds, storage nodes push the job onto updaters and proceed. The delay in listigs is often observable in Swift, which is one reason why people doing POSIX overlays often have their own manifests.
I'm quite impressed by S3 doing this, given their scale especially. Curious, too. I wish I could hack on their system a little bit. But it's all proprietary, alas.
As was recently announced, the Linux kernel project has been accepted as a CNA as a CVE Numbering Authority (CNA) for vulnerabilities found in Linux.
This is a trend, of more open source projects taking over the haphazard assignments of CVEs against their project by becoming a CNA so that no other group can assign CVEs without their involvment. Here’s the curl project doing much the same thing for the same reasons.
The Linux Plumbers Conference is proud to announce that it’s website for 2024 is up and the CFP has been issued. We will be running a hybrid conference as usual, but the in-person venue will be Vienna, Austria from 18-20 September. Deadlines to submit are 4 April for Microconferences and 16 June for Refereed and Kernel Summit track presentations. Details for other tracks and accepted Microconferences will be posted later.
Vulkan Video AV1 decode has been released, and I had some partly working support on Intel ANV driver previously, but I let it lapse.
The branch is currently [1]. It builds, but is totally untested, I'll get some time next week to plug in my DG2 and see if I can persuade it to decode some frames.
Update: the current branch decodes one frame properly, reference frames need more work unfortunately.
In order to save mass-storage space and to reduce boot times, rcutorture runs out of a tiny initrd filesystem that consists only of a root directory and a statically linked init program based on nolibc. This init program binds itself to a randomly chosen CPU, spins for a short time, sleeps for a short time, and repeats, the point being to inject at least a little userspace execution for the benefit of nohz_full CPUs.
This works very well most of the time. But what if you want to use a full userspace when torturing RCU, perhaps because you want to test runtime changes to RCU's many sysfs parameters, run heavier userspace loads on nohz_full CPUs, or even to run BPF programs?
What you do is go back to the old way of building rcutorture's initrd.
Which raises the question as to what the new way might be.
What rcutorture does is to look at the tools/testing/selftests/rcutorture/initrd directory. If this directory does not already a file named init, the tools/testing/selftests/rcutorture/bin/mkinitrd.sh script builds the aforementioned statically linked init program.
Which means that you can put whatever initrd file tree you wish into that initrd directory, and as long as it contains a /init program, rcutorture will happily bundle that file tree into an initrd in each the resulting rcutorture kernel images.
And back in the old days, that is exactly what I did. I grabbed any convenient initrd and expanded it into my tools/testing/selftests/rcutorture/initrd directory. This still works, so you can do this too!
The Khronos Group announced VK_KHR_video_decode_av1 [1], this extension adds AV1 decoding to the Vulkan specification. There is a radv branch [2] and merge request [3]. I did some AV1 work on this in the past, but I need to take some time to see if it has made any progress since. I'll post an ANV update once I figure that out.
This extension is one of the ones I've been wanting for a long time, since having royalty-free codec is something I can actually care about and ship, as opposed to the painful ones. I started working on a MESA extension for this a year or so ago with Lynne from the ffmpeg project and we made great progress with it. We submitted that to Khronos and it has gone through the committee process and been refined and validated amongst the hardware vendors.
I'd like to say thanks to Charlie Turner and Igalia for taking over a lot of the porting to the Khronos extension and fixing up bugs that their CTS development brought up. This is a great feature of having open source drivers, it allows a lot quicker turn around time in bug fixes when devs can fix them themselves!