Kernel Planet

In the future, computers will not crash due to bad software updates, even those updates that involve kernel code. In the future, these updates will push eBPF code.

Friday July 19th provided an unprecedented example of the inherent dangers of kernel programming, and has been called the largest outage in the history of information technology. Windows computers around the world encountered blue-screens-of-death and boot loops, causing outages for hospitals, airlines, banks, grocery stores, media broadcasters, and more. This was caused by a config update by a security company for their widely used product that included a kernel driver on Windows systems. The update caused the kernel driver to try to read invalid memory, an error type that will crash the kernel.

For Linux systems, the company behind this outage was already in the process of adopting eBPF, which is immune to such crashes. Once Microsoft's eBPF support for Windows becomes production-ready, Windows security software can be ported to eBPF as well. These security agents will then be safe and unable to cause a Windows kernel crash.

eBPF (no longer an acronym) is a secure kernel execution environment, similar to the secure JavaScript runtime built into web browsers. If you're using Linux, you likely already have eBPF available on your systems whether you know it or not, as it was included in the kernel several years ago. eBPF programs cannot crash the entire system because they are safety-checked by a software verifier and are effectively run in a sandbox. If the verifier finds any unsafe code, the program is rejected and not executed. The verifier is rigorous -- the Linux implementation has over 20,000 lines of code -- with contributions from industry (e.g., Meta, Isovalent, Google) and academia (e.g., Rutgers University, University of Washington). The safety this provides is a key benefit of eBPF, along with heightened security and lower resource usage.

Some eBPF-based security startups (e.g., Oligo, Uptycs) have made their own statements about the recent outage, and the advantages of migrating to eBPF. Larger tech companies are also adopting eBPF for security. As an example, Cisco acquired the eBPF-startup Isovalent and has announced a new eBPF security product: Cisco Hypershield, a fabric for security enforcement and monitoring. Google and Meta already rely on eBPF to detect and stop bad actors in their fleet, thanks to eBPF's speed, deep visibility, and safety guarantees. Beyond security, eBPF is also used for networking and observability.

The worst thing an eBPF program can do is to merely consume more resources than is desirable, such as CPU cycles and memory. eBPF cannot prevent developers writing poor code -- wasteful code -- but it will prevent serious issues that cause a system to crash. That said, as a new technology eBPF has had some bugs in its management code, including a Linux kernel panic discovered by the same security company in the news today. This doesn't mean that eBPF has solved nothing, substituting a vendor's bug for its own. Fixing these bugs in eBPF means fixing these bugs for all eBPF vendors, and more quickly improving the security of everyone.

There are other ways to reduce risks during software deployment that can be employed as well: canary testing, staged rollouts, and "resilience engineering" in general. What's important about the eBPF method is that it is a software solution that will be available in both Linux and Windows kernels by default, and has already been adopted for this use case.

If your company is paying for commercial software that includes kernel drivers or kernel modules, you can make eBPF a requirement. It's possible for Linux today, and Windows soon. While some vendors have already proactively adopted eBPF (thank you), others might need a little encouragement from their paying customers. Please help raise awareness, and together we can make such global outages a lesson of the past.

Authors: Brendan Gregg, Intel; Daniel Borkmann, Isovalent; Joe Stringer, Isovalent; KP Singh, Google.

July 21, 2024 02:00 PM

The System Boot and Security Microconference has been a critical platform for enthusiasts and professionals working on firmware, bootloaders, system boot, and security. This year, the conference focuses on the challenges of upstreaming boot process improvements to the Linux kernel. Cryptography, an ever-evolving field, poses unique demands on secure elements and TPMs as newer algorithms are introduced and older ones are deprecated. Additionally, new hardware architectures with DRTM capabilities, such as ARM’s D-RTM specification and the increased use of fTPMs in innovative applications, add to the complexity of the task. This is the fifth time the conference has been held in the last six years.

Trusted Platform Modules (TPMs) for encrypting disks have become widespread across various distributions. This highlights the vital role that TPMs play in ensuring platform security. As the field of confidential computing continues to grow, virtual machine firmware must evolve to meet end-users’ demands, and Linux would have to leverage exposed capabilities to provide relevant security properties. Mechanisms like UEFI Secure Boot that were once limited to OEMs now empower end-users. The System Boot and Security Microconference aims to address these challenges collaboratively and transparently. We welcome talks on the following technologies that can help achieve this goal.

TPMs, HSMs, secure elements
Roots of Trust: SRTM and DRTM
Intel TXT, SGX, TDX
AMD SKINIT, SEV
ARM DRTM
Growing Attestation ecosystem
IMA
TrenchBoot, tboot
TianoCore EDK II (UEFI), SeaBIOS, coreboot, U-Boot, LinuxBoot, hostboot
Measured Boot, Verified Boot, UEFI Secure Boot, UEFI Secure Boot Advanced Targeting (SBAT)
shim
boot loaders: GRUB2, systemd-boot/sd-boot, network boot, PXE, iPXE
UKI
u-root
OpenBMC, u-bmc
legal, organizational, and other similar issues relevant to people interested in system boot and security.

If you want to participate in this microconference and have ideas to share, please use the Call for Proposals (CFP) process. Your submissions should focus on new advancements, innovations, and solutions related to firmware, bootloader, and operating system development. It’s essential to explain clearly what will be discussed, why, and what outcomes you expect from the discussion.

Edit: The submission deadline has been updated to July 14th!

July 12, 2024 07:38 PM

sched_ext is a Linux kernel feature which enables implementing host-wide, safe kernel thread schedulers in BPF, and dynamically loading them at runtime. sched_ext enables safe and rapid iterations of scheduler implementations, thus radically widening the scope of scheduling strategies that can be experimented with and deployed, even in massive and complex production environments.

sched_ext was first sent to the upstream list as an RFC patch set back in November 2022. Since then, the project has evolved a great deal, both technically, as well as in the significant growth of the community of sched_ext users and contributors.

This MC is the space for the community to discuss the developments of sched_ext, its impact on the community, and to outline future strategies aimed at integrating this feature into the Linux kernel and mainstream Linux distributions.

Ideas of topics to be discussed include (but are not limited to):

Challenges and plans to facilitate the upstream merge of sched_ext
User-space scheduling (offload part / all of the scheduling from kernel to user-space)
Scheduling for gaming and latency-sensitive workloads
Scheduling & cpufreq integration
Distro support

While we anticipate having a schedule with existing talk proposals at the MC, we invite you to submit proposals for any topic(s) you’d like to discuss. Time permitting, we are happy to readjust the schedule for additional topics that are of relevance to the sched_ext community.

Submissions are made via LPC submission system, selecting the track Sched-Ext: The BPF extensible scheduler class.

We will consider the submissions until July 12th.

July 12, 2024 02:38 AM

This year it took us a bit more time, but we did run out of places and the conference is currently sold out for in-person registration.
We are setting up a waitlist for in-person registration (virtual attendee places are still available).
Please fill in this form and try to be clear about your reasons for wanting to attend.
We are giving waitlist priority to new attendees and people expected to contribute content.

July 09, 2024 06:34 PM

The Rust Microconference returns this year again. It covers both Rust in the kernel and Rust in general.

The submission deadline is July 14th. Submissions are made via the LPC submission system, selecting Rust MC for Track. Please see The Ideal Microconference Topic Session as well.

Possible Rust for Linux topics:

Rust in the kernel (e.g. status update, next steps).
Use cases for Rust around the kernel (e.g. subsystems, drivers, other modules…).
Discussions on how to abstract existing subsystems safely, on API design, on coding guidelines.
Integration with kernel systems and other infrastructure (e.g. build system, documentation, testing and CIs, maintenance, unstable
features, architecture support, stable/LTS releases, Rust versioning, third-party crates).
Updates on its subprojects (e.g. klint, pinned-init).
Rust versioning requirements and using Linux distributions’ toolchains.

Possible Rust topics:

Language and standard library (e.g. upcoming features, stabilization of the remaining features the kernel needs, memory model).
Compilers and codegen (e.g. rustc improvements, LLVM and Rust, rustc_codegen_gcc, gccrs.
Other tooling and new ideas (Coccinelle for Rust, bindgen, Compiler Explorer, Cargo, Clippy, Miri).
Educational material.
Any other Rust topic within the Linux ecosystem.

Hope to see you there!

July 07, 2024 04:44 PM

It comes with great sadness that on June 24th, 2024 we lost a great contributor to the Linux Plumbers Conference and the whole of Linux generally. Daniel Bristot de Oliveira passed away unexpectedly at the age of 37. Daniel has been an active participant of Linux Plumbers since 2017. Not only has he given numerous talks, which were extremely educational, he also took leadership roles in running Microconferences. He was this year’s main Microconference runner for both the Scheduler Microconference as well as the Real-Time Microconference. This year’s conference will be greatly affected by his absence. Many have stated how Daniel made them feel welcomed at Linux Plumbers. He always had a smile, would make jokes and help developers come to a conclusion for those controversial topics. He perfectly embodied the essence of what Linux Plumbers was all about. He will be missed.

July 06, 2024 05:18 AM

The Linux kernel has grown in complexity over the years. Complete understanding of how it works via code inspection has become virtually impossible. Today, tracing is used to follow the kernel as it performs its complex tasks. Tracing is used today for much more than simply debugging. Its framework has become the way for other parts of the Linux kernel to enhance and even make possible new features. Live kernel patching is based on the infrastructure of function tracing, as well as BPF. It is now even possible to model the behavior and correctness of the system via runtime verification which attaches to trace points. There is still much more that is happening in this space, and this microconference will be the forum to explore current and new ideas.

This year, focus will also be on perf events:

Perf events are a mechanism for presenting performance counters and Linux software events to users. There are kernel and userland components to perf events. The kernel supplies the APIs and the perf tooling presents the data to users.

Possible ideas for topics for this year’s conference:

Feedback about the tracing/perf subsystems overall (e.g. how can people help the maintainers).
Reboot persistent in-memory tracing buffers, this would make ftrace a very powerful debugging and performance analysis tool for kexec and could also be used for post crash debugging.
– Handling exposing enum names dynamically to user space to improve symbolic printing.
Userspace instrumentation (libside), including discussion of its impacts on the User events ABI.
Collect state dump events from kernel drivers (e.g. dump wifi interfaces configuration at a given point in time through trace buffers).
Current work implementing performance monitoring in the kernel
User land profiling and analysis tools using the perf event API
Improving the kernel perf event and PMU APIs
Interaction between perf events and subsystems like cgroups, kvm, drm, bpf, etc.
Improving the perf tool and its interfaces in particular w.r.t. to scalability of the tool
Implementation of new perf features and tools using eBPF, like the ones in tools/perf/util/bpf_skel/
Further use of type information to augment the perf tools
Novel uses of perf events for debugging and correctness
New challenges in performance monitoring for the Linux kernel
Regression testing/CI integration for the perf kernel infrastructure and tools
Improving documentation
Security aspects of using tracing/perf tools

The submission deadline has been updated to July 12th.

Come and join us in the discussion, we hope to see you there!

Please follow the suggestions from this BLOG post when submitting a CFP for this track.

Submissions are made via LPC submission system, selecting Track “Tracing / Perf events MC”

July 04, 2024 04:30 PM

The counted_by attribute

The counted_by attribute was introduced in Clang-18 and will soon be available in GCC-15. Its purpose is to associate a flexible-array member with a struct member that will hold the number of elements in this array at some point at run-time. This association is critical for enabling runtime bounds checking via the array bounds sanitizer and the __builtin_dynamic_object_size() built-in function. In user-space, this extra level of security is enabled by -D_FORTIFY_SOURCE=3. Therefore, using this attribute correctly enhances C codebases with runtime bounds-checking coverage on flexible-array members.

Here is an example of a flexible array annotated with this attribute:

struct bounded_flex_struct {
        ...
        size_t count;
        struct foo flex_array[] __attribute__((__counted_by__(count)));
};

In the above example, count is the struct member that will hold the number of elements of the flexible array at run-time. We will call this struct member the counter.

In the Linux kernel, this attribute facilitates bounds-checking coverage through fortified APIs such as the memcpy() family of functions, which internally use __builtin_dynamic_object_size() (CONFIG_FORTIFY_SOURCE). As well as through the array-bounds sanitizer (CONFIG_UBSAN_BOUNDS).

The __counted_by() macro

In the kernel we wrap the counted_by attribute in the __counted_by() macro, as shown below.

#if __has_attribute(__counted_by__)
# define __counted_by(member)           __attribute__((__counted_by__(member)))
#else
# define __counted_by(member)
#endif

c8248faf3ca27 (“Compiler Attributes: counted_by: Adjust name…”)

And with this we have been annotating flexible-array members across the whole kernel tree over the last year.

diff --git a/drivers/net/ethernet/chelsio/cxgb4/sched.h b/drivers/net/ethernet/chelsio/cxgb4/sched.h
index 5f8b871d79afac..6b3c778815f09e 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/sched.h
+++ b/drivers/net/ethernet/chelsio/cxgb4/sched.h
@@ -82,7 +82,7 @@ struct sched_class {
 
 struct sched_table {      /* per port scheduling table */
 	u8 sched_size;
-	struct sched_class tab[];
+	struct sched_class tab[] __counted_by(sched_size);
 };

ceba9725fb45 (“cxgb4: Annotate struct sched_table with …”)

However, as we are about to see, not all __counted_by() annotations are always as straightforward as the one above.

__counted_by() annotations in the kernel

There are a number of requirements to properly use the counted_by attribute. One crucial requirement is that the counter must be initialized before the first reference to the flexible-array member. Another requirement is that the array must always contain at least as many elements as indicated by the counter. Below you can see an example of a kernel patch addressing these requirements.

diff --git a/drivers/net/wireless/broadcom/brcm80211/brcmfmac/fweh.c b/drivers/net/wireless/broadcom/brcm80211/brcmfmac/fweh.c
index dac7eb77799bd1..68960ae9898713 100644
--- a/drivers/net/wireless/broadcom/brcm80211/brcmfmac/fweh.c
+++ b/drivers/net/wireless/broadcom/brcm80211/brcmfmac/fweh.c
@@ -33,7 +33,7 @@ struct brcmf_fweh_queue_item {
 	u8 ifaddr[ETH_ALEN];
 	struct brcmf_event_msg_be emsg;
 	u32 datalen;
-	u8 data[];
+	u8 data[] __counted_by(datalen);
 };
 
 /*
@@ -418,17 +418,17 @@ void brcmf_fweh_process_event(struct brcmf_pub *drvr,
 	    datalen + sizeof(*event_packet) > packet_len)
 		return;
 
-	event = kzalloc(sizeof(*event) + datalen, gfp);
+	event = kzalloc(struct_size(event, data, datalen), gfp);
 	if (!event)
 		return;
 
+	event->datalen = datalen;
 	event->code = code;
 	event->ifidx = event_packet->msg.ifidx;
 
 	/* use memcpy to get aligned event message */
 	memcpy(&event->emsg, &event_packet->msg, sizeof(event->emsg));
 	memcpy(event->data, data, datalen);
-	event->datalen = datalen;
 	memcpy(event->ifaddr, event_packet->eth.h_dest, ETH_ALEN);
 
 	brcmf_fweh_queue_event(fweh, event);

62d19b358088 (“wifi: brcmfmac: fweh: Add __counted_by…”)

In the patch above, datalen is the counter for the flexible-array member data. Notice how the assignment to the counter event->datalen = datalen had to be moved to before calling memcpy(event->data, data, datalen), this ensures the counter is initialized before the first reference to the flexible array. Otherwise, the compiler would complain about trying to write into a flexible array of size zero, due to datalen being zeroed out by a previous call to kzalloc(). This assignment-after-memcpy has been quite common in the Linux kernel. However, when dealing with counted_by annotations, this pattern should be changed. Therefore, we have to be careful when doing these annotations. We should audit all instances of code that reference both the counter and the flexible array and ensure they meet the proper requirements.

In the kernel, we’ve been learning from our mistakes and have fixed some buggy annotations we made in the beginning. Here are a couple of bugfixes to make you aware of these issues:

6dc445c19050 (“clk: bcm: rpi: Assign ->num before accessing…”)
9368cdf90f52 (“clk: bcm: dvp: Assign ->num before accessing…”)

Another common issue is when the counter is updated inside a loop. See the patch below.

diff --git a/drivers/net/wireless/ath/wil6210/cfg80211.c b/drivers/net/wireless/ath/wil6210/cfg80211.c
index 8993028709ecfb..e8f1d30a8d73c5 100644
--- a/drivers/net/wireless/ath/wil6210/cfg80211.c
+++ b/drivers/net/wireless/ath/wil6210/cfg80211.c
@@ -892,10 +892,8 @@ static int wil_cfg80211_scan(struct wiphy *wiphy,
 	struct wil6210_priv *wil = wiphy_to_wil(wiphy);
 	struct wireless_dev *wdev = request->wdev;
 	struct wil6210_vif *vif = wdev_to_vif(wil, wdev);
-	struct {
-		struct wmi_start_scan_cmd cmd;
-		u16 chnl[4];
-	} __packed cmd;
+	DEFINE_FLEX(struct wmi_start_scan_cmd, cmd,
+		    channel_list, num_channels, 4);
 	uint i, n;
 	int rc;
 
@@ -977,9 +975,8 @@ static int wil_cfg80211_scan(struct wiphy *wiphy,
 	vif->scan_request = request;
 	mod_timer(&vif->scan_timer, jiffies + WIL6210_SCAN_TO);
 
-	memset(&cmd, 0, sizeof(cmd));
-	cmd.cmd.scan_type = WMI_ACTIVE_SCAN;
-	cmd.cmd.num_channels = 0;
+	cmd->scan_type = WMI_ACTIVE_SCAN;
+	cmd->num_channels = 0;
 	n = min(request->n_channels, 4U);
 	for (i = 0; i < n; i++) {
 		int ch = request->channels[i]->hw_value;
@@ -991,7 +988,8 @@ static int wil_cfg80211_scan(struct wiphy *wiphy,
 			continue;
 		}
 		/* 0-based channel indexes */
-		cmd.cmd.channel_list[cmd.cmd.num_channels++].channel = ch - 1;
+		cmd->num_channels++;
+		cmd->channel_list[cmd->num_channels - 1].channel = ch - 1;
 		wil_dbg_misc(wil, "Scan for ch %d  : %d MHz\n", ch,
 			     request->channels[i]->center_freq);
 	}
@@ -1007,16 +1005,15 @@ static int wil_cfg80211_scan(struct wiphy *wiphy,
 	if (rc)
 		goto out_restore;
 
-	if (wil->discovery_mode && cmd.cmd.scan_type == WMI_ACTIVE_SCAN) {
-		cmd.cmd.discovery_mode = 1;
+	if (wil->discovery_mode && cmd->scan_type == WMI_ACTIVE_SCAN) {
+		cmd->discovery_mode = 1;
 		wil_dbg_misc(wil, "active scan with discovery_mode=1\n");
 	}
 
 	if (vif->mid == 0)
 		wil->radio_wdev = wdev;
 	rc = wmi_send(wil, WMI_START_SCAN_CMDID, vif->mid,
-		      &cmd, sizeof(cmd.cmd) +
-		      cmd.cmd.num_channels * sizeof(cmd.cmd.channel_list[0]));
+		      cmd, struct_size(cmd, channel_list, cmd->num_channels));
 
 out_restore:
 	if (rc) {
diff --git a/drivers/net/wireless/ath/wil6210/wmi.h b/drivers/net/wireless/ath/wil6210/wmi.h
index 71bf2ae27a984f..b47606d9068c8b 100644
--- a/drivers/net/wireless/ath/wil6210/wmi.h
+++ b/drivers/net/wireless/ath/wil6210/wmi.h
@@ -474,7 +474,7 @@ struct wmi_start_scan_cmd {
 	struct {
 		u8 channel;
 		u8 reserved;
-	} channel_list[];
+	} channel_list[] __counted_by(num_channels);
 } __packed;
 
 #define WMI_MAX_PNO_SSID_NUM	(16)

34c34c242a1b (“wifi: wil6210: cfg80211: Use __counted_by…”)

The patch above does a bit more than merely annotating the flexible array with the __counted_by() macro, but that’s material for a future post. For now, let’s focus on the following excerpt.

-	cmd.cmd.scan_type = WMI_ACTIVE_SCAN;
-	cmd.cmd.num_channels = 0;
+	cmd->scan_type = WMI_ACTIVE_SCAN;
+	cmd->num_channels = 0;
 	n = min(request->n_channels, 4U);
 	for (i = 0; i < n; i++) {
 		int ch = request->channels[i]->hw_value;
@@ -991,7 +988,8 @@ static int wil_cfg80211_scan(struct wiphy *wiphy,
 			continue;
 		}
 		/* 0-based channel indexes */
-		cmd.cmd.channel_list[cmd.cmd.num_channels++].channel = ch - 1;
+		cmd->num_channels++;
+		cmd->channel_list[cmd->num_channels - 1].channel = ch - 1;
 		wil_dbg_misc(wil, "Scan for ch %d  : %d MHz\n", ch,
 			     request->channels[i]->center_freq);
 	}
 ...
--- a/drivers/net/wireless/ath/wil6210/wmi.h
+++ b/drivers/net/wireless/ath/wil6210/wmi.h
@@ -474,7 +474,7 @@ struct wmi_start_scan_cmd {
 	struct {
 		u8 channel;
 		u8 reserved;
-	} channel_list[];
+	} channel_list[] __counted_by(num_channels);
 } __packed;

Notice that in this case, num_channels is our counter, and it’s set to zero before the for loop. Inside the for loop, the original code used this variable as an index to access the flexible array, then updated it via a post-increment, all in one line: cmd.cmd.channel_list[cmd.cmd.num_channels++]. The issue is that once channel_list was annotated with the __counted_by() macro, the compiler enforces dynamic array indexing of channel_list to stay below num_channels. Since num_channels holds a value of zero at the moment of the array access, this leads to undefined behavior and may trigger a compiler warning.

As shown in the patch, the solution is to increment num_channels before accessing the array, and then access the array through an index adjustment below num_channels.

Another option is to avoid using the counter as an index for the flexible array altogether. This can be done by using an auxiliary variable instead. See an excerpt of a patch below.

diff --git a/include/net/bluetooth/hci.h b/include/net/bluetooth/hci.h
index 38eb7ec86a1a65..21ebd70f3dcc97 100644
--- a/include/net/bluetooth/hci.h
+++ b/include/net/bluetooth/hci.h
@@ -2143,7 +2143,7 @@ struct hci_cp_le_set_cig_params {
 	__le16  c_latency;
 	__le16  p_latency;
 	__u8    num_cis;
-	struct hci_cis_params cis[];
+	struct hci_cis_params cis[] __counted_by(num_cis);
 } __packed;

@@ -1722,34 +1717,33 @@ static int hci_le_create_big(struct hci_conn *conn, struct bt_iso_qos *qos)
 
 static int set_cig_params_sync(struct hci_dev *hdev, void *data)
 {
 ...

+	u8 aux_num_cis = 0;
 	u8 cis_id;
 ...

 	for (cis_id = 0x00; cis_id < 0xf0 &&
-	     pdu.cp.num_cis < ARRAY_SIZE(pdu.cis); cis_id++) {
+	     aux_num_cis < pdu->num_cis; cis_id++) {
 		struct hci_cis_params *cis;
 
 		conn = hci_conn_hash_lookup_cis(hdev, NULL, 0, cig_id, cis_id);
@@ -1758,7 +1752,7 @@ static int set_cig_params_sync(struct hci_dev *hdev, void *data)
 
 		qos = &conn->iso_qos;
 
-		cis = &pdu.cis[pdu.cp.num_cis++];
+		cis = &pdu->cis[aux_num_cis++];
 		cis->cis_id = cis_id;
 		cis->c_sdu  = cpu_to_le16(conn->iso_qos.ucast.out.sdu);
 		cis->p_sdu  = cpu_to_le16(conn->iso_qos.ucast.in.sdu);
@@ -1769,14 +1763,14 @@ static int set_cig_params_sync(struct hci_dev *hdev, void *data)
 		cis->c_rtn  = qos->ucast.out.rtn;
 		cis->p_rtn  = qos->ucast.in.rtn;
 	}
+	pdu->num_cis = aux_num_cis;
 
 ...

ea9e148c803b (“Bluetooth: hci_conn: Use __counted_by() and…”)

Again, the entire patch does more than merely annotate the flexible-array member, but let’s just focus on how aux_num_cis is used to access flexible array pdu->cis[].

In this case, the counter is num_cis. As in our previous example, originally, the counter is used to directly access the flexible array: &pdu.cis[pdu.cp.num_cis++]. However, the patch above introduces a new variable aux_num_cis to be used instead of the counter: &pdu->cis[aux_num_cis++]. The counter is then updated after the loop: pdu->num_cis = aux_num_cis.

Both solutions are acceptable, so use whichever is convenient for you.

Here, you can see a recent bugfix for some buggy annotations that missed the details discussed above:

[PATCH] wifi: iwlwifi: mvm: Fix _counted_by usage in cfg80211_wowlan_nd*

In a future post, I’ll address the issue of annotating flexible arrays of flexible structures. Spoiler alert: don’t do it!

June 18, 2024 05:49 PM

We are excited about the submissions that are coming in to Linux Plumbers 2024. If you want to discuss a topic at one of the Microconferences, you should start putting together a problem statement and submit. Each Microconference has its own defined deadline. To submit, go to the Call for Proposals page and select Submit new abstract. After filling out your problem statement in the Content section, make sure to select the proper Microconference in the Track pull down list. It is recommended to read this blog before writing up your submission.

And a reminder that the other tracks submissions are ending soon as well.

June 17, 2024 11:04 PM

The deadline for submitting refereed track proposals for the 2024 Linux Plumbers Conference has been extended until 23 June.

If you have already submitted a proposal, thank you very much! For the rest of you, there is one additional week in which to get your proposal submitted. We very much look forward to seeing what you all come up with.

June 15, 2024 02:25 PM

A while back, I wrote about using the SSH agent protocol to satisfy WebAuthn requests. The main problem with this approach is that it required starting the SSH agent with a special argument and also involved being a little too friendly with the implementation - things worked because I could provide an arbitrary public key and the implementation never validated that, but it would be legitimate for it to start doing so and then break everything. And it also only worked for keys stored on tokens that ssh supports - there was no way to extend this to other keystores on the client (such as the Secure Enclave on Macs, or TPM-backed keys on PCs). I wanted a better solution.

It turns out that it was far easier than I expected. The ssh agent protocol is documented here, and the interesting part is the extension support extension mechanism. Basically, you can declare an extension and then just tunnel whatever you want over it. As before, my goto was the go ssh agent package which conveniently implements both the client and server side of this. Implementing the local agent is trivial - look up SSH_AUTH_SOCK, connect to it, create a new agent client that can communicate with that by calling NewClient, and then implement the ExtendedAgent interface, create a new socket, and call ServeAgent against that. Most of the ExtendedAgent functions should simply call through to the original agent, with the exception of Extension(). Just add a case statement against extensionType, define some reasonably namespaced extension, and you're done.

Now you need to use this agent. You probably don't want to use this for arbitrary hosts (agent forwarding should only be enabled for remote systems you trust, not arbitrary machines you connect to - if you enabled agent forwarding for github and github got compromised, github would be able to use any private keys loaded into your agent, and you probably don't want that). So the right approach is to add a Host entry to the ssh config with a ForwardAgent stanza pointing at the socket you created in your new agent. This way the configured subset of remote hosts will automatically talk to this new custom agent, while forwarding for anything else will still be at the user's discretion.

For the remote end things are even easier. Look up SSH_AUTH_SOCK and call NewClient as before, and then simply call client.Extension(). Whatever you stick in the contents argument will simply end up being received at the client end. You now have a communication channel between a the remote system and the local client, and what you do with that is up to you. I'm using it to allow a remote system to obtain auth tokens from Okta and forward WebAuthn challenges that can either be satisfied via a local WebAuthn token or by passing the query off to Mac TouchID, but there's fundamentally no constraints whatsoever on what can be done here.

(If you want to do this on Windows and still have everything work with existing clients you'll need to take this into account - Windows didn't really do Unix sockets until recently so everything there is awful)

comments

June 12, 2024 02:57 AM

We’re happy to announce that registration for LPC 2024 is now open. To register please go to our attend page.

To try to prevent the instant sellout we had in previous years we are keeping our cancellation policy of no refunds, only transfers of registrations. You will find more details during the registration process. LPC 2024 follows the Linux Foundation’s health & safety policy.

As usual we expect to sell our rather quickly so don’t delay your registration for too long!

May 27, 2024 08:11 AM

Unfortunately we still do not know the total cost of the 4th track yet. We are still in the process of looking at the costs of adding another room, but we do not want to delay the acceptance of topics to Microconferences any further. We have decided to accept all pending Microconferences with one caveat. That is, we are not accepting the rest as full Microconferences. The Microconferences being accepted now will become one of the following at Linux Plumbers 2024:

A full 3 hour Microconference
A 1 and a half hour Microconference (Nanoconference)
A full 3 hour Microconference but without normal Audio/Video

That last one is another option we are looking at. The main cost to having a 4th track is the manned AV operations. But we could add the 4th track without normal AV. Instead, these would get a BBB room where an Owl video camera (or the like) and a Jabra speaker will be in place. The quality of the AV will not be as good as having a fully manned room, but this would be better than being rejected from the conference, or having half the time of a full microconference.

Even with a 4th track, two still need to become Nanoconferences.

In the mean time, we will be accepting the rest of the Microconferences so that they can start putting together content. How they are presented at Linux Plumbers is still to be determined.

Note that this also means we will likely be dropping the 3 free passes that a Microconference usually gets down to only 2 passes.

The accepted Microconferences (as full, half or no A/V) are:

Sched_ext
Containers and Checkpoint/Restore
Confidential Computing
Real-Time
Build Systems
RISC-V
Compute Express Link
X86
VFIO/IOMMU/PCI
System Boot and Security
Zone Storage
Internet of Things
Embedded
Complex Cameras
Power Management and Thermal Control
Kernel <-> Userspace/Init/System Management Boundaries and APIs

May 13, 2024 06:50 PM

I've presented a talk Introduction to XDP, eBPF and AF_XDP as part of the OsmoDevCon 2024 conference on Open Source Mobile Communications.

This talk provides a generic introduction to a set of modern Linux kernel technologies:

eBPF (extended Berkeley Packet Filter) is a kind of virtual machine that runs sandboxed programs inside the Linux kernel.
XDP (eXpress Data Path) is a framework for eBPF that enables high-performance programmable packet processing in the Linux kernel
AF_XDP is an address family that is optimized for high-performance packet processing. It allows in-kernel XDP eBPF programs to efficiently pass packets to userspace via memory-mapped ring buffers.

The talk provides a high-level overview. It should provide some basics before the other/later talks on bpftrace and eUPF.

You can find the video recording at https://media.ccc.de/v/osmodevcon2024-204-introduction-to-xdp-ebpf-and-afxdp

May 04, 2024 10:00 PM

I've presented a talk Using bpftrace to analyze osmocom performance as part of the OsmoDevCon 2024 conference on Open Source Mobile Communications.

bpftrace is a utility that uses the Linux kernel tracing infrastructure (and eBPF) in order to provide tracing capabilities within the kernel, like uprobe, kprobe, tracepoints, etc.

bpftrace can help us to analyze the performance of [unmodified] Osmocom programs and quickly provide information like, for example:

Histogram of time spent in a specific system call
Histogram of any argument or return value of any system call

You can find the video recording at https://media.ccc.de/v/osmodevcon2024-203-using-bpftrace-to-analyze-osmocom-performance

May 04, 2024 10:00 PM

I've presented a talk GlobalPlatform in USIM and eUICC as part of the OsmoDevCon 2024 conference on Open Source Mobile Communications.

The GlobalPlatform Card Specification and its many amendments play a significant role in most real-world USIM/ISIM, and even more so in eUICC.

The talk will try to provide an overview of what GlobalPlatform does in the telecommunications context.

Topics include:

security domains
key loading
card and application life cycle
loading and installation of applications
Secure Channel Protocols SCP02, SCP03

You can find the video recording at https://media.ccc.de/v/osmodevcon2024-173-globalplatform-in-usim-and-euicc

May 03, 2024 10:00 PM

I've presented a talk Detailed workings of OTA for SIM/USIM/eUICC as part of the OsmoDevCon 2024 conference on Open Source Mobile Communications.

Everyone knows that OTA (over the air) access to SIM cards exists for decades, and that somehow authenticated APDUs can be sent via SMS.

But let's look at the OTA architecture in more detail:

OTA transport (SCP80) over SMS, USSD, CellBroadcast, CAT-TP, BIP
The new SCP81 transport (HTTPS via TLS-PSK)
how to address individal applications on the card via their TAR
common applications like RFM and RAM
custom applications on the card
OTA in the world of eUICCs
talking to the ECASD
talking to the ISD-R
talking to the ISD-P/MNO-SD or applications therein

You can find the video recording at https://media.ccc.de/v/osmodevcon2024-175-detailed-workings-of-ota-for-sim-usim-euicc

May 03, 2024 10:00 PM

I've presented a talk Anatomy of the eSIM Profile as part of the OsmoDevCon 2024 conference on Open Source Mobile Communications.

In the eSIM universe, eSIM profiles are the virtualised content of a classic USIM (possibly with ISIM, CSIM, applets, etc.).

Let's have a look what an eSIM profile is:

how is the data structured / organized?
what data can be represented in it?
how to handle features provided by eUICC, how can the eSIM profile mandate some of them?
how does personalization of eSIM profiles work?

There is also hands-on navigation through profiles, based on the pySim.esim.saip module.

You can find the video recording at https://media.ccc.de/v/osmodevcon2024-174-anatomy-of-the-esim-profile

May 03, 2024 10:00 PM

The Call-for-Proposals for Microconferences has come to a close, and with that, this year’s list of Microconferences is to be decided. A Microconference is a 3 and a half hour session with a half hour break (giving a total of 3 hours of content). Linux Plumbers has three Microconference tracks running per day, with each track having two Microconferences (one in the morning and one in the afternoon). Linux Plumbers runs for three days allowing for 18 Microconferences total (2 per track, with 3 tracks a day for 3 days).

This year we had a total of 26 quality submissions! Linux Plumbers is known as the conference that gets work done, and its success is proof of that. But sometimes success brings its own problems. How can we accept 26 Microconferences when we only have 18 slots to place them? Two of the Microconferences have agreed to merge as one bringing the total down to just 25. But that still is 7 more than we can handle.

We want to avoid rejecting 7 microconferences, but to do so, we need to make compromises. The first idea we have is to add a 4th Microconference track. But that still only gives us 6 more slots. As it will also require more A/V and manpower, the cost will increase and may not be within the budget to do so.

Pros to a 4 track are:

Have 24 full Microconferences and reject one (or 23 and keep 2 as per the next option).

Cons to a 4 track are:

Increased costs.
Having 4 Microconference tracks running simultaneously will cause more conflicts in the schedule. People have complained in the past about conflicts between sessions with just 3 tracks, having 4 will exacerbate the situation.
Still may need to reject 1 Microconference.

Another solution is to create a half Microconference (Nanoconference?). That is an hour and a half session, run the same as the full sessions, with the 30 minute break between two Nanoconferences. Doing so will allow for 11 full Microconferences and 14 Nanoconferences which will allow for all submissions to be accepted and fit within the 3 tracks.

The difference between a Nanoconference and a BOF is that a Nanoconference still has all the rules of a Microconference. That is, all sessions should be strictly discussion focus. If presentations are needed, they should be submitted as Refereed talks (the CFP for them are still open). A BOF is usually focused on a single issue. A Nanoconference should still be broken up into small discussions about different issues with sessions lasting 15 to 20 minutes each.

Pros for Nanoconferences are:

Can accommodate all submitted Microconferences.

Cons for Nanoconferences are:

Shortened time for Microconferences, even for topics in the past that had filled a full Microconference may now only get half the time.

Note, as BOFs will be in a separate track, a Nanoconference may be able to still submit for topics there (BOF submissions are still open).

Currently, we also give out 3 free passes to each Microconference that can be handed to anyone in their session. For 18 Microconferences, that is 54 passes. This will not be feasible to give out 3 passes to 25 Microconferences (totaling 75 passes), thus one solution is to drop it down to 2 free passes. The problem with passes is still an issue with the Nanoconference approach, as you can not give out 1 and a half passes. Thus, the Nanoconferences may only get 1 pass each, or perhaps have both the Microconferences and Nanoconferences all get just 2 passes each.

Anyway, since the above solutions still allow for 11 full Microconferences, we have accepted 9 so far. They are:

Android

KVM

Kernel Memory Management

Scheduling

Rust

Kernel Testing and Dependability

Graphics and DRM

Safe Systems with Linux

Tracing / Perf events (This is the merged Microconference)

We are still weighing our options so stay tuned for updates on the situation, and thank you to all the Microconference submitters that make
Linux Plumbers the best technical conference around!

May 03, 2024 04:54 PM

I've co-presented a talk (together with Andreas Eversberg High-performance I/O using io_uring via osmo_io as part of the OsmoDevCon 2024 conference on Open Source Mobile Communications.

Traditional socket I/O via read/write/recvfrom/sendto/recvmsg/sendmsg and friends creates a very high system call load. A highly-loaded osmo-bsc spends most of its time in syscall entry and syscall exit.

io_uring is a modern Linux kernel mechanism to avoid this syscall overhead. We have introduced the osmo_io`API to libosmocore as a generic back-end for non-blocking/asynchronous I/O and a back-end for our classic `osmo_fd / poll approach as well as a new backend for io_uring.

The talk will cover

a very basic io_uring introduction
a description of the osmo_io API
the difficulties porting from osmo_fd to osmo_io
status of porting various sub-systems over to osmo_io

You can find the video recording at https://media.ccc.de/v/osmodevcon2024-209-high-performance-i-o-using-iouring-via-osmoio

May 02, 2024 10:00 PM

The only way to obtain STEP from OpenSCAD that I know is an external tool that someone made. It's pretty crazy actually: it parses OpenSCAD's native export, CSG, and issues commands to OpenCASCADE's CLI, OCC-CSG. The biggest issue for me here is that his approach cannot handle transformations that the CLI does not support. I use hull all over the place and a tool that does not support hull is of no use for me.

So I came up with a mad lad idea: just add a native export of STEP to OpenSCAD. The language itself is constructive, and an export to CSG exists. I just need to duplicate whatever it does, and then at each node, transform it into something that can be expressed in STEP.

As it turned out, STEP does not have any operations. It only has manifolds assembled from faces, which are assembled from planes and lines, which are assembled from cartesian points and vectors. Thus, I need to walk the CSG, compile it into a STEP representation, and only then write it out. Operations like union, difference, or hull have to be computed by my code. The plan is to borrow from OpenSCAD's compiler that builds the mesh, only build with larger pieces - possibly square or round.

Not sure if this is sane and can be made to work, but it's pretty fun at least.

April 29, 2024 05:37 AM

What do you think this does:

class A(object):
def aa(self):
return 'A1'
class A(object):
def aa(self):
return 'A2'
a = A()
print("%s" % a.aa())

It prints "A2".

But before you think "what's the big deal, the __dict__ of A is getting updated", how about this:

class A(object):
def aa(self):
return 'A1'
class A(object):
def bb(self):
return 'A2'
a = A()
print("%s" % a.aa())

This fails with "AttributeError: 'A' object has no attribute 'aa'".

Apparently, the latter definition replaces the former completely. This is darkly amusing.

Python 3.12.2

April 18, 2024 02:44 AM

Problem:
When copying from tmux in gnome-terminal, the text is full of whitespace. How do I delete it in gvim?

Solution:
/ \+$

Obviously.

This is an area where tmux is a big regression from screen. Too bad.

April 16, 2024 08:26 PM

Problem: After an update to F39, a system continues to boot F38 kernels

The /bin/kernel-install generates entries in /boot/efi/loader/entries instead of /boot/loader/entries. Also, they are in BLS Type 1 format, and not in the legacy GRUB format. So I cannot copy them over.

Solution:
[root@chihiro zaitcev]# dnf install ostree
[root@chihiro zaitcev]# rm -rf /boot/efi/$(cat /etc/machine-id) /boot/efi/loader/

I've read a bunch of docs and the man page for kernel-install(8), but they are incomprehensible. Still the key insight was that all that Systemd stuff loves to autodetect by finding this directory or that.

The way to test is:
[root@chihiro zaitcev]# /bin/kernel-install -v add 6.8.5-201.fc39.x86_64 /lib/modules/6.8.5-201.fc39.x86_64/vmlinuz

April 15, 2024 05:51 PM

I'm a strong proponent of self-hosting all your services, if not on your own hardware than at least on dedicated rented hardware. For IT nerds of my generation, this has been the norm sicne the early 1990s: If you wante to run your own webserver/mailserver/... back then, the only way was to self-host.

So over the last 30 years, I've always been running a fleet of machines, some my own hardware colocated, and during the past ~18 years also some rented dedicated "root servers". They run a plethora of services for either my personal stuff (like this blog, or my personal email server), or any of the IT services of the open source projects I'm involved in (like osmocom) or the company I co-founded and run (sysmocom).

Every few years there's the need to migrate to new hardware. Either due to power consumption/efficiency, or to increase performance, or to simply avoid aging hardware that may be dying soon.

When upgrading from one [hosted] server to another [hosted] server, there's always the question of how to manage the migration with minimal interruption to services. For very simple services like http/https, it can usually be done entirely within DNS: You reduce the TTL of the records, bring up the service on the new server (with a new IP), make the change in the DNS and once the TTL of the DNS record is expired in all caches, everybody will access the new server/IP.

However, there are services where the IP address must be retained. SMTP is a prime example of that. Given how spam filtering works, you certainly will not want to give up your years if not decadeds of good reputation for your IP address. As a result, you will want to keep the IP address while doing the migration.

If it's a physical machine in colocation or your home, you can of course do that all rather easily under your control. You can synchronize the various steps from stopping the services on the old machine, rsync'ing over the spool files to the new, then migrate the IP over to the new machine.

However, if it's a rented "root" server at a company like Hetzner or KVH, then you do not have full control over when exactly the IP address will be migrated over to the new server.

Also, if there are many different services on that same physical machine, running on a variety of different IPv4/IPv6 addresess and ports, it may be difficult to migrate all of them at once. It would be much more practical, if individual services could be migrated step by step.

The poor man's approach would be to use port-forwarding / reverse-proxying. In this case, the client establishes a TCP connection to the old IP address on the old server, and a small port-forward proxy accepts that TCP connectin, creates a second TCP connection to the new server, and bridges those two together. This approach only works for the most simplistic of services (like web servers), where

there are only inbound connections from remote clients (as outbound connections from the new server would originate from the new IP, not the old one), and
where the source IP of the client doesn't matter. To the new server all connections' source IP addresses suddenly are masked and there's only one source IP (the old server) for all connections.

For more sophisticated serviecs (like e-mail/SMTP, again), this is not an option. The SMTP client IP address matters for whitelists/blacklists/relay rules, etc. And of course there are also plenty of outbound SMTP connections which need to originate from the old IP, not the new IP.

So in bed last night [don't ask], I was brainstorming if the goal of fully transparent migration of individual TCP+UDP/IP (IPv4+IPv6) services could be made between and old and new server. In theory it's rather simple, but in practice the IP stack is not really designed for this, and we're breaking a lot of the assumptions and principles of IP networking.

After some playing around earlier today, I was actually able to create a working setup!

It fulfills the followign goals / exhibits the following properties:

old and new server run concurrently for any amount of time
individual IP:port tuples can be migrated from old to new server, as services are migrated step by step
fully transparent to any remote peer: Old IP:port of server visible to client
fully transparent to the local service: Real client IP:port of client visible to server
no constraints on whether or not the old and new IPs are in the same subnet, link-layer, data centre, ...
use only stock features of the Linux kernel, no custom software, kernel patches, ...
no requirement for controlling a router in front of either old or new server

General Idea

The general idea is to receive and classify incoming packets on the old server, and then selectively tunnel some of them via a GRE tunnel from the old machine to the new machine, where they are decapsulated and passed to local processes on the new server. Any packets generated by the service on the new server (responses to clients or outbound connections to remote serveers) will take the opposite route: They will be encapsulated on the new server, passed through that GRE tunnel back to the old server, from where they will be sent off to the internet.

That sounds simple in theory, but it poses a number of challenges:

packets destined for a local IP address of the old server need to be re-routed/forwarded, not delivered to local sockets. This is easily done with fwmark, multiple routing tables and a rule, similar ro many other policy routing setups.
FIXME

March 29, 2024 11:00 PM

Linux Plumbers Conference 2024 is pleased to host the Networking Track!

LPC Networking track is an in-person manifestation of the netdev mailing list, bringing together developers, users and vendors to discuss topics related to Linux networking. Relevant topics span from proposals for kernel changes, through user space tooling, netdev testing and CI, to presenting interesting use cases, new protocols or new, interesting problems waiting for a solution.

The goal is to allow gathering early feedback on proposals, reach consensus on long running mailing list discussions and raise awareness of interesting work and use cases.

After four years of co-locating BPF & Networking Tracks together this year we separated the two, again. Please submit to the track which feels suitable, the committee will transfer submissions between tracks as it deems necessary.

Please come and join us in the discussion. We hope to see you there!

March 29, 2024 05:55 AM

When you have an outage caused by a performance issue, you don't want to lose precious time just to install the tools needed to diagnose it. Here is a list of "crisis tools" I recommend installing on your Linux servers by default (if they aren't already), along with the (Ubuntu) package names that they come from:

Package	Provides	Notes
procps	ps(1), vmstat(8), uptime(1), top(1)	basic stats
util-linux	dmesg(1), lsblk(1), lscpu(1)	system log, device info
sysstat	iostat(1), mpstat(1), pidstat(1), sar(1)	device stats
iproute2	ip(8), ss(8), nstat(8), tc(8)	preferred net tools
numactl	numastat(8)	NUMA stats
tcpdump	tcpdump(8)	Network sniffer
linux-tools-common linux-tools-$(uname -r)	perf(1), turbostat(8)	profiler and PMU stats
bpfcc-tools (bcc)	opensnoop(8), execsnoop(8), runqlat(8), softirqs(8), hardirqs(8), ext4slower(8), ext4dist(8), biotop(8), biosnoop(8), biolatency(8), tcptop(8), tcplife(8), trace(8), argdist(8), funccount(8), profile(8), etc.	canned eBPF tools[1]
bpftrace	bpftrace, basic versions of opensnoop(8), execsnoop(8), runqlat(8), biosnoop(8), etc.	eBPF scripting[1]
trace-cmd	trace-cmd(1)	Ftrace CLI
nicstat	nicstat(1)	net device stats
ethtool	ethtool(8)	net device info
tiptop	tiptop(1)	PMU/PMC top
cpuid	cpuid(1)	CPU details
msr-tools	rdmsr(8), wrmsr(8)	CPU digging

(This is based on Table 4.1 "Linux Crisis Tools" in SysPerf 2.)

Some longer notes: [1] bcc and bpftrace have many overlapping tools: the bcc ones are more capable (e.g., CLI options), and the bpftrace ones can be edited on the fly. But that's not to say that one is better or faster than the other: They emit the same BPF bytecode and are equally fast once running. Also note that bcc is evolving and migrating tools from Python to libbpf C (with CO-RE and BTF) but we haven't reworked the package yet. In the future "bpfcc-tools" should get replaced with a much smaller "libbpf-tools" package that's just tool binaries.

This list is a minimum. Some servers have accelerators and you'll want their analysis tools installed as well: e.g., on Intel GPU servers, the intel-gpu-tools package; on NVIDIA, nvidia-smi. Debugging tools, like gdb(1), can also be pre-installed for immediate use in a crisis.

Essential analysis tools like these don't change that often, so this list may only need updating every few years. If you think I missed a package that is important today, please let me know (e.g., in the comments).

The main downside of adding these packages is their on-disk size. On cloud instances, adding Mbytes to the base server image can add seconds, or fractions of a second, to instance deployment time. Fortunately the packages I've listed are mostly quite small (and bcc will get smaller) and should cost little size and time. I have seen this size concern prevent debuginfo (totaling around 1 Gbyte) from being included by default.

Can't I just install them later when needed?

Many problems can occur when trying to install software during a production crisis. I'll step through a made-up example that combines some of the things I've learned the hard way:

4:00pm: Alert! Your company's site goes down. No, some people say it's still up. Is it up? It's up but too slow to be usable.
4:01pm: You look at your monitoring dashboards and a group of backend servers are abnormal. Is that high disk I/O? What's causing that?
4:02pm: You SSH to one server to dig deeper, but SSH takes forever to login.
4:03pm: You get a login prompt and type "iostat -xz 1" for basic disk stats to begin with. There is a long pause, and finally "Command 'iostat' not found...Try: sudo apt install sysstat". Ugh. Given how slow the system is, installing this package could take several minutes. You run the install command.
4:07pm: The package install has failed as it can't resolve the repositories. Something is wrong with the /etc/apt configuration. Since the server owners are now in the SRE chatroom to help with the outage, you ask: "how do you install system packages?" They respond "We never do. We only update our app." Ugh. You find a different server and copy its working /etc/apt config over.
4:10pm: You need to run "apt-get update" first with the fixed config, but it's miserably slow.
4:12pm: ...should it really be taking this long??
4:13pm: apt returned "failed: Connection timed out." Maybe this system is too slow with the performance issue? Or can't it connect to the repos? You begin network debugging and ask the server team: "Do you use a firewall?" They say they don't know, ask the network security team.
4:17pm: The network security team have responded: Yes, they have blocked any unexpected traffic, including HTTP/HTTPS/FTP outbound apt requests. Gah. "Can you edit the rules right now?" "It's not that easy." "What about turning off the firewall completely?" "Uh, in an emergency, sure."
4:20pm: The firewall is disabled. You run apt-get update again. It's slow, but works! Then apt-get install, and...permission errors. What!? I'm root, this makes no sense. You share your error in the SRE chatroom and someone points out: Didn't the platform security team make the system immutable?
4:24pm: The platform security team are now in the SRE chatroom explaining that some parts of the file system can be written to, but others, especially for executable binaries, are blocked. Gah! "How do we disable this?" "You can't, that's the point. You'd have to create new server images with it disabled."
4:27pm: By now the SRE team has announced a major outage and informed the executive team, who want regular status updates and an ETA for when it will be fixed. Status: Haven't done much yet.
4:30pm: You start running "cat /proc/diskstats" as a rudimentary iostat(1), but have to spend time reading the Linux source (admin-guide/iostats.rst) to make sense of it. It just confirms the disks are busy which you knew anyway from the monitoring dashboard. You really need the disk and file system tracing tools, like biosnoop(8), but you can't install them either. Unless you can hack up rudimentary tracing tools as well...You "cd /sys/kernel/debug/tracing" and start looking for the FTrace docs.
4:55pm: New server images finally launch with all writable file systems. You login – gee it's fast – and "apt-get install sysstat". Before you can even run iostat there are messages in the chatroom: "Website's back up! Thanks! What did you do?" "We restarted the servers but we haven't fixed anything yet." You have the feeling that the outage will return exactly 10 minutes after you've fallen asleep tonight.
12:50am: Ping! I knew this would happen. You get out of bed and open your work laptop. The site is down – it's been hacked – someone disabled the firewall and file system security.

I've fortunately not experienced the 12:50am event, but the others are based on real world experiences. In my prior job this sequence can often take a different turn: a "traffic team" may initiate a cloud region failover by about the 15 minute mark, so I'd eventually get iostat installed but then these systems would be idle.

Default install

The above scenario explains why you ideally want to pre-install crisis tools so you can start debugging a production issue quickly during an outage. Some companies already do this, and have OS teams that create custom server images with everything included. But there are many sites still running default versions of Linux that learn this the hard way. I'd recommend Linux distros add these crisis tools to their enterprise Linux variants, so that companies large and small can hit the ground running when performance outages occur.

March 23, 2024 01:00 PM

I’ve had a couple of reasons recently to wonder about ipsec: one was doing private overlay networks in confidential VMs and the other was trying to be more efficient than my IPv4 openVPN when I’m remote on an IPv6 capable network. Usually ipsec descriptions begin with tools like raccoon or strong/open/libreswan; however, I’m going to try to explain how you do ipsec at a very basic level within Linux networking stack without using an ipsec toolkit. I’m going to concentrate on my latter use case, so this post is going to be ipsec over IPv6 (although most of the concepts should be applicable to IPv4). To attempt to do this, I’ll be delving into the ip xfrm commands extensively and trying to explain how the transform filters and policy work with the rest of the Linux networking stack.

The basics of ipsec

ipsec has two “protocols”: Authentication Header (AH) which means the packet is fully authenticated (or integrity protected) by an HMAC but not encrypted; and Encapsulating Security Payload (ESP), where the packet is encrypted but not necessarily integrity protected (ipsec was invented before AEAD ciphers, so previously you used both protocols to ensure confidentiality and integrity but, in the modern world, you can use ESP with an AEAD cipher and dispense entirely with AH). Once you have the protocol set, you encapsulate either in transport or tunnel mode. Transport mode means that the protocol headers are simply added to an existing IP packet (so the source and destination address remain the same) in the case of AH and an added header plus an encryption transformed payload with ESP and tunnel mode means that the entire IP packet is encapsulated and a new outer source and destination address is added (this is sometimes referred to as an ipsec VPN).

Understanding ipsec flows

This diagram should help understand how ipsec transforms work. There are two aspects to this: policy which basically does accept/reject and tagging and state, which does the encode/decode

The square boxes are the firewall filters and the ellipses are the ip xfrm policy and state transforms. xfrm decode is unconditionally activated whenever an ipsec packet reaches the input flow (provided there’s a matching state rule), output encoding only occurs if a matching output policy says it should (otherwise the packet is passed unencoded) and the xfrm policy fwd has no matching encode/decode, so it’s not possible to ipsec transform forwarded packets.

ip xfrm policy

To decapsulate, there is no requirement for a policy: every ipsec packet coming into the INPUT flow will be checked for a state match and decapsulated if one is found. However, for the packet to progress further you may need a policy. For transport mode, the decapsulated packet will go back around the input loop, so a dir in policy likely isn’t required but for tunnel mode, the decapsulated packet will likely traverse the FORWARD table and a dir fwd policy plus a firewall rule is likely required to permit the packet. Forwarded packets also hit the dir out policy as well.

A policy is specified with two parts: a selector which contains a set of matches (the only mandatory part of which is the direction dir) and which may match on partial addresses an action (block or allow, with allow being default) and a template (tmpl) which can specify the encoding (for dir out) or additional rules based on encapsulation.

Encoding Templates

Encoding only applies to dir out policies. Transport mode simply requires a statement of which encapsulation to use (proto ah or esp) and doesn’t require IP addresses in the ID section. In tunnel mode, the template must also have the source and destination outer IP addresses (the current source and destination become the inner addresses).

Every packet matching an encoding policy must also have a corresponding ip xfrm state match to specify the encapsulation parameters. Note that if a state transform is missing, the kernel will signal this on a netlink socket (which you can monitor with ip xfrm monitor). This socket is mostly used by ipsec toolkits to add state transforms just in time.

Other Policy Templates

For the in and fwd directions, the template acts as an additional filter on what packets to allow and what to block. For instance, if only decapsulated packets should be forwarded, then there should be a policy like

ip xfrm policy add dst net/netmask dir fwd tmpl proto ah mode tunnel level required

Which says the only allowed packets are those which were encapsulated in tunnel mode. Note that for this policy to be reached at all, the FORWARD table must allow the packets to pass. For most network security people, having a blanket forward permit rule is an anathema, so they often achieve the same thing by applying a firewall mark in decapsulation (the output-mark option of ip xfrm state) and only allowing market packets to pass the FORWARD chain (which dispenses with the need for a xfrm dir fwd policy). The two levers for controlling filter policies are the action (allow or block) the default is allow which is why this statement usually doesn’t appear) and the template level (required or use). The default level is required, which means for the allow rule to match the packet must be decapsulate (level use means pass regardless of decapsulation status)

Security Parameters Index (SPI) and reqid

For ipsec to work, every encapsulated packet must have a SPI value. You can specify this in the state transformation. The standards (RFC2409, RFC4303, etc) specify that SPI values 1-255 are “reserved”. Additionally the standards allow SPI value 0 to be used internally, which the Linux Kernel takes advantage of. SPI is mostly used to distinguish packet streams from the same host for complex ipsec policy, and don’t have much use in a simple policy situation. However, you must provide a value that isn’t 0-255 otherwise strange things can happen. In particular 0, the value you’ll get if you don’t specify spi, often causes the packet to get lost after decapsulation, so always specify a large number for spi. requid is a label which is attached to an unencapsulated packet that effectively remembers what the SPI value was; it’s mostly used as a label based discriminator in policy template to state transforms.

For the purposes of the following example I’ll simply use the randomly chosen 4321 for the spi value (but you could choose anything outside the 0-255 range).

ip xfrm state

Unlike policy, which attaches to a particular location (in, out or fwd), state is location agnostic and the same state match could theoretically be used both to encapsulate or decapsulate. State matching also isn’t subnet based: the address matching is either exact or fully wild card (match everything). However, a state encapsulation transform rule must match on dst and a decapsulation one may match on either src or dst, but must have an exact match on one or other. The main thing the state specifies is the algorithm to encapsulate (see man ip-xfrm for a full list). Remember you also must specify spi. The only other thing you might want to specify is the sel parameter. The selector applies to the inner address of decapsulated packets and is to ensure that a mode tunnel packet is going to an address you approve of.

Simple Example: HMAC authentication between two nodes

Assume a [4321::]/64 subnet with two nodes [4321::1] and [4321::2]. To set up authenticated headers one way (from 1->2) you need a policy specifying AH (can be specific or subnet based, so this is subnet)

ip xfrm policy add dst 4321::/64 dir out tmpl proto ah mode tunnel

Followed by a state that’s specific to the destination (using random spi 4321 and short key 1234):

ip xfrm state add dst 4321::2 proto ah spi 4321 auth "hmac(sha1)" 1234 mode transport

If you ping from [4321::1] and do a tcp dump from [4321::2] you’ll see

IP6 4321::1 > 4321::2: AH(spi=0x000010e1,seq=0x1b,icv=0x530cdd96149288da7a35fc6d): ICMP6, echo request, id 4, seq 104, length 64

But nothing will come back until you add on [4321::2]

ip xfrm state add dst 4321::2 proto ah spi 4321 auth "hmac(sha1)" 1234 mode transport

Which will cause a ping response to be seen. Note the ping packet has an authentication header, but the response is a simple icmp6 response packet (no AH) demonstrating ipsec can be set up asymmetrically.

Example: Private network for Cloud Nodes

Assume we have N nodes with public IP addresses [4321::1]…[4321::N] (which could be provided by the cloud overlay or simply by virtue of the physical network the nodes are on) and we want to connect them in a private mesh network using encryption. There are two ways of doing this: the first is to encrypt all traffic between the nodes on the public network using transport mode and the second would be to set up overlay tunnels between the nodes (this latter can be used even if the public addresses aren’t on a single network segment).

Simple Transport Mode Encryption

Firstly each node needs a policy to require encryption both to and from the private network

ip xfrm policy add dst 4321::0/64 dir out tmpl proto esp
ip xfrm policy add scr 4321::/64 dir in tmpl proto esp

And then for each node on the star and encrypt and a decrypt policy (the aes cipher type is taken from the key length, so I’ve chosen a 128 bit key “1234567890123456”)

ip xfrm state add dst 4321::1 proto esp spi 4321 enc "cbc(aes)" 1234567890123456 mode transport
ip xfrm state add src 4321::1 proto esp sip 4321 enc "cbc(aes)" 1234567890123456 mode transport
...
ip xfrm state add dst 4321::N proto esp spi 4321 enc "cbc(aes)" 1234567890123456 mode transport
ip xfrm state add src 4321::N proto esp sip 4321 enc "cbc(aes)" 1234567890123456 mode transport

This encryption scheme has one key for the entire network, but you could use 1 key per node if you wished (although this wouldn’t necessarily increase security that much). Note that what’s described above is not an overlay network because it relies on using the characteristics of the underlying network (in this case that all nodes are on an IPv6 /64 segment) to do opportunistic transport encryption. To get a true single subnet overlay on top of a disjoint network there must be some sort of tunnel. One way to get the tunnel is simply to use ipsec in tunnel mode, but another is to set up gre tunnels (or another, not necessarily trusted, network overlay which the cloud can likely provide) for the virtual overlay and then use ipsec in transport mode to ensure the packets are always encrypted.

Tunnel Mode Overlay Network

For this example, we’ll allow unencrypted packets to flow over a routed network [4321::1] and [4322::2] (assume network device eth0 on each node) but set up an overlay network on [6666::N]/64 which is fully encrypted. Firstly, each node requires a local address addition for the [6666::N] address. So on node [4321::1] do

ip addr add 6666::1/128 dev eth0

Now add policies and state transforms in both directions (in this case a dir in policy is required otherwise the decapsulated packets won’t get sent up the input flow):

# required policy for encapsulation
ip xfrm policy add dst 6666::2 dir out tmpl src 4321::1 dst 4322::2 proto esp mode tunnel
# state transform for encapsulation
ip xfrm state add src 4321::1 dst 4322::2 proto esp spi 4321 enc "cbc(aes)" 1234567890123456 mode tunnel
# policy to allow passing of decapsulated packets
ip xfrm policy add dst 6666::1 dir in tmpl proto esp mode tunnel level required
# automatic decapsulation.  sel ensures addresses after decapsulation
ip xfrm state add src 4322::2 dst 4321::1 proto esp sip 4321 enc "cbc(aes)" 1234567890123456 mode tunnel sel src 6666::2 dst 6666::1

And on the other node [4322::2] do the same in reverse

ip addr add 6666::2/128 dev eth0
ip xfrm policy add dst 6666::1 dir out tmpl src 4321::1 dst 4322::2 proto esp mode tunnel
ip xfrm state add src 4322::2 dst 4321::1 proto esp spi 4321 enc "cbc(aes)" 1234567890123456 mode tunnel
ip xfrm policy add dst 6666::2 dir in tmpl proto esp mode tunnel level required
ip xfrm state add src 4321::1 dst 4322::2 proto esp sip 4321 enc "cbc(aes)" 1234567890123456 mode tunnel sel src 6666::1 dst 6666::2

Note there’s no need to add any routing entries because the decapsulation is point to point (incoming decapsulated packets always end up in the input flow destined for the local [6666::N] address). With the above rules you should be able to ping from [6666::1] to [6666::2] and tcpdump should show fully encrypted packets going over the wire.

Obviously, you can add more nodes but each time you have to add rules for all the other nodes, making this an N(N-1) scaling problem. The need for a specific source and destination template in the out policy means you must have one for each connection. The in policy can be subnet based. The reason why people use ipsec toolkits is that they can add the transforms just in time for the subset of nodes you’re actually communicating with rather than having to add all the rules up front.

Final Example: correcly keyed AH Inbound Packet Acceptance

The final example is me trying to penetrate my router firewall when on an external IPv6 connection. I have a class /60 set of IPv6 space, so each of my systems has its own IPv6 address but, as is usual, inbound packets in the NEW state are blocked. It occurred to me that I should be able to use ipsec AH (no real need for encryption since most of the protocols I use are encrypted anyway) to accept packets to an internal destination with the NEW state. This would be an asymmetric use of ipsec because inbound would have AH but return wouldn’t.

My initial thought was to use AH in transport mode, but as you can see from the diagram above that won’t work because the router merely forwards the packets and to get to a state decapsulation on the router they have to go up the INPUT flow.

The next attempt used tunnel mode, so the packet was aimed at the router and then the inner destination was the real node. The next problem was the fwd policy to permit this: the xfrm policy has no connection tracker and return packets from connections originating in the interior node also have to pass this filter. The solution to this conundrum is to install a level use policy and rely on the FORWARD table firewall rules to allow RELATED,ESTABLISHED and MARKed packets (so the decapsulation can add the MARK to pass this rule).

IPSEC on the Router

Assume my external router address is [4321::1] and my internal network is [4444::]/60. On the router, I install a catch all state transform

ip xfrm state add add dst 4321::1 proto ah spi 4321 auth "hmac(sha1)" 1234 mode tunnel sel dst 4444::/60 output-mark 0x1

But I also need a policy to permit the decapsulated packets (and the state RELATED,ESTABLISHED unencapsulated ones) to pass:

ip xfrm policy add dst 4444::/60 dir fwd tmpl proto ah spi 4321 mode tunnel level use

And finally I need an addition to the firewall rules to allow packets in state NEW but with mark 0x1 to pass

ip6tables -A FORWARD -m conntrack --ctstate NEW -m mark --mark 0x1/0x1 -j ACCEPT

Which should be placed directly after the RELATED,ESTABLISHED state check.

Since there’s no encapsulation on outbound, the return packets simply pass through the firewall as normal. This means that any external entity wishing to use this AH packet acceptance simply needs a policy and state to tunnel

ip xfrm policy dst 4444::/60 dir out tmpl proto ah dst 4321::1 mode tunnel
ip xfrm state add dst 4321::1 spi 4321 proto ah auth "hmac(sha1)" 1234 mode tunnel

And with that, any machine known by inner IPv6 address can be reached (for an IPv6 connected remote machine).

Conclusion

Hopefully this post has demystified some of the ip xfrm rules for you. I’m afraid the commands have a huge range of options, so I’ve only covered the essential ones above and there are still loads of interesting but not at all well documented ones remaining, but thanks to the examples you should have some scope now for playing with them.

March 21, 2024 03:42 PM

Linux Plumbers Conference 2024 is pleased to host the eBPF Track!

After four years in a row of co-locating eBPF & Networking Tracks together, this year we separated the two in order to allow for both tracks to grow further individually as well as to bring more diversity into LPC by attracting more developers from each community.

The eBPF Track is going to bring together developers, maintainers, and other contributors from all around the globe to discuss improvements to the Linux kernel’s BPF subsystem and its surrounding user space ecosystem such as libraries, loaders, compiler backends, and other related low-level system tooling.

The gathering is designed to foster collaboration and face to face discussion of ongoing development topics as well as to encourage bringing new ideas into the development community for the advancement of the BPF subsystem.

Proposals can cover a wide range of topics related to BPF covering improvements in areas such as (but not limited to) BPF infrastructure and its use in tracing, security, networking, scheduling and beyond, as well as non-kernel components like libraries, compilers, testing infra and tools.

Please come and join us in the discussion. We hope to see you there!

March 17, 2024 09:08 AM

Sometimes debuggers and profilers are obviously broken, sometimes it's subtle and hard to spot. From my flame graphs page:

CPU flame graph (partly broken)

(Click for original SVG.) This is pretty common and usually goes unnoticed as the flame graph looks ok at first glance. But there are 15% of samples on the left, above "[unknown]", that are in the wrong place and missing frames. The problem is that this system has a default libc that has been compiled without frame pointers, so any stack walking stops at the libc layer, producing a partial stack that's missing the application frames. These partial stacks get grouped together on the left.

Click here for a longer explanation.

Other types of profiling hit this more often. Off-CPU flame graphs, for example, can be dominated by libc read/write and mutex functions, so without frame pointers end up mostly broken. Apart from library code, maybe your application doesn't have frame pointers either, in which case everything is broken.

I'm posting about this problem now because Fedora and Ubuntu are releasing versions that fix it, by compiling libc and more with frame pointers by default. This is great news as it not only fixes these flame graphs, but makes off-CPU flame graphs far more practical. This is also a win for continuous profilers (my employer, Intel, just announced one) as it makes customer adoption easier.

What are frame pointers?

The x86-64 ABI documentation shows how a CPU register, %rbp, can be used as a "base pointer" to a stack frame, aka the "frame pointer." I pictured how this is used to walk stack traces in my BPF book.

Figure 3.3: Stack Frame with
Base Pointer (x86-64 ABI)

Figure 2-6: Frame Pointer-based
Stack Walking (BPF book)

This stack-walking technique is commonly used by external profilers and debuggers, including Linux perf and eBPF, and ultimately visualized by flame graphs. However, the x86-64 ABI has a footnote [12] to say that this register use is optional:

"The conventional use of %rbp as a frame pointer for the stack frame may be avoided by using %rsp (the stack pointer) to index into the stack frame. This technique saves two instructions in the prologue and epilogue and makes one additional general-purpose register (%rbp) available."

(Trivia: I had penciled the frame pointer function prologue and epilogue on my Netflix office wall, lower left.)

2004: Their removal

In 2004 a compiler developer, Roger Sayle, changed gcc to stop generating frame pointers, writing:

"The simple patch below tweaks the i386 backend, such that we now default to the equivalent of "-fomit-frame-pointer -ffixed-ebp" on 32-bit targets"

i386 (32-bit microprocessors) only have four general purpose registers, so freeing up %ebp takes you from four to five (or if you include %si and %di, from six to seven). I'm sure this delivered large performance improvements and I wouldn't try arguing against it. Roger cited two other reasons for this change: The desire to outperform Intel's icc compiler, and the belief that it didn't break debuggers (of the time) since they supported other stack walking techniques.

2005-2023: The winter of broken profilers

However, the change was then applied to x86-64 (64-bit) as well, which had over a dozen registers and didn't benefit so much from freeing up one more. And there are debuggers/profilers that this change did break (typically system profilers, not language specific ones), more so today with eBPF, which didn't exist back then. As my former Sun Microsystems colleague Eric Schrock (nickname Schrock) wrote in November 2004:

"On i386, you at least had the advantage of increasing the number of usable registers by 20%. On amd64, adding a 17th general purpose register isn't going to open up a whole new world of compiler optimizations. You're just saving a pushl, movl, an series of operations that (for obvious reasons) is highly optimized on x86. And for leaf routines (which never establish a frame), this is a non-issue. Only in extreme circumstances does the cost (in processor time and I-cache footprint) translate to a tangible benefit - circumstances which usually resort to hand-coded assembly anyway. Given the benefit and the relative cost of losing debuggability, this hardly seems worth it."

In Schrock's conclusion:

"it's when people start compiling /usr/bin/ without frame pointers that it gets out of control."

This is exactly what happened on Linux, not just /usr/bin but also /usr/lib and application code! I'm sure there are people who are too new to the industry to remember the pre-2004 days when profilers would "just work" without OS and runtime changes.

2014: Java in Flames

Broken Java Stacks (2014)

When I joined Netflix in 2014, I found Java's lack of frame pointer support broke all application stacks (pictured in my 2014 Surge talk on the right). I ended up developing a fix for the JVM c2 compiler which Oracle reworked and added as the -XX:+PreserveFramePointer option in JDK8u60 (see my Java in Flames post for details [PDF]).

While that Java change led to discovering countless performance wins in application code, libc was still breaking some portion of the samples (as pictured in the example at the top of this post) and was breaking most stacks in off-CPU flame graphs. I started by compiling my own libc for production use with frame pointers, and then worked with Canonical to have one prebuilt for Ubuntu. For a while I was promoting the use of Canonical's libc6-prof, which was libc6 with frame pointers.

2015-2020: Overhead

As part of production rollout I did many performance overhead tests, which I've described publicly before: The overhead of adding frame pointers to everything (libc and Java) was usually less than 1%, with one exception of 10%. That 10% was an unusual application that was generating stack traces over 1000 frames deep (via Groovy), so deep that it broke Linux's perf profiler. Arnaldo Carvalho de Melo (Red Hat) added the kernel.perf_event_max_stack sysctl just for this Netflix workload. It was also a virtual machine that lacked low-level hardware profiling capabilities, so I wasn't able to do cycle analysis to confirm that the 10% was entirely frame pointer-based.

The actual overhead depends on your workload. Others have reported around 1% and around 2%. Microbenchmarks can be the worst, hitting 10%: This doesn't surprise me since they resolve to running a small funciton in a loop, and adding any instructions to that function can cause it to spill out of L1 cache warmth (or cache lines) causing a drop in performance. If I were analyzing such a microbenchmark, apart from observability anaylsis (cycles, instructions, PMU, PMCs, PEBS) there is also an experiment I'd like to try:

To test the theory of I-cache spillover: Compile the microbenchmark with and without frame pointers and find the performance delta. Then flame graph the microbenchmark to understand the hot function. Then add some inline assembly to the hot function where you add enough NOPs to the start and end to mimic the frame pointer prologue and epilogue (I recommend writing them on your office wall in pencil), compile it without frame pointers, disassemble the compiled binary to confirm those NOPs weren't stripped, and now test that. If the performance delta is still large (10%) you've confirmed that it is due to cache effects, and anyone who was worked at this level in production will tell you that it's the straw that broke the camel's back. Don't blame the straw, in this case, the frame pointers. Adding anything will cause the same effect. Having done this before, it reminds me of CSS programming: you make a little change here and everything breaks, and you spend hours chasing your own tail.

Another extreme example of overhead was the Python scimark_sparse_mat_mult benchmark, which could reach 10%. Fortunately this was analyzed by Andrii Nakryiko (Meta) who found it was a unusual case of a large function where gcc switched from %rsp offsets to %rbp-relative offsets, which took more bytes to store, causing performance issues. I've heard this has since been fixed so that Python can reenable frame pointers by default.

As I've seen frame pointers help find performance wins ranging from 5% to 500%, the typical "less than 1%" cost (or even 1% or 2% cost) is easily justified. But I'd rather the cost be zero, of course! We may get there with future technologies I'll cover later. In the meantime, frame pointers are the most practical way to find performance wins today.

What about Linux on devices where there is no chance of profiling or debugging, like electric toothbrushes? (I made that up, AFAIK they don't run Linux, but I may be wrong!) Sure, compile without frame pointers. The main users of this change are enterprise Linux. Back-end servers.

2022: Upstreaming, first attempt

Other large companies with OS and perf teams (Meta, Google) hinted strongly that they had already enabled frame pointers for everything years earlier. (Google should be no surprise because they pioneered continuous profiling.) So at this point you had Google, Meta, and Netflix running their own libc with frame pointers and able to enjoy profiling capabilities that most other companies – without dedicated OS teams – couldn't get working. Can't we just upstream this so everyone can benefit?

There's a bunch of difficulties when taking "works well for me" changes and trying to make them the default for everyone. Among the difficulties is that end-user companies don't have a clear return on the investment from telling their Linux vendor what they fixed, since they already fixed it. I guess the investment is quite small, we're talking about a single email, right?...Wrong! Your suggestion is now a 116-post thread where everyone is sharing different opinions and demanding this and that, as we found out the hard way. For Fedora, one person requested:

"Meta and/or Netflix should provide infrastructure for a side repository in which the change can be tested and benchmarked and the code size measured."

(Bear in mind that Netflix doesn't even use Fedora!)

Jonathan Corbet, who writes the best Linux articles, summarized this in "Fedora's tempest in a stack frame" which is so detailed that I feel PTSD when reading it. It's good that the Fedora community wants to be so careful, but I'd rather spend time discussing building something better than frame pointers, perhaps involving ORC, LBR, eBPF, and other technologies, than so much worry about looking bad in kitchen-sink benchmarks that I wouldn't trust in the first place.

2023, 2024: Frame Pointers in Fedora and Ubuntu!

Fedora revisited the proposal and has accepted it this time, making it the first distro to reenable frame pointers. Thank you!

Ubuntu has also announced frame pointers by default in Ubuntu 24.04 LTS. Thank you!

UPDATE: I've now heard that Arch Linux is also enabling frame pointers! Thanks Daan De Meyer (Meta).

While this fixes stack walking through OS libraries, you might find your application still doesn't support stack tracing, but that's typically much easier to fix. Java, for example, has the -XX:+PreserveFramePointer option. There were ways to get Golang to support frame pointers, but that became the default years ago. Just to name a couple of languages.

2034+: Beyond Frame Pointers

There's more than one way to walk a stack. These could be separate blog posts, but I want to comment briefly on alternates:

LBR (Last Branch Record): Intel's hardware feature that was limited to 16 or 32 frames. Most application stacks are deeper, so this can't be used to build flame graphs, but it is better than nothing. I use it as a last resort as it gives me some stack insights.
BTS (Branch Trace Store): Another Intel thing. Not so limited to stack depth, but has overhead from memory load/stores and BTS buffer overflow interrupt handling.
AET (Archetectural Event Trace): Another Intel thing. It's a JTAG-based tracer that can trace low-level CPU, BIOS, and device events, and apparently can be used for stack traces as well. I haven't used it. (I spent years as a cloud customer where I couldn't access many HW-level things.) I hope it can be configured to output to main memory, and not just a physical debug port.
DWARF: Binary debuginfo, has been used forever with debuggers. Update: I'd said it doesn't exist for JIT'd runtimes like the Java JVM, but others have pointed out there has been some JIT->DWARF work done. I still don't expect it to be practical on busy production servers that are constantly in c2. The overhead just to walk DWARF is also high, as it was designed for non-realtime use. Javier Honduvilla Coto (Polar Signals) did some interesting work using an eBPF walker to reduce the overhead, but...Java.
eBPF stack walking: Mark Wielaard (Red Hat) demonstrated a Java JVM stack walker using SystemTap back at LinuxCon 2014, where an external tracer walked a runtime with no runtime support or help. Very cool. This can be done using eBPF as well. The performmance overhead could be too high, however, as it may mean a lot of user space reads of runtime internals depending on the runtime. It would also be brittle; such eBPF stack walkers should ship with the language code base and be maintained with it.
ORC (oops rewind capability): The Linux kernel's new lightweight stack unwinder by Josh Poimboeuf (Red Hat) that has allowed newer kernels to remove frame pointers yet retain stack walking. You may be using ORC without realizing it; the rollout was smooth as the kernel profiler code was updated to support ORC (perf_callchain_kernel()->unwind_orc.c) at the same time as it was compiled to support ORC. Can't ORCs invade user space as well?
SFrames (Stack Frames): ...which is what SFrames does: lightweight user stack unwinding based on ORC. There have been recent talks to explain them by Indu Bhagat (Oracle) and Steven Rostedt (Google). I should do a blog post just on SFrames.
Shadow Stacks: A newer Intel and AMD security feature that can be configured to push function return addresses onto a separate HW stack so that they can be double checked when the return happens. Sounds like such a HW stack could also provide a stack trace, without frame pointers.
(And this isn't even all of them.)

Daan De Meyer (Meta) did a nice summary as well of different stack walkers on the Fedora wiki.

So what's next? Here's my guesses:

2029: Ubuntu and Fedora release new versions with SFrames for OS components (including libc) and ditches frame pointers again. We'll have had five years of frame pointer-based performance wins and new innovations that make use of user space stacks (e.g., better automated bug reporting), and will hit the ground running with SFrames.
2034: Shadow stacks have been enabled by default for security, and then are used for all stack tracing.

Conclusion

I could say that times have changed and now the original 2004 reasons for omitting frame pointers are no longer valid in 2024. Those reasons were that it improved performance significantly on i386, that it didn't break the debuggers of the day (prior to eBPF), and that competing with another compiler (icc) was deemed important. Yes, times have indeed changed. But I should note that one engineer, Eric Schrock, claimed that it didn't make sense back in 2004 either when it was applied to x86-64, and I agree with him. Profiling has been broken for 20 years and we've only now just fixed it.

Fedora and Ubuntu have now returned frame pointers, which is great news. People should start running these releases in 2024 and will find that CPU flame graphs make more sense, Off-CPU flame graphs work for the first time, and other new things become possible. It's also a win for continuous profilers, as they don't need to convince their customers to make OS changes to get profiles to fully work.

Thanks

The online threads about this change aren't even everything, there have been many discussions, meetings, and work put into this, not just for frame pointers but other recent advances including ORC and SFrames. Special thanks to Andrii Nakryiko (Meta), Daan De Meyer (Meta), Davide Cavalca (Meta), Neal Gompa (Velocity Limitless), Ian Rogers (Google), Steven Rostedt (Google), Josh Poimboeuf (Red Hat), Arjan Van De Ven (Intel), Indu Bhagat (Oracle), Mark Shuttleworth (Canonical), Jon Seager (Canonical), Oliver Smith (Canonical), Javier Honduvilla Coto (Polar Signals), Mark Wielaard (Red Hat), Ben Cotton (Red Hat), and many others (see the Fedora discussions). And thanks to Schrock.

Appendix: Fedora

For reference, here's my writeup for the Fedora change:

I enabled frame pointers at Netflix, for Java and glibc, and summarized the effect in BPF Performance Tools (page 40):

"Last time I studied the performance gain from frame pointer omission in our production environment, it was usually less than one percent, and it was often so close to zero that it was difficult to measure. Many microservices at Netflix are running with the frame pointer reenabled, as the performance wins found by CPU profiling outweigh the tiny loss of performance."

I've spent a lot of time analyzing frame pointer performance, and I did the original work to add them to the JVM (which became -XX:+PreserveFramePoiner). I was also working with another major Linux distro to make frame pointers the default in glibc, although I since changed jobs and that work has stalled. I'll pick it up again, but I'd be happy to see Fedora enable it in the meantime and be the first to do so.

We need frame pointers enabled by default because of performance. Enterprise environments are monitored, continuously profiled, and analyzed on a regular basis, so this capability will indeed be put to use. It enables a world of debugging and new performance tools, and once you find a 500% perf win you have a different perspective about the <1% cost. Off-CPU flame graphs in particular need to walk the pthread functions in glibc as most blocking paths go through them; CPU flame graphs need them as well to reconnect the floating glibc tower of futex/pthread functions with the developers code frames.

I see the comments about benchmark results of up to 10% slowdowns. It's good to look out for regressions, although in my experience all benchmarks are wrong or deeply misleading. You'll need to do cycle analysis (PEBS-based) to see where the extra cycles are, and if that makes any sense. Benchmarks can be super sensitive to degrading a single hot function (like "CPU benchmarks" that really just hammer one function in a loop), and if extra instructions (function prologue) bump it over a cache line or beyond L1 cache-warmth, then you can get a noticeable hit. This will happen to the next developer who adds code anyway (assuming such a hot function is real world) so the code change gets unfairly blamed. It will only regress in this particular scenario, and regression is inevitable. Hence why you need the cycle analysis ("active benchmarking") to make sense of this.

There was one microservice that was an outlier and had a 10% performance loss with Java frame pointers enabled (not glibc, I've never seen a big loss there). 10% is huge. This was before PMCs were available in the cloud, so I could do little to debug it. Initially the microservice ran a "flame graph canary" instance with FPs for flame graphs, but the developers eventually just enabled FPs across the whole microservice as the gains they were finding outweighed the 10% cost. This was the only noticeable (as in, >1%) production regression we saw, and it was a microservice that was bonkers for a variety of reasons, including stack traces that were over 1000 frames deep (and that was after inlining! Over 3000 deep without. ACME added the perf_event_max_stack sysctl just so Netflix could profile this microservice, as the prior limit was 128). So one possibility is that the extra function prologue instructions add up if you frequently walk 1000 frames of stack (although I still don't entirely buy it). Another attribute was that the microservice had over 1 Gbyte of instruction text (!), and we may have been flying close to the edge of hardware cache warmth, where adding a bit more instructions caused a big drop. Both scenarios are debuggable with PMCs/PEBS, but we had none at the time.

So while I think we need to debug those rare 10%s, we should also bear in mind that customers can recompile without FPs to get that performance back. (Although for that microservice, the developers chose to eat the 10% because it was so valuable!) I think frame pointers should be the default for enterprise OSes, and to opt out if/when necessary, and not the other way around. It's possible that some math functions in glibc should opt out of frame pointers (possibly fixing scimark, FWIW), but the rest (especially pthread) needs them.

In the distant future, all runtimes should come with an eBPF stack walker, and the kernel should support hopping between FPs, ORC, LBR, and eBPF stack walking as necessary. We may reach a point where we can turn off FPs again. Or maybe that work will never get done. Turning on FPs now is an improvement we can do, and then we can improve it more later.

For some more background: Eric Schrock (my former colleague at Sun Microsystems) described the then-recent gcc change in 2004 as "a dubious optimization that severely hinders debuggability" and that "it's when people start compiling /usr/bin/* without frame pointers that it gets out of control" I recommend reading his post: [0].

The original omit FP change was done for i386 that only had four general-purpose registers and saw big gains freeing up a fifth, and it assumed stack walking was a solved problem thanks to gdb(1) without considering real-time tracers, and the original change cites the need to compete with icc [1]. We have a different circumstance today -- 18 years later -- and it's time we updated this change.

[0] http://web.archive.org/web/20131215093042/https://blogs.oracle.com/eschrock/entry/debugging_on_amd64_part_one
[1] https://gcc.gnu.org/ml/gcc-patches/2004-08/msg01033.html

March 16, 2024 01:00 PM

Closing arguments in the trial between various people and Craig Wright over whether he's Satoshi Nakamoto are wrapping up today, amongst a bewildering array of presented evidence. But one utterly astonishing aspect of this lawsuit is that expert witnesses for both sides agreed that much of the digital evidence provided by Craig Wright was unreliable in one way or another, generally including indications that it wasn't produced at the point in time it claimed to be. And it's fascinating reading through the subtle (and, in some cases, not so subtle) ways that that's revealed.

One of the pieces of evidence entered is screenshots of data from Mind Your Own Business, a business management product that's been around for some time. Craig Wright relied on screenshots of various entries from this product to support his claims around having controlled meaningful number of bitcoin before he was publicly linked to being Satoshi. If these were authentic then they'd be strong evidence linking him to the mining of coins before Bitcoin's public availability. Unfortunately the screenshots themselves weren't contemporary - the metadata shows them being created in 2020. This wouldn't fundamentally be a problem (it's entirely reasonable to create new screenshots of old material), as long as it's possible to establish that the material shown in the screenshots was created at that point. Sadly, well.

One part of the disclosed information was an email that contained a zip file that contained a raw database in the format used by MYOB. Importing that into the tool allowed an audit record to be extracted - this record showed that the relevant entries had been added to the database in 2020, shortly before the screenshots were created. This was, obviously, not strong evidence that Craig had held Bitcoin in 2009. This evidence was reported, and was responded to with a couple of additional databases that had an audit trail that was consistent with the dates in the records in question. Well, partially. The audit record included session data, showing an administrator logging into the data base in 2011 and then, uh, logging out in 2023, which is rather more consistent with someone changing their system clock to 2011 to create an entry, and switching it back to present day before logging out. In addition, the audit log included fields that didn't exist in versions of the product released before 2016, strongly suggesting that the entries dated 2009-2011 were created in software released after 2016. And even worse, the order of insertions into the database didn't line up with calendar time - an entry dated before another entry may appear in the database afterwards, indicating that it was created later. But even more obvious? The database schema used for these old entries corresponded to a version of the software released in 2023.

This is all consistent with the idea that these records were created after the fact and backdated to 2009-2011, and that after this evidence was made available further evidence was created and backdated to obfuscate that. In an unusual turn of events, during the trial Craig Wright introduced further evidence in the form of a chain of emails to his former lawyers that indicated he had provided them with login details to his MYOB instance in 2019 - before the metadata associated with the screenshots. The implication isn't entirely clear, but it suggests that either they had an opportunity to examine this data before the metadata suggests it was created, or that they faked the data? So, well, the obvious thing happened, and his former lawyers were asked whether they received these emails. The chain consisted of three emails, two of which they confirmed they'd received. And they received a third email in the chain, but it was different to the one entered in evidence. And, uh, weirdly, they'd received a copy of the email that was submitted - but they'd received it a few days earlier. In 2024.

And again, the forensic evidence is helpful here! It turns out that the email client used associates a timestamp with any attachments, which in this case included an image in the email footer - and the mysterious time travelling email had a timestamp in 2024, not 2019. This was created by the client, so was consistent with the email having been sent in 2024, not being sent in 2019 and somehow getting stuck somewhere before delivery. The date header indicates 2019, as do encoded timestamps in the MIME headers - consistent with the mail being sent by a computer with the clock set to 2019.

But there's a very weird difference between the copy of the email that was submitted in evidence and the copy that was located afterwards! The first included a header inserted by gmail that included a 2019 timestamp, while the latter had a 2024 timestamp. Is there a way to determine which of these could be the truth? It turns out there is! The format of that header changed in 2022, and the version in the email is the new version. The version with the 2019 timestamp is anachronistic - the format simply doesn't match the header that gmail would have introduced in 2019, suggesting that an email sent in 2022 or later was modified to include a timestamp of 2019.

This is by no means the only indication that Craig Wright's evidence may be misleading (there's the whole argument that the Bitcoin white paper was written in LaTeX when general consensus is that it's written in OpenOffice, given that's what the metadata claims), but it's a lovely example of a more general issue.

Our technology chains are complicated. So many moving parts end up influencing the content of the data we generate, and those parts develop over time. It's fantastically difficult to generate an artifact now that precisely corresponds to how it would look in the past, even if we go to the effort of installing an old OS on an old PC and setting the clock appropriately (are you sure you're going to be able to mimic an entirely period appropriate patch level?). Even the version of the font you use in a document may indicate it's anachronistic. I'm pretty good at computers and I no longer have any belief I could fake an old document.

(References: this Dropbox, under "Expert reports", "Patrick Madden". Initial MYOB data is in "Appendix PM7", further analysis is in "Appendix PM42", email analysis is "Sixth Expert Report of Mr Patrick Madden")

comments

March 14, 2024 09:11 AM

In a different epoch, before the pandemic, I’ve done a presentation about upstream first at the Siemens Linux Community Event 2018, where I’ve tried to explain the fundamentals of open source using microeconomics. Unfortunately that talk didn’t work out too well with an audience that isn’t well-versed in upstream and open source concepts, largely because it was just too much material crammed into too little time.

Last year I got the opportunity to try again for an Intel-internal event series, and this time I’ve split the material into two parts. I think that worked a lot better. For obvious reasons I cannot publish the recordings, but I can publish the slides.

The first part “Upstream, Why?” covers a few concepts from microeconomcis 101, and then applies them to upstream stream open source. The key concept is on one hand that open source achieves an efficient software market in the microeconomic sense by driving margins and prices to zero. And the only way to make money in such a market is to either have more-or-less unstable barriers to entry that prevent the efficient market from forming and destroying all monetary value. Or to sell a complementary product.

The second part”Upstream, How?” then looks at what this all means for the different stakeholders involved:

Individual engineers, who have skills and create a product with zero economic value, and might still be stupid enough and try to build a career on that.
Upstream communities, often with a formal structure as a foundation, and what exactly their goals should be to build a thriving upstream open source project that can actually pay some bills, generate some revenue somewhere else and get engineers paid. Because without that you’re not going to have much of a project with a long term future.
Engineering organizations, what exactly their incentives and goals should be, and the fundamental conflicts of interest this causes. Specifically on this I’ve only seen bad solutions, and ugly solutions, but not yet a really good one. A relevant pre-pandemic talk of mine on this topic is also “Upstream Graphics: Too Little, Too Late”
And finally the overall business and more importantly, what kind of business strategy is needed to really thrive with an open source upstream first approach: You need to clearly understand which software market’s economic value you want to destroy by driving margins and prices to zero, and which complemenetary product you’re selling to still earn money.

At least judging by the feedback I’ve received internally taking more time and going a bit more in-depth on the various concept worked much better than the keynote presentation I’ve done at Siemens, hence I decided to publish at the least the slides.

March 14, 2024 12:00 AM

eBPF is a crazy technology – like putting JavaScript into the Linux kernel – and getting it accepted had so far been an untold story of strategy and ingenuity. The eBPF documentary, published late last year, tells this story by interviewing key players from 2014 including myself, and touches on new developments including Windows. (If you are new to eBPF, it is the name of a kernel execution engine that runs a variety of new programs in a performant and safe sandbox in the kernel, like how JavaScript can run programs safely in a browser sandbox; it is also no longer an acronym.) The documentary was played at KubeCon and is on youtube:

Watching this brings me right back to 2014, to see the faces and hear their voices discussing the problems we were trying to fix. Thanks to Speakeasy Productions for doing such a great job with this documentary, and letting you experience what it was like in those early days. This is also a great example of all the work that goes on behind the scenes to get code merged in a large codebase like Linux.

When Alexei Starovoitov visited Netflix in 2014 to discuss eBPF with myself and Amer Ather, we were so entranced that we lost track of time and were eventually kicked out of the meeting room as another meeting was starting. It was then I realized that we had missed lunch! Alexei sounded so confident that I was convinced that eBPF was the future, but a couple of times he added "if the patches get merged." If they get merged?? They have to get merged, this idea is too good to waste.

While only several of us worked on eBPF in 2014, more joined in 2015 and later, and there are now hundreds contributing to make it what it is. A longer documentary could also interview Brendan Blanco (bcc), Yonghong Song (bcc), Sasha Goldshtein (bcc), Alastair Robertson (bpftrace), Tobais Waldekranz (ply), Andrii Nakryiko, Joe Stringer, Jakub Kicinski, Martin KaFai Lau, John Fastabend, Quentin Monnet, Jesper Dangaard Brouer, Andrey Ignatov, Stanislav Fomichev, Teng Qin, Paul Chaignon, Vicent Marti, Dan Xu, Bas Smit, Viktor Malik, Mary Marchini, and many more. Thanks to everyone for all the work.

Ten years later it still feels like it's early days for eBPF, and a great time to get involved: It's likely already available in your production kernels, and there are tools, libraries, and documentation to help you get started.

I hope you enjoy the documentary. PS. Congrats to Isovalent, the role-model eBPF startup, as Cisco recently announced they would acquire them!

March 09, 2024 01:00 PM

Linux Plumbers Conference 2024 is pleased to host the Toolchains Track!

The aim of the Toolchains track is to fix particular toolchain issues which are of the interest of the kernel and, ideally, find solutions in situ, making the best use of the opportunity of live discussion with kernel developers and maintainers. In particular, this is not about presenting research nor abstract/miscellaneous toolchain work.

The track will be composed of activities, of variable length depending on the topic being discussed. Each activity is intended to cover a particular topic or issue involving both the Linux kernel and one or more of its associated toolchains and development tools. This includes compiling, linking, assemblers, debuggers and debugging formats, ABI analysis tools, object manipulation, etc. Few slides shall be necessary, and most of the time shall be devoted to actual discussion, brainstorming and seeking agreement.

Please come and join us in the discussion. We hope to see you there!

March 02, 2024 07:59 AM

postfix-3.8.1-5.fc39.x86_64
opendkim-2.11.0-0.35.fc39.x86_64

Following generic guides (e.g. at Akamai Linode) almost got it all working with ease. There were a few minor problems with permissions.

Problem:
Feb 28 11:45:17 takane postfix/smtpd[1137214]: warning: connect to Milter service local:/run/opendkim/opendkim.sock: Permission denied
Solution:
add postfix to opendkim group; no change to Umask etc.

Problem:
Feb 28 13:36:39 takane opendkim[1136756]: 5F1F4DB81: no signing table match for 'zaitcev@kotori.zaitcev.us'
Solution:
change SigningTable from refile: to a normal file

Problem:
Feb 28 13:52:05 takane opendkim[1138782]: can't load key from /etc/opendkim/keys/dkim001.private: Permission denied
Feb 28 13:52:05 takane opendkim[1138782]: 93FE7D0E3: error loading key 'dkim001._domainkey.kotori.zaitcev.us'
Solution:
[root@takane postfix]# chmod 400 /etc/opendkim/keys/dkim001.private
[root@takane postfix]# chown opendkim /etc/opendkim/keys/dkim001.private

March 01, 2024 09:57 PM

We have a cabin out in the forest, and when I say "out in the forest" I mean "in a national forest subject to regulation by the US Forest Service" which means there's an extremely thick book describing the things we're allowed to do and (somewhat longer) not allowed to do. It's also down in the bottom of a valley surrounded by tall trees (the whole "forest" bit). There used to be AT&T copper but all that infrastructure burned down in a big fire back in 2021 and AT&T no longer supply new copper links, and Starlink isn't viable because of the whole "bottom of a valley surrounded by tall trees" thing along with regulations that prohibit us from putting up a big pole with a dish on top. Thankfully there's LTE towers nearby, so I'm simply using cellular data. Unfortunately my provider rate limits connections to video streaming services in order to push them down to roughly SD resolution. The easy workaround is just to VPN back to somewhere else, which in my case is just a Wireguard link back to San Francisco.

This worked perfectly for most things, but some streaming services simply wouldn't work at all. Attempting to load the video would just spin forever. Running tcpdump at the local end of the VPN endpoint showed a connection being established, some packets being exchanged, and then… nothing. The remote service appeared to just stop sending packets. Tcpdumping the remote end of the VPN showed the same thing. It wasn't until I looked at the traffic on the VPN endpoint's external interface that things began to become clear.

This probably needs some background. Most network infrastructure has a maximum allowable packet size, which is referred to as the Maximum Transmission Unit or MTU. For ethernet this defaults to 1500 bytes, and these days most links are able to handle packets of at least this size, so it's pretty typical to just assume that you'll be able to send a 1500 byte packet. But what's important to remember is that that doesn't mean you have 1500 bytes of packet payload - that 1500 bytes includes whatever protocol level headers are on there. For TCP/IP you're typically looking at spending around 40 bytes on the headers, leaving somewhere around 1460 bytes of usable payload. And if you're using a VPN, things get annoying. In this case the original packet becomes the payload of a new packet, which means it needs another set of TCP (or UDP) and IP headers, and probably also some VPN header. This still all needs to fit inside the MTU of the link the VPN packet is being sent over, so if the MTU of that is 1500, the effective MTU of the VPN interface has to be lower. For Wireguard, this works out to an effective MTU of 1420 bytes. That means simply sending a 1500 byte packet over a Wireguard (or any other VPN) link won't work - adding the additional headers gives you a total packet size of over 1500 bytes, and that won't fit into the underlying link's MTU of 1500.

And yet, things work. But how? Faced with a packet that's too big to fit into a link, there are two choices - break the packet up into multiple smaller packets ("fragmentation") or tell whoever's sending the packet to send smaller packets. Fragmentation seems like the obvious answer, so I'd encourage you to read Valerie Aurora's article on how fragmentation is more complicated than you think. tl;dr - if you can avoid fragmentation then you're going to have a better life. You can explicitly indicate that you don't want your packets to be fragmented by setting the Don't Fragment bit in your IP header, and then when your packet hits a link where your packet exceeds the link MTU it'll send back a packet telling the remote that it's too big, what the actual MTU is, and the remote will resend a smaller packet. This avoids all the hassle of handling fragments in exchange for the cost of a retransmit the first time the MTU is exceeded. It also typically works these days, which wasn't always the case - people had a nasty habit of dropping the ICMP packets telling the remote that the packet was too big, which broke everything.

What I saw when I tcpdumped on the remote VPN endpoint's external interface was that the connection was getting established, and then a 1500 byte packet would arrive (this is kind of the behaviour you'd expect for video - the connection handshaking involves a bunch of relatively small packets, and then once you start sending the video stream itself you start sending packets that are as large as possible in order to minimise overhead). This 1500 byte packet wouldn't fit down the Wireguard link, so the endpoint sent back an ICMP packet to the remote telling it to send smaller packets. The remote should then have sent a new, smaller packet - instead, about a second after sending the first 1500 byte packet, it sent that same 1500 byte packet. This is consistent with it ignoring the ICMP notification and just behaving as if the packet had been dropped.

All the services that were failing were failing in identical ways, and all were using Fastly as their CDN. I complained about this on social media and then somehow ended up in contact with the engineering team responsible for this sort of thing - I sent them a packet dump of the failure, they were able to reproduce it, and it got fixed. Hurray!

(Between me identifying the problem and it getting fixed I was able to work around it. The TCP header includes a Maximum Segment Size (MSS) field, which indicates the maximum size of the payload for this connection. iptables allows you to rewrite this, so on the VPN endpoint I simply rewrote the MSS to be small enough that the packets would fit inside the Wireguard MTU. This isn't a complete fix since it's done at the TCP level rather than the IP level - so any large UDP packets would still end up breaking)

I've no idea what the underlying issue was, and at the client end the failure was entirely opaque: the remote simply stopped sending me packets. The only reason I was able to debug this at all was because I controlled the other end of the VPN as well, and even then I wouldn't have been able to do anything about it other than being in the fortuitous situation of someone able to do something about it seeing my post. How many people go through their lives dealing with things just being broken and having no idea why, and how do we fix that?

(Edit: thanks to this comment, it sounds like the underlying issue was a kernel bug that Fastly developed a fix for - under certain configurations, the kernel fails to associate the MTU update with the egress interface and so it continues sending overly large packets)

comments

February 20, 2024 04:17 PM

Speaking of S3 becoming strongly consistent, Swift was strongly consistent for objects in practice from the very beginning. All the "in practice" assumes a reasonably healthy cluster. It's very easy, really. When your client puts an object into a cluster and receives a 201, it means that a quorum of back-end nodes stored that object. Therefore, for you to get a stale version, you need to find a proxy that is willing to fetch you an old copy. That can only happen if the proxy has no access to any of the back-end nodes with the new object.

Somewhat unfortunately for the lovers of consistency, we made a decision that makes the observation of eventual consistency easier some 10 years ago. We allowed a half of an even replication factor to satisfy quorum. So, if you have a distributed cluster with factor of 4, a client can write an object into 2 nearest nodes, and receive a success. That opens a window for another client to read an old object from the other nodes.

Oh, well. Original Swift defaulted for odd replication factors, such as 3 and 5. They provided a bit of resistance to intercontinental scenarios, at the cost of the client knowing immediately if a partition is occurring. But a number of operators insisted that they preferred the observable eventual consistency. Remember that the replication factor is configurable, so there's no harm, right?

Alas, being flexible that way helps private clusters only. Because Swift generally hides the artitecture of the cluster from clients, they cannot know if they can rely on consistency of a random public cluster.

Either way, S3 does something that Swift cannot do here: the consistency of bucket listings. These things start lagging in Swift at a drop of a hat. If the container servers fail to reply in milliseconds, storage nodes push the job onto updaters and proceed. The delay in listigs is often observable in Swift, which is one reason why people doing POSIX overlays often have their own manifests.

I'm quite impressed by S3 doing this, given their scale especially. Curious, too. I wish I could hack on their system a little bit. But it's all proprietary, alas.

February 19, 2024 05:57 PM

As was recently announced, the Linux kernel project has been accepted as a CNA as a CVE Numbering Authority (CNA) for vulnerabilities found in Linux. This is a trend, of more open source projects taking over the haphazard assignments of CVEs against their project by becoming a CNA so that no other group can assign CVEs without their involvment. Here’s the curl project doing much the same thing for the same reasons.

February 13, 2024 12:00 AM

The Linux Plumbers Conference is proud to announce that it’s website for 2024 is up and the CFP has been issued. We will be running a hybrid conference as usual, but the in-person venue will be Vienna, Austria from 18-20 September. Deadlines to submit are 4 April for Microconferences and 16 June for Refereed and Kernel Summit track presentations. Details for other tracks and accepted Microconferences will be posted later.

February 11, 2024 08:26 PM

Vulkan Video AV1 decode has been released, and I had some partly working support on Intel ANV driver previously, but I let it lapse.

The branch is currently [1]. It builds, but is totally untested, I'll get some time next week to plug in my DG2 and see if I can persuade it to decode some frames.

Update: the current branch decodes one frame properly, reference frames need more work unfortunately.

[1] https://gitlab.freedesktop.org/airlied/mesa/-/commits/anv-vulkan-video-decode-av1

February 05, 2024 03:16 AM

In order to save mass-storage space and to reduce boot times, rcutorture runs out of a tiny initrd filesystem that consists only of a root directory and a statically linked init program based on nolibc. This init program binds itself to a randomly chosen CPU, spins for a short time, sleeps for a short time, and repeats, the point being to inject at least a little userspace execution for the benefit of nohz_full CPUs.

This works very well most of the time. But what if you want to use a full userspace when torturing RCU, perhaps because you want to test runtime changes to RCU's many sysfs parameters, run heavier userspace loads on nohz_full CPUs, or even to run BPF programs?

What you do is go back to the old way of building rcutorture's initrd.

Which raises the question as to what the new way might be.

What rcutorture does is to look at the tools/testing/selftests/rcutorture/initrd directory. If this directory does not already a file named init, the tools/testing/selftests/rcutorture/bin/mkinitrd.sh script builds the aforementioned statically linked init program.

Which means that you can put whatever initrd file tree you wish into that initrd directory, and as long as it contains a /init program, rcutorture will happily bundle that file tree into an initrd in each the resulting rcutorture kernel images.

And back in the old days, that is exactly what I did. I grabbed any convenient initrd and expanded it into my tools/testing/selftests/rcutorture/initrd directory. This still works, so you can do this too!

February 02, 2024 01:35 PM

The Khronos Group announced VK_KHR_video_decode_av1 [1], this extension adds AV1 decoding to the Vulkan specification. There is a radv branch [2] and merge request [3]. I did some AV1 work on this in the past, but I need to take some time to see if it has made any progress since. I'll post an ANV update once I figure that out.

This extension is one of the ones I've been wanting for a long time, since having royalty-free codec is something I can actually care about and ship, as opposed to the painful ones. I started working on a MESA extension for this a year or so ago with Lynne from the ffmpeg project and we made great progress with it. We submitted that to Khronos and it has gone through the committee process and been refined and validated amongst the hardware vendors.

I'd like to say thanks to Charlie Turner and Igalia for taking over a lot of the porting to the Khronos extension and fixing up bugs that their CTS development brought up. This is a great feature of having open source drivers, it allows a lot quicker turn around time in bug fixes when devs can fix them themselves!

[1]: https://www.khronos.org/blog/khronos-releases-vulkan-video-av1-decode-extension-vulkan-sdk-now-supports-h.264-h.265-encode

[2] https://gitlab.freedesktop.org/airlied/mesa/-/tree/radv-vulkan-video-decode-av1

[3] https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/27424

February 02, 2024 02:27 AM

This article will discuss securing your phone after you’ve rooted it and installed your preferred os (it will not discuss how to root your phone or change the OS). Re-securing your phone requires the installation of a custom AVB key, which not all phones support, so I’ll only be discussing Google Pixel phones (which support this) and the LineageOS distribution (which is the one I use). The reason for wanting to do this is that by default LineageOS runs with the debug keys (i.e. known to everyone) with an unlocked bootloader, meaning OS updates and even binary changes to the system partition are easy to do. Since most android phones are fully locked, this isn’t a standard attack vector for malicious apps, but if someone is targetting you directly it may become one.

This article will cover how android verified boot (AVB) works, how to install your own custom AVB key and get a stock LineageOS distribution to use it and how to turn DM verity back on to make /system immutable.

A Brief Tour of Android Verified Boot (AVB)

We’ll actually be covering the 2.0 version of AVB, but that’s what most phones use today. The proprietary bootloader of a Pixel (the fastboot capable one you get to with adb reboot bootloader) uses a vbmeta partition to find the boot/recovery system and from there either enter recovery or boot the standard OS. vbmeta contains hashes for both this boot partition and the system partition. If your phone is unlocked the bootloader will simply boot the partitions vbmeta points to without any verification. If it is locked, the vbmeta partition must be signed by a key the phone knows. Pixel phones have two keyslots: a built in one which houses either the Google key or an OEM one and the custom slot, which is blank. In the unlocked mode, you can flash your own key into the custom slot using the fastboot flash avb_custom_key command.

The vbmeta partition also contains a boot flags region which tells the bootloader how to boot the OS. The two flags the OS knows about are in external/avb/libavb/avb_vbmeta_image.h:

/* Flags for the vbmeta image.
 *
 * AVB_VBMETA_IMAGE_FLAGS_HASHTREE_DISABLED: If this flag is set,
 * hashtree image verification will be disabled.
 *
 * AVB_VBMETA_IMAGE_FLAGS_VERIFICATION_DISABLED: If this flag is set,
 * verification will be disabled and descriptors will not be parsed.
 */
typedef enum {
  AVB_VBMETA_IMAGE_FLAGS_HASHTREE_DISABLED = (1 << 0),
  AVB_VBMETA_IMAGE_FLAGS_VERIFICATION_DISABLED = (1 << 1)
} AvbVBMetaImageFlags;

if the first flag is set then dm-verity is disabled and if the second one is set, the bootloader doesn’t pass the hash of the vbmeta partition on the kernel command line. In a standard LineageOS build, both these flags are set.

The reason for passing the vbmeta hash on the command line is so the android init process can load the vbmeta partition, hash it and verify against what the bootloader passed in, thus confirming there hasn’t been a time of check to time of use (TOCTOU) security breach. The init process cannot verify the signature for itself because the vbmeta signing public key isn’t built into the OS (which allows the OS to be signed after the images are build).

The description of the AVB_VBMETA_IMAGE_FLAGS_HASHTREE_DISABLED flag is slightly wrong. A standard android build always seems to calculate the dm-verity hash tree and insert it into the vbmeta partition (where it is verified by the vbmeta signature) it’s just that if this flag is set, the android init process won’t load the dm-verity hash tree and the system partition will thus be mutable.

Creating and Using your own custom Boot Key

Obviously android doesn’t use any standard tool form for its keys, so you have to create your own RSA 2048 (the literature implies it will work with 4096 as well but I haven’t tried it) AVB custom key using say openssl, then use avbtool (found in external/avb as a python script) to convert your RSA public key to a form that can be flashed in the phone:

avbtool extract_public_key --key pubkey.pem --output pkmd.bin

This can then be flashed to the unlocked phone (in the bootloader fastboot) with

fastboot flash avb_custom_key pkmd.bin

And you’re all set up to boot a custom signed OS.

Signing your LineageOS

There is a wrinkle to this: to re-sign the OS, you need the target-files.zip intermediate build, not the ROM install file. Unfortunately, this is pretty big (38GB for lineage-19.1) and doesn’t seem to be available for download any more. If you can find it, you can re-sign the stock LineageOS, but if not you have to build it yourself. Instructions for both building and re-signing can be found here. You need to follow this but in addition you must add two extra flags to the sign_target_files_apks command:

--avb_vbmeta_key=/path/to/private/avb.key
--avb_vbmeta_algorithm=SHA256_RSA2048

Which will ensure the vbmeta partition is signed with the key you created above.

Optionally Enabling dm-verity

If you want to enable dm-verity, you have to change the vbmeta flags to 0 (enable both hashtree and vbmeta verification) before you execute the signing command above. These flags are stored in the META/misc_info.txt file which you can extract from target-files.zip with

unzip target-files.zip META/misc_info.txt

And once done you can vi this file to find the line

avb_vbmeta_args=--flags 3 --padding_size 4096 --rollback_index 1804212800

If you update the 3 to 0 this will unset the two disable flags and allow you to do a dm-verity verified boot. Then use zip to replace this updated file

zip -u target-files.zip META/misc_info.txt

And then proceed with signing the updated target-files.zip

Wrinkle for Android-12 (lineage-19.1) and above

For all these versions, this patch ensures that if the vbmeta was signed then the vbmeta hash must be verified, otherwise the system will crash in early init, so you have no choice and must alter the avb_vbmeta_args above to either --flags 1 or --flags 0 so the vbmeta hash is passed in to init. Since you have to alter the flags anyway, you might as well enable dm-verity (set to 0) at the same time.

Re-Lock the Bootloader

Once you have installed both your custom keys and your custom signed boot image, you are ready to re-lock the bootloader. Beware that some phones will erase your data partition when you do this (the Google advice says they shouldn’t, but not all manufacturers bother to follow it), so make sure you have a backup (or are playing with a newly rooted phone).

fastboot flashing lock

Check with a reboot (the phone should now display a yellow warning triangle saying it is booting a custom OS rather than the orange unsigned OS one). If everything goes according to plan you can enter the developer settings and click the “OEM Unlocking” settings to disabled and your phone can no longer be unlocked without your say so.

Conclusions and Caveats

Following the above instructions, you can updated your phone so it will verify images you signed with your AVB key, turn on dm-verity if you wish and generally make your phone much more secure. However, remember that you haven’t erased the original AVB key, so the phone can still be updated to an image signed with that key and, worse, the recovery partition of LineageOS is modified to allow rollback, so it will allow the flashing of any signed image without triggering an erase of the data partition. There are also a few more problems like, thanks to a bug in AOSP, the recovery version of fastboot will actually allow commands that are usually forbidden if the phone is locked.

January 29, 2024 03:04 PM

Back in my blog post about Securing the Google SIP Stack, I did say I’d look at re-enabling SIP in Android-12, so with a view to doing that I tried building and booting LineageOS 19.1, but it crashed really early in the boot sequence (after the boot splash but before the boot animation started). It turns out that information on debugging the android early boot sequence is a bit scarce, so I thought I should write a post about how I did it just in case it helps someone else who’s struggling with a similar early boot problem.

How I usually Build and Boot Android

My builds are standard LineageOS with my patches to fix SIP and not much else. However, I do replace the debug keys with my signing keys and I also have an AVB key installed in the phone’s third party keyslot with which I sign the vbmeta for boot. This actually means that my phone is effectively locked but with a user supplied key (Yellow as google puts it).

My phone is now a pixel 3 (I had to say goodbye to the old Nexus One thanks to the US 3G turn off) and I do have a slightly broken Pixel 3 I play with for experimental patches, which is where I was trying to install Android-12.

Signing Seems to be the Problem

Just to verify my phone could actually boot a stock LineageOS (it could) I had to unlock it and this lead to the discovery that once unlocked, it would also boot my custom rom as well, so whatever was failing in early boot seemed to be connected with the device being locked.

I also discovered an interesting bug in the recovery rom fastboot: If you’re booting locked with your own keys, it will still let you perform all the usually forbidden fastboot commands (the one I was using was set_active). It turns out to be because of a bug in AOSP which treats yellow devices as unlocked in fastboot. Somewhat handy for debugging, but not so hot for security …

And so to Debugging Early Boot

The big problem with Android is there’s no way to get the console messages for early boot. Even if you enable adb early, it doesn’t get started until quite far in to the boot animation (which was way after the crash I was tripping over). However, android does have a pstore (previously ramoops) driver that can give you access to the previously crashed boot’s kernel messages (early init, fortunately, mostly logs to the kernel message log).

Forcing init to crash on failure

Ordinarily an init failure prints a message and reboots (to the bootloader), which doesn’t excite pstore into saving the kernel message log. fortunately there is a boot option (androidboot.init_fatal_panic) which can be set in the boot options (or kernel command line for a pixel-3 which can only boot the 4.9 kernel). If you build your own android, it’s fairly easy to add things to the android commandline (which is in boot.img) because all you need to do is extract BOOT/cmdline from the intermediate zip file you sign add any boot options you need and place it back in the zip file (before you sign it).

Unfortunately, this expedient didn’t work (no console logs appear in pstore). I did check that init was correctly panic’ing on failure by inducing an init failure in recovery mode and observing the panic (recovery mode allows you to run adb). But this induced panic also didn’t show up in pstore, meaning there’s actually some problem with pstore and early panics.

Security is the problem (as usual)

The actual problem turned out to be security (as usual): The pixel-3 does encrypted boot panic logs. The way this seems to work (at least in my reading of the google additional pstore patches) is that the bootloader itself encrypts the pstore ram area with a key on the /data partition, which means it only becomes visible after the device is unlocked. Unfortunately, if you trigger a panic before the device is unlocked (by echoing ‘c’ to /proc/sysrq-trigger) the panic message is lost, so pstore itself is useless for debugging early boot. There seems to be some communication of the keys by the vendor proprietary ramoops binary making it very difficult to figure out how it’s being done.

Why the early panic message is lost is a bit mysterious, but unfortunately pstore on the pixel-3 has several proprietary components around the encrypted message handling that make it hard to debug. I suspect if you don’t set up the pstore encryption keys, the bootloader erases the pstore ram area instead of encrypting it, but I can’t prove that.

Although it might be possible to fix the pstore drivers to preserve the ramoops from before device unlock, the participation of the proprietary bootloader in preserving the memory doesn’t make that look like a promising avenue to explore.

Anatomy of the Pixel-3 Boot Sequence

The Pixel-3 device boots through recovery. What this means is that the initial ramdisk (from boot.img) init is what boots both the recovery and normal boot paths. The only difference is that for recovery (and fastboot), the device stays in the ramdisk and for normal boot it mounts the /system partition and pivots to it. What makes this happen or not is the boot flag androidboot.force_normal_boot=1 which is added by the bootloader. Pretty much all the binary content and init rc files in the ramdisk are for recovery and its allied menus.

Since the boot paths are pretty radically different, because the normal boot first pivots to a first stage before going on to a second, but in the manner of containers, it might be possible to boot recovery first, start a dmesg logger and then re-exec init through the normal path

Forcing Re-Exec

The idea is to signal init to re-exec itself for the normal path. Of course, there have to be a few changes to do this: An item has to be added to the recovery menu to signal init and init itself has to be modified to do the re-exec on the signal (note you can’t just kick off an init with a new command line because init must be pid 1 for booting). Once this is done, there are problems with selinux (it won’t actually allow init to re-exec) and some mount moves. The selinux problem is fixable by switching it from enforcing to permissive (boot option androidboot.selinux=permissive) and the mount moves (which are forbidden if you’re running binaries from the mount points being moved) can instead become bind mounts. The whole patch becomes 31 insertions across 7 files in android_system_core.

The signal I chose was SIGUSR1, which isn’t usually used by anything in the bootloader and the addition of a menu item to recovery to send this signal to init was also another trivial patch. So finally we have a system from which I can start adb to trace the kernel log (adb shell dmesg -w) and then signal to init to re-exec. Surprisingly this worked and produced as the last message fragment:

[ 190.966881] init: [libfs_mgr]Created logical partition system_a on device /dev/block/dm-0
[ 190.967697] init: [libfs_mgr]Created logical partition vendor_a on device /dev/block/dm-1
[ 190.968367] init: [libfs_mgr]Created logical partition product_a on device /dev/block/dm-2
[ 190.969024] init: [libfs_mgr]Created logical partition system_ext_a on device /dev/block/dm-3
[ 190.969067] init: DSU not detected, proceeding with normal boot
[ 190.982957] init: [libfs_avb]Invalid hash size:
[ 190.982967] init: [libfs_avb]Failed to verify vbmeta digest
[ 190.982972] init: [libfs_avb]vbmeta digest error isn't allowed
[ 190.982980] init: Failed to open AvbHandle: No such file or directory
[ 190.982987] init: Failed to setup verity for '/system': No such file or directory
[ 190.982993] init: Failed to mount /system: No such file or directory
[ 190.983030] init: Failed to mount required partitions early …
[ 190.983483] init: InitFatalReboot: signal 6
[ 190.984849] init: #00 pc 0000000000123b38 /system/bin/init
[ 190.984857] init: #01 pc 00000000000bc9a8 /system/bin/init
[ 190.984864] init: #02 pc 000000000001595c /system/lib64/libbase.so
[ 190.984869] init: #03 pc 0000000000014f8c /system/lib64/libbase.so
[ 190.984874] init: #04 pc 00000000000e6984 /system/bin/init
[ 190.984878] init: #05 pc 00000000000aa144 /system/bin/init
[ 190.984883] init: #06 pc 00000000000487dc /system/lib64/libc.so
[ 190.984889] init: Reboot ending, jumping to kernel

Which indicates exactly where the problem is.

Fixing the problem

Once the messages are identified, the problem turns out to be in system/core ec10d3cf6 “libfs_avb: verifying vbmeta digest early”, which is inherited from AOSP and which even says in in it’s commit message “the device will not boot if: 1. The image is signed with FLAGS_VERIFICATION_DISABLED is set 2. The device state is locked” which is basically my boot state, so thanks for that one google. Reverting this commit can be done cleanly and now the signed image boots without a problem.

I note that I could also simply add hashtree verification to my boot, but LineageOS is based on the eng target, which has FLAGS_VERIFICATION_DISABLED built into the main build Makefile. It might be possible to change it, but not easily I’m guessing … although I might try fixing it this way at some point, since it would make my phones much more secure.

Conclusion

Debugging android early boot is still a terribly hard problem. Probably someone with more patience for disassembling proprietary binaries could take apart pixel-3 vendor ramoops and figure out if it’s possible to get a pstore oops log out of early boot (which would be the easiest way to debug problems). But failing that the simple hack to re-exec init worked enough to show me where the problem was (of course, if init had continued longer it would likely have run into other issues caused by the way I hacked it).

January 15, 2024 10:05 PM

Libraries are collections of code that are intended to be usable by multiple consumers (if you're interested in the etymology, watch this video). In the old days we had what we now refer to as "static" libraries, collections of code that existed on disk but which would be copied into newly compiled binaries. We've moved beyond that, thankfully, and now make use of what we call "dynamic" or "shared" libraries - instead of the code being copied into the binary, a reference to the library function is incorporated, and at runtime the code is mapped from the on-disk copy of the shared object[1]. This allows libraries to be upgraded without needing to modify the binaries using them, and if multiple applications are using the same library at once it only requires that one copy of the code be kept in RAM.

But for this to work, two things are necessary: when we build a binary, there has to be a way to reference the relevant library functions in the binary; and when we run a binary, the library code needs to be mapped into the process.

(I'm going to somewhat simplify the explanations from here on - things like symbol versioning make this a bit more complicated but aren't strictly relevant to what I was working on here)

For the first of these, the goal is to replace a call to a function (eg, printf()) with a reference to the actual implementation. This is the job of the linker rather than the compiler (eg, if you use the -c argument to tell gcc to simply compile to an object rather than linking an executable, it's not going to care about whether or not every function called in your code actually exists or not - that'll be figured out when you link all the objects together), and the linker needs to know which symbols (which aren't just functions - libraries can export variables or structures and so on) are available in which libraries. You give the linker a list of libraries, it extracts the symbols available, and resolves the references in your code with references to the library.

But how is that information extracted? Each ELF object has a fixed-size header that contains references to various things, including a reference to a list of "section headers". Each section has a name and a type, but the ones we're interested in are .dynstr and .dynsym. .dynstr contains a list of strings, representing the name of each exported symbol. .dynsym is where things get more interesting - it's a list of structs that contain information about each symbol. This includes a bunch of fairly complicated stuff that you need to care about if you're actually writing a linker, but the relevant entries for this discussion are an index into .dynstr (which means the .dynsym entry isn't sufficient to know the name of a symbol, you need to extract that from .dynstr), along with the location of that symbol within the library. The linker can parse this information and obtain a list of symbol names and addresses, and can now replace the call to printf() with a reference to libc instead.

(Note that it's not possible to simply encode this as "Call this address in this library" - if the library is rebuilt or is a different version, the function could move to a different location)

Experimentally, .dynstr and .dynsym appear to be sufficient for linking a dynamic library at build time - there are other sections related to dynamic linking, but you can link against a library that's missing them. Runtime is where things get more complicated.

When you run a binary that makes use of dynamic libraries, the code from those libraries needs to be mapped into the resulting process. This is the job of the runtime dynamic linker, or RTLD[2]. The RTLD needs to open every library the process requires, map the relevant code into the process's address space, and then rewrite the references in the binary into calls to the library code. This requires more information than is present in .dynstr and .dynsym - at the very least, it needs to know the list of required libraries.

There's a separate section called .dynamic that contains another list of structures, and it's the data here that's used for this purpose. For example, .dynamic contains a bunch of entries of type DT_NEEDED - this is the list of libraries that an executable requires. There's also a bunch of other stuff that's required to actually make all of this work, but the only thing I'm going to touch on is DT_HASH. Doing all this re-linking at runtime involves resolving the locations of a large number of symbols, and if the only way you can do that is by reading a list from .dynsym and then looking up every name in .dynstr that's going to take some time. The DT_HASH entry points to a hash table - the RTLD hashes the symbol name it's trying to resolve, looks it up in that hash table, and gets the symbol entry directly (it still needs to resolve that against .dynstr to make sure it hasn't hit a hash collision - if it has it needs to look up the next hash entry, but this is still generally faster than walking the entire .dynsym list to find the relevant symbol). There's also DT_GNU_HASH which fulfills the same purpose as DT_HASH but uses a more complicated algorithm that performs even better. .dynamic also contains entries pointing at .dynstr and .dynsym, which seems redundant but will become relevant shortly.

So, .dynsym and .dynstr are required at build time, and both are required along with .dynamic at runtime. This seems simple enough, but obviously there's a twist and I'm sorry it's taken so long to get to this point.

I bought a Synology NAS for home backup purposes (my previous solution was a single external USB drive plugged into a small server, which had uncomfortable single point of failure properties). Obviously I decided to poke around at it, and I found something odd - all the libraries Synology ships were entirely lacking any ELF section headers. This meant no .dynstr, .dynsym or .dynamic sections, so how was any of this working? nm asserted that the libraries exported no symbols, and readelf agreed. If I wrote a small app that called a function in one of the libraries and built it, gcc complained that the function was undefined. But executables on the device were clearly resolving the symbols at runtime, and if I loaded them into ghidra the exported functions were visible. If I dlopen()ed them, dlsym() couldn't resolve the symbols - but if I hardcoded the offset into my code, I could call them directly.

Things finally made sense when I discovered that if I passed the --use-dynamic argument to readelf, I did get a list of exported symbols. It turns out that ELF is weirder than I realised. As well as the aforementioned section headers, ELF objects also include a set of program headers. One of the program header types is PT_DYNAMIC. This typically points to the same data that's present in the .dynamic section. Remember when I mentioned that .dynamic contained references to .dynsym and .dynstr? This means that simply pointing at .dynamic is sufficient, there's no need to have separate entries for them.

The same information can be reached from two different locations. The information in the section headers is used at build time, and the information in the program headers at run time[3]. I do not have an explanation for this. But if the information is present in two places, it seems obvious that it should be able to reconstruct the missing section headers in my weird libraries? So that's what this does. It extracts information from the DYNAMIC entry in the program headers and creates equivalent section headers.

There's one thing that makes this more difficult than it might seem. The section header for .dynsym has to contain the number of symbols present in the section. And that information doesn't directly exist in DYNAMIC - to figure out how many symbols exist, you're expected to walk the hash tables and keep track of the largest number you've seen. Since every symbol has to be referenced in the hash table, once you've hit every entry the largest number is the number of exported symbols. This seemed annoying to implement, so instead I cheated, added code to simply pass in the number of symbols on the command line, and then just parsed the output of readelf against the original binaries to extract that information and pass it to my tool.

Somehow, this worked. I now have a bunch of library files that I can link into my own binaries to make it easier to figure out how various things on the Synology work. Now, could someone explain (a) why this information is present in two locations, and (b) why the build-time linker and run-time linker disagree on the canonical source of truth?

[1] "Shared object" is the source of the .so filename extension used in various Unix-style operating systems
[2] You'll note that "RTLD" is not an acryonym for "runtime dynamic linker", because reasons
[3] For environments using the GNU RTLD, at least - I have no idea whether this is the case in all ELF environments

comments

January 02, 2024 07:24 PM

A while ago I mentioned I use Android-10 with the built in SIP stack and that the Google stack was pretty buggy and I had to fix it simply to get it to function without disconnecting all the time. Since then I’ve upported my fixes to Android-11 (the jejb-11 branch in the repositories) by using LineageOS-19.1. However, another major deficiency in the Google SIP stack is its complete lack of security: both the SIP signalling and the media streams are all unencrypted meaning they can be intercepted and tapped by pretty much anyone in the network path running tcpdump. Why this is so, particularly for a company that keeps touting its security credentials is anyone’s guess. I personally suspect they added SIP in Android-4 with a view to basing Google Voice on it, decided later that proprietary VoIP protocols was the way to go but got stuck with people actually using the SIP stack for other calling services so they couldn’t rip it out and instead simply neglected it hoping it would die quietly due to lack of features and updates.

This blog post is a guide to how I took the fully unsecured Google SIP stack and added security to it. It also gives a brief overview of some of the security protocols you need to understand to get secure VoIP working.

What is SIP

What I’m calling SIP (but really a VoIP system using SIP) is a protocol consisting of several pieces. SIP (Session Initiation Protocol), RFC 3261, is really only one piece: it is the “signalling” layer meaning that call initiation, response and parameters are all communicated this way. However, simple SIP isn’t enough for a complete VoIP stack; once a call moves to in progress, there must be an agreement on where the media streams are and how they’re encoded. This piece is called a SDP (Session Description Protocol) agreement and is usually negotiated in the body of the SIP INVITE and response messages and finally once agreement is reached, the actual media stream for call audio goes over a different protocol called RTP (Real-time Transport Protocol).

How Google did SIP

The trick to adding protocols fast is to take them from someone else (if you’re open source, this is encouraged) so google actually chose the NIST-SIP java stack (which later became the JAIN-SIP stack) as the basis for SIP in android. However, that only covered signalling and they had to plumb it in to the android Phone model. One essential glue piece is frameworks/opt/net/voip which supplies the SDP negotiating layer and interfaces the network codec to the phone audio. This isn’t quite enough because the telephony service and the Dialer also need to be involved to do the account setup and call routing. It always interested me that SIP was essentially special cased inside these services and apps instead of being a plug in, but that’s due to the fact that some of the classes that need extending to add phone protocols are internal only; presumably so only manufacturers can add phone features.

Securing SIP

This is pretty easy following the time honoured path of sending messages over TLS instead of in the clear simply by using a TLS wrappering technique of secure sockets and, indeed, this is how RFC 3261 says to do it. However, even this minor re-engineering proved unnecessary because the nist-sip stack was already TLS capable, it simply wasn’t allowed to be activated that way by the configuration options Google presented. A simple 10 line patch in a couple of repositories (external/nist_sip, packages/services/Telephony and frameworks/opt/net/voip) fixed this and the SIP stack messaging was secured leaving only the voice stream insecure.

SDP

As I said above, the google frameworks/opt/net/voip does all the SDP negotiation. This isn’t actually part of SIP. The SDP negotiation is conducted over SIP messages (which means it’s secured thanks to the above) but how this should be done isn’t part of the SIP RFC. Instead SDP has its own RFC 4566 which is what the class follows (mainly for codec and port negotiation). You’d think that if it’s already secured by SIP, there’s no additional problem, but, unfortunately, using SRTP as the audio stream requires the exchange of additional security parameters which added to SDP by RFC 4568. To incorporate this into the Google SIP stack, it has to be integrated into the voip class. The essential additions in this RFC are a separate media description protocol (RTP/SAVP) for the secure stream and the addition of a set of tagged a=crypto: lines for key negotiation.

As will be a common theme: not all of RFC 4568 has to be implemented to get a secure RTP stream. It goes into great detail about key lifetime and master key indexes, neither of which are used by the asterisk SIP stack (which is the one my phone communicates with) so they’re not implemented. Briefly, it is best practice in TLS to rekey the transport periodically, so part of key negotiation should be key lifetime (actually, this isn’t as important to SRTP as it is to TLS, see below, which is why asterisk ignores it) and the people writing the spec thought it would be great to have a set of keys to choose from instead of just a single one (The Master Key Identifier) but realistically that simply adds a load of complexity for not much security benefit and, again, is ignored by asterisk which uses a single key.

In the end, it was a case of adding a new class for parsing the a=crypto: lines of SDP and doing a loop in the audio protocol for RTP/SAVP if TLS were set as the transport. This ended up being a ~400 line patch.

Secure RTP

RTP itself is governed by RFC 3550 which actually contains two separate stream descriptions: the actual media over RTP and a control protocol over RTCP. RTCP is mostly used for multi-party and video calls (where you want reports on reception quality to up/downshift the video resolution) and really serves no purpose for audio, so it isn’t implemented in the Google SIP stack (and isn’t really used by asterisk for audio only either).

When it comes to securing RTP (and RTCP) you’d think the time honoured mechanism (using secure sockets) would have applied although, since RTP is transmitted over UDP, one would have to use DTLS instead of TLS. Apparently the IETF did consider this, but elected to define a new protocol instead (or actually two: SRTP and SRTCP) in RFC 3711. One of the problems with this new protocol is that it also defines a new ciphersuite (AES_CM_…) which isn’t found in any of the standard SSL implementations. Although the AES_CM ciphers are very similar in operation to the AES_GCM ciphers of TLS (Indeed AES_GCM was adopted for SRTP in a later RFC 7714) they were never incorporated into the TLS ciphersuite definition.

So now there are two problems: adding code for the new protocol and performing the new encyrption/decryption scheme. Fortunately, there already exists a library (libsrtp) that can do this and even more fortunately it’s shipped in android (external/libsrtp2) although it looks to be one of those throwaway additions where the library hasn’t really been updated since it was added (for cuttlefish gcastv2) in 2019 and so is still at a pre 2.3.0 version (I did check and there doesn’t look to be any essential bug fixes missing vs upstream, so it seems usable as is).

One of the really great things about libsrtp is that it has srtp_protect and srtp_unprotect functions which transform SRTP to RTP and vice versa, so it’s easily possible to layer this library directly into an existing RTP implementation. When doing this you have to remember that the encryption also includes authentication, so the size of the packet expands which is why the initial allocation size of the buffers has to be increased. One of the not so great things is that it implements all its own crypto primitives including AES and SHA1 (which most cryptographers think is always a bad idea) but on the plus side, it’s the same library asterisk uses so how much of a real problem could this be …

Following the simple layering approach, I constructed a patch to do the RTP<->SRTP transform in the JNI code if a key is passed in, so now everything just works and setting asterisk to SRTP only confirms the phone is able to send and receive encrypted audio streams. This ends up being a ~140 line patch.

So where does DTLS come in?

Anyone spending any time at all looking at protocols which use RTP, like webRTC, sees RTP and DTLS always mentioned in the same breath. Even asterisk has support for DTLS, so why is this? The answer is that if you use RTP outside the SIP framework, you still need a way of agreeing on the keys using SDP. That key agreement must be protected (and can’t go over RTCP because that would cause a chicken and egg problem) so implementations like webRTC use DTLS to exchange the initial SDP offer and answer negotiation. This is actually referred to as DTLS-SRTP even though it’s an initial DTLS handshake followed by SRTP (with no further DTLS in sight). However, this DTLS handshake is completely unnecessary for SIP, since the SDP handshake can be done over TLS protected SIP messaging instead (although I’ve yet to find anyone who can convincingly explain why this initial handshake has to go over DTLS and not TLS like SIP … I suspect it has something to do with wanting the protocol to be all UDP and go over the same initial port).

Conclusion

This whole exercise ended up producing less than 1000 lines in patches and taking a couple of days over Christmas to complete. That’s actually much simpler and way less time than I expected (given the complexity in the RFCs involved), which is why I didn’t look at doing this until now. I suppose the next thing I need to look at is reinserting the SIP stack into Android-12, but I’ll save that for when Android-11 falls out of support.

December 30, 2023 03:58 PM

Vulkan 1.3.274 moves the Vulkan encode work out of BETA and moves h264 and h265 into KHR extensions. radv support for the Vulkan video encode extensions has been in progress for a while.

The latest branch is at [1]. This branch has been updated for the new final headers.

Updated: It passes all of h265 CTS now, but it is failing one h264 test.

Initial ffmpeg support is [2].

[1] https://gitlab.freedesktop.org/airlied/mesa/-/tree/radv-vulkan-video-encode-h2645-spec-latest?ref_type=heads

[2] https://github.com/cyanreg/FFmpeg/commits/vulkan/

December 19, 2023 08:29 PM

Earlier this year, after Github accidentally committed their private RSA SSH host key to a public repository, I wrote about how better support for SSH host certificates would allow this sort of situation to be handled in a user-transparent way without any negative impact on security. I was hoping that someone would read this and be inspired to fix the problem but sadly that didn't happen so I've actually written some code myself.

The core part of this is straightforward - if a server presents you with a certificate associated with a host key, then make the trust in that host be whoever signed the certificate rather than just trusting the host key. This means that if someone needs to replace the host key for any reason (such as, for example, them having published the private half), you can replace the host key with a new key and a new certificate, and as long as the new certificate is signed by the same key that the previous certificate was, you'll trust the new key and key rotation can be carried out without any user errors. Hurrah!

So obviously I wrote that bit and then thought about the failure modes and it turns out there's an obvious one - if an attacker obtained both the private key and the certificate, what stops them from continuing to use it? The certificate isn't a secret, so we basically have to assume that anyone who possesses the private key has access to it. We may have silently transitioned to a new host key on the legitimate servers, but a hostile actor able to MITM a user can keep on presenting the old key and the old certificate until it expires.

There's two ways to deal with this - either have short-lived certificates (ie, issue a new certificate every 24 hours or so even if you haven't changed the key, and specify that the certificate is invalid after those 24 hours), or have a mechanism to revoke the certificates. The former is viable if you have a very well-engineered certificate issuing operation, but still leaves a window for an attacker to make use of the certificate before it expires. The latter is something SSH has support for, but the spec doesn't define any mechanism for distributing revocation data.

So, I've implemented a new SSH protocol extension that allows a host to send a key revocation list to a client. The idea is that the client authenticates to the server, receives a key revocation list, and will no longer trust any certificates that are contained within that list. This seems simple enough, but a naive implementation opens the client to various DoS attacks. For instance, if you simply revoke any key contained within the received KRL, a hostile server could revoke any certificates that were otherwise trusted by the client. The easy way around this is for the client to ensure that any revoked keys are associated with the same CA that signed the host certificate - that way a compromised host can only revoke certificates associated with that CA, and can't interfere with anyone else.

Unfortunately that still means that a single compromised host can still trigger revocation of certificates inside that trust domain (ie, a compromised host a.test.com could push a KRL that invalidated the certificate for b.test.com), because there's no way in the KRL format to indicate that a given revocation is associated with a specific hostname. This means we need a mechanism to verify that the KRL update is legitimate, and the easiest way to handle that is to sign it. The KRL format specifies an in-band signature but this was deprecated earlier this year - instead KRLs are supposed to be signed with the sshsig format. But we control both the server and the client, which means it's easy enough to send a detached signature as part of the extension data.

Putting this all together: you ssh to a server you've never contacted before, and it presents you with a host certificate. Instead of the host key being added to known_hosts, the CA key associated with the certificate is added. From now on, if you ssh to that host and it presents a certificate signed by that CA, it'll be trusted. Optionally, the host can also send you a KRL and a signature. If the signature is generated by the CA key that you already trust, any certificates in that KRL associated with that CA key will be incorporated into local storage. The expected flow if a key is compromised is that the owner of the host generates a new keypair, obtains a new certificate for the new key, and adds the old certificate to a KRL that is signed with the CA key. The next time the user connects to that host, they receive the new key and new certificate, trust it because it's signed by the same CA key, and also receive a KRL signed with the same CA that revokes trust in the old certificate.

Obviously this breaks down if a user is MITMed with a compromised key and certificate immediately after the host is compromised - they'll see a legitimate certificate and won't receive any revocation list, so will trust the host. But this is the same failure mode that would occur in the absence of keys, where the attacker simply presents the compromised key to the client before trust in the new key has been created. This seems no worse than the status quo, but means that most users will seamlessly transition to a new key and revoke trust in the old key with no effort on their part.

The work in progress tree for this is here - at the point of writing I've merely implemented this and made sure it builds, not verified that it actually works or anything. Cleanup should happen over the next few days, and I'll propose this to upstream if it doesn't look like there's any showstopper design issues.

comments

December 19, 2023 07:48 PM

Even if you’re a developer with legal leanings like me, you probably haven’t given much thought to the warranty disclaimer and the liability disclaimer that appears in almost every Open Source licence (see sections 14 and 15 of GPLv3). This post is designed to help you understand what they are, why they’re there and why we might need stronger defences in future thanks to a changing legal landscape.

History: Why no Warranty or Liability

It seems obvious that when considered in terms of what downstream gets from Open Source that an open ended obligation on behalf of upstream to fix your problems isn’t one of them because it wouldn’t be sustainable. Effectively the no warranty clause is notice that since you’re getting the code for free it comes with absolutely no obligations on developers: if it breaks, you get to fix it. This is why no warranty clauses have been present since the history of Open Source (and Free Software: GPLv1 included this). There’s also a historical commercial reason for this as well. Before the explosion of Open Source business models in the last decade, the Free Software Foundation (FSF) considered paid support for otherwise unsupported no warranty Open Source software to be the standard business model for making money on Open Source. Based on this, Cygnus Support (later Cygnus Solutions – Earliest web archive capture 1997) was started in 1989 with a business model of providing paid support and bespoke development for the compiler and toolchain.

Before 2000 most public opinion (when it thought about Open Source at all) was happy with this, because Open Source was seen by and large as the uncommercialized offerings of random groups of hackers. Even the largest Open Source project, the Linux kernel, was seen as the scrappy volunteer upstart challenging both Microsoft and the proprietary UNIXs for control of the Data Centre. On the back of this, distributions (Red Hat, SUSE, etc.) arose to commericallize support offerings around Linux to further its competition with UNIX and Windows and push it to win the war for the Data Centre (and later the Cloud).

The Rise of The Foundations: Public Perception Changes

The heyday explosion of volunteer Open Source happened in the first decade of the new Millennium. But volunteer Open Source also became a victim of this success: the more it penetrated industry, the greater control of the end product industry wanted. And, whenever there’s a Business Need, something always arises to fulfill it: the Foundation Model for exerting influence in exchange for cash. The model is fairly simple: interested parties form a foundation (or more likely go to a Foundation forming entity like the Linux Foundation). They get seats on the governing board, usually in proportion to their annual expenditure on the foundation and the foundation sets up a notionally independent Technical Oversight Body staffed by developers which is still somewhat beholden to the board and its financial interests. The net result is rising commercial franchise in Open Source.

The point of the above isn’t to say whether this commercial influence is good or bad, it’s to say that the rise of the Foundations have changed the public perception of Open Source. No longer is Open Source seen as the home of scrappy volunteers battling for technological innovation against entrenched commercial interests, now Open Source is seen as one more development tool of the tech industry. This change in attitude is pretty profound because now when a problem is found in Open Source, the public has no real hesitation in assuming the tech industry in general should be responsible; the perception that the no warranty clause protects innocent individual developers is supplanted by the perception that it’s simply one more tool big tech deploys to evade liability for the problems it creates. Some Open Source developers have inadvertently supported this notion by publicly demanding to be paid for working on their projects, often in the name of sustainability. Again, none of this is necessarily wrong but it furthers the public perception that Open Source developers are participating in a commercial not a volunteer enterprise.

Liability via Fiduciary Duty: The Bitcoin Case

An ongoing case in the UK courts (BL-2021-000313) between Tulip Trading and various bitcoin developers centers around the disputed ownership of about US$4bn in bitcoin. Essentially Tulip contends that it lost access to the bitcoins due to a computer hack but says that the bitcoin developers have a fiduciary duty to it to alter the blockchain code to recover its lost bitcoins. The unusual feature of this case is that Tulip sued the developers of the bitcoin code not the operators of the bitcoin network. (it’s rather like the Bank losing your money and then you trying to sue the Mint for recovery). The reason for this is that all the operators (the miners) use the same code base for the same blockchain and thus could rightly claim that it’s technologically impossible for them to recover the lost bitcoin, because that would necessitate a change to the fundamental blockchain code which only the developers control. The suit was initially lost by Tulip on the grounds of the no liability disclaimer, but reinstated by the UK appeal court which showed considerable interest in the idea that developers could pick up fiduciary liability in some cases, even though the suit may eventually get dismissed on the grounds that Tulip can’t prove it ever owned the US$4bn in bitcoins in the first place.

Why does all this matter? Well, even if this case resolves successfully, thanks to the appeal court ruling, the door is still open to others with less shady claims that they’ve suffered an injury due to some coding issue that gives developers fiduciary liability to them. The no warranty disclaimer is already judged not to be sufficient to prevent this, so the cracks are starting to appear in it as a defence against all liability claims.

The EU Cyber Resilience Act: Legally Piercing No Warranty Clauses

The EU Cyber Resilience Act (CRA) at its heart provides a fiduciary duty of care on all “digital components” incorporated into products or software offered on the EU market to adhere to prescribed cybersecurity requirements and an obligation to provide duty of care for these requirements over the whole lifecycle of such products or software. Essentially this is developer liability, notwithstanding any no warranty clauses, writ large. To be fair, there is currently a carve out for “noncommercial” Open Source but, as I pointed out above, most Open Source today is commercial and wouldn’t actually benefit from this. I’m not proposing to give a detailed analysis (many people have already done this and your favourite search engine will turn up dozens without even trying) I just want to note that this is a legislative act designed to pierce the no warranty clauses Open Source has relied on for so long.

EU CRA Politics: Why is this Popular?

Politicians don’t set out to effectively override licensing terms and contract law unless there’s a significant popularity upside and, if you actually canvas the general public, there is: People are tired of endless cybersecurity breaches compromising their private information, or even their bank accounts, and want someone to be held responsible. Making corporations pay for breaches that damage individuals is enormously popular (and not just in the EU). After all big Tech profits enormously from this, so big Tech should pay for the clean up when things go wrong.

Unfortunately, self serving arguments that this will place undue burdens on Foundations funded by starving corporations rather undermine the same arguments on behalf of individual developers. To the public at large such arguments merely serve to reinforce the idea that big Tech has been getting away with too much for too long. Trying to separate individual developer Open Source from corporate Open Source is too subtle a concept to introduce now, particularly when we, and the general public, have bought into the idea that they’re the same thing for so long.

So what should we do about this?

It’s clear that even if a massive (and expensive) lobbying effort succeeds in blunting the effect of the CRA on Open Source this time around, there will always be a next time because of the public desire for accountability for and their safety guarantees in cybersecurity practices. It is also clear that individual developer Open Source has to make common cause with commercial Open Source to solve this issue. Even though individuals hate being seen as synonymous with corporations, one of the true distinctions between Open Source and Free Software has always been the ability to make common cause over smaller goals rather than bigger philosophies and aspirations; so this is definitely a goal we can make a common cause over. This common cause means the eventual solution must apply to individual and commercial Open Source equally. And, since we’ve already lost the perception war, it will have to be something more legally based.

Indemnification: the Legal solution to Developer Liability

Indemnification means one party, in particular circumstances, agreeing to be on the hook for the legal responsibilities of another party. This is actually a well known way not of avoiding liability but transferring it to where it belongs. As such, it’s easily sellable in the court of public opinion: we’re not looking to avoid liability, merely trying to make sure it lands on those who are making all the money from the code.

The best mechanism for transmitting this is obviously the Licence and, ironically, a licence already exists with developer indemnity clauses: Apache-2 (clause 9). Unfortunately, the Apache-2 clause only attaches to an entity offering support for a fee, which doesn’t quite cover the intention of the CRA, which is for anyone offering a product in the EU market (whether free or for sale) should be responsible for its cybersecurity lifecycle, whether they offer support or not. However, it does provide a roadmap for what such a clause would look like:

If you choose to offer this work in whole or part as a component or product in a jurisdiction requiring lifecycle duty of care you agree to indemnify, defend, and hold each Contributor harmless for any liability incurred by, or claims asserted against, such Contributor by reason of your actions in such a jurisdiction.

Probably the wording would need some tweaking by an actual lawyer, but you get the idea.

Applying Indemnity to existing Licences

Obviously for a new project, the above clause can simply be added to the licence but for any existing project, since the clause is compatible with the standard no-warranty statements, it can be added after the fact without interfering with the existing operation of the licence or needing buy in from current copyright holders (there is an argument that this would represent an additional restriction within the meaning of GPL, but I addressed that here). This makes it very easy to add by anyone offering, for instance, a download over Github or Gitlab that could be incorporated by someone into a product in the EU.

Conclusion

Thanks to public perception, the issue of developer liability isn’t going to go away and lobbying will not forestall the issue forever, so a robust indemnity defence needs to be incorporated into Open Source licences so that Liability is seen to be accepted where it can best be served (by the people or corporation utilizing the code).

December 11, 2023 05:27 PM

Problem: I want to copy a file in git and preserve its history.

Solution: Holy guacamole!:

git checkout -b dup
git mv orig copy
git commit --author="Greg " -m "cp orig copy"
git checkout HEAD~ orig
git commit --author="Greg " -m "restore orig"
git checkout -
git merge --no-ff dup

December 10, 2023 02:21 AM

Problem: you run a Pleroma instance, and you want to restrict the composite timelines to logged-in users only, but without going to the full "public: false". This is a common ask, and the option restrict_unauthenticated is documented, but the syntax is far from obvious!

Solution:

config :pleroma, :restrict_unauthenticated,
timelines: %{local: true, federated: true}

December 08, 2023 08:32 PM

Problem: you create and VM like you always did, but RHEL 9 bombs with:

Fatal glibc error: CPU does not support x86-64-v2

Solution: as Dan Berrange explains in bug #2060839, a traditional default CPU model qemu64 is no longer sufficient. Unfortunately, there's no "qemu64-v2". Instead, you must select one of the real CPUs.

<cpu mode='host-model' match='exact' check='none'>
<model fallback='forbid'>Broadwell-v4</model>
</cpu>

December 07, 2023 05:54 AM

Compiler Options Hardening Guide

On November 29th, the Open Source Security Foundation (OpenSSF) released a comprehensive and thorough hardening guide aimed at mitigating potential vulnerabilities in C and C++ code through the use of various hardening compiler options.

Strengthening the Fort : OpenSSF Releases Compiler Options Hardening Guide for C and C++ (blog post about the guide)
Compiler Options Hardening Guide for C and C++ (the guide)

This guide references some of the work we’ve accomplished over the years in the Kernel Self-Protection Project (KSPP), particularly our efforts to globally enable -Wimplicit-fallthrough and -fstrict-flex-arrays=3 in the upstream Linux kernel.

-Wimplicit-fallthrough

This warning flag warns when a fallthrough occurs unless it is specially marked as being intended. The Linux kernel project uses this flag; it led to the discovery and fixing of many bugs²¹.

An end to implicit fall-throughs in the kernel (LWN.net article)

-fstrict-flex-arrays=3

In this guide we recommend using the standard C99 flexible array notation [] instead of non-standard [0] or misleading [1], and then using -fstrict-flex-arrays=3 to improve bounds checking in such cases. In this case, code that uses [0] for a flexible array will need to be modified to use [] instead. Code that uses [1] for a flexible arrays needs to be modified to use [] and also extensively modified to eliminate off-by-one errors. Using [1] is not just misleading³⁹, it’s error-prone; beware that existing code using [1] to indicate a flexible array may currently have off-by-one errors⁴⁰.

Safer flexible arrays for the kernel (LWN.net article)
Progress On Bounds Checking in C and the Linux Kernel – Kees Cook, Google & Gustavo A. R. Silva (Presentation on YouTube)

GCC hardening features

The work of Qing Zhao is also referenced in the guide. Qing is making significant contributions to the KSPP by implementing hardening features in GCC, which we want to adopt in the Linux kernel.

GCC features to help harden the kernel (LWN.net article)

Beyond the Linux kernel

In conclusion, it’s quite fulfilling to see the hardening work we undertake in the Kernel Self-Protection Project having a significant influence in the world of software security, beyond the Linux kernel.

December 06, 2023 08:19 PM

There's a decent number of laptops with fingerprint readers that are supported by Linux, and Gnome has some nice integration to make use of that for authentication purposes. But if you log in with a fingerprint, the moment you start any app that wants to access stored passwords you'll get a prompt asking you to type in your password, which feels like it somewhat defeats the point. Mac users don't have this problem - authenticate with TouchID and all your passwords are available after login. Why the difference?

Fingerprint detection can be done in two primary ways. The first is that a fingerprint reader is effectively just a scanner - it passes a graphical representation of the fingerprint back to the OS and the OS decides whether or not it matches an enrolled finger. The second is for the fingerprint reader to make that determination itself, either storing a set of trusted fingerprints in its own storage or supporting being passed a set of encrypted images to compare against. Fprint supports both of these, but note that in both cases all that we get at the end of the day is a statement of "The fingerprint matched" or "The fingerprint didn't match" - we can't associate anything else with that.

Apple's solution involves wiring the fingerprint reader to a secure enclave, an independently running security chip that can store encrypted secrets or keys and only release them under pre-defined circumstances. Rather than the fingerprint reader providing information directly to the OS, it provides it to the secure enclave. If the fingerprint matches, the secure enclave can then provide some otherwise secret material to the OS. Critically, if the fingerprint doesn't match, the enclave will never release this material.

And that's the difference. When you perform TouchID authentication, the secure enclave can decide to release a secret that can be used to decrypt your keyring. We can't easily do this under Linux because we don't have an interface to store those secrets. The secret material can't just be stored on disk - that would allow anyone who had access to the disk to use that material to decrypt the keyring and get access to the passwords, defeating the object. We can't use the TPM because there's no secure communications channel between the fingerprint reader and the TPM, so we can't configure the TPM to release secrets only if an associated fingerprint is provided.

So the simple answer is that fingerprint unlock doesn't unlock the keyring because there's currently no secure way to do that. It's not intransigence on the part of the developers or a conspiracy to make life more annoying. It'd be great to fix it, but I don't see an easy way to do so at the moment.

comments

December 05, 2023 06:32 AM

-Wstringop-overflow

Late in October I sent a patch to globally enable the -Wstringop-overflowcompiler option, which finally landed in linux-next on November 28th. It’s expected to be merged into mainline during the next merge window, likely in the last couple of weeks of December, but “We’ll see”. I plan to send a pull request for this to Linus when the time is right. 🙂

I’ll write more about the challenges of enabling this compiler option once it’s included in 6.8-rc1, early next year. In the meantime, it’s worth mentioning that several people, including Kees Cook, Arnd Bergmann, and myself, have sent patches to fix -Wstringop-overflow warnings over the past few years.

Below are the patches that address the last warnings, together with the couple of patches that enable the option in the kernel. The first of them enables the option globally for all versions of GCC. However, -Wstringop-overflow is buggy in GCC-11. Therefore, I wrote a second patch adding this option under new configuration CC_STRINGOP_OVERFLOW in init/Kconfig, which is enabled by default for all versions of GCC except GCC-11. To handle the GCC-11 case I added another configuration: GCC11_NO_STRINGOP_OVERFLOW, which will disable -Wstringop-overflowby default for GCC-11 only.

Boot crash on ARM64

Another relevant task I worked on recently was debugging and fixing a boot crash on ARM64, reported by Joey Gouly. This issue was interesting as it related to some long-term work in the Kernel Self-Protection Project (KSPP), particularly our efforts to transform “fake” flexible arrays into C99 flexible-array members. In short, there was a zero-length fake flexible array at the end of a structure annotated with the __randomize_layout attribute, which needed to be transformed into a C99 flexible-array member.

This becomes problematic due to how compilers previously treated such arrays before the introduction of -fstrict-flex-arrays=3. The randstruct GCC plugin treated these arrays as actual flexible arrays, thus leaving their memory layout untouched when the kernel is built with CONFIG_RANDSTRUCT. However, after commit 1ee60356c2dc (‘gcc-plugins: randstruct: Only warn about true flexible arrays’), this behavior changed. Fake flexible arrays were no longer treated the same as proper C99 flexible-array members, leading to randomized memory layout for these arrays in structures annotated with __randomize_layout, which was the root cause of the boot crash.

To address this, I sent two patches. The first patch is the actual bugfix, which includes the flexible-array transformation. The second patch is complementary to commit 1ee60356c2dc, updating a code comment to clarify that “we don’t randomize the layout of the last element of a struct if it’s a proper flexible array.”

diff --git a/include/net/neighbour.h b/include/net/neighbour.h
index 07022bb0d44d..0d28172193fa 100644
--- a/include/net/neighbour.h
+++ b/include/net/neighbour.h
@@ -162,7 +162,7 @@ struct neighbour {
 	struct rcu_head		rcu;
 	struct net_device	*dev;
 	netdevice_tracker	dev_tracker;
-	u8			primary_key[0];
+	u8			primary_key[];
 } __randomize_layout;
 
 struct neigh_ops {

neighbour: Fix __randomize_layout crash in struct neighbour

diff --git a/scripts/gcc-plugins/randomize_layout_plugin.c b/scripts/gcc-plugins/randomize_layout_plugin.c
index 910bd21d08f4..746ff2d272f2 100644
--- a/scripts/gcc-plugins/randomize_layout_plugin.c
+++ b/scripts/gcc-plugins/randomize_layout_plugin.c
@@ -339,8 +339,7 @@ static int relayout_struct(tree type)
 
 	/*
 	 * enforce that we don't randomize the layout of the last
-	 * element of a struct if it's a 0 or 1-length array
-	 * or a proper flexible array
+	 * element of a struct if it's a proper flexible array
 	 */
 	if (is_flexible_array(newtree[num_fields - 1])) {
 		has_flexarray = true;

gcc-plugins: randstruct: Update code comment in relayout_struct()

These two patches will be soon backported to a couple of -stable trees.

-Wflex-array-member-not-at-end

During my last presentation at Kernel Recipes in September this year, I discussed a bit about -Wflex-array-member-not-at-end, which is a compiler option currently under development for GCC-14.

One of the highlights of the talk was a 6-year-old bug that I initially uncovered through grepping, and later, while reviewing some build logs from previous months, I realized that -Wflex-array-member-not-at-end had also detected this problem:

qed/red_ll2: Fix undefined behavior bug in struct qed_ll2_info

This bugfix was backported to 6.5.7, 6.1.57, 5.15.135, 5.10.198, 5.4.258 and 4.19.296 stable kernels.

Encouraged by this discovery, I started hunting for more similar bugs. My efforts led to fixing a couple more:

On November 28th, these two bugfixes were successfully backported to multiple stable kernel trees. The first fix was applied to the 6.6.3, 6.5.13, 6.1.64 stable kernels. The second fix was also applied to these, along with the 5.15.140 stable kernel.

I will have a lot of fun with -Wflex-array-member-not-at-end next year.

-Warray-bounds

In addition to these tasks, I continued addressing -Warray-boundsissues. Below are some of the patches I sent for this.

Patch review and ACKs.

I’ve also been involved in patch review and providing ACKs. Kees Cook, for instance, has been actively annotating flexible-array members with the__counted_byattribute, and I’ve been reviewing those patches.

Google Open Source Peer Bonus Award

In other news from November, I want to share that I’m thrilled to be the recipient of this award from Google for the first time. I feel really grateful and honored!

This comes as a result of my contributions to the Linux kernel over the years.

Honestly, I didn’t even know about the existence of this award until I received an email from someone at Google informing me about it. However, learning about it made me feel really great!

My appreciation goes out to my teammates in the Kernel Self-Protection Project, especially to Kees Cook, who has been an invaluable mentor to me over the years. Special thanks to Greg Kroah-Hartman as well, who was instrumental in setting me on my journey as a Linux kernel developer.

Acknowledgements

Special thanks to The Linux Foundation and Google for supporting my Linux kernel work.

December 05, 2023 12:10 AM

The Compute Express Link (CXL) microconference was held, for a second straight time, at this year's Linux Plumbers Conference. The goals for the track were to openly discuss current on-going development efforts around the core driver, as well as experimental memory management topics which lead to accommodating kernel infrastructure for new technology and use cases.

CXL session at LPC23

(i) CXL Emulation in QEMU - Progress, status and most importantly what next? The cxl qemu maintainers presented the current state of the emulation, for which significant progress has been made, extending support beyond basic enablement. During this year, features such as volatile devices, CDAT, poison and injection infrastructure have been added upstream qemu, while several others are in the process, such as CCI/mailbox, Scan Media and dynamic capacity. There was also further highlighting of the latter, for which DCD support was presented along with extent management issues found in the 3.0 spec. Similarly, Fabric Management was another important topic, continuing the debate about qemu's role in FM development, which is still quite early. Concerns about the production (beyond testing) use cases for CCI kernel support were discussed, as well as semantics and interfaces that constrain qemu, such as host and switch coupling and differences with BMC behavior.

(ii) CXL Type-2 core support. The state and purpose of existing experimental support for type 2 (accelerators) devices was presented, for both the kernel and qemu sides. The kernel support led to preliminary abstraction improvement work being upstreamed, facilitating actual accelerator integration with the cxl core driver. However, the rest is merely guess work and the floor is open for an actual hardware backed proposal. In addition, HDM-DB support would also be welcomed as a step forward. The qemu side is very basic and designed to just exercise core checks, for which it's emulation should be limited, specially in light of cxl_test.

(iii) Plumbing challenges in Dynamic capacity device. An in-depth coverage and discussion, from a kernel side, of the state of DCD support and considerations around corner cases. Semantics of releasing DC for full partial extents (ranges) are two different beasts altogether. Releasing all the already given memory can simply require memory being offline and be done, avoiding unnecessary complexity in the kernel. Therefore the kernel can perfectly well reject the request, and FM design should keep that into consideration. Partial extents, on the other hand, are unsupported for the sake of simplicity, at least until a solid industry use case comes along. Forced DC removal of online memory semantics were also discussed, emphasizing that such DC memory is not guaranteed to ever be given back by the kernel, mapped or not. Forcing the event, the hardware does not care and the kernel has most likely crashed anyway. Support for extent tagging was another topic, establishing the need for supporting it, coupling a device to a tag domain, being a sensible use case. For now at least, the implementation can be kept to to simply enumerate tags and the necessary attributes to leave the memory matching to userspace, instead of more complex surgeries to create DAX devices on specific extents, dealing with sparse regions.

(iv) Adding RAS Support for CXL Port Devices. Starting with a general overview of RAS, this touched on the current state for support in CXL 1.1 and 2.0. Special handling is required for RCH: due to the RCRB implementation, the RCH downstream port does not have a BDF, needed for AER error handling; this work was merged in v6.7. As for CXL Virtual Hierarchy implementation, it is left still open, potentially things could move away from the PCIe port service driver model, which is not entirely liked. There are however, clear requirements: not-CXL specific (AER is a PCIe protocol, used by CXL.io); implement driver callback logic specific to that technology or device, giving flexibility to handle that specific need; and allow enable/disable on a per-device granularity. There were discussions around the order for which a registration handler is added in the PCI port driver, noting that it made sense to go top-down from the port and searching children, instead of written from a lower level.

(v) Shared CXL 3 memory: what will be required? Overview of the state, semantics and requirements for supporting shared fabric attached memory (FAM). A strong enablement use case is leveraging applications that already handle data sets in files. In addition appropriate workload candidates will fit the "master writer, multiple readers" read-only model for which this sort of machinery would make sense. Early results show that the benefits can out-weigh costly remote CXL memory access such as fitting larger data sets in FAM that would otherwise be possible in a single host. Similarly this avoids cache-coherency costs by simply never modifying the memory. A number of concrete data science and AI usecases were presented. Shared FAM is meant to be mmap-able, file-backed, special purpose memory, for which a FAMFS prototype is described, overcoming limitations of just using DAX device/FSDAX, such as distributing metadata in a shareable way.

(vi) CXL Memory Tiering for heterogenous computing. Discusses the pros and cons of interleaving heterogeneous (ie: DRAM and CXL) memory through hardware and/or software for bandwidth optimization. Hardware interleaving is simple to configure through the BIOS, but limited by not allowing the OS to manage allocations, otherwise hiding the NUMA topology (single node) as well as being a static configuration. The software interleaving solves these limitations with hardware and relies on weighted nodes for allocation distribution when doing the initial mapping (vma). Several interfaces have been posted, which incrementally are converging into a NUMA node based interface. The caveat is to have a single (configurable) system-wide set of weights, or to allow more flexibility, such as hierarchically through cgroups - something which has not been particularly sold yet. Combining both hardware and software models relies on within a socket, splitting channels among respective DDR and CXL NUMA nodes for which software can explicitly (numactl) set the interleaving - it is still restrained however by being static as the BIOS is in charge of setting the number of NUMA nodes.

(vii) A move_pages() equivalent for physical memory. Through an experimental interface, this focused on the semantics of tiering and device driven page movement. There are currently various mechanisms for access detection, such as PMU-based, fault hinting for page promotion and idle bit page monitoring; each with its set of limitations, while runtime overhead is a universal concern. Hardware mechanisms could help with the burden but the problem is that devices only know physical memory and must therefore do expensive reverse mapping lookups; nor are there any interfaces for this, and it is difficult to with out hardware standardization. A good starting point would be to keep the suggested move_phys_pages as an interface, but not have it be an actual syscall.

December 01, 2023 07:14 PM

Bro, u mad?

Zubrin lies like he breathes, not bothering even calculate in Excel, not to mention take integrals!

But they continue to believe him — because it's Zubrin!

When they discussed his "engine on salt of Uranium" [NSWR — zaitcev] I wrote that the engine will not work in principle, because a reactor can only work in case of effective deceleration of neutrons, but already at the temperature of the moderator of 3000 degrees (like in KIWI), the cross-section of fission decreases 10 times, and the critical mass increases proportionally. But nobody paid attention — who am I, who is Zubrin!

The core has to be hot, and the moderator has to be cool, this is essential.

But they continued to fantasize, is it going to be 100,000 degrees in there, or only 10,000?

No matter how much I pointed out the principal contradiction — here is the cold sub-critical solution, and here is super-critical plasma, only in a meter or two away, and therefore neutrons from this plasma fly into the solution — which will inevitably capture them, decelerate, and react, and therefore the whole concept goes down the toilet.

But they disucss this salt engine over decades, without trying to check Zubrin's claims.

All "normal" nuclear reactors work only in the region between "first" (including the delayed neutrons) and "second" (with fast neutrons) criticalities. Only in this region, control of the reactor is possible. By the way, the difference in breeding ratios is only 1.000 and 1.007 for slow neutrons and 1.002 for the fast ones (in case of Plutonium, this much is the case even for slow neutrons).

And by the way, average delay for delayed neutrons is 0.1 seconds! The solution has to remain in the active zone for 100 milliseconds, in order to capture the delayed neutrons! Not even the solid phase RD-0410 reached that much.

Therefore, Zubrin's engine must be critical at the prompt neutrons. And because the moderator underperforms because it's hot, prompt neutrons become indistinguishable from fission neutrons, and therefore the density of plasma has to be the same as density of metal in order to achieve criticality — that is to say, almost 20 g/sm^3 for Uranium.

But this persuades nobody, because Zubrin is Zubrin, and who are you?

It all began as a discussion of Mars Direct among geeks, but escalated quickly.

November 26, 2023 03:27 AM

Videos and slides from the 2023 Linux Security summits may be found here:

Linux Security Summit North America (LSS-NA), May 10-12 2023, Vancouver, Canada.

Linux Security Summit Europe (LSS-EU), September 20-21 2023, Bilbao, Spain.

Note: if you wish to follow Linux Security Summit announcements and event updates via Mastodon, see https://social.kernel.org/LinuxSecSummit. You can follow this via the Fediverse or the RSS reader of your choice.

November 25, 2023 08:32 PM

The current plan is to be co-located in Vienna with OSS-EU. We don’t have exact dates to give (still finding conference space) but it will be three days on the week of 16 September.

November 17, 2023 01:31 PM

As a reminder, The live stream of each main track of Linux Plumbers Conference will be available in real time on Youtube. The Links are now live in the timetable. To view, go to the Schedule Overview and click on the paperclip on the upper right of the track you want to watch to bring up the Live Stream URL.

Live Stream viewers may interact over chat by joining the Matrix Room of that event. To see all our Matrix rooms for Plumbers, go to the space #lpc2023:lpc.events in matrix. The room names should be pretty intuitive.

November 12, 2023 11:24 PM

Kernel Planet

July 21, 2024

Brendan Gregg: No More Blue Fridays

July 12, 2024

Linux Plumbers Conference: System Boot and Security Microconference CFP

Linux Plumbers Conference: Sched-Ext: The BPF extensible scheduler class Microconference CFP

July 09, 2024

Linux Plumbers Conference: In-person registration is sold out

July 07, 2024

Linux Plumbers Conference: Rust Microconference CFP

July 06, 2024

Linux Plumbers Conference: In memory of Daniel Bristot de Oliveira

July 04, 2024

Linux Plumbers Conference: Tracing / Perf Events Microconference CFP

June 18, 2024

Gustavo A. R. Silva: How to use the new counted_by attribute in C (and Linux)

The counted_by attribute

The __counted_by() macro

__counted_by() annotations in the kernel

June 17, 2024

Linux Plumbers Conference: Microconference topic submissions deadlines are coming soon!

June 15, 2024

Linux Plumbers Conference: Submission deadline for LPC refereed track proposals extended by a week

June 12, 2024

Matthew Garrett: SSH agent extensions as an arbitrary RPC mechanism

May 27, 2024

Linux Plumbers Conference: Registration for LPC 2024 is open

May 13, 2024

Linux Plumbers Conference: Update on the Microconference situation

May 04, 2024

Harald Welte: OsmoDevCon 2024: "Introduction to XDP, eBPF and AF_XDP"

Harald Welte: OsmoDevCon 2024: "Using bpftrace to analyze osmocom performance"

May 03, 2024

Harald Welte: OsmoDevCon 2024: "GlobalPlatform in USIM and eUICC"

Harald Welte: OsmoDevCon 2024: "Detailed workings of OTA for SIM/USIM/eUICC"

Harald Welte: OsmoDevCon 2024: "Anatomy of the eSIM Profile"

Linux Plumbers Conference: Awesome amount of Microconference submissions!

May 02, 2024

Harald Welte: OsmoDevCon 2024: "High-performance I/O using io_uring via osmo_io"

April 29, 2024

Pete Zaitcev: Export to STEP in OpenSCAD

April 18, 2024

Pete Zaitcev: sup Python you okay bro

April 16, 2024

Pete Zaitcev: Trailing whitespace in vim

April 15, 2024

Pete Zaitcev: Boot management magic in Fedora 39

March 29, 2024

Harald Welte: Gradual migration of IP address/port between servers

General Idea

Linux Plumbers Conference: Networking Track

March 23, 2024

Brendan Gregg: Linux Crisis Tools

Can't I just install them later when needed?

Default install

March 21, 2024

James Bottomley: Figuring out how ipsec transforms work in Linux

The basics of ipsec

Understanding ipsec flows

ip xfrm policy

Encoding Templates

Other Policy Templates

Security Parameters Index (SPI) and reqid

ip xfrm state

Simple Example: HMAC authentication between two nodes

Example: Private network for Cloud Nodes

Simple Transport Mode Encryption

Tunnel Mode Overlay Network

Final Example: correcly keyed AH Inbound Packet Acceptance

IPSEC on the Router

Conclusion

March 17, 2024

Linux Plumbers Conference: eBPF Track

March 16, 2024

Brendan Gregg: The Return of the Frame Pointers

What are frame pointers?

2004: Their removal

2005-2023: The winter of broken profilers

2014: Java in Flames

2015-2020: Overhead