18 Jun 2026

Performance Has Layers

Steve Karam
Product Assurance

One of the most beneficial things about building the whole stack is also one of the toughest things: you also own all the problems. When a packet leaves a customer’s guest, travels through a virtual NIC, crosses our software switch, hits the OPTE data path, rides the physical fabric, and arrives at another guest, it passes through several distinct components.

And if you build all of them, as we do, then every one of those components is a place you can make the system faster. It beats the hell out of a five-way call with different vendor support teams.

But just because you can speed up those components, it doesn’t mean any single layer is “the” performance story. Each one helps a different kind of workload, manifests as a specific type of bottleneck, and, when cleared, reveals another layer of wait.

Step chart showing CPU utilization rising as successive bottlenecks

So let’s talk about a handful of these layers, what each one does, and why owning all of them is the thing that lets us keep building high performance on-prem infrastructure.

A specific need

Over our last couple of releases, we’ve focused heavily on the network; namely, the introduction of end-to-end IPv6 plumbing and jumbo frames. Introducing these to a self-contained rack involved a lot of decisions, along with thorough investigation of the expected outcomes. We started by looking at the layers and knobs we could influence. From the wire up:

The physical fabric: Links between racks and the switches that carry them.
Frame size: How many bytes ride in a single packet (the Maximum Transmission Unit, or MTU).
Offloads: Whether segmentation and checksums happen in the guest, or are handed off to underlying infrastructure that can do them more efficiently.
CPU placement: Where the virtual NIC’s worker threads run relative to the guest’s own vCPUs.
Flow parallelism: How many independent connections the work is spread across.
Protocol: IPv6 is the rack’s native underlay; the only question is whether a guest speaks IPv6 or IPv4 on top of it.

Each of these knobs helps one or more types of workload. The rest of this post walks them, roughly in the order a packet would encounter them, and notes what each one is good for.

Frames, and the wire that carries them

Start with the most visible knob, and the one that took quite a bit of work to turn: the size of a packet.

Size matters because the network does a roughly fixed amount of work per packet, mostly independent of how many bytes the packet carries. Every packet has its headers parsed at several layers. Every packet runs the gauntlet of firewall rules, route lookups, address resolution, and connection-tracking state, much of it in software in the data plane. So if you want to move a megabyte and each packet holds 1500 bytes, you pay that per-packet tax around 700 times. Raise the packet to 9000 bytes and you pay it roughly 120 times. Same data, one-sixth the overhead.

Inside a rack, we have always taken advantage of this. Guests advertise TCP segmentation offload (TSO), so a guest hands us buffers up to 64 KB at a time over virtio. We carry them across a 9000-byte underlay, set the segment size to leave room for encapsulation, and let the sled’s physical NIC slice the big buffer back into wire-sized packets on the way out. This is tunneled TSO, and it is why traffic inside a VPC has effectively been riding jumbo frames for a long time, and why internal throughput has always been high.

The conservative part was at the rack’s edge. Traffic leaving the rack travels onto networks we do not control, and the safe assumption for an arbitrary external path is the Internet-standard 1500-byte MTU. So that is what we advertised to guests for external traffic.

The major public clouds make the same choice, but it is worth being precise about why, because the word “external” means something very different to them than it does to us. For a hyperscaler, the internal network is an entire availability zone: a building, or several, stitched together by a fabric they own. Almost everything a workload talks to lives inside that boundary, and the only thing that is genuinely “external” is the public Internet. Falling back to a 1500-byte MTU at that edge costs them little, because the edge is far away and rarely on the fast path.

For a single Oxide rack, the boundary sits much closer in. “External” is everything past one rack, and a lot of what lives there is not the public Internet at all: it is the rack in the next row, reached over a fabric the same operator owns and provisioned. That operator often does control the upstream and knows exactly what it can carry. Applying the hyperscaler’s conservative default to that link treats a known, fat, operator-owned path as if it were the open Internet, and leaves a great deal of performance on the table.

External jumbo frames let those operators opt out of the conservative default. The end-user-visible part is a single boolean on an instance. The interesting part is everything that boolean has to coordinate underneath it, because a frame size is only real if every layer of the stack agrees on it:

The guest learns its MTU over virtio (VIRTIO_NET_F_MTU), which Propolis derives from the data link viona attaches to. So the control plane has to create the OPTE/xde data link at the larger MTU in the first place.
The xde driver previously hardcoded 1500. That became a parameter threaded through the create-xde ioctl, through opteadm, and into the kernel module.
A guest also learns its IPv6 MTU from router advertisements, which OPTE generates, and that generator hardcoded 1500 too. So a guest could be configured for 8500 and still quietly use 1500 for IPv6 until the advertisement itself learned the larger number.
And Nexus gained two flags, fleet-level (operator) and instance-level (user), both required, so the capability exists for end users only once an operator has confirmed the surrounding network can take it.

One boolean on the surface; a dozen coordinated turns underneath. It was great being able to bring this capability to our customers. It also meant we found the rough edges ourselves. For example, testing the jumbo path prototype surfaced an interrupt-handling issue in the virtual NIC when oversize frames arrived, and because the driver is ours, we tracked it down and fixed it before the feature shipped, rather than opening a ticket with a vendor and waiting.

Why 8500 and not 9000

The underlay is 9000 bytes, but we offer guests 8500 to allow for headroom. Every guest frame gets wrapped for the underlay: Ethernet, IPv6, UDP, and a Geneve header, plus Geneve options we use to carry things like the segment size. The strict floor of that overhead is on the order of 70 to 78 bytes. We reserve 500. The extra is deliberate, as more Geneve features are coming, and the routing protocol will eventually add extension-header overhead in the data plane. There is also a hardware limit to respect: a switch parses packet headers only up to a fixed depth, often the first 256 or 512 bytes, and cannot see any header that falls past that window. Keep the encapsulation stack well inside it and the switch can still read the fields it needs to make forwarding decisions; let the stack grow too deep and those fields disappear from view. 8500 is a number that is easy to grow later and very hard to shrink once customers build expectations around it.

What about a per-port knob?

Traditional switches put a configurable MTU on every port. We deliberately do not, and the reasoning is a clean example of designing for the cloud rather than for the box. As our networking engineer Ryan Goodfellow puts it, “MTU is a property of a path, not of a port.” A single switch port can be the gateway to several upstream paths at once: one that carries jumbo frames, one capped at 1420 bytes, and the 1500-byte public Internet. There is no single port MTU that is correct for all three. Pinning the port to the smallest just penalizes the larger paths without helping the small one, since oversize packets bound for the small path get dropped and signaled either way. In many cases, users will want the effective MTU for a communication path to be determined dynamically by the operating system’s path MTU discovery (PMTUD).

So instead of a port dial, we put the decision where it belongs: on the instance, gated by a fleet-level switch the operator controls. That is the same protection a traditional network gets from port MTU and MSS clamping, expressed in a way that fits a cloud.

What the setting buys, and the fabric under it

Path	MTU 1500	MTU 8500	Change
Internal VPC, A to B	52.44	55.73	+6%
Internal VPC, B to A	51.70	55.57	+7%
External, A to B	7.70	32.67	+324%
External, B to A	7.96	34.19	+329%

(Gbps, per pair, on a 40G external uplink.)

Internal VPC traffic barely moves, because it was already riding large frames; its remaining cost is per-byte, not per-packet. External traffic, which had been paying the per-packet tax at 1500, improves by roughly 4x, and the gap between internal and external throughput closes from about 7x to about 1.7x.

Bigger frames also want more wire to stretch out on. Our lab fabric moved external traffic from a single 40G uplink to a set of 100G links, with traffic spread across them by equal-cost multipath (ECMP) routing. On that fabric a single pair reached 60 Gbps with the packet tax reduction.

Jumbo frames are not a go-fast button

One note, also in Ryan’s own framing: jumbo frames are not a network go-fast button. They help only flows that actually send many messages larger than 1500 bytes. Benchmarks like iperf3 show an outsized benefit because they are built to send the largest possible frames as fast as possible, which is close to the best case. A real workload whose messages are small on average may see a modest benefit, or none. The right way to read the 4x above is “the ceiling moved a lot,” not “every workload gets 4x.”

The one rule for jumbo frames

There is one operational sharp edge worth detailing, because it is the difference between jumbo helping and jumbo hurting.

Jumbo frames only work if every hop between two endpoints can carry them. When a packet is too large for some link along its path, that link is supposed to drop it and send back an ICMP “fragmentation needed” (or, for IPv6, “packet too big”) message, and the sender’s path-MTU discovery lowers its frame size in response. The trouble is that these signals get dropped surprisingly often, by a firewall, by a router policy that filters ICMP, by an overloaded hop. When that happens the sender never learns to back off. You get the classic black hole: the TCP handshake completes, and then the application stalls and transfers nothing. It is worse for IPv6, where the relevant ICMPv6 is less reliably handled and routers drop oversize packets unconditionally.

So the rule is: before you enable jumbo, audit the whole path. Confirm every hop carries the larger frame and/or returns the needs-fragmentation signal. Verify the negotiated MTU on a live connection rather than trusting the config. And consider enabling packetization-layer path-MTU discovery in the guest (net.ipv4.tcp_mtu_probing on Linux), which gives the kernel a second, in-band way to find the right frame size without depending on ICMP at all. Jumbo is a big lever for external traffic, and a safe one, as long as the path underneath it is consistent.

Where work runs

If frame size is about how much work there is, this layer is about where that work happens. Two knobs live here, and neither is one a customer turns directly.

The first is offloads, which we touched on above. Offloads decide which layer does the heavy per-packet lifting. Segmentation offload lets a guest hand off a big buffer and let a lower layer chop it into wire-sized packets in bulk, rather than the guest doing it one packet at a time. Getting that right across both IPv4 and IPv6 is part of what it means to bring a protocol to full parity.

The second is CPU placement: which cores actually run the network stack’s work. A guest with many vCPUs is only as fast as the stack’s ability to spread its packet processing across cores. We refined how the virtual NIC’s worker threads are placed relative to a guest’s vCPUs, and the result is that throughput now scales cleanly as guests get larger. On the 64-core, 128-thread parts we test on, a 96-vCPU guest runs wider than the physical cores, onto the second hardware thread of many of them. With the worker threads placed to handle exactly that, a 96-vCPU guest sustains as much (or more) external throughput as a 64-vCPU one.

Flow parallelism

A single connection has limitations. Many of our tests on the 100G fabric reached that limit at around 60 Gbps, one way, per VM-pair, and it held steady whether we pushed 32, 64, or 128 streams through one connection, or ran several iperf3 processes side by side.

The way past it is not a bigger single flow; it is more flows. Spread the work across multiple connections with distinct 5-tuples, landing on different sleds, and ECMP fans them across the fabric automatically. Aggregate throughput then scales with the number of independent flows rather than bumping into a ceiling. This happens to match how production systems actually behave: many connections, not one giant one.

The practical guidance on this layer:

Design for multiple parallel flows rather than a single connection.
Place throughput-critical pairs on unshared sleds. A single sled on a 100G link tops out around 90 Gbps of combined in-plus-out traffic regardless of how many VMs generate it, so do not put the receive side of one critical flow on the same sled as the send side of another. Oxide’s anti-affinity rules can be used for this task.

IPv6 is a first-class citizen

IPv6 is not a protocol we bolted on for compatibility. The rack’s underlay is IPv6: every sled-to-sled hop, every Geneve-encapsulated guest frame, rides IPv6 by default. It is the rack’s native tongue. So the work across these releases was not teaching the rack to speak IPv6, it was extending that first-class status all the way into the guest, end-to-end, and then tuning until there was no tax for using it.

In our tests on the 40G uplink, at the same MTU, on the same boundary path, IPv4 and IPv6 now deliver the same throughput: about 34.5 Gbps for IPv4 and about 34.9 for IPv6, per pair, at jumbo. There is no inherent IPv6 penalty, on a rack whose fabric was IPv6 from the start there should not be. If you are designing an IPv6-first environment, and many of our customers are, that distinction matters.

One IPv6 caveat: path-MTU discovery is less forgiving for IPv6 than for IPv4, so the jumbo path-audit rule from the frames section matters a little more when you are running IPv6.

The layers form a cohesive whole

Pull back and the through-line is not any single number. It is that a platform built from the ground up has a knob at every layer, and a team that owns every layer can keep turning them. The fabric, the frame size, the offloads, the CPU placement, the flow distribution, the protocol handling: these are not separate products bolted together; they are one system, and tuning one with full knowledge of the others is something only the owner of the whole stack can do.

External jumbo frames are the clearest example. What a user sees is one boolean. What it took was a coordinated change across the virtual NIC, the OPTE data path, the kernel driver, the routing protocol, and the control plane, plus a deliberately chosen frame size with the headroom reasoning written in, plus a decision about where in the system the knob even belongs.

That is also why this is not a one-time announcement. There is no final version of a network stack, or an end to tuning. There is the next layer, and the one under it, all the way to the core. We will keep peeling.