11 Dec 2025

A disappearing Service Processor

Laura Abbott
Engineer

One of the considerations in designing our Oxide rack is asking which parts we expect to be accessible and by what means. The Oxide rack is designed to live in a data center with exclusive access via the network. The only reason an engineer should ever need to physically visit a rack is to replace a failing part, such as a disk. Our Service Processor (SP) is accessible via the management network.

During some of our first attempts at putting our next generation Cosmo sled into an Oxide rack, we would see the Service Processor drop off the network. This is a tricky situation to debug, as without network access we have limited insight into the state of the SP itself. Debugging started based on the state of the rest of the system (original Hubris bug may contains spoilers for the blog post!):

The AMD host CPU was still alive, meaning the full system itself still had power
The SP itself was not broadcasting over the management network that it was alive
There were no increases in network data counters coming from the SP
The fans were spinning at a constant elevated rate. The service processor is responsible for fan control, so this was an indication the fan controller may have fallen back to emergency full power mode.
This was not reproducible on a sled outside a rack

The Service Processor runs our custom operating system, Hubris. Each portion of the system (networking, thermal control, update etc.) is written as a separate task. Hubris is not a true Real Time Operating System with deadline guarantees, but it does have the notion of task priorities. One of our working theories was that we had a software bug that was causing task starvation. If the networking task was unable to run due to some other task eating up all the CPU time, it would not be able to respond over the network. A likely culprit of task starvation could be a task that had gotten into an infinite crash loop, with all CPU time being spent restarting the task. We adjusted the task restart time to have a longer delay to catch this case. We also wanted to be able to observe if the SP was still making progress even if we lacked networking access, and so switched our chassis LED from "always on" to blinking.

We were fortunate to be able to reproduce the issue with these debug changes, but the results were still confusing: in some cases we would see the LED stuck on, and in other cases the LED was stuck off. The task responsible for LED blinking was near the top of priorities, which limited the number of places we could have a stuck task.

One of the many advantages of writing Hubris in Rust is eliminating bug classes such as buffer overflows. A category of issues Hubris is still particularly prone to is stack overflows. This is because Hubris requires manual sizing of stacks for tasks and calculating maximum stack size has proven tricky. Our ability to detect undersized stacks has improved with the addition of emit-stack-sizes feature but we can still hit some edge cases. When a stack overflow occurs, the task safely restarts. A stack overflow in the kernel would potentially produce similar behavior of a system that looks like it isn’t making progress. Unfortunately for us the stack margins on the kernel were relatively large (512 bytes!) so this was an unlikely case.

At this point, we really needed to get more debugging information out of the system. For manufacturing purposes, we have SWD debug headers. These are not expected to be used on a production system and especially not a system in a running rack. We had to do some creative cable pulling to get them attached with the assistance of coworkers in the Oxide office.

a Cosmo sled with a debug probe precariously placed

Fortunately, our cable attachment paid dividends: we reproduced the issue with the probe attached! This was not immediately fruitful: the debug probe was unable to actually halt the CPU via debug halt, which limited our ability to extract diagnostic information. Our Service Processor uses a Cortex-M7 STM32H7, and the number of ways to put the system in such a state is limited.

This put our focus on identifying what parts of the system could cause such behavior. A major change from our first generation Gimlet system was the addition of an FPGA to control more parts of our system such as host flash. This FPGA is connected using a simple, old-school parallel bus, like the sort you might use for RAM, and accessed via the STM32H7 Flexible Memory Controller. As stated in the manual (Section 22.1 RM0433):

Its main purposes are:
* to translate AXI transactions into the appropriate external device protocol
* to meet the access time requirements of the external memory devices

One way a CPU can potentially get stuck is if it never receives a bus acknowledgement from an external device. A bug in the FPGA timing, for example, could result in the CPU hanging forever when attempting to read a register. To test this theory, we created an FPGA test image with a register that when read would intentionally hang the FMC bus. This produced very similar behavior to what we observed and was a strong indicator we were looking at the right part of the system to find the issue.

We typically rely on full system dumps to debug Hubris problems. This is not possible unless we can halt the CPU. ARM CPUs do support vector catch though: it’s possible to configure the CPU so that on reset, it halts before executing the first instruction. Our hope was that a vector catch reset would unstick the CPU sufficiently without trampling over our existing state. This did work. We lost the running register state with the program counter but the rest of the Hubris state in RAM was preserved across reset and looked reasonably consistent. We could see what Hubris task was running, but nothing there looked like it was accessing the FMC.

Our hardware engineers did a review of FPGA timings and did find that we might not have been meeting timing constraints required by the memory interface. We merged the fix and figured that the vector catch dumps were just inconsistent, most likely due to the cache. When we ran experiments to turn off the cache the dumps were consistent but we never reproduced the actual issue.

We continued hubris development as usual over the next several weeks. One of the changes we worked on during this period was related to our measured boot work. Our Root of Trust (RoT) is responsible for taking a hash of the SP flash at bootup which eventually gets used by higher level software. To acheive the security properties we need, the SP may reset itself multiple times in a row at first bootup. While testing this change, we saw the same symptoms come back: the Cosmo SP would disappear from the network and appear dead. This change turned out to be incredibly good at reproducing the issue, turning a potentially 24+ hour reproduction rate to approximately 10-20 minutes. The initial dumps still didn’t show a significant smoking gun, but we were still highly suspicious of the FMC bus since there were still limited cases that could produce such symptoms.

The high reproduction rate gave us a chance to try many experiments, none of which were fruitful:

Adjusting the rate at which we reset and the number of resets before normally booting
Clearing the FPGA bit stream an extra time
Restricting tasks from accessing the FMC bus
Removing whole tasks that seemed to be unrelated

Finally, staring at the STM32H7 manual provided an insight: maybe the processor itself was performing accesses on the FMC bus that we weren’t expecting! Modern processors hold a large amount of internal state that isn’t directly visible to the programmer. It is not possible for a programmer to know when a CPU will pull data into or out of the cache outside of certain synchronization points or cache instructions. A CPU writing data from the cache to memory is considered a memory access so it’s possible for the CPU to be making memory accesses to addresses unrelated to the current program counter.

Hubris utilizes the Memory Protection Unit (MPU) to provide isolation between tasks and enforce privilege levels. Our configuration uses the MPU for the unprivileged tasks but uses the default memory map for the (privileged) kernel. In the tasks, the FMC is mapped as Uncached Device Memory. Based on our reading of the STM32H7 manual, it turned out our chosen base address for the FMC bus had a default memory type of Normal Cached. This means the FMC has different attributes depending on whether it’s being accessed from a task or the kernel.

Section A3.5.7 of the ARMv7-m reference manual has an entire section about mismatched memory attributes and what properties are lost in this situation. Based on discussion with our hardware engineers, the line "Preservation of the size of accesses" was the most suspicious. Our FPGA interface was designed for 32-bit accesses, and 16-bit or 8-bit accesses could potentially cause problems.

It’s important to note that the kernel was never intentionally accessing the FMC through the Normal Cached mapping. The most likely scenario was:

The CPU running an unprivileged task accessing the FMC issues a store that makes it to the processor’s store buffer
An interrupt occurs, switching us into privileged mode which uses the default memory map
The store hits the cache because the default memory map said that address is cached
The cache attempted to write to memory in ways outside the expected Device Memory attributes

One of the last lines of section A3.5.7 is "Arm strongly recommends that software does not use mismatched attributes for aliases of the same location." The default ARM memory map (which the kernel relies on) assigns different attributes to different sections of the address space, and one of the sections is set up the way we want: device memory, no caching. It turns out the STM32H7 FMC supports changing its base address to appear in this section of address space, likely to avoid the specific problem we were facing. The final fix was changing the base address to the section with matching attributes. We’ve seen no instances of this issue since that fix was merged.

Transparency continues to be an Oxide value. Debugging modern CPUs often involves diving into areas with little transparency. "Under what circumstances will you be unable to access your memory bus" is a tricky question to answer. Our debugging efforts this time were aided by documentation from ARM and STM that eventually explained our problem. Given the difficulty in debugging this issue, highlighting this potential problem in vendor documentation would be beneficial to all customers. Oxide hopes all hardware vendors continue to document as much of their part as possible for the benefit of their customers.