A Disappearing Service Processor
2 days ago
- #Hardware Debugging
- #Rust
- #Embedded Systems
- Oxide rack design prioritizes network access over physical access, with the Service Processor (SP) accessible via the management network.
- Debugging an issue where the SP dropped off the network involved analyzing system states like CPU activity, network counters, and fan behavior.
- Hubris, the custom OS for SP, uses task priorities; a theory was task starvation due to infinite crash loops, leading to adjustments in task restart delays.
- Stack overflows in Hubris were considered, given manual stack sizing, but kernel stack margins were large, making this an unlikely cause.
- Debugging efforts escalated to using SWD debug headers, revealing the CPU couldn't be halted, pointing to potential issues with the FPGA and FMC bus.
- A vector catch reset was used to preserve Hubris state in RAM, aiding in debugging without halting the CPU fully.
- FPGA timing issues were identified and fixed, but the problem persisted, leading to further investigation into CPU cache behavior and memory access attributes.
- The root cause was mismatched memory attributes between kernel and task accesses to the FMC bus, resolved by aligning the FMC base address with device memory attributes.
- The solution highlights the importance of vendor documentation in debugging complex hardware-software interactions.