80386 Early Start Memory Access
7 hours ago
- #FPGA
- #80386
- #Performance Optimization
- Intel's 80386 featured an 'Early Start' mechanism to hide memory latency by initiating the next instruction's address calculations in the last cycle of the current instruction.
- The z386 FPGA core, after adding Early Start and other optimizations, achieved ao486-class performance, with significant improvements in benchmarks like Doom (16.6 to 23.0 FPS) and 3DBench.
- Early Start relies on microcode and forwarding networks to handle data hazards, including a corner case that causes the POPAD bug.
- Implementation of Early Start in z386 involved forwarding logic for registers and stack pointers, computing effective addresses combinationality at the i_pop cycle.
- Further performance gains came from tightening the store queue, issuing reads/writes earlier, splitting the cache (16KB+16KB), and implementing early branch redirect for direct relative branches.
- A wider frontend with a 32-byte prefetch queue, single-cycle structural decoder, and improved refill bandwidth reduced decode-queue-empty stalls from ~20% to under 10%.
- Maintaining a high clock speed (85 MHz) required optimizations like adder carry-chain fusion and timing cleanups without sacrificing CPI improvements.
- z386 0.4 matches or exceeds ao486 performance, offering an alternative open-source x86 core, though it does not yet boot Windows and benefits from bug reports and community contributions.