Post-transformer inference: 224× compression of Llama-70B with improved accuracy
2 days ago
- #transformer-free inference
- #model compression
- #low-rank manifold
- Introduces a verified method to eliminate transformers from inference while improving downstream accuracy.
- A frozen 70B-parameter Llama model can be replaced by a 256-dimensional meaning field from seven internal activation layers.
- Lightweight compressor (AN1) reduces fields by 224× with an average +1.81 percentage point gain in classification tasks.
- A 30M-parameter student model regenerates these fields from raw text, enabling transformer-free inference at 60× higher throughput with minimal accuracy loss.
- Task-aligned semantics in transformers occupy a low-rank manifold, with 72–99% variance in the top one to three dimensions.
- Establishes Field Processing Units (FPUs) as a post-transformer compute primitive replacing deep matrix multiplication with shallow field operations.
- Results are averaged over five seeds with statistical significance reported; ablations isolate causal contributions.
- Zenodo release includes the scientific manuscript and baseline AN1 Core system implementation, with proprietary optimizations removed for independent verification.