Post-transformer inference: 224× compression of Llama-70B with improved accuracy

2 days ago

Copy Link

Introduces a verified method to eliminate transformers from inference while improving downstream accuracy.
A frozen 70B-parameter Llama model can be replaced by a 256-dimensional meaning field from seven internal activation layers.
Lightweight compressor (AN1) reduces fields by 224× with an average +1.81 percentage point gain in classification tasks.
A 30M-parameter student model regenerates these fields from raw text, enabling transformer-free inference at 60× higher throughput with minimal accuracy loss.
Task-aligned semantics in transformers occupy a low-rank manifold, with 72–99% variance in the top one to three dimensions.
Establishes Field Processing Units (FPUs) as a post-transformer compute primitive replacing deep matrix multiplication with shallow field operations.
Results are averaged over five seeds with statistical significance reported; ablations isolate causal contributions.
Zenodo release includes the scientific manuscript and baseline AN1 Core system implementation, with proprietary optimizations removed for independent verification.

Hasty Briefsbeta