Hasty Briefsbeta

Post-transformer inference: 224× compression of Llama-70B with improved accuracy

2 days ago
  • #transformer-free inference
  • #model compression
  • #low-rank manifold
  • Introduces a verified method to eliminate transformers from inference while improving downstream accuracy.
  • A frozen 70B-parameter Llama model can be replaced by a 256-dimensional meaning field from seven internal activation layers.
  • Lightweight compressor (AN1) reduces fields by 224× with an average +1.81 percentage point gain in classification tasks.
  • A 30M-parameter student model regenerates these fields from raw text, enabling transformer-free inference at 60× higher throughput with minimal accuracy loss.
  • Task-aligned semantics in transformers occupy a low-rank manifold, with 72–99% variance in the top one to three dimensions.
  • Establishes Field Processing Units (FPUs) as a post-transformer compute primitive replacing deep matrix multiplication with shallow field operations.
  • Results are averaged over five seeds with statistical significance reported; ablations isolate causal contributions.
  • Zenodo release includes the scientific manuscript and baseline AN1 Core system implementation, with proprietary optimizations removed for independent verification.