Hierarchical Autoregressive Modeling for Memory-Efficient Language Generation

5 months ago

PHOTON introduces a hierarchical autoregressive model for efficient language generation.
It replaces flat token scanning with vertical, multi-resolution context access.
PHOTON maintains a hierarchy of latent streams for better performance.
Experimental results show PHOTON outperforms Transformer-based models in throughput-quality trade-off.
PHOTON reduces KV-cache traffic, offering up to 1000x higher throughput per unit memory.

Hasty Briefsbeta