How Taalas "prints" LLM onto a chip?

3 days ago

Taalas has developed an ASIC chip that runs Llama 3.1 8B at 17,000 tokens per second, which is significantly faster and more efficient than GPU-based systems.
The chip hardwires the model's weights directly onto the silicon, eliminating the need for constant data fetching from memory, thus overcoming the memory bandwidth bottleneck.
Taalas uses a 'magic multiplier' technique to perform 4-bit data multiplication with a single transistor, enhancing efficiency.
The chip does not use external DRAM/HBM but employs on-chip SRAM for KV Cache and LoRA adapters, avoiding supply chain issues associated with DRAM.
Taalas designed a base chip with a generic grid of logic gates, allowing customization of the top layers for different models, reducing development time.
The development of the Llama 3.1 8B chip took two months, which is considered fast in the custom chip industry but slow compared to AI software development cycles.

Hasty Briefsbeta