Hasty Briefsbeta

Bilingual

How Taalas "prints" LLM onto a chip?

3 days ago
  • #ASIC
  • #LLM
  • #Hardware
  • Taalas has developed an ASIC chip that runs Llama 3.1 8B at 17,000 tokens per second, which is significantly faster and more efficient than GPU-based systems.
  • The chip hardwires the model's weights directly onto the silicon, eliminating the need for constant data fetching from memory, thus overcoming the memory bandwidth bottleneck.
  • Taalas uses a 'magic multiplier' technique to perform 4-bit data multiplication with a single transistor, enhancing efficiency.
  • The chip does not use external DRAM/HBM but employs on-chip SRAM for KV Cache and LoRA adapters, avoiding supply chain issues associated with DRAM.
  • Taalas designed a base chip with a generic grid of logic gates, allowing customization of the top layers for different models, reducing development time.
  • The development of the Llama 3.1 8B chip took two months, which is considered fast in the custom chip industry but slow compared to AI software development cycles.