Hasty Briefsbeta

Bilingual

Tiny-LLM – a course of serving LLM on Apple Silicon for systems engineers

a year ago
  • #tutorial
  • #LLM
  • #MLX
  • A tutorial on LLM serving using MLX for system engineers.
  • The codebase is based on MLX array/matrix APIs without high-level neural network APIs.
  • Goal: Learn techniques behind efficiently serving LLM models (e.g., Qwen2 models).
  • The tiny-llm book is available at https://skyzh.github.io/tiny-llm/.
  • Join skyzh's Discord server to study with the tiny-llm community.
  • Structured weekly chapters covering topics like Attention, RoPE, Grouped Query Attention, and more.
  • Advanced topics include KV Cache, Quantized Matmul, Flash Attention, Continuous Batching, and Speculative Decoding.
  • Future topics include Paged Attention, Prefill-Decode Separation, Scheduler, Parallelism, AI Agent, and Streaming API Server.
  • Other topics not covered: quantized/compressed KV cache.