Tiny-LLM – a course of serving LLM on Apple Silicon for systems engineers

a year ago

A tutorial on LLM serving using MLX for system engineers.
The codebase is based on MLX array/matrix APIs without high-level neural network APIs.
Goal: Learn techniques behind efficiently serving LLM models (e.g., Qwen2 models).
The tiny-llm book is available at https://skyzh.github.io/tiny-llm/.
Join skyzh's Discord server to study with the tiny-llm community.
Structured weekly chapters covering topics like Attention, RoPE, Grouped Query Attention, and more.
Advanced topics include KV Cache, Quantized Matmul, Flash Attention, Continuous Batching, and Speculative Decoding.
Future topics include Paged Attention, Prefill-Decode Separation, Scheduler, Parallelism, AI Agent, and Streaming API Server.
Other topics not covered: quantized/compressed KV cache.

Hasty Briefsbeta