Tiny-LLM – a course of serving LLM on Apple Silicon for systems engineers
a year ago
- #tutorial
- #LLM
- #MLX
- A tutorial on LLM serving using MLX for system engineers.
- The codebase is based on MLX array/matrix APIs without high-level neural network APIs.
- Goal: Learn techniques behind efficiently serving LLM models (e.g., Qwen2 models).
- The tiny-llm book is available at https://skyzh.github.io/tiny-llm/.
- Join skyzh's Discord server to study with the tiny-llm community.
- Structured weekly chapters covering topics like Attention, RoPE, Grouped Query Attention, and more.
- Advanced topics include KV Cache, Quantized Matmul, Flash Attention, Continuous Batching, and Speculative Decoding.
- Future topics include Paged Attention, Prefill-Decode Separation, Scheduler, Parallelism, AI Agent, and Streaming API Server.
- Other topics not covered: quantized/compressed KV cache.