MLX-Serve a Native LLM Runtime for Apple Silicon

2 days ago

An inference server and macOS menu bar app built in Zig and Swift, offering a drop-in replacement for OpenAI-compatible APIs with features like chat completions, streaming, tool calling, embeddings, and logprobs.
Utilizes direct MLX-C bindings without Python for fast performance, includes KV cache reuse across requests, and supports quantized MLX-format models from HuggingFace with 7 built-in tools and extendable prompt-based skills via markdown files.
Allows real-time SSE streaming with automatic tool call detection for multi-turn reasoning, includes a native macOS app for managing models and chats, and supports various models from Google, Alibaba, Meta, and Mistral AI in different parameter sizes.

Hasty Briefsbeta