Show HN: Serve 100 Large AI models on a single GPU with low impact to TTFT

14 days ago

Copy Link

Flashtensors is a blazing-fast inference engine that loads models from SSD to GPU VRAM up to 10x faster than alternative loaders.
Hotswap large models in less than 2 seconds, drastically reducing coldstart times.
Traditional model loaders slow down workflows with painful startup times, but flashtensors eliminates bottlenecks and maximizes performance.
Host hundreds of models in a single device and hot-swap them on demand with minimal effect on user experience.
Run Agentic workflows on constrained devices like robots and wearables.
Use cases include Affordable Personalized AI, Serverless AI Inference, On Prem Deployments, Robotics, and Local Inference.
Install via pip and use commands like 'flash start', 'flash pull', and 'flash run' to manage and execute models.
Python API allows configuration, model registration, loading, inference, and cleanup.
Benchmarks show flashtensors is ~4–6× faster than safetensors, with coldstays under 5 seconds even for 32B parameter models.
Future plans include Docker Integration, Inference Server, SGLang Integration, LlamaCPP Integration, Dynamo Integration, and Ollama Integration.

Hasty Briefsbeta