FastVLM: Dramatically Faster Vision Language Model from Apple

a year ago

FastVLM introduces FastViTHD, a hybrid vision encoder for high-resolution images, reducing tokens and encoding time.
The smallest variant outperforms LLaVA-OneVision-0.5B with 85x faster TTFT and a 3.4x smaller vision encoder.
Larger variants using Qwen2-7B LLM surpass Cambrian-1-8B with a 7.9x faster TTFT and a single image encoder.
Includes a demo iOS app for mobile performance demonstration.
Training and inference instructions provided, leveraging the LLaVA codebase.
Setup involves conda environment creation and pip installation.
Pretrained checkpoints available for FastVLM-0.5B, FastVLM-1.5B, and FastVLM-7B models.
Instructions for running inference on PyTorch and Apple Silicon provided.
Citation details and acknowledgments included for the CVPR 2025 paper.
Repository and model licenses require review before use.

Hasty Briefsbeta