FastVLM: Dramatically Faster Vision Language Model from Apple
a year ago
- #Computer Vision
- #Machine Learning
- #AI Efficiency
- FastVLM introduces FastViTHD, a hybrid vision encoder for high-resolution images, reducing tokens and encoding time.
- The smallest variant outperforms LLaVA-OneVision-0.5B with 85x faster TTFT and a 3.4x smaller vision encoder.
- Larger variants using Qwen2-7B LLM surpass Cambrian-1-8B with a 7.9x faster TTFT and a single image encoder.
- Includes a demo iOS app for mobile performance demonstration.
- Training and inference instructions provided, leveraging the LLaVA codebase.
- Setup involves conda environment creation and pip installation.
- Pretrained checkpoints available for FastVLM-0.5B, FastVLM-1.5B, and FastVLM-7B models.
- Instructions for running inference on PyTorch and Apple Silicon provided.
- Citation details and acknowledgments included for the CVPR 2025 paper.
- Repository and model licenses require review before use.