Hasty Briefsbeta

Bilingual

FastVLM: Dramatically Faster Vision Language Model from Apple

a year ago
  • #Computer Vision
  • #Machine Learning
  • #AI Efficiency
  • FastVLM introduces FastViTHD, a hybrid vision encoder for high-resolution images, reducing tokens and encoding time.
  • The smallest variant outperforms LLaVA-OneVision-0.5B with 85x faster TTFT and a 3.4x smaller vision encoder.
  • Larger variants using Qwen2-7B LLM surpass Cambrian-1-8B with a 7.9x faster TTFT and a single image encoder.
  • Includes a demo iOS app for mobile performance demonstration.
  • Training and inference instructions provided, leveraging the LLaVA codebase.
  • Setup involves conda environment creation and pip installation.
  • Pretrained checkpoints available for FastVLM-0.5B, FastVLM-1.5B, and FastVLM-7B models.
  • Instructions for running inference on PyTorch and Apple Silicon provided.
  • Citation details and acknowledgments included for the CVPR 2025 paper.
  • Repository and model licenses require review before use.