Show HN: Marlin-2B: a tiny VLM to extract structured information from videos
2 days ago
- #dense-captioning
- #temporal-grounding
- #video-vlm
- Instructions for using NemoStation/Marlin-2B with Transformers and other tools.
- Marlin is a 2B parameter video VLM for dense captioning and temporal grounding, featuring state-of-the-art performance.
- It includes two convenience methods: caption() for structured scene and event captions, and find() for natural-language temporal queries.
- Model training involved two stages: SFT on curated data and SimPO optimization, using a mix of public annotations and Gemini-generated data.
- Quickstart code demonstrates loading the model, using caption() and find() methods, and system requirements for installation.
- Advanced usage allows raw inference via generate() for custom prompts, with notes on output formatting.