Hasty Briefsbeta

Bilingual

Show HN: Marlin-2B: a tiny VLM to extract structured information from videos

2 days ago
  • #dense-captioning
  • #temporal-grounding
  • #video-vlm
  • Instructions for using NemoStation/Marlin-2B with Transformers and other tools.
  • Marlin is a 2B parameter video VLM for dense captioning and temporal grounding, featuring state-of-the-art performance.
  • It includes two convenience methods: caption() for structured scene and event captions, and find() for natural-language temporal queries.
  • Model training involved two stages: SFT on curated data and SimPO optimization, using a mix of public annotations and Gemini-generated data.
  • Quickstart code demonstrates loading the model, using caption() and find() methods, and system requirements for installation.
  • Advanced usage allows raw inference via generate() for custom prompts, with notes on output formatting.