Hasty Briefsbeta

Show HN: Largest open-source multimodal AI dataset

2 days ago
  • #AI development
  • #multimodal dataset
  • #cross-modal retrieval
  • Introduction of a large, high-quality dataset spanning five modalities: caption, image, video, audio, and point clouds.
  • Dataset consists of three parts: a large automatically generated dataset (>100M samples), a human-rated subset (~1M ratings), and a consensus-based evaluation set (3.5K data points).
  • Aims to accelerate development of multimodal applications with strong cross-modal retrieval performance.
  • Dataset composition includes pairing data from five modalities, ensuring both scale and quality.
  • Pre-training pool aggregates ~6.7M captions matched with top-16 candidates per modality.
  • Post-training subset involves human annotators to ensure quality and diversity.
  • Zero-shot classification benchmark introduced for audio-point cloud retrieval, ensuring low error rates.
  • Data health measures include integrity checks, responsible content filtering, licensing transparency, and leakage controls.
  • Baseline model provided for embedding all five modalities into a common space, with room for improvement.
  • Encourages new applications and improvements like attention over full token sequences and quality-weighted objectives.
  • Get started by downloading partitions from GitHub, prototyping with precomputed embeddings, and using provided baseline code.