Show HN: Largest open-source multimodal AI dataset
2 days ago
- #AI development
- #multimodal dataset
- #cross-modal retrieval
- Introduction of a large, high-quality dataset spanning five modalities: caption, image, video, audio, and point clouds.
- Dataset consists of three parts: a large automatically generated dataset (>100M samples), a human-rated subset (~1M ratings), and a consensus-based evaluation set (3.5K data points).
- Aims to accelerate development of multimodal applications with strong cross-modal retrieval performance.
- Dataset composition includes pairing data from five modalities, ensuring both scale and quality.
- Pre-training pool aggregates ~6.7M captions matched with top-16 candidates per modality.
- Post-training subset involves human annotators to ensure quality and diversity.
- Zero-shot classification benchmark introduced for audio-point cloud retrieval, ensuring low error rates.
- Data health measures include integrity checks, responsible content filtering, licensing transparency, and leakage controls.
- Baseline model provided for embedding all five modalities into a common space, with room for improvement.
- Encourages new applications and improvements like attention over full token sequences and quality-weighted objectives.
- Get started by downloading partitions from GitHub, prototyping with precomputed embeddings, and using provided baseline code.