Achieving 10,000x training data reduction with high-fidelity labels
16 days ago
- #data-curation
- #LLM-fine-tuning
- #active-learning
- A new active learning method reduces training data requirements for fine-tuning LLMs by orders of magnitude.
- The method focuses on high-fidelity labels to improve model alignment with human experts.
- Experiments showed a reduction from 100,000 to under 500 training examples while improving alignment by up to 65%.
- The process involves clustering and prioritizing the most confusing examples for expert review.
- Cohen’s Kappa is used to measure alignment between model and human experts, with values above 0.8 considered excellent.
- Larger models (3.25B parameters) showed significant improvements with curated data, achieving 55-65% better alignment.
- The method is scalable and can be applied to datasets with hundreds of billions of examples.
- High-quality labels (Kappa > 0.8) are essential for outperforming crowdsourced data.