Eval-maxing an AI FFmpeg command generator
17 days ago
- #Kiln
- #Fine-Tuning
- #AI Development
- Creating an AI project from start to finish with Kiln.
- Covered steps include creating evals, generating synthetic data, and validating with human ratings.
- Evaluating prompt/model pairs to find the best way to run tasks.
- Fine-tuning models with synthetic training data and evaluating results.
- Iterating on the project with new evals and prompts as it evolves.
- Setting up collaboration using Git and GitHub.
- Demo project: natural language to ffmpeg command builder.
- Key findings: GPT-4.1 outperformed other models, fine-tuning boosted performance by 21%.
- Initial high eval scores were tempered by bugs, requiring iteration on product evals.
- Process included creating correctness evals, generating synthetic data, and manual labeling.
- Experimentation with prompts and models revealed GPT-4.1 dominance.
- Fine-tuning involved various base models and providers, with promising results.
- Iteration included fixing bugs, adding product goals, and setting up Git collaboration.
- Next steps: improve evals, iterate on model+prompt, and consider more fine-tuning if needed.
- Kiln is a free open tool for optimizing AI systems.