Eval-maxing an AI FFmpeg command generator

18 days ago

Copy Link

Creating an AI project from start to finish with Kiln.
Covered steps include creating evals, generating synthetic data, and validating with human ratings.
Evaluating prompt/model pairs to find the best way to run tasks.
Fine-tuning models with synthetic training data and evaluating results.
Iterating on the project with new evals and prompts as it evolves.
Setting up collaboration using Git and GitHub.
Demo project: natural language to ffmpeg command builder.
Key findings: GPT-4.1 outperformed other models, fine-tuning boosted performance by 21%.
Initial high eval scores were tempered by bugs, requiring iteration on product evals.
Process included creating correctness evals, generating synthetic data, and manual labeling.
Experimentation with prompts and models revealed GPT-4.1 dominance.
Fine-tuning involved various base models and providers, with promising results.
Iteration included fixing bugs, adding product goals, and setting up Git collaboration.
Next steps: improve evals, iterate on model+prompt, and consider more fine-tuning if needed.
Kiln is a free open tool for optimizing AI systems.

Hasty Briefsbeta