A Year of Fast Apply – Our Path to 10k Tokens per Second
6 months ago
- #AI
- #Machine Learning
- #Software Development
- Released Fast Apply model a year ago, focusing on fine-tuning small, specialized models for code-specific tasks.
- Open-sourced training insights leading to Relace Apply 3, capable of 10k+ tokens per second with state-of-the-art accuracy.
- Highlighted the inefficiency of regenerating unchanged code with expensive LLMs, proposing a lightweight diff application solution.
- Introduced the concept of using an LLM as a merge algorithm to handle pathological diffs and infer intent, improving accuracy.
- Detailed dataset production for training, emphasizing quality and diversity over size, with a focus on real production data.
- Explained the evaluation process for merges, categorizing outcomes into six types to ensure high-quality training data.
- Utilized LLM-as-a-judge to scale up dataset filtering, achieving a low false positive rate for reliable training examples.
- Adopted LoRA for efficient model training, allowing specialization without catastrophic forgetting of general coding knowledge.
- Achieved 10k tok/s with speculative decoding, leveraging strong priors in code merging for parallel token processing.
- Showcased Relace Apply 3's improvements in merge accuracy, context length, and speed, positioning it as a market leader.
- Reflected on Fast Apply's impact over the past year, highlighting its role in making structured code edits reliable.
- Announced hiring for researchers and engineers to continue developing specialized models for coding tasks.